Personal Portfolio

Thomas
Peng.

Graphic designer turned AI-native builder. Builds agentic systems and evaluates them honestly. Deterministic scoring, adversarial verification, cost-gated reproducible runs, honest nulls.

thomas@thomaspeng.ca View work

The work
is the proof.

SHARED KERNEL: Quorum core/

The four artifacts below share a single verified substrate: Quorum's orchestration kernel. One codebase, three applied problems, reproducible evaluations. That is the story.

Every eval uses deterministic scoring. No LLM judge in the success path. Honest nulls are reported as findings, because a null you understand is better evidence than a lift you cannot reproduce.

Target roles: Applied AI, Forward-Deployed Engineer, Agent Engineer, Design Engineer. Frontier lab or serious builder.

SHT 01

Artifacts

PN-001

Quorum

github.com/7P3ng/quorum

Key finding

K=3 adversarial verification cut false positives 27.8% to 0.0% (95% CI [11.1, 50.0] to [0, 0]; recall 100% to 77.8%) on a 36-snippet labeled set including prompt-injection traps. Held-out real target: 3/3 genuine bugs found, 0 surviving false positives.

Cost-aware model routing (DeepSeek to Haiku to Sonnet to Opus) with adversarial multi-agent verification and full tracing. Fans out finders per file, then K skeptics per finding (concurrency cap 8). Approximately $0.25 total per run. 58 tests, ruff + mypy + CI green.

Cost-routing claim is operator-gated on an Anthropic key. Honest statement: harness committed, live multi-tier number gated on key availability. make eval-dry reproduces offline.

The trace UI is deployed as a real product surface. View it live below.

Metric	Value
FP rate (before)	27.8%
FP rate (after K=3)	0.0%
95% CI (after)	[0, 0]
Recall	77.8%
Cost per run	~$0.25
Tests	58

LIVE SURFACEOpen live

Quorum trace UI, live deploymentOpen live

PN-002

Aegis

github.com/7P3ng/aegis

Honest finding

A reasoning model is significantly more robust (injection ASR 49.3% vs 68.1%, p=0.0012; canary 10.4% vs 21.5%, p=0.010; overall p=0.0002). But the full defense stack erases the gap entirely (1.7% vs 2.8%, p=0.40, not significant). The null is the finding.

An adaptive attacker agent red-teams a target on two harmless proxies (canary-string extraction plus prompt-injection sentinel). Scored deterministically via exact match, no LLM judge. Layered defenses measurably cut attack success. Vendors Quorum's core/.

Adaptation lift: 24.0% to 29.9% (became significant only after scaling the benchmark: McNemar b=17/c=0, p approx 0; was a null at small n). Scaling is the legitimate power lever, not p-hacking.

Defense reduction: 29.2% to 4.2% (-25%). The input classifier is the workhorse. 78 tests, CI + Pages green.

Metric	Value
Injection ASR (base)	68.1%
Injection ASR (reasoning)	49.3%
Full defense ASR	1.7% vs 2.8% (p=0.40, NS)
Adaptation lift	24.0% to 29.9%
Defense reduction	-25 pp
Tests	78

LIVE SURFACEOpen live

Aegis red-team demo, live deploymentOpen live

PN-003

FieldAgent

github.com/7P3ng/fieldagent

Honest null

The "agentic chunking lift" is model-specific noise, not a real advantage. It looked like +0.45 F1 on DeepSeek only because of a truncation artifact. A fair rerun collapses it to +0.07 (CIs overlap) and it ties on Claude Sonnet. The real result: F1=0.548 (P=0.741/R=0.435), +0.21 over a keyword floor.

An agent reads real commercial contracts and flags risk-bearing clauses (span, severity, plain-English risk), graded by span-IoU against CUAD gold. No LLM judge. Vendors Quorum's core/.

20 held-out CUAD contracts. 47 tests, CI green. Party names and dollar figures are redacted in the demo.

The honest null is the artifact: a lift that looked like +0.45 on a flawed baseline, corrected to +0.07 in a fair rerun. Reporting it is the point.

Metric	Value
Detection F1	0.548
Precision	0.741
Recall	0.435
95% CI	[0.460, 0.637]
Lift vs keyword floor	+0.21 F1
Agentic lift (honest)	+0.07 (CI overlap, NS)
Tests	47

LIVE SURFACEOpen live

FieldAgent contract analysis demo, live deploymentOpen live

SHT 02

Skill-Tuning Council

A 4-proxy council (taste, pragmatism, intent, anti-drift) votes on every self-improvement before it ships. Pipeline: adversary, editors, merger, council, escalate-on-disagreement.

576 tests. Internal infrastructure; no public URL. Presented as a methodology piece, not a shipped product.

The council pattern is reusable: any domain where a single reviewer has blind spots benefits from orthogonal critic perspectives with a synthesis step and escalation on disagreement.

ADVERSARY

generates attack

EDITORS

4 orthogonal critics

MERGER

synthesizes edits

COUNCIL

votes on ship

ESCALATE

on disagreement

Tests	576
Critics	4 (taste, pragmatism, intent, anti-drift)
Public URL	None (internal infra)

SHT 03

Eval discipline

Deterministic scoring

No LLM judge in the success path. Exact match, span-IoU, CI bounds. The metric does not hallucinate.

Adversarial verification

K skeptics review each finding before it counts. Prompt-injection traps in the labeled set catch jailbroken positives.

Honest nulls

Nulls are reported as primary findings. A lift that collapses on a fair rerun is reported as collapsed. make eval-dry reproduces offline.

SHT 04

Contact

Let's
talk.

Emailthomas@thomaspeng.ca

GitHubgithub.com/7P3ng

ClassificationDoc No: TP-2026-ARev: AStatus: RELEASED

ThomasPeng.

The workis the proof.

Artifacts

Quorum

Aegis

FieldAgent

Skill-Tuning Council

Eval discipline

Deterministic scoring

Adversarial verification

Honest nulls

Contact

Let'stalk.

Thomas
Peng.

The work
is the proof.

Let's
talk.