Personal Portfolio

Thomas
Peng.

Graphic designer turned AI-native builder. Builds agentic systems and evaluates them honestly. Deterministic scoring, adversarial verification, cost-gated reproducible runs, honest nulls.

The work
is the proof.

SHARED KERNEL: Quorum core/

The four artifacts below share a single verified substrate: Quorum's orchestration kernel. One codebase, three applied problems, reproducible evaluations. That is the story.

Every eval uses deterministic scoring. No LLM judge in the success path. Honest nulls are reported as findings, because a null you understand is better evidence than a lift you cannot reproduce.

Target roles: Applied AI, Forward-Deployed Engineer, Agent Engineer, Design Engineer. Frontier lab or serious builder.

SHT 01

Artifacts

Key finding

K=3 adversarial verification cut false positives 27.8% to 0.0% (95% CI [11.1, 50.0] to [0, 0]; recall 100% to 77.8%) on a 36-snippet labeled set including prompt-injection traps. Held-out real target: 3/3 genuine bugs found, 0 surviving false positives.

Cost-aware model routing (DeepSeek to Haiku to Sonnet to Opus) with adversarial multi-agent verification and full tracing. Fans out finders per file, then K skeptics per finding (concurrency cap 8). Approximately $0.25 total per run. 58 tests, ruff + mypy + CI green.

Cost-routing claim is operator-gated on an Anthropic key. Honest statement: harness committed, live multi-tier number gated on key availability. make eval-dry reproduces offline.

The trace UI is deployed as a real product surface. View it live below.

MetricValue
FP rate (before)27.8%
FP rate (after K=3)0.0%
95% CI (after)[0, 0]
Recall77.8%
Cost per run~$0.25
Tests58
LIVE SURFACEOpen live

Honest finding

A reasoning model is significantly more robust (injection ASR 49.3% vs 68.1%, p=0.0012; canary 10.4% vs 21.5%, p=0.010; overall p=0.0002). But the full defense stack erases the gap entirely (1.7% vs 2.8%, p=0.40, not significant). The null is the finding.

An adaptive attacker agent red-teams a target on two harmless proxies (canary-string extraction plus prompt-injection sentinel). Scored deterministically via exact match, no LLM judge. Layered defenses measurably cut attack success. Vendors Quorum's core/.

Adaptation lift: 24.0% to 29.9% (became significant only after scaling the benchmark: McNemar b=17/c=0, p approx 0; was a null at small n). Scaling is the legitimate power lever, not p-hacking.

Defense reduction: 29.2% to 4.2% (-25%). The input classifier is the workhorse. 78 tests, CI + Pages green.

MetricValue
Injection ASR (base)68.1%
Injection ASR (reasoning)49.3%
Full defense ASR1.7% vs 2.8% (p=0.40, NS)
Adaptation lift24.0% to 29.9%
Defense reduction-25 pp
Tests78
LIVE SURFACEOpen live

Honest null

The "agentic chunking lift" is model-specific noise, not a real advantage. It looked like +0.45 F1 on DeepSeek only because of a truncation artifact. A fair rerun collapses it to +0.07 (CIs overlap) and it ties on Claude Sonnet. The real result: F1=0.548 (P=0.741/R=0.435), +0.21 over a keyword floor.

An agent reads real commercial contracts and flags risk-bearing clauses (span, severity, plain-English risk), graded by span-IoU against CUAD gold. No LLM judge. Vendors Quorum's core/.

20 held-out CUAD contracts. 47 tests, CI green. Party names and dollar figures are redacted in the demo.

The honest null is the artifact: a lift that looked like +0.45 on a flawed baseline, corrected to +0.07 in a fair rerun. Reporting it is the point.

MetricValue
Detection F10.548
Precision0.741
Recall0.435
95% CI[0.460, 0.637]
Lift vs keyword floor+0.21 F1
Agentic lift (honest)+0.07 (CI overlap, NS)
Tests47
LIVE SURFACEOpen live
SHT 02

Skill-Tuning Council

A 4-proxy council (taste, pragmatism, intent, anti-drift) votes on every self-improvement before it ships. Pipeline: adversary, editors, merger, council, escalate-on-disagreement.

576 tests. Internal infrastructure; no public URL. Presented as a methodology piece, not a shipped product.

The council pattern is reusable: any domain where a single reviewer has blind spots benefits from orthogonal critic perspectives with a synthesis step and escalation on disagreement.

ADVERSARY
generates attack
EDITORS
4 orthogonal critics
MERGER
synthesizes edits
COUNCIL
votes on ship
ESCALATE
on disagreement
Tests576
Critics4 (taste, pragmatism, intent, anti-drift)
Public URLNone (internal infra)
SHT 03

Eval discipline

01

Deterministic scoring

No LLM judge in the success path. Exact match, span-IoU, CI bounds. The metric does not hallucinate.

02

Adversarial verification

K skeptics review each finding before it counts. Prompt-injection traps in the labeled set catch jailbroken positives.

03

Honest nulls

Nulls are reported as primary findings. A lift that collapses on a fair rerun is reported as collapsed. make eval-dry reproduces offline.

SHT 04

Contact

Let's
talk.

ClassificationDoc No: TP-2026-ARev: AStatus: RELEASED