THE FORGE

Discriminating tests for language models · assay office

“A test is only as good as the fake it catches.”

Each test below is a sealed pair: a brief you hand to the model under evaluation, and a key the model never sees: hard anchors with tolerances, fake-detector probes, a taste rubric, and a documented cheat pass proving the obvious fake fails. Lineage: Ethan Mollick's isochronic-map challenge.

① Sealed splitAnchor values live only in the key. The brief demands a __probe surface but never names what gets probed — hardcoding the graded cases is impossible.

② Cheat passBefore sealing, the forge builds the laziest convincing fake and sharpens anchors until it fails decisively while a genuine mechanism passes.

③ LedgerEvery forged test is recorded; the next one must diverge in domain, mechanism, or pattern. New tests: /visual-test-forge in Claude Code.

Exhibit: the test that seeded the forge

Nº 000 · agentic · transit networks · live artifact

ISOCHRON: equal-time chart of the world

Mollick's prompt, executed: real rail timetables, nonstop flight networks, airport overheads and a Dijkstra wavefront over a land grid. The passing artifact itself can probe any point for its door-to-door itinerary.

→

Test registry

Ungraded queue

Drawing board

Strawmen · not yet forged

TEN CANDIDATES — sims, games, sound & 3D

Three-Body Ballet, Schelling's Neighborhood, The Trading Pit, Murmuration, Slingshot, The Counterpoint Machine, Bessel's Bells, Tissot's Mirror, Patient Zero, The Ant Economist. Each sketch names the mechanism, the lazy fake, the anchor that kills it, and the fun. Forge with /visual-test-forge.

→

Protocol: COPY BRIEF → paste into the model under test → grade the returned artifact against the key's console snippet and eyeball track. Keys are sealed and not published — anchor values on the open web would leak into training data and burn the tests. Grading happens offline.