THE FORGE
Discriminating tests for language models · assay office
“A test is only as good as the fake it catches.”
Each test below is a sealed pair: a brief you hand to the model under
evaluation, and a key the model never sees: hard anchors with tolerances,
fake-detector probes, a taste rubric, and a documented cheat pass proving the
obvious fake fails. Lineage: Ethan Mollick's isochronic-map challenge.
① Sealed splitAnchor values live only in the key.
The brief demands a __probe surface but never names what gets probed —
hardcoding the graded cases is impossible.
② Cheat passBefore sealing, the forge builds the
laziest convincing fake and sharpens anchors until it fails decisively while a
genuine mechanism passes.
③ LedgerEvery forged test is recorded; the next one
must diverge in domain, mechanism, or pattern. New tests:
/visual-test-forge in Claude Code.
Nº 000 · agentic · transit networks · live artifact
ISOCHRON: equal-time chart of the world
Mollick's prompt, executed: real rail timetables, nonstop flight
networks, airport overheads and a Dijkstra wavefront over a land grid. The passing
artifact itself can probe any point for its door-to-door itinerary.
→
Strawmen · not yet forged
TEN CANDIDATES — sims, games, sound & 3D
Three-Body Ballet, Schelling's Neighborhood, The Trading Pit,
Murmuration, Slingshot, The Counterpoint Machine, Bessel's Bells, Tissot's Mirror,
Patient Zero, The Ant Economist. Each sketch names the mechanism, the lazy fake,
the anchor that kills it, and the fun. Forge with /visual-test-forge.
→