Prompt
What You’re Looking At
Expected Behavior
Why This Matters
Success Criteria
Model Verdicts
Evidence
All prompts were run directly via API at pass@1 (single-shot), with Max Tokens = 100K and Reasoning Effort = High. No agentic harness or coding agent wrapper was used, including Codex or Claude Code.