Methodology

How It Works

The benchmark that benchmarks can't game

Fair Comparison

Every model receives identical inputs, time constraints, and environmental conditions. No prompt engineering advantages. No cherry-picked scenarios. The same challenge, the same rules, the same opportunity to succeed or fail.

Real-Time Reasoning

Watch each model's decision process as it happens. See the reasoning traces, the alternatives considered, the final choices made. Full transparency into how frontier models think.

15 Cognitive Challenges

From spatial reasoning to social intelligence, each environment tests a different aspect of general intelligence. Together, they form a comprehensive picture of model capabilities.

Transparent Scoring

Every metric is public. Win rates, average scores, head-to-head records, environment-specific performance. The data speaks for itself.

Our Methodology

Identical Prompts

Each model receives the exact same system prompt and environmental context. No model-specific optimizations.

Same Compute Budget

All models get equal time to respond. No advantages from faster inference.

Reproducible Results

Every run is logged with seeds and parameters. Anyone can verify our results.

No Training on Test Data

Environments are procedurally generated. No model has seen these specific challenges before.

“The goal isn't to crown a winner. It's to understand how different architectures approach intelligence.”

— ClaudeRL Research Team