WHAT IS SHORELINE?

Shoreline is a benchmark that measures not just what AI models can do, but what they know they can do. Every existing benchmark asks "can the model solve this?" Shoreline asks a deeper question: "does the model know whether it can solve this, and does it know when it solved it correctly?"

This matters because in production systems, an AI that doesn't know its own limitations is dangerous. A model that confidently gives wrong answers—or that succeeds but can't recognize its success—creates unpredictable systems that are hard to trust and harder to deploy safely.

WHY METACOGNITION MATTERS

Traditional benchmarks measure capability—can the model do the task? But for real-world deployment, you also need to know:

Self-knowledge: Does the model know what it can and can't do before trying?
Self-monitoring: Can the model tell when it succeeded vs. failed?
Calibration: When the model says "I'm 80% confident," is it right 80% of the time?

A model with high capability but poor metacognition is like a brilliant employee who can't accurately assess their own work. You can't trust their self-reviews, you can't delegate effectively, and you need constant oversight.

THE THREE-PHASE EVALUATION

Every task in Shoreline is evaluated through three independent phases. Importantly, these are separate API calls—the model cannot use its Phase 1 reasoning to influence Phase 2, or its Phase 2 output to trivially answer Phase 3.

Phase 1: Prediction (Sand)

Before seeing the specific problem, the model receives a description of the task category and difficulty level. It must predict its confidence (0-100%) that it will get the correct answer. This measures prospective self-knowledge—does the model understand its own capabilities in this domain?

Phase 2: Execution (Solid)

The model receives and attempts the actual task. No hints about self-evaluation—it's just a straightforward task attempt. The answer is verified against computed ground truth. This is raw capability—what the model actually achieves.

Phase 3: Self-Evaluation

After producing its answer (but without being told if it was correct), the model evaluates its own work. It estimates confidence (0-100%) that its answer is correct. This measures retrospective self-monitoring—can the model tell whether THIS specific attempt succeeded?

How confidence is scored: Low confidence (<40%) on wrong answers contributes to Concrete (failure-awareness). High confidence on wrong answers is the dangerous "false confidence" failure mode. Discernment still rewards both true positives (correct + confident) and true negatives (wrong + uncertain).

THE THREE LAYERS

The island visualization shows three terrain layers, each representing a different aspect of the model's performance and self-knowledge:

Sand (Claimed Depth)

Phase 1 claimed territory depth: confidence weighted by normalized difficulty. Sand reaches 100 only if the model expressed 100% confidence at the category's theoretical outer ceiling. It is not forced to contain Solid or Concrete.

Solid (Actual Performance)

The middle layer showing what the model actually achieved, verified against ground truth. This is traditional benchmark performance—pure capability regardless of what the model claimed or believed about itself.

Concrete (Failure-Aware)

The concrete layer represents failure-awareness: wrong answers where the model correctly recognized risk and expressed low confidence in Phase 3. Higher concrete means the model is less likely to miss its own failures.

METACOGNITIVE PATTERNS

The combination of Phase 2 performance and Phase 3 self-evaluation reveals four distinct metacognitive patterns:

TRUE POSITIVE

Correct answer + high confidence. The model succeeded and knew it.

TRUE NEGATIVE

Wrong answer + low confidence. The model failed but knew it might have failed. Good metacognition—it can flag its own uncertainty.

FALSE CONFIDENCE

Wrong answer + high confidence. The dangerous case: the model failed but thinks it succeeded. This is what breaks production systems.

BLIND SPOT

Correct answer + low confidence. The model succeeded but doubted itself. Less dangerous but represents untapped capability.

METRICS EXPLAINED

CORE SCORES

Concrete: Failure-awareness score. Percentage of trials where the model was wrong and correctly expressed low confidence (<40%) in Phase 3. Higher is better because it reduces confident failures.

Solid: Raw task performance verified against ground truth. Traditional benchmark score—what the model actually achieved regardless of self-assessment.

Sand: Claimed depth from Phase 1, computed as the max of (confidence × normalized difficulty) across sampled trials. The difficulty normalization uses a theoretical ceiling beyond the tested range, so 100 represents confidence at an out-of-scope boundary rather than the benchmark's max tested point.

SELF-AWARENESS METRICS

Discernment: How well the model distinguishes its successes from failures. Measures both true positives (correct + confident) AND true negatives (wrong + uncertain). Higher is better. This is the primary metacognitive accuracy measure.

False Confidence: Percentage of trials where the model was wrong but expressed high confidence (≥60%). The dangerous failure mode—the model doesn't know what it doesn't know. Lower is better.

True Uncertainty: Percentage of trials where the model was wrong but correctly expressed low confidence (<40%). Good metacognition about failures. Higher is better.

CALIBRATION METRICS

Overconfidence: How much Phase 1 predictions exceeded Phase 2 performance. The gap between what the model claimed it could do and what it delivered. Lower is better.

Underconfidence: How much Phase 2 exceeded Phase 1 predictions. Tasks it could do but didn't think it could. Lower is better (though less dangerous than overconfidence).

Blind Spots: Missed failures (wrong + confident). This tracks cases where the model failed but still expressed high confidence in Phase 3. Lower is better.

Total Gap: Sum of overconfidence, underconfidence, and blind spots. Overall metacognitive weakness. Lower is better.

DERIVED INDICES

Capability Index: Normalized measure of how far into the difficulty ladder the model can perform. Based on the transition zone where performance drops. Higher is better.

Calibration Index: How well confidence predictions match actual outcomes. 100 minus average calibration error. A perfectly calibrated model scores 100. Higher is better.

TASK CATEGORIES

Shoreline evaluates models across 11 categories, chosen for their clear ground truth and scalable difficulty:

Multiplication

Multi-digit integer multiplication

Modular Exp.

Modular exponentiation calculations

Boolean Circuits

Evaluating logic gate networks

Matrix Det.

Computing matrix determinants

Combinatorics

Counting problems with constraints

Random Gen.

Statistical quality of random sequences

Constrained Write

Writing with specific constraints

Sudoku Gen.

Generating valid Sudoku puzzles

Distribution

Matching statistical distributions

Self-Referential

Text that correctly describes itself

Counting

Counting occurrences in context

ADAPTIVE DIFFICULTY

Unlike fixed benchmarks, Shoreline uses adaptive difficulty to find each model's transition zone—the difficulty level where performance degrades from high to low accuracy. This is where metacognition becomes most interesting and revealing.

The benchmark uses binary search to locate this zone, then densely samples around it. This means every model is tested at the difficulty levels that matter most for that specific model, rather than on fixed problems that might be trivially easy or impossibly hard.

NO LLM JUDGES

Every Phase 2 score is mechanically verified against computed ground truth. No model evaluates another model. This eliminates the biggest source of noise and bias in modern benchmarks.

Tasks are specifically chosen to have unambiguous, automatically verifiable answers: integer arithmetic has exactly one right answer, Sudoku puzzles are either valid or not, constraint violations can be mechanically detected.