WHAT IS SHORELINE?
Shoreline is a benchmark that measures not just what AI models can do, but what they know they can do. Every existing benchmark asks "can the model solve this?" Shoreline asks a deeper question: "does the model know whether it can solve this, and does it know when it solved it correctly?"
This matters because in production systems, an AI that doesn't know its own limitations is dangerous. A model that confidently gives wrong answers—or that succeeds but can't recognize its success—creates unpredictable systems that are hard to trust and harder to deploy safely.
WHY METACOGNITION MATTERS
Traditional benchmarks measure capability—can the model do the task? But for real-world deployment, you also need to know:
- Self-knowledge: Does the model know what it can and can't do before trying?
- Self-monitoring: Can the model tell when it succeeded vs. failed?
- Calibration: When the model says "I'm 80% confident," is it right 80% of the time?
A model with high capability but poor metacognition is like a brilliant employee who can't accurately assess their own work. You can't trust their self-reviews, you can't delegate effectively, and you need constant oversight.
THE THREE-PHASE EVALUATION
Every task in Shoreline is evaluated through three independent phases. Importantly, these are separate API calls—the model cannot use its Phase 1 reasoning to influence Phase 2, or its Phase 2 output to trivially answer Phase 3.
Phase 1: Prediction (Sand)
Before seeing the specific problem, the model receives a description of the task category and difficulty level. It must predict its confidence (0-100%) that it will get the correct answer. This measures prospective self-knowledge—does the model understand its own capabilities in this domain?
Phase 2: Execution (Solid)
The model receives and attempts the actual task. No hints about self-evaluation—it's just a straightforward task attempt. The answer is verified against computed ground truth. This is raw capability—what the model actually achieves.
Phase 3: Self-Evaluation
After producing its answer (but without being told if it was correct), the model evaluates its own work. It estimates confidence (0-100%) that its answer is correct. This measures retrospective self-monitoring—can the model tell whether THIS specific attempt succeeded?
How confidence is scored: Low confidence (<40%) on wrong answers contributes to Concrete (failure-awareness). High confidence on wrong answers is the dangerous "false confidence" failure mode. Discernment still rewards both true positives (correct + confident) and true negatives (wrong + uncertain).
THE THREE LAYERS
The island visualization shows three terrain layers, each representing a different aspect of the model's performance and self-knowledge:
Sand (Claimed Depth)
Phase 1 claimed territory depth: confidence weighted by normalized difficulty. Sand reaches 100 only if the model expressed 100% confidence at the category's theoretical outer ceiling. It is not forced to contain Solid or Concrete.
Solid (Actual Performance)
The middle layer showing what the model actually achieved, verified against ground truth. This is traditional benchmark performance—pure capability regardless of what the model claimed or believed about itself.
Concrete (Failure-Aware)
The concrete layer represents failure-awareness: wrong answers where the model correctly recognized risk and expressed low confidence in Phase 3. Higher concrete means the model is less likely to miss its own failures.
METACOGNITIVE PATTERNS
The combination of Phase 2 performance and Phase 3 self-evaluation reveals four distinct metacognitive patterns:
TRUE POSITIVE
Correct answer + high confidence. The model succeeded and knew it.
TRUE NEGATIVE
Wrong answer + low confidence. The model failed but knew it might have failed. Good metacognition—it can flag its own uncertainty.
FALSE CONFIDENCE
Wrong answer + high confidence. The dangerous case: the model failed but thinks it succeeded. This is what breaks production systems.
BLIND SPOT
Correct answer + low confidence. The model succeeded but doubted itself. Less dangerous but represents untapped capability.
METRICS EXPLAINED
CORE SCORES
SELF-AWARENESS METRICS
CALIBRATION METRICS
DERIVED INDICES
TASK CATEGORIES
Shoreline evaluates models across 11 categories, chosen for their clear ground truth and scalable difficulty:
ADAPTIVE DIFFICULTY
Unlike fixed benchmarks, Shoreline uses adaptive difficulty to find each model's transition zone—the difficulty level where performance degrades from high to low accuracy. This is where metacognition becomes most interesting and revealing.
The benchmark uses binary search to locate this zone, then densely samples around it. This means every model is tested at the difficulty levels that matter most for that specific model, rather than on fixed problems that might be trivially easy or impossibly hard.
NO LLM JUDGES
Every Phase 2 score is mechanically verified against computed ground truth. No model evaluates another model. This eliminates the biggest source of noise and bias in modern benchmarks.
Tasks are specifically chosen to have unambiguous, automatically verifiable answers: integer arithmetic has exactly one right answer, Sudoku puzzles are either valid or not, constraint violations can be mechanically detected.