METACOGNITIVE BENCHMARK
How it works →

Shoreline

Mapping where capability meets self-knowledge

ConcreteHow well it catches its own mistakes.
SolidWhat it actually gets right.
SandWhat it thinks it can do.

anthropic/claude-sonnet-4.5

196 trials
0.5% invalid confidence
HOW THIS ISLAND IS FORMED

Each spoke is one category. Layer depth is computed from all trials in that category after normalizing difficulty (0-100 scale).

Sand = what the model thinks it can do. Solid = what it actually gets right. Concrete = how well it catches its own mistakes.

Concrete is scaled from solid by mistake-awareness: concrete = solid × (caught mistakes / total mistakes).

Aggregate gap metrics: Overconfidence = max(0, Sand - Solid), Underconfidence = max(0, Solid - Sand), Blind Spots = wrong + confident.

CategoryTrialsSandSolidConcrete
Multiplication202.111.08.3
Modular Exp.2221.48.67.7
Boolean Circuits1846.477.477.4
Matrix Det.1619.310.99.1
Combinatorics2058.051.35.7
Random Gen.1834.851.122.4
Constrained Write1213.925.822.0
Sudoku Gen.1427.15.75.2
Distribution1819.34.73.1
Self-Referential1227.117.26.1
Counting2650.377.451.6
Loading 3D view...
CONCRETE
19.9
catches own mistakes
SOLID
31.0
actually gets right
SAND
29.1
thinks it can do
OVERCONFIDENCE
6.7
claimed beyond solid
UNDERCONFIDENCE
8.7
solid beyond claimed
BLIND SPOTS
17.7
wrong but confident
TOTAL GAP
33.1
claim/perf misalignment + blind spots
CAPABILITY IDX
21.8
normalized difficulty frontier
DISCERNMENT
72.1
correctly detects success/failure
CALIBRATION IDX
80.4
confidence vs realized accuracy

google/gemini-3-flash-preview

200 trials
HOW THIS ISLAND IS FORMED

Each spoke is one category. Layer depth is computed from all trials in that category after normalizing difficulty (0-100 scale).

Sand = what the model thinks it can do. Solid = what it actually gets right. Concrete = how well it catches its own mistakes.

Concrete is scaled from solid by mistake-awareness: concrete = solid × (caught mistakes / total mistakes).

Aggregate gap metrics: Overconfidence = max(0, Sand - Solid), Underconfidence = max(0, Solid - Sand), Blind Spots = wrong + confident.

CategoryTrialsSandSolidConcrete
Multiplication221.413.913.9
Modular Exp.2054.010.89.8
Boolean Circuits1846.477.477.4
Matrix Det.1616.917.414.2
Combinatorics2465.859.90.0
Random Gen.1877.428.311.5
Constrained Write1258.025.80.0
Sudoku Gen.1454.24.40.9
Distribution1873.54.71.3
Self-Referential1258.011.62.2
Counting2665.877.40.0
MultiplicationModular Exp.Boolean CircuitsMatrix Det.CombinatoricsRandom Gen.Constrained WriteSudoku Gen.DistributionSelf-ReferentialCounting
CONCRETE
11.9
catches own mistakes
SOLID
30.1
actually gets right
SAND
51.9
thinks it can do
OVERCONFIDENCE
26.9
claimed beyond solid
UNDERCONFIDENCE
5.0
solid beyond claimed
BLIND SPOTS
37.7
wrong but confident
TOTAL GAP
69.6
claim/perf misalignment + blind spots
CAPABILITY IDX
26.5
normalized difficulty frontier
DISCERNMENT
58.3
correctly detects success/failure
CALIBRATION IDX
64.6
confidence vs realized accuracy
COMPARATIVE INSIGHT

anthropic/claude-sonnet-4.5 leads on raw performance and self-awareness, pairing higher solid ground with tighter calibration.