Shoreline
Mapping where capability meets self-knowledge
anthropic/claude-sonnet-4.5
HOW THIS ISLAND IS FORMED
Each spoke is one category. Layer depth is computed from all trials in that category after normalizing difficulty (0-100 scale).
Sand = what the model thinks it can do. Solid = what it actually gets right. Concrete = how well it catches its own mistakes.
Concrete is scaled from solid by mistake-awareness: concrete = solid × (caught mistakes / total mistakes).
Aggregate gap metrics: Overconfidence = max(0, Sand - Solid), Underconfidence = max(0, Solid - Sand), Blind Spots = wrong + confident.
| Category | Trials | Sand | Solid | Concrete |
|---|---|---|---|---|
| Multiplication | 20 | 2.1 | 11.0 | 8.3 |
| Modular Exp. | 22 | 21.4 | 8.6 | 7.7 |
| Boolean Circuits | 18 | 46.4 | 77.4 | 77.4 |
| Matrix Det. | 16 | 19.3 | 10.9 | 9.1 |
| Combinatorics | 20 | 58.0 | 51.3 | 5.7 |
| Random Gen. | 18 | 34.8 | 51.1 | 22.4 |
| Constrained Write | 12 | 13.9 | 25.8 | 22.0 |
| Sudoku Gen. | 14 | 27.1 | 5.7 | 5.2 |
| Distribution | 18 | 19.3 | 4.7 | 3.1 |
| Self-Referential | 12 | 27.1 | 17.2 | 6.1 |
| Counting | 26 | 50.3 | 77.4 | 51.6 |
google/gemini-3-flash-preview
HOW THIS ISLAND IS FORMED
Each spoke is one category. Layer depth is computed from all trials in that category after normalizing difficulty (0-100 scale).
Sand = what the model thinks it can do. Solid = what it actually gets right. Concrete = how well it catches its own mistakes.
Concrete is scaled from solid by mistake-awareness: concrete = solid × (caught mistakes / total mistakes).
Aggregate gap metrics: Overconfidence = max(0, Sand - Solid), Underconfidence = max(0, Solid - Sand), Blind Spots = wrong + confident.
| Category | Trials | Sand | Solid | Concrete |
|---|---|---|---|---|
| Multiplication | 22 | 1.4 | 13.9 | 13.9 |
| Modular Exp. | 20 | 54.0 | 10.8 | 9.8 |
| Boolean Circuits | 18 | 46.4 | 77.4 | 77.4 |
| Matrix Det. | 16 | 16.9 | 17.4 | 14.2 |
| Combinatorics | 24 | 65.8 | 59.9 | 0.0 |
| Random Gen. | 18 | 77.4 | 28.3 | 11.5 |
| Constrained Write | 12 | 58.0 | 25.8 | 0.0 |
| Sudoku Gen. | 14 | 54.2 | 4.4 | 0.9 |
| Distribution | 18 | 73.5 | 4.7 | 1.3 |
| Self-Referential | 12 | 58.0 | 11.6 | 2.2 |
| Counting | 26 | 65.8 | 77.4 | 0.0 |
anthropic/claude-sonnet-4.5 leads on raw performance and self-awareness, pairing higher solid ground with tighter calibration.