Overall Accuracy Across 8 Configurations
Vertical grouped bar chart
GSM-Level8 Benchmark | Round 6
This page compares eight peer systems: CoT-20B, Direct-20B-Low, Direct-20B-Med, Direct-20B-High, CoT-120B, Direct-120B-Low, Direct-120B-Med, and Direct-120B-High. The test questions are listed on the questions.html page. Each direct reasoning level is treated as an independent model configuration, not averaged into a single direct baseline.
The objective is to evaluate whether a structured chain-of-thought program can amplify analytical performance beyond native direct-call behavior, especially for the smaller 20B model. Each direct call attempted to return a parseable numeric value under low, medium, and high reasoning settings. When all parse attempts failed, the output was recorded as ERR. Correctness is analyzed with the benchmark rule used in these logs and then stress-tested under stricter tolerance thresholds (absolute and percent). Because Level 8 tasks blend branching logic, numerical methods, and heavy distractors, this is intentionally a high-friction comparison designed to measure robustness, not only raw arithmetic precision.
| Configuration | Family | Correct | Incorrect | ERR | Accuracy | ERR Rate |
|---|
Correctness rule used in this page: abs error <= 0.5 OR percent error <= 1%.
Vertical grouped bar chart
Horizontal bar chart by base model family
Stacked bar: correct, incorrect, ERR
Doughnut chart: overall correctness composition
Multi-line chart with filled areas
Radar chart across absolute and percent tolerances
Polar area chart for parse-failure pressure
Bubble chart: x=ERR rate, y=accuracy, bubble=model size
Filled multi-line progression chart
Mixed bar + line chart by question ID
Grouped bar by topic family
Scatter chart of percentage error by question
Floating bar from abs<=0.01 to pct<=5%