GSM-Level8 Benchmark | Round 6

Chain-of-Thought Program vs Direct LLM Calls Across 8 Independent Configurations

This page compares eight peer systems: CoT-20B, Direct-20B-Low, Direct-20B-Med, Direct-20B-High, CoT-120B, Direct-120B-Low, Direct-120B-Med, and Direct-120B-High. The test questions are listed on the questions.html page. Each direct reasoning level is treated as an independent model configuration, not averaged into a single direct baseline.

Context and Methodology

The objective is to evaluate whether a structured chain-of-thought program can amplify analytical performance beyond native direct-call behavior, especially for the smaller 20B model. Each direct call attempted to return a parseable numeric value under low, medium, and high reasoning settings. When all parse attempts failed, the output was recorded as ERR. Correctness is analyzed with the benchmark rule used in these logs and then stress-tested under stricter tolerance thresholds (absolute and percent). Because Level 8 tasks blend branching logic, numerical methods, and heavy distractors, this is intentionally a high-friction comparison designed to measure robustness, not only raw arithmetic precision.

Configuration	Family	Correct	Incorrect	ERR	Accuracy	ERR Rate

Correctness rule used in this page: abs error <= 0.5 OR percent error <= 1%.

Overall Accuracy Across 8 Configurations

Vertical grouped bar chart

Model-Family Head-to-Head (4 vs 4)

Horizontal bar chart by base model family

Outcome Composition by Configuration

Stacked bar: correct, incorrect, ERR

Global Outcome Share Across 240 Runs

Doughnut chart: overall correctness composition

Tolerance Curve Across Thresholds

Multi-line chart with filled areas

Strictness Profile (Radar)

Radar chart across absolute and percent tolerances

ERR Count Profile

Polar area chart for parse-failure pressure

Accuracy vs ERR Tradeoff

Bubble chart: x=ERR rate, y=accuracy, bubble=model size

Cumulative Correct Answers by Question Order

Filled multi-line progression chart

Per-Question CoT vs Best Direct Comparison

Mixed bar + line chart by question ID

Subject-Matter Success Breakdown

Grouped bar by topic family

Error Distribution for Numeric Outputs

Scatter chart of percentage error by question

Strict-to-Lenient Accuracy Range

Floating bar from abs<=0.01 to pct<=5%