Which LLM has the best commit messages / price ratio ? A human-judged ranking

by Martin Monperrus

Goal: identify which LLM generates the best git commit messages, using a human as the oracle.

Once a research problem, the task of generating a commit message from a diff is completely cracked. The question which remain is cost-efficiency. No need to use a frontier model for this task.

Each time I commit code, the script presents two LLM-generated commit messages side by side. I pick the better one. That human judgment is the signal. After 211 such judgments across 37 models in the $0.4–$0.6/M-token price bracket, a ranking has emerged.

The human oracle is the key design choice. Automated metrics (BLEU, LLM-as-judge) don’t capture what actually makes a commit message good: precision, imperative mood, no fluff, correct scope. I do.

The method

When I commit, a script stages files, generates a diff, then choose two models. It picks the pair that maximizes information gain for the ranking (the pair whose Bradley-Terry win probability is closest to 0.5). Both models generate commit messages in parallel. I pick the better one and update ELO (K=32, starting at 1200).

Leaderboard (211 battles, 37 models)

ELO: rating starting at 1200, updated after each battle (K=32). W: wins. L: losses. M: total matches; higher M means a more reliable ELO.

Rank Model ELO W L M
1 qwen/qwen3-vl-32b-instruct 1325.5 8 0 8
2 nvidia/nemotron-3-super-120b-a12b 1295.3 11 5 16
3 nvidia/nemotron-nano-12b-v2-vl 1279.8 7 2 9
4 x-ai/grok-3-mini 1264.0 7 3 10
5 qwen/qwq-32b 1264.0 5 1 6
6 x-ai/grok-4-fast 1262.8 7 3 10
7 openai/gpt-4o-mini-search-preview 1259.2 9 5 14
8 mistralai/mistral-small-3.1-24b-instruct 1245.4 9 6 15
9 x-ai/grok-3-mini-beta 1233.6 6 4 10
10 qwen/qwen3-vl-8b-instruct 1232.4 8 6 14
11 meta-llama/llama-4-maverick 1231.8 9 7 16
12 qwen/qwen2.5-vl-32b-instruct 1231.2 3 1 4
13 allenai/olmo-3.1-32b-instruct 1218.5 5 4 9
14 openai/gpt-4o-mini-2024-07-18 1216.4 8 7 15
15 allenai/olmo-3.1-32b-think 1216.0 2 1 3
16 qwen/qwen3-vl-30b-a3b-instruct 1216.0 8 7 15
17 openai/gpt-4o-mini 1216.0 8 7 15
18 x-ai/grok-4.1-fast 1205.1 5 5 10
19 qwen/qwen3-235b-a22b-thinking-2507 1200.8 2 2 4
20 qwen/qwen-vl-plus 1199.9 5 5 10
21 deepseek/deepseek-v3.2-speciale 1198.6 6 6 12
22 deepseek/deepseek-v3.2 1198.0 2 2 4
23 qwen/qwen3-30b-a3b 1184.8 7 8 15
24 thedrummer/cydonia-24b-v4.1 1184.2 7 8 15
25 cohere/command-r-08-2024 1183.9 7 8 15
26 alibaba/tongyi-deepresearch-30b-a3b 1168.0 5 7 12
27 mistralai/mistral-small-2603 1168.0 7 9 16
28 arcee-ai/trinity-large-preview 1167.7 5 7 12
29 deepseek/deepseek-v3.2-exp 1154.8 6 9 15
30 tencent/hunyuan-a13b-instruct 1151.9 6 9 15
31 nex-agi/deepseek-v3.1-nex-n1 1151.9 6 9 15
32 upstage/solar-pro-3 1121.8 2 7 9
33 thedrummer/rocinante-12b 1120.3 5 10 15
34 mistralai/mistral-saba 1120.2 5 10 15
35 allenai/olmo-3-32b-think 1120.1 1 6 7
36 mistralai/mixtral-8x7b-instruct 1118.4 2 7 9
37 baidu/ernie-4.5-vl-28b-a3b 1073.7 0 8 8

Observations

Qwen3-VL-32B is undefeated at 8-0. The VL (vision-language) variant outperforms its text-only sibling at this price point — no images involved, just a larger base model.

Nvidia Nemotron over-performs its size class. The 120B MoE model has 16 matches (most of any model) and holds 1295 ELO — the most statistically grounded score in the table. The 12B VL nano variant also punches above its weight.

Thinking models don’t help here. Qwen3-235B-thinking (1200), OLMo-think variants (1120–1216): extended reasoning adds latency without improving commit messages.

DeepSeek variants diverge. Base deepseek-v3.2 and speciale hover around 1200. The experimental endpoint deepseek-v3.2-exp and the third-party nex-agi/deepseek-v3.1-nex-n1 drop below 1160.

Baidu Ernie is the worst tested: 0W/8L. Outputs are verbose, often in mixed Chinese/English, and poorly formatted for a terse imperative commit line.

Conclusion

The more I let coding agents commit directly, the less this ranking matters.

Unless you have a harness which can explicitly choose the model to choose for commit message generation.

See also