Which LLM has the best commit messages / price ratio ? A human-judged ranking

Goal: identify which LLM generates the best git commit messages, using a human as the oracle.

Once a research problem, the task of generating a commit message from a diff is completely cracked. The question which remain is cost-efficiency. No need to use a frontier model for this task.

Each time I commit code, the script presents two LLM-generated commit messages side by side. I pick the better one. That human judgment is the signal. After 211 such judgments across 37 models in the $0.4–$0.6/M-token price bracket, a ranking has emerged.

The human oracle is the key design choice. Automated metrics (BLEU, LLM-as-judge) don’t capture what actually makes a commit message good: precision, imperative mood, no fluff, correct scope. I do.

The method

When I commit, a script stages files, generates a diff, then choose two models. It picks the pair that maximizes information gain for the ranking (the pair whose Bradley-Terry win probability is closest to 0.5). Both models generate commit messages in parallel. I pick the better one and update ELO (K=32, starting at 1200).

Leaderboard (211 battles, 37 models)

ELO: rating starting at 1200, updated after each battle (K=32). W: wins. L: losses. M: total matches; higher M means a more reliable ELO.

Rank	Model	ELO	W	L	M
1	qwen/qwen3-vl-32b-instruct	1325.5	8	0	8
2	nvidia/nemotron-3-super-120b-a12b	1295.3	11	5	16
3	nvidia/nemotron-nano-12b-v2-vl	1279.8	7	2	9
4	x-ai/grok-3-mini	1264.0	7	3	10
5	qwen/qwq-32b	1264.0	5	1	6
6	x-ai/grok-4-fast	1262.8	7	3	10
7	openai/gpt-4o-mini-search-preview	1259.2	9	5	14
8	mistralai/mistral-small-3.1-24b-instruct	1245.4	9	6	15
9	x-ai/grok-3-mini-beta	1233.6	6	4	10
10	qwen/qwen3-vl-8b-instruct	1232.4	8	6	14
11	meta-llama/llama-4-maverick	1231.8	9	7	16
12	qwen/qwen2.5-vl-32b-instruct	1231.2	3	1	4
13	allenai/olmo-3.1-32b-instruct	1218.5	5	4	9
14	openai/gpt-4o-mini-2024-07-18	1216.4	8	7	15
15	allenai/olmo-3.1-32b-think	1216.0	2	1	3
16	qwen/qwen3-vl-30b-a3b-instruct	1216.0	8	7	15
17	openai/gpt-4o-mini	1216.0	8	7	15
18	x-ai/grok-4.1-fast	1205.1	5	5	10
19	qwen/qwen3-235b-a22b-thinking-2507	1200.8	2	2	4
20	qwen/qwen-vl-plus	1199.9	5	5	10
21	deepseek/deepseek-v3.2-speciale	1198.6	6	6	12
22	deepseek/deepseek-v3.2	1198.0	2	2	4
23	qwen/qwen3-30b-a3b	1184.8	7	8	15
24	thedrummer/cydonia-24b-v4.1	1184.2	7	8	15
25	cohere/command-r-08-2024	1183.9	7	8	15
26	alibaba/tongyi-deepresearch-30b-a3b	1168.0	5	7	12
27	mistralai/mistral-small-2603	1168.0	7	9	16
28	arcee-ai/trinity-large-preview	1167.7	5	7	12
29	deepseek/deepseek-v3.2-exp	1154.8	6	9	15
30	tencent/hunyuan-a13b-instruct	1151.9	6	9	15
31	nex-agi/deepseek-v3.1-nex-n1	1151.9	6	9	15
32	upstage/solar-pro-3	1121.8	2	7	9
33	thedrummer/rocinante-12b	1120.3	5	10	15
34	mistralai/mistral-saba	1120.2	5	10	15
35	allenai/olmo-3-32b-think	1120.1	1	6	7
36	mistralai/mixtral-8x7b-instruct	1118.4	2	7	9
37	baidu/ernie-4.5-vl-28b-a3b	1073.7	0	8	8

Observations

Qwen3-VL-32B is undefeated at 8-0. The VL (vision-language) variant outperforms its text-only sibling at this price point — no images involved, just a larger base model.

Nvidia Nemotron over-performs its size class. The 120B MoE model has 16 matches (most of any model) and holds 1295 ELO — the most statistically grounded score in the table. The 12B VL nano variant also punches above its weight.

Thinking models don’t help here. Qwen3-235B-thinking (1200), OLMo-think variants (1120–1216): extended reasoning adds latency without improving commit messages.

DeepSeek variants diverge. Base deepseek-v3.2 and speciale hover around 1200. The experimental endpoint deepseek-v3.2-exp and the third-party nex-agi/deepseek-v3.1-nex-n1 drop below 1160.

Baidu Ernie is the worst tested: 0W/8L. Outputs are verbose, often in mixed Chinese/English, and poorly formatted for a terse imperative commit line.

Conclusion

The more I let coding agents commit directly, the less this ranking matters.

Unless you have a harness which can explicitly choose the model to choose for commit message generation.

Which LLM has the best commit messages / price ratio ? A human-judged ranking

The method

Leaderboard (211 battles, 37 models)

Observations

Conclusion

See also