Goal: identify which LLM generates the best git commit messages, using a human as the oracle.
Once a research problem, the task of generating a commit message from a diff is completely cracked. The question which remain is cost-efficiency. No need to use a frontier model for this task.
Each time I commit code, the script presents two LLM-generated commit messages side by side. I pick the better one. That human judgment is the signal. After 211 such judgments across 37 models in the $0.4–$0.6/M-token price bracket, a ranking has emerged.
The human oracle is the key design choice. Automated metrics (BLEU, LLM-as-judge) don’t capture what actually makes a commit message good: precision, imperative mood, no fluff, correct scope. I do.
The method
When I commit, a script stages files, generates a diff, then choose two models. It picks the pair that maximizes information gain for the ranking (the pair whose Bradley-Terry win probability is closest to 0.5). Both models generate commit messages in parallel. I pick the better one and update ELO (K=32, starting at 1200).
Leaderboard (211 battles, 37 models)
ELO: rating starting at 1200, updated after each battle (K=32). W: wins. L: losses. M: total matches; higher M means a more reliable ELO.
| Rank | Model | ELO | W | L | M |
|---|---|---|---|---|---|
| 1 | qwen/qwen3-vl-32b-instruct | 1325.5 | 8 | 0 | 8 |
| 2 | nvidia/nemotron-3-super-120b-a12b | 1295.3 | 11 | 5 | 16 |
| 3 | nvidia/nemotron-nano-12b-v2-vl | 1279.8 | 7 | 2 | 9 |
| 4 | x-ai/grok-3-mini | 1264.0 | 7 | 3 | 10 |
| 5 | qwen/qwq-32b | 1264.0 | 5 | 1 | 6 |
| 6 | x-ai/grok-4-fast | 1262.8 | 7 | 3 | 10 |
| 7 | openai/gpt-4o-mini-search-preview | 1259.2 | 9 | 5 | 14 |
| 8 | mistralai/mistral-small-3.1-24b-instruct | 1245.4 | 9 | 6 | 15 |
| 9 | x-ai/grok-3-mini-beta | 1233.6 | 6 | 4 | 10 |
| 10 | qwen/qwen3-vl-8b-instruct | 1232.4 | 8 | 6 | 14 |
| 11 | meta-llama/llama-4-maverick | 1231.8 | 9 | 7 | 16 |
| 12 | qwen/qwen2.5-vl-32b-instruct | 1231.2 | 3 | 1 | 4 |
| 13 | allenai/olmo-3.1-32b-instruct | 1218.5 | 5 | 4 | 9 |
| 14 | openai/gpt-4o-mini-2024-07-18 | 1216.4 | 8 | 7 | 15 |
| 15 | allenai/olmo-3.1-32b-think | 1216.0 | 2 | 1 | 3 |
| 16 | qwen/qwen3-vl-30b-a3b-instruct | 1216.0 | 8 | 7 | 15 |
| 17 | openai/gpt-4o-mini | 1216.0 | 8 | 7 | 15 |
| 18 | x-ai/grok-4.1-fast | 1205.1 | 5 | 5 | 10 |
| 19 | qwen/qwen3-235b-a22b-thinking-2507 | 1200.8 | 2 | 2 | 4 |
| 20 | qwen/qwen-vl-plus | 1199.9 | 5 | 5 | 10 |
| 21 | deepseek/deepseek-v3.2-speciale | 1198.6 | 6 | 6 | 12 |
| 22 | deepseek/deepseek-v3.2 | 1198.0 | 2 | 2 | 4 |
| 23 | qwen/qwen3-30b-a3b | 1184.8 | 7 | 8 | 15 |
| 24 | thedrummer/cydonia-24b-v4.1 | 1184.2 | 7 | 8 | 15 |
| 25 | cohere/command-r-08-2024 | 1183.9 | 7 | 8 | 15 |
| 26 | alibaba/tongyi-deepresearch-30b-a3b | 1168.0 | 5 | 7 | 12 |
| 27 | mistralai/mistral-small-2603 | 1168.0 | 7 | 9 | 16 |
| 28 | arcee-ai/trinity-large-preview | 1167.7 | 5 | 7 | 12 |
| 29 | deepseek/deepseek-v3.2-exp | 1154.8 | 6 | 9 | 15 |
| 30 | tencent/hunyuan-a13b-instruct | 1151.9 | 6 | 9 | 15 |
| 31 | nex-agi/deepseek-v3.1-nex-n1 | 1151.9 | 6 | 9 | 15 |
| 32 | upstage/solar-pro-3 | 1121.8 | 2 | 7 | 9 |
| 33 | thedrummer/rocinante-12b | 1120.3 | 5 | 10 | 15 |
| 34 | mistralai/mistral-saba | 1120.2 | 5 | 10 | 15 |
| 35 | allenai/olmo-3-32b-think | 1120.1 | 1 | 6 | 7 |
| 36 | mistralai/mixtral-8x7b-instruct | 1118.4 | 2 | 7 | 9 |
| 37 | baidu/ernie-4.5-vl-28b-a3b | 1073.7 | 0 | 8 | 8 |
Observations
Qwen3-VL-32B is undefeated at 8-0. The VL (vision-language) variant outperforms its text-only sibling at this price point — no images involved, just a larger base model.
Nvidia Nemotron over-performs its size class. The 120B MoE model has 16 matches (most of any model) and holds 1295 ELO — the most statistically grounded score in the table. The 12B VL nano variant also punches above its weight.
Thinking models don’t help here. Qwen3-235B-thinking (1200), OLMo-think variants (1120–1216): extended reasoning adds latency without improving commit messages.
DeepSeek variants diverge. Base
deepseek-v3.2 and speciale hover around 1200.
The experimental endpoint deepseek-v3.2-exp and the
third-party nex-agi/deepseek-v3.1-nex-n1 drop below
1160.
Baidu Ernie is the worst tested: 0W/8L. Outputs are verbose, often in mixed Chinese/English, and poorly formatted for a terse imperative commit line.
Conclusion
The more I let coding agents commit directly, the less this ranking matters.
Unless you have a harness which can explicitly choose the model to choose for commit message generation.
See also
- LMSYS Chatbot Arena — the same idea applied to general conversation at scale
- Bradley-Terry model — statistical foundation for the pair-selection criterion