Overview
SWE-bench updated their coding benchmark leaderboard with fresh results from current AI models. The results are notable because they represent independent third-party evaluation rather than self-reported scores from AI labs. Claude 4.5 Opus leads at 76.8% resolution rate, followed closely by Gemini 3 Flash and Chinese model MiniMax M2.5.
Key Points
- Independent benchmark testing shows Claude 4.5 Opus achieving 76.8% resolution rate - avoiding the bias of lab self-reported results
- Chinese AI models are increasingly competitive, with MiniMax M2.5, GLM-5, Kimi K2.5, and DeepSeek V3.2 all ranking in top 10 - signaling global AI capability convergence
- SWE-bench tests real-world coding problems from 12 major open source repositories including Django and scikit-learn - measuring practical software engineering skills, not just coding syntax
- OpenAI’s best general model GPT-5.2 ranks 6th at 72.8%, but their specialized coding model GPT-5.3-Codex wasn’t included - suggesting potential gaps in specialized vs general model performance