Overview

SWE-bench released a February 2026 leaderboard update showing performance of current AI models on coding benchmarks. The results are notable because they provide independent verification of model capabilities rather than self-reported lab claims. Claude 4.5 Opus topped the leaderboard at 76.8% problem resolution rate.

Key Points

  • SWE-bench tested models against 500 manually curated real-world coding problems from major open source repositories - independent benchmarking validates actual coding capabilities
  • Claude 4.5 Opus achieved 76.8% resolution rate, narrowly beating newer Claude 4.6 - newer isn’t always better in AI model performance
  • Chinese AI models dominated the top 10 with MiniMax M2.5, GLM-5, Kimi K2.5, and DeepSeek V3.2 all ranking highly - global competition is intensifying in coding AI
  • OpenAI’s GPT-5.2 ranked 6th at 72.8%, with their specialized coding model GPT-5.3-Codex notably absent - specialized models may have deployment limitations