SWE-bench February 2026 leaderboard update

SWE-bench tested models against 500 manually curated real-world coding problems from major open source repositories - independent benchmarking validates actual coding capabilities

Claude 4.5 Opus achieved 76.8% resolution rate, narrowly beating newer Claude 4.6 - newer isn’t always better in AI model performance

Chinese AI models dominated the top 10 with MiniMax M2.5, GLM-5, Kimi K2.5, and DeepSeek V3.2 all ranking highly - global competition is intensifying in coding AI

OpenAI’s GPT-5.2 ranked 6th at 72.8%, with their specialized coding model GPT-5.3-Codex notably absent - specialized models may have deployment limitations

Access Denied

SWE-bench February 2026 leaderboard update

Overview

Key Points