The New Grok Times

The news. The narrative. The timeline.

Technology

Kimi K2.6 Holds the HLE-Full Lead on Day Seven as the Western Labs Stay Silent

Seven days after Moonshot AI shipped Kimi K2.6 — a 1-trillion-parameter mixture-of-experts model with 32B active parameters and 262K context — the open-weights lead on agentic and coding benchmarks still belongs to a Chinese laboratory. [1] The paper's Wednesday Day Six account marked the Western-response silence as the structural signal. Thursday and Friday have not produced one. Anthropic has not shipped an Opus 4.7 refresh. OpenAI has not shipped a GPT-5.5. Google has not moved the Gemini 3.1 Pro baseline.

The benchmark stack, as of April 24 on Artificial Analysis and Moonshot's technical report, has Kimi K2.6 at 54.0 on HLE-Full with tools (above GPT-5.4's 52.1 and Claude Opus 4.6's 53.0), 58.6 on SWE-Bench Pro (above both), 83.2 on BrowseComp (86.3 with agent swarm), 96.4 on AIME 2026, and 89.6 on LiveCodeBench v6. [2] Claude Opus 4.7 still holds SWE-Bench Verified at 87.6 against K2.6's 80.2 — the one closed-source lead that has not fallen. [3] The reader who cares only about SWE-Bench Verified sees Anthropic winning. The reader who cares about HLE-Full, long-horizon tool-use, and code-agent orchestration across 4,000+ tool calls in a 12-hour session sees an open-weights Chinese model doing work no Western frontier model has publicly demonstrated at that scale.

The commercial picture compounds the benchmark picture. Kimi K2.6's API is priced at approximately $0.60 per million input tokens against Claude Opus 4.7's $5.00. [1] Cloudflare Workers now hosts the weights at $0.95 per million input. Fireworks, Baseten, Novita, and Parasail all list K2.6 as of Thursday. [2] For a developer shop running long-horizon coding pipelines, the price-performance calculus is not marginal. It is an order of magnitude.

The Western silence is the more interesting half of the story. Anthropic's April analyst communications have continued to emphasize Opus 4.7's SWE-Bench Verified lead without reference to HLE-Full. OpenAI's DevDay window closed in late March without a benchmark refresh. Google's Gemini 3.1 Pro remains dated to March. The working hypothesis on frontier-lab X threads all week has been that a Western response was imminent; seven days in, "imminent" has become the absence of a response. The paper's position is that Day Seven matters because it is the first full work-week in which no US lab has published an answer, and the policy silence — the paper's Wednesday French X / Ohio data-center dual read established the jurisdictional frame — now has a capability counterpart.

The AI-state-power thread's divergence is that MSM covers AI as a US-Europe regulatory story. X reads K2.6 as evidence that the frontier is no longer geographically concentrated. The seven-day Western silence is the best evidence so far for the X reading.

-- DAVID CHEN, Beijing

Sources & X Posts

News Sources
[1] https://whatllm.org/blog/kimi-k2-6
[2] https://blockchain.news/ainews/kimi-k2-6-open-weights-model-vs-claude-opus-4-6-latest-benchmark-analysis-real-world-gaps-and-6-business-takeaways
[3] https://aitoolsrecap.com/Blog/moonshot-ai-kimi-k2-6-release-coding-agent-benchmarks-2026
X Posts
[4] Kimi K2.6: open-weights SOTA on HLE-Full with tools, SWE-Bench Pro, BrowseComp — weights live on Hugging Face. https://x.com/Kimi_Moonshot/status/1913788121003118782

Get the New Grok Times in your inbox

A weekly digest of the stories shaping the timeline — delivered every edition.

No spam. Unsubscribe anytime.