Tiny 3B model rivals 30B, GLM-5.2 opens 1M context

AI · 2026-06-17

Models & Releases

Tiny 3B model rivals 30B giants on math and coding benchmarks40 MIN

VibeThinker-3B, a 3‑billion‑parameter model from Weibo AI, reaches frontier reasoning scores on AIME‑26 (94.3) and LiveCodeBench (80.2), rivaling 30‑plus‑billion‑parameter flagships. The technical report credits a Spectrum‑to‑Signal post‑training pipeline and introduces a Parametric Compression‑Coverage hypothesis, arguing that verifiable reasoning can be compressed into tiny cores.

GLM‑5.2 brings a 1M‑token context to open‑source long‑run coding14 MIN

GLM‑5.2 ships with a 1 million‑token context window and an MIT license, making truly long‑horizon coding agents feasible for anyone. Its new IndexShare architecture cuts FLOPs 2.9×, and benchmark results place it near top‑tier closed models on extended software‑engineer tasks.

Image-to-LoRA V2 slashes style LoRA training to seconds36 MIN

DiffSynth-Studio now ships Image-to-LoRA V2, turning one or more reference images into a style LoRA with a single forward pass. This cuts hours of LoRA training down to seconds, letting creators prototype custom styles instantly. The release supports Z-Image, FLUX.2-klein-base-4B, and Hidream-O1 models.

Qwen‑3.6‑27B proves fast local coding assistant on M2 Ultra and RTX 50901 MIN

Georgi Gerganov, the mind behind llama.cpp, says Qwen‑3.6‑27B handles daily refactoring and code generation effortlessly on his M2 Ultra laptop and RTX 5090 GPU. The lightweight ggml harness runs offline, showing that high‑quality coding assistance is now viable without cloud services.

Research

Noise‑Driven Escape Theory Illuminates Grokking in Deep Nets2 MIN

The paper shows that grokking results from stochastic gradient descent noise pushing a network out of metastable states created by low L2 regularization. This first‑order phase transition explains why generalization can appear abruptly after prolonged overfitting, and suggests ways to speed up learning.

LLMs Build Code Answers in Hidden Layers before They Appear, Revealing Blind Spots in Accuracy Metrics1 MIN

The paper shows that decoder‑only LLMs first “brew” a solution across many early transformer layers, making it linearly recoverable long before it becomes self‑decodable. Only about 42% of attempts resolve correctly, with depth‑dependent drops, exposing failure modes that standard pass/fail scores miss. Their dual diagnostic, layer probing plus Context‑Stripped Decoding, maps this lifecycle across 16 models.

Editable KV Cache Cuts LLM Prompt Latency up to 15×2 MIN

The paper shows KV caches can be edited and composed during prefill, so only changed fields are overwritten while the rest is reused. This yields up to 14.9× lower latency and 53‑398× faster time‑to‑first‑token in vLLM benchmarks without hurting model outputs.

LLMs Still Can’t Invent Zero, Even With Language Pre‑training1 MIN

A study tests whether GPT‑2‑scale language models can independently infer the mathematical concept of zero. The models fail outright without examples, but after seeing a few dozen instances they learn it, and prior language pre‑training cuts the required examples by roughly half. The result shows language skills can scaffold, yet true mathematical discovery remains out of reach.

Diverse First‑Turn Queries Break Parallel Sampling Limits in Agentic Search1 MIN

Standard parallel sampling in agentic search quickly hits a ceiling because early queries repeat, causing redundant evidence retrieval. The authors introduce DivInit, a training-free step that selects diverse first‑turn queries, boosting multi‑hop QA performance by five to seven points at equal compute across several models.

Products & Industry

Strands Agents turn Hugging Face demos into real‑robot code in a single step16 MIN

Amazon's Strands Robots SDK now wraps Hugging Face's LeRobot stack as AgentTools, letting a single agent record demos, push them to the Hub, run policies in simulation, and deploy the same code to physical robots with one flag change. The tight integration cuts a five‑tool workflow down to one, enabling fleet‑wide coordination and faster real‑world deployment.

Policy & Safety

US Govt Forces Anthropic to Pull Fable and Mythos Over Jailbreak Fears11 MIN

On Friday the White House ordered Anthropic to shut down access to its Fable and Mythos models after a narrow jailbreak was demonstrated, then backed the demand with an export restriction that effectively forced the takedown. The move exposes how little regulators understand cutting‑edge AI and sets a precedent that could chill US AI innovation and foreign talent recruitment.

Tools & Open Source

DFlash + Spec V2 slashes LLM inference latency by up to 4.3×9 MIN

In a joint release, Modal, Z Lab and SGLang show that DFlash paired with SGLang’s Spec V2 engine on Qwen 3.5‑397B‑A17B delivers more than 4.3× the throughput of the baseline and 1.5× the native MTP speed, while keeping quality. The model and launch scripts are posted on Hugging Face for anyone to try.