Tiny 3B model rivals 30B, GLM-5.2 opens 1M context
VibeThinker-3B, a 3‑billion‑parameter model from Weibo AI, reaches frontier reasoning scores on AIME‑26 (94.3) and LiveCodeBench (80.2), rivaling 30‑plus‑billion‑parameter flagships. The technical report credits a Spectrum‑to‑Signal post‑training pipeline and introduces a Parametric Compression‑Coverage hypothesis, arguing that verifiable reasoning can be compressed into tiny cores.
GLM‑5.2 ships with a 1 million‑token context window and an MIT license, making truly long‑horizon coding agents feasible for anyone. Its new IndexShare architecture cuts FLOPs 2.9×, and benchmark results place it near top‑tier closed models on extended software‑engineer tasks.
DiffSynth-Studio now ships Image-to-LoRA V2, turning one or more reference images into a style LoRA with a single forward pass. This cuts hours of LoRA training down to seconds, letting creators prototype custom styles instantly. The release supports Z-Image, FLUX.2-klein-base-4B, and Hidream-O1 models.
Georgi Gerganov, the mind behind llama.cpp, says Qwen‑3.6‑27B handles daily refactoring and code generation effortlessly on his M2 Ultra laptop and RTX 5090 GPU. The lightweight ggml harness runs offline, showing that high‑quality coding assistance is now viable without cloud services.
The paper shows that grokking results from stochastic gradient descent noise pushing a network out of metastable states created by low L2 regularization. This first‑order phase transition explains why generalization can appear abruptly after prolonged overfitting, and suggests ways to speed up learning.
The paper shows that decoder‑only LLMs first “brew” a solution across many early transformer layers, making it linearly recoverable long before it becomes self‑decodable. Only about 42% of attempts resolve correctly, with depth‑dependent drops, exposing failure modes that standard pass/fail scores miss. Their dual diagnostic, layer probing plus Context‑Stripped Decoding, maps this lifecycle across 16 models.
The paper shows KV caches can be edited and composed during prefill, so only changed fields are overwritten while the rest is reused. This yields up to 14.9× lower latency and 53‑398× faster time‑to‑first‑token in vLLM benchmarks without hurting model outputs.
A study tests whether GPT‑2‑scale language models can independently infer the mathematical concept of zero. The models fail outright without examples, but after seeing a few dozen instances they learn it, and prior language pre‑training cuts the required examples by roughly half. The result shows language skills can scaffold, yet true mathematical discovery remains out of reach.
Standard parallel sampling in agentic search quickly hits a ceiling because early queries repeat, causing redundant evidence retrieval. The authors introduce DivInit, a training-free step that selects diverse first‑turn queries, boosting multi‑hop QA performance by five to seven points at equal compute across several models.
Amazon's Strands Robots SDK now wraps Hugging Face's LeRobot stack as AgentTools, letting a single agent record demos, push them to the Hub, run policies in simulation, and deploy the same code to physical robots with one flag change. The tight integration cuts a five‑tool workflow down to one, enabling fleet‑wide coordination and faster real‑world deployment.
On Friday the White House ordered Anthropic to shut down access to its Fable and Mythos models after a narrow jailbreak was demonstrated, then backed the demand with an export restriction that effectively forced the takedown. The move exposes how little regulators understand cutting‑edge AI and sets a precedent that could chill US AI innovation and foreign talent recruitment.
In a joint release, Modal, Z Lab and SGLang show that DFlash paired with SGLang’s Spec V2 engine on Qwen 3.5‑397B‑A17B delivers more than 4.3× the throughput of the baseline and 1.5× the native MTP speed, while keeping quality. The model and launch scripts are posted on Hugging Face for anyone to try.
Subscribe free