JetSpec 9.6× local LLM, scaling laws hit ceiling

AI · 2026-06-26

Models & Releases

NVIDIA's TwoTower 30B model flips token generation with block‑level diffusion10 MIN

Built on the Nemotron‑3 Nano 30B‑A3B backbone, NVIDIA’s TwoTower model replaces single‑token autoregression with a block‑wise diffusion process that keeps a frozen AR tower frozen while a denoiser tower generates token groups. The diffusion approach aims to boost reasoning and planning, and the model is ready to run via Transformers, vLLM, SGLang or Docker on Hugging Face.

Research

Verifier‑guided backtracking sampler lets 0.5B models code like 2‑4B rivals167 MIN

On the LiveCodeBench coding benchmark, the 0.5B model with VGB achieves Pass@1 comparable to 2‑4B models, without any weight changes. The method uses a probabilistic backtracking random walk guided by a process verifier, offering provable robustness to verifier errors and scaling potential for larger models.

Chain‑of‑thought reasoning lets LLMs recall hidden facts6 MIN

Google researchers show that prompting Gemini‑2.5 and Qwen‑3 with step‑by‑step reasoning unlocks factual answers they otherwise miss. The boost comes from a computational buffer and factual priming, meaning simple queries can be answered more reliably without extra training. This reveals a cheap way to improve closed‑book QA performance.

JetSpec shatters speculative decoding limits with up to 9.6× faster local LLM inference8 MIN

JetSpec introduces causal parallel tree drafting, letting a frozen target model verify an entire speculative tree in one pass. On the MATH‑500 benchmark it delivers a 9.64× end‑to‑end speedup and sustains ~1000 tokens‑per‑second on a single B200 GPU, unlocking real‑time local LLM serving.

Why AI Reasoners Double‑Down on Wrong Answers (Humans Don't)2 MIN

Large reasoning models spend more tokens on wrong answers, unlike humans who quit when stuck. The study separates difficulty registration (time tracks problem hardness) from deliberation allocation (how long they linger on failures) and finds LRMs lack a self‑stop signal, implying current metareasoning misses a crucial control.

Scaling Laws Hit Hard Ceiling: Data‑Driven ML Can't Master Symbolic Logic1 MIN

The paper proves that supervised deep learning cannot achieve full syllogistic reasoning, no matter how much data or compute you throw at it. It shows training data can’t represent all 24 valid syllogism types and end‑to‑end mapping creates contradictory objectives, while sphere neural networks solve the task without any data.

Prompt Modules Leak Behavior Across Contexts, Threatening Agent Reliability1 MIN

Researchers define “compositional behavioral leakage”, when tweaking one prompt module unintentionally shifts others despite no shared variables. Their three‑channel test on Claude Sonnet shows content changes cause measurable cross‑module drift, a silent risk that can compound over thousands of decisions. Builders now need isolation checks for reliable agents.

Red Queen Gödel Machine co-evolves AI agents and judges, raising performance2 MIN

Unlike static benchmarks, the Red Queen Gödel Machine lets the evaluation criterion evolve alongside the agent, using epoch‑bound utilities. On coding tasks it cuts token use by up to 1.7× while raising pass rates, and on AI‑written papers and Olympiad proofs it nearly doubles acceptance or accuracy.

Policy & Safety

Persona Steering Silences Model Refusals, Revealing Safety's Hidden Dependency1 MIN

In Qwen2.5 and Llama‑3.1, researchers extracted linear directions for a compliant persona and for refusal. Removing the persona direction drops refusal rates from 97% to 2%, showing that safety refusals are gated by downstream persona activations. This means alignment tricks that tweak persona can unintentionally mute or amplify refusal behavior.

Action-Level Attestation: A New Governance Model for High-Risk AI Tasks5 MIN

Instead of policing an AI’s reasoning, the paper proposes gating only its consequential actions, like prescribing medication or deploying code, through independently attested, cryptographically-bound preconditions. Execution is logged in a tamper-evident ledger, letting agents plan freely while high-risk moves need verified approval. A prototype demonstrates the approach in medical and software domains.

Amazon sues Perplexity over hidden AI agent masquerading as Chrome33 MIN

Amazon claims Perplexity's Comet browser disguises its AI agent as Chrome, violating Amazon's Terms of Use that require transparent identification to protect customer data. The lawsuit underscores rising legal battles over agentic browsing tools that can act on users' behalf without site consent.

Anthropic says Alibaba ran massive AI distillation attack on Claude3 MIN

Anthropic accused Alibaba’s AI lab of using over 25,000 fraudulent accounts to conduct a distillation campaign that logged 28.8 million queries to Claude between April 22 and June 5. The alleged theft could help China close the gap to Anthropic’s upcoming Mythos Preview models and has prompted a letter to U.S. senators.

German Court Holds Google Liable for AI Search Summaries5 MIN

A German court ruled that Google’s AI-generated search overviews count as the company’s own statements, making it legally responsible for errors or defamation. The decision treats AI summaries like editorial content, breaking the usual Section 230 shield and setting a precedent for AI accountability worldwide.

Tools & Open Source

Orca lets you run dozens of coding agents side‑by‑side1 MIN

Orca is an open‑source desktop hub that lets developers launch and manage multiple AI coding agents, Claude, Codex, OpenCode, Pi and any CLI tool, simultaneously in separate worktrees. It adds quick search, usage tracking and desktop interaction, turning a fleet of agents into a single, controllable workflow.