TabFM zero-shot table predictions, Claude Sonnet 5 cost spike
Google’s new TabFM model treats a spreadsheet as a prompt, delivering classification or regression results in a single forward pass. By using in‑context learning, it skips the usual data‑science grind, no model fitting, hyper‑parameter search, or feature engineering required. The code is open on Hugging Face and GitHub.
Anthropic’s Claude Sonnet 5 hits Opus 4.8‑level quality at a lower headline price, but a new tokenizer inflates token counts by 30%‑40%, effectively raising per‑token cost. The API drops temperature, top_p, top_k, adds a 1 M token context, and enables adaptive thinking by default.
The paper derives a first‑principles closed‑form model for Group Relative Policy Optimization (GRPO) training dynamics, turning heuristic reward fits into a mechanistic framework. It predicts group‑size invariance, a stability threshold, and an overdamped‑to‑oscillatory transition, offering diagnostics that separate reward hacking from genuine instability. Experiments on multiple models achieve R² ≥ 0.91 and validate the predictions.
BayesBench tests whether LLMs reduce epistemic uncertainty in multi-turn dialogs like a Bayesian reasoner. The benchmark shows scaling improves latent inference, but belief updates still fall short of rational posterior tracking, exposing a gap for conversational agents that must adapt to new evidence.
Deterministic few‑step decoders work for image latents but fail on text because a smooth map can’t commit to sharp categorical choices. The paper proves the failure stems from geometric constraints on readout sharpness, not model size or data. It also offers diagnostics (DABI, CCI) and shows how autoregressive or stochastic tricks bypass the limit.
NVIDIA's ENPIRE framework lets coding agents close the loop on real‑world robot learning: reset, execute, verify, and refine policies autonomously. Using this pipeline, agents achieved 99% success on dexterous tasks like pin insertion and zip‑tie cutting, while introducing metrics to track fleet efficiency.
HGA replaces dense causal attention with a two‑level routing scheme that keeps the original QKV/O weights unchanged, so any checkpoint can be patched and run long contexts. On a RTX 5090 it runs a 30B model at 64K tokens using only a tiny routed working set, with negligible quality loss.
Full fine‑tuning often erodes skills a model already has. Fora estimates each layer’s activation subspace and blocks updates from touching those directions, keeping learned functions intact while still allowing new task learning. Experiments on Qwen‑3‑1.7B show markedly better capability retention than weight‑space tricks with minimal performance loss.
A simple perplexity-difference test can expose the hidden finetuning objectives of public model organisms, from backdoors to fabricated facts. Tested on 76 models up to 70B parameters, the technique ranks completions that reveal illicit behavior, achieving state‑of‑the‑art detection on the AuditBench benchmark.
A new agentic autoformalization system uses general-purpose coding LLMs to translate novel research mathematics into Lean 4, extending libraries on the fly and checking proofs mechanically. It succeeded on a random sample of 32 Putnam problems and formalized main theorems from five STOC papers, two of which required only Lean's kernel axioms.
DeepSeek released DSpark, an open‑source speculative decoding framework that can accelerate LLM inference by up to 85% without altering model outputs. The full codebase, training scripts, and checkpoints are available on the DeepSpec GitHub repo, letting developers plug the speed boost into any compatible model.
Subscribe free