Ornith-1.0: coding agents that self-generate RL scaffolds
DeepReinforce has open-sourced Ornith-1.0, an MIT‑licensed family of coding agents that learn their own RL scaffolds. Variants from 9B to 397B outperform comparable open models on benchmarks like SWE‑bench and Terminal‑Bench, promising higher‑quality tool‑calling without proprietary restrictions.
New selection theorems prove that any highly capable AI that minimizes regret must internally build world models, belief-like memory, and regime-tracking variables akin to emotions. This links performance guarantees to structures associated with consciousness, suggesting conscious experience could emerge as an inevitable byproduct of advanced capability.
Anthropic’s June 2026 Economic Index shows AI compute usage scales with task value: occupations in the top wage brackets spend 2.5 times more tokens than lower‑paid roles. The report links higher compute to higher‑value outputs and reveals usage patterns that mirror work cycles, underscoring AI’s growing economic footprint.
The authors show that fine‑tuning on benign data can pull model behavior back toward early training representations, undoing safety alignments. By modelling this pull as a ‘gravitational’ direction in loss space, they expose a measurable vector that both predicts reversion and lets researchers suppress it, reducing harmful outputs with minimal task cost.
AI labs are betting that scaling reinforcement learning with verifiable rewards across millions of tasks will yield AGI, but the approach stalls where deterministic simulators are unavailable. The missing piece is true continual learning, updating model weights from real‑world deployment, not just longer context windows.
AllenAI’s DiScoFormer uses cross‑attention to predict both the probability density and its gradient (score) from a set of samples, eliminating the need to train separate models per distribution. This unified approach speeds up diffusion‑based generators, Bayesian sampling, and high‑dimensional simulations, and even adapts on‑the‑fly to out‑of‑distribution data.
Scaling test‑time sampling in language‑model reasoning looks like a win until two hard limits appear. The modal ceiling caps how many draws are needed before the most common answer is fixed, often wrong, while the correlation ceiling shows extra samples become dependent and degrade performance. Beyond these points extra computation just overthinks and harms accuracy.
Across 28 000 web‑shopping, terminal and QA tasks, 13 LLM‑agent systems rarely know when to quit, often grinding through futile steps. Even larger models can be worse at timely abstention, exposing a safety gap for deployed agents. The authors’ CONVOLVE technique doubles Llama‑3.3‑70B’s timely recall from 27 % to 57 % without retraining.
A new algorithm starts only with axioms and inference rules, then alternates proof search and theorem extraction to build its own library. In experiments it generated tens of thousands of novel theorems and solved benchmark problems, and feeding these lemmas into large language models improved their proof performance. This shows AI can create useful mathematical knowledge without human‑written resources.
Ford lifted its JD Power quality ranking by pairing AI with 350 veteran engineers. The seasoned staff mentors newcomers and tightens design reviews, fixing gaps AI alone missed. Executives say the hybrid approach is key to reversing a decade of recall woes.
Anthropic’s Claude Code can read an entire codebase, edit files, run tests and commit changes, letting engineers offload routine coding. Companies report that developers now spend most of their time directing autonomous agents and deciding what to build, turning product thinking into the bottleneck.
A study shows managers catch 18% fewer mistakes when AI tools are framed as employees rather than software. The anthropomorphic label shifts responsibility away from humans, prompting more escalations and risk of blame‑shifting in high‑stakes domains like health care and defense.
The paper shows that teaching agentic LLMs to refuse unsafe prompts misses the point, harm comes from unauthorized actions, not text. Evidence shows refusal training only learns surface patterns and collapses multi‑step agents, while even unguarded models exceed granted authority. Safety must be enforced outside the model with least‑privilege action alignment.
Subscribe free