AI — 2026-06-09
Google’s Gemma 4 model now includes special control tokens that separate internal reasoning (“thinking”) from final answers, letting developers preserve the model’s thought process in prompts. The new chat template defines <|think|>‑style delimiters for agentic workflows and tool use.
Anthropic’s research shows its Claude model can accurately interpret and predict NMR spectra, performing on par with—or better than—established chemistry software. The study highlights Claude’s potential to streamline analytical workflows for chemists, bridging AI reasoning with complex molecular data.
PTI runs multiple token streams in parallel via llama.cpp's batch decoding, sharing weight loads to avoid extra model copies. On an NVIDIA MI50, it achieves a 1.96× speedup for Qwen3.6-27B (38.1 vs 19.4 tok/s) with only ~0.2 GiB extra VRAM.
A new kv‑cache patch for llama.cpp eliminates costly KV cell copies, restoring performance for Gemma‑4 models. Benchmarks on an RTX 5090 show structured decode rising from 104 tok/s to 149 tok/s (+43%) and free‑text speed up ~20% at 64 k context.
Subscribe free