DataFusion gets SQL superpowers, Iceberg cuts latency
The new release introduces LATERAL joins, SQL lambda functions for arrays, an Arrow‑based Avro reader, and spill‑to‑disk for memory‑heavy nested loop joins. Under the hood, join, scan and planning performance get a 5‑20% bump, and scalar subqueries run once instead of being rewritten as joins. These upgrades broaden DataFusion’s SQL expressiveness and cut query runtimes for complex analytics.
A four‑year‑old AWS data lake built on Hive‑style partitions was rebuilt on Apache Iceberg after AWS added native Iceberg support. The move eliminated costly partition pruning issues, enabled schema evolution without rewrites, and cut query latency by up to 40%, freeing engineering time for higher‑value work.
Spotify’s AI data assistant, Vedder, now uses a ‘context layer’ where domain experts curate each of 177 clusters with vetted datasets and question‑SQL pairs. This encoding of expertise lets over 2,100 users retrieve accurate answers from 70,000+ datasets without writing SQL, bridging the gap between raw schemas and real business meaning.
LinkedIn introduced MUSE, a dual‑tower embedding model trained on millions of recruiter interactions, to power semantic search across 1.3 billion member profiles. The system lets recruiters describe roles in natural language and returns higher‑quality matches, driving measurable gains in candidate relevance and recruiter engagement.
The Linux Foundation introduced OpenSharing, an open, vendor‑neutral protocol that extends Delta Sharing to include AI models, agent skills, and unstructured data. By standardizing exchanges across clouds and platforms, it eliminates proprietary silos, letting enterprises share AI assets securely and at scale.
Google Research unveils Regularized f‑Divergence Kernel Tests, a statistical framework that detects unlearning failures with higher sensitivity and lower sample cost than traditional two‑sample tests. This lets auditors prove GDPR‑style forgetting without full model access, cutting computational expense and tightening privacy guarantees for large‑scale AI systems.
Uber’s AI spend exploded when Claude Code agents went from pilot to production, draining the whole annual budget because each task triggers 5‑30× more tokens than a simple chat. The Cockroach Labs post shows why token‑based pricing fails for agentic AI and outlines a task‑centric cost model and practical controls to keep enterprise AI bills in check.
A five‑component DIY feature store built with DuckDB, Parquet, Redis, and FastAPI eliminates training‑serving skew and powers real‑time LLM‑driven recommendations. The guide walks through a registry, offline/online stores, materialization, and a low‑latency API, showing code you can copy into production today.
Feldera models streaming data as relational deltas (Z‑sets) and uses the DBSP engine to propagate only the affected rows. This avoids re‑computing joins and aggregations, delivering low‑latency, scalable continuous analytics for complex queries that traditional streaming SQL systems struggle with.
The benchmark tested six popular enrichment providers on a common set of 349 DNS‑resolved domains from the Majestic Million, measuring coverage and data depth. CompanyEnrich led overall with the highest match rate and richest profiles, while Apollo, People Data Labs and others excelled in specific fields.
A curated HuggingFace dataset bundles 748 mechanistic interpretability papers from arXiv and Semantic Scholar, each annotated with a quality score. Researchers can download the JSONL file to train, benchmark, or meta‑analyze interpretability methods without manual paper collection.
Subscribe free