DataFusion gets SQL superpowers, Iceberg cuts latency

Data · 2026-06-15

Data Engineering

DataFusion 54.0.0 adds LATERAL joins and spill‑to‑disk, boosting SQL capabilities9 MIN

The new release introduces LATERAL joins, SQL lambda functions for arrays, an Arrow‑based Avro reader, and spill‑to‑disk for memory‑heavy nested loop joins. Under the hood, join, scan and planning performance get a 5‑20% bump, and scalar subqueries run once instead of being rewritten as joins. These upgrades broaden DataFusion’s SQL expressiveness and cut query runtimes for complex analytics.

Why moving to Apache Iceberg cut query latency and maintenance overhead11 MIN

A four‑year‑old AWS data lake built on Hive‑style partitions was rebuilt on Apache Iceberg after AWS added native Iceberg support. The move eliminated costly partition pruning issues, enabled schema evolution without rewrites, and cut query latency by up to 40%, freeing engineering time for higher‑value work.

ML & AI for Data

Spotify’s Vedder adds a domain‑expert layer for reliable AI data queries6 MIN

Spotify’s AI data assistant, Vedder, now uses a ‘context layer’ where domain experts curate each of 177 clusters with vetted datasets and question‑SQL pairs. This encoding of expertise lets over 2,100 users retrieve accurate answers from 70,000+ datasets without writing SQL, bridging the gap between raw schemas and real business meaning.

LinkedIn's MUSE boosts semantic hiring search, delivering higher recruiter relevance17 MIN

LinkedIn introduced MUSE, a dual‑tower embedding model trained on millions of recruiter interactions, to power semantic search across 1.3 billion member profiles. The system lets recruiters describe roles in natural language and returns higher‑quality matches, driving measurable gains in candidate relevance and recruiter engagement.

Linux Foundation launches OpenSharing to make AI models and data portable across clouds4 MIN

The Linux Foundation introduced OpenSharing, an open, vendor‑neutral protocol that extends Delta Sharing to include AI models, agent skills, and unstructured data. By standardizing exchanges across clouds and platforms, it eliminates proprietary silos, letting enterprises share AI assets securely and at scale.

Google's new Regularized f-Divergence Kernel Tests boost machine-unlearning audits6 MIN

Google Research unveils Regularized f‑Divergence Kernel Tests, a statistical framework that detects unlearning failures with higher sensitivity and lower sample cost than traditional two‑sample tests. This lets auditors prove GDPR‑style forgetting without full model access, cutting computational expense and tightening privacy guarantees for large‑scale AI systems.

Agentic AI’s hidden cost: Why per‑task pricing beats token pricing22 MIN

Uber’s AI spend exploded when Claude Code agents went from pilot to production, draining the whole annual budget because each task triggers 5‑30× more tokens than a simple chat. The Cockroach Labs post shows why token‑based pricing fails for agentic AI and outlines a task‑centric cost model and practical controls to keep enterprise AI bills in check.

Build a Low‑Latency Feature Store with DuckDB, Redis, and FastAPI9 MIN

A five‑component DIY feature store built with DuckDB, Parquet, Redis, and FastAPI eliminates training‑serving skew and powers real‑time LLM‑driven recommendations. The guide walks through a registry, offline/online stores, materialization, and a low‑latency API, showing code you can copy into production today.

Databases & Storage

Feldera’s DBSP Engine Turns Streams into Incremental SQL Views3 MIN

Feldera models streaming data as relational deltas (Z‑sets) and uses the DBSP engine to propagate only the affected rows. This avoids re‑computing joins and aggregations, delivering low‑latency, scalable continuous analytics for complex queries that traditional streaming SQL systems struggle with.

Practice & Datasets

Six Company Enrichment APIs Compared on 349 Real Domains, Who Wins?20 MIN

The benchmark tested six popular enrichment providers on a common set of 349 DNS‑resolved domains from the Majestic Million, measuring coverage and data depth. CompanyEnrich led overall with the highest match rate and richest profiles, while Apollo, People Data Labs and others excelled in specific fields.

748‑paper JSONL dataset streamlines mechanistic interpretability research52 MIN

A curated HuggingFace dataset bundles 748 mechanistic interpretability papers from arXiv and Semantic Scholar, each annotated with a quality score. Researchers can download the JSONL file to train, benchmark, or meta‑analyze interpretability methods without manual paper collection.