OpenSearch slashes log costs 70% as Hudi indexes live

Data · 2026-07-02

Data Engineering

OpenSearch’s new log engine cuts cost 70% and doubles speed8 MIN

AWS launched a purpose-built log-analytics engine for Amazon OpenSearch Service that delivers up to 4× price-performance, 2× faster ingestion, and up to 70 % lower storage costs. It stores logs in columnar Parquet, routes queries to DataFusion or Lucene mid-query, and works with existing APIs, letting teams cut spend while keeping full search capabilities.

Hudi’s async indexing adds live tables new indexes without downtime17 MIN

Apache Hudi now supports async indexing, letting you add new record‑level or secondary indexes to a multi‑petabyte table while ingest continues uninterrupted. The append‑first architecture stitches historic data with live writes, eliminating the downtime required by other lakehouse formats.

How Arcesium Halved Query Costs with DuckDB10 MIN

Arcesium cut query costs and runtimes by about half by migrating thousands of Athena and Trino SQL jobs to DuckDB over 18 months. The lightweight OLAP engine handled S3‑backed Parquet data with 50% faster execution and 50% lower resource usage, unlocking scale for client onboarding. Their experience highlights practical steps and pitfalls for similar migrations.

ML & AI for Data

Meta’s DEmate boosts data‑engineering with recipe‑driven LLM assistant8 MIN

Meta released DEmate, an LLM‑driven assistant that writes, reviews, and tests SQL pipelines inside its private analytics stack. By wrapping the model in a ‘Recipe Architecture’ that maps prompts to 70 custom data‑engineering recipes, the tool achieved an 80% acceptance rate and 3,500 weekly active users, showing how tuned LLMs can scale internal data workflows.

LLMs Spot Spark SQL Bottlenecks, Cutting Debug Time9 MIN

Expedia’s team fed Spark SQL physical plans to a large language model, which automatically flags anti‑patterns like missing broadcasts, data skew, and excessive shuffles. The LLM then suggests concrete fixes, turning hours‑long manual triage into a few minutes and lowering cluster costs.

t0-alpha Shows Small Transformer Can Match Large Time‑Series Forecasting Benchmarks10 MIN

t0-alpha, a 102M-parameter decoder-style transformer, demonstrates how modern time-series foundation models tokenise, attend causally, and output probabilistic quantiles. Running it reproduces the paper's CRPS 0.4941 and MASE 0.7240, showing competitive accuracy while remaining hardware‑friendly. Its patch‑based design points to calibration and routing as the next big leaps.

Inductive Latent Context Persistence lets LLM agents skip cold‑start re‑prompting12 MIN

Inductive Latent Context Persistence (ILCP) compresses an LLM's hidden state into a tiny latent payload and ships it across multi‑agent handovers. The trick wipes out costly context rebuilds, cutting per‑hop latency to 7.7 ms and boosting post‑handover accuracy by up to 13.3 pp. It repurposes a 6G handover insight for agent pipelines.

AI agents are reviving ontologies as the new data layer9 MIN

Big‑tech platforms, Palantir, Microsoft Fabric, Databricks, Google, are rolling out ontologies as core data layers. They give AI agents a shared, machine‑readable business vocabulary, enabling richer inference than raw schemas. The shift turns structured meaning into a competitive advantage for any data‑driven organization.

AI agents now turn raw data into charts and insights without a human1 MIN

A new framework lets AI agents take raw datasets, clean them, choose optimal visualizations, and generate narrative insights without human input. This end‑to‑end automation could free analysts from repetitive prep work and let teams focus on strategy, accelerating decision‑making cycles.

Databases & Storage

Meta’s BLOB-storage revamps AI training, slashing GPU stalls12 MIN

Meta introduced a new BLOB-storage layer built on its Tectonic fabric to serve exabyte‑scale AI datasets. Smart tiering and placement cut data‑movement latency, boosting GPU utilization and accelerating research cycles across regions.

SedonaDB 0.4's RayBooster lets consumer GPUs outpace H100 on spatial joins2 MIN

SedonaDB 0.4 adds RayBooster, a GPU engine that maps spatial joins onto NVIDIA ray‑tracing cores. On a consumer RTX 3090 it delivers up to 5.9× faster joins and can even beat an H100 on certain queries, cutting AWS costs by 60 %.

Too many PostgreSQL tables can crash your server and stall queries9 MIN

A customer’s PostgreSQL cluster was crashing under the Linux OOM killer and suffered CPU‑hungry, long‑running queries. The root cause? an explosion of tables bloating system catalogs, slowing planner logic, and inflating I/O. Consolidating tables and pruning unused schemas can restore stability and speed upgrades.

Practice & Datasets

Google shares global rooftop reflectivity data to accelerate cool‑roof climate action3 MIN

Google Research today opened a building‑level rooftop reflectivity dataset covering more than 50 cities, accessible via a new Earth Engine app. The data lets planners pinpoint where cool‑roof interventions would cut urban heat the most, potentially lowering surface temperatures by up to 0.5 °C.

ScarfBench Shows AI Agents Falter on Enterprise Java Migration4 MIN

IBM Research’s new ScarfBench benchmark tests AI coding agents on moving Java apps between Spring, Jakarta EE and Quarkus. Early results show agents can compile code but frequently fail to deploy or preserve behavior, especially for whole‑application migrations. The suite gives a realistic bar for AI‑driven modernization tools.