IceStream kills Iceberg stale data faster

Data · 2026-06-14

Data Engineering

IceStream converts Iceberg equality deletes to fast deletion vectors16 MIN

IceStream runs as an async service that rewrites Apache Iceberg equality deletes into positional deletes or deletion vectors. By avoiding the expensive join‑side delete processing, it slashes query latency and storage bloat for streaming pipelines, letting writers commit without blocking. The diskless design leverages Flink and a Paimon index for scalable conversion.

CocoIndex lets you keep AI data fresh with delta‑only incremental indexing2 MIN

CocoIndex is an open‑source engine that continuously extracts, transforms, and indexes data, code, docs, Slack, PDFs, so LLM apps always see up‑to‑date context. Its Rust core runs only the changed pieces, scaling from a repo to petabytes, and offers a Python API to define custom indexing flows.

Discord scales dbt for petabyte‑level analytics with custom engine tweaks2 MIN

Discord rewrote dbt to handle petabytes of data and 100+ developers across 2,500+ models, cutting compile times from 20+ minutes to seconds. Their custom table aliasing, time‑based incremental runs, and automatic backfill detection let engineers work without stepping on each other's toes. This shows open‑source tools can be extended for true enterprise‑scale analytics.

Analytics & Visualization

Explore RGB Gamut Volumes in Oklab and CIELAB with a 3D Web Tool1 MIN

A new interactive site lets you render RGB color gamuts as solid 3D volumes in both Oklab and CIELAB, overlaying a comparison gamut and visualizing the spectral locus. It’s a quick way for designers and researchers to see how primaries and transfer curves reshape color space.

ML & AI for Data

GPU Time‑Slicing doubles LLM agent tail latency on a $150 GTX 108011 MIN

When two LLM agents share a single low‑cost GTX 1080 via Kubernetes CUDA time‑slicing, average throughput stays flat but the latency‑sensitive agent’s p99 jumps by 66 % and jitter rises 67 %. The hidden tail cost means dashboards based on averages can miss deadline breaches, warning that cheap GPU sharing isn’t free.

Why Bigger Context Windows Still Miss Aggregations in RAG12 MIN

An engineer shows that expanding context windows in Retrieval‑Augmented Generation doesn't fix inaccurate aggregations. By benchmarking a 100k‑row CSV against a deterministic full‑scan engine, he demonstrates that RAG should be routed away from heavy numeric computation toward purpose‑built aggregation systems. The findings warn data teams against trusting RAG for summarizing large tables.

DeepSeek’s mHC revamps residual links to scale future AI models1 MIN

DeepSeek’s new paper introduces Manifold-Constrained Hyper-Connections (mHC), an extension of residual links that restores identity mapping while scaling to larger models. The approach patches the instability that plagued earlier hyper‑connection variants, promising faster training and better performance for next‑gen foundation models.

How to Build Scalable Semantic Search with Embeddings and Vector DBs11 MIN

The guide walks through choosing embedding models, similarity metrics, ANN algorithms, and vector database options to deploy production‑grade semantic search for knowledge bases, product catalogs, or RAG pipelines. It shows concrete code snippets and performance trade‑offs, helping engineers cut latency and cost while delivering meaning‑based results at million‑document scale.

How to Build Scalable Semantic Search with Embeddings and Vector DBs11 MIN

Docling lets you parse PDFs with rich tables locally, no cloud needed10 MIN

Docling, IBM Research’s open‑source parser, extracts table cells, OCR text and captions from PDFs entirely on‑premises. Because it runs after a one‑time model download, no API keys or per‑page fees are required and documents never leave the network, crucial for regulated sectors. Plug the JSON tables straight into your RAG pipeline.

Build a Deep Research Agent from Scratch with DeepSeek R119 MIN

The SwirlAI guide walks you through building a Deep Research Agent powered by the open‑source DeepSeek R1 model, handling outline planning, web‑search‑augmented reasoning, and iterative reflection without any orchestration framework. It gives a hands‑on notebook so data engineers can prototype end‑to‑end research pipelines today.

2026’s First‑Half LLM Papers: Trends, Tools, and What to Read Next5 MIN

Sebastian Raschka’s curated markdown lists the most impactful LLM papers from January to May 2026, spotlighting hybrid architectures, state‑space layers, agent tool use, and long‑context methods. The collection saves researchers hours of hunting and shows where the field’s practical focus is shifting.

Pinterest Boosts Search Relevance with LLM Teacher‑Student Distillation8 MIN

Pinterest replaced its legacy relevance model with a cross‑encoder LLM that scores pins on a five‑point relevance scale, then distilled that knowledge into a lightweight student model for real‑time inference. The new pipeline lifted click‑through rates by double‑digit percentages in live A/B tests, proving LLMs can power large‑scale search without sacrificing latency.

Databases & Storage

Redis adds native vector sets, turning the store into a vector-search engine6 MIN

Redis 8 now ships vector sets, a first‑class data type that stores embeddings and enables fast similarity search inside the database. VADD inserts items with high‑dimensional vectors, while VSIM returns nearest neighbors, letting you run semantic search, recommendation or face‑recognition workloads without an external vector DB.