Kafka stalls, Parquet speeds up, CDP hits 30ms

Data · 2026-06-29

Data Engineering

record_limit mode can stall Kafka share groups with too few consumers14 MIN

Using share.acquire.mode=record_limit with fewer consumers than partitions, and any partition skew, triggers pathological fetch waits. The consumer fetches from brokers round‑robin, throttling throughput and turning lag drains into hour‑long stalls. This can cripple Kafka pipelines that rely on fast, predictable consumption.

Hardwood 1.0 offers a fast, dependency‑free Parquet reader for Java 21+11 MIN

Hardwood 1.0 is a production‑ready, JVM‑native Parquet reader that drops every mandatory dependency and parallelises page decoding across all CPU cores. It supports the full Parquet type set, projection, and predicate push‑down, giving Java 21+ apps a lean, high‑throughput way to ingest columnar data.

iChibi Lake turns DuckDB into a Dockerized data lake with Kafka and GraphQL20 MIN

iChibi Lake wraps DuckDB in a Dockerized gateway that adds REST, GraphQL, Kafka ingestion, and PostgreSQL‑backed metadata. It lets you treat a DuckDB instance as a modern data lake, persisting Parquet files while supporting schema evolution and graph queries, so data teams can build AI pipelines without a massive infrastructure overhaul.

Razorpay’s CDP delivers sub‑30 ms segments for 500 M+ users11 MIN

Razorpay built an in‑house Customer Data Platform that unifies transaction data from over 500 million user profiles and serves real‑time audience segments in under 30 ms. The system stitches disparate payment events with Airflow‑driven Spark pipelines, eliminating days‑long data requests for merchants and enabling instant, targeted campaigns.

Pinterest Automates Schema Evolution Across Kafka, Flink, Spark, Iceberg11 MIN

Pinterest’s CDC ingestion platform now auto‑propagates schema changes through Kafka, Flink, Spark and Iceberg, turning schema into a contract rather than metadata. The system generates code and metadata artifacts, handles both push‑ and pull‑based migrations, and audits updates via PRs, cutting manual drift and keeping online/offline data consistent.

ML & AI for Data

Dropbox uses DSPy to turn AI evals into sharper, cheaper Chat answers10 MIN

Dropbox leveraged the open‑source DSPy framework to turn LLM‑as‑judge evaluations into a feedback loop that fine‑tunes both its judges and the Dash chat system prompt. By calibrating judges with human labels and running DSPy’s optimization algorithms, the chatbot cut incomplete answers and token usage while keeping answer quality high.

Agent‑Optimized Docs Outperform Stale Skills, Raising CLI Success to 87%7 MIN

Running 250 controlled evaluations, Wix found that AI‑optimized documentation lifted CLI task completion from 67 % to 87 % and slashed token usage by 35 %. When skills were outdated or misaligned, the docs outperformed skill‑only runs, showing that well‑tuned docs can be a more reliable agent aid than fragile handcrafted skills.

Control response‑time tails to make AI agents reliably on‑time1 MIN

Reliable AI agents need consistent answer latency, not just faster responses. The article shows that controlling the tail of response time distribution, using counterintuitive engineering tricks, ensures APIs meet strict on‑time guarantees, which is critical for customer‑facing services.

Databases & Storage

SmithDB speeds up inverted‑index construction 2.2× with string interning10 MIN

SmithDB builds its object‑storage backed inverted index by parsing JSON with a flat‑tape, tokenizing values, then interning strings to integer IDs. Interning slashes string‑comparison work, delivering a 2.2× speedup in index construction, while streaming compaction keeps memory use flat regardless of index size.

Practice & Datasets

Q2 2026 Common Crawl web graphs enable massive link analysis out‑of‑the‑box2 MIN

Common Crawl just dropped its Q2 2026 host‑ and domain‑level web graphs. The host graph packs 247 M nodes and 6.3 B edges; the domain graph contains 121 M nodes and 3.9 B edges. Researchers can now run massive link‑analysis jobs without building a crawler from scratch.