Kafka stalls, Parquet speeds up, CDP hits 30ms
Using share.acquire.mode=record_limit with fewer consumers than partitions, and any partition skew, triggers pathological fetch waits. The consumer fetches from brokers round‑robin, throttling throughput and turning lag drains into hour‑long stalls. This can cripple Kafka pipelines that rely on fast, predictable consumption.
Hardwood 1.0 is a production‑ready, JVM‑native Parquet reader that drops every mandatory dependency and parallelises page decoding across all CPU cores. It supports the full Parquet type set, projection, and predicate push‑down, giving Java 21+ apps a lean, high‑throughput way to ingest columnar data.
iChibi Lake wraps DuckDB in a Dockerized gateway that adds REST, GraphQL, Kafka ingestion, and PostgreSQL‑backed metadata. It lets you treat a DuckDB instance as a modern data lake, persisting Parquet files while supporting schema evolution and graph queries, so data teams can build AI pipelines without a massive infrastructure overhaul.
Razorpay built an in‑house Customer Data Platform that unifies transaction data from over 500 million user profiles and serves real‑time audience segments in under 30 ms. The system stitches disparate payment events with Airflow‑driven Spark pipelines, eliminating days‑long data requests for merchants and enabling instant, targeted campaigns.
Pinterest’s CDC ingestion platform now auto‑propagates schema changes through Kafka, Flink, Spark and Iceberg, turning schema into a contract rather than metadata. The system generates code and metadata artifacts, handles both push‑ and pull‑based migrations, and audits updates via PRs, cutting manual drift and keeping online/offline data consistent.
Dropbox leveraged the open‑source DSPy framework to turn LLM‑as‑judge evaluations into a feedback loop that fine‑tunes both its judges and the Dash chat system prompt. By calibrating judges with human labels and running DSPy’s optimization algorithms, the chatbot cut incomplete answers and token usage while keeping answer quality high.
Running 250 controlled evaluations, Wix found that AI‑optimized documentation lifted CLI task completion from 67 % to 87 % and slashed token usage by 35 %. When skills were outdated or misaligned, the docs outperformed skill‑only runs, showing that well‑tuned docs can be a more reliable agent aid than fragile handcrafted skills.
Reliable AI agents need consistent answer latency, not just faster responses. The article shows that controlling the tail of response time distribution, using counterintuitive engineering tricks, ensures APIs meet strict on‑time guarantees, which is critical for customer‑facing services.
SmithDB builds its object‑storage backed inverted index by parsing JSON with a flat‑tape, tokenizing values, then interning strings to integer IDs. Interning slashes string‑comparison work, delivering a 2.2× speedup in index construction, while streaming compaction keeps memory use flat regardless of index size.
Common Crawl just dropped its Q2 2026 host‑ and domain‑level web graphs. The host graph packs 247 M nodes and 6.3 B edges; the domain graph contains 121 M nodes and 3.9 B edges. Researchers can now run massive link‑analysis jobs without building a crawler from scratch.
Subscribe free