Cheap inference crushed satisfaction; logistic beats XGBoost

Data · 2026-06-28

Data Engineering

Self‑Hosted dbt Cloud Clone Offers 80% of Cloud Features In‑House6 MIN

A developer built a self‑hosted dbt Cloud clone using React, FastAPI, dbt Core, and Prefect, delivering about 80% of cloud features while keeping data in‑house. The stack recreates the web IDE, job orchestration, run history, and environment management, offering a low‑cost alternative for teams that need control over their pipelines.

ML & AI for Data

Build a Free, Private Coding Agent with Open‑Weight LLMs31 MIN

Sebastian Raschka shows how to build a fully local coding agent using open-weight LLMs and a custom harness, sidestepping costly subscriptions to Claude Code or OpenAI Codex. The tutorial outlines hardware, model choices, and a production-ready workflow, proving that privacy‑first, cost‑predictable AI coding is now practical for developers.

Cheap Model Routing Halved Inference Costs, but Crushed User Satisfaction12 MIN

The team built a cheap‑model routing layer that slashed their AI inference bill by more than 50 % in a quarter. Within three months, the classifier mis‑routed complex queries, degrading response quality and driving churn, exposing a Pareto trap in cost‑first AI pipelines.

Logistic Regression Beats XGBoost on World Cup Match Predictions, Simpler Wins9 MIN

On 358 historic international matches, a plain logistic regression achieved the lowest log‑loss, while XGBoost performed worse than a uniform guess. The experiment shows that with limited features, the bias‑variance trade‑off favors the smallest model, warning practitioners against defaulting to heavyweight boosters.

Fine‑tune LLMs on a Mac for free using Apple’s MLX library10 MIN

Apple Silicon now lets you fine‑tune open‑source language models on a Mac without any cloud GPU fees. The MLX library exploits the unified memory architecture, so a 16 GB M‑series laptop can train LoRA adapters locally, keeping data private and costs at zero.

Practice & Datasets

Parquet‑formatted arXiv LaTeX source dataset cuts S3 egress costs for large‑scale research1 MIN

A new Hugging Face dataset aligns 3 M+ arXiv LaTeX source files with their metadata in columnar Parquet files stored on the hub. By keeping the data on‑site, researchers can run massive text‑mining jobs without paying hefty S3 download fees, enabling faster, cheaper analysis.