Long-form, technical, and specific to systems I actually operate. Newest first.
What running agents in a real business taught me about failure.
The cheapest place to catch a bad model decision is before it ships.
Integrating a model is a weekend. Integrating it into the system of record is a year.
Routing between models is mostly a question of when not to.
Retrieval quality is a data-modelling problem wearing a machine-learning costume.
You cannot debug what you did not record. Observability for LLM systems.