Shipping LLM features without drowning in regression tests
We wired a summarisation helper into our support console. The tricky part was not the prompt — it was knowing when the model silently changed behaviour between releases.
How are you pinning behaviour: golden sets, eval harnesses, or something else?
15 replies
We snapshot fifty anonymised tickets and diff model outputs on every deploy; anything that moves more than a threshold blocks the release.
Golden prompts alone were not enough for us — we added a tiny human spot-check queue for edge cases the metrics miss.
Our biggest win was versioning prompts like code and tagging them to the exact model revision in telemetry.
We treat nondeterminism as a product bug: users hate 'sometimes it works' more than 'it does not work yet'.
Latency budgets matter as much as accuracy — we cut scope until p95 stayed under two seconds on cold starts.
I keep a spreadsheet of failure modes we have seen in production and map each to a regression case the CI runs nightly.
Shadow mode for two weeks saved us — we logged proposed answers without showing them and measured disagreement with humans.
We stopped chasing perfect ROUGE scores and instead scored against business outcomes the team actually cares about.
For internal tools, a lightweight 'thumbs down with reason' on every answer gave us better signal than offline evals.
Contract tests against the vendor API shape caught two breaking JSON changes before they hit staging.
We document the exact temperature and max tokens per workflow so support engineers know what they are looking at.
The compliance team wanted traceability — we store prompt hash, model id, and retrieval chunks for every answer shown.
Fine-tuning was not worth it until we had at least ten thousand labelled pairs; before that, RAG plus tight instructions won.
Biggest lesson: invest in eval data the same way you invest in training data — garbage in, garbage confidence out.
We rotate one engineer per sprint as 'model shepherd' who reads logs and updates the eval set — ownership fixed the drift problem.
Join the conversation.
Log in to reply