Shipping LLM features without drowning in regression tests

Riley Brown ⭐16 · Mar 6, 2026 23:44
We wired a summarisation helper into our support console. The tricky part was not the prompt — it was knowing when the model silently changed behaviour between releases. How are you pinning behaviour: golden sets, eval harnesses, or something else?
15 replies
Jamie Nguyen ⭐22 · Mar 7, 2026 01:44
We snapshot fifty anonymised tickets and diff model outputs on every deploy; anything that moves more than a threshold blocks the release.
Jamie Lopez ⭐30 · Mar 7, 2026 05:44
Golden prompts alone were not enough for us — we added a tiny human spot-check queue for edge cases the metrics miss.
Cameron Walker ⭐49 · Mar 7, 2026 09:44
Our biggest win was versioning prompts like code and tagging them to the exact model revision in telemetry.
Drew Tran ⭐190 · Mar 7, 2026 13:44
We treat nondeterminism as a product bug: users hate 'sometimes it works' more than 'it does not work yet'.
Parker Le ⭐14 · Mar 7, 2026 17:44
Latency budgets matter as much as accuracy — we cut scope until p95 stayed under two seconds on cold starts.
Parker Ahmed ⭐175 · Mar 7, 2026 21:44
I keep a spreadsheet of failure modes we have seen in production and map each to a regression case the CI runs nightly.
Emerson Carter ⭐48 · Mar 8, 2026 01:44
Shadow mode for two weeks saved us — we logged proposed answers without showing them and measured disagreement with humans.
Finley Tan ⭐191 · Mar 8, 2026 05:44
We stopped chasing perfect ROUGE scores and instead scored against business outcomes the team actually cares about.
Quinn Carter ⭐143 · Mar 8, 2026 09:44
For internal tools, a lightweight 'thumbs down with reason' on every answer gave us better signal than offline evals.
Quinn Walker ⭐171 · Mar 8, 2026 13:44
Contract tests against the vendor API shape caught two breaking JSON changes before they hit staging.
Drew Khan ⭐138 · Mar 8, 2026 17:44
We document the exact temperature and max tokens per workflow so support engineers know what they are looking at.
Hayden Le ⭐116 · Mar 8, 2026 21:44
The compliance team wanted traceability — we store prompt hash, model id, and retrieval chunks for every answer shown.
Parker Bennett ⭐153 · Mar 9, 2026 01:44
Fine-tuning was not worth it until we had at least ten thousand labelled pairs; before that, RAG plus tight instructions won.
Casey Pham ⭐38 · Mar 9, 2026 05:44
Biggest lesson: invest in eval data the same way you invest in training data — garbage in, garbage confidence out.
Quinn Tan ⭐20 · Mar 9, 2026 09:44
We rotate one engineer per sprint as 'model shepherd' who reads logs and updates the eval set — ownership fixed the drift problem.

Join the conversation.

Log in to reply