Shipping LLM features without drowning in regression tests

Riley Brown ⭐16 · Mar 6, 2026 23:44

We wired a summarisation helper into our support console. The tricky part was not the prompt — it was knowing when the model silently changed behaviour between releases. How are you pinning behaviour: golden sets, eval harnesses, or something else?

15 replies

Jamie Nguyen ⭐22 · Mar 7, 2026 01:44

We snapshot fifty anonymised tickets and diff model outputs on every deploy; anything that moves more than a threshold blocks the release.

Jamie Lopez ⭐30 · Mar 7, 2026 05:44

Golden prompts alone were not enough for us — we added a tiny human spot-check queue for edge cases the metrics miss.

Cameron Walker ⭐49 · Mar 7, 2026 09:44

Our biggest win was versioning prompts like code and tagging them to the exact model revision in telemetry.

Drew Tran ⭐190 · Mar 7, 2026 13:44

We treat nondeterminism as a product bug: users hate 'sometimes it works' more than 'it does not work yet'.

Parker Le ⭐14 · Mar 7, 2026 17:44

Latency budgets matter as much as accuracy — we cut scope until p95 stayed under two seconds on cold starts.

Parker Ahmed ⭐175 · Mar 7, 2026 21:44

I keep a spreadsheet of failure modes we have seen in production and map each to a regression case the CI runs nightly.

Emerson Carter ⭐48 · Mar 8, 2026 01:44

Shadow mode for two weeks saved us — we logged proposed answers without showing them and measured disagreement with humans.

Finley Tan ⭐191 · Mar 8, 2026 05:44

We stopped chasing perfect ROUGE scores and instead scored against business outcomes the team actually cares about.

Quinn Carter ⭐143 · Mar 8, 2026 09:44

For internal tools, a lightweight 'thumbs down with reason' on every answer gave us better signal than offline evals.

Quinn Walker ⭐171 · Mar 8, 2026 13:44

Contract tests against the vendor API shape caught two breaking JSON changes before they hit staging.

Drew Khan ⭐138 · Mar 8, 2026 17:44

We document the exact temperature and max tokens per workflow so support engineers know what they are looking at.

Hayden Le ⭐116 · Mar 8, 2026 21:44

The compliance team wanted traceability — we store prompt hash, model id, and retrieval chunks for every answer shown.

Parker Bennett ⭐153 · Mar 9, 2026 01:44

Fine-tuning was not worth it until we had at least ten thousand labelled pairs; before that, RAG plus tight instructions won.

Casey Pham ⭐38 · Mar 9, 2026 05:44

Biggest lesson: invest in eval data the same way you invest in training data — garbage in, garbage confidence out.

Quinn Tan ⭐20 · Mar 9, 2026 09:44

We rotate one engineer per sprint as 'model shepherd' who reads logs and updates the eval set — ownership fixed the drift problem.

Join the conversation.