Evaluation metrics that looked great offline and then failed in production
AUC, F1, BLEU — when did optimising the metric diverge from the business outcome you actually cared about, and how long did it take to notice?
7 replies
Worth being explicit about assumptions before starting — we wasted two weeks discovering constraints that were knowable upfront.
Defining 'good enough' before starting rather than after the work is done made a real difference for us.
The version that ships is always different from the version you planned — the question is whether the delta was intentional.
The pattern I keep seeing: the signal is visible in the data much earlier than anyone acts on it.
Documentation and worked examples mattered more than tooling for us — especially when adoption was uneven across the team.
Who owns the decision vs. who owns the outcome is the execution detail that matters most in our context.
We tried three variants. The simplest one worked, which took us too long to try.
Join the conversation.
Log in to reply