Blog
We’ll be the first to tell you that we’re obsessed with delivering high-quality responses at RunLLM. Spending time understanding, analyzing, and improving quality is core to how we build trust with our customers — and, if you ask our engineering team, core to what we nag everyone about. But what are we actually doing when we say we’re optimizing for high-quality responses?
The truth is that there’s no clear definition for what “high-quality” AI actually is. Typically, when we say high quality, we mean, roughly, “We got the answer that we expected to get from the LLM.” In other words, we don’t have an empirical way to measure what high-quality AI is, so we’re mostly relying on whether it meets our expectations. Without an empirical measure, we fall back to trying things out and seeing how the LLM responds. This is what we’ve heard jokingly (though increasingly seriously) called vibes-based evals.
The scientists in us cringe at how fluffy this is, but (to some extent) it’s not bad. We used to begrudgingly tolerated vibes-based evals, but they’ve started to grow on us in the last few months. Slowly, we’ve come around to thinking that vibes are a great place to start (if ultimately insufficient).
Earlier this year, we were frustrated by the idea of vibes-based evals. Most customers we worked with at RunLLM didn’t have pre-built test sets, we had to rely on just trying some questions to see what worked.
We think of this as a form of the blank page problem — you look at an empty chat window and think, “Okay, well, what do I type here?” and enter the first thing that comes to mind. The style and content of those first few responses has an outsized impact on someone’s impression of the quality of a product.
Having done done dozens of POCs with customers in the last few months, we’re starting to feel more warmly than we did before. The thing with a product like RunLLM is that your customers aren’t going to run a disciplined, scalable evaluation process to determine whether the answer they get is correct or not. They’ll ask a question, get an answer, and be satisfied if it solved their problem. If you evaluate our product in the same way, that’s probably a reasonable place to start.
Even beyond that, we’ve found that evaluating incrementally and without a pre-defined test set has its benefits.
With all that said, wouldn’t it still be better to have some metrics? They might not be comprehensive or perfect, but something’s better than nothing… right?
The short answer is yes. The issue is that there aren’t great evaluation frameworks out there. Measures like MMLU attempt to capture too many different skills in a single number. The LMSys Leaderboard has begun to help on this front this by capturing task-specific Elo scores (e.g., coding, instruction following), but this unfortunately doesn’t help us show how good an individual product is.
Even still, we think we can do better as a community. we’ve argued in the past, we strongly believe we need better LLM evaluations. We even built our own evaluation framework at RunLLM to help guide our development and our customer conversations. Our opinions on none of these things have changed.
What we have found in building our own evaluation framework is there are, of course, pitfalls to our evaluation framework too. The biggest issue is that metrics are hard to understand. Our evaluation framework for RunLLM measures correctness, coherence, conciseness, and relevance — each of these criteria has a specific rubric we use for scoring. Unfortunately, you’ll need to closely read each answer and the rubric to justify the score that you see on the screen. And because we’re using an LLM as a judge, you’ll occasionally see some outlier scores that make you question the trustworthiness of the results. Unless you’re planning on using the same test set to evaluate many products, the time you’ll spend building trust in a metrics framework is better spent building trust in the product itself.
What’s worse is that the quality of an evaluation depends significantly on its tests set, and constructing good test sets is really hard. Unlike writing software tests where inputs are typed and constrained, text can come in all sorts of strange formats — as we touched on earlier, generating this unexpected input is easy for humans to do. It’s one thing to generate a single, shared test set, but it’s significantly more difficult to do this in a programmatic way when you’re building a customized product for each customer.
Finally, we’ve also found that customers are skeptical of assistants being overfit for a particular test set. This is a valid concern, and while we explain how we keep our testing framework separate from assistant improvements, there’s little that builds trust like using a tool yourself.
Things aren’t always great with vibes-based evals. We’ve found a few consistent trends in where vibes can go off the rails:
These three areas are where evals can make the biggest difference: bringing a consistent, holistic view to the quality of the assistant.
A few months back, we viewed vibes-based evals as a stepping stone towards building more empirical test sets. In the absence of that type of test set, we decided we had to make do with vibes-based evals. This is where we’ve changed our minds the most dramatically.
We still believe that we need better LLM benchmarks — both general-purpose and task-specific ones. We still believe that disciplined evaluation processes are super valuable and should be included in how customers are choosing to buy LLM-based products. What we’ve now realized is that you also need to have hands-on-keyboard time with a product to convince yourself that it’s going to do what it’s supposed to do — especially when real-world data is noisy and tricky.
Vibes-based evals are a critical tool for good evaluation, and we shouldn’t deride them as low-quality or undisciplined, as we once did. We should embrace the vibes and to make sure that AI products are delivering good experiences both qualitatively and quantitively.