AI
LLM evaluation
How teams measure whether an LLM application is actually working in production, without relying on vibes.
LLM evaluation is the engineering discipline that lets teams ship AI features with the same confidence they ship anything else. It covers offline benchmarks, online A/B testing, regression suites, and the human-in-the-loop scoring that calibrates the automated checks.
The state of the art shifted in 2025: LLM-as-judge methods became reliable enough for production gating when paired with deterministic structural checks. Notifire tracks the eval-framework releases, public benchmark drama, and the tools (Ragas, Braintrust, Inspect, OpenAI Evals) that engineering teams actually adopt.
Latest briefings on LLM evaluation
Data
New Benchmark Tests Snowflake vs. ClickHouse on Cost
A new benchmark, CostBench, offers an end-to-end cost and performance comparison of Snowflake and ClickHouse Cloud. The analysis helps engineering leaders choose the right platform for real-time analytics by looking beyond just query speed.
Taranpreet Singh ·
Data
Smarter AI Models Still Lack Context
New AI models consistently achieve higher benchmark scores, yet they often fail in real-world applications by hallucinating or mishandling queries. This gap highlights that raw intelligence isn't enough; models require specific, real-time context to perform reliably and reason effectively in production environments.
Taranpreet Singh ·
Security
Microsoft Benchmarks Defender Against Email Security Rivals
Microsoft has released a year's worth of benchmarking data, comparing its Defender email security platform against rivals. The reports use real-world threat data to measure effectiveness, offering transparency for teams evaluating their security tools.
Neeraj Dhiman ·
Infra
AI Retrieval Is Now Systems Problem
Scaling AI applications is revealing the limits of simple vector search. Production systems now require a complex retrieval layer that combines keyword matching, semantic search, ranking, and real-time data. This shift treats AI retrieval as a complex systems problem, not just a tooling one.
Ashish Kale ·
Data
New Tool Benchmarks Your Key-Value Databases
A developer has released Keybench, a new open-source tool for testing key-value databases. It provides a simple, scriptable way to measure performance, filling a gap left by popular tools that don't support these specific data stores.
Taranpreet Singh ·
AI
A New Framework for AI Evaluation
Mallika Rao, with experience from Twitter and Netflix, presents a new framework for evaluating production AI systems. She argues traditional metrics are outdated and introduces a five-layer evaluation stack and a maturity model to prevent silent failures and manage "evaluation debt."
Neeraj Dhiman ·
AI
AI Security Benchmarks Don't Work
A new report highlights that traditional security benchmarks are ineffective for evaluating AI systems. Unlike standard software, AI security is an emergent property that cannot be measured by simple tests, challenging teams to rethink how they approach securing their AI models and applications.
Neeraj Dhiman ·
Frequently asked questions
What's the minimum viable LLM eval setup?
A frozen test set of 50–200 representative prompts with reference answers, an automated scorer (exact match for structured outputs, LLM-as-judge for free-form), and a CI gate that fails the build if pass rate regresses. That's enough to catch the majority of production regressions before they ship.
Can you trust an LLM to grade another LLM?
For most subjective tasks, yes, with caveats. LLM judges should be calibrated against human-labelled ground truth on a hold-out set, run with deterministic settings (low temperature, structured output), and ideally run multiple times with majority voting on important decisions.