by Max Vornovskykh | April 30, 2026 11:08 am
AI agents are entering real business workflows, but their quality is hard to measure consistently. A single request can involve reasoning, tool calls, external data, and several decisions, so teams need visibility into the full workflow before trusting it in production.
At QATestLab, we currently operate 32 AI agents across 4 platforms — from corporate knowledge bots to ML solutions for marketing data enrichment and scoring. In this article, I want to share how we built a unified evaluation approach for these agents using Langfuse, tracing, AI as a Judge, and continuous improvement cycles.
Traditional software is deterministic: the same input produces the same output. Classical ML models add an element of uncertainty, but their behavior remains broadly predictable — a model makes a prediction and waits for the next request.
AI agents work fundamentally differently. They reason, make decisions, invoke tools, analyze results, and continue to act. A single user request can cascade into dozens of LLM calls, API requests, and intermediate decisions. Each step creates new potential points of failure.
LangChain’s State of Agent Engineering report shows[1] that 57% of surveyed teams already run AI agents in production, but quality remains the top production barrier for 32% of respondents. This makes structured evaluation and observability less of a “nice to have” and more of a core requirement for reliable agentic systems.
For AI agents, quality becomes easier to manage when the full execution path is visible: where the agent received context, how it made decisions, which tools it used, where failures appeared, and what can be improved. This turns agent behavior into something teams can measure, analyze, and refine over time.
At QATestLab, we deliberately use different tools for different tasks. The choice of platform depends on the complexity of the logic, customization requirements, and the team that will maintain the solution.
No-code platforms (n8n, Flowise) are used for standard scenarios where speed of deployment and ease of maintenance by non-developers are important. Typical use cases include workflow automations, simple RAG bots, and integrations with corporate systems. Their main advantage is a low barrier to entry and visual debugging.
Code-based solutions (LangChain, LangGraph, Claude Code) are applied in more complex agentic workflows, where control over each step of the reasoning process is required. These include multi-agent systems, advanced business logic, and custom tool integrations. They provide more flexibility, but also require stronger technical expertise.
Regardless of the platform, the evaluation challenges remain the same: non-determinism, cascading failures, and the difficulty of root cause analysis. That’s why we built a unified observability system that covers all 32 agents — even though they run on different platforms.
| Platform Type | Examples | When to Choose | Evaluation Complexity |
|---|---|---|---|
| No-code | n8n, Flowise | Standard scenarios, rapid deployment | Moderate (fewer steps) |
| Low-code | Flowise + custom nodes | Balance of speed and flexibility | Moderate to high |
| Code-based | LangChain, LangGraph | Complex logic, multi-agent | High |
| Hybrid | Claude Code + integrations | R&D, experimentation | Depends on scope |
In traditional QA, we usually expect the same input to produce the same output. With AI agents, this assumption breaks down.
↳ Non-determinism at the generation level. The same prompt can elicit different responses due to sampling, variation in tool selection, and changes in context.
↳ Multi-step workflows. An agent operates through chains of decisions rather than isolated requests. Its behavior depends on conversation history, retrieved documents, and previous steps.
↳ External dependencies. Agents make real API calls, search for information, and trigger downstream automations. The environment may change between evaluations.
↳ Cascading failures. An error at an early step can affect the entire workflow. Without tracing, root cause analysis quickly turns into detective work.
We structure the evaluation across three levels:
This structure allows teams to move beyond checking final outputs and instead evaluate how the agent operates across the full workflow. It creates a clearer view of where failures occur and provides a foundation for consistent improvement.
There are many LLM observability tools on the market. Still, we chose Langfuse because it combines the features we need for production use with enough flexibility to support diverse AI agents.
These capabilities were important for tool selection, but the real value of Langfuse becomes clearer when we look at how it changes agent debugging and quality control in practice.
Without observability, debugging agents quickly turns into guesswork. A trace in Langfuse shows the full execution path: which prompt was sent to the model, how the model responded, which tools it called, what those tools returned, and how the model interpreted the result.
In practice, this helps identify the specific failure point much faster. The issue may be an outdated document returned during retrieval, a tool call timeout, or incorrectly parsed structured output.
Langfuse also provides performance insights that help teams answer practical optimization questions:
Together, these insights support data-driven optimization instead of relying on assumptions.
Human evaluation is the gold standard, but it does not scale. Automated validation works well for objective checks, such as output format or schema validation. AI as a Judge helps cover the gap by using one LLM to evaluate another LLM’s output against defined criteria.
Langfuse provides built-in support for LLM-as-Judge evaluators. You configure an evaluator once, and it automatically runs on new traces (or on a sample to control costs). Each evaluation creates its own trace, which allows you to inspect the judge’s reasoning.
Currently, 22 of our 32 agents are covered by automated evaluation. For the remaining 10, either configuration is still in progress, or the use case requires human review due to its specificity.
What we evaluate with AI as a Judge:
What we leave for human review:
This gives us a scalable evaluation layer: routine quality signals are collected automatically, while problematic traces become visible for deeper analysis. As a result, evaluation becomes part of the agent’s lifecycle rather than a separate manual check after something goes wrong.
The most common mistake is trying to evaluate too much with a single evaluator. We follow a simpler principle:
one evaluator — one failure mode.
A reliable evaluator prompt should include:
The critical step is validation against human judgment. We split a labeled dataset into a development set for prompt tuning and a test set for final evaluation. Then we measure the true positive and true negative rates.
If the evaluator systematically diverges from human labels, we iterate on the prompt until the automated judgment becomes consistent with the expected evaluation logic.
Evaluation creates value only when it leads to action. The goal is to turn evaluation results into concrete improvements.
Our cycle looks like this:
AI agents are systems that reason, act, and adapt across complex workflows. This makes quality assurance a continuous process rather than a one-time check.
Our experience with 32 agents across 4 platforms has shown:
Langfuse has become a strong fit for our AI quality workflow: open-source, actively developed, production-ready, and flexible enough for different platforms and use cases.
If you are planning to implement AI agents or improve the reliability of existing solutions, we would be happy to share our experience in setting up an evaluation pipeline and discuss what this process could look like in your specific context. Book a discovery call to get started.
[2]Source URL: https://blog.qatestlab.com/32-ai-agents-across-4-platforms-building-a-robust-evaluation-system-for-ai-solutions/
Copyright ©2026 QATestLab Blog unless otherwise noted.