- QATestLab Blog >
- How We AI >
- 32 AI Agents Across 4 Platforms: Building a Robust Evaluation System for AI Solutions
32 AI Agents Across 4 Platforms: Building a Robust Evaluation System for AI Solutions
AI agents are entering real business workflows, but their quality is hard to measure consistently. A single request can involve reasoning, tool calls, external data, and several decisions, so teams need visibility into the full workflow before trusting it in production.
At QATestLab, we currently operate 32 AI agents across 4 platforms — from corporate knowledge bots to ML solutions for marketing data enrichment and scoring. In this article, I want to share how we built a unified evaluation approach for these agents using Langfuse, tracing, AI as a Judge, and continuous improvement cycles.
Why AI Agents Need a Different Quality Approach
Traditional software is deterministic: the same input produces the same output. Classical ML models add an element of uncertainty, but their behavior remains broadly predictable — a model makes a prediction and waits for the next request.
AI agents work fundamentally differently. They reason, make decisions, invoke tools, analyze results, and continue to act. A single user request can cascade into dozens of LLM calls, API requests, and intermediate decisions. Each step creates new potential points of failure.
LangChain’s State of Agent Engineering report shows that 57% of surveyed teams already run AI agents in production, but quality remains the top production barrier for 32% of respondents. This makes structured evaluation and observability less of a “nice to have” and more of a core requirement for reliable agentic systems.
For AI agents, quality becomes easier to manage when the full execution path is visible: where the agent received context, how it made decisions, which tools it used, where failures appeared, and what can be improved. This turns agent behavior into something teams can measure, analyze, and refine over time.
From No-Code to LangGraph: A Unified Approach to Agent Quality
At QATestLab, we deliberately use different tools for different tasks. The choice of platform depends on the complexity of the logic, customization requirements, and the team that will maintain the solution.
No-code platforms (n8n, Flowise) are used for standard scenarios where speed of deployment and ease of maintenance by non-developers are important. Typical use cases include workflow automations, simple RAG bots, and integrations with corporate systems. Their main advantage is a low barrier to entry and visual debugging.
Code-based solutions (LangChain, LangGraph, Claude Code) are applied in more complex agentic workflows, where control over each step of the reasoning process is required. These include multi-agent systems, advanced business logic, and custom tool integrations. They provide more flexibility, but also require stronger technical expertise.
Regardless of the platform, the evaluation challenges remain the same: non-determinism, cascading failures, and the difficulty of root cause analysis. That’s why we built a unified observability system that covers all 32 agents — even though they run on different platforms.
| Platform Type | Examples | When to Choose | Evaluation Complexity |
|---|---|---|---|
| No-code | n8n, Flowise | Standard scenarios, rapid deployment | Moderate (fewer steps) |
| Low-code | Flowise + custom nodes | Balance of speed and flexibility | Moderate to high |
| Code-based | LangChain, LangGraph | Complex logic, multi-agent | High |
| Hybrid | Claude Code + integrations | R&D, experimentation | Depends on scope |
Why AI Agents Need More Than Traditional QA
In traditional QA, we usually expect the same input to produce the same output. With AI agents, this assumption breaks down.
↳ Non-determinism at the generation level. The same prompt can elicit different responses due to sampling, variation in tool selection, and changes in context.
↳ Multi-step workflows. An agent operates through chains of decisions rather than isolated requests. Its behavior depends on conversation history, retrieved documents, and previous steps.
↳ External dependencies. Agents make real API calls, search for information, and trigger downstream automations. The environment may change between evaluations.
↳ Cascading failures. An error at an early step can affect the entire workflow. Without tracing, root cause analysis quickly turns into detective work.
We structure the evaluation across three levels:
- Task-level — did the agent complete the overall task? This includes success rate, completion time, and resource consumption.
- Step-level — were individual steps correct? This covers tool selection, response interpretation, and recovery after failures.
- Behavioral — how does the agent “think”? This helps identify decision-making patterns and systematic errors.
This structure allows teams to move beyond checking final outputs and instead evaluate how the agent operates across the full workflow. It creates a clearer view of where failures occur and provides a foundation for consistent improvement.
Why We Chose Langfuse for LLM Observability
There are many LLM observability tools on the market. Still, we chose Langfuse because it combines the features we need for production use with enough flexibility to support diverse AI agents.
- Open-source with a self-hosted option. For some projects, data must stay within the company perimeter. Langfuse can be deployed on your own infrastructure via Docker Compose or Kubernetes.
- Active development. The Langfuse team releases updates regularly. Recent improvements include an OpenTelemetry-native SDK, enhanced LLM-as-Judge with full execution tracing, and new integrations. This pace of development matters in a rapidly changing technology area.
- Broad integrations. Langfuse supports OpenAI, LangChain, LlamaIndex, Pydantic AI, and Vercel AI SDK. For no-code platforms like n8n, integration is possible via APIs, allowing a single tool to cover different parts of the agent ecosystem.
- Production-optimized setup. Minimal performance overhead, trace batching, and asynchronous sending make Langfuse suitable for agents running in production.
- Prompt management. Centralized prompt versioning and A/B testing make it easier to test and deploy prompt changes without code updates.
These capabilities were important for tool selection, but the real value of Langfuse becomes clearer when we look at how it changes agent debugging and quality control in practice.
From Black Box to Transparent Agent Debugging
Without observability, debugging agents quickly turns into guesswork. A trace in Langfuse shows the full execution path: which prompt was sent to the model, how the model responded, which tools it called, what those tools returned, and how the model interpreted the result.
In practice, this helps identify the specific failure point much faster. The issue may be an outdated document returned during retrieval, a tool call timeout, or incorrectly parsed structured output.
Langfuse also provides performance insights that help teams answer practical optimization questions:
- Latency by step — where is the bottleneck?
- Token usage by component — what is driving the cost?
- Error rates across workflows — where do failures occur most frequently?
Together, these insights support data-driven optimization instead of relying on assumptions.
AI as a Judge — How We Cover 22 AI Agents with Automated Evaluation
Human evaluation is the gold standard, but it does not scale. Automated validation works well for objective checks, such as output format or schema validation. AI as a Judge helps cover the gap by using one LLM to evaluate another LLM’s output against defined criteria.
Langfuse provides built-in support for LLM-as-Judge evaluators. You configure an evaluator once, and it automatically runs on new traces (or on a sample to control costs). Each evaluation creates its own trace, which allows you to inspect the judge’s reasoning.
Currently, 22 of our 32 agents are covered by automated evaluation. For the remaining 10, either configuration is still in progress, or the use case requires human review due to its specificity.
What we evaluate with AI as a Judge:
- Adherence to instructions — whether the required format and tone are followed
- Response relevance — whether the response addresses the request
- Hallucination detection — for RAG scenarios, whether the response is grounded in the retrieved context
- Out-of-scope handling — whether the agent correctly declines requests outside its competence
What we leave for human review:
- Edge cases with complex business logic
- New failure modes that have not yet been formalized
- Evaluator calibration — periodic alignment checks between automated and human judgment
This gives us a scalable evaluation layer: routine quality signals are collected automatically, while problematic traces become visible for deeper analysis. As a result, evaluation becomes part of the agent’s lifecycle rather than a separate manual check after something goes wrong.
How to Create a Reliable Evaluator Prompt
The most common mistake is trying to evaluate too much with a single evaluator. We follow a simpler principle:
one evaluator — one failure mode.
A reliable evaluator prompt should include:
- Clear focus — one specific failure mode, such as out-of-scope behavior, hallucination, or wrong tool selection.
- Explicit Pass/Fail criteria — what counts as a pass, and what should be marked as a fail.
- Examples — 2–3 annotated examples to calibrate the evaluator.
- Structured output — a defined format for the score and reasoning.
The critical step is validation against human judgment. We split a labeled dataset into a development set for prompt tuning and a test set for final evaluation. Then we measure the true positive and true negative rates.
If the evaluator systematically diverges from human labels, we iterate on the prompt until the automated judgment becomes consistent with the expected evaluation logic.
Turning Evaluation Results into Continuous Improvement
Evaluation creates value only when it leads to action. The goal is to turn evaluation results into concrete improvements.
Our cycle looks like this:
Conclusion
AI agents are systems that reason, act, and adapt across complex workflows. This makes quality assurance a continuous process rather than a one-time check.
Our experience with 32 agents across 4 platforms has shown:
- Observability is foundational. Without tracing, debugging agents becomes guesswork. Set up Langfuse or an equivalent tool before the first production deployment.
- The platform does not define quality. No-code, low-code, or LangGraph — the evaluation challenges remain similar. A unified observability system helps keep the full agent stack under control.
- AI as a Judge can scale evaluation. With the right criteria and the principle of one evaluator — one failure mode, automated evaluation becomes practical for production agents.
- Evaluation should lead to improvement. Without a continuous improvement cycle, evaluation becomes passive monitoring. The real value appears when insights turn into changes.
Langfuse has become a strong fit for our AI quality workflow: open-source, actively developed, production-ready, and flexible enough for different platforms and use cases.
If you are planning to implement AI agents or improve the reliability of existing solutions, we would be happy to share our experience in setting up an evaluation pipeline and discuss what this process could look like in your specific context. Book a discovery call to get started.

Learn more from QATestLab
Related Posts:
About Article Author
view more articles



