32 AI Agents Across 4 Platforms: Building a Robust Evaluation System for AI Solutions

by Max Vornovskykh | April 30, 2026 11:08 am

AI agents are entering real business workflows, but their quality is hard to measure consistently. A single request can involve reasoning, tool calls, external data, and several decisions, so teams need visibility into the full workflow before trusting it in production.

At QATestLab, we currently operate 32 AI agents across 4 platforms — from corporate knowledge bots to ML solutions for marketing data enrichment and scoring. In this article, I want to share how we built a unified evaluation approach for these agents using Langfuse, tracing, AI as a Judge, and continuous improvement cycles.

Why AI Agents Need a Different Quality Approach

Traditional software is deterministic: the same input produces the same output. Classical ML models add an element of uncertainty, but their behavior remains broadly predictable — a model makes a prediction and waits for the next request.

AI agents work fundamentally differently. They reason, make decisions, invoke tools, analyze results, and continue to act. A single user request can cascade into dozens of LLM calls, API requests, and intermediate decisions. Each step creates new potential points of failure.

LangChain’s State of Agent Engineering report shows[1] that 57% of surveyed teams already run AI agents in production, but quality remains the top production barrier for 32% of respondents. This makes structured evaluation and observability less of a “nice to have” and more of a core requirement for reliable agentic systems. 

For AI agents, quality becomes easier to manage when the full execution path is visible: where the agent received context, how it made decisions, which tools it used, where failures appeared, and what can be improved. This turns agent behavior into something teams can measure, analyze, and refine over time. 

From No-Code to LangGraph: A Unified Approach to Agent Quality

At QATestLab, we deliberately use different tools for different tasks. The choice of platform depends on the complexity of the logic, customization requirements, and the team that will maintain the solution.

No-code platforms (n8n, Flowise) are used for standard scenarios where speed of deployment and ease of maintenance by non-developers are important. Typical use cases include workflow automations, simple RAG bots, and integrations with corporate systems. Their main advantage is a low barrier to entry and visual debugging.

Code-based solutions (LangChain, LangGraph, Claude Code) are applied in more complex agentic workflows, where control over each step of the reasoning process is required. These include multi-agent systems, advanced business logic, and custom tool integrations. They provide more flexibility, but also require stronger technical expertise.

Regardless of the platform, the evaluation challenges remain the same: non-determinism, cascading failures, and the difficulty of root cause analysis. That’s why we built a unified observability system that covers all 32 agents — even though they run on different platforms.

Platform overview
Platform type comparison
Platform Type Examples When to Choose Evaluation Complexity
No-code n8n, Flowise Standard scenarios, rapid deployment Moderate (fewer steps)
Low-code Flowise + custom nodes Balance of speed and flexibility Moderate to high
Code-based LangChain, LangGraph Complex logic, multi-agent High
Hybrid Claude Code + integrations R&D, experimentation Depends on scope

Why AI Agents Need More Than Traditional QA

In traditional QA, we usually expect the same input to produce the same output. With AI agents, this assumption breaks down.

↳ Non-determinism at the generation level. The same prompt can elicit different responses due to sampling, variation in tool selection, and changes in context.

↳ Multi-step workflows. An agent operates through chains of decisions rather than isolated requests. Its behavior depends on conversation history, retrieved documents, and previous steps.

↳ External dependencies. Agents make real API calls, search for information, and trigger downstream automations. The environment may change between evaluations.

↳ Cascading failures. An error at an early step can affect the entire workflow. Without tracing, root cause analysis quickly turns into detective work.

We structure the evaluation across three levels:

  1. Task-level — did the agent complete the overall task? This includes success rate, completion time, and resource consumption.
  2. Step-level — were individual steps correct? This covers tool selection, response interpretation, and recovery after failures.
  3. Behavioral — how does the agent “think”? This helps identify decision-making patterns and systematic errors.

This structure allows teams to move beyond checking final outputs and instead evaluate how the agent operates across the full workflow. It creates a clearer view of where failures occur and provides a foundation for consistent improvement.

Why We Chose Langfuse for LLM Observability

There are many LLM observability tools on the market. Still, we chose Langfuse because it combines the features we need for production use with enough flexibility to support diverse AI agents.

  1. Open-source with a self-hosted option. For some projects, data must stay within the company perimeter. Langfuse can be deployed on your own infrastructure via Docker Compose or Kubernetes.
  2. Active development. The Langfuse team releases updates regularly. Recent improvements include an OpenTelemetry-native SDK, enhanced LLM-as-Judge with full execution tracing, and new integrations. This pace of development matters in a rapidly changing technology area.
  3. Broad integrations. Langfuse supports OpenAI, LangChain, LlamaIndex, Pydantic AI, and Vercel AI SDK. For no-code platforms like n8n, integration is possible via APIs, allowing a single tool to cover different parts of the agent ecosystem.
  4. Production-optimized setup. Minimal performance overhead, trace batching, and asynchronous sending make Langfuse suitable for agents running in production.
  5. Prompt management. Centralized prompt versioning and A/B testing make it easier to test and deploy prompt changes without code updates.

These capabilities were important for tool selection, but the real value of Langfuse becomes clearer when we look at how it changes agent debugging and quality control in practice.

From Black Box to Transparent Agent Debugging

Without observability, debugging agents quickly turns into guesswork. A trace in Langfuse shows the full execution path: which prompt was sent to the model, how the model responded, which tools it called, what those tools returned, and how the model interpreted the result.

In practice, this helps identify the specific failure point much faster. The issue may be an outdated document returned during retrieval, a tool call timeout, or incorrectly parsed structured output.

Langfuse also provides performance insights that help teams answer practical optimization questions:

Together, these insights support data-driven optimization instead of relying on assumptions.

AI as a Judge — How We Cover 22 AI Agents with Automated Evaluation

Human evaluation is the gold standard, but it does not scale. Automated validation works well for objective checks, such as output format or schema validation. AI as a Judge helps cover the gap by using one LLM to evaluate another LLM’s output against defined criteria.

Langfuse provides built-in support for LLM-as-Judge evaluators. You configure an evaluator once, and it automatically runs on new traces (or on a sample to control costs). Each evaluation creates its own trace, which allows you to inspect the judge’s reasoning.

Currently, 22 of our 32 agents are covered by automated evaluation. For the remaining 10, either configuration is still in progress, or the use case requires human review due to its specificity.

What we evaluate with AI as a Judge:

What we leave for human review:

This gives us a scalable evaluation layer: routine quality signals are collected automatically, while problematic traces become visible for deeper analysis. As a result, evaluation becomes part of the agent’s lifecycle rather than a separate manual check after something goes wrong.

How to Create a Reliable Evaluator Prompt

The most common mistake is trying to evaluate too much with a single evaluator. We follow a simpler principle:
one evaluator — one failure mode.

A reliable evaluator prompt should include:

  1. Clear focus — one specific failure mode, such as out-of-scope behavior, hallucination, or wrong tool selection.
  2. Explicit Pass/Fail criteria — what counts as a pass, and what should be marked as a fail.
  3. Examples — 2–3 annotated examples to calibrate the evaluator.
  4. Structured output — a defined format for the score and reasoning.

The critical step is validation against human judgment. We split a labeled dataset into a development set for prompt tuning and a test set for final evaluation. Then we measure the true positive and true negative rates.

If the evaluator systematically diverges from human labels, we iterate on the prompt until the automated judgment becomes consistent with the expected evaluation logic.

Turning Evaluation Results into Continuous Improvement

Evaluation creates value only when it leads to action. The goal is to turn evaluation results into concrete improvements.

Our cycle looks like this:

Continuous improvement
Evaluation Action cycle
Evaluation creates value only when it leads to action. Each step feeds the next — forming a loop that makes AI systems measurably better over time.
01
Collect evaluation data
What happens
Automated evaluators generate scores on production traces.
02
Analyze errors
What happens
Traces with low scores are filtered and grouped by error type.
03
Form a hypothesis
What happens
What exactly should be improved: the prompt, retrieval logic, tool selection, or another part of the workflow?
04
Run an experiment
What happens
Langfuse lets us feed a dataset through several variants and compare the results.
05
Interpret & deploy
What happens
We look beyond “Variant B is 10% better” and analyze which error categories decreased and why.
06
Repeat
What happens
Each deployed improvement generates new data for the next cycle.
tap a card to flip
CI/CD integration — for critical agents, a prompt change automatically triggers regression tests on the evaluation dataset

Conclusion

AI agents are systems that reason, act, and adapt across complex workflows. This makes quality assurance a continuous process rather than a one-time check.

Our experience with 32 agents across 4 platforms has shown:

Langfuse has become a strong fit for our AI quality workflow: open-source, actively developed, production-ready, and flexible enough for different platforms and use cases.

If you are planning to implement AI agents or improve the reliability of existing solutions, we would be happy to share our experience in setting up an evaluation pipeline and discuss what this process could look like in your specific context. Book a discovery call to get started. 

banner call to action. Build AI Agents you can evaluate, trace, and improve with confidence. Button: test with us[2]
FAQ
You can use a unified evaluation system like Langfuse to assess AI agents built on different platforms — n8n, Flowise, LangChain, and LangGraph — within a single, consistent framework. This eliminates the need for platform-specific testing tools.
AI as a Judge is an approach where one LLM automatically evaluates the output of another AI agent. By following the “one evaluator — one failure mode” principle, you can create automated evaluation scales that cover specific quality dimensions. This method scales well — for example, automated LLM-based evaluators can cover 22 out of 32 agents.
Langfuse is an open-source, actively developed observability platform for LLM applications. It supports tracing, prompt management, and LLM-as-Judge evaluation with full execution tracing — giving you visibility into every step an AI agent takes.
There are three levels: (1) Task-level — did the agent achieve the goal? (2) Step-level — were the individual steps correct? (3) Behavioral — why did the agent act that way? Each level provides a different depth of insight into agent performance.
Follow a continuous improvement cycle: run evaluation, perform error analysis, form a hypothesis about the root cause, design an experiment to fix it, deploy the change, and repeat. This iterative loop ensures agents get better over time based on data, not guesswork.

Learn more from QATestLab

Related Posts:

Endnotes:
  1. LangChain’s State of Agent Engineering report shows: https://go.qatestlab.com/83hdq
  2. [Image]: https://go.qatestlab.com/2mxj2
  3. AI Agents Testing: 5 Practices That Will Ensure Quality: https://blog.qatestlab.com/ai-agents-testing-5-practices-to-ensure-quality/
  4. Web Summit 2025: AI Leads the Future, But User Experience Sets the Direction: https://blog.qatestlab.com/web-summit-2025-ai-leads-the-future-but-user-experience-sets-the-direction/
  5. Key QA & Game Testing Takeaways from Paris Games Week 2025: https://blog.qatestlab.com/key-qa-game-testing-takeaways-from-paris-games-week-2025/

Source URL: https://blog.qatestlab.com/32-ai-agents-across-4-platforms-building-a-robust-evaluation-system-for-ai-solutions/