32 AI Agents Across 4 Platforms: Building a Robust Evaluation System for AI Solutions

April 30 11:08 2026 by Max Vornovskykh Print This Article

AI agents are entering real business workflows, but their quality is hard to measure consistently. A single request can involve reasoning, tool calls, external data, and several decisions, so teams need visibility into the full workflow before trusting it in production.

At QATestLab, we currently operate 32 AI agents across 4 platforms — from corporate knowledge bots to ML solutions for marketing data enrichment and scoring. In this article, I want to share how we built a unified evaluation approach for these agents using Langfuse, tracing, AI as a Judge, and continuous improvement cycles.

Why AI Agents Need a Different Quality Approach

Traditional software is deterministic: the same input produces the same output. Classical ML models add an element of uncertainty, but their behavior remains broadly predictable — a model makes a prediction and waits for the next request.

AI agents work fundamentally differently. They reason, make decisions, invoke tools, analyze results, and continue to act. A single user request can cascade into dozens of LLM calls, API requests, and intermediate decisions. Each step creates new potential points of failure.

LangChain’s State of Agent Engineering report shows that 57% of surveyed teams already run AI agents in production, but quality remains the top production barrier for 32% of respondents. This makes structured evaluation and observability less of a “nice to have” and more of a core requirement for reliable agentic systems.

For AI agents, quality becomes easier to manage when the full execution path is visible: where the agent received context, how it made decisions, which tools it used, where failures appeared, and what can be improved. This turns agent behavior into something teams can measure, analyze, and refine over time.

From No-Code to LangGraph: A Unified Approach to Agent Quality

At QATestLab, we deliberately use different tools for different tasks. The choice of platform depends on the complexity of the logic, customization requirements, and the team that will maintain the solution.

No-code platforms (n8n, Flowise) are used for standard scenarios where speed of deployment and ease of maintenance by non-developers are important. Typical use cases include workflow automations, simple RAG bots, and integrations with corporate systems. Their main advantage is a low barrier to entry and visual debugging.

Code-based solutions (LangChain, LangGraph, Claude Code) are applied in more complex agentic workflows, where control over each step of the reasoning process is required. These include multi-agent systems, advanced business logic, and custom tool integrations. They provide more flexibility, but also require stronger technical expertise.

Regardless of the platform, the evaluation challenges remain the same: non-determinism, cascading failures, and the difficulty of root cause analysis. That’s why we built a unified observability system that covers all 32 agents — even though they run on different platforms.

Platform overview

Platform type comparison

Platform Type	Examples	When to Choose	Evaluation Complexity
No-code	n8n, Flowise	Standard scenarios, rapid deployment	Moderate (fewer steps)
Low-code	Flowise + custom nodes	Balance of speed and flexibility	Moderate to high
Code-based	LangChain, LangGraph	Complex logic, multi-agent	High
Hybrid	Claude Code + integrations	R&D, experimentation	Depends on scope

Why AI Agents Need More Than Traditional QA

In traditional QA, we usually expect the same input to produce the same output. With AI agents, this assumption breaks down.

↳ Non-determinism at the generation level. The same prompt can elicit different responses due to sampling, variation in tool selection, and changes in context.

↳ Multi-step workflows. An agent operates through chains of decisions rather than isolated requests. Its behavior depends on conversation history, retrieved documents, and previous steps.

↳ External dependencies. Agents make real API calls, search for information, and trigger downstream automations. The environment may change between evaluations.

↳ Cascading failures. An error at an early step can affect the entire workflow. Without tracing, root cause analysis quickly turns into detective work.

We structure the evaluation across three levels:

Task-level — did the agent complete the overall task? This includes success rate, completion time, and resource consumption.
Step-level — were individual steps correct? This covers tool selection, response interpretation, and recovery after failures.
Behavioral — how does the agent “think”? This helps identify decision-making patterns and systematic errors.

This structure allows teams to move beyond checking final outputs and instead evaluate how the agent operates across the full workflow. It creates a clearer view of where failures occur and provides a foundation for consistent improvement.

Why We Chose Langfuse for LLM Observability

There are many LLM observability tools on the market. Still, we chose Langfuse because it combines the features we need for production use with enough flexibility to support diverse AI agents.

Open-source with a self-hosted option. For some projects, data must stay within the company perimeter. Langfuse can be deployed on your own infrastructure via Docker Compose or Kubernetes.
Active development. The Langfuse team releases updates regularly. Recent improvements include an OpenTelemetry-native SDK, enhanced LLM-as-Judge with full execution tracing, and new integrations. This pace of development matters in a rapidly changing technology area.
Broad integrations. Langfuse supports OpenAI, LangChain, LlamaIndex, Pydantic AI, and Vercel AI SDK. For no-code platforms like n8n, integration is possible via APIs, allowing a single tool to cover different parts of the agent ecosystem.
Production-optimized setup. Minimal performance overhead, trace batching, and asynchronous sending make Langfuse suitable for agents running in production.
Prompt management. Centralized prompt versioning and A/B testing make it easier to test and deploy prompt changes without code updates.

These capabilities were important for tool selection, but the real value of Langfuse becomes clearer when we look at how it changes agent debugging and quality control in practice.

From Black Box to Transparent Agent Debugging

Without observability, debugging agents quickly turns into guesswork. A trace in Langfuse shows the full execution path: which prompt was sent to the model, how the model responded, which tools it called, what those tools returned, and how the model interpreted the result.

In practice, this helps identify the specific failure point much faster. The issue may be an outdated document returned during retrieval, a tool call timeout, or incorrectly parsed structured output.

Langfuse also provides performance insights that help teams answer practical optimization questions:

Latency by step — where is the bottleneck?
Token usage by component — what is driving the cost?
Error rates across workflows — where do failures occur most frequently?

Together, these insights support data-driven optimization instead of relying on assumptions.

AI as a Judge — How We Cover 22 AI Agents with Automated Evaluation

Human evaluation is the gold standard, but it does not scale. Automated validation works well for objective checks, such as output format or schema validation. AI as a Judge helps cover the gap by using one LLM to evaluate another LLM’s output against defined criteria.

Langfuse provides built-in support for LLM-as-Judge evaluators. You configure an evaluator once, and it automatically runs on new traces (or on a sample to control costs). Each evaluation creates its own trace, which allows you to inspect the judge’s reasoning.

Currently, 22 of our 32 agents are covered by automated evaluation. For the remaining 10, either configuration is still in progress, or the use case requires human review due to its specificity.

What we evaluate with AI as a Judge:

Adherence to instructions — whether the required format and tone are followed
Response relevance — whether the response addresses the request
Hallucination detection — for RAG scenarios, whether the response is grounded in the retrieved context
Out-of-scope handling — whether the agent correctly declines requests outside its competence

What we leave for human review:

Edge cases with complex business logic
New failure modes that have not yet been formalized
Evaluator calibration — periodic alignment checks between automated and human judgment

This gives us a scalable evaluation layer: routine quality signals are collected automatically, while problematic traces become visible for deeper analysis. As a result, evaluation becomes part of the agent’s lifecycle rather than a separate manual check after something goes wrong.

How to Create a Reliable Evaluator Prompt

The most common mistake is trying to evaluate too much with a single evaluator. We follow a simpler principle:
one evaluator — one failure mode.

A reliable evaluator prompt should include:

Clear focus — one specific failure mode, such as out-of-scope behavior, hallucination, or wrong tool selection.
Explicit Pass/Fail criteria — what counts as a pass, and what should be marked as a fail.
Examples — 2–3 annotated examples to calibrate the evaluator.
Structured output — a defined format for the score and reasoning.

The critical step is validation against human judgment. We split a labeled dataset into a development set for prompt tuning and a test set for final evaluation. Then we measure the true positive and true negative rates.

If the evaluator systematically diverges from human labels, we iterate on the prompt until the automated judgment becomes consistent with the expected evaluation logic.

Turning Evaluation Results into Continuous Improvement

Evaluation creates value only when it leads to action. The goal is to turn evaluation results into concrete improvements.

Our cycle looks like this:

Continuous improvement

Evaluation → Action cycle

Evaluation creates value only when it leads to action. Each step feeds the next — forming a loop that makes AI systems measurably better over time.

Collect evaluation data

→

What happens

Automated evaluators generate scores on production traces.

Analyze errors

→

What happens

Traces with low scores are filtered and grouped by error type.

Form a hypothesis

→

What happens

What exactly should be improved: the prompt, retrieval logic, tool selection, or another part of the workflow?

Run an experiment

→

What happens

Langfuse lets us feed a dataset through several variants and compare the results.

Interpret & deploy

→

What happens

We look beyond “Variant B is 10% better” and analyze which error categories decreased and why.

Repeat

→

What happens

Each deployed improvement generates new data for the next cycle.

tap a card to flip

CI/CD integration — for critical agents, a prompt change automatically triggers regression tests on the evaluation dataset

▶

Conclusion

AI agents are systems that reason, act, and adapt across complex workflows. This makes quality assurance a continuous process rather than a one-time check.

Our experience with 32 agents across 4 platforms has shown:

Observability is foundational. Without tracing, debugging agents becomes guesswork. Set up Langfuse or an equivalent tool before the first production deployment.
The platform does not define quality. No-code, low-code, or LangGraph — the evaluation challenges remain similar. A unified observability system helps keep the full agent stack under control.
AI as a Judge can scale evaluation. With the right criteria and the principle of one evaluator — one failure mode, automated evaluation becomes practical for production agents.
Evaluation should lead to improvement. Without a continuous improvement cycle, evaluation becomes passive monitoring. The real value appears when insights turn into changes.

Langfuse has become a strong fit for our AI quality workflow: open-source, actively developed, production-ready, and flexible enough for different platforms and use cases.

If you are planning to implement AI agents or improve the reliability of existing solutions, we would be happy to share our experience in setting up an evaluation pipeline and discuss what this process could look like in your specific context. Book a discovery call to get started.

banner call to action. Build AI Agents you can evaluate, trace, and improve with confidence. Button: test with us

FAQ

How do you evaluate AI agents across different platforms?

You can use a unified evaluation system like Langfuse to assess AI agents built on different platforms — n8n, Flowise, LangChain, and LangGraph — within a single, consistent framework. This eliminates the need for platform-specific testing tools.

What is AI as a Judge, and how does it work for agent evaluation?

AI as a Judge is an approach where one LLM automatically evaluates the output of another AI agent. By following the “one evaluator — one failure mode” principle, you can create automated evaluation scales that cover specific quality dimensions. This method scales well — for example, automated LLM-based evaluators can cover 22 out of 32 agents.

What is Langfuse, and why use it for AI agent evaluation?

Langfuse is an open-source, actively developed observability platform for LLM applications. It supports tracing, prompt management, and LLM-as-Judge evaluation with full execution tracing — giving you visibility into every step an AI agent takes.

What are the three levels of AI agent evaluation?

There are three levels: (1) Task-level — did the agent achieve the goal? (2) Step-level — were the individual steps correct? (3) Behavioral — why did the agent act that way? Each level provides a different depth of insight into agent performance.

How to continuously improve AI agent quality?

Follow a continuous improvement cycle: run evaluation, perform error analysis, form a hypothesis about the root cause, design an experiment to fix it, deploy the change, and repeat. This iterative loop ensures agents get better over time based on data, not guesswork.