Automating AI Agent Evaluation for a Telecom Provider with AWS

Written by Akash Pandey | 25 September 2025

Executive Summary

A major telecom provider introduced conversational AI agents to manage customer queries such as changing payment methods and updating billing information. The system included an Orchestration Agent that guided conversations and delegated requests to specialized task agents for execution.

The company needed to ensure these agents responded accurately, fairly, and safely, while following the correct trajectories — from intent detection to task execution and confirmation. Manual evaluation methods could not keep up with the complexity or volume of interactions.

By deploying the Agent Evaluation solution on AWS, the telecom company automated validation of conversation flows, task delegation, and trajectory compliance. This improved the reliability of AI-driven support, enhanced compliance confidence, and increased customer trust in digital services.

Customer Challenge

Customer Information

Customer: Confidential Telecom Provider

Industry: Telecommunications

Location: South Korea

Company Size: Large enterprise with millions of subscribers

Business Challenges

Validate end-to-end workflows where an Orchestration Agent delegated to task agents.

Ensure trajectory compliance: intent recognition → validation → task delegation → confirmation.

Manual QA was too slow and inconsistent.

Risk of errors when updating payment methods or service preferences.

Meet regulatory and internal requirements for fairness, safety, and auditability.

Technical Challenges

Complex orchestration across multiple agents and backend systems.

Legacy billing and CRM integrations made testing fragile.

No centralized observability for correctness and trajectories.

Needed a safe evaluation layer without impacting live systems.

Partner Solution

Solution Overview

The telecom provider deployed the Agent Evaluation service on AWS, designed to evaluate multi-agent systems with a focus on trajectory compliance and responsible AI.

Key Components

Evaluation Orchestrator Agent – Routed evaluation requests across specialized evaluators.

Model Evaluation Agent – Benchmarked LLM responses for factuality, efficiency, and fairness.

AI Agent Evaluation Agent – Validated reasoning, fairness, and trajectory alignment in conversations.

Workflow Evaluation Agent – Verified orchestration correctness and task completion.

Langfuse Observability – Central store for traces, trajectories, and enriched metrics.

Aurora PostgreSQL – Stored structured evaluation results for user-facing reports via the frontend.

Context Orchestrator – Multi-layer memory with Redis (short-term), DynamoDB (metadata), OpenSearch (semantic recall), and S3 (archival).

AWS-Native Deployment – Deployed on Amazon EKS, with Amazon Bedrock providing model access.

AWS Services Used

Amazon EKS – Hosts Orchestrator and Evaluator agents.

Amazon API Gateway + Cognito – Provides secure API access and authentication.

Amazon Bedrock – Supplies foundation models for evaluation.

Amazon Aurora PostgreSQL – Stores structured evaluation results for the frontend.

Amazon ElastiCache (Redis) – Manages short-term context.

Amazon DynamoDB – Stores evaluation metadata and tenant context.

Amazon OpenSearch Service – Supports semantic recall and trajectory analysis.

Amazon S3 – Stores transcripts, datasets, and archived evaluations.

Amazon CloudWatch – Provides monitoring and logging.

Implementation Details

Adopted Agile sprints from pilot to production.

Deployed evaluator agents as containerized microservices in Amazon EKS.

Modeled trajectories as structured workflows (intent → validation → task delegation → confirmation).

Integrated Langfuse with CloudWatch for trace logging and monitoring.

Used Aurora PostgreSQL for structured result storage, enabling frontend reporting.

Applied AWS Well-Architected Framework principles for security and scalability.

Innovation and Best Practices

Introduced trajectory compliance scoring to confirm agents followed approved flows.

Automated checks for bias, fairness, and safety in conversational agents.

Designed observability-first architecture with Langfuse as the trace backbone.

Added Aurora PostgreSQL as a structured results store for user-facing reporting.

Adopted CI/CD pipelines for continuous evaluation of new agent workflows.

Results and Benefits

Business Outcomes

Increased confidence in trajectory compliance for AI-driven support.

Reduced dependency on manual QA, saving costs and resources.

Accelerated approval cycles for deploying new agent workflows.

Improved customer trust in AI services.

Technical Benefits

Trajectory validation ensured consistent compliance.

Scalability with EKS supported enterprise workloads.

Reliability improved by separating traces (Langfuse) from structured results (Aurora PostgreSQL).

Security strengthened with IAM, Cognito, and VPC isolation.

Observability enriched through Langfuse and CloudWatch.

Customer Testimonial

Confidential

Lessons Learned

Challenges Overcome

Legacy billing APIs required tailored evaluation connectors.

Multilingual conversations needed adjustments for fairness evaluation.

Balancing real-time evaluation with archival storage required careful orchestration.

Best Practices

Always validate trajectory compliance, not just accuracy.

Use a separation of stores: Langfuse for traces, Aurora PostgreSQL for results.

Build observability into the architecture from the start.

Future Plans

Extend evaluation to voice-based assistants (IVR systems).

Add network troubleshooting agents to scope.

Explore Amazon Neptune for graph-based trajectory analysis.

Technical Requirements

AWS account with access to EKS, Bedrock, and supporting services.

Aurora PostgreSQL for results storage.

Redis, DynamoDB, OpenSearch, and S3 for memory orchestration.

Langfuse + CloudWatch for observability.

Cognito for authentication and tenant access control.

Security Architecture

IAM + Cognito for authentication and RBAC.

VPC isolation + PrivateLink for secure communication.

Tenant isolation in Aurora PostgreSQL, DynamoDB, OpenSearch, and S3.

Audit trails with CloudWatch and Langfuse.

Performance Considerations

Parallel trajectory evaluations for efficiency.

Auto-scaling EKS clusters.

Query optimization in Aurora PostgreSQL for dashboards.

Caching frequently used flows in Redis.

Tools and AWS Services Used

Amazon Bedrock, EKS, API Gateway, Cognito

Aurora PostgreSQL, Redis (ElastiCache), DynamoDB, OpenSearch, S3

CloudWatch, Langfuse

Users of Agent Evaluation

AI Engineers – Validate conversation and task trajectories.

Compliance Teams – Audit Responsible AI guardrails.

Service Managers – Monitor quality of customer support agents.

Product Owners – Validate readiness of new workflows.

Dependencies

Libraries: LangGraph, Ragas, LLM-as-a-Judge.

Langfuse observability.

AWS services: Bedrock, EKS, Aurora PostgreSQL, Redis, DynamoDB, OpenSearch, S3, CloudWatch.

Key Benefits and Differentiators

End-to-end trajectory evaluation across conversation and task agents.

Responsible AI guardrails: accuracy, fairness, safety.

Separation of concerns: Langfuse for traces, Aurora PostgreSQL for results.

Traceability-first design: observability at every stage.

AWS-native, secure, and scalable.

Value Proposition

Agent Evaluation enables enterprises to trust their AI-driven workflows by ensuring agents not only produce correct answers but also act in the right way, following approved trajectories. This closes the gap between experimental conversational AI and production-ready, customer-facing systems.

Conclusion

Agent Evaluation on AWS gave the telecom provider a comprehensive framework to validate its conversational AI agents. By automating trajectory compliance, accuracy, fairness, and safety checks, the company ensured its AI agents consistently acted as expected — strengthening compliance confidence and customer trust in telecom AI services.

View full post