A major telecom provider introduced conversational AI agents to manage customer queries such as changing payment methods and updating billing information. The system included an Orchestration Agent that guided conversations and delegated requests to specialized task agents for execution.
The company needed to ensure these agents responded accurately, fairly, and safely, while following the correct trajectories — from intent detection to task execution and confirmation. Manual evaluation methods could not keep up with the complexity or volume of interactions.
By deploying the Agent Evaluation solution on AWS, the telecom company automated validation of conversation flows, task delegation, and trajectory compliance. This improved the reliability of AI-driven support, enhanced compliance confidence, and increased customer trust in digital services.
Customer Information
Customer: Confidential Telecom Provider
Industry: Telecommunications
Location: South Korea
Company Size: Large enterprise with millions of subscribers
Business Challenges
Validate end-to-end workflows where an Orchestration Agent delegated to task agents.
Ensure trajectory compliance: intent recognition → validation → task delegation → confirmation.
Manual QA was too slow and inconsistent.
Risk of errors when updating payment methods or service preferences.
Meet regulatory and internal requirements for fairness, safety, and auditability.
Technical Challenges
Complex orchestration across multiple agents and backend systems.
Legacy billing and CRM integrations made testing fragile.
No centralized observability for correctness and trajectories.
Needed a safe evaluation layer without impacting live systems.
The telecom provider deployed the Agent Evaluation service on AWS, designed to evaluate multi-agent systems with a focus on trajectory compliance and responsible AI.
Evaluation Orchestrator Agent – Routed evaluation requests across specialized evaluators.
Model Evaluation Agent – Benchmarked LLM responses for factuality, efficiency, and fairness.
AI Agent Evaluation Agent – Validated reasoning, fairness, and trajectory alignment in conversations.
Workflow Evaluation Agent – Verified orchestration correctness and task completion.
Langfuse Observability – Central store for traces, trajectories, and enriched metrics.
Aurora PostgreSQL – Stored structured evaluation results for user-facing reports via the frontend.
Context Orchestrator – Multi-layer memory with Redis (short-term), DynamoDB (metadata), OpenSearch (semantic recall), and S3 (archival).
AWS-Native Deployment – Deployed on Amazon EKS, with Amazon Bedrock providing model access.
Amazon EKS – Hosts Orchestrator and Evaluator agents.
Amazon API Gateway + Cognito – Provides secure API access and authentication.
Amazon Bedrock – Supplies foundation models for evaluation.
Amazon Aurora PostgreSQL – Stores structured evaluation results for the frontend.
Amazon ElastiCache (Redis) – Manages short-term context.
Amazon DynamoDB – Stores evaluation metadata and tenant context.
Amazon OpenSearch Service – Supports semantic recall and trajectory analysis.
Amazon S3 – Stores transcripts, datasets, and archived evaluations.
Amazon CloudWatch – Provides monitoring and logging.
Adopted Agile sprints from pilot to production.
Deployed evaluator agents as containerized microservices in Amazon EKS.
Modeled trajectories as structured workflows (intent → validation → task delegation → confirmation).
Integrated Langfuse with CloudWatch for trace logging and monitoring.
Used Aurora PostgreSQL for structured result storage, enabling frontend reporting.
Applied AWS Well-Architected Framework principles for security and scalability.
Introduced trajectory compliance scoring to confirm agents followed approved flows.
Automated checks for bias, fairness, and safety in conversational agents.
Designed observability-first architecture with Langfuse as the trace backbone.
Added Aurora PostgreSQL as a structured results store for user-facing reporting.
Adopted CI/CD pipelines for continuous evaluation of new agent workflows.
Results and Benefits
Increased confidence in trajectory compliance for AI-driven support.
Reduced dependency on manual QA, saving costs and resources.
Accelerated approval cycles for deploying new agent workflows.
Improved customer trust in AI services.
Trajectory validation ensured consistent compliance.
Scalability with EKS supported enterprise workloads.
Reliability improved by separating traces (Langfuse) from structured results (Aurora PostgreSQL).
Security strengthened with IAM, Cognito, and VPC isolation.
Observability enriched through Langfuse and CloudWatch.
Confidential
Legacy billing APIs required tailored evaluation connectors.
Multilingual conversations needed adjustments for fairness evaluation.
Balancing real-time evaluation with archival storage required careful orchestration.
Always validate trajectory compliance, not just accuracy.
Use a separation of stores: Langfuse for traces, Aurora PostgreSQL for results.
Build observability into the architecture from the start.
Extend evaluation to voice-based assistants (IVR systems).
Add network troubleshooting agents to scope.
Explore Amazon Neptune for graph-based trajectory analysis.
AWS account with access to EKS, Bedrock, and supporting services.
Aurora PostgreSQL for results storage.
Redis, DynamoDB, OpenSearch, and S3 for memory orchestration.
Langfuse + CloudWatch for observability.
Cognito for authentication and tenant access control.
IAM + Cognito for authentication and RBAC.
VPC isolation + PrivateLink for secure communication.
Tenant isolation in Aurora PostgreSQL, DynamoDB, OpenSearch, and S3.
Audit trails with CloudWatch and Langfuse.
Parallel trajectory evaluations for efficiency.
Auto-scaling EKS clusters.
Query optimization in Aurora PostgreSQL for dashboards.
Caching frequently used flows in Redis.
Amazon Bedrock, EKS, API Gateway, Cognito
Aurora PostgreSQL, Redis (ElastiCache), DynamoDB, OpenSearch, S3
CloudWatch, Langfuse
AI Engineers – Validate conversation and task trajectories.
Compliance Teams – Audit Responsible AI guardrails.
Service Managers – Monitor quality of customer support agents.
Product Owners – Validate readiness of new workflows.
Libraries: LangGraph, Ragas, LLM-as-a-Judge.
Langfuse observability.
AWS services: Bedrock, EKS, Aurora PostgreSQL, Redis, DynamoDB, OpenSearch, S3, CloudWatch.
End-to-end trajectory evaluation across conversation and task agents.
Responsible AI guardrails: accuracy, fairness, safety.
Separation of concerns: Langfuse for traces, Aurora PostgreSQL for results.
Traceability-first design: observability at every stage.
AWS-native, secure, and scalable.
Agent Evaluation enables enterprises to trust their AI-driven workflows by ensuring agents not only produce correct answers but also act in the right way, following approved trajectories. This closes the gap between experimental conversational AI and production-ready, customer-facing systems.
Agent Evaluation on AWS gave the telecom provider a comprehensive framework to validate its conversational AI agents. By automating trajectory compliance, accuracy, fairness, and safety checks, the company ensured its AI agents consistently acted as expected — strengthening compliance confidence and customer trust in telecom AI services.