Partner Solution
Solution Overview
The telecom provider deployed the Agent Evaluation service on AWS, designed to evaluate multi-agent systems with a focus on trajectory compliance and responsible AI.
Key Components
-
Evaluation Orchestrator Agent – Routed evaluation requests across specialized evaluators.
-
Model Evaluation Agent – Benchmarked LLM responses for factuality, efficiency, and fairness.
-
AI Agent Evaluation Agent – Validated reasoning, fairness, and trajectory alignment in conversations.
-
Workflow Evaluation Agent – Verified orchestration correctness and task completion.
-
Langfuse Observability – Central store for traces, trajectories, and enriched metrics.
-
Aurora PostgreSQL – Stored structured evaluation results for user-facing reports via the frontend.
-
Context Orchestrator – Multi-layer memory with Redis (short-term), DynamoDB (metadata), OpenSearch (semantic recall), and S3 (archival).
-
AWS-Native Deployment – Deployed on Amazon EKS, with Amazon Bedrock providing model access.
AWS Services Used
-
Amazon EKS – Hosts Orchestrator and Evaluator agents.
-
Amazon API Gateway + Cognito – Provides secure API access and authentication.
-
Amazon Bedrock – Supplies foundation models for evaluation.
-
Amazon Aurora PostgreSQL – Stores structured evaluation results for the frontend.
-
Amazon ElastiCache (Redis) – Manages short-term context.
-
Amazon DynamoDB – Stores evaluation metadata and tenant context.
-
Amazon OpenSearch Service – Supports semantic recall and trajectory analysis.
-
Amazon S3 – Stores transcripts, datasets, and archived evaluations.
-
Amazon CloudWatch – Provides monitoring and logging.
Implementation Details
-
Adopted Agile sprints from pilot to production.
-
Deployed evaluator agents as containerized microservices in Amazon EKS.
-
Modeled trajectories as structured workflows (intent → validation → task delegation → confirmation).
-
Integrated Langfuse with CloudWatch for trace logging and monitoring.
-
Used Aurora PostgreSQL for structured result storage, enabling frontend reporting.
-
Applied AWS Well-Architected Framework principles for security and scalability.
Innovation and Best Practices
-
Introduced trajectory compliance scoring to confirm agents followed approved flows.
-
Automated checks for bias, fairness, and safety in conversational agents.
-
Designed observability-first architecture with Langfuse as the trace backbone.
-
Added Aurora PostgreSQL as a structured results store for user-facing reporting.
-
Adopted CI/CD pipelines for continuous evaluation of new agent workflows.
Results and Benefits
Business Outcomes
-
Increased confidence in trajectory compliance for AI-driven support.
-
Reduced dependency on manual QA, saving costs and resources.
-
Accelerated approval cycles for deploying new agent workflows.
-
Improved customer trust in AI services.
Technical Benefits
-
Trajectory validation ensured consistent compliance.
-
Scalability with EKS supported enterprise workloads.
-
Reliability improved by separating traces (Langfuse) from structured results (Aurora PostgreSQL).
-
Security strengthened with IAM, Cognito, and VPC isolation.
-
Observability enriched through Langfuse and CloudWatch.
Customer Testimonial
Confidential
Lessons Learned
Challenges Overcome
-
Legacy billing APIs required tailored evaluation connectors.
-
Multilingual conversations needed adjustments for fairness evaluation.
-
Balancing real-time evaluation with archival storage required careful orchestration.
Best Practices
-
Always validate trajectory compliance, not just accuracy.
-
Use a separation of stores: Langfuse for traces, Aurora PostgreSQL for results.
-
Build observability into the architecture from the start.
Future Plans
-
Extend evaluation to voice-based assistants (IVR systems).
-
Add network troubleshooting agents to scope.
-
Explore Amazon Neptune for graph-based trajectory analysis.
Technical Requirements
-
AWS account with access to EKS, Bedrock, and supporting services.
-
Aurora PostgreSQL for results storage.
-
Redis, DynamoDB, OpenSearch, and S3 for memory orchestration.
-
Langfuse + CloudWatch for observability.
-
Cognito for authentication and tenant access control.
Security Architecture
-
IAM + Cognito for authentication and RBAC.
-
VPC isolation + PrivateLink for secure communication.
-
Tenant isolation in Aurora PostgreSQL, DynamoDB, OpenSearch, and S3.
-
Audit trails with CloudWatch and Langfuse.
Performance Considerations
-
Parallel trajectory evaluations for efficiency.
-
Auto-scaling EKS clusters.
-
Query optimization in Aurora PostgreSQL for dashboards.
-
Caching frequently used flows in Redis.
Tools and AWS Services Used
-
Amazon Bedrock, EKS, API Gateway, Cognito
-
Aurora PostgreSQL, Redis (ElastiCache), DynamoDB, OpenSearch, S3
-
CloudWatch, Langfuse
Users of Agent Evaluation
-
AI Engineers – Validate conversation and task trajectories.
-
Compliance Teams – Audit Responsible AI guardrails.
-
Service Managers – Monitor quality of customer support agents.
-
Product Owners – Validate readiness of new workflows.
Dependencies
-
Libraries: LangGraph, Ragas, LLM-as-a-Judge.
-
Langfuse observability.
-
AWS services: Bedrock, EKS, Aurora PostgreSQL, Redis, DynamoDB, OpenSearch, S3, CloudWatch.
Key Benefits and Differentiators
-
End-to-end trajectory evaluation across conversation and task agents.
-
Responsible AI guardrails: accuracy, fairness, safety.
-
Separation of concerns: Langfuse for traces, Aurora PostgreSQL for results.
-
Traceability-first design: observability at every stage.
-
AWS-native, secure, and scalable.
Value Proposition
Agent Evaluation enables enterprises to trust their AI-driven workflows by ensuring agents not only produce correct answers but also act in the right way, following approved trajectories. This closes the gap between experimental conversational AI and production-ready, customer-facing systems.
Conclusion
Agent Evaluation on AWS gave the telecom provider a comprehensive framework to validate its conversational AI agents. By automating trajectory compliance, accuracy, fairness, and safety checks, the company ensured its AI agents consistently acted as expected — strengthening compliance confidence and customer trust in telecom AI services.