The Brief
A US-based insurance company came to us with a specific problem: their claims processing team was reviewing 10,000 claims per month manually. Each claim required a human to check policy coverage, assess damage documentation, verify fraud indicators, and make an approval decision. The average review time was 22 minutes per claim. The backlog was growing.
They wanted an AI agent that could handle the routine cases automatically — freeing their team to focus on complex claims, edge cases, and customer escalations. The target: automate 70% of claims with zero increase in error rate.
This is the story of how we built it.
Week 1: Architecture Design
The first decision was the most important: Bedrock Agents or LangGraph?
We chose a hybrid. Bedrock Agents for the individual reasoning steps — policy lookup, damage assessment, fraud scoring — and LangGraph as the orchestrator that routes between them, manages state, and handles the human-in-the-loop escalation path.
The architecture:
- Intake layer: API Gateway + Lambda receives claim data (structured JSON + document URLs)
- Document processing: Amazon Textract extracts text from damage photos and supporting documents. Amazon Comprehend classifies document types.
- Knowledge Base: Bedrock Knowledge Base ingesting 50,000 policy documents, chunked at 512 tokens with hybrid search enabled
- Agent layer: Three specialised Bedrock Agents — Policy Agent, Damage Agent, Fraud Agent — each with a specific tool set and knowledge base
- Orchestration: LangGraph workflow that routes between agents, accumulates evidence, and makes the final approval/escalation decision
- Output layer: DynamoDB for decision storage, SNS for notifications, a React dashboard for the claims team
Week 2-3: Building the Knowledge Base
The Knowledge Base was the most labour-intensive part of the build. 50,000 policy documents in various formats — PDFs, Word documents, scanned images — needed to be ingested, chunked, and indexed.
We learned three things the hard way:
Chunk size matters more than you think. Our initial 1024-token chunks produced poor retrieval results for specific policy clauses. Dropping to 512 tokens with 10% overlap improved retrieval precision by 40%.
Document quality is the ceiling. Scanned documents with poor OCR quality produced hallucinations. We added a pre-processing step that runs Textract on all scanned documents and flags low-confidence extractions for human review before ingestion.
Hybrid search is not optional. Pure vector search missed exact policy numbers and coverage amounts. Enabling hybrid search (semantic + keyword) in Bedrock Knowledge Bases resolved this immediately.
Week 4-5: The LangGraph Orchestration Layer
The LangGraph workflow has five nodes:
- Intake: Validates claim data, extracts key fields, routes to the appropriate agent sequence
- Policy Check: Policy Agent queries the Knowledge Base for coverage details and limits
- Damage Assessment: Damage Agent analyses photos and documentation, produces a damage estimate with confidence score
- Fraud Scoring: Fraud Agent checks claim history, cross-references with known fraud patterns, produces a risk score
- Decision: Aggregates all agent outputs. If all confidence scores exceed threshold, auto-approves. Otherwise, routes to human review with a structured summary.
The human-in-the-loop path was the most complex to implement correctly. We used LangGraph's interrupt mechanism to pause the workflow, write the pending decision to DynamoDB, and resume when a human reviewer submits their decision via the dashboard.
The Mistakes We Made
We underestimated prompt engineering time. Getting the agents to produce consistently structured outputs — JSON with specific fields, confidence scores in a defined range — took two full weeks of iteration. Budget for this.
We did not test edge cases early enough. Claims with missing documentation, claims in languages other than English, claims where the policy had lapsed — these edge cases broke the workflow in ways we did not anticipate. We now build edge case testing into week 2 of every agent project.
We over-engineered the fraud detection. Our initial fraud agent used a complex multi-step reasoning chain that added 8 seconds to every claim. We simplified it to a single-step assessment with a structured rubric and got the same accuracy in 1.2 seconds.
The Result
After a 4-week pilot with 500 claims, the agent achieved:
- 73% auto-approval rate (target was 70%)
- 0.3% error rate on auto-approved claims (below the human baseline of 0.8%)
- Average processing time: 1.4 seconds per claim (vs 22 minutes manual)
- Infrastructure cost: $0.04 per claim at scale
The claims team now handles 27% of claims — the complex ones — instead of 100%. Their job satisfaction scores went up. The backlog cleared in three weeks.
What We Would Do Differently
Start with a smaller scope. We built three agents in parallel. In retrospect, we should have built and validated the Policy Agent alone first, then added the others. The integration complexity of three agents running simultaneously added debugging overhead that a sequential build would have avoided.
If you are evaluating a similar project, talk to us. We can scope the architecture in a single session and give you a realistic timeline and cost estimate based on what we have already built.
