- System Overview
- RAG Methodologies & Agentic Logic
- Prerequisites & Tooling
- Phase 1: Infrastructure Initialization (Terraform)
- Phase 2: Cluster Bootstrapping (Kubernetes)
- Phase 3: The Data Plane (Ray & Databases)
- Phase 4: The Control Plane (API Deployment)
- Phase 5: Data Ingestion Pipeline
- Validation & Testing
- Cost Optimization & Scaling
- Troubleshooting
This repository contains the source code and Infrastructure-as-Code (IaC) definitions for a production-grade Retrieval-Augmented Generation (RAG) system. Unlike standard RAG implementations, this platform utilizes an Agentic Architecture (via LangGraph) to perform multi-step reasoning, query expansion, and hybrid retrieval (Vector + Knowledge Graph).
The system is decoupled into two primary processing planes:
- Control Plane (The Brain): Handles HTTP requests, state management, agent orchestration, and business logic. Runs on low-cost CPU nodes.
- Data Plane (The Muscle): Handles heavy compute tasks including LLM Inference, Embedding generation, and Graph Extraction. Runs on autoscaling GPU nodes via Ray.
This platform implements advanced RAG techniques to solve common failure modes (hallucination, retrieval misses).
Instead of a linear chain, we use LangGraph to model the RAG process as a state machine.
- Planner Node: Analyzes user intent. Decides whether to perform a direct answer, a retrieval, or use a tool (Code Interpreter).
- Query Rewriter: Uses an LLM to rewrite the user's query, resolving coreferences (e.g., changing "How much does it cost?" to "How much does Kubernetes cost?").
- HyDE (Hypothetical Document Embeddings): Generates a fake "ideal" answer, embeds it, and uses that vector to find real documents. This bridges the semantic gap between a question and a declarative statement.
- Vector Search (Qdrant): Uses BGE-M3 embeddings (dense retrieval) to find semantically similar text chunks.
- Graph Search (Neo4j): Executes Cypher queries to find entities and their relationships (e.g.,
(Entity A)-[RELATED_TO]->(Entity B)). This captures structural knowledge that vector search misses.
Ensure the following tools are installed on your workstation (Bastion Host):
- AWS CLI (
v2.x): Configured with AdministratorAccess. - Terraform (
v1.5+): For infrastructure provisioning. - Kubectl (
v1.29+): For Kubernetes interaction. - Helm (
v3.x): For chart management. - Python (
3.10+): For local scripting. - Docker: For building container images.
We use Terraform to provision the "Hardware" layer: VPC, EKS Control Plane, S3, and RDS.
Terraform requires a backend to store the state file safely.
- Log in to the AWS Console.
- Navigate to S3 and create a bucket named:
rag-platform-terraform-state-prod-001(Must be unique globally). - Navigate to DynamoDB and create a table named
terraform-state-lock.- Partition Key:
LockID(String).
- Partition Key:
Navigate to the infrastructure directory:
cd infra/terraformInitialize the backend and providers:
terraform initReview the execution plan. This will show creation of:
- VPC: 10.0.0.0/16 with 3 Public, 3 Private, and 3 Database subnets.
- EKS: Cluster named
rag-platform-cluster(Version 1.29). - RDS: Aurora Postgres Serverless v2.
- IAM: OIDC Providers and IRSA roles.
terraform plan -var="db_password=YourStrongPassword#123" -out=tfplanApply the infrastructure (Estimated time: 20 minutes):
terraform apply tfplanOnce Terraform completes, configure kubectl to communicate with the new cluster:
aws eks update-kubeconfig --region us-east-1 --name rag-platform-clusterVerify connectivity:
kubectl get nodes
# Expected Output: ip-10-0-x-x.ec2.internal Ready <none> m6i.largeThe EKS cluster is currently empty. We need to install the core system controllers.
Execute the helper script to install Karpenter (Autoscaler), KubeRay Operator, External Secrets, and Ingress Controller.
cd scripts
chmod +x bootstrap_cluster.sh
./bootstrap_cluster.shKarpenter is responsible for analyzing unschedulable pods and spinning up EC2 instances dynamically.
Apply the CPU Provisioner (For API & System pods):
kubectl apply -f infra/karpenter/provisioner-cpu.yaml- Technical Detail: This targets
m6i,c6iinstances and uses Spot pricing where available.
Apply the GPU Provisioner (For AI Inference):
kubectl apply -f infra/karpenter/provisioner-gpu.yaml- Technical Detail: This targets
g5(Nvidia A10G) instances. It creates a taintnvidia.com/gpu=true:NoScheduleto prevent non-AI pods from accidentally using expensive nodes.
We now deploy the "Muscle" of the system.
In a full production environment, you might use Terraform managed services (AWS Neptune / Qdrant Cloud), but for this setup, we deploy HA clusters inside K8s.
# Deploy Qdrant
helm upgrade --install qdrant deploy/helm/qdrant --namespace default
# Deploy Neo4j
helm upgrade --install neo4j deploy/helm/neo4j --namespace defaultThe Ray Cluster consists of a Head Node (orchestrator) and Worker Groups.
kubectl apply -f deploy/ray/ray-cluster.yamlVerification:
kubectl get pods -l ray.io/cluster=rag-ray-cluster
# Wait until the 'ray-head' pod is Running.We deploy two separate Ray Services. These utilize the ServeConfigV2 specification.
A. Embedding Service (BGE-M3):
kubectl apply -f deploy/ray/ray-serve-embed.yamlB. LLM Service (vLLM / Llama-3-70B): This is the most resource-intensive step.
kubectl apply -f deploy/ray/ray-serve-llm.yamlWhat happens technically:
- The
RayServiceCRD submits a request to the Ray Head. - Ray realizes it needs 1 GPU (
nvidia.com/gpu: 1resource request). - The Ray Worker pod goes into
Pendingstate. - Karpenter detects the pending pod, calls AWS Fleet API, and provisions a
g5.xlargeinstance. - Once the node joins (approx. 90s), the pod starts, downloads the weights (approx. 40GB) from HuggingFace, and initializes the vLLM engine (PagedAttention).
Create the Kubernetes Secret containing database credentials and keys.
kubectl create secret generic app-env-secret \
--from-literal=DATABASE_URL="postgresql+asyncpg://ragadmin:YourStrongPassword#123@rag-platform-cluster-postgres.cluster-xxxx.us-east-1.rds.amazonaws.com:5432/rag_db" \
--from-literal=REDIS_URL="redis://rag-redis-prod.xxxx.ng.0001.use1.cache.amazonaws.com:6379/0" \
--from-literal=NEO4J_PASSWORD="password" \
--from-literal=JWT_SECRET_KEY="$(openssl rand -hex 32)" \
--from-literal=QDRANT_HOST="qdrant" \
--from-literal=RAY_LLM_ENDPOINT="http://llm-service:8000/llm" \
--from-literal=RAY_EMBED_ENDPOINT="http://embed-service:8000/embed"Deploy the FastAPI application using Helm.
helm upgrade --install api deploy/helm/apiConfigure the Load Balancer to route traffic to the API.
kubectl apply -f deploy/ingress/nginx.yaml- Note: Get your Load Balancer DNS name via
kubectl get ingress. Map your domain (CNAME) to this DNS.
The system requires data to function. The ingestion pipeline is an asynchronous, distributed Ray Job.
Upload your dataset (PDF, DOCX, HTML) to the S3 bucket created by Terraform.
# Retrieve bucket name
BUCKET_NAME=$(cd infra/terraform && terraform output -raw s3_documents_bucket_name)
# Run bulk uploader script
python scripts/bulk_upload_s3.py ./data/finance_reports $BUCKET_NAMENormally triggered by S3 Events, we can manually submit the job to the Ray Cluster.
-
Port Forward Ray Dashboard:
kubectl port-forward service/rag-ray-cluster-head-svc 8265:8265
-
Submit Job via Python SDK:
python -m pipelines.jobs.s3_event_handler
Technical Workflow:
- Ray Data reads binaries from S3 lazily.
- MapBatches (CPU):
unstructuredlibrary parses PDFs (OCR via Tesseract if needed) and chunks text (512 tokens). - MapBatches (GPU - Embed): Chunks are sent to the
embed-serviceActor. - MapBatches (GPU - Graph): Chunks are sent to the
llm-serviceto extract(Subject, Predicate, Object)tuples. - Write:
- Vectors -> Qdrant (Upsert).
- Nodes/Edges -> Neo4j (MERGE queries).
Verify the API connects to all subsystems.
curl https://<YOUR_ALB_DNS>/health/readiness
# Expected: {"redis": "up", "neo4j": "up"}Perform a request to verify the Agentic flow (Authentication required).
- Obtain Token (Dev Mode): Use the
jwt.pyutility or disable auth inconfig.pytemporarily for testing. - Curl Request:
curl -X POST https://<YOUR_ALB_DNS>/api/v1/chat/stream \ -H "Content-Type: application/json" \ -d '{ "message": "Analyze the financial risks mentioned in the Q3 report.", "session_id": "test-session-1" }'
The system uses aggressive scaling policies to minimize costs:
- Spot Instances:
provisioner-gpu.yamlis configured to request Spot instances (karpenter.sh/capacity-type: spot). This reduces GPU costs by ~70%. - Scale-to-Zero:
- The Ray Autoscaler is configured in
ray-serve-llm.yamlwithmin_replicas: 1(can be 0 for dev). - If
min_replicasis 0 and no requests arrive, Ray kills the Pod. - Karpenter sees the node is empty (TTL 30s) and terminates the EC2 instance.
- The Ray Autoscaler is configured in
- Pod Pending (Insufficient CPU/Mem): Check
kubectl describe pod <pod_name>. If it saysFailedScheduling, check if Karpenter logs showlaunching node. - Ray Actor Death: Check Ray Dashboard
http://localhost:8265. Common issue is OOM (Out Of Memory) on the GPU. Decreasemax_num_seqsinllama-70b.yaml. - Database Connection Refused: Ensure Security Groups in
infra/terraform/vpc.tfallow traffic on ports 5432 (Postgres), 6333 (Qdrant), and 7687 (Neo4j) from the EKS Subnet CIDR.
- Create a feature branch (
git checkout -b feature/amazing-feature). - Commit your changes.
- Run tests (
make test). - Push to the branch.
- Open a Pull Request.
Distributed under the MIT License. See LICENSE for more information.
