Building an AI-Powered HIV Guidelines Chatbot
HIV Guidelines Chatbot - Technical Documentation
This comprehensive documentation covers the architecture, implementation details, and technical specifications of the Kenya Medical Guidelines RAG (Retrieval-Augmented Generation) System.
š Table of Contents
- System Overview
- Backend Architecture
- RAG Implementation
- Data Pipeline
- API Reference
- Caching System
- Deployment & Scaling
1. System Overview
The HIV Guidelines Chatbot is a specialized clinical decision support system designed to provide evidence-based answers from the Kenya HIV Prevention and Treatment Guidelines (2022). It uses a Retrieval-Augmented Generation (RAG) architecture to ensure accuracy and reduce hallucinations.
Key Capabilities
- Domain Specificity: Strictly scoped to HIV/AIDS management in Kenya
- Evidence-Based: Citations and references to specific tables/sections
- Clinical Accuracy: Detailed dosing tables and regimens
- Hybrid Response System: Mix of pre-computed cached responses and real-time RAG generation
Technology Stack
| Component | Technology | Role |
|---|---|---|
| Backend API | FastAPI (Python) | REST API, Streaming, Async processing |
| LLM | Anthropic Claude 3 Haiku | Response generation, reasoning |
| Embeddings | OpenAI text-embedding-3-small | Vector representation of text |
| Vector Store | ChromaDB | Similarity search, document storage |
| Orchestration | LangChain | RAG pipeline, prompt management |
| Frontend | Next.js 15, AI SDK | User interface, chat management |
2. Backend Architecture
The backend is structured as a modular Python application following Clean Architecture principles.
Directory Structure (backend/src/)
backend/src/
āāā api.py # FastAPI entry point & route handlers
āāā rag_chain.py # Core RAG logic & LLM interaction
āāā vector_store.py # ChromaDB management & retrieval
āāā document_processor.py # PDF ingestion & chunking
āāā cache_manager.py # Response caching system
āāā config.py # Configuration & environment variables
Core Components
1. API Layer (api.py)
- Handles HTTP requests and Server-Sent Events (SSE) for streaming
- Manages CORS and anti-buffering headers for deployment
- Routes requests to Cache Manager or RAG System
- Key Endpoints:
POST /api/v1/chat/completions: Main chat endpoint (OpenAI compatible)POST /api/query: Hybrid endpoint (Cache -> RAG)POST /api/cache/warm: Triggers background cache warming
2. Configuration (config.py)
- Centralizes environment variables
- Manages paths for data, cache, and vector store
- Defines model parameters (chunk size, overlap, top-k)
3. RAG Implementation
The RAG system (rag_chain.py) is the brain of the application. It follows a Retrieve-Then-Generate pattern with strict guardrails.
The Pipeline
- Input: User query received via API
- Retrieval:
VectorStoreManagersearches for relevant chunks- Metric: Cosine similarity
- Top-K: 30 documents (high recall strategy)
- Relevance Check:
- Scores calculated for retrieved documents
- Threshold check (< 0.9 distance) to detect out-of-scope queries
- Context Construction: Relevant chunks formatted into a context string
- Generation: Claude 3 Haiku generates response using strict system prompt
System Prompt Engineering
The system prompt enforces:
- Scope Limitation: Rejects non-HIV questions
- Formatting: Markdown tables for regimens
- Tone: Clinical, direct, and comprehensive
- Safety: "Answer First" policy (no clarifying questions for clinical queries)
Relevance Guardrails
# Logic to reject out-of-scope queries if not relevant_docs or (relevant_docs[0][1] > RELEVANCE_THRESHOLD): return "I cannot find information about this specific question..."
4. Data Pipeline
The data pipeline transforms raw PDF guidelines into a searchable vector index.
Ingestion Process (scripts/ingest_documents.py)
- Extraction:
- Uses
MarkItDownto convert PDF to Markdown (preserves table structure) - Fallback to
pypdffor raw text extraction
- Uses
- Chunking:
- Splitter:
RecursiveCharacterTextSplitter - Chunk Size: 1000 characters
- Overlap: 200 characters (preserves context across boundaries)
- Splitter:
- Embedding:
- Model:
text-embedding-3-small(OpenAI) - Dimensions: 1536
- Model:
- Storage:
- Persisted to local ChromaDB instance (
backend/chroma_db/)
- Persisted to local ChromaDB instance (
File Management
- Raw Data:
backend/data/raw/*.pdf - Processed:
backend/data/processed/*.md - Vector DB:
backend/chroma_db/
5. API Reference
Chat Completions
Endpoint: POST /api/v1/chat/completions
Description: OpenAI-compatible endpoint for chat interfaces.
Request Body:
{ "messages": [ {"role": "user", "content": "What are the first-line ART regimens?"} ], "stream": true, "temperature": 0 }
Response (Streamed): Server-Sent Events (SSE) with JSON data chunks.
Hybrid Query
Endpoint: POST /api/query
Description: Checks cache first, then falls back to RAG.
Request Body:
{ "messages": [{"role": "user", "content": "..."}] }
Response:
{ "cached": boolean, "response": "Markdown string...", "topic_id": "string (optional)" }
Cache Management
POST /api/cache/warm: Triggers background cache generationGET /api/cache/stats: Returns cache hit/miss statisticsDELETE /api/cache/clear: Purges all cached responses
6. Caching System
To reduce latency and LLM costs, a robust caching system (cache_manager.py) pre-computes answers for high-frequency queries.
Strategy
- Predefined Topics: List of ~20 common clinical questions (regimens, dosing, PMTCT)
- Hashing: MD5 hash of normalized query strings used as keys
- Warming: Background task runs RAG pipeline for all topics and saves results
- Storage: JSON file persistence (
backend/cache/responses.json)
Benefits
- Zero Latency: Instant responses for complex tables
- Reliability: Guarantees consistent answers for critical guidelines
- Cost Savings: Eliminates API calls for common queries
7. Deployment & Scaling
Deployment Model (Render.com)
- Backend: Python Web Service (Uvicorn/FastAPI)
- Frontend: Node.js Web Service (Next.js)
- Communication: HTTPS REST API
Production Configuration
- Streaming: Nginx buffering disabled via
X-Accel-Buffering: no - Keep-Alive: Periodic pings to prevent timeout during generation
- CORS: Restricted to frontend domain in production
Scaling Considerations
- Vector Store: Currently local ChromaDB. For scale, migrate to Chroma Client/Server or Pinecone.
- Concurrency: FastAPI handles async requests; multiple workers can be configured in Uvicorn.
- Cache: Currently file-based. Can be migrated to Redis for distributed caching.
Documentation generated for HIV Guidelines Chatbot v1.0