HIV Guidelines Chatbot - Technical Documentation

This comprehensive documentation covers the architecture, implementation details, and technical specifications of the Kenya Medical Guidelines RAG (Retrieval-Augmented Generation) System.

📋 Table of Contents

System Overview
Backend Architecture
RAG Implementation
Data Pipeline
API Reference
Caching System
Deployment & Scaling

1. System Overview

The HIV Guidelines Chatbot is a specialized clinical decision support system designed to provide evidence-based answers from the Kenya HIV Prevention and Treatment Guidelines (2022). It uses a Retrieval-Augmented Generation (RAG) architecture to ensure accuracy and reduce hallucinations.

Key Capabilities

Domain Specificity: Strictly scoped to HIV/AIDS management in Kenya
Evidence-Based: Citations and references to specific tables/sections
Clinical Accuracy: Detailed dosing tables and regimens
Hybrid Response System: Mix of pre-computed cached responses and real-time RAG generation

Technology Stack

Component	Technology	Role
Backend API	FastAPI (Python)	REST API, Streaming, Async processing
LLM	Anthropic Claude 3 Haiku	Response generation, reasoning
Embeddings	OpenAI text-embedding-3-small	Vector representation of text
Vector Store	ChromaDB	Similarity search, document storage
Orchestration	LangChain	RAG pipeline, prompt management
Frontend	Next.js 15, AI SDK	User interface, chat management

2. Backend Architecture

The backend is structured as a modular Python application following Clean Architecture principles.

Directory Structure (`backend/src/`)

backend/src/
├── api.py               # FastAPI entry point & route handlers
├── rag_chain.py         # Core RAG logic & LLM interaction
├── vector_store.py      # ChromaDB management & retrieval
├── document_processor.py # PDF ingestion & chunking
├── cache_manager.py     # Response caching system
└── config.py            # Configuration & environment variables

Core Components

1. API Layer (`api.py`)

Handles HTTP requests and Server-Sent Events (SSE) for streaming
Manages CORS and anti-buffering headers for deployment
Routes requests to Cache Manager or RAG System
Key Endpoints:
- POST /api/v1/chat/completions: Main chat endpoint (OpenAI compatible)
- POST /api/query: Hybrid endpoint (Cache -> RAG)
- POST /api/cache/warm: Triggers background cache warming

2. Configuration (`config.py`)

Centralizes environment variables
Manages paths for data, cache, and vector store
Defines model parameters (chunk size, overlap, top-k)

3. RAG Implementation

The RAG system (rag_chain.py) is the brain of the application. It follows a Retrieve-Then-Generate pattern with strict guardrails.

The Pipeline

Input: User query received via API
Retrieval: VectorStoreManager searches for relevant chunks
- Metric: Cosine similarity
- Top-K: 30 documents (high recall strategy)
Relevance Check:
- Scores calculated for retrieved documents
- Threshold check (< 0.9 distance) to detect out-of-scope queries
Context Construction: Relevant chunks formatted into a context string
Generation: Claude 3 Haiku generates response using strict system prompt

System Prompt Engineering

The system prompt enforces:

Scope Limitation: Rejects non-HIV questions
Formatting: Markdown tables for regimens
Tone: Clinical, direct, and comprehensive
Safety: "Answer First" policy (no clarifying questions for clinical queries)

Relevance Guardrails

# Logic to reject out-of-scope queries
if not relevant_docs or (relevant_docs[0][1] > RELEVANCE_THRESHOLD):
    return "I cannot find information about this specific question..."

4. Data Pipeline

The data pipeline transforms raw PDF guidelines into a searchable vector index.

Ingestion Process (`scripts/ingest_documents.py`)

Extraction:
- Uses MarkItDown to convert PDF to Markdown (preserves table structure)
- Fallback to pypdf for raw text extraction
Chunking:
- Splitter: RecursiveCharacterTextSplitter
- Chunk Size: 1000 characters
- Overlap: 200 characters (preserves context across boundaries)
Embedding:
- Model: text-embedding-3-small (OpenAI)
- Dimensions: 1536
Storage:
- Persisted to local ChromaDB instance (backend/chroma_db/)

File Management

Raw Data: backend/data/raw/*.pdf
Processed: backend/data/processed/*.md
Vector DB: backend/chroma_db/

5. API Reference

Chat Completions

Endpoint: POST /api/v1/chat/completions Description: OpenAI-compatible endpoint for chat interfaces.

Request Body:

{
  "messages": [
    {"role": "user", "content": "What are the first-line ART regimens?"}
  ],
  "stream": true,
  "temperature": 0
}

Response (Streamed): Server-Sent Events (SSE) with JSON data chunks.

Hybrid Query

Endpoint: POST /api/query Description: Checks cache first, then falls back to RAG.

Request Body:

{
  "messages": [{"role": "user", "content": "..."}]
}

Response:

{
  "cached": boolean,
  "response": "Markdown string...",
  "topic_id": "string (optional)"
}

Cache Management

POST /api/cache/warm: Triggers background cache generation
GET /api/cache/stats: Returns cache hit/miss statistics
DELETE /api/cache/clear: Purges all cached responses

6. Caching System

To reduce latency and LLM costs, a robust caching system (cache_manager.py) pre-computes answers for high-frequency queries.

Strategy

Predefined Topics: List of ~20 common clinical questions (regimens, dosing, PMTCT)
Hashing: MD5 hash of normalized query strings used as keys
Warming: Background task runs RAG pipeline for all topics and saves results
Storage: JSON file persistence (backend/cache/responses.json)

Benefits

Zero Latency: Instant responses for complex tables
Reliability: Guarantees consistent answers for critical guidelines
Cost Savings: Eliminates API calls for common queries

7. Deployment & Scaling

Deployment Model (Render.com)

Backend: Python Web Service (Uvicorn/FastAPI)
Frontend: Node.js Web Service (Next.js)
Communication: HTTPS REST API

Production Configuration

Streaming: Nginx buffering disabled via X-Accel-Buffering: no
Keep-Alive: Periodic pings to prevent timeout during generation
CORS: Restricted to frontend domain in production

Scaling Considerations

Vector Store: Currently local ChromaDB. For scale, migrate to Chroma Client/Server or Pinecone.
Concurrency: FastAPI handles async requests; multiple workers can be configured in Uvicorn.
Cache: Currently file-based. Can be migrated to Redis for distributed caching.

Documentation generated for HIV Guidelines Chatbot v1.0

HIV Guidelines Chatbot - Technical Documentation

This comprehensive documentation covers the architecture, implementation details, and technical specifications of the Kenya Medical Guidelines RAG (Retrieval-Augmented Generation) System.

1. System Overview

Key Capabilities

Domain Specificity: Strictly scoped to HIV/AIDS management in Kenya
Evidence-Based: Citations and references to specific tables/sections
Clinical Accuracy: Detailed dosing tables and regimens
Hybrid Response System: Mix of pre-computed cached responses and real-time RAG generation

Technology Stack

Component	Technology	Role
Backend API	FastAPI (Python)	REST API, Streaming, Async processing
LLM	Anthropic Claude 3 Haiku	Response generation, reasoning
Embeddings	OpenAI text-embedding-3-small	Vector representation of text
Vector Store	ChromaDB	Similarity search, document storage
Orchestration	LangChain	RAG pipeline, prompt management
Frontend	Next.js 15, AI SDK	User interface, chat management

2. Backend Architecture

The backend is structured as a modular Python application following Clean Architecture principles.

Directory Structure (`backend/src/`)

backend/src/
├── api.py               # FastAPI entry point & route handlers
├── rag_chain.py         # Core RAG logic & LLM interaction
├── vector_store.py      # ChromaDB management & retrieval
├── document_processor.py # PDF ingestion & chunking
├── cache_manager.py     # Response caching system
└── config.py            # Configuration & environment variables

Core Components

1. API Layer (`api.py`)

Handles HTTP requests and Server-Sent Events (SSE) for streaming
Manages CORS and anti-buffering headers for deployment
Routes requests to Cache Manager or RAG System
Key Endpoints:
- POST /api/v1/chat/completions: Main chat endpoint (OpenAI compatible)
- POST /api/query: Hybrid endpoint (Cache -> RAG)
- POST /api/cache/warm: Triggers background cache warming

2. Configuration (`config.py`)

Centralizes environment variables
Manages paths for data, cache, and vector store
Defines model parameters (chunk size, overlap, top-k)

3. RAG Implementation

The RAG system (rag_chain.py) is the brain of the application. It follows a Retrieve-Then-Generate pattern with strict guardrails.

The Pipeline

Input: User query received via API
Retrieval: VectorStoreManager searches for relevant chunks
- Metric: Cosine similarity
- Top-K: 30 documents (high recall strategy)
Relevance Check:
- Scores calculated for retrieved documents
- Threshold check (< 0.9 distance) to detect out-of-scope queries
Context Construction: Relevant chunks formatted into a context string
Generation: Claude 3 Haiku generates response using strict system prompt

System Prompt Engineering

The system prompt enforces:

Scope Limitation: Rejects non-HIV questions
Formatting: Markdown tables for regimens
Tone: Clinical, direct, and comprehensive
Safety: "Answer First" policy (no clarifying questions for clinical queries)

Relevance Guardrails

# Logic to reject out-of-scope queries
if not relevant_docs or (relevant_docs[0][1] > RELEVANCE_THRESHOLD):
    return "I cannot find information about this specific question..."

4. Data Pipeline

The data pipeline transforms raw PDF guidelines into a searchable vector index.

Ingestion Process (`scripts/ingest_documents.py`)

Extraction:
- Uses MarkItDown to convert PDF to Markdown (preserves table structure)
- Fallback to pypdf for raw text extraction
Chunking:
- Splitter: RecursiveCharacterTextSplitter
- Chunk Size: 1000 characters
- Overlap: 200 characters (preserves context across boundaries)
Embedding:
- Model: text-embedding-3-small (OpenAI)
- Dimensions: 1536
Storage:
- Persisted to local ChromaDB instance (backend/chroma_db/)

File Management

Raw Data: backend/data/raw/*.pdf
Processed: backend/data/processed/*.md
Vector DB: backend/chroma_db/

5. API Reference

Chat Completions

Endpoint: POST /api/v1/chat/completions Description: OpenAI-compatible endpoint for chat interfaces.

Request Body:

{
  "messages": [
    {"role": "user", "content": "What are the first-line ART regimens?"}
  ],
  "stream": true,
  "temperature": 0
}

Response (Streamed): Server-Sent Events (SSE) with JSON data chunks.

Hybrid Query

Endpoint: POST /api/query Description: Checks cache first, then falls back to RAG.

Request Body:

{
  "messages": [{"role": "user", "content": "..."}]
}

Response:

{
  "cached": boolean,
  "response": "Markdown string...",
  "topic_id": "string (optional)"
}

Cache Management

POST /api/cache/warm: Triggers background cache generation
GET /api/cache/stats: Returns cache hit/miss statistics
DELETE /api/cache/clear: Purges all cached responses

6. Caching System

To reduce latency and LLM costs, a robust caching system (cache_manager.py) pre-computes answers for high-frequency queries.

Strategy

Predefined Topics: List of ~20 common clinical questions (regimens, dosing, PMTCT)
Hashing: MD5 hash of normalized query strings used as keys
Warming: Background task runs RAG pipeline for all topics and saves results
Storage: JSON file persistence (backend/cache/responses.json)

Benefits

Zero Latency: Instant responses for complex tables
Reliability: Guarantees consistent answers for critical guidelines
Cost Savings: Eliminates API calls for common queries

7. Deployment & Scaling

Deployment Model (Render.com)

Backend: Python Web Service (Uvicorn/FastAPI)
Frontend: Node.js Web Service (Next.js)
Communication: HTTPS REST API

Production Configuration

Streaming: Nginx buffering disabled via X-Accel-Buffering: no
Keep-Alive: Periodic pings to prevent timeout during generation
CORS: Restricted to frontend domain in production

Scaling Considerations

Vector Store: Currently local ChromaDB. For scale, migrate to Chroma Client/Server or Pinecone.
Concurrency: FastAPI handles async requests; multiple workers can be configured in Uvicorn.
Cache: Currently file-based. Can be migrated to Redis for distributed caching.

Documentation generated for HIV Guidelines Chatbot v1.0

HIV Guidelines Chatbot - Technical Documentation

📋 Table of Contents

1. System Overview

Key Capabilities

Technology Stack

2. Backend Architecture

Directory Structure (backend/src/)

Core Components

1. API Layer (api.py)

2. Configuration (config.py)

3. RAG Implementation

The Pipeline

System Prompt Engineering

Relevance Guardrails

4. Data Pipeline

Ingestion Process (scripts/ingest_documents.py)

File Management

5. API Reference

Chat Completions

Hybrid Query

Cache Management

6. Caching System

Strategy

Benefits

7. Deployment & Scaling

Deployment Model (Render.com)

Production Configuration

Scaling Considerations

HIV Guidelines Chatbot - Technical Documentation

📋 Table of Contents

1. System Overview

Key Capabilities

Technology Stack

2. Backend Architecture

Directory Structure (backend/src/)

Core Components

1. API Layer (api.py)

2. Configuration (config.py)

3. RAG Implementation

The Pipeline

System Prompt Engineering

Relevance Guardrails

4. Data Pipeline

Ingestion Process (scripts/ingest_documents.py)

File Management

5. API Reference

Chat Completions

Hybrid Query

Cache Management

6. Caching System

Strategy

Benefits

7. Deployment & Scaling

Deployment Model (Render.com)

Production Configuration

Scaling Considerations

Directory Structure (`backend/src/`)

1. API Layer (`api.py`)

2. Configuration (`config.py`)

Ingestion Process (`scripts/ingest_documents.py`)

Directory Structure (`backend/src/`)

1. API Layer (`api.py`)

2. Configuration (`config.py`)

Ingestion Process (`scripts/ingest_documents.py`)