AI
Projects
Deep Learning

Building an AI-Powered HIV Guidelines Chatbot

December 1, 2025•6 min read

HIV Guidelines Chatbot - Technical Documentation

This comprehensive documentation covers the architecture, implementation details, and technical specifications of the Kenya Medical Guidelines RAG (Retrieval-Augmented Generation) System.

šŸ“‹ Table of Contents

  1. System Overview
  2. Backend Architecture
  3. RAG Implementation
  4. Data Pipeline
  5. API Reference
  6. Caching System
  7. Deployment & Scaling

1. System Overview

The HIV Guidelines Chatbot is a specialized clinical decision support system designed to provide evidence-based answers from the Kenya HIV Prevention and Treatment Guidelines (2022). It uses a Retrieval-Augmented Generation (RAG) architecture to ensure accuracy and reduce hallucinations.

Key Capabilities

  • Domain Specificity: Strictly scoped to HIV/AIDS management in Kenya
  • Evidence-Based: Citations and references to specific tables/sections
  • Clinical Accuracy: Detailed dosing tables and regimens
  • Hybrid Response System: Mix of pre-computed cached responses and real-time RAG generation

Technology Stack

ComponentTechnologyRole
Backend APIFastAPI (Python)REST API, Streaming, Async processing
LLMAnthropic Claude 3 HaikuResponse generation, reasoning
EmbeddingsOpenAI text-embedding-3-smallVector representation of text
Vector StoreChromaDBSimilarity search, document storage
OrchestrationLangChainRAG pipeline, prompt management
FrontendNext.js 15, AI SDKUser interface, chat management

2. Backend Architecture

The backend is structured as a modular Python application following Clean Architecture principles.

Directory Structure (backend/src/)

backend/src/
ā”œā”€ā”€ api.py               # FastAPI entry point & route handlers
ā”œā”€ā”€ rag_chain.py         # Core RAG logic & LLM interaction
ā”œā”€ā”€ vector_store.py      # ChromaDB management & retrieval
ā”œā”€ā”€ document_processor.py # PDF ingestion & chunking
ā”œā”€ā”€ cache_manager.py     # Response caching system
└── config.py            # Configuration & environment variables

Core Components

1. API Layer (api.py)

  • Handles HTTP requests and Server-Sent Events (SSE) for streaming
  • Manages CORS and anti-buffering headers for deployment
  • Routes requests to Cache Manager or RAG System
  • Key Endpoints:
    • POST /api/v1/chat/completions: Main chat endpoint (OpenAI compatible)
    • POST /api/query: Hybrid endpoint (Cache -> RAG)
    • POST /api/cache/warm: Triggers background cache warming

2. Configuration (config.py)

  • Centralizes environment variables
  • Manages paths for data, cache, and vector store
  • Defines model parameters (chunk size, overlap, top-k)

3. RAG Implementation

The RAG system (rag_chain.py) is the brain of the application. It follows a Retrieve-Then-Generate pattern with strict guardrails.

The Pipeline

  1. Input: User query received via API
  2. Retrieval: VectorStoreManager searches for relevant chunks
    • Metric: Cosine similarity
    • Top-K: 30 documents (high recall strategy)
  3. Relevance Check:
    • Scores calculated for retrieved documents
    • Threshold check (< 0.9 distance) to detect out-of-scope queries
  4. Context Construction: Relevant chunks formatted into a context string
  5. Generation: Claude 3 Haiku generates response using strict system prompt

System Prompt Engineering

The system prompt enforces:

  • Scope Limitation: Rejects non-HIV questions
  • Formatting: Markdown tables for regimens
  • Tone: Clinical, direct, and comprehensive
  • Safety: "Answer First" policy (no clarifying questions for clinical queries)

Relevance Guardrails

# Logic to reject out-of-scope queries
if not relevant_docs or (relevant_docs[0][1] > RELEVANCE_THRESHOLD):
    return "I cannot find information about this specific question..."

4. Data Pipeline

The data pipeline transforms raw PDF guidelines into a searchable vector index.

Ingestion Process (scripts/ingest_documents.py)

  1. Extraction:
    • Uses MarkItDown to convert PDF to Markdown (preserves table structure)
    • Fallback to pypdf for raw text extraction
  2. Chunking:
    • Splitter: RecursiveCharacterTextSplitter
    • Chunk Size: 1000 characters
    • Overlap: 200 characters (preserves context across boundaries)
  3. Embedding:
    • Model: text-embedding-3-small (OpenAI)
    • Dimensions: 1536
  4. Storage:
    • Persisted to local ChromaDB instance (backend/chroma_db/)

File Management

  • Raw Data: backend/data/raw/*.pdf
  • Processed: backend/data/processed/*.md
  • Vector DB: backend/chroma_db/

5. API Reference

Chat Completions

Endpoint: POST /api/v1/chat/completions Description: OpenAI-compatible endpoint for chat interfaces.

Request Body:

{
  "messages": [
    {"role": "user", "content": "What are the first-line ART regimens?"}
  ],
  "stream": true,
  "temperature": 0
}

Response (Streamed): Server-Sent Events (SSE) with JSON data chunks.

Hybrid Query

Endpoint: POST /api/query Description: Checks cache first, then falls back to RAG.

Request Body:

{
  "messages": [{"role": "user", "content": "..."}]
}

Response:

{
  "cached": boolean,
  "response": "Markdown string...",
  "topic_id": "string (optional)"
}

Cache Management

  • POST /api/cache/warm: Triggers background cache generation
  • GET /api/cache/stats: Returns cache hit/miss statistics
  • DELETE /api/cache/clear: Purges all cached responses

6. Caching System

To reduce latency and LLM costs, a robust caching system (cache_manager.py) pre-computes answers for high-frequency queries.

Strategy

  • Predefined Topics: List of ~20 common clinical questions (regimens, dosing, PMTCT)
  • Hashing: MD5 hash of normalized query strings used as keys
  • Warming: Background task runs RAG pipeline for all topics and saves results
  • Storage: JSON file persistence (backend/cache/responses.json)

Benefits

  • Zero Latency: Instant responses for complex tables
  • Reliability: Guarantees consistent answers for critical guidelines
  • Cost Savings: Eliminates API calls for common queries

7. Deployment & Scaling

Deployment Model (Render.com)

  • Backend: Python Web Service (Uvicorn/FastAPI)
  • Frontend: Node.js Web Service (Next.js)
  • Communication: HTTPS REST API

Production Configuration

  • Streaming: Nginx buffering disabled via X-Accel-Buffering: no
  • Keep-Alive: Periodic pings to prevent timeout during generation
  • CORS: Restricted to frontend domain in production

Scaling Considerations

  1. Vector Store: Currently local ChromaDB. For scale, migrate to Chroma Client/Server or Pinecone.
  2. Concurrency: FastAPI handles async requests; multiple workers can be configured in Uvicorn.
  3. Cache: Currently file-based. Can be migrated to Redis for distributed caching.

Documentation generated for HIV Guidelines Chatbot v1.0