Using Semantic Similarity for Intelligent Data Import

Data from health facilities often arrives in unstructured formats—scattered across Excel files, PDFs, and heterogeneous data systems. The JSI DHIS2 Data Uploader project leverages semantic similarity and AI to automatically match and import this unstructured data into DHIS2, Kenya's national health information system.

This guide provides a comprehensive walkthrough on how to reproduce this project locally, understand its architecture, and use it for your own data matching needs.

The Problem

Health facility data is often collected in unstructured formats
Manual data entry into DHIS2 is time-consuming and error-prone
Data inconsistencies and naming variations cause import failures
Data import bottlenecks delay reporting and decision-making

The Innovation

Using semantic similarity algorithms to:

Automatically match variables: Understand that "Total Cases Reported" and "Cases Reported (Total)" refer to the same data element
Handle naming variations: Bridge gaps between facility naming conventions and standardized DHIS2 data elements
Reduce manual work: Automate 80%+ of data import tasks
Improve accuracy: Eliminate human error in manual data entry

Technical Architecture

The application is built using Python and Shiny for Python, leveraging the Sentence Transformers library for semantic matching.

Frontend: Shiny for Python provides a reactive web interface.
Backend Logic:
- app.py: Main application entry point handling UI and server logic.
- workbook.py: Processes incoming Excel files (DAGU and Manual formats).
- match.py: Contains the core semantic matching logic using cosine similarity.
AI Model: Uses all-MiniLM-L6-v2 (default) to generate embeddings for text comparison.

Reproduction Guide

Follow these steps to set up the project on your local machine.

1. Prerequisites

Python 3.9+: Ensure Python is installed.
Git: For version control.
Virtual Environment: Recommended to keep dependencies isolated.

2. Installation

Clone the repository:

git clone https://github.com/danielmaangi/ai-powered-semantic-similarity.git
cd ai-powered-semantic-similarity

Create and activate a virtual environment:

# Linux/macOS
python3 -m venv venv
source venv/bin/activate

# Windows
python -m venv venv
.\venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

3. Configuration

The application uses environment variables for configuration. Create a .env file in the root directory:

touch .env

Add the following configuration to .env:

# Embedding Model Configuration
EMBEDDING_MODEL=all-MiniLM-L6-v2
SIMILARITY_THRESHOLD=0.7
TOP_K_MATCHES=3
ENABLE_FUZZY_MATCHING=true

# DHIS2 Configuration (Optional for local testing if not uploading)
DHIS2_BASE_URL=https://your-dhis2-instance.org
DHIS2_USERNAME=your_username
DHIS2_PASSWORD=your_password

4. Data Setup

The application requires specific metadata files to function correctly. These act as the "Ground Truth" for matching. Create a data/ directory and place the following CSV files inside it:

rrf_categoryCombos_separated.csv: Contains the standardized list of products/items in your system.
- Required columns: product_id, product, units, id (category option combo ID).
rrf_dataElements.csv: Metadata for data elements.
orgunits.csv: List of organization units (facilities).
- Required columns: orgUnit, username (for mapping users to facilities).

Note: If you don't have these files, you can create dummy CSVs with the required columns to test the interface.

5. Running the Application

Start the Shiny application from your terminal:

shiny run app.py

The app will launch in your default web browser (usually at http://127.0.0.1:8000).

Usage Workflow

1. Data Processing

Navigate to the "Data Processing" tab.
Workbook Source: Select "Manual" or "DAGU" depending on your input file format.
Upload: Upload your Excel workbook containing unstructured data.
Process: Click "Process Data". The app will:
1. Read the Excel file.
2. Clean and normalize text.
3. Generate embeddings for the uploaded data.
4. Compare them against the rrf_categoryCombos_separated.csv metadata using cosine similarity.
5. Display a preview of the processed data.

2. Match Analysis

Switch to the "Match Analysis" tab.
Review the Match Statistics (Match rate, confidence scores).
Examine the Match Review Table to see exactly how incoming items were matched to standardized items.
- Green: High confidence matches.
- Yellow/Red: Low confidence or unmatched items requiring attention.
Download the Matched Dataset CSV for further analysis.

3. DHIS2 Upload (Optional)

If you configured DHIS2 credentials, you can use the "DHIS2 Upload" tab.
Test Auth: Verify your credentials.
Upload: Push the matched and structured data directly to your DHIS2 instance.

Key Code Components

If you are reviewing the code, focus on these functions in app.py:

process_workbook(): Handles the ingestion of messy Excel files.
ProductMatcher class:
- Uses SentenceTransformer to encode text.
- find_best_matches() calculates cosine similarity between the input vectors and your metadata vectors.
convert_to_dhis2_format(): Transforms the matched tabular data into the specific JSON format required by the DHIS2 Data Value Sets API.

Results & Impact

This project demonstrates how AI can bridge the gap between messy real-world data and structured health information systems, significantly reducing the manual burden on health data workers.

GitHub Repository: ai-powered-semantic-similarity