Using Semantic Similarity for Intelligent Data Import
Using Semantic Similarity for Intelligent Data Import
Data from health facilities often arrives in unstructured formats—scattered across Excel files, PDFs, and heterogeneous data systems. The JSI DHIS2 Data Uploader project leverages semantic similarity and AI to automatically match and import this unstructured data into DHIS2, Kenya's national health information system.
This guide provides a comprehensive walkthrough on how to reproduce this project locally, understand its architecture, and use it for your own data matching needs.
The Problem
- Health facility data is often collected in unstructured formats
- Manual data entry into DHIS2 is time-consuming and error-prone
- Data inconsistencies and naming variations cause import failures
- Data import bottlenecks delay reporting and decision-making
The Innovation
Using semantic similarity algorithms to:
- Automatically match variables: Understand that "Total Cases Reported" and "Cases Reported (Total)" refer to the same data element
- Handle naming variations: Bridge gaps between facility naming conventions and standardized DHIS2 data elements
- Reduce manual work: Automate 80%+ of data import tasks
- Improve accuracy: Eliminate human error in manual data entry
Technical Architecture
The application is built using Python and Shiny for Python, leveraging the Sentence Transformers library for semantic matching.
- Frontend: Shiny for Python provides a reactive web interface.
- Backend Logic:
app.py: Main application entry point handling UI and server logic.workbook.py: Processes incoming Excel files (DAGU and Manual formats).match.py: Contains the core semantic matching logic using cosine similarity.
- AI Model: Uses
all-MiniLM-L6-v2(default) to generate embeddings for text comparison.
Reproduction Guide
Follow these steps to set up the project on your local machine.
1. Prerequisites
- Python 3.9+: Ensure Python is installed.
- Git: For version control.
- Virtual Environment: Recommended to keep dependencies isolated.
2. Installation
-
Clone the repository:
git clone https://github.com/danielmaangi/ai-powered-semantic-similarity.git cd ai-powered-semantic-similarity -
Create and activate a virtual environment:
# Linux/macOS python3 -m venv venv source venv/bin/activate # Windows python -m venv venv .\venv\Scripts\activate -
Install dependencies:
pip install -r requirements.txt
3. Configuration
The application uses environment variables for configuration. Create a .env file in the root directory:
touch .env
Add the following configuration to .env:
# Embedding Model Configuration EMBEDDING_MODEL=all-MiniLM-L6-v2 SIMILARITY_THRESHOLD=0.7 TOP_K_MATCHES=3 ENABLE_FUZZY_MATCHING=true # DHIS2 Configuration (Optional for local testing if not uploading) DHIS2_BASE_URL=https://your-dhis2-instance.org DHIS2_USERNAME=your_username DHIS2_PASSWORD=your_password
4. Data Setup
The application requires specific metadata files to function correctly. These act as the "Ground Truth" for matching. Create a data/ directory and place the following CSV files inside it:
rrf_categoryCombos_separated.csv: Contains the standardized list of products/items in your system.- Required columns:
product_id,product,units,id(category option combo ID).
- Required columns:
rrf_dataElements.csv: Metadata for data elements.orgunits.csv: List of organization units (facilities).- Required columns:
orgUnit,username(for mapping users to facilities).
- Required columns:
Note: If you don't have these files, you can create dummy CSVs with the required columns to test the interface.
5. Running the Application
Start the Shiny application from your terminal:
shiny run app.py
The app will launch in your default web browser (usually at http://127.0.0.1:8000).
Usage Workflow
1. Data Processing
- Navigate to the "Data Processing" tab.
- Workbook Source: Select "Manual" or "DAGU" depending on your input file format.
- Upload: Upload your Excel workbook containing unstructured data.
- Process: Click "Process Data". The app will:
- Read the Excel file.
- Clean and normalize text.
- Generate embeddings for the uploaded data.
- Compare them against the
rrf_categoryCombos_separated.csvmetadata using cosine similarity. - Display a preview of the processed data.
2. Match Analysis
- Switch to the "Match Analysis" tab.
- Review the Match Statistics (Match rate, confidence scores).
- Examine the Match Review Table to see exactly how incoming items were matched to standardized items.
- Green: High confidence matches.
- Yellow/Red: Low confidence or unmatched items requiring attention.
- Download the Matched Dataset CSV for further analysis.
3. DHIS2 Upload (Optional)
- If you configured DHIS2 credentials, you can use the "DHIS2 Upload" tab.
- Test Auth: Verify your credentials.
- Upload: Push the matched and structured data directly to your DHIS2 instance.
Key Code Components
If you are reviewing the code, focus on these functions in app.py:
process_workbook(): Handles the ingestion of messy Excel files.ProductMatcherclass:- Uses
SentenceTransformerto encode text. find_best_matches()calculates cosine similarity between the input vectors and your metadata vectors.
- Uses
convert_to_dhis2_format(): Transforms the matched tabular data into the specific JSON format required by the DHIS2 Data Value Sets API.
Results & Impact
This project demonstrates how AI can bridge the gap between messy real-world data and structured health information systems, significantly reducing the manual burden on health data workers.
GitHub Repository: ai-powered-semantic-similarity