Explainable AI Diabetes Prediction Project


Explainable AI Diabetes Prediction Project

Project Description

This project is an introduction to building explainable machine learning systems for healthcare, using the Pima Indians Diabetes Dataset as the foundation. You’ll start by exploring the data. From there, you’ll tackle a real-world challenge in medical datasets: class imbalance. You’ll experiment with three different strategies – random undersampling, random oversampling, and synthetic oversampling (SMOTE) – to see how balancing the data can improve the model’s ability to detect positive (diabetic) cases.

With cleaner and better-balanced data, you’ll build and compare classification models, evaluating them with metrics such as recall for the positive class. The project then goes a step further into model interpretability: you’ll use LIME to explain individual predictions and SHAP to understand global feature importance, making the model’s decisions transparent and trustworthy.

By the end, you won’t just have a model that predicts diabetes risk – you’ll have a full, reproducible notebook that shows how to explore medical data, address imbalance, train models, and explain their predictions.

Please note: This project is intended solely for educational and analytical purposes. It does not constitute medical advice, diagnosis, or treatment, nor should it be used to guide healthcare decisions. The dataset used in this analysis is derived from a specific sample of individuals, and the findings do not generalize to all individuals or communities. Any insights or model predictions should be interpreted with this limitation in mind.


Dataset

Pima Indians Diabetes Dataset


Relevant Skills You May Apply

Python Programming and Machine Learning knowledge


Skills You May Gain

Machine Learning, Class imbalance, AI Explainability and Interpretability


Total Time

10 – 15 hours


Milestones

Milestone 1: Exploratory Data Analysis
Milestone 2: Building Machine Learning Models
Milestone 3: Random Undersampling
Milestone 4: Random Oversampling
Milestone 5: Synthetic Minority Oversampling Technique (SMOTE)
Milestone 6: Explainability and Interpretability


Deliverables

Deliverables include a project report highlighting new skills gained and an interactive Python notebook (Jupyter/Google Colab).