Stroke Prediction - Spark, MLib

The objective is to predict brain stroke from patient’s records such as age, bmi score, heart problem, hypertension and smoking practice. The dataset includes 100k patient records. Among the records, 1.5% of them are related to stroke patients and the remaining 98.5% of them are related to non-stroke patients. Therefore, the data is extremely imbalanced.

The dataset is collected from https://bigml.com/dashboard/dataset/5e92c6d14f6bfd2dd00044a9

Dataproc and Google Cloud Platform is used to set up spark clusters.
PySpark and MLlib is used to develop the model.
Different data imputation techniques are applied to process the missing data.
Edited Nearest Neighbours under-sampling technique is used on the majority class (non-stroke patient) and SMOTE over sampling technique is used on the minority class (stroke-patient).
Bagging (i.e., Random Forest) and Boosting approach (i.e., Gradient Boosting Tree) are applied on the processed data.
The best performance is achieved using the bagging approach with AUC = 0.796.

This Project’s GitHub Repository