Below you will find pages that utilize the taxonomy term “Spark”
Post
Earthquake Prediction Dashboard - Spark, Tableau, MongoDB
The objective is to report the prediction of the earthquake from the historical data. A machine learning model is trained with historical data of the world related to earthquakes from 1965-2016. The data includes geographical location and magnitude of the earthquakes (23.5k samples). The model predicts earthquake magnitude for the year of 2017. Finally, a dashboard is created to visualize the prediction in addition to the historical analysis on the data.
Post
Stroke Prediction - Spark, MLib
The objective is to predict brain stroke from patient’s records such as age, bmi score, heart problem, hypertension and smoking practice. The dataset includes 100k patient records. Among the records, 1.5% of them are related to stroke patients and the remaining 98.5% of them are related to non-stroke patients. Therefore, the data is extremely imbalanced.
The dataset is collected from https://bigml.com/dashboard/dataset/5e92c6d14f6bfd2dd00044a9
Dataproc and Google Cloud Platform is used to set up spark clusters.
Post
Customer Churn Detection - Spark, MLib
The dataset includes different information about customers. The objective is to predict customer churn from the data. The input data is highly imbalanced consisting 150 churn (i.e., churn = 1) and 750 no churn (i.e., churn = 0) customers. Check the customer_churn.csv dataset for details.
MLlib and PySpark is used to build the model. Feature vectorization is performed to convert the categorical features. Random undersampling is performed to the majority class (i.
Post
Spam Detection - Spark, MLib
The objective of this project is to detect spam messages. The dataset includes tagged SMS messages. It contains 5,574 English SMS messages in total; tagged as ham or spam. Check the SMSSpamCollection dataset for details.
MLlib with PySpark is used to build the model. Preprocessing steps are performed and feature engineering is applied using TF-IDF. Logistic regression, Random Forest and Naive Bayes are used for classifications. The best performance is achieved for the Logistic regression with 97% accuracy.