Customer Churn Detection - Spark, MLib

The dataset includes different information about customers. The objective is to predict customer churn from the data. The input data is highly imbalanced consisting 150 churn (i.e., churn = 1) and 750 no churn (i.e., churn = 0) customers. Check the customer_churn.csv dataset for details.

MLlib and PySpark is used to build the model.
Feature vectorization is performed to convert the categorical features.
Random undersampling is performed to the majority class (i.e., No Churn) and random oversampling is performed to the minority class (i.e., Churn) to balance the class distribution.
Logistic regression, Random Forest and Gradient Boosting Tree are applied to the balanced data.
The best performance is achieved for the Gradient Boosting Tree with AUC (Area Under Curve) = 0.92.

This Project’s GitHub Repository

Big Data
Classification
Spark
MLib