Post

TutVis: A Tool to Visualize Tutorials - Topic Modeling(LDA), Random Forest, Angular js

Online text and video tutorials are among the most popular and heavily used resources for learning feature-rich software applications (e.g., Photoshop, Weka, AutoCAD, Fusion 360). However, when searching, users can find it difficult to assess whether a tutorial is designed for their level of expertise. TutVis stands for tutorial visualization, which is a Photoshop tutorial browsing system that provides auto-generated information to assist users in the tutorial searching and selection. The provided information is as follows: difficulty level (advanced/beginner), topics covered, length, text complexity, command usage ratio.

Post

Sentiment Analysis from Health and Fitness app reviews - BERT

The objective of this project is sentiment analysis (i.e., Positive, Neutral, Negative) from the popular health and fitness app reviews. For this task, app reviews are collected from the google play store. 10 popular health and fitness apps are chosen. All around 12000 most recent reviews are collected. Ratings are considered as the measure/label of positive, negative, and neutral sentiment of the reviews. The collected data are preprocessed and trained using a transformer model.

Post

Credit Card Fraud Detection - Autoencoder, KNN, SVM, MLP

The purpose of this project is to leverage machine learning and find out fraudulent transactions of the credit cards. The idea is to prevent fraudulent activity only by analyzing credit card transaction data. The transaction data that has been used is highly imbalanced, having only 0.2% fraud cases. The overall challenge is to make a supervised model that can detect fraud transactions from normal transactions. The data can be found in the following link:

Post

Customer Loan Enquiry - ANN, Django

The objective of this project is to predict if a customer will get a loan given applicant income, loan amount, loan amount term, credit history, education status, self-employment status, property area, etc. A model is trained using the training data on previous customers’ loan approval history. A web service is created, which runs the trained model in the background. The service presents an interface through which any user can request to get an automated decision/prediction (i.

Post

Question Classification - SVM, Logistic Regression, LSTM, BERT, Doc2Vec, TF-IDF

The objective is to build a question classification model. The questions have six different categories such as: Description(DESC), Entity(ENTY), Abbreviation(ABBR), Human(HUM), Location(LOC), Numeric Value(NUM). To investigate different approaches, the following data is used (downloaded from https://cogcomp.seas.upenn.edu/Data/QA/QC/): Training set 5(5500 labeled questions) Test set: TREC 10 questions Different data analyses have been performed and four different models are trained. The models are the followings: Tf-Idf + SVM: Tf-Idf is used for vectorizing texts and a linear model (i.

Post

Heart Disease Prediction - Classification, Flask, Streamlit, Docker, Feature Selection

Heart disease or Cardiovascular disease is one of the biggest causes of mortality (i.e., causing 1 out of 4 deaths in the US) among the population of the world. Therefore, prediction of Cardiovascular disease is considered one of the important subjects in clinical data analysis. However, several contributory risk factors such as diabetes, high blood pressure, high cholesterol, abnormal pulse rate, etc. lead to cardiac arrest. So, the purpose of this work is to predict if any patient has the chance of having heart disease or not.

Post

Analysis of Parkinson Patient - Feature Selection, Classification

The objective is to detect patients with Parkinson’s Disease (PD) from the voice samples. The training data includes voice measurements such as average, maximum, and minimum vocal fundamental frequency, several measures of variation in fundamental frequency, variation in amplitudes, ratio of noise to tonal components, signal fractal scaling exponent and nonlinear measures of fundamental frequency variation. From these measurements, feature-selection (filter and wrapper method) is performed to select essential features for detecting Parkinson.

Post

News Classification - LSTM

The objective of this project is to classify news category from articles. The input data consist of 2225 news articles from the BBC news website corresponding to stories in 5 topical areas (e.g., business, entertainment, politics, sport, tech). LSTM has been applied in the classification task to categorize articles. TensorFlow 2.0 has been used to train the model. Word embedding is used in feature generation. TSNE is used to visualize the word vectors in 2d space.

Post

Fake or Real Tweets - BERT, LSTM, TF-IDF

The dataset includes tweets about disasters, e.g., earthquake, wildfire. The objective is to detect if the tweet is about a real disaster vs. fake disaster. Different approaches have been performed for data cleaning and training the model. The best model can predict real vs. fake tweets with 89% accuracy using transfer learning (BERT). The following models have been developed for training: BOW Model with Logistic Regression. (accuracy 77%) Tf-Idf with Logistic Regression.

Post

Stroke Prediction - Spark, MLib

The objective is to predict brain stroke from patient’s records such as age, bmi score, heart problem, hypertension and smoking practice. The dataset includes 100k patient records. Among the records, 1.5% of them are related to stroke patients and the remaining 98.5% of them are related to non-stroke patients. Therefore, the data is extremely imbalanced. The dataset is collected from https://bigml.com/dashboard/dataset/5e92c6d14f6bfd2dd00044a9 Dataproc and Google Cloud Platform is used to set up spark clusters.

Post

Handwritten Digits Classification - CNN

Classification of the handwritten digits (0-9) has been performed using CNN. The data set includes 60,000 training samples and 10,000 testing samples. The dataset is collected from MNIST. Tensorflow 2.0 is used. The model includes 2 conv, 2 maxpool, 1 dense, 1 dropout layers. Earlystopping is performed. 99% accuracy is achieved. This Project’s GitHub Repository

Post

Rain Prediction - ANN

The purpose of this project is to predict if it will rain tomorrow. The data includes weather information (i.e., temperature, evaporation, wind speed, humidity, pressure and cloud status) of different locations in Australia. The data is quite imbalanced having a few instances of rain information. The objective is to train a neural network to predict rain tomorrow from the given information. Pytorch is used for training. Feature selection is performed using Pearson’s correlation.

Post

Customer Churn Detection - Spark, MLib

The dataset includes different information about customers. The objective is to predict customer churn from the data. The input data is highly imbalanced consisting 150 churn (i.e., churn = 1) and 750 no churn (i.e., churn = 0) customers. Check the customer_churn.csv dataset for details. MLlib and PySpark is used to build the model. Feature vectorization is performed to convert the categorical features. Random undersampling is performed to the majority class (i.

Post

Intent Detection - BERT

The objective of this project is to detect intent from texts. For this, a benchmark dataset is used, which includes 7 intents (Search Creative Work, get weather, Book Restaurant, Play Music, Add to Playlist, Rate Book, Search Screening Event) and 14 thousand samples. Transfer learning has been leveraged to train a machine learning model. The model takes the raw texts, which are tokenized and vectorized to feed into the pre-trained model.

Post

Spam Detection - Spark, MLib

The objective of this project is to detect spam messages. The dataset includes tagged SMS messages. It contains 5,574 English SMS messages in total; tagged as ham or spam. Check the SMSSpamCollection dataset for details. MLlib with PySpark is used to build the model. Preprocessing steps are performed and feature engineering is applied using TF-IDF. Logistic regression, Random Forest and Naive Bayes are used for classifications. The best performance is achieved for the Logistic regression with 97% accuracy.