Post

Refining Word Embeddings - Word2Vec, Glove, FastText, LexVec

The objective is to implement different word embeddings and investigate the performance on a dataset. For this task, the chosen dataset includes tweets about disasters, e.g., earthquake, wildfire. Among the tweets, some are fake and some are real. So, there are two labels such as 1:Real tweet and 0: Fake tweet. This data has been chosen so that we can see if the features extracted using different embedding techniques can retain distinguishable information to detect which tweet is fake and which one is real.

Post

Question Classification - SVM, Logistic Regression, LSTM, BERT, Doc2Vec, TF-IDF

The objective is to build a question classification model. The questions have six different categories such as: Description(DESC), Entity(ENTY), Abbreviation(ABBR), Human(HUM), Location(LOC), Numeric Value(NUM). To investigate different approaches, the following data is used (downloaded from https://cogcomp.seas.upenn.edu/Data/QA/QC/): Training set 5(5500 labeled questions) Test set: TREC 10 questions Different data analyses have been performed and four different models are trained. The models are the followings: Tf-Idf + SVM: Tf-Idf is used for vectorizing texts and a linear model (i.

Post

News Classification - LSTM

The objective of this project is to classify news category from articles. The input data consist of 2225 news articles from the BBC news website corresponding to stories in 5 topical areas (e.g., business, entertainment, politics, sport, tech). LSTM has been applied in the classification task to categorize articles. TensorFlow 2.0 has been used to train the model. Word embedding is used in feature generation. TSNE is used to visualize the word vectors in 2d space.

Post

Fake or Real Tweets - BERT, LSTM, TF-IDF

The dataset includes tweets about disasters, e.g., earthquake, wildfire. The objective is to detect if the tweet is about a real disaster vs. fake disaster. Different approaches have been performed for data cleaning and training the model. The best model can predict real vs. fake tweets with 89% accuracy using transfer learning (BERT). The following models have been developed for training: BOW Model with Logistic Regression. (accuracy 77%) Tf-Idf with Logistic Regression.