Refining Word Embeddings - Word2Vec, Glove, FastText, LexVec
The objective is to implement different word embeddings and investigate the performance on a dataset. For this task, the chosen dataset includes tweets about disasters, e.g., earthquake, wildfire. Among the tweets, some are fake and some are real. So, there are two labels such as 1:Real tweet and 0: Fake tweet. This data has been chosen so that we can see if the features extracted using different embedding techniques can retain distinguishable information to detect which tweet is fake and which one is real. The data is collected from the following link:
https://www.kaggle.com/c/nlp-getting-started
The tweets are vectorized using four different embedding approaches. Those are the followings:
- Word2Vec.
- GloVe.
- FastText.
- LexVec.
The following steps are performed:
- Pre-trained embeddings are used for vectorization.
- Average word-embeddings are taken to represent texts.
- The embeddings are visualized in 2D plane to check linear separability between the two classes.
- Classification has been performed using Logistic Regression.
- Devis Bouldin Index and Silhouette Index are calculated.
This Project’s GitHub Repository