Automatic Topic Extraction from Documents - LDA, Top2Vec

The objective is to develop an automated tool, which can find out topics from any article. Different topic modeling techniques are investigated for this topic extraction task. For the investigation, yahoo answer dataset has been chosen. The dataset has ~1.5M question-answer pairs. Only 60,000 samples has been randomly chosen from the data. From the selected data, only the answers are used to train the models. The models are trained using LDA and Top2Vec. The main dataset can be found in the following link: https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset!

Model 1: Topic Model - LDA:

Different preprocessing strategies (e.g., lemmatization, stop word removal) are followed to process the input text.
Topic Coherence metric is used to select the number of topics to generate (20 topics) using LDA model.
The topics are further analyzed to give them a semantic name. (e.g., naming “topic 1” to “science”).
The trained model is used to extract topics from unknown documents.

Model 2: Topic Model - Top2Vec:

This model does not need any pre-processing so the raw text (only removed hyperlinks) is used for the model training.
For generating the document vectors, bert embeddings are used.
Hierarchical topic reduction has been used to reduce the generated topics to 20.
Topic labeling is performed to generate a semantic name for all the 20 topics.
The trained model is used to extract topics from unknown documents.

This Project’s GitHub Repository