Question Classification - SVM, Logistic Regression, LSTM, BERT, Doc2Vec, TF-IDF
The objective is to build a question classification model. The questions have six different categories such as: Description(DESC), Entity(ENTY), Abbreviation(ABBR), Human(HUM), Location(LOC), Numeric Value(NUM).
To investigate different approaches, the following data is used (downloaded from https://cogcomp.seas.upenn.edu/Data/QA/QC/):
Training set 5(5500 labeled questions) Test set: TREC 10 questions
Different data analyses have been performed and four different models are trained. The models are the followings:
-
Tf-Idf + SVM: Tf-Idf is used for vectorizing texts and a linear model (i.e., SVM) is used for the classification. This approach combines stochastic gradient descent (SGD) learning with hinge loss (equivalent to a linear SVM) and l1 regularization (i.e., Lasso). This model achieves an accuracy of 93%.
-
Doc2Vec + Logistic Regression: The vectorization is done using Doc2Vec (PV-DBOW: Distributed Bag of Words version of Paragraph Vector) by representing every question with 100d vectorized features. For the classification, logistic regression is used. It performed slightly better than the previous model (accuracy: 95%).
-
GloVe + LSTM: GloVe word embedding is used to vectorize the texts. LSTM layer is added with a few dropout layers. The performance achieved is accuracy: 96%.
-
BERT: The transfer learning approach is followed by using pre-trained BERT-Large, Uncased(24-layer, 1024-hidden, 16-heads, 340M parameters). The hyperparameters are refined by following https://github.com/google-research/bert article. The google tokenizer (https://github.com/google-research/bert/blob/master/tokenization.py) is used for tokenizing the texts. Overall, the best performance is achieved from this model is accuracy: 99%.
This Project’s GitHub Repository