Discussion

Stop words

Gupta, D., Vani, K., & Leema, L. M. (2016). Plagiarism detection in text documents using sentence bounded stop word n-grams. Journal of Engineering Science and Technology, 11(10), 1403-1420.

Sentence annotation

Classification and Name Entity Detection

Pragmatic words

"你為什麼什麼都這樣?" "Why are you always like that?"

BERT

2. Preparing the data

In order to use BERT, we need to convert our data into the format expected by BERT — we have reviews in the form of csv files; BERT, however, wants data to be in a tsv file with a specific format as given below (four columns and no header row):

  • Column 0: An ID for the row

  • Column 1: The label for the row (should be an int — class labels: 0,1,2,3 etc)

  • Column 2: A column of the same letter for all rows — this is a throw-away column that we need to include because BERT expects it.

  • Column 3: The text examples we want to classify

Applications of BERT

BERT can be used for a wide variety of language tasks, while only adding a small layer to the core model:

  1. Classification tasks such as sentiment analysis are done similarly to Next Sentence classification, by adding a classification layer on top of the Transformer output for the [CLS] token.

  2. In Question Answering tasks (e.g. SQuAD v1.1), the software receives a question regarding a text sequence and is required to mark the answer in the sequence. Using BERT, a Q&A model can be trained by learning two extra vectors that mark the beginning and the end of the answer.

  3. In Named Entity Recognition (NER), the software receives a text sequence and is required to mark the various types of entities (Person, Organization, Date, etc) that appear in the text. Using BERT, a NER model can be trained by feeding the output vector of each token into a classification layer that predicts the NER label.

Last updated