Discussion
Last updated
Last updated
Gupta, D., Vani, K., & Leema, L. M. (2016). Plagiarism detection in text documents using sentence bounded stop word n-grams. Journal of Engineering Science and Technology, 11(10), 1403-1420.
"你為什麼什麼都這樣?" "Why are you always like that?"
In order to use BERT, we need to convert our data into the format expected by BERT — we have reviews in the form of csv files; BERT, however, wants data to be in a tsv file with a specific format as given below (four columns and no header row):
Column 0: An ID for the row
Column 1: The label for the row (should be an int — class labels: 0,1,2,3 etc)
Column 2: A column of the same letter for all rows — this is a throw-away column that we need to include because BERT expects it.
Column 3: The text examples we want to classify
BERT can be used for a wide variety of language tasks, while only adding a small layer to the core model:
Classification tasks such as sentiment analysis are done similarly to Next Sentence classification, by adding a classification layer on top of the Transformer output for the [CLS] token.
In Question Answering tasks (e.g. SQuAD v1.1), the software receives a question regarding a text sequence and is required to mark the answer in the sequence. Using BERT, a Q&A model can be trained by learning two extra vectors that mark the beginning and the end of the answer.
In Named Entity Recognition (NER), the software receives a text sequence and is required to mark the various types of entities (Person, Organization, Date, etc) that appear in the text. Using BERT, a NER model can be trained by feeding the output vector of each token into a classification layer that predicts the NER label.