# Discussion

## Stop words

Gupta, D., Vani, K., & Leema, L. M. (2016). Plagiarism detection in text documents using sentence bounded stop word n-grams. Journal of Engineering Science and Technology, 11(10), 1403-1420.

![](https://2844018201-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M2exRGxOyWlx4fDWu3H%2F-M9JXAO4ogLQUKZBPNHm%2F-M9JY3cVa2u3s_cch1a0%2Fimage.png?alt=media\&token=73cf6101-98aa-4cf2-ac3c-04b9ed30a4ac)

## Sentence annotation

{% embed url="<https://tbrain.trendmicro.com.tw/Competitions/Details/8>" %}

![](https://lh4.googleusercontent.com/BgU9eHEVi0DrC19rnhinyStSYKH1PccqzV2h6pChUjuE1jP1Dv8qSrhjavIPfsKoTWQoiA0OQXXQXoAftMrWfDMsQh9UFmqIJlWlgMYp_tehn3e5b7b2At_OK-pY1E5cdFUmEaAGMCo)

## Classification and Name Entity Detection

{% embed url="<https://tbrain.trendmicro.com.tw/Competitions/Details/11>" %}

![](https://2844018201-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M2exRGxOyWlx4fDWu3H%2F-M9JXAO4ogLQUKZBPNHm%2F-M9JbcQy9aGKdEX7NZns%2Fimage.png?alt=media\&token=03d65a82-9eed-4467-91d4-a9e9c62f2414)

## Pragmatic words

"你為什麼什麼都這樣?" "Why are you always like that?"

## BERT

{% embed url="<https://towardsml.com/2019/09/17/bert-explained-a-complete-guide-with-theory-and-tutorial/>" %}

#### 2. Preparing the data

In order to use BERT, we need to convert our data into the format expected by BERT — we have reviews in the form of csv files; BERT, however, wants data to be in a ***tsv*** file with a specific format as given below (four columns and no header row):

* **Column 0:** An ID for the row
* **Column 1:** The label for the row (should be an int — class labels: 0,1,2,3 etc)
* **Column 2:** A column of the same letter for all rows — this is a throw-away column that we need to include because BERT expects it.
* **Column 3:** The text examples we want to classify

![](https://2844018201-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M2exRGxOyWlx4fDWu3H%2F-M9LcDuIJ6folMqzUGzz%2F-M9LcHx7wk6lgvwFP1_N%2Fimage.png?alt=media\&token=3d877d5d-ef9a-4155-97fa-8214b9681d8d)

### Applications of BERT

{% embed url="<https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270>" %}

BERT can be used for a wide variety of language tasks, while only adding a small layer to the core model:

1. Classification tasks such as sentiment analysis are done similarly to Next Sentence classification, by adding a classification layer on top of the Transformer output for the \[CLS] token.
2. In Question Answering tasks (e.g. SQuAD v1.1), the software receives a question regarding a text sequence and is required to mark the answer in the sequence. Using BERT, a Q\&A model can be trained by learning two extra vectors that mark the beginning and the end of the answer.
3. In Named Entity Recognition (NER), the software receives a text sequence and is required to mark the various types of entities (Person, Organization, Date, etc) that appear in the text. Using BERT, a NER model can be trained by feeding the output vector of each token into a classification layer that predicts the NER label.
