> For the complete documentation index, see [llms.txt](https://jirlong.gitbook.io/py/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://jirlong.gitbook.io/py/text-mining/discussion.md).

# Discussion

## Stop words

Gupta, D., Vani, K., & Leema, L. M. (2016). Plagiarism detection in text documents using sentence bounded stop word n-grams. Journal of Engineering Science and Technology, 11(10), 1403-1420.

![](/files/-M9JY3cVa2u3s_cch1a0)

## Sentence annotation

{% embed url="<https://tbrain.trendmicro.com.tw/Competitions/Details/8>" %}

![](https://lh4.googleusercontent.com/BgU9eHEVi0DrC19rnhinyStSYKH1PccqzV2h6pChUjuE1jP1Dv8qSrhjavIPfsKoTWQoiA0OQXXQXoAftMrWfDMsQh9UFmqIJlWlgMYp_tehn3e5b7b2At_OK-pY1E5cdFUmEaAGMCo)

## Classification and Name Entity Detection

{% embed url="<https://tbrain.trendmicro.com.tw/Competitions/Details/11>" %}

![](/files/-M9JbcQy9aGKdEX7NZns)

## Pragmatic words

"你為什麼什麼都這樣?" "Why are you always like that?"

## BERT

{% embed url="<https://towardsml.com/2019/09/17/bert-explained-a-complete-guide-with-theory-and-tutorial/>" %}

#### 2. Preparing the data

In order to use BERT, we need to convert our data into the format expected by BERT — we have reviews in the form of csv files; BERT, however, wants data to be in a ***tsv*** file with a specific format as given below (four columns and no header row):

* **Column 0:** An ID for the row
* **Column 1:** The label for the row (should be an int — class labels: 0,1,2,3 etc)
* **Column 2:** A column of the same letter for all rows — this is a throw-away column that we need to include because BERT expects it.
* **Column 3:** The text examples we want to classify

![](/files/-M9LcHx7wk6lgvwFP1_N)

### Applications of BERT

{% embed url="<https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270>" %}

BERT can be used for a wide variety of language tasks, while only adding a small layer to the core model:

1. Classification tasks such as sentiment analysis are done similarly to Next Sentence classification, by adding a classification layer on top of the Transformer output for the \[CLS] token.
2. In Question Answering tasks (e.g. SQuAD v1.1), the software receives a question regarding a text sequence and is required to mark the answer in the sequence. Using BERT, a Q\&A model can be trained by learning two extra vectors that mark the beginning and the end of the answer.
3. In Named Entity Recognition (NER), the software receives a text sequence and is required to mark the various types of entities (Person, Organization, Date, etc) that appear in the text. Using BERT, a NER model can be trained by feeding the output vector of each token into a classification layer that predicts the NER label.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://jirlong.gitbook.io/py/text-mining/discussion.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
