Collocation
Speech analysis
Introduction
Using dataset https://data.gov.tw/dataset/42540
Loading packages
library(tidyverse)
library(tidyr)
library(jiebaR)
options(stringsAsFactors = F)
options(scipen = 999)Loading data
raw.df <- readRDS("data/toChinaSpeech.rds") %>%
mutate(doc_id = str_c("doc", str_pad(row_number(), 2, pad = "0"))) %>%
mutate(nchar = nchar(content)) %>%
select(doc_id, content, title, nchar) %>%
mutate(content = str_replace(content, "【總統府新聞稿】", "")) %>%
mutate(content = stringr::str_replace_all(content, "台灣", "臺灣"))Initialize jieba
Significant words between docs
Log-ratio

tf-idf
Collocation
Sentence extraction
tokenization
pair counts
Pearson Correlation
phi-correlation https://en.wikipedia.org/wiki/Phi_coefficient
Chi-square
PMI
Last updated
Was this helpful?