Collocation

Speech analysis

Introduction

Loading packages

library(tidyverse)
library(tidyr) 
library(jiebaR)
options(stringsAsFactors = F)
options(scipen = 999)

Loading data

raw.df <- readRDS("data/toChinaSpeech.rds") %>%
    mutate(doc_id = str_c("doc", str_pad(row_number(), 2, pad = "0"))) %>%
    mutate(nchar = nchar(content)) %>%
    select(doc_id, content, title, nchar) %>%
    mutate(content = str_replace(content, "【總統府新聞稿】", "")) %>%
    mutate(content = stringr::str_replace_all(content, "台灣", "臺灣"))

Initialize jieba

Significant words between docs

Log-ratio

tf-idf

Collocation

Sentence extraction

tokenization

pair counts

Pearson Correlation

Chi-square

PMI

Last updated

Was this helpful?