TM02. Collocation

分析完單一文字的詞頻後，通常會開始思考「用詞的脈絡」。用操作型定義來說就是，哪些字會出現在某個詞的附近？又哪些字經常會在一起出現？這類的方法可以從Collocation開始談起。

Def. Collocation [Wikipedia:collocation]. In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, collocation is a sub-type of phraseme. An example of a phraseological collocation, as propounded by Michael Halliday,[1] is the expression strong tea. While the same meaning could be conveyed by the roughly equivalent powerful tea, this expression is considered excessive and awkward by English speakers.

本單元將依照政大資科黃瀚萱老師的教材建議採用一個較長文本《共產黨宣言（The Communist Manifesto）》來介紹Collocation，並測試能否找到一些該文本的特徵。該文件可由Project Gutenberg免費電子書處下載，你也可以下載其他的英文書籍來做測試，以觀察文本間的差異。

1. 詞頻

載入資料

注意事項：讀取資料的時候，請注意這個範例是把corpus02.txt這個檔案放在名為data的資料夾裡，如果你讀取不到該資料的話，請自行調整一下檔案的路徑。

with open("data/corpus02.txt", encoding="utf8") as fin:
    text = fin.read()
print("Number of characters: %d" % len(text))

Number of characters: 75346

文字前處理

篩去停用詞的時機。當在處理單一文字時，若為一般性目的通常會篩去停用詞，例如要找到文本中的關鍵字就不需要這些停用詞，但也有些狀況是要保留停用詞的，比方說，你要研究美國總統候選人的說話習慣，通常這種「說話習慣」的分析在詞頻上可有兩種面向，一種是他會用什麼生冷或特殊字詞？另一種是，他講話習慣上會怎麼使用主持、連結詞等。第二種的範例就比方說，他會怎麼用We、You、I這類代名詞，這類代名詞分析往往是演講稿的分析的主要結果，甚至會發現在前後兩次競選候選人的說話習慣差異。就Collocation（詞對）分析而言，一樣要去取捨要不要篩去停用詞，但更要去評估篩去停用詞的時機，比方說，是在建立詞對前做篩除，還是在建立詞對後？若在建立詞對後，要用「且」或者是「或」的條件？這些取捨受文本型態影響，也會影響到後續的分析與詮釋。

前面我提到文本型態指的是什麼？比方說，Twitter分析、新聞內文分析、演講稿、書籍、研究論文的用字和風格都差非常多，所以，必須要根據文本特性來做分析。比方說，假設我現在有數萬篇研究論文的摘要，每篇論文摘要都有好幾個句子，但這些句子大體可分為是對BACKGROUND、MOTIVATION、METHOD、RESULTS、DISCUSSION的描述，當我要判斷這些句子究竟是哪一類型的描述時，通常我不會針對句子或文本的特有關鍵字去做分析，因為這些關鍵字是該文本特有的用詞，無助於分析出是哪一類型的句子。反倒是，我們只要看到The result shows that或者we conclude或we review...等再一般也不過的片語時，就知道他是在描述哪一種類型的句子。因此，在這種案例，我會選擇不要去除停用詞來做分析。

NLTK斷詞與停用詞庫下載。若要應用NLTK來斷詞或去除停用詞的話，必須要下載NLTK的詞庫，如下面的nltk.download('punkt')與stopword_list = stopwords.words('english')。但如果已經下載過了（例如你有照著本書TM01. Term frequency操作過程做的話便會需要下載），那就不用再次下載。

import nltk
# nltk.download('punkt')

from nltk.tokenize import word_tokenize
# from nltk.corpus import stopwords
# stopword_list = stopwords.words('english')

raw_tokens = word_tokenize(text)
tokens = []
for token in raw_tokens:
    if token.isalpha():
#         if token.lower() not in stopword_list: 
        tokens.append(token.lower())
print("Number of tokens: %d" % len(tokens))

印出文本中的常見的關鍵字

from collections import Counter
word_counts = Counter(tokens)
for w, c in word_counts.most_common(20):
    print("%s\t%d" % (w, c))

2. 詞對 Collocation

N-Grams：Collocation的概念可從N-Grams開始講起，所謂的N-Grams指的是把文本中的文字當成一個長長的List，每次只兩兩、三三、四四取出N個字。例如"Programming for Social Scientists"的

2-grams就是["Programming for", "for Social", "Social Scientists"]
3-grams是["Programming for Social", "for Social Scientists"]，依此類推。

Collocation的概念。Collocation指的是詞對，但怎樣的詞可以是一對的概念卻是彈性的，也就是說，我們常常在閱讀文章的時候，會覺得某兩個字老是一起出現，這兩個字可能會出現在前後，但也不一定會緊鄰，可能會隔幾個字，到底隔幾個字就可以算他們似乎在一起出現，這也是可以彈性定義的。N-Grams指的就是前後字才算，但Collocation並沒有這麼嚴謹的定義，大致上可以分為以下四種，我們將在本節中一一介紹。

Frequency-based：一定距離內的詞對出現頻率
Mean and Variance：並非單計算出現頻率，還要考慮詞對距離
Hypothesis Testing：用統計方法來檢驗詞對出現頻率的顯著性
Mutual Information：用機率的概念來檢驗詞對出現的相依與獨立性

計算並印出文本中的常見詞對

word_pair_counts = Counter()
for i in range(len(tokens) - 1):
    (w1, w2) = (tokens[i], tokens[i + 1])
    word_pair_counts[(w1, w2)] += 1

for pair, c in word_pair_counts.most_common(20):
    print("%s\t%s\t%d" % (pair[0], pair[1], c))

of	the	244
in	the	91
the	bourgeoisie	66
the	proletariat	50
to	the	43
by	the	40
for	the	38
of	production	38
with	the	34
the	bourgeois	33
conditions	of	29
means	of	25
of	society	24
against	the	23
on	the	23
working	class	23
to	be	22
of	all	22
is	the	21
the	communists	21

for (w1, w2), c in word_pair_counts.most_common(20):
    print("%s\t%s\t%d" % (w1, w2, c))

of	the	244
in	the	91
the	bourgeoisie	66
the	proletariat	50
to	the	43
by	the	40
for	the	38
of	production	38
with	the	34
the	bourgeois	33
conditions	of	29
means	of	25
of	society	24
against	the	23
on	the	23
working	class	23
to	be	22
of	all	22
is	the	21
the	communists	21

去除停用詞

# nltk.download('stopwords')
from nltk.corpus import stopwords
stopword_list = stopwords.words('english')

word_pair_nosw_counts = Counter()
for i in range(len(tokens) - 1):
    (w1, w2) = (tokens[i], tokens[i + 1])
    if w1 not in stopword_list and w2 not in stopword_list:
        word_pair_nosw_counts[(w1, w2)] += 1
    
for (w1, w2), c in word_pair_nosw_counts.most_common(20):
    print("%s\t%s\t%d" % (w1, w2, c))

working	class	23
bourgeois	society	15
class	antagonisms	11
modern	industry	11
ruling	class	11
productive	forces	9
modern	bourgeois	8
middle	ages	7
bourgeois	property	7
private	property	7
feudal	society	6
middle	class	6
social	conditions	6
property	relations	6
class	struggle	6
old	society	6
petty	bourgeois	6
existing	society	5
one	word	5
bourgeois	socialism	5

3. 考慮詞對平均距離

在前面我們僅考慮2-grams也就是前後字的狀況。但往往詞對在應用上可能中間會插入若干個字，例如「open...door」這樣的兩個字可能就可以有以下數種情形，估計一下「open...door」的平均距離差不多是2到4之間。

open the door
open the black door
open the third closet door
open a bottle of wine and put on the table near the door

計算詞彙距離

window_size = 9

word_pair_counts = Counter()
word_pair_distance_counts = Counter()
for i in range(len(tokens) - 1):
    for distance in range(1, window_size):
        if i + distance < len(tokens):
            w1 = tokens[i]
            w2 = tokens[i + distance]
            word_pair_distance_counts[(w1, w2, distance)] += 1
            word_pair_counts[(w1, w2)] += 1

for (w1, w2, distance), c in word_pair_distance_counts.most_common(20):
    print("%s\t%s\t%d\t%d" % (w1, w2, distance, c))

the	of	2	302
of	the	1	244
the	the	3	186
the	the	8	134
the	the	6	129
the	the	7	126
the	of	3	125
the	the	4	117
the	the	5	114
of	the	4	92
in	the	1	91
of	the	8	91
of	the	6	88
the	of	7	81
the	of	6	77
of	the	7	76
the	of	5	76
the	of	8	75
of	the	5	72
the	bourgeoisie	1	66

print(word_pair_distance_counts.most_common(1)[0])

print(word_pair_distance_counts['the', 'of', 1])
print(word_pair_distance_counts['the', 'of', 100])


for distance in range(1, window_size):
    print("Occurrences of the word pair (%s, %s) with a distance of %d: %d" % (
        'the', 'of', distance, word_pair_distance_counts['the', 'of', distance]))

print("Occurrences of the usage 'the * * of'")
print(word_pair_distance_counts['the', 'of', 2])

print("Occurrences of the usage 'of * * the'")
print(word_pair_distance_counts['of', 'the', 2])

(('the', 'of', 2), 302)
3
0
Occurrences of the word pair (the, of) with a distance of 1: 3
Occurrences of the word pair (the, of) with a distance of 2: 302
Occurrences of the word pair (the, of) with a distance of 3: 125
Occurrences of the word pair (the, of) with a distance of 4: 59
Occurrences of the word pair (the, of) with a distance of 5: 76
Occurrences of the word pair (the, of) with a distance of 6: 77
Occurrences of the word pair (the, of) with a distance of 7: 81
Occurrences of the word pair (the, of) with a distance of 8: 75
Occurrences of the usage 'the * * of'
302
Occurrences of the usage 'of * * the'
27

用平均距離來篩選詞對

pair_mean_distances = Counter()

for (w1, w2, distance), c in word_pair_distance_counts.most_common():
    pair_mean_distances[(w1, w2)] += distance * (c / word_pair_counts[(w1, w2)])

for (w1, w2), distance in pair_mean_distances.most_common(20):
    print("%s\t%s\t%f\t%d" % (w1, w2, distance, word_pair_counts[(w1, w2)]))

of	destroyed	8.000000	3
to	petty	8.000000	3
necessarily	of	8.000000	3
in	communistic	8.000000	2
world	and	8.000000	2
and	existing	8.000000	2
is	slave	8.000000	2
an	each	8.000000	2
time	society	8.000000	2
an	communication	8.000000	2
every	we	8.000000	2
that	word	8.000000	2
bourgeoisie	market	8.000000	2
of	feet	8.000000	2
population	property	8.000000	2
at	these	8.000000	2
our	relations	8.000000	2
is	subsistence	8.000000	2
disposal	the	8.000000	2
wealth	bourgeoisie	8.000000	2

觀察詞對：平均詞對距離

以下分別是篩除只出現一次的詞對後，分別找出平均距離最大、最小和中間的詞對（注意，排序的依據是平均距離，不是詞對數量）。請觀察看看，依據平均距離來排序的話，和詞對的出現頻率相較下，可以獲得什麼樣不同的資訊？然後觀察比較看看，平均距離最大、中間和最小的各20個詞對，哪一個能夠獲得最多得以理解文本的資訊？

pair_mean_distances = Counter()

for (w1, w2, distance), c in word_pair_distance_counts.most_common():
    if word_pair_counts[(w1, w2)] > 1:
        pair_mean_distances[(w1, w2)] += distance * (c / word_pair_counts[(w1, w2)])

for (w1, w2), distance in pair_mean_distances.most_common(20):
    print("%s\t%s\t%f\t%d" % (w1, w2, distance, word_pair_counts[(w1, w2)]))

of	destroyed	8.000000	3
to	petty	8.000000	3
necessarily	of	8.000000	3
in	communistic	8.000000	2
world	and	8.000000	2
and	existing	8.000000	2
is	slave	8.000000	2
an	each	8.000000	2
time	society	8.000000	2
an	communication	8.000000	2
every	we	8.000000	2
that	word	8.000000	2
bourgeoisie	market	8.000000	2
of	feet	8.000000	2
population	property	8.000000	2
at	these	8.000000	2
our	relations	8.000000	2
is	subsistence	8.000000	2
disposal	the	8.000000	2
wealth	bourgeoisie	8.000000	2

for (w1, w2), distance in pair_mean_distances.most_common()[-20:]:
    print("%s\t%s\t%f\t%d" % (w1, w2, distance, word_pair_counts[(w1, w2)]))

continued	existence	1.000000	2
complete	systems	1.000000	2
they	wish	1.000000	2
new	jerusalem	1.000000	2
every	revolutionary	1.000000	2
revolutionary	movement	1.000000	2
could	be	1.000000	2
and	others	1.000000	2
these	systems	1.000000	2
social	science	1.000000	2
most	suffering	1.000000	2
suffering	class	1.000000	2
antagonisms	they	1.000000	2
their	ends	1.000000	2
a	critical	1.000000	2
these	proposals	1.000000	2
chiefly	to	1.000000	2
they	support	1.000000	2
partly	of	1.000000	2
existing	social	1.000000	2

num_pairs = len(pair_mean_distances)
mid = num_pairs // 2
for (w1, w2), distance in pair_mean_distances.most_common()[mid-10:mid+10]:
    print("%s\t%s\t%f\t%d" % (w1, w2, distance, word_pair_counts[(w1, w2)]))

which	laborer	4.500000	2
in	bare	4.500000	2
a	existence	4.500000	2
means	appropriation	4.500000	2
intend	the	4.500000	2
personal	appropriation	4.500000	2
labor	appropriation	4.500000	2
and	leaves	4.500000	2
away	is	4.500000	2
is	far	4.500000	2
the	living	4.500000	2
to	accumulated	4.500000	2
increase	labor	4.500000	2
accumulated	is	4.500000	2
enrich	the	4.500000	2
past	society	4.500000	4
dominates	present	4.500000	2
the	person	4.500000	2
trade	selling	4.500000	2
but	selling	4.500000	2

觀察詞對：詞對距離的標準差

相同比較基準下看標準差，所獲得的是詞對距離的變異性，如果一個詞對的距離總是相同，那代表變異性很低，很固定；如果詞對距離變異性很大，那標準差就會高。像下面的Top20是標準差最高的（因為剛好兩對，一對應該是8、一對是距離1，所以除以二後距離恰成4.5）。由於高標準差的均是出現頻率最少的，所以，觀察Top20能獲得的資訊不多；但觀看Bottom20，也發現大部分都是兩對的詞對，而距離都一樣，所以標準差是0。因此，觀察Top20和Bottom20能夠獲得的資訊，單純就這邊的結果來說，似乎無法提供有意義的資訊，主因是Top20和Bottom20都是那些兩對的。據此，在下一節，我們原本只列計超過一次的詞對，但我們可以把他改為，只列計超過10對的詞對來觀察結果。

其公式為

pair_deviations = Counter()
for (w1, w2, distance), c in word_pair_distance_counts.most_common():
    if word_pair_counts[(w1, w2)] > 1:
        pair_deviations[(w1, w2)] += c * ((distance - pair_mean_distances[(w1, w2)]) ** 2)
    
for (w1, w2), dev_tmp in pair_deviations.most_common():
    s_2 = dev_tmp / (word_pair_counts[(w1, w2)] - 1)
    pair_deviations[(w1, w2)] = s_2 ** 0.5

for (w1, w2), dev in pair_deviations.most_common(20):
    print("%s\t%s\t%f\t%f\t%d" % (w1, w2, pair_mean_distances[(w1, w2)], dev, word_pair_counts[(w1, w2)]))

the	branding	4.500000	4.949747	2
sketched	the	4.500000	4.949747	2
class	struggles	4.500000	4.949747	2
journeyman	in	4.500000	4.949747	2
in	almost	4.500000	4.949747	2
epochs	of	4.500000	4.949747	2
everywhere	a	4.500000	4.949747	2
old	bourgeois	4.500000	4.949747	2
navigation	and	4.500000	4.949747	2
vanished	in	4.500000	4.949747	2
the	giant	4.500000	4.949747	2
the	leaders	4.500000	4.949747	2
armies	the	4.500000	4.949747	2
its	capital	4.500000	4.949747	2
oppressed	class	4.500000	4.949747	2
the	executive	4.500000	4.949747	2
enthusiasm	of	4.500000	4.949747	2
of	egotistical	4.500000	4.949747	2
its	relation	4.500000	4.949747	2
ones	that	4.500000	4.949747	2

for (w1, w2), dev in pair_deviations.most_common()[-20:]:
    print("%s\t%s\t%f\t%f\t%d" % (w1, w2, pair_mean_distances[(w1, w2)], dev, word_pair_counts[(w1, w2)]))

being	class	4.000000	0.000000	2
most	suffering	1.000000	0.000000	2
suffering	class	1.000000	0.000000	2
antagonisms	they	1.000000	0.000000	2
the	appeal	6.000000	0.000000	2
they	ends	5.000000	0.000000	2
their	ends	1.000000	0.000000	2
a	critical	1.000000	0.000000	2
these	proposals	1.000000	0.000000	2
of	c	6.000000	0.000000	2
chiefly	to	1.000000	0.000000	2
chiefly	germany	2.000000	0.000000	2
communists	against	6.000000	0.000000	2
they	support	1.000000	0.000000	2
partly	of	1.000000	0.000000	2
partly	in	4.000000	0.000000	2
germany	fight	2.000000	0.000000	2
revolutionary	against	2.000000	0.000000	2
germany	immediately	8.000000	0.000000	2
existing	social	1.000000	0.000000	2

觀察詞對：只考慮高頻詞對

由於在前一節發現低頻詞對下的詞對距離平均與標準差會使得我們難以觀察文本特徵，於是，在這一節我們要把納入計算的門檻值提高到超過10，來看看能不能得到一些有用的資訊。觀看變異性最低的前20對，終於可以看到一些，出現頻率高但詞對距離很穩定的詞對。

pair_deviations = Counter()
for (w1, w2, distance), c in word_pair_distance_counts.most_common():
    if word_pair_counts[(w1, w2)] > 10:
        pair_deviations[(w1, w2)] += c * ((distance - pair_mean_distances[(w1, w2)]) ** 2)
    
for (w1, w2), dev_tmp in pair_deviations.most_common():
    s_2 = dev_tmp / (word_pair_counts[(w1, w2)] - 1)
    pair_deviations[(w1, w2)] = s_2 ** 0.5

for (w1, w2), dev in pair_deviations.most_common()[-20:]:
    print("%s\t%s\t%f\t%f\t%d" % (w1, w2, pair_mean_distances[(w1, w2)], dev, word_pair_counts[(w1, w2)]))

be	and	3.833333	1.466804	12
have	of	5.869565	1.455533	23
to	be	1.416667	1.442120	24
the	communism	4.454545	1.439697	11
of	can	5.000000	1.414214	11
in	but	5.000000	1.414214	11
for	a	2.076923	1.382120	13
for	class	5.363636	1.361817	11
as	and	5.153846	1.344504	13
has	of	5.730769	1.343360	26
it	has	1.409091	1.333063	22
working	class	1.360000	1.319091	25
in	class	5.272727	1.272078	11
bourgeois	society	1.312500	1.250000	16
it	of	6.088235	1.239933	34
by	means	1.692308	0.947331	13
in	proportion	1.500000	0.904534	12
they	are	1.250000	0.866025	12
modern	industry	1.166667	0.577350	12
ruling	class	1.000000	0.000000	11

4. Pearson's Chi-Square Test

我們也可以用統計檢定來找到顯著的詞對。通常有t-test與Chi-square兩種，但是t-test需要有機率分佈近似常態的假設。Chi-square則概念上相對容易理解。

Chi-square的概念理解：可以試想一個生活中的問題，我觀察了一個學期，我懷疑我們班上w1和w2是班對，這種「他們是一對」的感覺要怎麼操作化？簡單地說，就從你這學期觀察所有課堂和所有系所活動來觀察，假設你這學期一共出席了100次系所活動和課程，你如何深信w1和w2是班對？這時候你應該很容易就可以想到，就有w1就有w2，沒w1那w2也跟著不來了，也就是說w1出現但w2不出現，和w2出現但w1不出現的次數很少。所以，你可以想像可以寫成（w1和w2都來的次數 + w1和w2都不來的次數 - w1來w2不來的次數 - w2來w1不來的次數）

下面那張表的O11, O12, O21, O22分別是

O11：w1和w2都來的次數
O22：w1和w2都不來的次數
O21：w1來w2不來的次數
O12：w2來w1不來的次數

所以如果是上面的想法把它轉換為下面的表達方式就會是（O11 + O22 - O21 - O12），但Chi-Square的定義方法是（O11*O22 - O21*O12），那效果也是一樣的，就是兩個人一起來和一起不來的次數相乘，減去其中一個人來的次數相乘。而Chi-Square方程式其他的項在數理上你都可以當他是在做正規化。

Chi-square的定義

設計Chi-square函式

def chisquare(o11, o12, o21, o22):
    n = o11 + o12 + o21 + o22
    x_2 = (n * ((o11 * o22 - o12 * o21)**2)) / ((o11 + o12) * (o11 + o21) * (o12 + o22) * (o21 + o22)) 
    return x_2

回到二元詞組 Back to bigrams

word_pair_counts = Counter()
word_counts = Counter(tokens)
num_bigrams = 0

for i in range(len(tokens) - 1):
    w1 = tokens[i]
    w2 = tokens[i + 1]
    word_pair_counts[(w1, w2)] += 1
    num_bigrams += 1
print(num_bigrams) # 11780

計算每一詞對的chi-square

pair_chi_squares = Counter()
for (w1, w2), w1_w2_count in word_pair_counts.most_common():
    w1_only_count = word_counts[w1] - w1_w2_count
    w2_only_count = word_counts[w2] - w1_w2_count
    rest_count = num_bigrams - w1_only_count - w2_only_count - w1_w2_count
    pair_chi_squares[(w1, w2)] = chisquare(w1_w2_count, w1_only_count, w2_only_count, rest_count)

for (w1, w2), x_2 in pair_chi_squares.most_common(20):
    print("%s\t%s\t%d\t%f" % (w1, w2, word_pair_counts[(w1, w2)], x_2))

third	estate	2	11780.000000
constitution	adapted	2	11780.000000
karl	marx	1	11780.000000
frederick	engels	1	11780.000000
czar	metternich	1	11780.000000
police	spies	1	11780.000000
nursery	tale	1	11780.000000
italian	flemish	1	11780.000000
danish	languages	1	11780.000000
plebeian	lord	1	11780.000000
complicated	arrangement	1	11780.000000
manifold	gradation	1	11780.000000
patricians	knights	1	11780.000000
knights	plebeians	1	11780.000000
journeymen	apprentices	1	11780.000000
subordinate	gradations	1	11780.000000
cape	opened	1	11780.000000
proper	serving	1	11780.000000
left	remaining	1	11780.000000
heavenly	ecstacies	1	11780.000000

調整結果（TopN, 篩除停用詞）

pair_chi_squares = Counter()
for (w1, w2), w1_w2_count in word_pair_counts.most_common():
    if w1_w2_count > 5:
        w1_only_count = word_counts[w1] - w1_w2_count
        w2_only_count = word_counts[w2] - w1_w2_count
        rest_count = num_bigrams - w1_only_count - w2_only_count - w1_w2_count
        pair_chi_squares[(w1, w2)] = chisquare(w1_w2_count, w1_only_count, w2_only_count, rest_count)

for (w1, w2), x_2 in pair_chi_squares.most_common(20):
    print("%s\t%s\t%d\t%f" % (w1, w2, word_pair_counts[(w1, w2)], x_2))

productive	forces	9	9636.544203
middle	ages	7	4928.714781
no	longer	14	4150.496033
working	class	23	2477.732678
modern	industry	11	1128.037662
class	antagonisms	11	1042.736309
private	property	7	1022.522314
ruling	class	11	966.767323
can	not	9	775.745125
their	own	11	759.449519
proportion	as	8	720.619853
have	been	7	702.438620
it	has	20	652.817376
away	with	8	563.758348
to	be	22	468.075784
just	as	6	439.987822
of	the	244	406.859323
the	bourgeoisie	66	397.199063
its	own	8	392.296908
petty	bourgeois	6	381.069314

for (w1, w2), x_2 in pair_chi_squares.most_common()[-20:]:
    print("%s\t%s\t%d\t%f" % (w1, w2, word_pair_counts[(w1, w2)], x_2))

is	the	21	4.412587
and	by	7	2.913108
at	the	7	2.592918
and	that	7	2.542680
be	the	7	2.367553
proletariat	the	10	2.357617
bourgeois	the	6	1.654642
that	of	6	0.910970
and	of	20	0.906966
of	class	9	0.569228
bourgeoisie	the	7	0.548588
that	the	15	0.476118
of	its	7	0.436822
and	in	11	0.401790
society	the	6	0.307430
all	the	11	0.192300
the	property	6	0.041125
and	to	9	0.027804
the	class	10	0.009971
class	the	10	0.009971

pair_chi_squares = Counter()
for (w1, w2), w1_w2_count in word_pair_counts.most_common():
    if w1_w2_count > 1 and w1 not in stopword_list and w2 not in stopword_list:
        w1_only_count = word_counts[w1] - w1_w2_count
        w2_only_count = word_counts[w2] - w1_w2_count
        rest_count = num_bigrams - w1_only_count - w2_only_count - w1_w2_count
        pair_chi_squares[(w1, w2)] = chisquare(w1_w2_count, w1_only_count, w2_only_count, rest_count)

for (w1, w2), x_2 in pair_chi_squares.most_common(20):
    print("%s\t%s\t%d\t%f" % (w1, w2, word_pair_counts[(w1, w2)], x_2))

third	estate	2	11780.000000
constitution	adapted	2	11780.000000
productive	forces	9	9636.544203
eternal	truths	3	8834.249809
corporate	guilds	2	7852.666553
absolute	monarchy	4	7537.599406
eighteenth	century	3	7066.799694
immense	majority	3	6624.749575
laid	bare	2	5888.999830
distinctive	feature	2	5234.221968
torn	asunder	2	5234.221968
middle	ages	7	4928.714781
radical	rupture	2	4710.799796
buying	disappears	2	3925.333107
let	us	3	3310.499278
upper	hand	2	2943.499745
commercial	crises	2	2943.499745
various	stages	2	2943.499745
united	action	3	2942.749427
raw	material	2	2616.221958

for (w1, w2), x_2 in pair_chi_squares.most_common()[-20:]:
    print("%s\t%s\t%d\t%f" % (w1, w2, word_pair_counts[(w1, w2)], x_2))

many	bourgeois	2	49.412739
bourgeois	private	2	44.087627
petty	bourgeoisie	2	43.023037
modern	working	2	41.985003
bourgeois	freedom	2	39.732201
revolutionary	proletariat	2	35.100229
old	conditions	2	32.566311
feudal	property	2	31.386462
old	property	2	28.685615
bourgeois	revolution	2	26.136797
modern	bourgeoisie	3	21.380029
whole	bourgeoisie	2	20.752016
revolutionary	class	2	20.224783
bourgeois	state	2	17.063629
every	class	2	16.075081
bourgeois	conditions	3	16.040705
bourgeois	form	2	14.680146
one	class	2	10.562861
bourgeois	production	2	5.662454
bourgeois	class	3	5.261506

5. Mutual Information (MI)

另一個現在常用來評估兩個詞間的相依關係的方法是Mutual Information（MI），又稱Pointwise Mutual Information（PMI）。詳細情形可看Wikipedia:Mutual Information。

在概率論和資訊理論中，兩個隨機變數的相互資訊（mutual Information，簡稱MI）或轉移資訊（transinformation）是變數間相互依賴性的量度。不同於相關係數，相互資訊並不局限於實值隨機變數，它更加一般且決定著聯合分布 p(X,Y) 和分解的邊緣分布的乘積 p(X)p(Y) 的相似程度。相互資訊是點間相互資訊（PMI）的期望值。from Wikipedia:Mutual Information（中文）

PMI的概念如下，P(x, y)代表x與y一起出現的機率，你可以想像說，如果P(x, y)中的x與y相互獨立的話，那P(x, y)就會等於P(x)P(y)就會等於分母，這樣兩者一除為1再取log就會變成零。所以，若x與y相互獨立的話，那麼PMI就是0，否則就會比0大。

Define a mutual information function

上述式子如果用Chi-Square那張表來表達，就相當於(O11/N)*(N*N/(O11+O21)/(O11+O12)) = O11*N/(O11+O21)/(O11+O21)。概念上分母和分子均與不難理解為什麼要那麼算，只是算式中少了O22這個項目。

import math
def mutual_information(w1_w2_prob, w1_prob, w2_prob):
    return math.log2(w1_w2_prob / (w1_prob * w2_prob))

Computing MI

num_unigrams = sum(word_counts.values())

pair_mutual_information_scores = Counter()
for (w1, w2), w1_w2_count in word_pair_counts.most_common():
    if w1_w2_count > 0:
        w1_prob = word_counts[w1] / num_unigrams
        w2_prob = word_counts[w2] / num_unigrams
        w1_w2_prob = w1_w2_count / num_bigrams
        pair_mutual_information_scores[(w1, w2)] = mutual_information(w1_w2_prob, w1_prob, w2_prob)

for (w1, w2), mi in pair_mutual_information_scores.most_common(20):
    print("%s\t%s\t%d\t%f" % (w1, w2, word_pair_counts[(w1, w2)], mi))

karl	marx	1	13.524297
frederick	engels	1	13.524297
czar	metternich	1	13.524297
police	spies	1	13.524297
nursery	tale	1	13.524297
italian	flemish	1	13.524297
danish	languages	1	13.524297
plebeian	lord	1	13.524297
complicated	arrangement	1	13.524297
manifold	gradation	1	13.524297
patricians	knights	1	13.524297
knights	plebeians	1	13.524297
journeymen	apprentices	1	13.524297
subordinate	gradations	1	13.524297
cape	opened	1	13.524297
proper	serving	1	13.524297
left	remaining	1	13.524297
heavenly	ecstacies	1	13.524297
chivalrous	enthusiasm	1	13.524297
egotistical	calculation	1	13.524297

num_unigrams = sum(word_counts.values())

pair_mutual_information_scores = Counter()
for (w1, w2), w1_w2_count in word_pair_counts.most_common():
    if w1_w2_count > 5:
        w1_prob = word_counts[w1] / num_unigrams
        w2_prob = word_counts[w2] / num_unigrams
        w1_w2_prob = w1_w2_count / num_bigrams
        pair_mutual_information_scores[(w1, w2)] = mutual_information(w1_w2_prob, w1_prob, w2_prob)

for (w1, w2), mi in pair_mutual_information_scores.most_common(20):
    print("%s\t%s\t%d\t%f" % (w1, w2, word_pair_counts[(w1, w2)], mi))

productive	forces	9	10.064865
middle	ages	7	9.461287
no	longer	14	8.215308
private	property	7	7.202369
working	class	23	6.762457
modern	industry	11	6.699483
have	been	7	6.669874
class	antagonisms	11	6.582849
proportion	as	8	6.513070
ruling	class	11	6.475934
can	not	9	6.453431
just	as	6	6.223563
away	with	8	6.165646
their	own	11	6.138238
petty	bourgeois	6	6.020471
middle	class	6	5.708380
its	own	8	5.660885
property	relations	6	5.658048
not	only	7	5.550292
class	struggle	6	5.321357

num_unigrams = sum(word_counts.values())

pair_mutual_information_scores = Counter()
for (w1, w2), w1_w2_count in word_pair_counts.most_common():
    if w1_w2_count > 1 and w1 not in stopword_list and w2 not in stopword_list:
        w1_prob = word_counts[w1] / num_unigrams
        w2_prob = word_counts[w2] / num_unigrams
        w1_w2_prob = w1_w2_count / num_bigrams
        pair_mutual_information_scores[(w1, w2)] = mutual_information(w1_w2_prob, w1_prob, w2_prob)

for (w1, w2), mi in pair_mutual_information_scores.most_common(20):
    print("%s\t%s\t%d\t%f" % (w1, w2, word_pair_counts[(w1, w2)], mi))

third	estate	2	12.524297
constitution	adapted	2	12.524297
corporate	guilds	2	11.939334
eternal	truths	3	11.524297
laid	bare	2	11.524297
distinctive	feature	2	11.354372
torn	asunder	2	11.354372
eighteenth	century	3	11.202369
radical	rupture	2	11.202369
immense	majority	3	11.109259
buying	disappears	2	10.939334
absolute	monarchy	4	10.880441
upper	hand	2	10.524297
commercial	crises	2	10.524297
various	stages	2	10.524297
earlier	epochs	2	10.354372
raw	material	2	10.354372
complete	systems	2	10.354372
guild	masters	2	10.202369
best	possible	2	10.202369

<練習>詞對分析方法的比較

請試著比較並討論，與Mutual Information相比，Frequency based 的缺點是什什麼?
處理 metamorphosis_franz_kafka.txt（卡夫卡變形記），找出三種 collocations
- Frequency-based
- Chi-square test
- Mutual information
如果今天我要用Collocation的方法來比較Metamorphosis和The Communist Manifesto兩本書的差別，你認為可能會有什麼樣的差別？要怎麼比？
metamorphosis_franz_kafka.txt 裡有很多對話或⾃自⾔言⾃自語，⽤用雙引號區別。請只⽤用括號裡的⽂文字，建立collocations

"Oh, God", he thought, "what a strenuous career it is that I’ve chosen! Travelling day in and day out. Doing business like this takes much more effort than doing your own business at home, and on top of that there's the curse of travelling, worries about making train connections, bad and irregular food, contact with different people all the time so that you can never get to know anyone or become friendly with them. It can all go to Hell!" He felt a slight itch up on his belly; pushed himself slowly up on his back towards the headboard so that he could lift his head better;

<練習>斷句對詞對分析的影響

前面在定義詞對的距離時，是定義出一篇文章中每個字的後九個字。但這九個字可能會跨句子而產生一些謬誤。所以，請你嘗試找到一篇有句號或標點符號的文章（Gutenberg的可能會被不自然地斷行而產生錯誤），遇到句號或者逗號就斷開句子，只要是在同一句子內的，就算他們有Collocation的特性，也就是相當於前面所定義的window_size是根據句子長短。請你跑跑看並和這章的結果做比較，看看你能夠觀察到什麼樣的差異。

PreviousTM01. Term frequency NextTM03. POS Part-of-Speech

Last updated 5 years ago