TM05. Word Embedding

Introduction

Features from documents to words

詞嵌入（Word embedding）是詞彙的特徵，顧名思義是該詞被鑲嵌在什麼樣的脈絡下，或者在什麼樣的脈絡下被用。因此，操作型定義就是某個詞周遭有哪些詞，但要怎麼被記錄下來做運算呢？在了解詞嵌入前，先想想「何謂文本的特徵」，也就是文本要怎麼被表達。

文本特徵的表達法中，最直覺的方法就是用詞袋（Bag-of-words）來表示，也就是這個文本裡面有哪些字，且每個字出現過幾次。但我今天如果有100篇文本，我想把他記錄在一起，最有可能的方法就是用矩陣來記錄。每一列為一篇文章，每一欄為某個字，記錄了每個字的出現次數，這個矩陣就是Document-Term-Matrix。

那詞彙的特徵可不可以也用一個詞周遭有哪些詞來表達？當然可以，我們就會得到一個Term-to-Term matrix，橫欄紀錄著每一列的詞和哪些詞「一起出現」。但「一起出現」顯然是個可以操作並設計的定義，總不好把所謂的「詞和詞一起出現」定義為出現在同一篇文章中叫做一起出現。在一篇文章中，第一個詞與最後一個詞的關係可能天差地遠，這樣的定義不切實際。而有個辦法是把他定義為「出現在同一個句子中的詞」或者是「出現在我周遭10個字之內的詞」。而詞嵌入所採取的作法是後者。所以，我們會獲得一個Term-to-Term Matrix。這個Term-to-Term Matrix就可說是詞向量了。

但這樣的Term-to-Term Matrix有個缺點，該矩陣紀錄了任兩個詞周遭分別會有哪些詞，但搞不好完全沒重疊，也就是這個矩陣大部分都是零，那這樣來算詞和詞的相似性顯然說服力不高。如果是Document-Term-Matrix這種問題可能比較小，至少如果我的文本是有目的性地抽出，至少我還有一些中頻關鍵字甚至停用詞，比較文本間的詞彙有什麼差異並不是太詭異。

所以最好的狀況是，我能夠用比較低的維度（例如300維）來表達每個字，又不失去每個字的脈絡。要怎麼做？直接SVD做降維看看？有喔！有人這麼做。不然就是用Word2Vec，用類神經網絡來求得這300維的詞特徵向量。

Resource

[自然語言處理] #1 Word to Vector 實作教學Medium

[自然語言處理] #2 Word to Vector 實作教學 (實作篇)Medium

Slide

Using tidy approach to implement word embedding

Word Vectors with tidy data principles | Julia SilgeJulia Silge

Tidy word vectors, take 2! | Julia SilgeJulia Silge

Emil Hvitfeldt - R Developer & Data Scientistwww.hvitfeldt.me

Preparation

Download Chinese word embeddings from CKIP https://ckip.iis.sinica.edu.tw/resource/
Unzip it and pull into suitable directory

Training model

Read csv by pandas

import pandas as pd
df = pd.read_csv('data/sentiment.csv')
df.head(5)

print(len(df))
print(df['tag'].value_counts())

6388 N 3347 P 3041 Name: tag, dtype: int64

Tokenization by jieba

import jieba
df['token_text'] = df['text'].apply(lambda x:list(jieba.cut(x)))
df.head(5)

Building prefix dict from the default dictionary ... Dumping model to file cache /var/folders/61/5bvzqdmn7455dm96br7vs9jw0000gn/T/jieba.cache Loading model cost 0.661 seconds. Prefix dict has been built successfully.

Building model

# Training
from gensim.models import Word2Vec
model = Word2Vec(df['token_text'], min_count=1, size=300, window=5, sg=0, workers=4)

# Saving
model.wv.save_word2vec_format('model.bin', binary=True)

Using model

print(model.most_similar(positive=['不會', '他'], negative=['她'], topn=20))
print("-"*40)
print(model.most_similar(positive=['不會', '她'], negative=['他'], topn=20))
print("-"*40)

/Users/jirlong/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).
  """Entry point for launching an IPython kernel.
[('奶', 0.9997791647911072), ('今天', 0.9997557997703552), ('幸福', 0.9997545480728149), ('優益', 0.9997543096542358), ('冠益', 0.9997209310531616), ('?', 0.9996999502182007), ('好喝', 0.9996600151062012), ('=', 0.999654233455658), ('#', 0.9996501803398132), ('純', 0.9995929002761841), ('降', 0.999588131904602), ('你們', 0.9995837211608887), ('我', 0.999580442905426), ('傻', 0.9995710253715515), ('良心', 0.9995673894882202), ('啦', 0.9995420575141907), ('每天', 0.9995291233062744), ('翻', 0.9995279312133789), ('粒', 0.9995216131210327), ('出', 0.9995182156562805)]
----------------------------------------
[('現在', 0.999843418598175), ('讓', 0.999842643737793), ('那', 0.999837338924408), ('已經', 0.999826192855835), ('後', 0.9998255372047424), ('我們', 0.9998246431350708), ('竟然', 0.9998204112052917), ('還在', 0.9998034238815308), ('中國', 0.9997988939285278), ('鬱', 0.9997972249984741), ('繼續', 0.9997922778129578), ('一直', 0.9997907280921936), ('噹噹', 0.9997901320457458), ('”', 0.9997866153717041), ('月', 0.9997856616973877), ('產品', 0.9997832775115967), ('這種', 0.9997782707214355), ('一個', 0.9997742772102356), ('一次', 0.9997730255126953), ('一樣', 0.9997689723968506)]
----------------------------------------
/Users/jirlong/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).
  This is separate from the ipykernel package so we can avoid doing imports until

print(len(model['好吃']))
print(model.most_similar('好吃'))
print("-"*40)
print(model.most_similar(positive=['工程師', '他'], negative=['她'], topn=20))
print("-"*40)
print(model.most_similar(positive=['工程師', '她'], negative=['他'], topn=20))
print("-"*40)
print(model.most_similar(positive=['科學家', '男'], negative=['女'], topn=20))
print("-"*40)
print(model.most_similar(positive=['科學家', '女'], negative=['男'], topn=20))
print("-"*40)
print(model.most_similar(positive=['醫生', '他'], negative=['她'], topn=20))
print("-"*40)
print(model.most_similar(positive=['醫生', '她'], negative=['他'], topn=20))
print("-"*40)
print(model.most_similar(positive=['家長', '他'], negative=['她'], topn=20))
print("-"*40)
print(model.most_similar(positive=['家長', '她'], negative=['他'], topn=20))
print("-"*40)
print(model.most_similar(positive=['結婚', '他'], negative=['她'], topn=20))
print("-"*40)
print(model.most_similar(positive=['結婚', '她'], negative=['他'], topn=20))
print("-"*40)
print(model.most_similar(positive=['同性', '他'], negative=['她'], topn=20))
print("-"*40)
print(model.most_similar(positive=['同性', '她'], negative=['他'], topn=20))
print("-"*40)
print(model.most_similar(positive=['同志', '他'], negative=['她'], topn=20))
print("-"*40)
print(model.most_similar(positive=['同志', '她'], negative=['他'], topn=20))
print("-"*40)
print(model.most_similar(positive=['不婚', '他'], negative=['她'], topn=20))
print("-"*40)
print(model.most_similar(positive=['不婚', '她'], negative=['他'], topn=20))
print("-"*40)
print(model.most_similar(positive=['未婚', '他'], negative=['她'], topn=20))
print("-"*40)
print(model.most_similar(positive=['未婚', '她'], negative=['他'], topn=20))
print("-"*40)
print(model.most_similar(positive=['成功', '他'], negative=['她'], topn=20))
print("-"*40)
print(model.most_similar(positive=['成功', '她'], negative=['他'], topn=20))
print("-"*40)
print(model.most_similar(positive=['外遇', '他'], negative=['她'], topn=20))
print("-"*40)
print(model.most_similar(positive=['外遇', '她'], negative=['他'], topn=20))
print("-"*40)
print(model.most_similar(positive=['離婚', '他'], negative=['她'], topn=20))
print("-"*40)
print(model.most_similar(positive=['離婚', '她'], negative=['他'], topn=20))
print("-"*40)
print(model.most_similar(positive=['失敗', '他'], negative=['她'], topn=20))
print("-"*40)
print(model.most_similar(positive=['失敗', '她'], negative=['他'], topn=20))

Loading pre-built model

English

import fasttext.util
fasttext.util.download_model('en', if_exists='ignore')  # English
ft = fasttext.load_model('cc.en.300.bin')

Chinese: Using CKIP

from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format("/Users/jirlong/Downloads/wordVec/w2v_CNA_ASBC_300d.vec", 
                                          binary = False, 
                                          unicode_errors='ignore')
print(len(model['好吃']))
print(model.most_similar('好吃'))

300 [('美味', 0.7559595704078674), ('品嚐', 0.719903290271759), ('吃起來', 0.7154861092567444), ('吃到', 0.710068941116333), ('爽口', 0.7017905712127686), ('可口', 0.6996539831161499), ('吃', 0.6877780556678772), ('香甜', 0.6855288743972778), ('鮮美', 0.6764703989028931), ('口感', 0.6668637990951538)]

PreviousTM04. Ranking and Indexing NextTM06. Classification

Last updated 5 years ago