8. Pandas (35%)
Read CSV and JSON
讀取CSV檔
import pandas as pd
df = pd.read_csv('data/14196_drug_adv.csv', error_bad_lines=False)
set(df.刊播媒體類別){'其他', '平面媒體', '廣播電台', '網路', '電視'}
讀取Excel檔
import pandas as pd
df = pd.read_excel('../../R/clickbait_detection/labeled/tag_comparison_1st&2nd_round.xlsx', error_bad_lines=False)計數以了解資料概況
df.刊播媒體類別.value_counts()`網路 1479
平面媒體 111
電視 86
廣播電台 71
其他 13
Name: 刊播媒體類別, dtype: int64
from collections import Counter
type_dict = Counter(df.刊播媒體類別)
print(type_dict)Counter({'網路': 1479, '平面媒體': 111, '電視': 86, '廣播電台': 71, '其他': 13})
1760
Pilot分析-群組化計數 group_by

案例分析:摘要youbike
載入資料

產生新的變項
產生新的變數(方法一)
df = df.assign(new_var = old_var1 / old_var2)to create or convert new variable. Be careful! You must assign to left to overwrite original df.產生新的變數(方法二)
df["new_var"] = df.old_var1 / df.old_var2
觀察資料概況
觀察各個變數的分佈
df.info()anddf.describe()修改變數型態
pd.to_numeric(var)to convert data type

Read R RDS
The root node of RDS converted to pandas dataframe will be result[None] .
Read R RDA
Download data here for testing https://www.dropbox.com/s/qrgi3ralwqrhbq5/boy-girl_201906160922.rda?dl=0
odict_keys(['allc.df', 'allp.df'])
Index(['plink', 'board', 'pcontent', 'poster', 'ptitle', 'ptime', 'ipaddr', 'ip.len'], dtype='object')
Tokenizing post content

M1. Tokenize one column by for-loo
M2. Tokenize pandas columns by `apply()`
Applications: Building word2vec model
Last updated