8. Pandas (35%)

Read CSV and JSON

讀取CSV檔

import pandas as pd
df = pd.read_csv('data/14196_drug_adv.csv', error_bad_lines=False)
set(df.刊播媒體類別)

{'其他', '平面媒體', '廣播電台', '網路', '電視'}

讀取Excel檔

import pandas as pd
df = pd.read_excel('../../R/clickbait_detection/labeled/tag_comparison_1st&2nd_round.xlsx', error_bad_lines=False)

計數以了解資料概況

df.刊播媒體類別.value_counts()

`網路 1479 平面媒體 111 電視 86 廣播電台 71 其他 13 Name: 刊播媒體類別, dtype: int64

from collections import Counter
type_dict = Counter(df.刊播媒體類別)
print(type_dict)

Counter({'網路': 1479, '平面媒體': 111, '電視': 86, '廣播電台': 71, '其他': 13})

1760

Pilot分析-群組化計數 group_by

案例分析:摘要youbike

載入資料

產生新的變項

  • 產生新的變數(方法一)df = df.assign(new_var = old_var1 / old_var2) to create or convert new variable. Be careful! You must assign to left to overwrite original df.

  • 產生新的變數(方法二)df["new_var"] = df.old_var1 / df.old_var2

觀察資料概況

  • 觀察各個變數的分佈 df.info() and df.describe()

  • 修改變數型態 pd.to_numeric(var) to convert data type

Read R RDS

The root node of RDS converted to pandas dataframe will be result[None] .

Read R RDA

Download data here for testing https://www.dropbox.com/s/qrgi3ralwqrhbq5/boy-girl_201906160922.rda?dl=0

odict_keys(['allc.df', 'allp.df']) Index(['plink', 'board', 'pcontent', 'poster', 'ptitle', 'ptime', 'ipaddr', 'ip.len'], dtype='object')

Tokenizing post content

M1. Tokenize one column by for-loo

M2. Tokenize pandas columns by `apply()`

Applications: Building word2vec model

Last updated