2. Counting
數水果的問題 The fruit-counting problem
寫程式是模仿你解決問題的邏輯,讓電腦幫你重複做,大量的做。因此,重要的是你解決問題的邏輯。數物件、數水果就是個很實用的演算法邏輯,教你怎麼計算一堆物品中,每個物件各有幾個。相關範例有:
全班有這麼多人,我想算算每個科系有多少人,好知道系所比例安排專題。
你被當好人跑腿,幫全班買飲料,結果一票人買了很多種飲料,只好一杯一杯畫正字。
計算某一篇文章裡面,每個字出現幾次(字頻);依照0
10、1020依此類推計算學生的成績分布。
問題定義:考慮一個狀況,有一個跟一座山一樣多的水果,不知道有幾種水果,也不知道有幾顆,你必須要數出各種水果有幾顆,你會怎麼數?
其他程式語言會怎麼解決這個問題?若你今天在用Excel處理的話,會使用COUNTIF函式,作法如下Excel skills: How do I get the distinct/uniq。如果你是用R來處理的話,那就會是count(vec)。
步驟一:嘗試說明邏輯
以下這個例子將帶你從最基礎的語法,用Python最基礎的資料型態來「計數(counting)」。
先想想若你在數水果實際上你是怎麼數?Ans. 一顆一顆數。
那你怎麼記住哪一種水果有幾顆?Ans. 拿出一張紙,看到一種沒看過的水果,就新增一個「對應」,將水果名稱對應到0,然後在對應的欄位數值遞增一。若已經看過該水果,那就直接找到那個欄位遞增一即可。
步驟二:標準化邏輯
先拿出一張紙做對應表,上面一行要寫水果名,下面一行為他所對 到的水果數量
把水果排成一列準備一個一個數
對在該列中的每顆水果
如果我沒看過他
就在對應表記下該水果,登記該水果為1顆。
若我有看過他
就把對應表上的那個水果所對應到的格子下面的數字加1。
步驟三:英文來描述之
build a look-up table to record each fruit and number of the fruit(calls it dictionary), naming as
fruit_countkeep all fruits in a list named
fruit_listfor each
fruitinfruit_list:If the fruit does not appear in
fruit_countCreate a mapping in
fruit_countto map thefruitname to 1
else
increase the mapped value of the
fruitname infruit_count
用Python來寫寫看
要寫程式前要先認識一個符號「=」(Assignment),語法為「 variable = value」透過Assignment可以把右方的數值或運算的結果assign給左方的變數。通常左方都是一個變數,Python比較特別的是,他左方可以是多個變數,只要是右方所要Assign的數值數量和左方的變數數量相同即可。關於Assignment可觀看Py01. Basic的Assignment一節。
Assigning an empty dictionary to a vriable
fruit_count = {}Assigning a list with several terms to a variable
fruit_list = ['a', 'b', 'c', 'a', 'd', 'a', 'w', 'b']var[]Brackes are used to access a list or a dictionryfruit_list[1]fruit_count["a"]a = a + 1 is a typical incrementer 遞增運算
fruit_count[fruit] = fruit_count[fruit] + 1list is ordered
fruit_list[2]dictionary is unordered
fruit_count["b"]
{'apple': 4, 'banana': 3, 'grape': 1}
Print all pairs
{'apple': 4, 'banana': 3, 'grape': 1}
dict_keys(['apple', 'banana', 'grape'])
dict_values([4, 3, 1])
dict_items([('apple', 4), ('banana', 3), ('grape', 1)])
apple 4
banana 3
grape 1
V1. Using list.count(key) to count the frequency of something
V2. Using dict.get()
V3. Using set(fruit_list) to guarantee value unique
V4. by comprehensive dictionary
V5. by Counter()
案例:計算班級中各系人數
這個例子和數水果幾乎一模一樣,只是把水果種類換成每個學生的科系而已。
案例:計算Wikipedia文章詞頻
「計數」其實是很多程式會應用到的演算邏輯。其他的演算邏輯單元還包含搜尋、排序等等,這些都是運算思維的一部分。在這個例子中,我打算要用計數邏輯來計算一篇Wikipedia文章的詞頻(Term frequency),亦即該篇文章中每個字出現幾遍,相當於把字詞當成上述範例中的水果。因此,一開始我需要把一篇文章斷開成單詞並存成如前面例子中的List。我會用Python的Wikipedia套件來取得某個Wikipedia頁面的摘要,這個套件為第三方套件,通常你的電腦都還沒安裝過這個套件,所以你需要在Terminal.app或cmd.exe中用 pip install wikipedia 安裝這個套件。
1. Loading text
Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many cases (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy and data source. Big data was originally associated with three key concepts: volume, variety, and velocity. When we handle big data, we may not sample but simply observe and track what happens. Therefore, big data often includes data with sizes that exceed the capacity of traditional software to process within an acceptable time and value.\nCurrent usage of the term big data tends to refer to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set. "There is little doubt that the quantities of data now available are indeed large, but that\'s not the most relevant characteristic of this new data ecosystem."\nAnalysis of data sets can find new correlations to "spot business trends, prevent diseases, combat crime and so on." Scientists, business executives, practitioners of medicine, advertising and governments alike regularly meet difficulties with large data-sets in areas including Internet searches, fintech, urban informatics, and business informatics. Scientists encounter limitations in e-Science work, including meteorology, genomics, connectomics, complex physics simulations, biology and environmental research.Data sets grow rapidly, to a certain extent because they are increasingly gathered by cheap and numerous information-sensing Internet of things devices such as mobile devices, aerial (remote sensing), software logs, cameras, microphones, radio-frequency identification (RFID) readers and wireless sensor networks. The world\'s technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5 exabytes (2.5×260 bytes) of data are generated.
2. Tokenization and text preprocessing
接下來,我會對文字做斷詞和基本的前處理
3. Computing term frequency
{'big': 6, 'data': 21, 'is': 2, 'a': 4, 'field': 1, 'that': 5, 'treats': 1, 'ways': 1, 'to': 10, 'analyze': 1, 'systematically': 1, 'extract': 2, 'information': 3, 'from': 2, 'or': 4, 'otherwise': 1, 'deal': 1, 'with': 7, 'sets': 3, 'are': 4, 'too': 1, 'large': 3, 'complex': 2, 'be': 1, 'dealt': 1, 'by': 2, 'traditional': 2, 'dataprocessing': 1, 'application': 1, 'software': 3, 'many': 1, 'cases': 1, 'rows': 1, 'offer': 1, 'greater': 1, 'statistical': 1, 'power': 1, 'while': 1, 'higher': 2, 'complexity': 1, 'more': 1, 'attributes': 1, 'columns': 1, 'may': 2, 'lead': 1, 'false': 1, 'discovery': 1, 'rate': 1, 'challenges': 1, 'include': 1, 'capturing': 1, 'storage': 1, 'analysis': 2, 'search': 1, 'sharing': 1, 'transfer': 1, 'visualization': 1, 'querying': 1, 'updating': 1, 'privacy': 1, 'and': 11, 'source': 1, 'was': 1, 'originally': 1, 'associated': 1, 'three': 1, 'key': 1, 'concepts': 1, 'volume': 1, 'variety': 1, 'velocity': 1, 'when': 1, 'we': 2, 'handle': 1, 'not': 2, 'sample': 1, 'but': 2, 'simply': 1, 'observe': 1, 'track': 1, 'what': 1, 'happens': 1, 'therefore': 1, 'often': 1, 'includes': 1, 'sizes': 1, 'exceed': 1, 'the': 7, 'capacity': 2, 'of': 11, 'process': 1, 'within': 1, 'an': 1, 'acceptable': 1, 'time': 1, 'value': 2, 'current': 1, 'usage': 1, 'term': 1, 'tends': 1, 'refer': 1, 'use': 1, 'predictive': 1, 'analytics': 3, 'user': 1, 'behavior': 1, 'certain': 2, 'other': 1, 'advanced': 1, 'methods': 1, 'seldom': 1, 'particular': 1, 'size': 1, 'set': 1, 'there': 1, 'little': 1, 'doubt': 1, 'quantities': 1, 'now': 1, 'available': 1, 'indeed': 1, 'thats': 1, 'most': 1, 'relevant': 1, 'characteristic': 1, 'this': 1, 'new': 2, 'ecosystem': 1, 'can': 1, 'find': 1, 'correlations': 1, 'spot': 1, 'business': 3, 'trends': 1, 'prevent': 1, 'diseases': 1, 'combat': 1, 'crime': 1, 'so': 1, 'on': 1, 'scientists': 2, 'executives': 1, 'practitioners': 1, 'medicine': 1, 'advertising': 1, 'governments': 1, 'alike': 1, 'regularly': 1, 'meet': 1, 'difficulties': 1, 'datasets': 1, 'in': 2, 'areas': 1, 'including': 2, 'internet': 2, 'searches': 1, 'fintech': 1, 'urban': 1, 'informatics': 2, 'encounter': 1, 'limitations': 1, 'escience': 1, 'work': 1, 'meteorology': 1, 'genomics': 1, 'connectomics': 1, 'physics': 1, 'simulations': 1, 'biology': 1, 'environmental': 1, 'researchdata': 1, 'grow': 1, 'rapidly': 1, 'extent': 1, 'because': 1, 'they': 1, 'increasingly': 1, 'gathered': 1, 'cheap': 1, 'numerous': 1, 'informationsensing': 1, 'things': 1, 'devices': 2, 'such': 1, 'as': 2, 'mobile': 1, 'aerial': 1, 'remote': 1, 'sensing': 1, 'logs': 1, 'cameras': 1, 'microphones': 1, 'radiofrequency': 1, 'identification': 1, 'rfid': 1, 'readers': 1, 'wireless': 1, 'sensor': 1, 'networks': 1, 'worlds': 1, 'technological': 1, 'percapita': 1, 'store': 1, 'has': 1, 'roughly': 1, 'doubled': 1, 'every': 2, '40': 1, 'months': 1, 'since': 1, '1980s': 1, '2012': 1, 'day': 1, '25': 1, 'exabytes': 1, '25×260': 1, 'bytes': 1, 'generated': 1}
Python有以下的寫法可以把有for-loop的程式邏輯表述為單行,有人稱它為generator,有人稱這樣的技巧為comprehension。
(延伸學習)如何依據Dictionary value排序Dictionary:sorted(list_of_dict, key=itemgetter(dict_key), reverse=True),但這屬於未在程式初始會被載入的套件,因此要載入from operator import itemgetter。
下面的方法根據dict的value排序了dict word_freq中的key。現在你已經有排序好的key,你可以嘗試把value給印出來,只需要把key填入word_freq中即可,如下面的word_freq[key]。
4. Plotting term frequency

案例:成績百分制轉等第制
最近台灣也開始用等第制來給分,一方面是為了符合國際的標準,另一原因據說是為了避免學生對成績太錙銖必較,雖然我自己的感覺是,學生對於為何從A+和A的差別更計較了,甚至會來追問,原始成績是多少,為何會變成A。不過,很多老師在給分還是採用百分制,學校在讓老師上傳成績的時候,就有個選項是可自動把百分制轉為等第制,這個範例是這件事的簡化版,我們只轉出A、B、C和F四種等第。
首先,先造一筆測試資料,分別是75個由35~99間的隨機整數,可重複取出。再來,初始化一個Dictionary來存放每個等第的個數(一開始均為0)。最後,使用if-elif-else條件判斷式來判斷每一筆成績落在哪個區間。
Last updated