Intro

This video explains to you what fastText is all about as if you were five years old. If the video doesn't load, click on this link. Basically, a fastText model maps a word to a series of numbers (called vectors or embeddings in NLP jargon) so that word similarity can be calcuated based on those numbers.

fastText cbow 300 dimensions from Facebook

Here're the simple steps for loading the Chinese model released by Facebook, abbreviated here as ft.

!pip install fasttext
import fasttext
Collecting fasttext
  Downloading https://files.pythonhosted.org/packages/f8/85/e2b368ab6d3528827b147fdb814f8189acc981a4bc2f99ab894650e05c40/fasttext-0.9.2.tar.gz (68kB)
     |████████████████████████████████| 71kB 4.4MB/s 
Requirement already satisfied: pybind11>=2.2 in /usr/local/lib/python3.6/dist-packages (from fasttext) (2.6.1)
Requirement already satisfied: setuptools>=0.7.0 in /usr/local/lib/python3.6/dist-packages (from fasttext) (51.1.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from fasttext) (1.19.5)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... done
  Created wheel for fasttext: filename=fasttext-0.9.2-cp36-cp36m-linux_x86_64.whl size=3039122 sha256=5aa81e1045293ebc74315d2013c28cd0018ec96b8868502d535b71438f1faa0c
  Stored in directory: /root/.cache/pip/wheels/98/ba/7f/b154944a1cf5a8cee91c154b75231136cc3a3321ab0e30f592
Successfully built fasttext
Installing collected packages: fasttext
Successfully installed fasttext-0.9.2
import fasttext.util
fasttext.util.download_model('zh', if_exists='ignore')  # zh = Chinese
Downloading https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.zh.300.bin.gz

'cc.zh.300.bin'
ft = fasttext.load_model('cc.zh.300.bin')
Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.

The ft model covers a whopping great number of words, 2000000 to be exact, because it's trained on a HUGE corpus.

len(ft.words)
2000000

Let's check out the top 10 words most similar to "疫情" (meaning "pandemic situation") according to the ft model. The numbers indicate the degree of similarity. The larger the number, the greater the similarity.

ft.get_nearest_neighbors("疫情")

[(0.7571706771850586, '禽流感'),
 (0.6940484046936035, '甲流'),
 (0.6807129383087158, '流感'),
 (0.6670429706573486, '疫病'),
 (0.6640030741691589, '防疫'),
 (0.6531218886375427, '萨斯病'),
 (0.6506668329238892, 'H1N1'),
 (0.6495682001113892, '疫症'),
 (0.6432098150253296, 'SARS'),
 (0.642063319683075, '疫区')]

The results are pretty good. But the downside is that the ft model is huge in size. After being unzipped, the model file is about 6.74G.

fastText cbow 300 dimensions from ToastyNews in Cantonese

This article is what inpired me to write this post. The author trained a fastText model on articles written in Cantonese, which uses traditional characters. Here're the simple steps for loading his model, abbreviated here as hk.

Since his model is stored on GDrive, I find it more convenient to use the gdown library to download the model.

import gdown
url = 'https://drive.google.com/u/0/uc?export=download&confirm=4g-b&id=1kmZ8NKYDngKtA_-1f3ZdmbLV0CDBy1xA'
output = 'toasty_news.bin.gz'
gdown.download(url, output, quiet=False)
Downloading...
From: https://drive.google.com/u/0/uc?export=download&confirm=4g-b&id=1kmZ8NKYDngKtA_-1f3ZdmbLV0CDBy1xA
To: /content/toasty_news.bin.gz
2.77GB [00:26, 106MB/s] 
'toasty_news.bin.gz'

The file needs to be first unzipped to be loaded as a fastText model. An easy way to do that is the command !gunzip plus a file name.

!gunzip toasty_news.bin.gz
hk = fasttext.load_model('/content/toasty_news.bin')
hk.get_dimension()
Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
300

The hk model covers 222906 words in total.

len(hk.words)
222906

fastText cbow 100 dimensions from Taiwan news in traditional Chinese

I trained a fastText model on 5816 articles of Taiwan news in traditional Chinese, most of them related to health and diseases.

tw = fasttext.load_model(path) # "path" is where my model is stored. 
tw.get_dimension()
Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
100

The tw model covers only 11089 words in total because it's trained on a much smaller corpus than the hk model.

len(tw.words)
11089

Comparison

My original plan was to compare all the three models and see what similar words they come up with given the same keyword. But the ft model is huge so I can't load all of them into RAM. The RAM limit on Colab is about 12G. So we'll just compare the tw and hk model.

Since we're not concerned with the degree of similarity, let's write a simple function to show just similar words.

def similar_words(keyword, model):
  top_10 = model.get_nearest_neighbors(keyword)
  top_10 = [w[1] for w in top_10]
  return top_10

Then, calling the function similar_word(), with a keyword and a fastText model as the required arguments, shows the top ten words most similar to the keyword.

similar_words("疫情", hk)
['疫症', '病疫情', '武漢肺炎', '疫潮', '疫', '新冠肺炎', '疫調', '疫市', '新型冠狀病毒', '疫病']

Now let's write a function to show the results of the two models side by side in a dataframe.

import pandas as pd

models = {'hk': hk, 'tw': tw}
def compare_models(keyword, **models):
  hk_results = similar_words(keyword, models['hk'])
  tw_results = similar_words(keyword, models['tw'])
  data = {'HKNews_'+keyword: hk_results, 'TWNews_'+keyword: tw_results}
  df = pd.DataFrame(data)
  return df

Let's test it out with the keyword "疫情".

test = compare_models("疫情", **models)
test
HKNews_疫情 TWNews_疫情
0 疫症 疫情國
1 病疫情 因應
2 武漢肺炎 防堵
3 疫潮 切記
4 擴散
5 新冠肺炎 屬地
6 疫調 疫情處
7 疫市 升溫
8 新型冠狀病毒 警訊
9 疫病 嚴峻

It's interesting that similar words of "總統" (meaning "the president") include "蔡總統" (meaning "President Tsai", referring to Taiwan's incumbent president) according to the hk model but not the tw model. I'd expect the opposite.

test = compare_models("總統", **models)
test
HKNews_總統 TWNews_總統
0 代總統 主持
1 美國總統 總統府
2 前總統 部長
3 民選總統 親臨
4 李總統 局長
5 副總統 蘇益仁
6 下任總統 幹事長
7 總理 副院長
8 首相 李明亮
9 蔡總統 座談會

Again, it is the hk model, not the tw model, that knows "蔡英文" (meaning "Tsai Ing-wen") is most similar to "蔡總統" (meaning "President Tsai"). The two linguistic terms have the same reference.

test = compare_models("蔡總統", **models)
test
HKNews_蔡總統 TWNews_蔡總統
0 蔡英文 總統
1 賴清德 主持
2 馬英九 部長
3 李總統 親臨
4 林全 局長
5 民進黨 陳建仁
6 柯文哲 座談會
7 總統
8 川普 副院長
9 總統府 總統府

Finally, let's write a function to quickly compare a list of keywords.

def concat_dfs(keyword_list):
  dfs = []
  for word in keyword_list:
    df = compare_models(word, **models)
    dfs.append(df)
  results = pd.concat(dfs, axis=1)
  return results
keywords = "疫情 疫苗 病毒 肺炎 檢疫 流感 台灣"
key_list = keywords.split()
concat_dfs(key_list)

HKNews_疫情 TWNews_疫情 HKNews_疫苗 TWNews_疫苗 HKNews_病毒 TWNews_病毒 HKNews_肺炎 TWNews_肺炎 HKNews_檢疫 TWNews_檢疫 HKNews_流感 TWNews_流感 HKNews_台灣 TWNews_台灣
0 疫症 疫情國 流感疫苗 接種 輪狀病毒 病毒型 武漢肺炎 豬鏈球菌 檢疫所 檢疫官 流感病毒 新流感 臺灣 臺灣
1 病疫情 因應 免疫針 接種地 含病毒 腺病毒 武肺 鏈球菌 檢疫中心 檢疫站 流行性感冒 防流感 台灣國 根除
2 武漢肺炎 防堵 抗體 接種為 冠狀病毒 病毒株 新冠肺炎 疾患 檢疫局 檢疫局 禽流感 打流感 台灣政府 歷史
3 疫潮 切記 藥物 接種點 新病毒 病毒學 病疫 雙球菌 檢疫站 航機 流行病 對流感 中國大陸 一直
4 擴散 卡介苗 接種卡 腺病毒 型別 病疫情 心包膜炎 隔離 機場 疫症 豬流感 中國 亞太
5 新冠肺炎 屬地 抗生素 疫苗量 殺病毒 流行株 疫症 特殊 自我隔離 入境 病疫情 抗流感 台灣人 諸多
6 疫調 疫情處 輪狀病毒 卡介苗 麻疹病毒 株型別 非典型肺炎 侵襲性 隔離者 調查表 麻疹 流感疫 中國台灣 世紀
7 疫市 升溫 接種 防病毒 疫情 冠狀動脈 病毒檢測 港口 流行性腮腺炎 季節性 台灣獨立 跨國性
8 新型冠狀病毒 警訊 預防接種 多合一 冠状病毒 重組 廢炎 症候群 健康申報 登機 流感疫苗 流感病 台灣社 面臨
9 疫病 嚴峻 麻疹 廠牌 病原體 毒株 疫病 冠狀病毒 檢測 聲明卡 登革熱 新型 中華民國 之中
keywords = "頭痛 發燒 流鼻水 "
key_list = keywords.split()
concat_dfs(key_list)
HKNews_頭痛 TWNews_頭痛 HKNews_發燒 TWNews_發燒 HKNews_流鼻水 TWNews_流鼻水
0 偏頭痛 肌肉痛 咳嗽 出現 鼻水 鼻水
1 頭疼 骨頭痛 病徵 症狀 流鼻涕
2 胃痛 噁心 發高燒 喉嚨痛 咳嗽 鼻塞
3 痠痛 骨頭 發病 嗅覺 喉嚨痛 喉嚨
4 酸痛 肌肉 喉嚨痛 味覺 出疹 喉嚨癢
5 絞痛 後眼 流鼻水 鼻水 發燒 喉嚨痛
6 腫痛 畏寒 症狀 咳嗽 皮疹 嗅覺
7 頭暈 倦怠 徵狀 喉嚨 流鼻血 味覺
8 心絞痛 窩痛 出疹 疲倦 肚瀉 紅疹
9 腰背痛 結膜 呼吸困難 咳血 倦怠

Recap

You can easily find out words most similar to a keyword that you're interested in just by loading a fastText model. And for it to work pretty well, you don't even need to have a huge corpus at hand. Even if you don't know how to train a model from scratch, you can still make good use of fastText by loading pretrained models, like those released by Facebook. In total, 157 languages are covered, including even Malay and Malayalam! (Btw, check out this Malayalam grammar that I wrote and is now archived on Semantic Scholar.)

Note: This is my first post written in Jupyter notebook. After I uploaded the .ipynb file to GitHub, the post didn’t show up automatically and I got a CI failing warning in my repo. Listed in the tip section below is what I did to fix the problem, though I’m not sure which of them was the key.
Tip: 1. requested an automatic update by following the instructions in the troubleshooting guide 2. deleted the backtick symbol in the summary section of the front matter 3. uploaded the .ipynb file straight from Colab to GitHub instead of doing this manually