fastText embeddings for traditional Chinese
fastText models are useful for finding similar words in a corpus. This post compares two small fastText models trained on data in traditional Chinese.
- Intro
- fastText cbow 300 dimensions from Facebook
- fastText cbow 300 dimensions from ToastyNews in Cantonese
- fastText cbow 100 dimensions from Taiwan news in traditional Chinese
- Comparison
- Recap

This video explains to you what fastText is all about as if you were five years old. If the video doesn't load, click on this link. Basically, a fastText model maps a word to a series of numbers (called vectors or embeddings in NLP jargon) so that word similarity can be calcuated based on those numbers.
Here're the simple steps for loading the Chinese model released by Facebook, abbreviated here as ft.
!pip install fasttext
import fasttext
import fasttext.util
fasttext.util.download_model('zh', if_exists='ignore') # zh = Chinese
ft = fasttext.load_model('cc.zh.300.bin')
The ft model covers a whopping great number of words, 2000000 to be exact, because it's trained on a HUGE corpus.
len(ft.words)
Let's check out the top 10 words most similar to "疫情" (meaning "pandemic situation") according to the ft model. The numbers indicate the degree of similarity. The larger the number, the greater the similarity.
ft.get_nearest_neighbors("疫情")
The results are pretty good. But the downside is that the ft model is huge in size. After being unzipped, the model file is about 6.74G.
fastText cbow 300 dimensions from ToastyNews in Cantonese
This article is what inpired me to write this post. The author trained a fastText model on articles written in Cantonese, which uses traditional characters. Here're the simple steps for loading his model, abbreviated here as hk.
Since his model is stored on GDrive, I find it more convenient to use the gdown library to download the model.
import gdown
url = 'https://drive.google.com/u/0/uc?export=download&confirm=4g-b&id=1kmZ8NKYDngKtA_-1f3ZdmbLV0CDBy1xA'
output = 'toasty_news.bin.gz'
gdown.download(url, output, quiet=False)
The file needs to be first unzipped to be loaded as a fastText model. An easy way to do that is the command !gunzip plus a file name.
!gunzip toasty_news.bin.gz
hk = fasttext.load_model('/content/toasty_news.bin')
hk.get_dimension()
The hk model covers 222906 words in total.
len(hk.words)
I trained a fastText model on 5816 articles of Taiwan news in traditional Chinese, most of them related to health and diseases.
tw = fasttext.load_model(path) # "path" is where my model is stored.
tw.get_dimension()
The tw model covers only 11089 words in total because it's trained on a much smaller corpus than the hk model.
len(tw.words)
My original plan was to compare all the three models and see what similar words they come up with given the same keyword. But the ft model is huge so I can't load all of them into RAM. The RAM limit on Colab is about 12G. So we'll just compare the tw and hk model.
Since we're not concerned with the degree of similarity, let's write a simple function to show just similar words.
def similar_words(keyword, model):
top_10 = model.get_nearest_neighbors(keyword)
top_10 = [w[1] for w in top_10]
return top_10
Then, calling the function similar_word(), with a keyword and a fastText model as the required arguments, shows the top ten words most similar to the keyword.
similar_words("疫情", hk)
Now let's write a function to show the results of the two models side by side in a dataframe.
import pandas as pd
models = {'hk': hk, 'tw': tw}
def compare_models(keyword, **models):
hk_results = similar_words(keyword, models['hk'])
tw_results = similar_words(keyword, models['tw'])
data = {'HKNews_'+keyword: hk_results, 'TWNews_'+keyword: tw_results}
df = pd.DataFrame(data)
return df
Let's test it out with the keyword "疫情".
test = compare_models("疫情", **models)
test
It's interesting that similar words of "總統" (meaning "the president") include "蔡總統" (meaning "President Tsai", referring to Taiwan's incumbent president) according to the hk model but not the tw model. I'd expect the opposite.
test = compare_models("總統", **models)
test
Again, it is the hk model, not the tw model, that knows "蔡英文" (meaning "Tsai Ing-wen") is most similar to "蔡總統" (meaning "President Tsai"). The two linguistic terms have the same reference.
test = compare_models("蔡總統", **models)
test
Finally, let's write a function to quickly compare a list of keywords.
def concat_dfs(keyword_list):
dfs = []
for word in keyword_list:
df = compare_models(word, **models)
dfs.append(df)
results = pd.concat(dfs, axis=1)
return results
keywords = "疫情 疫苗 病毒 肺炎 檢疫 流感 台灣"
key_list = keywords.split()
concat_dfs(key_list)
keywords = "頭痛 發燒 流鼻水 "
key_list = keywords.split()
concat_dfs(key_list)
You can easily find out words most similar to a keyword that you're interested in just by loading a fastText model. And for it to work pretty well, you don't even need to have a huge corpus at hand. Even if you don't know how to train a model from scratch, you can still make good use of fastText by loading pretrained models, like those released by Facebook. In total, 157 languages are covered, including even Malay and Malayalam! (Btw, check out this Malayalam grammar that I wrote and is now archived on Semantic Scholar.)
.ipynb file to GitHub, the post didn’t show up automatically and I got a CI failing warning in my repo. Listed in the tip section below is what I did to fix the problem, though I’m not sure which of them was the key.
.ipynb file straight from Colab to GitHub instead of doing this manually