fastText embeddings for traditional Chinese
fastText models are useful for finding similar words in a corpus. This post compares two small fastText models trained on data in traditional Chinese.
- Intro
- fastText cbow 300 dimensions from Facebook
- fastText cbow 300 dimensions from ToastyNews in Cantonese
- fastText cbow 100 dimensions from Taiwan news in traditional Chinese
- Comparison
- Recap
This video explains to you what fastText
is all about as if you were five years old. If the video doesn't load, click on this link. Basically, a fastText
model maps a word to a series of numbers (called vectors or embeddings in NLP jargon) so that word similarity can be calcuated based on those numbers.
Here're the simple steps for loading the Chinese model released by Facebook, abbreviated here as ft
.
!pip install fasttext
import fasttext
import fasttext.util
fasttext.util.download_model('zh', if_exists='ignore') # zh = Chinese
ft = fasttext.load_model('cc.zh.300.bin')
The ft
model covers a whopping great number of words, 2000000 to be exact, because it's trained on a HUGE corpus.
len(ft.words)
Let's check out the top 10 words most similar to "疫情" (meaning "pandemic situation") according to the ft
model. The numbers indicate the degree of similarity. The larger the number, the greater the similarity.
ft.get_nearest_neighbors("疫情")
The results are pretty good. But the downside is that the ft
model is huge in size. After being unzipped, the model file is about 6.74G.
ToastyNews in Cantonese
fastText cbow 300 dimensions fromThis article is what inpired me to write this post. The author trained a fastText
model on articles written in Cantonese, which uses traditional characters. Here're the simple steps for loading his model, abbreviated here as hk
.
Since his model is stored on GDrive, I find it more convenient to use the gdown
library to download the model.
import gdown
url = 'https://drive.google.com/u/0/uc?export=download&confirm=4g-b&id=1kmZ8NKYDngKtA_-1f3ZdmbLV0CDBy1xA'
output = 'toasty_news.bin.gz'
gdown.download(url, output, quiet=False)
The file needs to be first unzipped to be loaded as a fastText
model. An easy way to do that is the command !gunzip
plus a file name.
!gunzip toasty_news.bin.gz
hk = fasttext.load_model('/content/toasty_news.bin')
hk.get_dimension()
The hk
model covers 222906 words in total.
len(hk.words)
I trained a fastText
model on 5816 articles of Taiwan news in traditional Chinese, most of them related to health and diseases.
tw = fasttext.load_model(path) # "path" is where my model is stored.
tw.get_dimension()
The tw
model covers only 11089 words in total because it's trained on a much smaller corpus than the hk
model.
len(tw.words)
My original plan was to compare all the three models and see what similar words they come up with given the same keyword. But the ft
model is huge so I can't load all of them into RAM. The RAM limit on Colab is about 12G. So we'll just compare the tw
and hk
model.
Since we're not concerned with the degree of similarity, let's write a simple function to show just similar words.
def similar_words(keyword, model):
top_10 = model.get_nearest_neighbors(keyword)
top_10 = [w[1] for w in top_10]
return top_10
Then, calling the function similar_word()
, with a keyword and a fastText
model as the required arguments, shows the top ten words most similar to the keyword.
similar_words("疫情", hk)
Now let's write a function to show the results of the two models side by side in a dataframe.
import pandas as pd
models = {'hk': hk, 'tw': tw}
def compare_models(keyword, **models):
hk_results = similar_words(keyword, models['hk'])
tw_results = similar_words(keyword, models['tw'])
data = {'HKNews_'+keyword: hk_results, 'TWNews_'+keyword: tw_results}
df = pd.DataFrame(data)
return df
Let's test it out with the keyword "疫情".
test = compare_models("疫情", **models)
test
It's interesting that similar words of "總統" (meaning "the president") include "蔡總統" (meaning "President Tsai", referring to Taiwan's incumbent president) according to the hk
model but not the tw
model. I'd expect the opposite.
test = compare_models("總統", **models)
test
Again, it is the hk
model, not the tw
model, that knows "蔡英文" (meaning "Tsai Ing-wen") is most similar to "蔡總統" (meaning "President Tsai"). The two linguistic terms have the same reference.
test = compare_models("蔡總統", **models)
test
Finally, let's write a function to quickly compare a list of keywords.
def concat_dfs(keyword_list):
dfs = []
for word in keyword_list:
df = compare_models(word, **models)
dfs.append(df)
results = pd.concat(dfs, axis=1)
return results
keywords = "疫情 疫苗 病毒 肺炎 檢疫 流感 台灣"
key_list = keywords.split()
concat_dfs(key_list)
keywords = "頭痛 發燒 流鼻水 "
key_list = keywords.split()
concat_dfs(key_list)
You can easily find out words most similar to a keyword that you're interested in just by loading a fastText
model. And for it to work pretty well, you don't even need to have a huge corpus at hand. Even if you don't know how to train a model from scratch, you can still make good use of fastText
by loading pretrained models, like those released by Facebook. In total, 157 languages are covered, including even Malay and Malayalam! (Btw, check out this Malayalam grammar that I wrote and is now archived on Semantic Scholar.)
.ipynb
file to GitHub, the post didn’t show up automatically and I got a CI failing
warning in my repo. Listed in the tip section below is what I did to fix the problem, though I’m not sure which of them was the key.
.ipynb
file straight from Colab to GitHub instead of doing this manually