
Unlike English, Chinese does not use spaces in its writing system, which can be a pain in the neck (or in the eyes, for that matter) if you're learning to read Chinese. In a way, it's like trying to make sense out of long German words like Lebensabschnittspartner, which roughly means "the person I'm with today" (taken from David Sedaris's language lessons published on the New Yorker). We'll see how computer models can help us with breaking a stretch of Chinese text into words (called tokenization in NLP jargon). To give computer models a hard time, we'll test out this text without punctuations.

text = "今年好煩惱少不得打官司釀酒剛剛好做醋格外酸養牛隻隻大如山老鼠隻隻死"

This text is challenging not only because it can be segmented multiple ways but also because it could potentially express quite different meanings depending on how you interprete it. For instance, this part 今年好煩惱少不得打官司 could either mean "This year will be great for you. You'll have few worries. Don't file any lawsuit" or "This year, you'll be very worried. A lawsuit is inevitable". Either way, it sounds like the kind of aphorism you'd find in fortune cookies. Now that you know the secret to aphorisms being always right is ambiguity, we'll turn to five Python libraries for doing the hard work for us.


Of the five tools to be introduced here, Jieba is perhaps the most widely used one, and it's even pre-installed on Colab and supported by spaCy. Unfortunately, Jieba told us that a lawsuit is inevitable this year... 😭

import jieba
tokens = jieba.cut(text)  
jieba_default = " | ".join(tokens)
今年 | 好 | 煩惱 | 少不得 | 打官司 | 釀酒 | 剛剛 | 好 | 做 | 醋 | 格外 | 酸養 | 牛 | 隻 | 隻 | 大如山 | 老鼠 | 隻 | 隻 | 死

The result is quite satisfying, except for 酸養, which is not even a word. Jieba is famouse for being super fast. If we run the segmentation function 1000000 times, top results we got are 256 nanoseconds per loop!

%timeit jieba.cut(text)
The slowest run took 12.90 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 256 ns per loop

Let's write a function for later use.

def Jieba_tokenizer(text):
  tokens = jieba.cut(text)  
  result = " | ".join(tokens)
  return result


As its name suggests, PKUSeg is built by the Language Computing and Machine Learning Group at Peking (aka. Beijing) University. It's been recently integrated into spaCy.

!pip install -U pkuseg

Here's the result.

import pkuseg

pku = pkuseg.pkuseg()        
result = pku.cut(text) 
result = " | ".join(result)
'今年 | 好 | 煩惱 | 少不得 | 打官司 | 釀 | 酒剛 | 剛 | 好 | 做 | 醋 | 格外 | 酸養 | 牛 | 隻隻 | 大 | 如 | 山 | 老鼠 | 隻隻 | 死'

Compared with Jieba, PKUSeg not only got more wrong tokens (酸養 and 酒剛) but also ran at a much slower speed.

%timeit pku.cut(text)
1000 loops, best of 3: 648 µs per loop

Yet, PKUSeg has one nice feature absent from Jieba.

Users have the option to choose from four domain-specific models, including news, web, medicine, and tourism.

This can be quite helpful if you're specifically dealing with texts in any of the four domains. Let's test the news domain with the first paragraph of a news article about Covid-19 published on Yahoo News.

article = '''

Here's the result with the default settinng.

pku = pkuseg.pkuseg()        
result = pku.cut(article) 
result = " | ".join(result)
'台灣 | 新冠 | 肺炎 | 連續 | 第6 | 天 | 零 | 本土 | 病例 | 破功 | ! | 中央 | 流行 | 疫情 | 指揮 | 中心 | 指揮官 | 陳時 | 中 | 今天 | 宣布 | 國內 | 新增 | 4 | 例 | 本土 | 確定 | 病例 | , | 均 | 為 | 桃園 | 醫院 | 感染 | 事件 | 之 | 確 | 診個案 | 相關 | 接觸者 | , | 其中 | 3 | 例 | 為案 | 863 | 之 | 同 | 住家人 | ( | 案 | 907 | 、 | 909 | 、 | 910 | ) | , | 研判 | 與案 | 863 | 、 | 864 | 、 | 865 | 為 | 一起 | 家庭 | 群聚案 | , | 其中 | 1 | 人 | ( | 案 | 907 | ) | 死亡 | , | 是 | 相隔 | 8 | 個 | 月 | 以 | 來 | 再 | 添 | 死亡 | 病例 | ; | 另 | 1 | 例 | 為案 | 889 | 之 | 就 | 醫 | 相關 | 接觸者 | ( | 案 | 908 | ) | 。 | 此外 | , | 今天 | 也 | 新增 | 6例 | 境外 | 移入 | 確定 | 病例 | , | 分別 | 自 | 印尼 | ( | 案 | 901 | ) | 、 | 捷克 | ( | 案 | 902 | ) | 及 | 巴西 | ( | 案 | 903 | 至 | 906 | ) | 入境 | 。 | 衛福部 | 桃園 | 醫院 | 感染 | 累計 | 達 | 19 | 例 | ( | 其中 | 1 | 人 | 死亡 | ) | , | 全 | 台 | 達 | 909 | 例 | 、 | 8 | 死 | 。'

Here's the result with the model_name argument set to news. Both models made some mistakes here and there, but what's surprising to me is that the news-specific model even made a mistake when parsing 新冠肺炎, which literally means "new coronavirus disease" and refers to Covid-19.

pku = pkuseg.pkuseg(model_name='news')        
result = pku.cut(article) 
result = " | ".join(result)
Downloading: "" to /root/.pkuseg/
100%|██████████| 43767759/43767759 [00:00<00:00, 104004889.71it/s]
'台灣 | 新 | 冠 | 肺 | 炎連 | 續 | 第6天 | 零本土 | 病例 | 破功 | ! | 中央 | 流行疫情指揮中心 | 指揮 | 官 | 陳 | 時 | 中 | 今天 | 宣布 | 國內 | 新增 | 4例 | 本土 | 確定 | 病例 | , | 均 | 為桃園醫院 | 感染 | 事件 | 之 | 確 | 診 | 個 | 案 | 相關 | 接觸 | 者 | , | 其中 | 3例 | 為案 | 863 | 之 | 同 | 住 | 家人 | (案 | 907 | 、 | 909 | 、 | 910) | , | 研判 | 與案 | 863 | 、 | 864 | 、 | 865為 | 一起 | 家庭 | 群 | 聚案 | , | 其中 | 1 | 人 | ( | 案 | 907 | ) | 死亡 | , | 是 | 相隔 | 8個月 | 以 | 來 | 再 | 添 | 死亡 | 病例 | ; | 另 | 1例 | 為案 | 889 | 之 | 就 | 醫 | 相關 | 接觸 | 者 | (案 | 908) | 。 | 此外 | , | 今天 | 也 | 新增 | 6例 | 境外 | 移入 | 確定 | 病例 | , | 分 | 別 | 自 | 印尼 | (案 | 901) | 、 | 捷克 | (案 | 902) | 及 | 巴西 | (案 | 903至906 | ) | 入境 | 。 | 衛 | 福部桃園醫院 | 感染 | 累 | 計達 | 19例 | ( | 其中 | 1 | 人 | 死亡 | ) | , | 全 | 台 | 達 | 909例 | 、 | 8 | 死 | 。'

Let's write a function for later use.

def PKU_tokenizer(text):
  pku = pkuseg.pkuseg()
  tokens = pku.cut(text) 
  result = " | ".join(tokens)
  return result


Next, we'll try PyHanLP. It'll take some time to download the model and data files (about 640MB in total).

!pip install pyhanlp
from pyhanlp import *
下载 到 /usr/local/lib/python3.6/dist-packages/pyhanlp/static/
100.00%, 1 MB, 187 KB/s, 还有 0 分  0 秒   
下载 到 /usr/local/lib/python3.6/dist-packages/pyhanlp/static/
98.24%, 626 MB, 8117 KB/s, 还有 0 分  1 秒   

With PyHanLP, we got a similar parsing result, but without the error that Jieba produced.

tokens = HanLP.segment(text)
token_list = [res.word for res in tokens]
pyhan = " | ".join(token_list)
今年 | 好 | 煩惱 | 少不得 | 打官司 | 釀 | 酒 | 剛剛 | 好 | 做 | 醋 | 格外 | 酸 | 養 | 牛 | 隻 | 隻 | 大 | 如山 | 老鼠 | 隻 | 隻 | 死

However, PyHanLP is about 26 times slower than Jieba, as timed below.

%timeit HanLP.segment(text)
The slowest run took 11.80 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 24.6 µs per loop

Let's write a function for later use.

def PyHan_tokenizer(text):
  tokens = HanLP.segment(text)
  token_list = [res.word for res in tokens]
  result = " | ".join(token_list)
  return result


Next is SnowNLP, which I came across only recently. While PyHanLP is about 640MB in size, SnowNLP takes up only less than 40MB.

!pip install snownlp
from snownlp import SnowNLP
SnowNLP gave a similar result, but made two parsing mistakes. Neither 做醋格 nor 外酸 is a legitimate word.

tokens = SnowNLP(text)
token_list = [tokens.words][0]
snow =  " | ".join(token_list)
今 | 年 | 好 | 煩 | 惱 | 少不得 | 打 | 官司 | 釀 | 酒 | 剛 | 剛 | 好 | 做醋格 | 外酸 | 養 | 牛 | 隻 | 隻 | 大 | 如 | 山 | 老 | 鼠 | 隻 | 隻 | 死

SnowNLP not only made more mistakes, but also took longer to run.

%timeit  SnowNLP(text)
10000 loops, best of 3: 35.4 µs per loop

But SnowNLP has a convenient feature inspired by TextBlob. Any instance of SnowNLP() has such attributes as words, pinyin (for romanization of words), tags (for parts of speech tags), and even sentiments, which calculates the probability of a text being positive.

['今', '年', '好', '煩', '惱', '少不得', '打', '官司', '釀', '酒', '剛', '剛', '好', '做醋格', '外酸', '養', '牛', '隻', '隻', '大', '如', '山', '老', '鼠', '隻', '隻', '死']
['jin', 'nian', 'hao', '煩', '惱', 'shao', 'bu', 'de', 'da', 'guan', 'si', '釀', 'jiu', '剛', '剛', 'hao', 'zuo', 'cu', 'ge', 'wai', 'suan', '養', 'niu', '隻', '隻', 'da', 'ru', 'shan', 'lao', 'shu', '隻', '隻', 'si']
[('今', 'Tg'), ('年', 'q'), ('好', 'a'), ('煩', 'Rg'), ('惱', 'Rg'), ('少不得', 'Rg'), ('打', 'v'), ('官司', 'n'), ('釀', 'u'), ('酒', 'n'), ('剛', 'i'), ('剛', 'Mg'), ('好', 'a'), ('做醋格', 'Ag'), ('外酸', 'Ng'), ('養', 'Dg'), ('牛', 'Ag'), ('隻', 'Bg'), ('隻', 'a'), ('大', 'a'), ('如', 'v'), ('山', 'n'), ('老', 'a'), ('鼠', 'Ng'), ('隻', 'Ag'), ('隻', 'Bg'), ('死', 'a')]

Again, let's write a function for later use.

def Snow_tokenizer(text):
  tokens = SnowNLP(text)
  token_list = [tokens.words][0]
  result = " | ".join(token_list)
  return result

CKIP Transformers

While the four models above are primarily trained on simplified Chinese, CKIP Transformers is trained on traditional Chinese. It is created by the CKIP Lab at Academia Sinica. As its name suggests, CKIP Transformers is built on the Transformer architecture, such as BERT and ALBERT.

Note: Read this to find out How Google Changed NLP.


!pip install -U ckip-transformers
from ckip_transformers.nlp import CkipWordSegmenter
CKIP Transformers gives its users the freedom to choose between speed and accuracy. It comes with three levels; the smaller the number, the shorter the running time. All you need to do is pass a number to the level argument of CkipWordSegmenter(). Here're the models and F1 scores for each level:

  • Level 1: CKIP ALBERT Tiny, 96.66%
  • Level 2: CKIP ALBERT Base, 97.33%
  • Level 3: CKIP BERT Base, 97.60%

By comparison, the F1 score for Jieba is only 81.18%. For more stats, visit the CKIP Lab's repo.

ws_driver  = CkipWordSegmenter(level=1, device=0)

Here's the result at Level 1. What's suprising here is that this big chunk 大如山老鼠 was not further segmented. But this is not a mistake. It simply means that the model has learned it as an idiom.

tokens  = ws_driver([text])
ckip_1 = " | ".join(tokens[0])
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 3284.50it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00,  3.98it/s]
今年 | 好 | 煩惱 | 少不得 | 打官司 | 釀酒 | 剛剛 | 好 | 做 | 醋 | 格外 | 酸 | 養 | 牛 | 隻隻 | 大如山老鼠 | 隻隻 | 死

Of the five libraries covered here, CKIP Transformers by far takes the longest time to run. But where it lags behind in speed (i.e. 17.8 ms per loop for top 3 results), it makes it up in accuracy.

Warning: Don’t toggle to show the output unless you really want to see a long list of details.
%timeit ws_driver([text])

Let's reinstantiate the CkipWordSegmenter() class and set the level to 2 this time.

ws_driver  = CkipWordSegmenter(level=2, device=0)

Here's the result at Level 2, where 大如山老鼠 was properly segmented into , , and 山老鼠.

tokens  = ws_driver([text])
ckip_2 = " | ".join(tokens[0])
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 2253.79it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 47.86it/s]
今年 | 好 | 煩惱 | 少不得 | 打官司 | 釀酒 | 剛剛好 | 做醋 | 格外 | 酸 | 養牛 | 隻隻 | 大 | 如 | 山老鼠 | 隻隻 | 死

Finally, let's create an instance of CkipWordSegmenter() at Level 3.

ws_driver  = CkipWordSegmenter(level=3, device=0)

However, Level 3 didn't produce a better result than Level 2. For instance, 牛隻, though a legitimate token, is not appropriate in this context.

tokens  = ws_driver([text])
ckip_3 = " | ".join(tokens[0])
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 976.10it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 59.33it/s]
今年 | 好 | 煩惱 | 少不得 | 打官司 | 釀酒 | 剛剛 | 好 | 做 | 醋 | 格外 | 酸 | 養 | 牛隻 | 隻 | 大 | 如 | 山 | 老鼠 | 隻隻 | 死

Here's the function for later use, which takes two arguments instead of one, unlike in previous cases.

def Ckip_tokenizer(text, level):
  ws_driver  = CkipWordSegmenter(level=level, device=0)
  tokens  = ws_driver([text])
  result = " | ".join(tokens[0])
  return result


To compare the five libraries, let's write a general function.

def Tokenizer(text, style):
  if style == 'jieba':
    result = Jieba_tokenizer(text)
  elif style == 'pku':
    result = PKU_tokenizer(text)
  elif style == 'pyhan':
    result = PyHan_tokenizer(text)
  elif style == 'snow':
    result = Snow_tokenizer(text)
  elif style == 'ckip':
    res1 = Ckip_tokenizer(text, 1)
    res2 = Ckip_tokenizer(text, 2)
    res3 = Ckip_tokenizer(text, 3)
    result = f"Level 1: {res1}\nLevel 2: {res2}\nLevel 3: {res3}"
  output = f"Result tokenized by {style}: \n{result}"
  return output

Now I'm interested in finding out whether simplified or traditional Chinese would have any effect on segmentation results. In addition to the text we've been trying (let's rename it as text_A), we'll also test another challenging text taken from the PyHanLP repo (let's call it text_B), which is intended to be ambiguous in multiple places. Given these two texts, two versions of Chinese scripts (simplified and traditional), and five segmentation libraries, we end up having in total 20 combinations of texts and libraries.

import itertools

textA_tra = "今年好煩惱少不得打官司釀酒剛剛好做醋格外酸養牛隻隻大如山老鼠隻隻死"
textA_sim = "今年好烦恼少不得打官司酿酒刚刚好做醋格外酸养牛隻隻大如山老鼠隻隻死"
textB_tra = "工信處女幹事每月經過下屬科室都要親口交代24口交換機等技術性器件的安裝工作"
textB_sim = "工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作"
texts = [textA_tra, textA_sim, textB_tra, textB_sim]
tokenizers = ['jieba', 'pku', 'pyhan', 'snow','ckip']

testing_tup = list(itertools.product(texts, tokenizers))
[('今年好煩惱少不得打官司釀酒剛剛好做醋格外酸養牛隻隻大如山老鼠隻隻死', 'jieba'),
 ('今年好煩惱少不得打官司釀酒剛剛好做醋格外酸養牛隻隻大如山老鼠隻隻死', 'pku'),
 ('今年好煩惱少不得打官司釀酒剛剛好做醋格外酸養牛隻隻大如山老鼠隻隻死', 'pyhan'),
 ('今年好煩惱少不得打官司釀酒剛剛好做醋格外酸養牛隻隻大如山老鼠隻隻死', 'snow'),
 ('今年好煩惱少不得打官司釀酒剛剛好做醋格外酸養牛隻隻大如山老鼠隻隻死', 'ckip'),
 ('今年好烦恼少不得打官司酿酒刚刚好做醋格外酸养牛隻隻大如山老鼠隻隻死', 'jieba'),
 ('今年好烦恼少不得打官司酿酒刚刚好做醋格外酸养牛隻隻大如山老鼠隻隻死', 'pku'),
 ('今年好烦恼少不得打官司酿酒刚刚好做醋格外酸养牛隻隻大如山老鼠隻隻死', 'pyhan'),
 ('今年好烦恼少不得打官司酿酒刚刚好做醋格外酸养牛隻隻大如山老鼠隻隻死', 'snow'),
 ('今年好烦恼少不得打官司酿酒刚刚好做醋格外酸养牛隻隻大如山老鼠隻隻死', 'ckip'),
 ('工信處女幹事每月經過下屬科室都要親口交代24口交換機等技術性器件的安裝工作', 'jieba'),
 ('工信處女幹事每月經過下屬科室都要親口交代24口交換機等技術性器件的安裝工作', 'pku'),
 ('工信處女幹事每月經過下屬科室都要親口交代24口交換機等技術性器件的安裝工作', 'pyhan'),
 ('工信處女幹事每月經過下屬科室都要親口交代24口交換機等技術性器件的安裝工作', 'snow'),
 ('工信處女幹事每月經過下屬科室都要親口交代24口交換機等技術性器件的安裝工作', 'ckip'),
 ('工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作', 'jieba'),
 ('工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作', 'pku'),
 ('工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作', 'pyhan'),
 ('工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作', 'snow'),
 ('工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作', 'ckip')]

Here're the results for traditional textA.

for sent in testing_tup[:5]:
  result = Tokenizer(sent[0], sent[1])
Result tokenized by jieba: 
今年 | 好 | 煩惱 | 少不得 | 打官司 | 釀酒 | 剛剛 | 好 | 做 | 醋 | 格外 | 酸養 | 牛 | 隻 | 隻 | 大如山 | 老鼠 | 隻 | 隻 | 死
Result tokenized by pku: 
今年 | 好 | 煩惱 | 少不得 | 打官司 | 釀 | 酒剛 | 剛 | 好 | 做 | 醋 | 格外 | 酸養 | 牛 | 隻隻 | 大 | 如 | 山 | 老鼠 | 隻隻 | 死
Result tokenized by pyhan: 
今年 | 好 | 煩惱 | 少不得 | 打官司 | 釀 | 酒 | 剛剛 | 好 | 做 | 醋 | 格外 | 酸 | 養 | 牛 | 隻 | 隻 | 大 | 如山 | 老鼠 | 隻 | 隻 | 死
Result tokenized by snow: 
今 | 年 | 好 | 煩 | 惱 | 少不得 | 打 | 官司 | 釀 | 酒 | 剛 | 剛 | 好 | 做醋格 | 外酸 | 養 | 牛 | 隻 | 隻 | 大 | 如 | 山 | 老 | 鼠 | 隻 | 隻 | 死
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 1287.78it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 136.95it/s]
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 1394.38it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 66.44it/s]
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 998.41it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 60.47it/s]
Result tokenized by ckip: 
Level 1: 今年 | 好 | 煩惱 | 少不得 | 打官司 | 釀酒 | 剛剛 | 好 | 做 | 醋 | 格外 | 酸 | 養 | 牛 | 隻隻 | 大如山老鼠 | 隻隻 | 死
Level 2: 今年 | 好 | 煩惱 | 少不得 | 打官司 | 釀酒 | 剛剛好 | 做醋 | 格外 | 酸 | 養牛 | 隻隻 | 大 | 如 | 山老鼠 | 隻隻 | 死
Level 3: 今年 | 好 | 煩惱 | 少不得 | 打官司 | 釀酒 | 剛剛 | 好 | 做 | 醋 | 格外 | 酸 | 養 | 牛隻 | 隻 | 大 | 如 | 山 | 老鼠 | 隻隻 | 死

Here're the results for the simplified version of the same text. Notice that the outcome can be quite different simply because a traditional text is converted to its simplified counterpart.

for sent in testing_tup[5:10]:
  result = Tokenizer(sent[0], sent[1])
Result tokenized by jieba: 
今年 | 好 | 烦恼 | 少不得 | 打官司 | 酿酒 | 刚刚 | 好 | 做 | 醋 | 格外 | 酸 | 养牛 | 隻 | 隻 | 大如山 | 老鼠 | 隻 | 隻 | 死
Result tokenized by pku: 
今年 | 好 | 烦恼 | 少不得 | 打官司 | 酿酒 | 刚刚 | 好 | 做 | 醋 | 格外 | 酸养 | 牛隻 | 隻 | 大 | 如 | 山 | 老鼠 | 隻隻 | 死
Result tokenized by pyhan: 
今年 | 好 | 烦恼 | 少不得 | 打官司 | 酿酒 | 刚刚好 | 做 | 醋 | 格外 | 酸 | 养牛 | 隻 | 隻 | 大 | 如山 | 老鼠 | 隻 | 隻 | 死
Result tokenized by snow: 
今年 | 好 | 烦恼 | 少不得 | 打官司 | 酿酒 | 刚刚 | 好 | 做醋 | 格外 | 酸 | 养 | 牛 | 隻 | 隻 | 大 | 如 | 山 | 老 | 鼠 | 隻 | 隻 | 死
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 303.61it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 123.89it/s]
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 695.69it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 66.45it/s]
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 392.84it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 72.00it/s]
Result tokenized by ckip: 
Level 1: 今年 | 好 | 烦恼 | 少不得 | 打官司 | 酿 | 酒 | 刚刚 | 好 | 做 | 醋 | 格外 | 酸 | 养 | 牛隻隻 | 大如山老鼠 | 隻隻 | 死
Level 2: 今年 | 好 | 烦恼 | 少不得 | 打官司 | 酿酒 | 刚刚 | 好 | 做醋 | 格外 | 酸 | 养 | 牛 | 隻隻 | 大 | 如 | 山老鼠 | 隻隻 | 死
Level 3: 今年 | 好 | 烦恼 | 少不得 | 打官司 | 酿酒 | 刚刚好 | 做 | 醋 | 格外 | 酸 | 养 | 牛隻 | 隻 | 大 | 如 | 山 | 老鼠 | 隻隻 | 死

Here're the results for traditional textB. Serious mistakes include 處女 (for "virgin") and 口交 (for "blowjob"). Both are correct words in Chinese, but not the intended ones in this context.

for sent in testing_tup[10:15]:
  result = Tokenizer(sent[0], sent[1])
Result tokenized by jieba: 
工信 | 處女 | 幹事 | 每月 | 經過 | 下屬 | 科室 | 都 | 要 | 親口 | 交代 | 24 | 口交 | 換機 | 等 | 技術性 | 器件 | 的 | 安裝 | 工作
Result tokenized by pku: 
工信 | 處女 | 幹事 | 每月 | 經 | 過下 | 屬科室 | 都 | 要 | 親口 | 交代 | 24 | 口 | 交 | 換機 | 等 | 技術性 | 器件 | 的 | 安裝 | 工作
Result tokenized by pyhan: 
工 | 信 | 處女 | 幹 | 事 | 每月 | 經 | 過 | 下 | 屬 | 科室 | 都 | 要 | 親 | 口 | 交代 | 24 | 口交 | 換機 | 等 | 技 | 術 | 性 | 器件 | 的 | 安 | 裝 | 工作
Result tokenized by snow: 
工 | 信 | 處 | 女 | 幹 | 事 | 每 | 月 | 經 | 過 | 下 | 屬 | 科室 | 都 | 要 | 親口 | 交代 | 24 | 口 | 交 | 換 | 機 | 等 | 技 | 術性 | 器件 | 的 | 安 | 裝 | 工作
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 494.49it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 119.49it/s]
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 402.87it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 60.66it/s]
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 3942.02it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 60.56it/s]
Result tokenized by ckip: 
Level 1: 工信 | 處女 | 幹事 | 每 | 月 | 經過 | 下屬 | 科室 | 都 | 要 | 親口 | 交代 | 24 | 口 | 交換機 | 等 | 技術性 | 器件 | 的 | 安裝 | 工作
Level 2: 工信處 | 女 | 幹事 | 每 | 月 | 經過 | 下屬 | 科室 | 都 | 要 | 親口 | 交代 | 24 | 口 | 交換機 | 等 | 技術性 | 器件 | 的 | 安裝 | 工作
Level 3: 工信處 | 女 | 幹事 | 每 | 月 | 經過 | 下屬 | 科室 | 都 | 要 | 親口 | 交代 | 24 | 口 | 交換機 | 等 | 技術性 | 器件 | 的 | 安裝 | 工作

Here're the results for the simplified version of textB. In terms of textB, CKIP Transformers Level 2 and 3 are most stable, giving the same error-free results regardless of the writing sytems.

for sent in testing_tup[15:]:
  result = Tokenizer(sent[0], sent[1])
Result tokenized by jieba: 
工信处 | 女干事 | 每月 | 经过 | 下属 | 科室 | 都 | 要 | 亲口 | 交代 | 24 | 口 | 交换机 | 等 | 技术性 | 器件 | 的 | 安装 | 工作
Result tokenized by pku: 
工信 | 处女 | 干事 | 每月 | 经过 | 下属 | 科室 | 都 | 要 | 亲口 | 交代 | 24 | 口 | 交换机 | 等 | 技术性 | 器件 | 的 | 安装 | 工作
Result tokenized by pyhan: 
工信处 | 女干事 | 每月 | 经过 | 下属 | 科室 | 都 | 要 | 亲口 | 交代 | 24 | 口 | 交换机 | 等 | 技术性 | 器件 | 的 | 安装 | 工作
Result tokenized by snow: 
工 | 信处女 | 干事 | 每月 | 经过 | 下属 | 科室 | 都 | 要 | 亲口 | 交代 | 24 | 口 | 交换机 | 等 | 技术性 | 器件 | 的 | 安装 | 工作
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 1220.69it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 131.83it/s]
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 878.39it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 71.48it/s]
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 1254.65it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 60.75it/s]
Result tokenized by ckip: 
Level 1: 工信处 | 女干 | 事 | 每 | 月 | 经过 | 下 | 属 | 科室 | 都 | 要 | 亲 | 口 | 交代 | 24 | 口 | 交换机 | 等 | 技术性 | 器件 | 的 | 安装 | 工作
Level 2: 工信处 | 女 | 干事 | 每 | 月 | 经过 | 下属 | 科室 | 都 | 要 | 亲口 | 交代 | 24 | 口 | 交换机 | 等 | 技术性 | 器件 | 的 | 安装 | 工作
Level 3: 工信处 | 女 | 干事 | 每 | 月 | 经过 | 下属 | 科室 | 都 | 要 | 亲口 | 交代 | 24 | 口 | 交换机 | 等 | 技术性 | 器件 | 的 | 安装 | 工作


This post has tested five word segmentation libraries against two challenging Chinese texts. Here're the takeaways:

  • If you value speed more than anything, Jieba is definitely the top choice. If you're dealing with traditional Chinese, it is a good practice to first convert your texts to simplified Chinese before feeding them to Jieba. Doing this may produce better results.

  • If you care more about accuracy instead, it's best to use CKIP Transformers. Its Level 2 and 3 produce consistent results whether your texts are in traditional or simplified Chinese.

  • Finally, if you hope to levarage the power of NLP libraries such as spaCy and Texthero (by the way, their slogan is really awesome: from zero to hero), you'll have to go for Jieba or PKUSeg. I hope spaCy will also add CKIP to its inventory of tokenizers in the near future.