Intro

spaCy is an industrial-strength natural language processing library in Python, and supports multiple human languages, including Chinese. For segmenting Chinese texts into words, spaCy uses Jieba or PKUSeg under the hood. However, neither of them beats CKIP Transformers in accuracy when it comes to traditional Chinese (see my previous post for a comparison). So I'll show how to plug in CKIP Transformers to spaCy to get the best out of both.

For the purpose of demonstration, I'll situate this integration in a pipeline for extracting keywords from texts. Compared with other NLP tasks, keyword extraction is a relatively easy job. TextRank and RAKE seem to be among the most widely adopted algorithms for keyword extraction. I tried most of the methods mentioned in this article, but there doesn't seem to be any easy-peasy implementation of TextRank or RAKE that produces decent results for traditional Chinese texts. So the first part of this post walks through a pipeline that actually works, and the second part records other methods that failed. I included the second part because I believe in this quote:

“We learn wisdom from failure much more than from success. We often discover what will do, by finding out what will not do; and probably he who never made a mistake never made a discovery.” ― Samuel Smiles

Note: TextRank is based on Google’s PageRank, which is used to compute the rank of webpages. This article on Natural Language Processing for Hackers demonstrates the connection between the two. From it I learned a tidbit: I always assumed that Page as in PageRank refers to webpages, but it turns out to be the family name of Larry Page, the creator of PageRank.

Working pipeline

Set variables

Let's start with defining two variables that users of our keyword extraction program might want to modify: CUSTOM_STOPWORDS for a list of words that users definitely hope to exclude from keyword candidates and KW_NUM for the number of keywords that they'd like to extract from a document.

CUSTOM_STOPWORDS = [
                    "民眾","朋友","市民","人數", "全民","人員","人士","里民",
                    "影本","系統", "項目", "證件", "資格","公民", "對象","個人",
                    ]

KW_NUM = 10

Preprocess texts

I took an announcement from Land Administration Bureau of Kaohsiung City Goverment as a sample text, but you can basically take any text in traditional Chinese to test the program.

Tip: To run the program with your own text, follow the following steps:

Click on Open in Colab at the upper right corner of this page.
Click on File and then Save a copy in Drive.
Replace the following text with your own text.
Click on Runtime and then Run all.
Go to the section Put it together to see the outcome.

raw_text = '''
市府地政局109年度第4季開發區土地標售，共計推出8標9筆優質建地，訂於109年12月16日開標，合計總底價12 億4049萬6164 元。

 

第93期重劃區，原為國軍眷村，緊鄰國定古蹟-「原日本海軍鳳山無線電信所」，市府為保存古蹟同時活化眷村遷移後土地，以重劃方式整體開發，新闢住宅區、道路、公園及停車場，使本區具有歷史文化內涵與綠色休閒特色，生活機能更加健全。地政局首次推出1筆大面積土地，面積約2160坪，地形方整，雙面臨路，利於規劃興建景觀大樓，附近有市場、學校、公園及大東文化園區，距捷運大東站、鳳山國中站及鳳山火車站僅數分鐘車程，交通四通八達，因土地稀少性及區位條件絕佳，勢必成為投資人追逐焦點。

 

第87期重劃區，位於省道台1線旁，鄰近捷運南岡山站，重劃後擁有完善的道路系統、公園綠地及毗鄰醒村懷舊文化景觀建築群，具備優質居住環境及交通便捷要件，地政局一推出土地標售，即掀起搶標熱潮，本季再釋出1筆面積約93坪土地，臨20米介壽路及鵬程東路，附近有岡山文化中心、兆湘國小、公13、公14、陽明公園及劉厝公園，區位條件佳，投資人準備搶進！

 

第77期市地重劃區，位於鳳山區快速道路省道台88線旁，近中山高五甲系統交流道，近年推出土地標售皆順利完銷。本季再推出2筆土地，其中1筆面積約526坪，臨保華一路，適合商業使用；1筆面積107坪，位於代德三街，自用投資兩相宜。

 

高雄大學區段徵收區，為北高雄優質文教特區，優質居住環境，吸引投資人進駐，本季再推出2標2筆土地，其中1筆第三種商業區土地，面積約639坪，位於大學26街，近高雄大學正門及萬坪藍田公園，地形方正，使用強度高，適合興建優質住宅大樓；另1筆住三用地，面積約379坪，臨28米藍昌路，近高雄大學及中山高中，交通便捷。

 

另第37期重劃區及前大寮農地重劃區各推出1至2筆土地，價格合理。

 

第4季土地標售作業於109年12月1日公告，投資大眾可前往地政局土地開發處土地處分科索取標售海報及標單，或直接上網高雄房地產億年旺網站、地政局及土地開發處網站查詢下載相關資料，在期限前完成投標，另再提醒投標人，本年度已更新投標單格式，投標大眾請注意應以新式投標單投標以免投標無效作廢。

 

為配合防疫需求，本季開標作業除於地政局第一會議室辦理外，另將於地政局Facebook粉絲專頁同步直播，請大眾多加利用。

 

洽詢專線：(07)3373451或(07)3314942

高雄房地產億年旺網站（網址：http://eland.kcg.gov.tw/）

高雄市政府地政局網站（網址：http://landp.kcg.gov.tw/）

高雄市政府地政局土地開發處網站（網址：http://landevp.kcg.gov.tw/）　
'''
raw_text[-300:]

'及土地開發處網站查詢下載相關資料，在期限前完成投標，另再提醒投標人，本年度已更新投標單格式，投標大眾請注意應以新式投標單投標以免投標無效作廢。\n\n \n\n為配合防疫需求，本季開標作業除於地政局第一會議室辦理外，另將於地政局Facebook粉絲專頁同步直播，請大眾多加利用。\n\n \n\n洽詢專線：(07)3373451或(07)3314942\n\n高雄房地產億年旺網站（網址：http://eland.kcg.gov.tw/）\n\n高雄市政府地政局網站（網址：http://landp.kcg.gov.tw/）\n\n高雄市政府地政局土地開發處網站（網址：http://landevp.kcg.gov.tw/）\u3000\n'

I find this lightweight library nlp2 quite handy for text cleaning. The clean_all function removes URL links, HTML elements, and unused tags.

Note: I want to give a shoutout to Eric Lam, who created nlp2 and other useful NLP tools such as NLPrep, TFkit, and nlp2go.

!pip install nlp2
from nlp2 import clean_all

After cleaning, our sample text looks like this. Notice that all the URL links are gone now.

text = clean_all(raw_text)
text[-300:]

'合理。\n\n \n\n第4季土地標售作業於109年12月1日公告，投資大眾可前往地政局土地開發處土地處分科索取標售海報及標單，或直接上網高雄房地產億年旺網站、地政局及土地開發處網站查詢下載相關資料，在期限前完成投標，另再提醒投標人，本年度已更新投標單格式，投標大眾請注意應以新式投標單投標以免投標無效作廢。\n\n \n\n為配合防疫需求，本季開標作業除於地政局第一會議室辦理外，另將於地政局Facebook粉絲專頁同步直播，請大眾多加利用。\n\n \n\n洽詢專線： 3373451或 3314942\n\n高雄房地產億年旺網站（網址： ）\n\n高雄市政府地政局網站（網址： ）\n\n高雄市政府地政局土地開發處網站（網址： ）'

Install `spacy` and `ckip-transformers`

!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download zh_core_web_sm

!pip install -U ckip-transformers

Tokenize texts with `ckip-transformers`

Let's create a driver for word segmentation and one for parts of speech. CKIP Transformers also has a built-in driver for named entity recognition, i.e. CkipNerChunker. But we won't use it here.

Tip: By default, CPU is used. If you want to use GPU to speed up word segmentation, initialize ws_driver this way instead: ws_driver = CkipWordSegmenter(device=-1)

from ckip_transformers.nlp import CkipWordSegmenter, CkipPosTagger
ws_driver  = CkipWordSegmenter()
pos_driver = CkipPosTagger()

Important: Make sure that the input to ws_driver() is a list even if you’re only dealing with a single text. Otherwise, words won’t be properly segmented. Notice that the input to pos_driver() is the output of ws_driver().

ws  = ws_driver([text])
pos = pos_driver(ws)

Here're the segmented tokens.

tokens = ws[0]
print(tokens)

['市府', '地政局', '109年度', '第4', '季', '開發區', '土地', '標售', '，', '共計', '推出', '8', '標', '9', '筆', '優質', '建地', '，', '訂', '於', '109年', '12月', '16日', '開標', '，', '合計', '總底價', '12 億', '4049萬', '6164 ', '元', '。', '\n\n \n\n', '第93', '期', '重劃區', '，', '原', '為', '國軍', '眷村', '，', '緊鄰', '國定', '古蹟', '-', '「', '原', '日本', '海軍', '鳳山', '無線', '電信所', '」', '，', '市府', '為', '保存', '古蹟', '同時', '活化', '眷村', '遷移', '後', '土地', '，', '以', '重劃', '方式', '整體', '開發', '，', '新', '闢', '住宅區', '、', '道路', '、', '公園', '及', '停車場', '，', '使', '本', '區', '具有', '歷史', '文化', '內涵', '與', '綠色', '休閒', '特色', '，', '生活', '機能', '更加', '健全', '。', '地政局', '首次', '推出', '1', '筆', '大', '面積', '土地', '，', '面積', '約', '2160', '坪', '，', '地形', '方整', '，', '雙面', '臨', '路', '，', '利於', '規劃', '興建', '景觀', '大樓', '，', '附近', '有', '市場', '、', '學校', '、', '公園', '及', '大東', '文化', '園區', '，', '距', '捷運', '大東站', '、', '鳳山', '國中站', '及', '鳳山', '火車站', '僅', '數', '分鐘', '車程', '，', '交通', '四通八達', '，', '因', '土地', '稀少性', '及', '區位', '條件', '絕佳', '，', '勢必', '成為', '投資人', '追逐', '焦點', '。', '\n\n \n\n', '第87', '期', '重劃區', '，', '位於', '省道', '台1線', '旁', '，', '鄰近', '捷運', '南', '岡山站', '，', '重劃', '後', '擁有', '完善', '的', '道路', '系統', '、', '公園', '綠地', '及', '毗鄰', '醒村', '懷舊', '文化', '景觀', '建築群', '，', '具備', '優質', '居住', '環境', '及', '交通', '便捷', '要件', '，', '地政局', '一', '推出', '土地', '標售', '，', '即', '掀起', '搶標', '熱潮', '，', '本', '季', '再', '釋出', '1', '筆', '面積', '約', '93', '坪', '土地', '，', '臨', '20', '米', '介壽路', '及', '鵬程東路', '，', '附近', '有', '岡山', '文化', '中心', '、', '兆湘', '國小', '、', '公13', '、', '公14', '、', '陽明', '公園', '及', '劉厝', '公園', '，', '區位', '條件', '佳', '，', '投資人', '準備', '搶進', '！', '\n\n \n\n', '第77', '期', '市地', '重劃區', '，', '位於', '鳳山區', '快速', '道路', '省道', '台88', '線', '旁', '，', '近', '中山高', '五甲', '系統', '交流道', '，', '近年', '推出', '土地', '標售', '皆', '順利', '完銷', '。', '本', '季', '再', '推出', '2', '筆', '土地', '，', '其中', '1', '筆', '面積', '約', '526', '坪', '，', '臨', '保華一路', '，', '適合', '商業', '使用', '；', '1', '筆', '面積', '107', '坪', '，', '位於', '代德三街', '，', '自用', '投資', '兩', '相宜', '。', '\n\n \n\n', '高雄', '大學', '區段', '徵收區', '，', '為', '北', '高雄', '優質', '文教', '特區', '，', '優質', '居住', '環境', '，', '吸引', '投資人', '進駐', '，', '本', '季', '再', '推出', '2', '標', '2', '筆', '土地', '，', '其中', '1', '筆', '第三', '種', '商業區', '土地', '，', '面積', '約', '639', '坪', '，', '位於', '大學', '26街', '，', '近', '高雄', '大學', '正門', '及', '萬', '坪', '藍田', '公園', '，', '地形', '方正', '，', '使用', '強度', '高', '，', '適合', '興建', '優質', '住宅', '大樓', '；', '另', '1', '筆', '住三', '用地', '，', '面積', '約', '379', '坪', '，', '臨', '28', '米', '藍昌路', '，', '近', '高雄', '大學', '及', '中山', '高中', '，', '交通', '便捷', '。', '\n\n \n\n', '另', '第37', '期', '重劃區', '及', '前', '大寮', '農地', '重劃區', '各', '推出', '1', '至', '2', '筆', '土地', '，', '價格', '合理', '。', '\n\n \n\n', '第4', '季', '土地', '標售', '作業', '於', '109年', '12月', '1日', '公告', '，', '投資', '大眾', '可', '前往', '地政局', '土地', '開發處', '土地處', '分科', '索取', '標售', '海報', '及', '標單', '，', '或', '直接', '上網', '高雄', '房地產', '億年旺', '網站', '、', '地政局', '及', '土地', '開發處', '網站', '查詢', '下載', '相關', '資料', '，', '在', '期限', '前', '完成', '投標', '，', '另', '再', '提醒', '投標人', '，', '本', '年度', '已', '更新', '投標單', '格式', '，', '投標', '大眾', '請', '注意', '應', '以', '新式', '投標單', '投標', '以免', '投標', '無效', '作廢', '。', '\n\n \n\n', '為', '配合', '防疫', '需求', '，', '本', '季', '開標', '作業', '除', '於', '地政局', '第一', '會議室', '辦理', '外', '，', '另', '將', '於', '地政局', 'Facebook', '粉絲', '專頁', '同步', '直播', '，', '請', '大眾', '多加', '利用', '。', '\n\n \n\n', '洽詢', '專線', '：', ' 3373', '451', '或', ' 3314942', '\n', '\n', '高雄', '房地產', '億', '年', '旺', '網站', '（', '網址', '：', ' ', '）', '\n', '\n', '高雄市', '政府', '地政局', '網站', '（', '網址', '：', ' ', '）', '\n', '\n', '高雄市', '政府', '地政局', '土地', '開發處', '網站', '（', '網址', '：', ' ', '）']

By contrast, Jieba produced lots of wrongly segmented tokens, which is precisely why we prefer CKIP Transformers.

import jieba
print(list(jieba.cut(text)))

Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.958 seconds.
Prefix dict has been built successfully.

['市府', '地', '政局', '109', '年度', '第', '4', '季開', '發區', '土地', '標售', '，', '共計', '推出', '8', '標', '9', '筆優質', '建地', '，', '訂', '於', '109', '年', '12', '月', '16', '日', '開標', '，', '合計', '總底價', '12', ' ', '億', '4049', '萬', '6164', ' ', '元', '。', '\n', '\n', ' ', '\n', '\n', '第', '93', '期重', '劃區', '，', '原為國', '軍', '眷村', '，', '緊鄰', '國定', '古', '蹟', '-', '「', '原', '日本海', '軍鳳山', '無線', '電信', '所', '」', '，', '市府', '為', '保存', '古', '蹟', '同時', '活化', '眷村', '遷移', '後', '土地', '，', '以', '重劃', '方式', '整體', '開發', '，', '新闢', '住宅', '區', '、', '道路', '、', '公園', '及', '停車場', '，', '使本區', '具有', '歷史', '文化', '內涵', '與', '綠色', '休閒', '特色', '，', '生活', '機能', '更加', '健全', '。', '地', '政局', '首次', '推出', '1', '筆大面積', '土地', '，', '面積', '約', '2160', '坪', '，', '地形', '方整', '，', '雙面', '臨路', '，', '利', '於', '規劃', '興建景', '觀大樓', '，', '附近', '有', '市場', '、', '學校', '、', '公園', '及', '大東', '文化', '園區', '，', '距捷', '運大東', '站', '、', '鳳山國', '中站', '及鳳', '山火', '車站', '僅數', '分鐘', '車程', '，', '交通', '四通', '八達', '，', '因', '土地', '稀少', '性及', '區位', '條件', '絕佳', '，', '勢必成', '為', '投資人', '追逐', '焦點', '。', '\n', '\n', ' ', '\n', '\n', '第', '87', '期重', '劃區', '，', '位', '於', '省', '道', '台', '1', '線旁', '，', '鄰近', '捷運', '南岡山', '站', '，', '重劃', '後', '擁有', '完善', '的', '道路', '系統', '、', '公園', '綠地', '及', '毗', '鄰醒', '村懷舊', '文化', '景觀', '建築群', '，', '具備', '優質', '居住', '環境', '及', '交通', '便捷', '要件', '，', '地', '政局', '一', '推出', '土地', '標售', '，', '即', '掀起', '搶標', '熱潮', '，', '本季', '再釋', '出', '1', '筆面', '積約', '93', '坪', '土地', '，', '臨', '20', '米', '介壽路', '及鵬程', '東路', '，', '附近', '有岡山', '文化', '中心', '、', '兆', '湘國', '小', '、', '公', '13', '、', '公', '14', '、', '陽明', '公園', '及', '劉厝公園', '，', '區位', '條件', '佳', '，', '投資人', '準備', '搶進', '！', '\n', '\n', ' ', '\n', '\n', '第', '77', '期市', '地', '重劃區', '，', '位', '於', '鳳山區', '快速道路', '省道', '台', '88', '線旁', '，', '近', '中山', '高', '五甲', '系統', '交流', '道', '，', '近年', '推出', '土地', '標售', '皆', '順利', '完銷', '。', '本季', '再', '推出', '2', '筆', '土地', '，', '其中', '1', '筆面', '積約', '526', '坪', '，', '臨保華', '一路', '，', '適合', '商業', '使用', '；', '1', '筆面積', '107', '坪', '，', '位', '於', '代德三街', '，', '自用', '投資', '兩', '相宜', '。', '\n', '\n', ' ', '\n', '\n', '高雄', '大學區', '段', '徵收', '區', '，', '為', '北高雄', '優質', '文教', '特區', '，', '優質', '居住', '環境', '，', '吸引', '投資人', '進駐', '，', '本季', '再', '推出', '2', '標', '2', '筆', '土地', '，', '其中', '1', '筆', '第三', '種商業區', '土地', '，', '面積', '約', '639', '坪', '，', '位', '於', '大學', '26', '街', '，', '近高雄', '大學', '正門', '及', '萬坪', '藍田公園', '，', '地形', '方正', '，', '使用', '強度', '高', '，', '適合', '興建', '優質', '住宅', '大樓', '；', '另', '1', '筆住', '三', '用地', '，', '面積', '約', '379', '坪', '，', '臨', '28', '米', '藍昌路', '，', '近高雄', '大學及', '中山', '高中', '，', '交通', '便捷', '。', '\n', '\n', ' ', '\n', '\n', '另', '第', '37', '期重', '劃區', '及', '前', '大', '寮', '農地', '重劃區', '各', '推出', '1', '至', '2', '筆', '土地', '，', '價格', '合理', '。', '\n', '\n', ' ', '\n', '\n', '第', '4', '季', '土地', '標售', '作業', '於', '109', '年', '12', '月', '1', '日', '公告', '，', '投資大眾', '可', '前往', '地', '政局', '土地', '開發處', '土地', '處', '分科', '索取', '標售', '海報', '及', '標單', '，', '或', '直接', '上網', '高雄房', '地產', '億年', '旺', '網站', '、', '地', '政局', '及', '土地', '開發處', '網站', '查詢', '下載', '相關', '資料', '，', '在', '期限', '前', '完成', '投標', '，', '另', '再', '提醒', '投標', '人', '，', '本年度', '已', '更新', '投標', '單', '格式', '，', '投標', '大眾', '請', '注意', '應以', '新式', '投標單', '投標', '以免', '投標', '無效作', '廢', '。', '\n', '\n', ' ', '\n', '\n', '為', '配合', '防疫', '需求', '，', '本季', '開標', '作業', '除', '於', '地', '政局', '第一', '會議室', '辦理外', '，', '另將', '於', '地', '政局', 'Facebook', '粉絲', '專頁', '同步', '直播', '，', '請大眾', '多加', '利用', '。', '\n', '\n', ' ', '\n', '\n', '洽詢', '專線', '：', ' ', '3373451', '或', ' ', '3314942', '\n', '\n', '高雄房', '地產', '億年', '旺', '網站', '（', '網址', '：', ' ', '）', '\n', '\n', '高雄市', '政府', '地', '政局', '網站', '（', '網址', '：', ' ', '）', '\n', '\n', '高雄市', '政府', '地', '政局', '土地', '開發處', '網站', '（', '網址', '：', ' ', '）']

Feed tokenized results to `spacy` using `WhitespaceTokenizer`

The official website of spaCy describes several ways of adding a custom tokenizer. The simplest is to define the WhitespaceTokenizer class, which tokenizes a text on space characters. The output of tokenization can then be fed into subsequent operations down the pipeline, including tagger for parts-of-speech (POS) tagging, parser for dependency parsing, and ner for named entity recognition. This is possible primarily because tokenizer creates a Doc object whereas the other three steps operate on the Doc object, as illustrated in this graph.

Note: The original code for words is words = text.split(" "), but it caused an error to my text. So I revised it into words = text.strip().split().

from spacy.tokens import Doc

class WhitespaceTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, text):
        words = text.strip().split()
        return Doc(self.vocab, words=words)

Next, let's load the zh_core_web_sm model for Chinese, which we'll need for POS tagging. Then here comes the crucial part: nlp.tokenizer = WhitespaceTokenizer(nlp.vocab). This line of code sets the default tokenizer from Jieba to WhitespaceTokenizer, which we just defined above.

import spacy
nlp = spacy.load('zh_core_web_sm')
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)

Then we join the tokenized result from CKIP Transformers to a single string of space-seperated tokens.

token_str = " ".join(tokens)
token_str

'市府 地政局 109年度 第4 季 開發區 土地 標售 ， 共計 推出 8 標 9 筆 優質 建地 ， 訂 於 109年 12月 16日 開標 ， 合計 總底價 12 億 4049萬 6164  元 。 \n\n \n\n 第93 期 重劃區 ， 原 為 國軍 眷村 ， 緊鄰 國定 古蹟 - 「 原 日本 海軍 鳳山 無線 電信所 」 ， 市府 為 保存 古蹟 同時 活化 眷村 遷移 後 土地 ， 以 重劃 方式 整體 開發 ， 新 闢 住宅區 、 道路 、 公園 及 停車場 ， 使 本 區 具有 歷史 文化 內涵 與 綠色 休閒 特色 ， 生活 機能 更加 健全 。 地政局 首次 推出 1 筆 大 面積 土地 ， 面積 約 2160 坪 ， 地形 方整 ， 雙面 臨 路 ， 利於 規劃 興建 景觀 大樓 ， 附近 有 市場 、 學校 、 公園 及 大東 文化 園區 ， 距 捷運 大東站 、 鳳山 國中站 及 鳳山 火車站 僅 數 分鐘 車程 ， 交通 四通八達 ， 因 土地 稀少性 及 區位 條件 絕佳 ， 勢必 成為 投資人 追逐 焦點 。 \n\n \n\n 第87 期 重劃區 ， 位於 省道 台1線 旁 ， 鄰近 捷運 南 岡山站 ， 重劃 後 擁有 完善 的 道路 系統 、 公園 綠地 及 毗鄰 醒村 懷舊 文化 景觀 建築群 ， 具備 優質 居住 環境 及 交通 便捷 要件 ， 地政局 一 推出 土地 標售 ， 即 掀起 搶標 熱潮 ， 本 季 再 釋出 1 筆 面積 約 93 坪 土地 ， 臨 20 米 介壽路 及 鵬程東路 ， 附近 有 岡山 文化 中心 、 兆湘 國小 、 公13 、 公14 、 陽明 公園 及 劉厝 公園 ， 區位 條件 佳 ， 投資人 準備 搶進 ！ \n\n \n\n 第77 期 市地 重劃區 ， 位於 鳳山區 快速 道路 省道 台88 線 旁 ， 近 中山高 五甲 系統 交流道 ， 近年 推出 土地 標售 皆 順利 完銷 。 本 季 再 推出 2 筆 土地 ， 其中 1 筆 面積 約 526 坪 ， 臨 保華一路 ， 適合 商業 使用 ； 1 筆 面積 107 坪 ， 位於 代德三街 ， 自用 投資 兩 相宜 。 \n\n \n\n 高雄 大學 區段 徵收區 ， 為 北 高雄 優質 文教 特區 ， 優質 居住 環境 ， 吸引 投資人 進駐 ， 本 季 再 推出 2 標 2 筆 土地 ， 其中 1 筆 第三 種 商業區 土地 ， 面積 約 639 坪 ， 位於 大學 26街 ， 近 高雄 大學 正門 及 萬 坪 藍田 公園 ， 地形 方正 ， 使用 強度 高 ， 適合 興建 優質 住宅 大樓 ； 另 1 筆 住三 用地 ， 面積 約 379 坪 ， 臨 28 米 藍昌路 ， 近 高雄 大學 及 中山 高中 ， 交通 便捷 。 \n\n \n\n 另 第37 期 重劃區 及 前 大寮 農地 重劃區 各 推出 1 至 2 筆 土地 ， 價格 合理 。 \n\n \n\n 第4 季 土地 標售 作業 於 109年 12月 1日 公告 ， 投資 大眾 可 前往 地政局 土地 開發處 土地處 分科 索取 標售 海報 及 標單 ， 或 直接 上網 高雄 房地產 億年旺 網站 、 地政局 及 土地 開發處 網站 查詢 下載 相關 資料 ， 在 期限 前 完成 投標 ， 另 再 提醒 投標人 ， 本 年度 已 更新 投標單 格式 ， 投標 大眾 請 注意 應 以 新式 投標單 投標 以免 投標 無效 作廢 。 \n\n \n\n 為 配合 防疫 需求 ， 本 季 開標 作業 除 於 地政局 第一 會議室 辦理 外 ， 另 將 於 地政局 Facebook 粉絲 專頁 同步 直播 ， 請 大眾 多加 利用 。 \n\n \n\n 洽詢 專線 ：  3373 451 或  3314942 \n \n 高雄 房地產 億 年 旺 網站 （ 網址 ：   ） \n \n 高雄市 政府 地政局 網站 （ 網址 ：   ） \n \n 高雄市 政府 地政局 土地 開發處 網站 （ 網址 ：   ）'

Next, we feed token_str, our tokenized text, to nlp to create a spaCy Doc object. From this point on, we are able to leverage the power of spaCy. For every token in a Doc object, we have access to its text via the attribute .text and its parts-of-speech label via the attribute .pos_.

doc = nlp(token_str)
print([token.text for token in doc])
print([token.pos_ for token in doc])

['市府', '地政局', '109年度', '第4', '季', '開發區', '土地', '標售', '，', '共計', '推出', '8', '標', '9', '筆', '優質', '建地', '，', '訂', '於', '109年', '12月', '16日', '開標', '，', '合計', '總底價', '12', '億', '4049萬', '6164', '元', '。', '第93', '期', '重劃區', '，', '原', '為', '國軍', '眷村', '，', '緊鄰', '國定', '古蹟', '-', '「', '原', '日本', '海軍', '鳳山', '無線', '電信所', '」', '，', '市府', '為', '保存', '古蹟', '同時', '活化', '眷村', '遷移', '後', '土地', '，', '以', '重劃', '方式', '整體', '開發', '，', '新', '闢', '住宅區', '、', '道路', '、', '公園', '及', '停車場', '，', '使', '本', '區', '具有', '歷史', '文化', '內涵', '與', '綠色', '休閒', '特色', '，', '生活', '機能', '更加', '健全', '。', '地政局', '首次', '推出', '1', '筆', '大', '面積', '土地', '，', '面積', '約', '2160', '坪', '，', '地形', '方整', '，', '雙面', '臨', '路', '，', '利於', '規劃', '興建', '景觀', '大樓', '，', '附近', '有', '市場', '、', '學校', '、', '公園', '及', '大東', '文化', '園區', '，', '距', '捷運', '大東站', '、', '鳳山', '國中站', '及', '鳳山', '火車站', '僅', '數', '分鐘', '車程', '，', '交通', '四通八達', '，', '因', '土地', '稀少性', '及', '區位', '條件', '絕佳', '，', '勢必', '成為', '投資人', '追逐', '焦點', '。', '第87', '期', '重劃區', '，', '位於', '省道', '台1線', '旁', '，', '鄰近', '捷運', '南', '岡山站', '，', '重劃', '後', '擁有', '完善', '的', '道路', '系統', '、', '公園', '綠地', '及', '毗鄰', '醒村', '懷舊', '文化', '景觀', '建築群', '，', '具備', '優質', '居住', '環境', '及', '交通', '便捷', '要件', '，', '地政局', '一', '推出', '土地', '標售', '，', '即', '掀起', '搶標', '熱潮', '，', '本', '季', '再', '釋出', '1', '筆', '面積', '約', '93', '坪', '土地', '，', '臨', '20', '米', '介壽路', '及', '鵬程東路', '，', '附近', '有', '岡山', '文化', '中心', '、', '兆湘', '國小', '、', '公13', '、', '公14', '、', '陽明', '公園', '及', '劉厝', '公園', '，', '區位', '條件', '佳', '，', '投資人', '準備', '搶進', '！', '第77', '期', '市地', '重劃區', '，', '位於', '鳳山區', '快速', '道路', '省道', '台88', '線', '旁', '，', '近', '中山高', '五甲', '系統', '交流道', '，', '近年', '推出', '土地', '標售', '皆', '順利', '完銷', '。', '本', '季', '再', '推出', '2', '筆', '土地', '，', '其中', '1', '筆', '面積', '約', '526', '坪', '，', '臨', '保華一路', '，', '適合', '商業', '使用', '；', '1', '筆', '面積', '107', '坪', '，', '位於', '代德三街', '，', '自用', '投資', '兩', '相宜', '。', '高雄', '大學', '區段', '徵收區', '，', '為', '北', '高雄', '優質', '文教', '特區', '，', '優質', '居住', '環境', '，', '吸引', '投資人', '進駐', '，', '本', '季', '再', '推出', '2', '標', '2', '筆', '土地', '，', '其中', '1', '筆', '第三', '種', '商業區', '土地', '，', '面積', '約', '639', '坪', '，', '位於', '大學', '26街', '，', '近', '高雄', '大學', '正門', '及', '萬', '坪', '藍田', '公園', '，', '地形', '方正', '，', '使用', '強度', '高', '，', '適合', '興建', '優質', '住宅', '大樓', '；', '另', '1', '筆', '住三', '用地', '，', '面積', '約', '379', '坪', '，', '臨', '28', '米', '藍昌路', '，', '近', '高雄', '大學', '及', '中山', '高中', '，', '交通', '便捷', '。', '另', '第37', '期', '重劃區', '及', '前', '大寮', '農地', '重劃區', '各', '推出', '1', '至', '2', '筆', '土地', '，', '價格', '合理', '。', '第4', '季', '土地', '標售', '作業', '於', '109年', '12月', '1日', '公告', '，', '投資', '大眾', '可', '前往', '地政局', '土地', '開發處', '土地處', '分科', '索取', '標售', '海報', '及', '標單', '，', '或', '直接', '上網', '高雄', '房地產', '億年旺', '網站', '、', '地政局', '及', '土地', '開發處', '網站', '查詢', '下載', '相關', '資料', '，', '在', '期限', '前', '完成', '投標', '，', '另', '再', '提醒', '投標人', '，', '本', '年度', '已', '更新', '投標單', '格式', '，', '投標', '大眾', '請', '注意', '應', '以', '新式', '投標單', '投標', '以免', '投標', '無效', '作廢', '。', '為', '配合', '防疫', '需求', '，', '本', '季', '開標', '作業', '除', '於', '地政局', '第一', '會議室', '辦理', '外', '，', '另', '將', '於', '地政局', 'Facebook', '粉絲', '專頁', '同步', '直播', '，', '請', '大眾', '多加', '利用', '。', '洽詢', '專線', '：', '3373', '451', '或', '3314942', '高雄', '房地產', '億', '年', '旺', '網站', '（', '網址', '：', '）', '高雄市', '政府', '地政局', '網站', '（', '網址', '：', '）', '高雄市', '政府', '地政局', '土地', '開發處', '網站', '（', '網址', '：', '）']
['NOUN', 'NOUN', 'NUM', 'NUM', 'NOUN', 'NOUN', 'NOUN', 'NOUN', 'PUNCT', 'VERB', 'VERB', 'NUM', 'CCONJ', 'NUM', 'ADV', 'VERB', 'NOUN', 'PUNCT', 'ADV', 'VERB', 'NOUN', 'NOUN', 'NOUN', 'VERB', 'PUNCT', 'NOUN', 'NOUN', 'NUM', 'PROPN', 'NOUN', 'NUM', 'NUM', 'PUNCT', 'NUM', 'NUM', 'NOUN', 'PUNCT', 'ADV', 'VERB', 'NOUN', 'NOUN', 'PUNCT', 'NOUN', 'VERB', 'PROPN', 'PUNCT', 'PUNCT', 'ADJ', 'PROPN', 'NOUN', 'VERB', 'ADV', 'NOUN', 'PUNCT', 'PUNCT', 'NOUN', 'VERB', 'VERB', 'PROPN', 'VERB', 'VERB', 'NOUN', 'NOUN', 'PART', 'NOUN', 'PUNCT', 'ADP', 'NOUN', 'NOUN', 'ADV', 'NOUN', 'PUNCT', 'ADV', 'VERB', 'NOUN', 'PUNCT', 'NOUN', 'PUNCT', 'NOUN', 'CCONJ', 'NOUN', 'PUNCT', 'VERB', 'DET', 'NUM', 'VERB', 'NOUN', 'NOUN', 'VERB', 'ADV', 'VERB', 'NOUN', 'NOUN', 'PUNCT', 'NOUN', 'NOUN', 'ADV', 'VERB', 'PUNCT', 'NOUN', 'ADV', 'VERB', 'NUM', 'NUM', 'ADJ', 'NOUN', 'NOUN', 'PUNCT', 'VERB', 'NOUN', 'NUM', 'NUM', 'PUNCT', 'NOUN', 'NOUN', 'PUNCT', 'VERB', 'NUM', 'NOUN', 'PUNCT', 'NOUN', 'PROPN', 'NOUN', 'NOUN', 'NOUN', 'PUNCT', 'NOUN', 'VERB', 'NOUN', 'PUNCT', 'NOUN', 'PUNCT', 'NOUN', 'CCONJ', 'ADJ', 'NOUN', 'NOUN', 'PUNCT', 'VERB', 'NOUN', 'NOUN', 'PUNCT', 'NOUN', 'NOUN', 'CCONJ', 'VERB', 'NOUN', 'PROPN', 'NOUN', 'VERB', 'NOUN', 'PUNCT', 'NOUN', 'NUM', 'PUNCT', 'ADP', 'NOUN', 'NOUN', 'CCONJ', 'NOUN', 'ADV', 'VERB', 'PUNCT', 'VERB', 'NOUN', 'NOUN', 'VERB', 'NOUN', 'PUNCT', 'NUM', 'NUM', 'NOUN', 'PUNCT', 'ADJ', 'NOUN', 'NOUN', 'PART', 'PUNCT', 'VERB', 'NOUN', 'NOUN', 'NOUN', 'PUNCT', 'NOUN', 'PART', 'NOUN', 'VERB', 'PART', 'NOUN', 'NOUN', 'PUNCT', 'NOUN', 'NOUN', 'CCONJ', 'VERB', 'PROPN', 'NOUN', 'NOUN', 'PROPN', 'NOUN', 'PUNCT', 'VERB', 'PROPN', 'NOUN', 'NOUN', 'CCONJ', 'NOUN', 'NOUN', 'NOUN', 'PUNCT', 'NOUN', 'ADV', 'VERB', 'NOUN', 'NOUN', 'PUNCT', 'ADV', 'VERB', 'ADJ', 'NOUN', 'PUNCT', 'DET', 'NOUN', 'ADV', 'NOUN', 'NUM', 'NUM', 'VERB', 'NOUN', 'NUM', 'NUM', 'NOUN', 'PUNCT', 'VERB', 'NUM', 'NUM', 'NOUN', 'CCONJ', 'VERB', 'PUNCT', 'NOUN', 'VERB', 'VERB', 'NOUN', 'NOUN', 'PUNCT', 'ADJ', 'NOUN', 'PUNCT', 'NOUN', 'PUNCT', 'PROPN', 'PUNCT', 'NOUN', 'NOUN', 'CCONJ', 'NOUN', 'NOUN', 'PUNCT', 'NOUN', 'VERB', 'VERB', 'PUNCT', 'VERB', 'VERB', 'NOUN', 'PUNCT', 'NUM', 'NUM', 'NOUN', 'NOUN', 'PUNCT', 'ADJ', 'NOUN', 'ADJ', 'NOUN', 'NOUN', 'NOUN', 'NOUN', 'PART', 'PUNCT', 'ADV', 'NOUN', 'NOUN', 'NOUN', 'NOUN', 'PUNCT', 'NOUN', 'VERB', 'NOUN', 'NOUN', 'ADV', 'NOUN', 'VERB', 'PUNCT', 'DET', 'NOUN', 'ADV', 'VERB', 'NUM', 'NUM', 'NOUN', 'PUNCT', 'NOUN', 'NUM', 'NUM', 'VERB', 'NOUN', 'NUM', 'NUM', 'PUNCT', 'NOUN', 'NOUN', 'PUNCT', 'PROPN', 'NOUN', 'VERB', 'PUNCT', 'NUM', 'NUM', 'VERB', 'NUM', 'NUM', 'PUNCT', 'NOUN', 'VERB', 'PUNCT', 'VERB', 'VERB', 'NOUN', 'VERB', 'PUNCT', 'PROPN', 'NOUN', 'VERB', 'NOUN', 'PUNCT', 'VERB', 'PROPN', 'PROPN', 'PROPN', 'NOUN', 'NOUN', 'PUNCT', 'VERB', 'NOUN', 'NOUN', 'PUNCT', 'VERB', 'NOUN', 'NOUN', 'PUNCT', 'DET', 'NOUN', 'ADV', 'VERB', 'NUM', 'ADJ', 'NUM', 'NUM', 'NOUN', 'PUNCT', 'NOUN', 'NUM', 'NOUN', 'NUM', 'CCONJ', 'NOUN', 'NOUN', 'PUNCT', 'VERB', 'NOUN', 'NUM', 'NUM', 'PUNCT', 'ADJ', 'NOUN', 'NOUN', 'PUNCT', 'ADJ', 'PROPN', 'NOUN', 'NOUN', 'CCONJ', 'NUM', 'NUM', 'ADJ', 'NOUN', 'PUNCT', 'NOUN', 'VERB', 'PUNCT', 'VERB', 'NOUN', 'VERB', 'PUNCT', 'ADV', 'NOUN', 'VERB', 'NOUN', 'NOUN', 'PUNCT', 'DET', 'NUM', 'NOUN', 'VERB', 'VERB', 'PUNCT', 'VERB', 'NOUN', 'NUM', 'NUM', 'PUNCT', 'VERB', 'NUM', 'NUM', 'NOUN', 'PUNCT', 'ADJ', 'PROPN', 'NOUN', 'CCONJ', 'PROPN', 'NOUN', 'PUNCT', 'NOUN', 'VERB', 'PUNCT', 'DET', 'NUM', 'NUM', 'NOUN', 'CCONJ', 'ADJ', 'ADJ', 'NOUN', 'NOUN', 'ADV', 'VERB', 'NUM', 'CCONJ', 'NUM', 'NUM', 'NOUN', 'PUNCT', 'NOUN', 'VERB', 'PUNCT', 'NUM', 'NOUN', 'NOUN', 'NOUN', 'NOUN', 'ADP', 'NOUN', 'NOUN', 'NOUN', 'VERB', 'PUNCT', 'VERB', 'NOUN', 'VERB', 'VERB', 'ADJ', 'NOUN', 'NOUN', 'NOUN', 'VERB', 'VERB', 'VERB', 'NOUN', 'CCONJ', 'NOUN', 'PUNCT', 'CCONJ', 'ADV', 'NOUN', 'PROPN', 'NOUN', 'NOUN', 'VERB', 'PUNCT', 'NOUN', 'CCONJ', 'NOUN', 'NOUN', 'ADV', 'VERB', 'VERB', 'VERB', 'VERB', 'PUNCT', 'ADP', 'NOUN', 'PART', 'VERB', 'NOUN', 'PUNCT', 'ADV', 'ADV', 'VERB', 'NOUN', 'PUNCT', 'DET', 'NOUN', 'ADV', 'VERB', 'NOUN', 'NOUN', 'PUNCT', 'VERB', 'PROPN', 'ADV', 'VERB', 'VERB', 'ADP', 'ADJ', 'NOUN', 'VERB', 'ADV', 'VERB', 'NOUN', 'NOUN', 'PUNCT', 'VERB', 'VERB', 'NOUN', 'NOUN', 'PUNCT', 'DET', 'NOUN', 'VERB', 'VERB', 'ADP', 'ADP', 'NOUN', 'NUM', 'NOUN', 'NOUN', 'PART', 'PUNCT', 'DET', 'ADV', 'VERB', 'NOUN', 'PROPN', 'PROPN', 'ADV', 'ADV', 'VERB', 'PUNCT', 'VERB', 'PROPN', 'VERB', 'NOUN', 'PUNCT', 'NOUN', 'NOUN', 'PUNCT', 'NUM', 'NOUN', 'CCONJ', 'NUM', 'PROPN', 'NOUN', 'NUM', 'NUM', 'VERB', 'VERB', 'PUNCT', 'VERB', 'PUNCT', 'PUNCT', 'PROPN', 'NOUN', 'NOUN', 'VERB', 'PUNCT', 'VERB', 'PUNCT', 'PUNCT', 'PROPN', 'NOUN', 'ADJ', 'NOUN', 'NOUN', 'VERB', 'PUNCT', 'VERB', 'PUNCT', 'PUNCT']

The POS tagging is made possible by the zh_core_web_sm model. Notice that spaCy uses coarse labels such as NOUN and VERB. By contrast, CKIP Transformers adopts a more fine-grained tagset, such as Nc for locative nouns and Nd for temporal nouns. Here're the POS labels for the same text produced by CKIP Transformers. We'll be using the spaCy's POS tagging to filter out words that we don't want in the candicate pool for keywords.

pos_tags = pos[0]
print(pos_tags)

['Nc', 'Nc', 'Nd', 'Neu', 'Nd', 'Nc', 'Na', 'VC', 'COMMACATEGORY', 'VJ', 'VC', 'Neu', 'Nf', 'Neu', 'Nf', 'A', 'Na', 'COMMACATEGORY', 'VJ', 'P', 'Nd', 'Nd', 'Nd', 'VA', 'COMMACATEGORY', 'VG', 'Na', 'Neu', 'Neu', 'Neu', 'Nf', 'PERIODCATEGORY', 'WHITESPACE', 'Neu', 'Nf', 'Nc', 'COMMACATEGORY', 'D', 'VG', 'Na', 'Nc', 'COMMACATEGORY', 'VJ', 'A', 'Na', 'DASHCATEGORY', 'PARENTHESISCATEGORY', 'A', 'Nc', 'Nc', 'Nc', 'A', 'Nc', 'PARENTHESISCATEGORY', 'COMMACATEGORY', 'Nc', 'P', 'VC', 'Na', 'Nd', 'VHC', 'Nc', 'VC', 'Ng', 'Na', 'COMMACATEGORY', 'P', 'Nv', 'Na', 'Na', 'VC', 'COMMACATEGORY', 'VH', 'VC', 'Nc', 'PAUSECATEGORY', 'Na', 'PAUSECATEGORY', 'Nc', 'Caa', 'Nc', 'COMMACATEGORY', 'VL', 'Nes', 'Nc', 'VJ', 'Na', 'Na', 'Na', 'Caa', 'Na', 'Nv', 'Na', 'COMMACATEGORY', 'Na', 'Na', 'Dfa', 'VHC', 'PERIODCATEGORY', 'Nc', 'D', 'VC', 'Neu', 'Nf', 'VH', 'Na', 'Na', 'COMMACATEGORY', 'Na', 'Da', 'Neu', 'Nf', 'COMMACATEGORY', 'Na', 'VH', 'COMMACATEGORY', 'A', 'VCL', 'Na', 'COMMACATEGORY', 'VK', 'VC', 'VC', 'Na', 'Na', 'COMMACATEGORY', 'Nc', 'V_2', 'Nc', 'PAUSECATEGORY', 'Nc', 'PAUSECATEGORY', 'Nc', 'Caa', 'Nb', 'Na', 'Nc', 'COMMACATEGORY', 'P', 'Na', 'Nc', 'PAUSECATEGORY', 'Nc', 'Nc', 'Caa', 'Nc', 'Nc', 'Da', 'Neu', 'Nf', 'Na', 'COMMACATEGORY', 'Na', 'VH', 'COMMACATEGORY', 'Cbb', 'Na', 'Na', 'Caa', 'Na', 'Na', 'VH', 'COMMACATEGORY', 'D', 'VG', 'Na', 'VC', 'Na', 'PERIODCATEGORY', 'WHITESPACE', 'Neu', 'Nf', 'Nc', 'COMMACATEGORY', 'VCL', 'Nc', 'Nc', 'Ncd', 'COMMACATEGORY', 'VJ', 'Na', 'Nc', 'Nc', 'COMMACATEGORY', 'VC', 'Ng', 'VJ', 'VH', 'DE', 'Na', 'Na', 'PAUSECATEGORY', 'Nc', 'Na', 'Caa', 'VH', 'Nc', 'VH', 'Na', 'Na', 'Na', 'COMMACATEGORY', 'VJ', 'A', 'VA', 'Na', 'Caa', 'Na', 'VH', 'Na', 'COMMACATEGORY', 'Nc', 'D', 'VC', 'Na', 'Nv', 'COMMACATEGORY', 'D', 'VC', 'VD', 'Na', 'COMMACATEGORY', 'Nes', 'Nd', 'D', 'VC', 'Neu', 'Nf', 'Na', 'Da', 'Neu', 'Nf', 'Na', 'COMMACATEGORY', 'P', 'Neu', 'Nf', 'Nc', 'Caa', 'Nc', 'COMMACATEGORY', 'Nc', 'V_2', 'Nc', 'Na', 'Nc', 'PAUSECATEGORY', 'Nb', 'Nc', 'PAUSECATEGORY', 'Na', 'PAUSECATEGORY', 'Na', 'PAUSECATEGORY', 'Nb', 'Nc', 'Caa', 'Nc', 'Nc', 'COMMACATEGORY', 'Na', 'Na', 'VH', 'COMMACATEGORY', 'Na', 'VF', 'VA', 'EXCLAMATIONCATEGORY', 'WHITESPACE', 'Neu', 'Nf', 'Na', 'Nc', 'COMMACATEGORY', 'VCL', 'Nc', 'VH', 'Na', 'Nc', 'Nc', 'Nf', 'Ncd', 'COMMACATEGORY', 'VJ', 'Nc', 'Nc', 'Na', 'Na', 'COMMACATEGORY', 'Nd', 'VC', 'Na', 'Nv', 'D', 'VH', 'VC', 'PERIODCATEGORY', 'Nes', 'Nd', 'D', 'VC', 'Neu', 'Nf', 'Na', 'COMMACATEGORY', 'Nep', 'Neu', 'Nf', 'Na', 'Da', 'Neu', 'Nf', 'COMMACATEGORY', 'P', 'Nc', 'COMMACATEGORY', 'VH', 'Na', 'VC', 'SEMICOLONCATEGORY', 'Neu', 'Nf', 'Na', 'Neu', 'Nf', 'COMMACATEGORY', 'VCL', 'Nc', 'COMMACATEGORY', 'A', 'Na', 'Neu', 'VH', 'PERIODCATEGORY', 'WHITESPACE', 'Nc', 'Nc', 'Na', 'VC', 'COMMACATEGORY', 'VG', 'Ncd', 'Nc', 'A', 'Na', 'Nc', 'COMMACATEGORY', 'A', 'Nv', 'Na', 'COMMACATEGORY', 'VJ', 'Na', 'VCL', 'COMMACATEGORY', 'Nes', 'Nd', 'D', 'VC', 'Neu', 'Nf', 'Neu', 'Nf', 'Na', 'COMMACATEGORY', 'Nep', 'Neu', 'Nf', 'Neu', 'Nf', 'Nc', 'Na', 'COMMACATEGORY', 'Na', 'Da', 'Neu', 'Nf', 'COMMACATEGORY', 'VCL', 'Nc', 'Nc', 'COMMACATEGORY', 'VH', 'Nc', 'Nc', 'Na', 'Caa', 'Neu', 'Nf', 'Nb', 'Nc', 'COMMACATEGORY', 'Na', 'VH', 'COMMACATEGORY', 'Nv', 'Na', 'VH', 'COMMACATEGORY', 'VH', 'VC', 'A', 'Na', 'Na', 'SEMICOLONCATEGORY', 'Nes', 'Neu', 'Nf', 'VCL', 'Na', 'COMMACATEGORY', 'Na', 'Da', 'Neu', 'Nf', 'COMMACATEGORY', 'P', 'Neu', 'Nf', 'Nc', 'COMMACATEGORY', 'VH', 'Nc', 'Nc', 'Caa', 'Nb', 'Nc', 'COMMACATEGORY', 'Na', 'VH', 'PERIODCATEGORY', 'WHITESPACE', 'Cbb', 'Neu', 'Nf', 'Nc', 'Caa', 'Nes', 'Nc', 'Na', 'Nc', 'D', 'VC', 'Neu', 'Caa', 'Neu', 'Nf', 'Na', 'COMMACATEGORY', 'Na', 'VH', 'PERIODCATEGORY', 'WHITESPACE', 'Neu', 'Nd', 'Na', 'VC', 'Na', 'P', 'Nd', 'Nd', 'Nd', 'VE', 'COMMACATEGORY', 'VC', 'Nh', 'D', 'VCL', 'Nc', 'Na', 'Nv', 'Na', 'Nc', 'VD', 'Nv', 'Na', 'Caa', 'Na', 'COMMACATEGORY', 'Caa', 'VH', 'VA', 'Nc', 'Na', 'Nb', 'Nc', 'PAUSECATEGORY', 'Nc', 'Caa', 'Na', 'VC', 'Nc', 'VE', 'VC', 'VH', 'Na', 'COMMACATEGORY', 'P', 'Na', 'Ng', 'VC', 'VA', 'COMMACATEGORY', 'Cbb', 'D', 'VE', 'Na', 'COMMACATEGORY', 'Nes', 'Na', 'D', 'VC', 'Na', 'Na', 'COMMACATEGORY', 'VA', 'Nh', 'VF', 'VK', 'D', 'P', 'A', 'Na', 'VA', 'Cbb', 'VA', 'VH', 'VH', 'PERIODCATEGORY', 'WHITESPACE', 'P', 'VC', 'VA', 'Na', 'COMMACATEGORY', 'Nes', 'Nd', 'Nv', 'Na', 'P', 'P', 'Nc', 'Neu', 'Nc', 'VC', 'Ng', 'COMMACATEGORY', 'Cbb', 'D', 'P', 'Nc', 'FW', 'Na', 'Na', 'VH', 'D', 'COMMACATEGORY', 'VF', 'Nh', 'D', 'VC', 'PERIODCATEGORY', 'WHITESPACE', 'VE', 'Na', 'COLONCATEGORY', 'FW', 'Neu', 'Caa', 'FW', 'WHITESPACE', 'WHITESPACE', 'Nc', 'Na', 'Nb', 'Nf', 'VH', 'Nc', 'PARENTHESISCATEGORY', 'Na', 'COLONCATEGORY', 'WHITESPACE', 'PARENTHESISCATEGORY', 'WHITESPACE', 'WHITESPACE', 'Nc', 'Na', 'Nc', 'Nc', 'PARENTHESISCATEGORY', 'Na', 'COLONCATEGORY', 'WHITESPACE', 'PARENTHESISCATEGORY', 'WHITESPACE', 'WHITESPACE', 'Nc', 'Na', 'Nc', 'Na', 'VC', 'Nc', 'PARENTHESISCATEGORY', 'Na', 'COLONCATEGORY', 'WHITESPACE', 'PARENTHESISCATEGORY']

Convert stopwords in `spaCy` from simplified to Taiwanese traditional

spaCy comes with a built-in set of stopwords (basically words that we'd like to ignore), accessible via spacy.lang.zh.stop_words. To make good use of it, let's convert all the words from simplified characters to traditional ones with the help of OpenCC.

!pip install OpenCC
import opencc

OpenCC does not just convert characters mechanically. It has the ability to convert words from simplified characters to their equivalent phrasing in Taiwan Mandarin, which is done by s2twp.json.

from spacy.lang.zh.stop_words import STOP_WORDS
converter = opencc.OpenCC('s2twp.json')
spacy_stopwords_sim = list(STOP_WORDS)
print(spacy_stopwords_sim[:5])
spacy_stopwords_tra = [converter.convert(w) for w in spacy_stopwords_sim]
print(spacy_stopwords_tra[:5])

['因为', '奇', '嘿嘿', '其次', '偏偏']
['因為', '奇', '嘿嘿', '其次', '偏偏']

Define a class for implementing TextRank

If you're dealing with English texts, you can implement TextRank quite easily with textaCy, the tagline of which is NLP, before and after spaCy. But I couldn't get it to work for Chinese texts, so I had to implement TextRank from scratch. Luckily, I got a jump-start from this gist, which offers a blueprint for the following definitions.

from collections import OrderedDict
import numpy as np

class TextRank4Keyword():
    """Extract keywords from text"""
    
    def __init__(self):
        self.d = 0.85 # damping coefficient, usually is .85
        self.min_diff = 1e-5 # convergence threshold
        self.steps = 10 # iteration steps
        self.node_weight = None # save keywords and its weight

    def set_stopwords(self, custom_stopwords):  
        """Set stop words"""
        for word in set(spacy_stopwords_tra).union(set(custom_stopwords)):
            lexeme = nlp.vocab[word]
            lexeme.is_stop = True
    
    def sentence_segment(self, doc, candidate_pos, lower):
        """Store those words only in cadidate_pos"""
        sentences = []
        for sent in doc.sents:
            selected_words = []
            for token in sent:
                # Store words only with cadidate POS tag
                if token.pos_ in candidate_pos and token.is_stop is False:
                    if lower is True:
                        selected_words.append(token.text.lower())
                    else:
                        selected_words.append(token.text)
            sentences.append(selected_words)
        return sentences
        
    def get_vocab(self, sentences):
        """Get all tokens"""
        vocab = OrderedDict()
        i = 0
        for sentence in sentences:
            for word in sentence:
                if word not in vocab:
                    vocab[word] = i
                    i += 1
        return vocab
    
    def get_token_pairs(self, window_size, sentences):
        """Build token_pairs from windows in sentences"""
        token_pairs = list()
        for sentence in sentences:
            for i, word in enumerate(sentence):
                for j in range(i+1, i+window_size):
                    if j >= len(sentence):
                        break
                    pair = (word, sentence[j])
                    if pair not in token_pairs:
                        token_pairs.append(pair)
        return token_pairs
        
    def symmetrize(self, a):
        return a + a.T - np.diag(a.diagonal())
    
    def get_matrix(self, vocab, token_pairs):
        """Get normalized matrix"""
        # Build matrix
        vocab_size = len(vocab)
        g = np.zeros((vocab_size, vocab_size), dtype='float')
        for word1, word2 in token_pairs:
            i, j = vocab[word1], vocab[word2]
            g[i][j] = 1
            
        # Get Symmeric matrix
        g = self.symmetrize(g)
        
        # Normalize matrix by column
        norm = np.sum(g, axis=0)
        g_norm = np.divide(g, norm, where=norm!=0) # this is to ignore the 0 element in norm
        
        return g_norm
    
    # I revised this function to return keywords as a list
    def get_keywords(self, number=10):
        """Print top number keywords"""
        node_weight = OrderedDict(sorted(self.node_weight.items(), key=lambda t: t[1], reverse=True))
        keywords = []
        for i, (key, value) in enumerate(node_weight.items()):
            keywords.append(key)
            if i > number:
                break
        return keywords

    def analyze(self, text, 
                candidate_pos=['NOUN', 'VERB'], 
                window_size=5, lower=False, stopwords=list()):
        """Main function to analyze text"""
        
        # Set stop words
        self.set_stopwords(stopwords)

        # Pare text with spaCy
        doc = nlp(token_str)
        
        # Filter sentences
        sentences = self.sentence_segment(doc, candidate_pos, lower) # list of list of words
        
        # Build vocabulary
        vocab = self.get_vocab(sentences)
        
        # Get token_pairs from windows
        token_pairs = self.get_token_pairs(window_size, sentences)
        
        # Get normalized matrix
        g = self.get_matrix(vocab, token_pairs)
        
        # Initionlization for weight(pagerank value)
        pr = np.array([1] * len(vocab))
        
        # Iteration
        previous_pr = 0
        for epoch in range(self.steps):
            pr = (1-self.d) + self.d * np.dot(g, pr)
            if abs(previous_pr - sum(pr))  < self.min_diff:
                break
            else:
                previous_pr = sum(pr)

        # Get weight for each node
        node_weight = dict()
        for word, index in vocab.items():
            node_weight[word] = pr[index]
        
        self.node_weight = node_weight

Now we can create an instace of the TextRank4Keyword class and call the set_stopwords function with our CUSTOM_STOPWORDS variable. This created a set of stopwords resulting from the union of both our custom stopwords and spaCy's built-in stopwords. And only words that meet these two criteria would become candidates for keywords:

they are not in the set of stopwords;
their POS labels are one of those listed in candidate_pos, which includes NOUN and VERB by default.

tr4w = TextRank4Keyword()
tr4w.set_stopwords(CUSTOM_STOPWORDS)

Put it together

Let's put it all together by defining a main function for keyword extraction.

def extract_keys_from_str(raw_text):
  text = clean_all(raw_text) #clean the raw text
  ws  = ws_driver([text]) #tokenize the text with CKIP Transformers
  tokenized_text = " ".join(ws[0]) #join a list into a string 
  tr4w.analyze(tokenized_text) #create a spaCy Doc object with the string and calculate weights for words
  keys = tr4w.get_keywords(KW_NUM) #get top 10 keywords, as set by the KW_NUM variable
  return keys

Here're the top ten keywords for our sample text. The results are quite satisfactory.

keys = extract_keys_from_str(raw_text)
keys = [k for k in keys if len(k) > 1]
keys

Tokenization: 100%|██████████| 1/1 [00:00<00:00, 221.73it/s]
Inference: 100%|██████████| 1/1 [00:05<00:00,  5.20s/it]

['土地', '公園', '地政局', '文化', '推出', '面積', '標售', '道路', '優質', '投標']

As a comparison, here're the top 10 keywords produced by Jieba's implementation of TextRank, 7 of which are identical to the list above. Although extracting keywords with Jieba is quick and easy, it tends to give rise to wrongly segmented tokens, such as 政局 in this example, which should have been 地政局 for Land Administration Bureau.

import jieba.analyse as KE
jieba_kw = KE.textrank(text, topK=10)
jieba_kw

['土地', '政局', '投標', '公園', '投資', '標售', '文化', '開發', '優質', '推出']

Other libraries that failed

`textaCy`

!pip install textacy

With textaCy, you can load a spaCy language model and then create a spaCy Doc object using that model.

import textacy
zh = textacy.load_spacy_lang("zh_core_web_sm")
doc = textacy.make_spacy_doc(text, lang=zh)
doc._.preview

'Doc(612 tokens: "市府地政局109年度第4季開發區土地標售，共計推出8標9筆優質建地，訂於109年12月16日開...")'

textaCy implements four algorithms for keyword extraction, including TextRank. But I got useless results by calling the textacy.ke.textrank function with doc.

import textacy.ke as ke
ke.textrank(doc)

[('     ', 6.0)]

`pyate`

!pip install pyate

pyate has a built-in TermExtractionPipeline class for extracting keywords, which can be added to spaCy's pipeline. But it didn't work and this error message showed up: TypeError: load() got an unexpected keyword argument 'parser'.

from pyate.term_extraction_pipeline import TermExtractionPipeline
nlp.add_pipe(TermExtractionPipeline())

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-27-f5a8398fbc3b> in <module>()
      1 #collapse-output
----> 2 from pyate.term_extraction_pipeline import TermExtractionPipeline
      3 nlp.add_pipe(TermExtractionPipeline())

/usr/local/lib/python3.6/dist-packages/pyate/__init__.py in <module>()
----> 1 from .term_extraction import TermExtraction, add_term_extraction_method
      2 from .basic import basic
      3 from .combo_basic import combo_basic
      4 from .cvalues import cvalues
      5 from .term_extractor import term_extractor

/usr/local/lib/python3.6/dist-packages/pyate/term_extraction.py in <module>()
     20 
     21 
---> 22 class TermExtraction:
     23     # TODO: find some way to prevent redundant loading of csv files
     24     nlp = spacy.load("en_core_web_sm", parser=False, entity=False)

/usr/local/lib/python3.6/dist-packages/pyate/term_extraction.py in TermExtraction()
     22 class TermExtraction:
     23     # TODO: find some way to prevent redundant loading of csv files
---> 24     nlp = spacy.load("en_core_web_sm", parser=False, entity=False)
     25     matcher = Matcher(nlp.vocab)
     26     language = "en"

TypeError: load() got an unexpected keyword argument 'parser'

I found on the documentation page that pyate only supports English and Italian, which may account for the error I got.

`pytextrank`

!pip install pytextrank

To add TextRank to the spaCy pipeline, I followed the instructions found on spaCy's documentation. But an error popped up. Luckily, ValueError offers possible ways to fix the problem.

import pytextrank
tr = pytextrank.TextRank()
nlp.add_pipe(tr.PipelineComponent, name='textrank', last=True)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-29-cd319957f3b6> in <module>()
      2 import pytextrank
      3 tr = pytextrank.TextRank()
----> 4 nlp.add_pipe(tr.PipelineComponent, name='textrank', last=True)

/usr/local/lib/python3.6/dist-packages/spacy/language.py in add_pipe(self, factory_name, name, before, after, first, last, source, config, raw_config, validate)
    746         batch_size=1000,
    747         disable=[],
--> 748         cleanup=False,
    749         component_cfg=None,
    750         n_process=1,

ValueError: [E966] `nlp.add_pipe` now takes the string name of the registered component factory, not a callable component. Expected string, but got <bound method TextRank.PipelineComponent of <pytextrank.pytextrank.TextRank object at 0x7f4fea403550>> (name: 'textrank').

- If you created your component with `nlp.create_pipe('name')`: remove nlp.create_pipe and call `nlp.add_pipe('name')` instead.

- If you passed in a component like `TextCategorizer()`: call `nlp.add_pipe` with the string name instead, e.g. `nlp.add_pipe('textcat')`.

- If you're using a custom component: Add the decorator `@Language.component` (for function components) or `@Language.factory` (for class components / factories) to your custom component and assign it a name, e.g. `@Language.component('your_name')`. You can then run `nlp.add_pipe('your_name')` to add it to the pipeline.

So I used the @Language.factory decorator to define a TextRank component, and then called the nlp.add_pipe function with textrank. But this didn't work either. The error message reads: 'Chinese' object has no attribute 'sents'.

from spacy.language import Language

tr = pytextrank.TextRank()

@Language.factory("textrank")
def create_textrank_component(nlp: Language, name: str):
    return tr.PipelineComponent(nlp)

nlp.add_pipe('textrank')

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-31-1c84bbe50472> in <module>()
      1 #collapse-output
----> 2 nlp.add_pipe('textrank')

/usr/local/lib/python3.6/dist-packages/spacy/language.py in add_pipe(self, factory_name, name, before, after, first, last, source, config, raw_config, validate)
    770         if is_python2 and n_process != 1:
    771             warnings.warn(Warnings.W023)
--> 772             n_process = 1
    773         if n_threads != -1:
    774             warnings.warn(Warnings.W016, DeprecationWarning)

/usr/local/lib/python3.6/dist-packages/spacy/language.py in create_pipe(self, factory_name, name, config, raw_config, validate)
    656         link_vectors_to_models(self.vocab)
    657         if self.vocab.vectors.data.shape[1]:
--> 658             cfg["pretrained_vectors"] = self.vocab.vectors.name
    659         if sgd is None:
    660             sgd = create_default_optimizer(Model.ops)

/usr/local/lib/python3.6/dist-packages/thinc/config.py in resolve(cls, config, schema, overrides, validate)

/usr/local/lib/python3.6/dist-packages/thinc/config.py in _make(cls, config, schema, overrides, resolve, validate)

/usr/local/lib/python3.6/dist-packages/thinc/config.py in _fill(cls, config, schema, validate, resolve, parent, overrides)

<ipython-input-30-fb02aff6bab9> in create_textrank_component(nlp, name)
      5 @Language.factory("textrank")
      6 def create_textrank_component(nlp: Language, name: str):
----> 7     return tr.PipelineComponent(nlp)

/usr/local/lib/python3.6/dist-packages/pytextrank/pytextrank.py in PipelineComponent(self, doc)
    559         Doc.set_extension("phrases", force=True, default=[])
    560         Doc.set_extension("textrank", force=True, default=self)
--> 561         doc._.phrases = self.calc_textrank()
    562 
    563         return doc

/usr/local/lib/python3.6/dist-packages/pytextrank/pytextrank.py in calc_textrank(self)
    375         t0 = time.time()
    376 
--> 377         for sent in self.doc.sents:
    378             self.link_sentence(sent)
    379 

AttributeError: 'Chinese' object has no attribute 'sents'

`rake-spacy`

I couldn't even install rake-spacy.

!pip install rake-spacy

ERROR: Could not find a version that satisfies the requirement rake-spacy
ERROR: No matching distribution found for rake-spacy

`rake-keyword`

!pip install rake-keyword

According to the documentation on PYPI, the import is done by from rake import Rake, but it didn't work.

from rake import Rake

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-34-fe043f886018> in <module>()
      1 #collapse-output
----> 2 from rake import Rake

ImportError: cannot import name 'Rake'

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

However, based on the documentation on GitHub, this is done by from rake import RAKE instead. But it didn't work either.

from rake import RAKE

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-35-59e63adf01ef> in <module>()
      1 #collapse-output
----> 2 from rake import RAKE

ImportError: cannot import name 'RAKE'

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

Recap

Integration of CKIP Transformers with spaCy and the TextRank algorithm generates decent results for extracting keywords from texts in traditional Chinese. Although there are many Python libraries out there that implement TextRank, none of them works better than the TextRank4Keyword class crafted from scratch. Until I figure out how to properly add the TextRank component to the spaCy pipeline, I'll stick with my working pipeline shown here. As a final thought, spaCy recently released v3.0, which supports pretrained transformer models. I can't wait to give it a try and see how this would change the workflow of extracting keywords or other NLP tasks. But that'll have to wait until next post.

Intro

Working pipeline

Set variables

Preprocess texts

Install spacy and ckip-transformers

Tokenize texts with ckip-transformers

Feed tokenized results to spacy using WhitespaceTokenizer

Convert stopwords in spaCy from simplified to Taiwanese traditional