Adding a custom tokenizer to spaCy and extracting keywords
This post shows how to plug in a custom tokenizer to spaCy and gets decent results for the extraction of keywords from texts in traditional Chinese.
spaCy is an industrial-strength natural language processing
library in Python, and supports multiple human languages, including Chinese. For segmenting Chinese texts into words, spaCy uses Jieba or PKUSeg under the hood. However, neither of them beats CKIP Transformers in accuracy when it comes to traditional Chinese (see my previous post for a comparison). So I'll show how to plug in CKIP Transformers to spaCy
to get the best out of both.
For the purpose of demonstration, I'll situate this integration in a pipeline for extracting keywords from texts. Compared with other NLP tasks, keyword extraction is a relatively easy job. TextRank and RAKE seem to be among the most widely adopted algorithms for keyword extraction. I tried most of the methods mentioned in this article, but there doesn't seem to be any easy-peasy implementation of TextRank or RAKE that produces decent results for traditional Chinese texts. So the first part of this post walks through a pipeline that actually works, and the second part records other methods that failed. I included the second part because I believe in this quote:
“We learn wisdom from failure much more than from success. We often discover what will do, by finding out what will not do; and probably he who never made a mistake never made a discovery.” ― Samuel Smiles
Page
as in PageRank refers to webpages, but it turns out to be the family name of Larry Page, the creator of PageRank.
Let's start with defining two variables that users of our keyword extraction program might want to modify: CUSTOM_STOPWORDS
for a list of words that users definitely hope to exclude from keyword candidates and KW_NUM
for the number of keywords that they'd like to extract from a document.
CUSTOM_STOPWORDS = [
"民眾","朋友","市民","人數", "全民","人員","人士","里民",
"影本","系統", "項目", "證件", "資格","公民", "對象","個人",
]
KW_NUM = 10
I took an announcement from Land Administration Bureau of Kaohsiung City Goverment as a sample text, but you can basically take any text in traditional Chinese to test the program.
- Click on
Open in Colab
at the upper right corner of this page. - Click on
File
and thenSave a copy in Drive
. - Replace the following text with your own text.
- Click on
Runtime
and thenRun all
. - Go to the section
Put it together
to see the outcome.
raw_text = '''
市府地政局109年度第4季開發區土地標售,共計推出8標9筆優質建地,訂於109年12月16日開標,合計總底價12 億4049萬6164 元。
第93期重劃區,原為國軍眷村,緊鄰國定古蹟-「原日本海軍鳳山無線電信所」,市府為保存古蹟同時活化眷村遷移後土地,以重劃方式整體開發,新闢住宅區、道路、公園及停車場,使本區具有歷史文化內涵與綠色休閒特色,生活機能更加健全。地政局首次推出1筆大面積土地,面積約2160坪,地形方整,雙面臨路,利於規劃興建景觀大樓,附近有市場、學校、公園及大東文化園區,距捷運大東站、鳳山國中站及鳳山火車站僅數分鐘車程,交通四通八達,因土地稀少性及區位條件絕佳,勢必成為投資人追逐焦點。
第87期重劃區,位於省道台1線旁,鄰近捷運南岡山站,重劃後擁有完善的道路系統、公園綠地及毗鄰醒村懷舊文化景觀建築群,具備優質居住環境及交通便捷要件,地政局一推出土地標售,即掀起搶標熱潮,本季再釋出1筆面積約93坪土地,臨20米介壽路及鵬程東路,附近有岡山文化中心、兆湘國小、公13、公14、陽明公園及劉厝公園,區位條件佳,投資人準備搶進!
第77期市地重劃區,位於鳳山區快速道路省道台88線旁,近中山高五甲系統交流道,近年推出土地標售皆順利完銷。本季再推出2筆土地,其中1筆面積約526坪,臨保華一路,適合商業使用;1筆面積107坪,位於代德三街,自用投資兩相宜。
高雄大學區段徵收區,為北高雄優質文教特區,優質居住環境,吸引投資人進駐,本季再推出2標2筆土地,其中1筆第三種商業區土地,面積約639坪,位於大學26街,近高雄大學正門及萬坪藍田公園,地形方正,使用強度高,適合興建優質住宅大樓;另1筆住三用地,面積約379坪,臨28米藍昌路,近高雄大學及中山高中,交通便捷。
另第37期重劃區及前大寮農地重劃區各推出1至2筆土地,價格合理。
第4季土地標售作業於109年12月1日公告,投資大眾可前往地政局土地開發處土地處分科索取標售海報及標單,或直接上網高雄房地產億年旺網站、地政局及土地開發處網站查詢下載相關資料,在期限前完成投標,另再提醒投標人,本年度已更新投標單格式,投標大眾請注意應以新式投標單投標以免投標無效作廢。
為配合防疫需求,本季開標作業除於地政局第一會議室辦理外,另將於地政局Facebook粉絲專頁同步直播,請大眾多加利用。
洽詢專線:(07)3373451或(07)3314942
高雄房地產億年旺網站(網址:http://eland.kcg.gov.tw/)
高雄市政府地政局網站(網址:http://landp.kcg.gov.tw/)
高雄市政府地政局土地開發處網站(網址:http://landevp.kcg.gov.tw/)
'''
raw_text[-300:]
I find this lightweight library nlp2
quite handy for text cleaning. The clean_all
function removes URL links, HTML elements, and unused tags.
nlp2
and other useful NLP tools such as NLPrep
, TFkit
, and nlp2go
.
!pip install nlp2
from nlp2 import clean_all
After cleaning, our sample text looks like this. Notice that all the URL links are gone now.
text = clean_all(raw_text)
text[-300:]
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download zh_core_web_sm
!pip install -U ckip-transformers
Let's create a driver for word segmentation and one for parts of speech. CKIP Transformers also has a built-in driver for named entity recognition, i.e. CkipNerChunker
. But we won't use it here.
ws_driver
this way instead: ws_driver = CkipWordSegmenter(device=-1)
from ckip_transformers.nlp import CkipWordSegmenter, CkipPosTagger
ws_driver = CkipWordSegmenter()
pos_driver = CkipPosTagger()
ws_driver()
is a list even if you’re only dealing with a single text. Otherwise, words won’t be properly segmented. Notice that the input to pos_driver()
is the output of ws_driver()
.
ws = ws_driver([text])
pos = pos_driver(ws)
Here're the segmented tokens.
tokens = ws[0]
print(tokens)
By contrast, Jieba produced lots of wrongly segmented tokens, which is precisely why we prefer CKIP Transformers.
import jieba
print(list(jieba.cut(text)))
The official website of spaCy describes several ways of adding a custom tokenizer. The simplest is to define the WhitespaceTokenizer
class, which tokenizes a text on space characters. The output of tokenization can then be fed into subsequent operations down the pipeline, including tagger
for parts-of-speech (POS) tagging, parser
for dependency parsing, and ner
for named entity recognition. This is possible primarily because tokenizer
creates a Doc
object whereas the other three steps operate on the Doc
object, as illustrated in this graph.
words
is words = text.split(" ")
, but it caused an error to my text. So I revised it into words = text.strip().split()
.
from spacy.tokens import Doc
class WhitespaceTokenizer:
def __init__(self, vocab):
self.vocab = vocab
def __call__(self, text):
words = text.strip().split()
return Doc(self.vocab, words=words)
Next, let's load the zh_core_web_sm
model for Chinese, which we'll need for POS tagging. Then here comes the crucial part: nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
. This line of code sets the default tokenizer from Jieba to WhitespaceTokenizer
, which we just defined above.
import spacy
nlp = spacy.load('zh_core_web_sm')
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
Then we join the tokenized result from CKIP Transformers to a single string of space-seperated tokens.
token_str = " ".join(tokens)
token_str
Next, we feed token_str
, our tokenized text, to nlp
to create a spaCy Doc
object. From this point on, we are able to leverage the power of spaCy. For every token in a Doc
object, we have access to its text via the attribute .text
and its parts-of-speech label via the attribute .pos_
.
doc = nlp(token_str)
print([token.text for token in doc])
print([token.pos_ for token in doc])
The POS tagging is made possible by the zh_core_web_sm
model. Notice that spaCy uses coarse labels such as NOUN
and VERB
. By contrast, CKIP Transformers adopts a more fine-grained tagset, such as Nc
for locative nouns and Nd
for temporal nouns. Here're the POS labels for the same text produced by CKIP Transformers. We'll be using the spaCy's POS tagging to filter out words that we don't want in the candicate pool for keywords.
pos_tags = pos[0]
print(pos_tags)
spaCy comes with a built-in set of stopwords (basically words that we'd like to ignore), accessible via spacy.lang.zh.stop_words
. To make good use of it, let's convert all the words from simplified characters to traditional ones with the help of OpenCC
.
!pip install OpenCC
import opencc
OpenCC
does not just convert characters mechanically. It has the ability to convert words from simplified characters to their equivalent phrasing in Taiwan Mandarin, which is done by s2twp.json
.
from spacy.lang.zh.stop_words import STOP_WORDS
converter = opencc.OpenCC('s2twp.json')
spacy_stopwords_sim = list(STOP_WORDS)
print(spacy_stopwords_sim[:5])
spacy_stopwords_tra = [converter.convert(w) for w in spacy_stopwords_sim]
print(spacy_stopwords_tra[:5])
If you're dealing with English texts, you can implement TextRank quite easily with textaCy
, the tagline of which is NLP, before and after spaCy
. But I couldn't get it to work for Chinese texts, so I had to implement TextRank from scratch. Luckily, I got a jump-start from this gist, which offers a blueprint for the following definitions.
from collections import OrderedDict
import numpy as np
class TextRank4Keyword():
"""Extract keywords from text"""
def __init__(self):
self.d = 0.85 # damping coefficient, usually is .85
self.min_diff = 1e-5 # convergence threshold
self.steps = 10 # iteration steps
self.node_weight = None # save keywords and its weight
def set_stopwords(self, custom_stopwords):
"""Set stop words"""
for word in set(spacy_stopwords_tra).union(set(custom_stopwords)):
lexeme = nlp.vocab[word]
lexeme.is_stop = True
def sentence_segment(self, doc, candidate_pos, lower):
"""Store those words only in cadidate_pos"""
sentences = []
for sent in doc.sents:
selected_words = []
for token in sent:
# Store words only with cadidate POS tag
if token.pos_ in candidate_pos and token.is_stop is False:
if lower is True:
selected_words.append(token.text.lower())
else:
selected_words.append(token.text)
sentences.append(selected_words)
return sentences
def get_vocab(self, sentences):
"""Get all tokens"""
vocab = OrderedDict()
i = 0
for sentence in sentences:
for word in sentence:
if word not in vocab:
vocab[word] = i
i += 1
return vocab
def get_token_pairs(self, window_size, sentences):
"""Build token_pairs from windows in sentences"""
token_pairs = list()
for sentence in sentences:
for i, word in enumerate(sentence):
for j in range(i+1, i+window_size):
if j >= len(sentence):
break
pair = (word, sentence[j])
if pair not in token_pairs:
token_pairs.append(pair)
return token_pairs
def symmetrize(self, a):
return a + a.T - np.diag(a.diagonal())
def get_matrix(self, vocab, token_pairs):
"""Get normalized matrix"""
# Build matrix
vocab_size = len(vocab)
g = np.zeros((vocab_size, vocab_size), dtype='float')
for word1, word2 in token_pairs:
i, j = vocab[word1], vocab[word2]
g[i][j] = 1
# Get Symmeric matrix
g = self.symmetrize(g)
# Normalize matrix by column
norm = np.sum(g, axis=0)
g_norm = np.divide(g, norm, where=norm!=0) # this is to ignore the 0 element in norm
return g_norm
# I revised this function to return keywords as a list
def get_keywords(self, number=10):
"""Print top number keywords"""
node_weight = OrderedDict(sorted(self.node_weight.items(), key=lambda t: t[1], reverse=True))
keywords = []
for i, (key, value) in enumerate(node_weight.items()):
keywords.append(key)
if i > number:
break
return keywords
def analyze(self, text,
candidate_pos=['NOUN', 'VERB'],
window_size=5, lower=False, stopwords=list()):
"""Main function to analyze text"""
# Set stop words
self.set_stopwords(stopwords)
# Pare text with spaCy
doc = nlp(token_str)
# Filter sentences
sentences = self.sentence_segment(doc, candidate_pos, lower) # list of list of words
# Build vocabulary
vocab = self.get_vocab(sentences)
# Get token_pairs from windows
token_pairs = self.get_token_pairs(window_size, sentences)
# Get normalized matrix
g = self.get_matrix(vocab, token_pairs)
# Initionlization for weight(pagerank value)
pr = np.array([1] * len(vocab))
# Iteration
previous_pr = 0
for epoch in range(self.steps):
pr = (1-self.d) + self.d * np.dot(g, pr)
if abs(previous_pr - sum(pr)) < self.min_diff:
break
else:
previous_pr = sum(pr)
# Get weight for each node
node_weight = dict()
for word, index in vocab.items():
node_weight[word] = pr[index]
self.node_weight = node_weight
Now we can create an instace of the TextRank4Keyword
class and call the set_stopwords
function with our CUSTOM_STOPWORDS
variable. This created a set of stopwords resulting from the union of both our custom stopwords and spaCy's built-in stopwords. And only words that meet these two criteria would become candidates for keywords:
- they are not in the set of stopwords;
- their POS labels are one of those listed in
candidate_pos
, which includesNOUN
andVERB
by default.
tr4w = TextRank4Keyword()
tr4w.set_stopwords(CUSTOM_STOPWORDS)
Let's put it all together by defining a main function for keyword extraction.
def extract_keys_from_str(raw_text):
text = clean_all(raw_text) #clean the raw text
ws = ws_driver([text]) #tokenize the text with CKIP Transformers
tokenized_text = " ".join(ws[0]) #join a list into a string
tr4w.analyze(tokenized_text) #create a spaCy Doc object with the string and calculate weights for words
keys = tr4w.get_keywords(KW_NUM) #get top 10 keywords, as set by the KW_NUM variable
return keys
Here're the top ten keywords for our sample text. The results are quite satisfactory.
keys = extract_keys_from_str(raw_text)
keys = [k for k in keys if len(k) > 1]
keys
As a comparison, here're the top 10 keywords produced by Jieba's implementation of TextRank, 7 of which are identical to the list above. Although extracting keywords with Jieba is quick and easy, it tends to give rise to wrongly segmented tokens, such as 政局
in this example, which should have been 地政局
for Land Administration Bureau.
import jieba.analyse as KE
jieba_kw = KE.textrank(text, topK=10)
jieba_kw
!pip install textacy
With textaCy, you can load a spaCy language model and then create a spaCy Doc
object using that model.
import textacy
zh = textacy.load_spacy_lang("zh_core_web_sm")
doc = textacy.make_spacy_doc(text, lang=zh)
doc._.preview
textaCy implements four algorithms for keyword extraction, including TextRank. But I got useless results by calling the textacy.ke.textrank
function with doc
.
import textacy.ke as ke
ke.textrank(doc)
!pip install pyate
pyate
has a built-in TermExtractionPipeline
class for extracting keywords, which can be added to spaCy's pipeline. But it didn't work and this error message showed up: TypeError: load() got an unexpected keyword argument 'parser'
.
from pyate.term_extraction_pipeline import TermExtractionPipeline
nlp.add_pipe(TermExtractionPipeline())
I found on the documentation page that pyate
only supports English and Italian, which may account for the error I got.
!pip install pytextrank
To add TextRank to the spaCy pipeline, I followed the instructions found on spaCy's documentation. But an error popped up. Luckily, ValueError
offers possible ways to fix the problem.
import pytextrank
tr = pytextrank.TextRank()
nlp.add_pipe(tr.PipelineComponent, name='textrank', last=True)
So I used the @Language.factory
decorator to define a TextRank component, and then called the nlp.add_pipe
function with textrank
. But this didn't work either. The error message reads: 'Chinese' object has no attribute 'sents'
.
from spacy.language import Language
tr = pytextrank.TextRank()
@Language.factory("textrank")
def create_textrank_component(nlp: Language, name: str):
return tr.PipelineComponent(nlp)
nlp.add_pipe('textrank')
I couldn't even install rake-spacy
.
!pip install rake-spacy
!pip install rake-keyword
According to the documentation on PYPI, the import is done by from rake import Rake
, but it didn't work.
from rake import Rake
However, based on the documentation on GitHub, this is done by from rake import RAKE
instead. But it didn't work either.
from rake import RAKE
Integration of CKIP Transformers with spaCy and the TextRank algorithm generates decent results for extracting keywords from texts in traditional Chinese. Although there are many Python libraries out there that implement TextRank, none of them works better than the TextRank4Keyword
class crafted from scratch. Until I figure out how to properly add the TextRank component to the spaCy pipeline, I'll stick with my working pipeline shown here. As a final thought, spaCy recently released v3.0, which supports pretrained transformer models. I can't wait to give it a try and see how this would change the workflow of extracting keywords or other NLP tasks. But that'll have to wait until next post.