Open In Colab

Intro

Text classification is a very common NLP task. Given enough training data, it's relatively easy to build a model that can automatically classify previously unseen texts in a way that follows the logic of the training data. In this post, I'll go through the steps for building such a model. Specifically, I'll leverage the power of the recently released spaCy v3.0 to train two classification models, one for identifying the sentiment of customer reviews in Chinese as being positive or negative (i.e. binary classification) and the other for predicting their product categories in a list of five (i.e. multiclass classification). If you can't wait to see how spaCy v3.0 has made the training process an absolute breeze, feel free to jump to the training the textcat component with CLI section. If not, bear with me on this long journey. All the datasets and models created in this post are hosted in this repo of mine.

Preparing the dataset

Getting the dataset

I'm hoping to build classification models that can take traditional Chinese texts as input, but I can't find any publicly available datasets of customer reviews in traditional Chinese. So I had to make do with reviews in simplified Chinese. Let's first download the dataset using !wget.

!wget https://raw.githubusercontent.com/SophonPlus/ChineseNlpCorpus/master/datasets/online_shopping_10_cats/online_shopping_10_cats.zip

--2021-03-07 14:08:42--  https://raw.githubusercontent.com/SophonPlus/ChineseNlpCorpus/master/datasets/online_shopping_10_cats/online_shopping_10_cats.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4084428 (3.9M) [application/zip]
Saving to: ‘online_shopping_10_cats.zip’

online_shopping_10_ 100%[===================>]   3.89M  --.-KB/s    in 0.1s    

2021-03-07 14:08:43 (37.1 MB/s) - ‘online_shopping_10_cats.zip’ saved [4084428/4084428]

Then we unzip the downloaded file online_shopping_10_cats.zip with, surprisingly, !unzip.

!unzip online_shopping_10_cats.zip

Archive:  online_shopping_10_cats.zip
  inflating: online_shopping_10_cats.csv  

The dataset has three columns: review for review texts, label for sentiment , and cat for product categories. Here's a random sample of five reviews.

import pandas as pd
file_path = '/content/online_shopping_10_cats.csv'
df = pd.read_csv(file_path)
df.sample(5)
cat label review
35479 洗发水 0 买了两套说好的赠品吹风机没给!
35477 洗发水 0 抢购降价一半?坑,爹?没赶上时候?
53299 酒店 1 碰上酒店做活动,加了40元给升级到行政房。房间还不错,比较新。服务员是实习生,不熟练但态度认...
14367 手机 1 1)外观新颖2)拥有强大的多媒体功能和卓越的性能,同时将电池的消耗减到最小,方便了更多的用户...
12549 平板 0 分辨率太低,买的后悔了.

There're in total 62774 reviews.

df.shape
(62774, 3)

The label column has only two unique values, 1 for positive reviews and 0 for negative ones.

df.label.unique()
array([1, 0])

The cat column has nine unique values.

df.cat.unique()
array(['书籍', '平板', '手机', '水果', '洗发水', '热水器', '蒙牛', '衣服', '计算机', '酒店'],
      dtype=object)

Before moving on, let's save the raw dataset to Google Drive. The dest variable can be any GDrive path you like.

dest = "/content/drive/MyDrive/Python/NLP/shopping_comments/"
!cp {file_path} {dest} 

Filtering the datases

Now let's do some data filtering. The groupby function from pandas is very useful, and here's how to get the counts of each of the unique values in the cat column.

df.groupby(by='cat').size()
cat
书籍      3851
平板     10000
手机      2323
水果     10000
洗发水    10000
热水器      575
蒙牛      2033
衣服     10000
计算机     3992
酒店     10000
dtype: int64

To create a balanced dataset, I decided to keep categories whose counts are 10,000. So we're left with five product categories, 平板 for tablets, 水果 for fruits, 洗发水 for shampoo, 衣服 for clothing, and finally 酒店 for hotels.

There're many ways to filter data in pandas, and my favorite is to first create a filt variable that holds a list of True and False, which in this particular case is whether the value in the cat volumn is in the cat_list variable for the categories to be kept. Then we can simply filter data with df[filt]. After filtering, the dataset is reduced to 50,000 reviews.

cat_list = ['平板', '水果', '洗发水', '衣服', '酒店'] 
filt = df['cat'].isin(cat_list)
df = df[filt]
df.shape
(50000, 3)

Now, the dataset is balanced in terms of both the cat and label columnn. There're 10,000 reviews for each product category.

df.groupby(by='cat').size()
cat
平板     10000
水果     10000
洗发水    10000
衣服     10000
酒店     10000
dtype: int64

And there're 25,000 for either of the two sentiments.

df.groupby(by='label').size()
label
0    25000
1    25000
dtype: int64

Having made sure the filtered dataset is balanced, we can now reset the index, and save the dataset as online_shopping_5_cats_sim.csv.

df.reset_index(inplace=True, drop=True)
df.to_csv(dest+"online_shopping_5_cats_sim.csv", sep=",", index=False)

Converting the dataset to traditional Chinese

Let's load back the file we just saved to make sure the dataset is accessible for later use.

df = pd.read_csv(dest+"online_shopping_5_cats_sim.csv")
df.tail()
cat label review
49995 酒店 0 我们去盐城的时候那里的最低气温只有4度,晚上冷得要死,居然还不开空调,投诉到酒店客房部,得到...
49996 酒店 0 房间很小,整体设施老化,和四星的差距很大。毛巾太破旧了。早餐很简陋。房间隔音很差,隔两间房间...
49997 酒店 0 我感觉不行。。。性价比很差。不知道是银川都这样还是怎么的!
49998 酒店 0 房间时间长,进去有点异味!服务员是不是不够用啊!我在一楼找了半个小时以上才找到自己房间,想找...
49999 酒店 0 老人小孩一大家族聚会,选在吴宫泛太平洋,以为新加坡品牌一定很不错,没想到11点30分到前台,...

Next, I converted the reviews from simplified Chinese to traditional Chinese using the OpenCC library.

!pip install OpenCC

Collecting OpenCC
  Downloading https://files.pythonhosted.org/packages/d5/b4/24e677e135df130fc6989929dc3990a1ae19948daf28beb8f910b4f7b671/OpenCC-1.1.1.post1-py2.py3-none-manylinux1_x86_64.whl (1.3MB)
     |████████████████████████████████| 1.3MB 8.0MB/s 
Installing collected packages: OpenCC
Successfully installed OpenCC-1.1.1.post1

OpenCC has many conversion methods. I specifically used s2twp, which converts simplified Chinese to traditional Chinese adpated to Taiwanese vocabulary. The adaptation is not optimal, but it's better than mechanic simplified-to-traditional conversion. Here's a random review in the two writing systems.

from opencc import OpenCC
cc = OpenCC('s2twp') 
test = df.loc[49995, 'review']
print(test)
test_tra = cc.convert(test)
print(test_tra)
我们去盐城的时候那里的最低气温只有4度,晚上冷得要死,居然还不开空调,投诉到酒店客房部,得到的答复是现在还没有领导指示需要开暖气,如果冷到话可以多给一床被子,太可怜了。。。
我們去鹽城的時候那裡的最低氣溫只有4度,晚上冷得要死,居然還不開空調,投訴到酒店客房部,得到的答覆是現在還沒有領導指示需要開暖氣,如果冷到話可以多給一床被子,太可憐了。。。

Having made sure the conversion is correct, we can now go ahead and convert all reviews.

df.loc[ : , 'review']  = df['review'].apply(lambda x: cc.convert(x)) 

Let's make the same change to the cat column.

df.loc[ : , 'cat']  = df['cat'].apply(lambda x: cc.convert(x)) 

And then we save the converted dataset as online_shopping_5_cats_tra.csv.

df.to_csv(dest+'online_shopping_5_cats_tra.csv', sep=",", index=False)

Inspecting the dataset

Let's load back the file just saved to make sure it's accessible in the future.

df = pd.read_csv(dest+'online_shopping_5_cats_tra.csv')
df.tail()
cat label review
49995 酒店 0 我們去鹽城的時候那裡的最低氣溫只有4度,晚上冷得要死,居然還不開空調,投訴到酒店客房部,得到...
49996 酒店 0 房間很小,整體設施老化,和四星的差距很大。毛巾太破舊了。早餐很簡陋。房間隔音很差,隔兩間房間...
49997 酒店 0 我感覺不行。。。價效比很差。不知道是銀川都這樣還是怎麼的!
49998 酒店 0 房間時間長,進去有點異味!服務員是不是不夠用啊!我在一樓找了半個小時以上才找到自己房間,想找...
49999 酒店 0 老人小孩一大家族聚會,選在吳宮泛太平洋,以為新加坡品牌一定很不錯,沒想到11點30分到前臺,...

Before building models, I would normally inspect the dataset. There're many ways to do so. I recently learned that there's a trick on Colab which allows you to filter a dataset in an interactive manner. All it takes is three lines of code.

%load_ext google.colab.data_table
from google.colab import data_table
data_table.DataTable(df, include_index=False, num_rows_per_page=10)

Alternatively, if you'd like to see some sample reviews from all the categories, the groupby function is quite handy. The trick here is to feed pd.DataFrame.sample to the apply function so that you can specify the number of reviews to inspect from each product category.

df.groupby('cat').apply(pd.DataFrame.sample, n=3)[['label', 'review']]
label review
cat
平板 6247 0 這個平板真的是3G的嗎?你們有沒有忽悠唉,為什麼我下了一個百度影片,就卡的要死要活的,跟我以...
1081 1 看網頁玩王者榮耀都很流暢,音質畫面都不錯,就是稍微重了點,綜合性價比還是很好的
1042 1 我覺得還可以,就是把膜貼上了之後,有點滑不動,打遊戲的時候就很煩了
水果 10468 1 還不錯,這個價格比我在外面買的划算,以後還會經常來,個頭不是很大,還可以吧
10986 1 蘋果味道好,就是小了一點,比想象的要小量了一下,好像基本上都沒到70毫米。快遞還是挺快的包裝...
18062 0 這是我在京東消費這麼多年來買到唯一次最爛的東西,還是自營一斤6塊錢的就這貨色還有一個是壞的,...
洗髮水 23271 1 很好,很舒服,清揚就是好用,謝謝老闆,希望一直好用,好好好好好好好好好,快樂
21992 1 一如既往的好 京東速度快 值得信賴 優惠多多
21867 1 京東購物 多快好省 寫評論真的很累 有木有 每次商品很滿意就用這個 各位大佬請放心購買
衣服 30157 1 褲子質量不錯,貨真價實,和店家介紹的基本相符,大小合適,樣式也很滿意,穿上褲子走路感到很輕便,舒服
33190 1 做工精細穿起來很舒服,質量很好
35326 0 質量很差,沒有想象的那麼好
酒店 42243 1 房間是新裝修的,用的是淺色調,感覺很溫馨,佈局較合理,顯的比較寬敞.酒店選用的布草很講究,放...
48082 0 實在忍無可忍。1、水。缺水現象——洗澡至中途,突然斷水,望著滿身的肥皂泡欲哭無淚;多水現象—...
49552 0 下雨天冷,想洗熱水澡,可惜開了半小時水還是冷的

Finally, one of the most powerful ways of exploring a dataset is to use the facets-overview library. Let's first create a column for the length of review texts.

df['len'] = df['review'].apply(len)
df.tail()
cat label review len
49995 酒店 0 我們去鹽城的時候那裡的最低氣溫只有4度,晚上冷得要死,居然還不開空調,投訴到酒店客房部,得到... 86
49996 酒店 0 房間很小,整體設施老化,和四星的差距很大。毛巾太破舊了。早餐很簡陋。房間隔音很差,隔兩間房間... 102
49997 酒店 0 我感覺不行。。。價效比很差。不知道是銀川都這樣還是怎麼的! 29
49998 酒店 0 房間時間長,進去有點異味!服務員是不是不夠用啊!我在一樓找了半個小時以上才找到自己房間,想找... 64
49999 酒店 0 老人小孩一大家族聚會,選在吳宮泛太平洋,以為新加坡品牌一定很不錯,沒想到11點30分到前臺,... 455

Then we install the library.

!pip install facets-overview

Collecting facets-overview
  Downloading https://files.pythonhosted.org/packages/df/8a/0042de5450dbd9e7e0773de93fe84c999b5b078b1f60b4c19ac76b5dd889/facets_overview-1.0.0-py2.py3-none-any.whl
Requirement already satisfied: protobuf>=3.7.0 in /usr/local/lib/python3.7/dist-packages (from facets-overview) (3.12.4)
Requirement already satisfied: pandas>=0.22.0 in /usr/local/lib/python3.7/dist-packages (from facets-overview) (1.1.5)
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.7/dist-packages (from facets-overview) (1.19.5)
Requirement already satisfied: six>=1.9 in /usr/local/lib/python3.7/dist-packages (from protobuf>=3.7.0->facets-overview) (1.15.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from protobuf>=3.7.0->facets-overview) (53.0.0)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.22.0->facets-overview) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.22.0->facets-overview) (2018.9)
Installing collected packages: facets-overview
Successfully installed facets-overview-1.0.0

In order to render an interative visualization of the dataset, we first convert the DataFrame object df to the json format and then add it to an HTML template, as shown below. If you choose len for Binning | X-Axis, cat for Binning | Y-Axis, and finally review for Label By, you'll see all the reviews are beautifully arranged in term of text length along the X axis and product categories along the Y axis. They're also color-coded with respect to sentiment, blue for positive and red for negative. Clicking on a point of either color shows the values of that particular datapoint. Feel free to play around.

from IPython.core.display import display, HTML
jsonstr = df.to_json(orient='records')
HTML_TEMPLATE = """
        <script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
        <link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html">
        <facets-dive id="elem" height="600"></facets-dive>
        <script>
          var data = {jsonstr};
          document.querySelector("#elem").data = data;
        </script>"""
html = HTML_TEMPLATE.format(jsonstr=jsonstr)
display(HTML(html))

Training spaCy models

Instantiating a pretrained spaCy model

spaCy supports many pretrained models in multiple languages, and offers a convenient widget for working out the code for downloading a particular model for a particular language. I specifically picked zh_core_web_md for Chinese.

!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download zh_core_web_md

Everthing in spaCy starts with loading a model. The model we downloaded has five built-in components, accessible via the pipe_names attribute.

import spacy
nlp=spacy.load("zh_core_web_md")
nlp.pipe_names
['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler']

Let's call the nlp object with a sample review text and print out its tokens to make sure the model is working.

test = df.loc[22, 'review']
doc = nlp(test)
for tok in doc:
  print(tok.text, tok.pos_)

買 VERB
了 PART
送人 NOUN
的 PART
, PUNCT
沒有 VERB
拆開 NOUN
, PUNCT
包裝 VERB
高大上 NOUN
, PUNCT
以前 NOUN
自用 VERB
買 VERB
了 PART
一 NUM
個 NUM
, PUNCT
這次 ADJ
活動 NOUN
便宜 VERB
好 VERB
幾百 VERB
, PUNCT
價格 NOUN
好 ADV
划算 VERB
啊 PART
, PUNCT
以後 NOUN
還會 VERB
降價 NOUN
嗎 VERB
??? PUNCT

Converting the dataset to spaCy format for training

To train a classification model wtih spaCy, we have to convert our dataset to spaCy format. Before that, we'll create a directory called data under the current working directory cwd. This is where we'll save the data in spaCy format.

Tip: I find the os.makedirs function much more useful than the more commonly seen os.mkdir() function because the former creates all the intermediate directories for you if they don’t exist yet.

import os
def create_dir(dir_name):
  cwd = os.getcwd()
  project_dir = os.path.join(cwd, dir_name)  
  os.makedirs(project_dir)  
  return project_dir

project_dir = create_dir("data")
project_dir
'/content/data'

The first step for the conversion is to create a list of tuples with two elements, one for the text and the other for the text class label. Let's start with the binary classification for sentiment first and generalize to multiclass classification for product categories later.

The easiest way to generate such a list is to create a new column called tuples (or whatever), whose values are derived by applying a lambda function to review for text and label for text class. I learned this trick from this article. Here're the first 10 tuples in the newly created dataset list.

df['tuples'] = df.apply(lambda row: (row['review'], row['label']), axis=1)
dataset = df['tuples'].tolist()
dataset[:10]
[('\ufeff很不錯。。。。。。很好的平板', 1),
 ('幫同學買的,同學說感覺挺好,質量也不錯', 1),
 ('東西不錯,一看就是正品包裝,還沒有開機,相信京東,都是老顧客,還是京東值得信賴,給五星好評', 1),
 ('總體而言,產品還是不錯的。', 1),
 ('好,不錯,真的很好不錯', 1),
 ('很好,音響效果不錯。挺喜歡的,用了一段時間才來評價的。', 1),
 ('包裝不是很好,裡面太空了我不塞多點汽泡袋,其它的還可以', 1),
 ('之前一直用華為手機,覺得不錯,想試一下平板,前天下單,昨天到貨,試了一下,感覺還不錯,家裡也有ipad,兩者不能比較,各有各的好,只是近年一直都用華為,手機,盒子用起來都不錯,現在是老媽用蘋果手機用ipad,我都用華為,用習慣了。大小也正合適,挺喜歡的,唯一欠缺的一點就是沒有耳機。後面多用幾次再追評吧。',
  1),
 ('說實話,非常喜歡,這個是送給客戶的,之前送的蘋果派的,但是不能拷檔案,自從找到這款,物美價廉,經濟實惠,很喜歡!!!買了好幾個了', 1),
 ('續航能力也太差了吧,看影片1個小時就25%了', 1)]

Then I split the dataset using train_test_split from scikit-learn. The train_data and valid_data hold 80% and 20% of the dataset, respectively.

from sklearn.model_selection import train_test_split
train_data, valid_data = train_test_split(dataset, test_size=0.2, random_state=1)

Then we need to turn the dataset list into spaCy Doc objects and assign text classes to each of them so that the model can start learning. Assigning text classes is the trickiest part and took me lots of trials and errors to figure out. Unfortunately, official documentation of spaCy is not very clear about this part. At first, I used this make_docs function from p-sodmann/Spacy3Textcat to create Doc objects. That was successful, but the trained model gave weird results. Then I realized that the values of text classes need to be either True or False. So here's my revised version of the make_docs function. The trick here is to use bool(label), which will be True if the value of the label is 1 and False if its value is 0.

from tqdm.auto import tqdm
from spacy.tokens import DocBin

def make_docs(data):
    """
    this will take a list of texts and labels and transform them in spacy documents
    
    texts: List(str)
    labels: List(labels)
    
    returns: List(spacy.Doc.doc)
    """
    
    docs = []

    # nlp.pipe([texts]) is way faster than running nlp(text) for each text
    # as_tuples allows us to pass in a tuple, the first one is treated as text
    # the second one will get returned as it is.
    
    for doc, label in tqdm(nlp.pipe(data, as_tuples=True), total = len(data)):
        
        # we need to set the (text)cat(egory) for each document
        doc.cats["POSITIVE"] = bool(label)
        doc.cats["NEGATIVE"] = not bool(label)

        # put them into a nice list
        docs.append(doc)
    
    return docs

spaCy v3.0 introduces the DocBin class, which is the recommended container for serializing a list of Doc objects, much like the pickle format, but better. After we create a list of Doc objects with the make_docs function, we can then generate an instance of the DocBin class for holding that list and save the serialized object to disk by calling the to_disk function. We first do this to valid_data and save the serialized file as valid.spacy in the data directory.

valid_docs = make_docs(valid_data)
doc_bin = DocBin(docs=valid_docs)
doc_bin.to_disk("./data/valid.spacy")

Then we do the same thing to train_data and save it as train.spacy. This will take much more time to run since train_data is much larger than valid_data.

train_docs = make_docs(train_data)
doc_bin = DocBin(docs=train_docs)
doc_bin.to_disk("./data/train.spacy")

And that was the end of the converting process! Now I'd like to save the serialized data for later use, so I copied it to dest, which is a directory in my Google Drive.

Note: Remember to use the -R flag when copying whatever in a directory to another.
source = "/content/data"
dest = "/content/drive/MyDrive/Python/NLP/shopping_comments/spaCy-sentiment-model/"
!cp -R {source} {dest}

Training the textcat component with CLI

spaCy v3.0 offers a nice and easy way to train spaCy's textcat component with a command line interfact (CLI). The first step is to get the base_config.cfg file, which is generated for you if you use spaCy's quickstart widget. The only thing that needs to changed is train and dev under [paths], which indicate where the serialized files are stored.

%%writefile base_config.cfg

# This is an auto-generated partial config. To use it with 'spacy train'
# you can run spacy init fill-config to auto-fill all default settings:
# python -m spacy init fill-config ./base_config.cfg ./config.cfg
[paths]
train = "data/train.spacy"
dev = "data/valid.spacy"

[system]
gpu_allocator = null

[nlp]
lang = "zh"
pipeline = ["tok2vec","textcat"]
batch_size = 1000

[components]

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = ${components.tok2vec.model.encode.width}
attrs = ["ORTH", "SHAPE"]
rows = [5000, 2500]
include_static_vectors = false

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[components.textcat]
factory = "textcat"

[components.textcat.model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = true
ngram_size = 1
no_output_layer = false

[corpora]

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 2000

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"

[training.optimizer]
@optimizers = "Adam.v1"

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001

[initialize]
vectors = null
Writing base_config.cfg

Then we create the config.cfg configuration file with base_config.cfg by using this command.

!python -m spacy init fill-config ./base_config.cfg ./config.cfg
2021-03-05 01:43:52.105583: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
✔ Auto-filled config with all values
✔ Saved config
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy

Here comes the fun part, where we can see how a computer model learns over time! The initial score was 43, which is worse than chance. But only after 200 iterations, the score already skyrocketed to 83! By the end of the training, we got a score of 92, which is pretty awesome, considering that we didn't do any text preprocessing.

!python -m spacy train ./config.cfg --output ./output

2021-03-05 01:44:00.345586: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
✔ Created output directory: output
ℹ Using CPU

=========================== Initializing pipeline ===========================
Set up nlp object from config
Pipeline: ['tok2vec', 'textcat']
Created vocabulary
Finished initializing nlp object
Initialized pipeline components: ['tok2vec', 'textcat']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'textcat']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TEXTCAT  CATS_SCORE  SCORE 
---  ------  ------------  ------------  ----------  ------
  0       0          0.00          0.06       43.32    0.43
  0     200          0.00         34.18       83.25    0.83
  0     400          0.00         21.43       86.60    0.87
  0     600          0.00         12.77       87.25    0.87
  0     800          0.00         15.29       89.00    0.89
  0    1000          0.00         11.78       89.71    0.90
  0    1200          0.00          5.92       89.62    0.90
  0    1400          0.00          4.52       89.98    0.90
  0    1600          0.00          1.51       90.40    0.90
  0    1800          0.00          4.41       90.82    0.91
  0    2000          0.00          0.80       90.76    0.91
  0    2200          0.00          1.53       90.97    0.91
  0    2400          0.00          0.16       91.06    0.91
  0    2600          0.00          0.37       91.34    0.91
  0    2800          0.00          0.13       91.33    0.91
  0    3000          0.00          0.16       91.40    0.91
  0    3200          0.00          0.20       91.56    0.92
  0    3400          0.00          0.38       91.57    0.92
  0    3600          0.00          0.10       91.70    0.92
  1    3800          0.00          0.12       91.41    0.91
  1    4000          0.00          0.17       91.65    0.92
  1    4200          0.00          0.10       91.54    0.92
  1    4400          0.00          0.14       91.58    0.92
  1    4600          0.00          0.15       91.58    0.92
  1    4800          0.00          0.19       91.59    0.92
  1    5000          0.00          0.09       91.66    0.92
  1    5200          0.00          2.16       91.91    0.92
  1    5400          0.00          0.12       91.86    0.92
  1    5600          0.00          0.21       91.68    0.92
  1    5800          0.00          0.10       91.69    0.92
  2    6000          0.00          2.18       91.22    0.91
  2    6200          0.00          0.15       91.50    0.92
  2    6400          0.00          0.16       91.61    0.92
  2    6600          0.00          0.10       91.67    0.92
  2    6800          0.00          0.11       91.57    0.92
✔ Saved pipeline to output directory
output/model-last

spaCy automatically saves two models to the path specified by the --output argument. We'll use the best model for testing. Here's the py file adapted from p-sodmann/Spacy3Textcat.

%%writefile test_input.py
import spacy
# load the best model from training
nlp = spacy.load("output/model-best")
text = ""
print("type : 'quit' to exit")
# predict the sentiment until someone writes quit
while text != "quit":
    text = input("Please enter a review here: ")
    doc = nlp(text)
    print(doc.cats)
Writing test_input.py

%%writefile test_input.py
import spacy
# load the best model from training
nlp = spacy.load("/content/drive/MyDrive/Python/NLP/shopping_comments/spaCy-text-classification/spaCy-sentiment-model/model-best")
text = ""
print("type : 'quit' to exit")
# predict the sentiment until someone writes quit
while text != "quit":
    text = input("Please enter a review here: ")
    doc = nlp(text)
    print(doc.cats)
Writing test_input.py

Let's print out some reviews in valid_data for the sake of testing the model.

valid_data[:10]
[('和以前買的不一樣!以前用的特別滑!用的也少!這次買的不僅用的東西多!而且還頭皮癢!有頭屑!', 0),
 ('不好意思我貨未收到,我想起訴你們!', 0),
 ('寶貝收到了,是我想要的版型,質量十分好,這條褲子的顏色跟圖片上一樣,是我想要的顏色,總之,是一次很愉快的購物', 1),
 ('一點也不甜不清脆,只能榨汁了。', 0),
 ('蘋果小,但味道不錯,給個好評吧', 1),
 ('平板收到,很滿意和想像的一樣,沒失望第一時間體驗中,稍後再評', 1),
 ('說好的樂視會員一年 送哪去了 差評', 0),
 ('相當完美啊 好東西', 1),
 ('連生產生日都沒有,真不知道是不是真的。', 0),
 ('蘋果一如既往得好,脆、甜、水分足,推薦購買哦奧~', 1)]

Now let's run test_input.py to start grilling the model about the sentiment of reviews. The results are quite satisfactory. I even intentionally asked about a mixed review, 他們家的香蕉好吃,但是蘋果卻一點也不甜! (Their bananas are delicious, but their apples are not sweet at all!). And the model gave almost equal scores to the two sentiments!

!python test_input.py
2021-03-05 05:16:18.092297: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
type : 'quit' to exit
Please enter a review here: 和以前買的不一樣!以前用的特別滑!用的也少!這次買的不僅用的東西多!而且還頭皮癢!有頭屑!
{'POSITIVE': 0.022647695615887642, 'NEGATIVE': 0.9773523211479187}
Please enter a review here: 平板收到,很滿意和想像的一樣,沒失望第一時間體驗中,稍後再評
{'POSITIVE': 0.43043994903564453, 'NEGATIVE': 0.5695600509643555}
Please enter a review here: 說好的樂視會員一年 送哪去了 差評
{'POSITIVE': 0.20790322124958038, 'NEGATIVE': 0.792096734046936}
Please enter a review here: 連生產生日都沒有,真不知道是不是真的。
{'POSITIVE': 0.04043734818696976, 'NEGATIVE': 0.9595626592636108}
Please enter a review here: 蘋果一如既往得好,脆、甜、水分足,推薦購買哦奧~
{'POSITIVE': 0.9967644214630127, 'NEGATIVE': 0.0032355356961488724}
Please enter a review here: 他們家的香蕉好吃,但是
{'POSITIVE': 0.7071358561515808, 'NEGATIVE': 0.2928641438484192}
Please enter a review here: 他們家的香蕉好吃,但是蘋果卻一點也不甜!
{'POSITIVE': 0.5702691674232483, 'NEGATIVE': 0.4297308325767517}
Please enter a review here: quit
{'POSITIVE': 0.46642377972602844, 'NEGATIVE': 0.533576250076294}

Finally, I saved the best trained model to Google Drive.

source = "/content/output/model-best"
dest = "/content/drive/MyDrive/Python/NLP/shopping_comments/spaCy-sentiment-model/"
!cp -R {source} {dest}

Going from binary to multiclass classification

Next, we'll go over pretty much the same steps to train a multiclass classification model. This time, the dataset is a list of tuples with review texts and product categories.

df['tuples'] = df.apply(lambda row: (row['review'], row['cat']), axis=1)
dataset = df['tuples'].tolist()
dataset[-10:]
[('非常一般的酒店,房裡設施很舊,房間送餐竟然要加多50%的送餐費。總之找不到好的方面來說,有其他選擇就不要去了', '酒店'),
 ('房間沒窗戶,攜程網竟然沒有說明!', '酒店'),
 ('1、中午快一點到店,說房間沒有收拾出來,只好寄存行李出去玩,晚上回到房間滿屋煙味,打電話後給調了同型房間。明知道入住的是一家三口親子游,還這樣安排,初始印象分-1 2、電視很多頻道都是雪花,調了幾個臺都是如此,一點看電視的興致都沒了,-0.5毫不過分 3、房間設施一點不人性化,先說燈,除了總控,還要開燈具開關,晚上起夜,前前後後摁了五六個開關才把燈開啟,一點睡意都沒了,-0.5不冤吧;再說洗澡,浴缸很淺,站在裡面淋浴完後外面地上全是水,只能墊個浴巾之類的跳到房間穿衣,還好我和娃她媽還沒老到跳不動。第二天晚上一家三口沒一個人有洗澡的興致,這個-1算客氣了吧 4、最後來說早餐,剛進去服務員一把攔住,娃身高超一米二,要全價一百九,雖然肉痛,一想五星級飯店早餐應該值這個價,就去付了,一個男領班模樣的帶到一邊,讓付九十現金就可以了。付完了還竊喜佔了便宜,等端了盤子找吃的時才發現品種少的可憐。娃之前在四星都能吃到的冰激淋這根本找不到。還沒吃完娃就和我們嘟噥,明天打死也不來吃。-1算理智了吧 5、我要這麼給您減完了,您估計會以為我挑剔,故意找碴。這麼來說吧,稍微讓我們滿意的就是空調溫度了,還有禮賓部寄存行李的服務員了。讓我暫時忘卻你們是五星級飯店吧,這裡給您+0.5 6、沒有比較就沒有差距,遠的不說,這次來北京住五晚,前兩天是紅杉假日酒店,我給評了4.8,和您比起來,除了地段外,您就沒哪項能落了好的。人家酒店剛入住時也說了小孩半價早餐,可覺得過年早餐品種不夠豐富,就免了娃的費用。諷刺的是,我們一家尤其是娃覺得紅杉的早餐比您這要豐盛可口,唯一遺憾的是那也沒冰淇淋。 7、您要是就憑地段就能拉紅杉一晚三百的差距,您自個也覺得心安理得。那我提醒您一句:且行且珍惜',
  '酒店'),
 ('酒店很舊 房間很小。入住體驗非常不好。服務還可以。下面是盤門景區,這點不錯。到各景區交通比較方便。', '酒店'),
 ('開始以為應該不錯,結果大失所望!!頂多跟100塊的小旅館差不多!各種設施都老舊的要命!連淋雨花灑壞的!最坑的是網速還限制每個房間幾百K!!開個網頁都要五分鐘好嗎?真破酒店!',
  '酒店'),
 ('我們去鹽城的時候那裡的最低氣溫只有4度,晚上冷得要死,居然還不開空調,投訴到酒店客房部,得到的答覆是現在還沒有領導指示需要開暖氣,如果冷到話可以多給一床被子,太可憐了。。。',
  '酒店'),
 ('房間很小,整體設施老化,和四星的差距很大。毛巾太破舊了。早餐很簡陋。房間隔音很差,隔兩間房間的聲音都聽得見。晚上有人在走廊裡大聲喧譁很久,也沒有人來勸止。比不上以前入住的附近的經濟型酒店。下次不會入住了。',
  '酒店'),
 ('我感覺不行。。。價效比很差。不知道是銀川都這樣還是怎麼的!', '酒店'),
 ('房間時間長,進去有點異味!服務員是不是不夠用啊!我在一樓找了半個小時以上才找到自己房間,想找個服務員問一下都找不到,總之不推薦!', '酒店'),
 ('老人小孩一大家族聚會,選在吳宮泛太平洋,以為新加坡品牌一定很不錯,沒想到11點30分到前臺,大堂裡擠滿了人,我們所有人都託著大大的行李箱站著,還有的抱著小嬰兒,排隊辦理入住,足足等了1個多小時,期間還被前臺張悅(自稱是當天的值班經理,和另一位自稱值班經理有重複,不知誰是誰非)冷嘲熱諷,總之等待的客人都很不高興,前臺速度太慢,一會兒進裡面問問,一會兒去那裡問問,很混亂的感覺!我們的房間等到4點還沒拿到房! 房間裡的床品有發黴潮溼的味道,絕對不像5星級,家人們都說不如3星的! 排隊吃早飯排了半小時,也算是醉了,而且沒什麼東西吃,問有沒有小餛飩,煎荷包蛋的廚師說有,但要點單45元一碗,沒聽說過自助餐廳裡有收費的專案,要麼說沒有,再說一碗小餛飩要45元?邊上聽著的客人直搖頭! 沒辦法承接這麼多客流,就不要接,入住體驗很不好!勸上海的朋友不要被宣傳圖片和文字忽悠了,起碼旺季真心不咋的,出入不方便,電梯不是直達的,必須經過前臺才能到底,總之體驗非常差,每次去蘇州都住園區那裡,老城區第一次體驗,不好,不會再去!',
  '酒店')]

In the make_doc function above, we hardcoded the string names of text classes, that is, POSITIVE and NEGATIVE. That was not a good option since the function cannot be reused in other cases. So I'd like a general function that works just as well even if we don't know in advance how many text classes there are in the dataset and what their string names are. After doing some experiments, I finally came up with the make_docs_multiclass function, which does the job. The trick here is to create the label_dict dictionary with every unique class name as keys and False as their default values for every Doc object in the for-loop. Then we assign label_dict to the cats attribute of every Doc object, that is, doc.cats = label_dict. At last, we update the value of a class in label_dict to True only when that is the class of the Doc object in question.

unique_labels = df.cat.unique().tolist()
def make_docs_multiclass(data):
    """
    this will take a list of texts and labels and transform them in spacy documents
    
    texts: List(str)
    labels: List(labels)
    
    returns: List(spacy.Doc.doc)
    """
    
    docs = []

    # nlp.pipe([texts]) is way faster than running nlp(text) for each text
    # as_tuples allows us to pass in a tuple, the first one is treated as text
    # the second one will get returned as it is.
    
    for doc, label in tqdm(nlp.pipe(data, as_tuples=True), total = len(data)):
        label_dict = {label: False for label in unique_labels}
        # we need to set the (text)cat(egory) for each document
        doc.cats = label_dict
        doc.cats[label] = True

        # put them into a nice list
        docs.append(doc)
    
    return docs

Then we split the dataset, convert it to sapCy format, save the converted data, and train a model with CLI, just like we did earlier.

from sklearn.model_selection import train_test_split
train_data, valid_data = train_test_split(dataset, test_size=0.2, random_state=1)
valid_docs = make_docs_multiclass(valid_data)
doc_bin = DocBin(docs=valid_docs)
doc_bin.to_disk("./data/valid.spacy")
train_docs = make_docs_multiclass(train_data)
doc_bin = DocBin(docs=train_docs)
doc_bin.to_disk("./data/train.spacy")
source = "/content/data"
dest = "/content/drive/MyDrive/Python/NLP/shopping_comments/spaCy-product-cat-model/"
!cp -R {source} {dest}

After training, we got an overall score of 89, which is pretty awesome.

!python -m spacy train ./config.cfg --output ./output-multiclass

2021-03-05 03:12:00.200786: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
✔ Created output directory: output-multiclass
ℹ Using CPU

=========================== Initializing pipeline ===========================
Set up nlp object from config
Pipeline: ['tok2vec', 'textcat']
Created vocabulary
Finished initializing nlp object
Initialized pipeline components: ['tok2vec', 'textcat']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'textcat']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TEXTCAT  CATS_SCORE  SCORE 
---  ------  ------------  ------------  ----------  ------
  0       0          0.00          0.09        0.00    0.00
  0     200          0.00         45.82       37.83    0.38
  0     400          0.00         14.59       63.48    0.63
  0     600          0.00         12.88       71.83    0.72
  0     800          0.00         12.18       76.29    0.76
  0    1000          0.00          5.34       78.87    0.79
  0    1200          0.00          3.11       80.73    0.81
  0    1400          0.00          4.12       82.13    0.82
  0    1600          0.00          1.41       82.99    0.83
  0    1800          0.00          0.76       83.86    0.84
  0    2000          0.00          0.58       84.57    0.85
  0    2200          0.00          0.40       85.13    0.85
  0    2400          0.00          0.22       85.54    0.86
  0    2600          0.00          0.24       86.10    0.86
  0    2800          0.00          0.17       86.51    0.87
  0    3000          0.00          0.30       86.82    0.87
  0    3200          0.00          0.20       87.06    0.87
  0    3400          0.00          0.26       87.13    0.87
  0    3600          0.00          0.14       87.48    0.87
  1    3800          0.00          0.13       87.73    0.88
  1    4000          0.00          0.13       87.80    0.88
  1    4200          0.00          0.31       87.80    0.88
  1    4400          0.00          0.12       88.07    0.88
  1    4600          0.00          0.25       88.07    0.88
  1    4800          0.00          0.26       87.97    0.88
  1    5000          0.00          0.10       88.25    0.88
  1    5200          0.00          0.20       88.27    0.88
  1    5400          0.00          0.15       88.39    0.88
  1    5600          0.00          0.13       88.51    0.89
  1    5800          0.00          0.11       88.53    0.89
  2    6000          0.00          0.16       88.77    0.89
  2    6200          0.00          0.15       88.39    0.88
  2    6400          0.00          0.15       88.67    0.89
  2    6600          0.00          0.11       88.77    0.89
  2    6800          0.00          0.11       88.65    0.89
  2    7000          0.00          0.14       88.73    0.89
  2    7200          0.00          0.11       88.69    0.89
  2    7400          0.00          0.17       88.67    0.89
  2    7600          0.00          0.17       88.58    0.89
  2    7800          0.00          0.11       88.68    0.89
  2    8000          0.00          0.28       88.96    0.89
  3    8200          0.00          0.21       88.93    0.89
  3    8400          0.00          0.11       88.91    0.89
  3    8600          0.00          0.23       88.86    0.89
  3    8800          0.00          0.15       88.80    0.89
  3    9000          0.00          0.11       88.95    0.89
  3    9200          0.00          0.30       88.90    0.89
  3    9400          0.00          0.11       89.11    0.89
  3    9600          0.00          0.12       89.14    0.89
  3    9800          0.00          0.12       89.09    0.89
  3   10000          0.00          0.16       88.95    0.89
  3   10200          0.00          0.17       89.01    0.89
  4   10400          0.00          0.10       89.07    0.89
  4   10600          0.00          0.12       89.14    0.89
  4   10800          0.00          0.12       89.15    0.89
  4   11000          0.00          0.11       89.09    0.89
  4   11200          0.00          0.13       89.03    0.89
  4   11400          0.00          0.09       89.11    0.89
  4   11600          0.00          0.11       89.28    0.89
  4   11800          0.00          0.18       89.15    0.89
  4   12000          0.00          0.14       89.33    0.89
  4   12200          0.00          0.09       89.22    0.89
  4   12400          0.00          0.15       89.22    0.89
  5   12600          0.00          0.10       89.08    0.89
  5   12800          0.00          0.11       89.17    0.89
  5   13000          0.00          0.15       89.24    0.89
  5   13200          0.00          0.12       89.20    0.89
  5   13400          0.00          0.12       89.08    0.89
  5   13600          0.00          0.15       89.18    0.89
✔ Saved pipeline to output directory
output-multiclass/model-last

Here's the py file for testing the multiclass classification model.

%%writefile test_input_multiclass.py

import spacy
# load the best model from training
nlp = spacy.load("output-multiclass/model-best")
text = ""
print("type : 'quit' to exit")
# predict the product category until someone writes quit
while text != "quit":
    text = input("Please enter a review here: ")
    doc = nlp(text)
    print(doc.cats)
Writing test_input_multiclass.py

Again, let's print out 10 reviews in valid_data for the purpose of testing.

%%writefile test_input_multiclass.py

import spacy
# load the best model from training
nlp = spacy.load("/content/drive/MyDrive/Python/NLP/shopping_comments/spaCy-text-classification/spaCy-product-cat-model/model-best")
text = ""
print("type : 'quit' to exit")
# predict the product category until someone writes quit
while text != "quit":
    text = input("Please enter a review here: ")
    doc = nlp(text)
    print(doc.cats)
Writing test_input_multiclass.py
valid_data[:10]
[('和以前買的不一樣!以前用的特別滑!用的也少!這次買的不僅用的東西多!而且還頭皮癢!有頭屑!', '洗髮水'),
 ('不好意思我貨未收到,我想起訴你們!', '衣服'),
 ('寶貝收到了,是我想要的版型,質量十分好,這條褲子的顏色跟圖片上一樣,是我想要的顏色,總之,是一次很愉快的購物', '衣服'),
 ('一點也不甜不清脆,只能榨汁了。', '水果'),
 ('蘋果小,但味道不錯,給個好評吧', '水果'),
 ('平板收到,很滿意和想像的一樣,沒失望第一時間體驗中,稍後再評', '平板'),
 ('說好的樂視會員一年 送哪去了 差評', '平板'),
 ('相當完美啊 好東西', '平板'),
 ('連生產生日都沒有,真不知道是不是真的。', '洗髮水'),
 ('蘋果一如既往得好,脆、甜、水分足,推薦購買哦奧~', '水果')]

The model works as expected. I intentionally tested the ambiguous review, 這個牌子值得信賴 (This brand is trustworthy.), which could have been a review for tablets, clothing, or shampoo. And our model gave top three scores to precisely these three classes!

!python test_input_multiclass.py
2021-03-05 05:07:28.215017: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
type : 'quit' to exit
Please enter a review here: 和以前買的不一樣!以前用的特別滑!用的也少!這次買的不僅用的東西多!而且還頭皮癢!有頭屑!
{'平板': 2.058545214822516e-05, '水果': 1.1623805221461225e-05, '洗髮水': 0.9999675750732422, '衣服': 2.6233973926537146e-07, '酒店': 7.811570945648327e-09}
Please enter a review here: 蘋果小,但味道不錯,給個好評吧
{'平板': 0.0010689852060750127, '水果': 0.9964427351951599, '洗髮水': 0.00197225296869874, '衣服': 0.00019674153008963913, '酒店': 0.0003193040902260691}
Please enter a review here: 脆、甜、水分足,推薦購買哦奧~
{'平板': 0.00028378094430081546, '水果': 0.9972990155220032, '洗髮水': 0.0011949185281991959, '衣服': 0.0006593622965738177, '酒店': 0.0005628817598335445}
Please enter a review here: 收到,很滿意和想像的一樣,沒失望第一時間體驗中,稍後再評
{'平板': 0.5766726732254028, '水果': 0.07788817584514618, '洗髮水': 0.16239948570728302, '衣服': 0.16160377860069275, '酒店': 0.0214359350502491}
Please enter a review here: 連生產生日都沒有,真不知道是不是真的。
{'平板': 0.23529699444770813, '水果': 0.032520271837711334, '洗髮水': 0.6238041520118713, '衣服': 0.017157811671495438, '酒店': 0.09122073650360107}
Please enter a review here: 這個牌子值得信賴,
{'平板': 0.19572772085666656, '水果': 0.10571669787168503, '洗髮水': 0.362021267414093, '衣服': 0.29673027992248535, '酒店': 0.03980398178100586}
Please enter a review here: quit
{'平板': 0.25125113129615784, '水果': 0.061232421547174454, '洗髮水': 0.27915307879447937, '衣服': 0.12709283828735352, '酒店': 0.2812705338001251}
!python test_input_multiclass.py
2021-03-16 10:14:45.823434: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
type : 'quit' to exit
Please enter a review here: 是蘋果,大家都想要
{'平板': 0.07115291804075241, '水果': 0.872447669506073, '洗髮水': 0.011495854705572128, '衣服': 0.027250811457633972, '酒店': 0.01765277236700058}
Please enter a review here: 是蘋果的,大家都想要
{'平板': 0.05921125411987305, '水果': 0.8904093503952026, '洗髮水': 0.01062779687345028, '衣服': 0.025291932746767998, '酒店': 0.014459758996963501}
Please enter a review here: 是蘋果的,大家都想要
{'平板': 0.05921125411987305, '水果': 0.8904093503952026, '洗髮水': 0.01062779687345028, '衣服': 0.025291932746767998, '酒店': 0.014459758996963501}
Please enter a review here: 是
{'平板': 0.23912471532821655, '水果': 0.11478321999311447, '洗髮水': 0.206477552652359, '衣服': 0.26159045100212097, '酒店': 0.17802411317825317}
Please enter a review here: 是蘋果的產品,大家都想要。
{'平板': 0.09895286709070206, '水果': 0.8691270351409912, '洗髮水': 0.016442863270640373, '衣服': 0.012551860883831978, '酒店': 0.002925291657447815}
Please enter a review here: 蘋果的產品,大家都想要。
{'平板': 0.1049538403749466, '水果': 0.865241289138794, '洗髮水': 0.014476561918854713, '衣服': 0.01276717334985733, '酒店': 0.0025611238088458776}
Please enter a review here: 蘋果的手機大家都想要。
{'平板': 0.7589152455329895, '水果': 0.208853617310524, '洗髮水': 0.0034141496289521456, '衣服': 0.013454959727823734, '酒店': 0.015362164005637169}
Please enter a review here: quit
{'平板': 0.25125113129615784, '水果': 0.061232421547174454, '洗髮水': 0.27915307879447937, '衣服': 0.12709283828735352, '酒店': 0.2812705338001251}

Now I can rest assured and save the trained model to Google Drive.

source = "/content/output-multiclass/model-best"
dest = "/content/drive/MyDrive/Python/NLP/shopping_comments/spaCy-product-cat-model/"
!cp -R {source} {dest}

Checking model performance

I thought I'd have to write custom functions to evaluate the performance of our classification models. But it turns out that spaCy, our unfailingly considerate friend, has done it for us under the hood. Performance metrics are hidden in the meta.json file under the model-best directory. Here's the content of the meta.json file for our multiclass classification model.

import json

meta_path = "/content/drive/MyDrive/Python/NLP/shopping_comments/spaCy-text-classification/spaCy-product-cat-model/model-best/meta.json"
with open(meta_path) as json_file:
    metrics = json.load(json_file)
metrics    
{'author': '',
 'components': ['tok2vec', 'textcat'],
 'description': '',
 'disabled': [],
 'email': '',
 'labels': {'textcat': ['平板', '水果', '洗髮水', '衣服', '酒店'], 'tok2vec': []},
 'lang': 'zh',
 'license': '',
 'name': 'pipeline',
 'performance': {'cats_f_per_type': {'平板': {'f': 0.8329853862,
    'p': 0.9140893471,
    'r': 0.7651006711},
   '水果': {'f': 0.9107981221, 'p': 0.9525368249, 'r': 0.8725637181},
   '洗髮水': {'f': 0.8430637386, 'p': 0.8827818284, 'r': 0.8067657611},
   '衣服': {'f': 0.8985658409, 'p': 0.9303455724, 'r': 0.868885527},
   '酒店': {'f': 0.9810741688, 'p': 0.9932677369, 'r': 0.9691763517}},
  'cats_macro_auc': 0.9846299582,
  'cats_macro_auc_per_type': 0.0,
  'cats_macro_f': 0.8932974513,
  'cats_macro_p': 0.9346042619,
  'cats_macro_r': 0.8564984058,
  'cats_micro_f': 0.8939148603,
  'cats_micro_p': 0.9357025697,
  'cats_micro_r': 0.8557,
  'cats_score': 0.8932974513,
  'cats_score_desc': 'macro F',
  'textcat_loss': 0.1365536493,
  'tok2vec_loss': 0.0},
 'pipeline': ['tok2vec', 'textcat'],
 'spacy_git_version': 'f4f46b617',
 'spacy_version': '>=3.0.3,<3.1.0',
 'url': '',
 'vectors': {'keys': 0, 'name': None, 'vectors': 0, 'width': 0},
 'version': '0.0.0'}

Specifically, values of the performance key are what we're looking for.

performance = metrics['performance']
performance
{'cats_f_per_type': {'平板': {'f': 0.8329853862,
   'p': 0.9140893471,
   'r': 0.7651006711},
  '水果': {'f': 0.9107981221, 'p': 0.9525368249, 'r': 0.8725637181},
  '洗髮水': {'f': 0.8430637386, 'p': 0.8827818284, 'r': 0.8067657611},
  '衣服': {'f': 0.8985658409, 'p': 0.9303455724, 'r': 0.868885527},
  '酒店': {'f': 0.9810741688, 'p': 0.9932677369, 'r': 0.9691763517}},
 'cats_macro_auc': 0.9846299582,
 'cats_macro_auc_per_type': 0.0,
 'cats_macro_f': 0.8932974513,
 'cats_macro_p': 0.9346042619,
 'cats_macro_r': 0.8564984058,
 'cats_micro_f': 0.8939148603,
 'cats_micro_p': 0.9357025697,
 'cats_micro_r': 0.8557,
 'cats_score': 0.8932974513,
 'cats_score_desc': 'macro F',
 'textcat_loss': 0.1365536493,
 'tok2vec_loss': 0.0}

Let's make a nice DataFrame object out of the metrics of the overall performance.

score = performance['cats_score']
auc = performance['cats_macro_auc']
f1 = performance['cats_macro_f']
precision = performance['cats_macro_p']
recall = performance['cats_macro_r']
overall_dict = {'score': score, 'precision': precision, 'recall': recall, 'F1': f1, 'AUC': auc}
overall_df = pd.DataFrame(overall_dict, index=[0])
overall_df
score precision recall F1 AUC
0 0.893297 0.934604 0.856498 0.893297 0.98463

We can also break down the metrics into specific categories, which are saved as the values of the cats_f_per_type key of performance.

per_cat_dict = performance['cats_f_per_type']
per_cat_df = pd.DataFrame(per_cat_dict)
per_cat_df
平板 水果 洗髮水 衣服 酒店
p 0.914089 0.952537 0.882782 0.930346 0.993268
r 0.765101 0.872564 0.806766 0.868886 0.969176
f 0.832985 0.910798 0.843064 0.898566 0.981074

Previously, I also used the fasttext library to train a multiclass classification model on the same dataset. Here're the values of the parameters I set up. And the overall accuracy is 0.886, which is pretty close to our spaCy model.

{'dim': 200,
 'epoch': 5,
 'loss': 'softmax',
 'lr': 0.1,
 'test_size': 0.2,
 'window': 5,
 'wordNgrams': 1}

And here's the breakdown for each category. One noticeable difference between the spaCy and fastText model is that the spaCy model has the highest scores in the HOTEL category (酒店) across all three metrics whereas the fastText model consistently performs the best in the TABLET category (平板) in terms of all three metrics.

 
cat 平板 水果 洗髮水 衣服 酒店
p 0.976 0.892 0.851 0.899 0.816
r 0.972 0.908 0.832 0.885 0.836
f 0.974 0.9 0.841 0.892 0.825

Recap

Like fastText, spaCy is good for training text classification models, and this task has become a no-brainer with the release of spaCy v3.0. Even in cases where both methods work equally well, spaCy still has an edge over fastText. That is, with spaCy you don't need to deal with text preprocessing or tokenization. So, watch this space-y!