Classifying customer reviews with spaCy v3.0
This post shows how to train models that can automatically identify the sentiment and product category of customer reviews. With the help of the recently released spaCy v3.0, text classification is an absolute breeze.

Text classification is a very common NLP task. Given enough training data, it's relatively easy to build a model that can automatically classify previously unseen texts in a way that follows the logic of the training data. In this post, I'll go through the steps for building such a model. Specifically, I'll leverage the power of the recently released spaCy v3.0 to train two classification models, one for identifying the sentiment of customer reviews in Chinese as being positive or negative (i.e. binary classification) and the other for predicting their product categories in a list of five (i.e. multiclass classification). If you can't wait to see how spaCy v3.0 has made the training process an absolute breeze, feel free to jump to the training the textcat component with CLI section. If not, bear with me on this long journey. All the datasets and models created in this post are hosted in this repo of mine.
I'm hoping to build classification models that can take traditional Chinese texts as input, but I can't find any publicly available datasets of customer reviews in traditional Chinese. So I had to make do with reviews in simplified Chinese. Let's first download the dataset using !wget.
!wget https://raw.githubusercontent.com/SophonPlus/ChineseNlpCorpus/master/datasets/online_shopping_10_cats/online_shopping_10_cats.zip
Then we unzip the downloaded file online_shopping_10_cats.zip with, surprisingly, !unzip.
!unzip online_shopping_10_cats.zip
The dataset has three columns: review for review texts, label for sentiment , and cat for product categories. Here's a random sample of five reviews.
import pandas as pd
file_path = '/content/online_shopping_10_cats.csv'
df = pd.read_csv(file_path)
df.sample(5)
There're in total 62774 reviews.
df.shape
The label column has only two unique values, 1 for positive reviews and 0 for negative ones.
df.label.unique()
The cat column has nine unique values.
df.cat.unique()
Before moving on, let's save the raw dataset to Google Drive. The dest variable can be any GDrive path you like.
dest = "/content/drive/MyDrive/Python/NLP/shopping_comments/"
!cp {file_path} {dest}
Now let's do some data filtering. The groupby function from pandas is very useful, and here's how to get the counts of each of the unique values in the cat column.
df.groupby(by='cat').size()
To create a balanced dataset, I decided to keep categories whose counts are 10,000. So we're left with five product categories, 平板 for tablets, 水果 for fruits, 洗发水 for shampoo, 衣服 for clothing, and finally 酒店 for hotels.
There're many ways to filter data in pandas, and my favorite is to first create a filt variable that holds a list of True and False, which in this particular case is whether the value in the cat volumn is in the cat_list variable for the categories to be kept. Then we can simply filter data with df[filt]. After filtering, the dataset is reduced to 50,000 reviews.
cat_list = ['平板', '水果', '洗发水', '衣服', '酒店']
filt = df['cat'].isin(cat_list)
df = df[filt]
df.shape
Now, the dataset is balanced in terms of both the cat and label columnn. There're 10,000 reviews for each product category.
df.groupby(by='cat').size()
And there're 25,000 for either of the two sentiments.
df.groupby(by='label').size()
Having made sure the filtered dataset is balanced, we can now reset the index, and save the dataset as online_shopping_5_cats_sim.csv.
df.reset_index(inplace=True, drop=True)
df.to_csv(dest+"online_shopping_5_cats_sim.csv", sep=",", index=False)
Let's load back the file we just saved to make sure the dataset is accessible for later use.
df = pd.read_csv(dest+"online_shopping_5_cats_sim.csv")
df.tail()
Next, I converted the reviews from simplified Chinese to traditional Chinese using the OpenCC library.
!pip install OpenCC
OpenCC has many conversion methods. I specifically used s2twp, which converts simplified Chinese to traditional Chinese adpated to Taiwanese vocabulary. The adaptation is not optimal, but it's better than mechanic simplified-to-traditional conversion. Here's a random review in the two writing systems.
from opencc import OpenCC
cc = OpenCC('s2twp')
test = df.loc[49995, 'review']
print(test)
test_tra = cc.convert(test)
print(test_tra)
Having made sure the conversion is correct, we can now go ahead and convert all reviews.
df.loc[ : , 'review'] = df['review'].apply(lambda x: cc.convert(x))
Let's make the same change to the cat column.
df.loc[ : , 'cat'] = df['cat'].apply(lambda x: cc.convert(x))
And then we save the converted dataset as online_shopping_5_cats_tra.csv.
df.to_csv(dest+'online_shopping_5_cats_tra.csv', sep=",", index=False)
Let's load back the file just saved to make sure it's accessible in the future.
df = pd.read_csv(dest+'online_shopping_5_cats_tra.csv')
df.tail()
Before building models, I would normally inspect the dataset. There're many ways to do so. I recently learned that there's a trick on Colab which allows you to filter a dataset in an interactive manner. All it takes is three lines of code.
%load_ext google.colab.data_table
from google.colab import data_table
data_table.DataTable(df, include_index=False, num_rows_per_page=10)
Alternatively, if you'd like to see some sample reviews from all the categories, the groupby function is quite handy. The trick here is to feed pd.DataFrame.sample to the apply function so that you can specify the number of reviews to inspect from each product category.
df.groupby('cat').apply(pd.DataFrame.sample, n=3)[['label', 'review']]
Finally, one of the most powerful ways of exploring a dataset is to use the facets-overview library. Let's first create a column for the length of review texts.
df['len'] = df['review'].apply(len)
df.tail()
Then we install the library.
!pip install facets-overview
In order to render an interative visualization of the dataset, we first convert the DataFrame object df to the json format and then add it to an HTML template, as shown below. If you choose len for Binning | X-Axis, cat for Binning | Y-Axis, and finally review for Label By, you'll see all the reviews are beautifully arranged in term of text length along the X axis and product categories along the Y axis. They're also color-coded with respect to sentiment, blue for positive and red for negative. Clicking on a point of either color shows the values of that particular datapoint. Feel free to play around.
from IPython.core.display import display, HTML
jsonstr = df.to_json(orient='records')
HTML_TEMPLATE = """
<script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
<link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html">
<facets-dive id="elem" height="600"></facets-dive>
<script>
var data = {jsonstr};
document.querySelector("#elem").data = data;
</script>"""
html = HTML_TEMPLATE.format(jsonstr=jsonstr)
display(HTML(html))
spaCy supports many pretrained models in multiple languages, and offers a convenient widget for working out the code for downloading a particular model for a particular language. I specifically picked zh_core_web_md for Chinese.
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download zh_core_web_md
Everthing in spaCy starts with loading a model. The model we downloaded has five built-in components, accessible via the pipe_names attribute.
import spacy
nlp=spacy.load("zh_core_web_md")
nlp.pipe_names
Let's call the nlp object with a sample review text and print out its tokens to make sure the model is working.
test = df.loc[22, 'review']
doc = nlp(test)
for tok in doc:
print(tok.text, tok.pos_)
To train a classification model wtih spaCy, we have to convert our dataset to spaCy format. Before that, we'll create a directory called data under the current working directory cwd. This is where we'll save the data in spaCy format.
os.makedirs function much more useful than the more commonly seen os.mkdir() function because the former creates all the intermediate directories for you if they don’t exist yet.
import os
def create_dir(dir_name):
cwd = os.getcwd()
project_dir = os.path.join(cwd, dir_name)
os.makedirs(project_dir)
return project_dir
project_dir = create_dir("data")
project_dir
The first step for the conversion is to create a list of tuples with two elements, one for the text and the other for the text class label. Let's start with the binary classification for sentiment first and generalize to multiclass classification for product categories later.
The easiest way to generate such a list is to create a new column called tuples (or whatever), whose values are derived by applying a lambda function to review for text and label for text class. I learned this trick from this article. Here're the first 10 tuples in the newly created dataset list.
df['tuples'] = df.apply(lambda row: (row['review'], row['label']), axis=1)
dataset = df['tuples'].tolist()
dataset[:10]
Then I split the dataset using train_test_split from scikit-learn. The train_data and valid_data hold 80% and 20% of the dataset, respectively.
from sklearn.model_selection import train_test_split
train_data, valid_data = train_test_split(dataset, test_size=0.2, random_state=1)
Then we need to turn the dataset list into spaCy Doc objects and assign text classes to each of them so that the model can start learning. Assigning text classes is the trickiest part and took me lots of trials and errors to figure out. Unfortunately, official documentation of spaCy is not very clear about this part. At first, I used this make_docs function from p-sodmann/Spacy3Textcat to create Doc objects. That was successful, but the trained model gave weird results. Then I realized that the values of text classes need to be either True or False. So here's my revised version of the make_docs function. The trick here is to use bool(label), which will be True if the value of the label is 1 and False if its value is 0.
from tqdm.auto import tqdm
from spacy.tokens import DocBin
def make_docs(data):
"""
this will take a list of texts and labels and transform them in spacy documents
texts: List(str)
labels: List(labels)
returns: List(spacy.Doc.doc)
"""
docs = []
# nlp.pipe([texts]) is way faster than running nlp(text) for each text
# as_tuples allows us to pass in a tuple, the first one is treated as text
# the second one will get returned as it is.
for doc, label in tqdm(nlp.pipe(data, as_tuples=True), total = len(data)):
# we need to set the (text)cat(egory) for each document
doc.cats["POSITIVE"] = bool(label)
doc.cats["NEGATIVE"] = not bool(label)
# put them into a nice list
docs.append(doc)
return docs
spaCy v3.0 introduces the DocBin class, which is the recommended container for serializing a list of Doc objects, much like the pickle format, but better. After we create a list of Doc objects with the make_docs function, we can then generate an instance of the DocBin class for holding that list and save the serialized object to disk by calling the to_disk function. We first do this to valid_data and save the serialized file as valid.spacy in the data directory.
valid_docs = make_docs(valid_data)
doc_bin = DocBin(docs=valid_docs)
doc_bin.to_disk("./data/valid.spacy")
Then we do the same thing to train_data and save it as train.spacy. This will take much more time to run since train_data is much larger than valid_data.
train_docs = make_docs(train_data)
doc_bin = DocBin(docs=train_docs)
doc_bin.to_disk("./data/train.spacy")
And that was the end of the converting process! Now I'd like to save the serialized data for later use, so I copied it to dest, which is a directory in my Google Drive.
-R flag when copying whatever in a directory to another.
source = "/content/data"
dest = "/content/drive/MyDrive/Python/NLP/shopping_comments/spaCy-sentiment-model/"
!cp -R {source} {dest}
spaCy v3.0 offers a nice and easy way to train spaCy's textcat component with a command line interfact (CLI). The first step is to get the base_config.cfg file, which is generated for you if you use spaCy's quickstart widget. The only thing that needs to changed is train and dev under [paths], which indicate where the serialized files are stored.
%%writefile base_config.cfg
# This is an auto-generated partial config. To use it with 'spacy train'
# you can run spacy init fill-config to auto-fill all default settings:
# python -m spacy init fill-config ./base_config.cfg ./config.cfg
[paths]
train = "data/train.spacy"
dev = "data/valid.spacy"
[system]
gpu_allocator = null
[nlp]
lang = "zh"
pipeline = ["tok2vec","textcat"]
batch_size = 1000
[components]
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = ${components.tok2vec.model.encode.width}
attrs = ["ORTH", "SHAPE"]
rows = [5000, 2500]
include_static_vectors = false
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3
[components.textcat]
factory = "textcat"
[components.textcat.model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = true
ngram_size = 1
no_output_layer = false
[corpora]
[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 2000
[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
[training.optimizer]
@optimizers = "Adam.v1"
[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
[initialize]
vectors = null
Then we create the config.cfg configuration file with base_config.cfg by using this command.
!python -m spacy init fill-config ./base_config.cfg ./config.cfg
Here comes the fun part, where we can see how a computer model learns over time! The initial score was 43, which is worse than chance. But only after 200 iterations, the score already skyrocketed to 83! By the end of the training, we got a score of 92, which is pretty awesome, considering that we didn't do any text preprocessing.
!python -m spacy train ./config.cfg --output ./output
spaCy automatically saves two models to the path specified by the --output argument. We'll use the best model for testing. Here's the py file adapted from p-sodmann/Spacy3Textcat.
%%writefile test_input.py
import spacy
# load the best model from training
nlp = spacy.load("output/model-best")
text = ""
print("type : 'quit' to exit")
# predict the sentiment until someone writes quit
while text != "quit":
text = input("Please enter a review here: ")
doc = nlp(text)
print(doc.cats)
%%writefile test_input.py
import spacy
# load the best model from training
nlp = spacy.load("/content/drive/MyDrive/Python/NLP/shopping_comments/spaCy-text-classification/spaCy-sentiment-model/model-best")
text = ""
print("type : 'quit' to exit")
# predict the sentiment until someone writes quit
while text != "quit":
text = input("Please enter a review here: ")
doc = nlp(text)
print(doc.cats)
Let's print out some reviews in valid_data for the sake of testing the model.
valid_data[:10]
Now let's run test_input.py to start grilling the model about the sentiment of reviews. The results are quite satisfactory. I even intentionally asked about a mixed review, 他們家的香蕉好吃,但是蘋果卻一點也不甜! (Their bananas are delicious, but their apples are not sweet at all!). And the model gave almost equal scores to the two sentiments!
!python test_input.py
Finally, I saved the best trained model to Google Drive.
source = "/content/output/model-best"
dest = "/content/drive/MyDrive/Python/NLP/shopping_comments/spaCy-sentiment-model/"
!cp -R {source} {dest}
Next, we'll go over pretty much the same steps to train a multiclass classification model. This time, the dataset is a list of tuples with review texts and product categories.
df['tuples'] = df.apply(lambda row: (row['review'], row['cat']), axis=1)
dataset = df['tuples'].tolist()
dataset[-10:]
In the make_doc function above, we hardcoded the string names of text classes, that is, POSITIVE and NEGATIVE. That was not a good option since the function cannot be reused in other cases. So I'd like a general function that works just as well even if we don't know in advance how many text classes there are in the dataset and what their string names are. After doing some experiments, I finally came up with the make_docs_multiclass function, which does the job. The trick here is to create the label_dict dictionary with every unique class name as keys and False as their default values for every Doc object in the for-loop. Then we assign label_dict to the cats attribute of every Doc object, that is, doc.cats = label_dict. At last, we update the value of a class in label_dict to True only when that is the class of the Doc object in question.
unique_labels = df.cat.unique().tolist()
def make_docs_multiclass(data):
"""
this will take a list of texts and labels and transform them in spacy documents
texts: List(str)
labels: List(labels)
returns: List(spacy.Doc.doc)
"""
docs = []
# nlp.pipe([texts]) is way faster than running nlp(text) for each text
# as_tuples allows us to pass in a tuple, the first one is treated as text
# the second one will get returned as it is.
for doc, label in tqdm(nlp.pipe(data, as_tuples=True), total = len(data)):
label_dict = {label: False for label in unique_labels}
# we need to set the (text)cat(egory) for each document
doc.cats = label_dict
doc.cats[label] = True
# put them into a nice list
docs.append(doc)
return docs
Then we split the dataset, convert it to sapCy format, save the converted data, and train a model with CLI, just like we did earlier.
from sklearn.model_selection import train_test_split
train_data, valid_data = train_test_split(dataset, test_size=0.2, random_state=1)
valid_docs = make_docs_multiclass(valid_data)
doc_bin = DocBin(docs=valid_docs)
doc_bin.to_disk("./data/valid.spacy")
train_docs = make_docs_multiclass(train_data)
doc_bin = DocBin(docs=train_docs)
doc_bin.to_disk("./data/train.spacy")
source = "/content/data"
dest = "/content/drive/MyDrive/Python/NLP/shopping_comments/spaCy-product-cat-model/"
!cp -R {source} {dest}
After training, we got an overall score of 89, which is pretty awesome.
!python -m spacy train ./config.cfg --output ./output-multiclass
Here's the py file for testing the multiclass classification model.
%%writefile test_input_multiclass.py
import spacy
# load the best model from training
nlp = spacy.load("output-multiclass/model-best")
text = ""
print("type : 'quit' to exit")
# predict the product category until someone writes quit
while text != "quit":
text = input("Please enter a review here: ")
doc = nlp(text)
print(doc.cats)
Again, let's print out 10 reviews in valid_data for the purpose of testing.
%%writefile test_input_multiclass.py
import spacy
# load the best model from training
nlp = spacy.load("/content/drive/MyDrive/Python/NLP/shopping_comments/spaCy-text-classification/spaCy-product-cat-model/model-best")
text = ""
print("type : 'quit' to exit")
# predict the product category until someone writes quit
while text != "quit":
text = input("Please enter a review here: ")
doc = nlp(text)
print(doc.cats)
valid_data[:10]
The model works as expected. I intentionally tested the ambiguous review, 這個牌子值得信賴 (This brand is trustworthy.), which could have been a review for tablets, clothing, or shampoo. And our model gave top three scores to precisely these three classes!
!python test_input_multiclass.py
!python test_input_multiclass.py
Now I can rest assured and save the trained model to Google Drive.
source = "/content/output-multiclass/model-best"
dest = "/content/drive/MyDrive/Python/NLP/shopping_comments/spaCy-product-cat-model/"
!cp -R {source} {dest}
I thought I'd have to write custom functions to evaluate the performance of our classification models. But it turns out that spaCy, our unfailingly considerate friend, has done it for us under the hood. Performance metrics are hidden in the meta.json file under the model-best directory. Here's the content of the meta.json file for our multiclass classification model.
import json
meta_path = "/content/drive/MyDrive/Python/NLP/shopping_comments/spaCy-text-classification/spaCy-product-cat-model/model-best/meta.json"
with open(meta_path) as json_file:
metrics = json.load(json_file)
metrics
Specifically, values of the performance key are what we're looking for.
performance = metrics['performance']
performance
Let's make a nice DataFrame object out of the metrics of the overall performance.
score = performance['cats_score']
auc = performance['cats_macro_auc']
f1 = performance['cats_macro_f']
precision = performance['cats_macro_p']
recall = performance['cats_macro_r']
overall_dict = {'score': score, 'precision': precision, 'recall': recall, 'F1': f1, 'AUC': auc}
overall_df = pd.DataFrame(overall_dict, index=[0])
overall_df
We can also break down the metrics into specific categories, which are saved as the values of the cats_f_per_type key of performance.
per_cat_dict = performance['cats_f_per_type']
per_cat_df = pd.DataFrame(per_cat_dict)
per_cat_df
Previously, I also used the fasttext library to train a multiclass classification model on the same dataset. Here're the values of the parameters I set up. And the overall accuracy is 0.886, which is pretty close to our spaCy model.
{'dim': 200,
'epoch': 5,
'loss': 'softmax',
'lr': 0.1,
'test_size': 0.2,
'window': 5,
'wordNgrams': 1}
And here's the breakdown for each category. One noticeable difference between the spaCy and fastText model is that the spaCy model has the highest scores in the HOTEL category (酒店) across all three metrics whereas the fastText model consistently performs the best in the TABLET category (平板) in terms of all three metrics.
Like fastText, spaCy is good for training text classification models, and this task has become a no-brainer with the release of spaCy v3.0. Even in cases where both methods work equally well, spaCy still has an edge over fastText. That is, with spaCy you don't need to deal with text preprocessing or tokenization. So, watch this space-y!