Classifying customer reviews with spaCy v3.0
This post shows how to train models that can automatically identify the sentiment and product category of customer reviews. With the help of the recently released spaCy v3.0, text classification is an absolute breeze.
Text classification is a very common NLP task. Given enough training data, it's relatively easy to build a model that can automatically classify previously unseen texts in a way that follows the logic of the training data. In this post, I'll go through the steps for building such a model. Specifically, I'll leverage the power of the recently released spaCy v3.0 to train two classification models, one for identifying the sentiment of customer reviews in Chinese as being positive or negative (i.e. binary classification) and the other for predicting their product categories in a list of five (i.e. multiclass classification). If you can't wait to see how spaCy v3.0 has made the training process an absolute breeze, feel free to jump to the training the textcat component with CLI
section. If not, bear with me on this long journey. All the datasets and models created in this post are hosted in this repo of mine.
I'm hoping to build classification models that can take traditional Chinese texts as input, but I can't find any publicly available datasets of customer reviews in traditional Chinese. So I had to make do with reviews in simplified Chinese. Let's first download the dataset using !wget
.
!wget https://raw.githubusercontent.com/SophonPlus/ChineseNlpCorpus/master/datasets/online_shopping_10_cats/online_shopping_10_cats.zip
Then we unzip the downloaded file online_shopping_10_cats.zip
with, surprisingly, !unzip
.
!unzip online_shopping_10_cats.zip
The dataset has three columns: review
for review texts, label
for sentiment , and cat
for product categories. Here's a random sample of five reviews.
import pandas as pd
file_path = '/content/online_shopping_10_cats.csv'
df = pd.read_csv(file_path)
df.sample(5)
There're in total 62774 reviews.
df.shape
The label
column has only two unique values, 1 for positive reviews and 0 for negative ones.
df.label.unique()
The cat
column has nine unique values.
df.cat.unique()
Before moving on, let's save the raw dataset to Google Drive. The dest
variable can be any GDrive path you like.
dest = "/content/drive/MyDrive/Python/NLP/shopping_comments/"
!cp {file_path} {dest}
Now let's do some data filtering. The groupby
function from pandas
is very useful, and here's how to get the counts of each of the unique values in the cat
column.
df.groupby(by='cat').size()
To create a balanced dataset, I decided to keep categories whose counts are 10,000. So we're left with five product categories, 平板 for tablets, 水果 for fruits, 洗发水 for shampoo, 衣服 for clothing, and finally 酒店 for hotels.
There're many ways to filter data in pandas
, and my favorite is to first create a filt
variable that holds a list of True
and False
, which in this particular case is whether the value in the cat
volumn is in the cat_list
variable for the categories to be kept. Then we can simply filter data with df[filt]
. After filtering, the dataset is reduced to 50,000 reviews.
cat_list = ['平板', '水果', '洗发水', '衣服', '酒店']
filt = df['cat'].isin(cat_list)
df = df[filt]
df.shape
Now, the dataset is balanced in terms of both the cat
and label
columnn. There're 10,000 reviews for each product category.
df.groupby(by='cat').size()
And there're 25,000 for either of the two sentiments.
df.groupby(by='label').size()
Having made sure the filtered dataset is balanced, we can now reset the index, and save the dataset as online_shopping_5_cats_sim.csv
.
df.reset_index(inplace=True, drop=True)
df.to_csv(dest+"online_shopping_5_cats_sim.csv", sep=",", index=False)
Let's load back the file we just saved to make sure the dataset is accessible for later use.
df = pd.read_csv(dest+"online_shopping_5_cats_sim.csv")
df.tail()
Next, I converted the reviews from simplified Chinese to traditional Chinese using the OpenCC
library.
!pip install OpenCC
OpenCC
has many conversion methods. I specifically used s2twp
, which converts simplified Chinese to traditional Chinese adpated to Taiwanese vocabulary. The adaptation is not optimal, but it's better than mechanic simplified-to-traditional conversion. Here's a random review in the two writing systems.
from opencc import OpenCC
cc = OpenCC('s2twp')
test = df.loc[49995, 'review']
print(test)
test_tra = cc.convert(test)
print(test_tra)
Having made sure the conversion is correct, we can now go ahead and convert all reviews.
df.loc[ : , 'review'] = df['review'].apply(lambda x: cc.convert(x))
Let's make the same change to the cat
column.
df.loc[ : , 'cat'] = df['cat'].apply(lambda x: cc.convert(x))
And then we save the converted dataset as online_shopping_5_cats_tra.csv
.
df.to_csv(dest+'online_shopping_5_cats_tra.csv', sep=",", index=False)
Let's load back the file just saved to make sure it's accessible in the future.
df = pd.read_csv(dest+'online_shopping_5_cats_tra.csv')
df.tail()
Before building models, I would normally inspect the dataset. There're many ways to do so. I recently learned that there's a trick on Colab which allows you to filter a dataset in an interactive manner. All it takes is three lines of code.
%load_ext google.colab.data_table
from google.colab import data_table
data_table.DataTable(df, include_index=False, num_rows_per_page=10)
Alternatively, if you'd like to see some sample reviews from all the categories, the groupby
function is quite handy. The trick here is to feed pd.DataFrame.sample
to the apply
function so that you can specify the number of reviews to inspect from each product category.
df.groupby('cat').apply(pd.DataFrame.sample, n=3)[['label', 'review']]
Finally, one of the most powerful ways of exploring a dataset is to use the facets-overview
library. Let's first create a column for the length of review texts.
df['len'] = df['review'].apply(len)
df.tail()
Then we install the library.
!pip install facets-overview
In order to render an interative visualization of the dataset, we first convert the DataFrame object df
to the json format and then add it to an HTML template, as shown below. If you choose len
for Binning | X-Axis
, cat
for Binning | Y-Axis
, and finally review
for Label By
, you'll see all the reviews are beautifully arranged in term of text length along the X axis and product categories along the Y axis. They're also color-coded with respect to sentiment, blue for positive and red for negative. Clicking on a point of either color shows the values of that particular datapoint. Feel free to play around.
from IPython.core.display import display, HTML
jsonstr = df.to_json(orient='records')
HTML_TEMPLATE = """
<script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
<link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html">
<facets-dive id="elem" height="600"></facets-dive>
<script>
var data = {jsonstr};
document.querySelector("#elem").data = data;
</script>"""
html = HTML_TEMPLATE.format(jsonstr=jsonstr)
display(HTML(html))
spaCy supports many pretrained models in multiple languages, and offers a convenient widget for working out the code for downloading a particular model for a particular language. I specifically picked zh_core_web_md
for Chinese.
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download zh_core_web_md
Everthing in spaCy starts with loading a model. The model we downloaded has five built-in components, accessible via the pipe_names
attribute.
import spacy
nlp=spacy.load("zh_core_web_md")
nlp.pipe_names
Let's call the nlp
object with a sample review text and print out its tokens to make sure the model is working.
test = df.loc[22, 'review']
doc = nlp(test)
for tok in doc:
print(tok.text, tok.pos_)
To train a classification model wtih spaCy, we have to convert our dataset to spaCy format. Before that, we'll create a directory called data
under the current working directory cwd
. This is where we'll save the data in spaCy format.
os.makedirs
function much more useful than the more commonly seen os.mkdir()
function because the former creates all the intermediate directories for you if they don’t exist yet.
import os
def create_dir(dir_name):
cwd = os.getcwd()
project_dir = os.path.join(cwd, dir_name)
os.makedirs(project_dir)
return project_dir
project_dir = create_dir("data")
project_dir
The first step for the conversion is to create a list of tuples with two elements, one for the text and the other for the text class label. Let's start with the binary classification for sentiment first and generalize to multiclass classification for product categories later.
The easiest way to generate such a list is to create a new column called tuples
(or whatever), whose values are derived by applying a lambda
function to review
for text and label
for text class. I learned this trick from this article. Here're the first 10 tuples in the newly created dataset
list.
df['tuples'] = df.apply(lambda row: (row['review'], row['label']), axis=1)
dataset = df['tuples'].tolist()
dataset[:10]
Then I split the dataset
using train_test_split
from scikit-learn
. The train_data
and valid_data
hold 80% and 20% of the dataset
, respectively.
from sklearn.model_selection import train_test_split
train_data, valid_data = train_test_split(dataset, test_size=0.2, random_state=1)
Then we need to turn the dataset
list into spaCy Doc
objects and assign text classes to each of them so that the model can start learning. Assigning text classes is the trickiest part and took me lots of trials and errors to figure out. Unfortunately, official documentation of spaCy is not very clear about this part. At first, I used this make_docs
function from p-sodmann/Spacy3Textcat to create Doc
objects. That was successful, but the trained model gave weird results. Then I realized that the values of text classes need to be either True
or False
. So here's my revised version of the make_docs
function. The trick here is to use bool(label)
, which will be True
if the value of the label is 1 and False
if its value is 0.
from tqdm.auto import tqdm
from spacy.tokens import DocBin
def make_docs(data):
"""
this will take a list of texts and labels and transform them in spacy documents
texts: List(str)
labels: List(labels)
returns: List(spacy.Doc.doc)
"""
docs = []
# nlp.pipe([texts]) is way faster than running nlp(text) for each text
# as_tuples allows us to pass in a tuple, the first one is treated as text
# the second one will get returned as it is.
for doc, label in tqdm(nlp.pipe(data, as_tuples=True), total = len(data)):
# we need to set the (text)cat(egory) for each document
doc.cats["POSITIVE"] = bool(label)
doc.cats["NEGATIVE"] = not bool(label)
# put them into a nice list
docs.append(doc)
return docs
spaCy v3.0 introduces the DocBin
class, which is the recommended container for serializing a list of Doc
objects, much like the pickle format, but better. After we create a list of Doc
objects with the make_docs
function, we can then generate an instance of the DocBin
class for holding that list and save the serialized object to disk by calling the to_disk
function. We first do this to valid_data
and save the serialized file as valid.spacy
in the data
directory.
valid_docs = make_docs(valid_data)
doc_bin = DocBin(docs=valid_docs)
doc_bin.to_disk("./data/valid.spacy")
Then we do the same thing to train_data
and save it as train.spacy
. This will take much more time to run since train_data
is much larger than valid_data
.
train_docs = make_docs(train_data)
doc_bin = DocBin(docs=train_docs)
doc_bin.to_disk("./data/train.spacy")
And that was the end of the converting process! Now I'd like to save the serialized data for later use, so I copied it to dest
, which is a directory in my Google Drive.
-R
flag when copying whatever in a directory to another.
source = "/content/data"
dest = "/content/drive/MyDrive/Python/NLP/shopping_comments/spaCy-sentiment-model/"
!cp -R {source} {dest}
spaCy v3.0 offers a nice and easy way to train spaCy's textcat
component with a command line interfact (CLI). The first step is to get the base_config.cfg
file, which is generated for you if you use spaCy's quickstart widget. The only thing that needs to changed is train
and dev
under [paths]
, which indicate where the serialized files are stored.
%%writefile base_config.cfg
# This is an auto-generated partial config. To use it with 'spacy train'
# you can run spacy init fill-config to auto-fill all default settings:
# python -m spacy init fill-config ./base_config.cfg ./config.cfg
[paths]
train = "data/train.spacy"
dev = "data/valid.spacy"
[system]
gpu_allocator = null
[nlp]
lang = "zh"
pipeline = ["tok2vec","textcat"]
batch_size = 1000
[components]
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = ${components.tok2vec.model.encode.width}
attrs = ["ORTH", "SHAPE"]
rows = [5000, 2500]
include_static_vectors = false
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3
[components.textcat]
factory = "textcat"
[components.textcat.model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = true
ngram_size = 1
no_output_layer = false
[corpora]
[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 2000
[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
[training.optimizer]
@optimizers = "Adam.v1"
[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
[initialize]
vectors = null
Then we create the config.cfg
configuration file with base_config.cfg
by using this command.
!python -m spacy init fill-config ./base_config.cfg ./config.cfg
Here comes the fun part, where we can see how a computer model learns over time! The initial score was 43, which is worse than chance. But only after 200 iterations, the score already skyrocketed to 83! By the end of the training, we got a score of 92, which is pretty awesome, considering that we didn't do any text preprocessing.
!python -m spacy train ./config.cfg --output ./output
spaCy automatically saves two models to the path specified by the --output
argument. We'll use the best model for testing. Here's the py file adapted from p-sodmann/Spacy3Textcat.
%%writefile test_input.py
import spacy
# load the best model from training
nlp = spacy.load("output/model-best")
text = ""
print("type : 'quit' to exit")
# predict the sentiment until someone writes quit
while text != "quit":
text = input("Please enter a review here: ")
doc = nlp(text)
print(doc.cats)
%%writefile test_input.py
import spacy
# load the best model from training
nlp = spacy.load("/content/drive/MyDrive/Python/NLP/shopping_comments/spaCy-text-classification/spaCy-sentiment-model/model-best")
text = ""
print("type : 'quit' to exit")
# predict the sentiment until someone writes quit
while text != "quit":
text = input("Please enter a review here: ")
doc = nlp(text)
print(doc.cats)
Let's print out some reviews in valid_data
for the sake of testing the model.
valid_data[:10]
Now let's run test_input.py
to start grilling the model about the sentiment of reviews. The results are quite satisfactory. I even intentionally asked about a mixed review, 他們家的香蕉好吃,但是蘋果卻一點也不甜! (Their bananas are delicious, but their apples are not sweet at all!). And the model gave almost equal scores to the two sentiments!
!python test_input.py
Finally, I saved the best trained model to Google Drive.
source = "/content/output/model-best"
dest = "/content/drive/MyDrive/Python/NLP/shopping_comments/spaCy-sentiment-model/"
!cp -R {source} {dest}
Next, we'll go over pretty much the same steps to train a multiclass classification model. This time, the dataset
is a list of tuples with review texts and product categories.
df['tuples'] = df.apply(lambda row: (row['review'], row['cat']), axis=1)
dataset = df['tuples'].tolist()
dataset[-10:]
In the make_doc
function above, we hardcoded the string names of text classes, that is, POSITIVE
and NEGATIVE
. That was not a good option since the function cannot be reused in other cases. So I'd like a general function that works just as well even if we don't know in advance how many text classes there are in the dataset and what their string names are. After doing some experiments, I finally came up with the make_docs_multiclass
function, which does the job. The trick here is to create the label_dict
dictionary with every unique class name as keys and False
as their default values for every Doc
object in the for-loop. Then we assign label_dict
to the cats
attribute of every Doc
object, that is, doc.cats = label_dict
. At last, we update the value of a class in label_dict
to True
only when that is the class of the Doc
object in question.
unique_labels = df.cat.unique().tolist()
def make_docs_multiclass(data):
"""
this will take a list of texts and labels and transform them in spacy documents
texts: List(str)
labels: List(labels)
returns: List(spacy.Doc.doc)
"""
docs = []
# nlp.pipe([texts]) is way faster than running nlp(text) for each text
# as_tuples allows us to pass in a tuple, the first one is treated as text
# the second one will get returned as it is.
for doc, label in tqdm(nlp.pipe(data, as_tuples=True), total = len(data)):
label_dict = {label: False for label in unique_labels}
# we need to set the (text)cat(egory) for each document
doc.cats = label_dict
doc.cats[label] = True
# put them into a nice list
docs.append(doc)
return docs
Then we split the dataset, convert it to sapCy format, save the converted data, and train a model with CLI, just like we did earlier.
from sklearn.model_selection import train_test_split
train_data, valid_data = train_test_split(dataset, test_size=0.2, random_state=1)
valid_docs = make_docs_multiclass(valid_data)
doc_bin = DocBin(docs=valid_docs)
doc_bin.to_disk("./data/valid.spacy")
train_docs = make_docs_multiclass(train_data)
doc_bin = DocBin(docs=train_docs)
doc_bin.to_disk("./data/train.spacy")
source = "/content/data"
dest = "/content/drive/MyDrive/Python/NLP/shopping_comments/spaCy-product-cat-model/"
!cp -R {source} {dest}
After training, we got an overall score of 89, which is pretty awesome.
!python -m spacy train ./config.cfg --output ./output-multiclass
Here's the py file for testing the multiclass classification model.
%%writefile test_input_multiclass.py
import spacy
# load the best model from training
nlp = spacy.load("output-multiclass/model-best")
text = ""
print("type : 'quit' to exit")
# predict the product category until someone writes quit
while text != "quit":
text = input("Please enter a review here: ")
doc = nlp(text)
print(doc.cats)
Again, let's print out 10 reviews in valid_data
for the purpose of testing.
%%writefile test_input_multiclass.py
import spacy
# load the best model from training
nlp = spacy.load("/content/drive/MyDrive/Python/NLP/shopping_comments/spaCy-text-classification/spaCy-product-cat-model/model-best")
text = ""
print("type : 'quit' to exit")
# predict the product category until someone writes quit
while text != "quit":
text = input("Please enter a review here: ")
doc = nlp(text)
print(doc.cats)
valid_data[:10]
The model works as expected. I intentionally tested the ambiguous review, 這個牌子值得信賴 (This brand is trustworthy.), which could have been a review for tablets, clothing, or shampoo. And our model gave top three scores to precisely these three classes!
!python test_input_multiclass.py
!python test_input_multiclass.py
Now I can rest assured and save the trained model to Google Drive.
source = "/content/output-multiclass/model-best"
dest = "/content/drive/MyDrive/Python/NLP/shopping_comments/spaCy-product-cat-model/"
!cp -R {source} {dest}
I thought I'd have to write custom functions to evaluate the performance of our classification models. But it turns out that spaCy, our unfailingly considerate friend, has done it for us under the hood. Performance metrics are hidden in the meta.json
file under the model-best
directory. Here's the content of the meta.json
file for our multiclass classification model.
import json
meta_path = "/content/drive/MyDrive/Python/NLP/shopping_comments/spaCy-text-classification/spaCy-product-cat-model/model-best/meta.json"
with open(meta_path) as json_file:
metrics = json.load(json_file)
metrics
Specifically, values of the performance
key are what we're looking for.
performance = metrics['performance']
performance
Let's make a nice DataFrame object out of the metrics of the overall performance.
score = performance['cats_score']
auc = performance['cats_macro_auc']
f1 = performance['cats_macro_f']
precision = performance['cats_macro_p']
recall = performance['cats_macro_r']
overall_dict = {'score': score, 'precision': precision, 'recall': recall, 'F1': f1, 'AUC': auc}
overall_df = pd.DataFrame(overall_dict, index=[0])
overall_df
We can also break down the metrics into specific categories, which are saved as the values of the cats_f_per_type
key of performance
.
per_cat_dict = performance['cats_f_per_type']
per_cat_df = pd.DataFrame(per_cat_dict)
per_cat_df
Previously, I also used the fasttext
library to train a multiclass classification model on the same dataset. Here're the values of the parameters I set up. And the overall accuracy is 0.886, which is pretty close to our spaCy model.
{'dim': 200,
'epoch': 5,
'loss': 'softmax',
'lr': 0.1,
'test_size': 0.2,
'window': 5,
'wordNgrams': 1}
And here's the breakdown for each category. One noticeable difference between the spaCy and fastText model is that the spaCy model has the highest scores in the HOTEL category (酒店
) across all three metrics whereas the fastText model consistently performs the best in the TABLET category (平板
) in terms of all three metrics.
Like fastText, spaCy is good for training text classification models, and this task has become a no-brainer with the release of spaCy v3.0. Even in cases where both methods work equally well, spaCy still has an edge over fastText. That is, with spaCy you don't need to deal with text preprocessing or tokenization. So, watch this space-y!