NLP demos and talks made with Jupyter Notebook and reveal.js
Abstract: In this talk, we showcase some common applications of natural language processing technologies in business. Then we introduce Colab, the coding environment adopted throughout this workshop. Our NLP journey begins with learning how to get access to pretrained NLP models through spaCy for various functionalities, including tokenization, parts of speech tagging, named entity recognition, and dependency parsing.
Abstract: A topic model in NLP consists of two probability distributions. One has to do with topics over documents and the other with words over topics. In this talk, we go over the pipeline for creating a topic model with Gensim, covering such topics as text preprocessing, bag of words, N-gram, Latent Dirichlet allocation, and Hierarchical Dirichlet process.
Abstract: Text clustering is an unsupervised way of assigning texts to clusters. In this talk, we go over the pipeline for building a text clustering model using the K-means algorithm, where K represents a predefined number of clusters. One prerequisite of implementing K-means is vectorization of texts. We will learn how to use TF-IDF as a baseline approach to text vectorization.
Abstract: Last week, we implemented the K-means algorithm for text clustering by vectorizing texts with TF-IDF. This week, we continue to experiment with text clustering, but with more complicated approaches to text vectorization, namely word embeddings. Common architectures for word embeddings include Word2vec by Google, fastText by Facebook, and Glove by the Stanford NLP team. We will train a fastText embedding model on the fly with Gensim and employ the same K-means algorithm as from last week.
Abstract: While word embedding models like Word2vec, fastText, and Glove are powerful, they are essentially large lookup tables that consistently map a linguistic token to a fixed-length dense vector, thus failing to capture more nuanced information from context. This week, we leverage pretrained models for dynamic embeddings to build a vector-based search engine of texts, including Google’s Universal Sentence Encoder (USE) and BERT ((Bidirectional Encoder Representations from Transformers), which dynamically calculate vectors from context given a stretch of text. To facilitate the vector-based search, we tap into Facebook’s FAISS library to create an embedding index, which makes it much faster to search for similar vectors.
Abstract: Text classification is a prominent example of supervised learning in NLP, whereby texts are automatically assigned a category. There are numerous use cases for text classification, including email spam detector, hate speech detector, customer sentiment analysis, customer support system, news classification, and even chatbot intent classification. This week, we go through the steps for training text classification models using traditional machine learning methods, such as TF-IDF vectorizer, Naive Bayes classifiers, Support Vector Machines, and Logistic Regression. We also look into how to evaluate the trained models and explain why they work in the first place.
Abstract: Last week, we trained text classification models in traditional machine learning methods using Scikit-learn. This week, we move one step forward to carry out the same task but leverage the power of neural networks using spaCy. It is shown that training and evaluation can be highly streamlined, replicable, and efficient when they are done in spaCy’s command lines. While evaluating classification models, we also introduce concepts such as Receiver operating characteristic (ROC) and Area Under the ROC Curve (AUC).
Abstract: Named entity recognition (NER) is a highly valuable AI capability and widely used in industries like ecommerce, social media, and FinTech. This week, we show how to train and evaluate NER models using spaCy’s command lines, which is a process very similar to what we did last week while training text classification models. We then compare the performance of the newly trained model with that of a pretrained spaCy model.