國立中山大學NLP Workshop大綱

講師: 江豪文Haowen Jiang
時間: 2022-04-15 ~ 2022-06-10
大綱內容的簡報模式

參考資料

工具

nltk
spaCy
- Playground
stanza
gensim
sklearn

資料集

第1️⃣週

NLP相關應用
熟悉Colab環境
調用預訓練模型
Notebook

Abstract: In this talk, we showcase some common applications of natural language processing technologies in business. Then we introduce Colab, the coding environment adopted throughout this workshop. Our NLP journey begins with learning how to get access to pretrained NLP models through spaCy for various functionalities, including tokenization, parts of speech tagging, named entity recognition, and dependency parsing.

第2️⃣週

取得資料集
資料預處理
訓練主題模型
- Bag of Words(BOW)
- N-gram
- LDA
- HDP
Notebook

Abstract: A topic model in NLP consists of two probability distributions. One has to do with topics over documents and the other with words over topics. In this talk, we go over the pipeline for creating a topic model with Gensim, covering such topics as text preprocessing, bag of words, N-gram, Latent Dirichlet allocation, and Hierarchical Dirichlet process.

第3️⃣週

文本向量化1
- Frequency-based representation
  - TF-IDF
文本聚類
- k-means clustering
Notebook

Abstract: Text clustering is an unsupervised way of assigning texts to clusters. In this talk, we go over the pipeline for building a text clustering model using the K-means algorithm, where K represents a predefined number of clusters. One prerequisite of implementing K-means is vectorization of texts. We will learn how to use TF-IDF as a baseline approach to text vectorization.

第4️⃣週

文本向量化2
- Static word embeddings
  - Word2vec by Google
  - fastText by Facebook
  - GloVe (Global Vectors) by the Stanford NLP team
文本聚類
- k-means clustering
Notebook
- 沿用第3️⃣週的Notebook

Abstract: Last week, we implemented the K-means algorithm for text clustering by vectorizing texts with TF-IDF. This week, we continue to experiment with text clustering, but with more complicated approaches to text vectorization, namely word embeddings. Common architectures for word embeddings include Word2vec by Google, fastText by Facebook, and Glove by the Stanford NLP team. We will train a fastText embedding model on the fly with Gensim and employ the same K-means algorithm as from last week.

第5️⃣週

文本向量化3
- Dynamic embeddings
  - USE (Universal Sentence Encoder)
  - BERT (Bidirectional Encoder Representations from Transformers)
文本相似性
Notebook

Abstract: While word embedding models like Word2vec, fastText, and Glove are powerful, they are essentially large lookup tables that consistently map a linguistic token to a fixed-length dense vector, thus failing to capture more nuanced information from context. This week, we leverage pretrained models for dynamic embeddings to build a vector-based search engine of texts, including Google’s Universal Sentence Encoder (USE) and BERT ((Bidirectional Encoder Representations from Transformers), which dynamically calculate vectors from context given a stretch of text. To facilitate the vector-based search, we tap into Facebook’s FAISS library to create an embedding index, which makes it much faster to search for similar vectors.

第6️⃣週

Abstract: Text classification is a prominent example of supervised learning in NLP, whereby texts are automatically assigned a category. There are numerous use cases for text classification, including email spam detector, hate speech detector, customer sentiment analysis, customer support system, news classification, and even chatbot intent classification. This week, we go through the steps for training text classification models using traditional machine learning methods, such as TF-IDF vectorizer, Naive Bayes classifiers, Support Vector Machines, and Logistic Regression. We also look into how to evaluate the trained models and explain why they work in the first place.

第7️⃣週

文本分類: 神經網絡
評估指標
- Accuracy
- Recall
- Precision
- F1
- ROC
Notebook

Abstract: Last week, we trained text classification models in traditional machine learning methods using Scikit-learn. This week, we move one step forward to carry out the same task but leverage the power of neural networks using spaCy. It is shown that training and evaluation can be highly streamlined, replicable, and efficient when they are done in spaCy’s command lines. While evaluating classification models, we also introduce concepts such as Receiver operating characteristic (ROC) and Area Under the ROC Curve (AUC).

第8️⃣週

命名實體
Notebook

Abstract: Named entity recognition (NER) is a highly valuable AI capability and widely used in industries like ecommerce, social media, and FinTech. This week, we show how to train and evaluate NER models using spaCy’s command lines, which is a process very similar to what we did last week while training text classification models. We then compare the performance of the newly trained model with that of a pretrained spaCy model.

國立中山大學NLP Workshop大綱

參考資料

相關書籍

工具

資料集

第1️⃣週

第2️⃣週

第3️⃣週

第4️⃣週

第5️⃣週

第6️⃣週

第7️⃣週

第8️⃣週