NLP: Text Classification and Language Modelling
Natural Language Processing (NLP) has grown leaps and bounds over the past few years. This guide delves deep into some fundamental concepts like text classification, language modeling, and neural networks, providing a technical yet accessible overview suitable for both beginners and seasoned practitioners.
Text Classification
What is Text Classification?
Text classification involves categorizing text into predefined labels based on various traits such as topic, sentiment, subjectivity/objectivity, and intent. For example, determining whether a movie review is positive or negative falls under sentiment analysis, a type of text classification.
The Basics
At its core, machine learning (ML) for text classification works with paired data , where represents the input text, and is the corresponding label. The process involves:
- Training Data: Feeding the model with labeled examples.
- Learning Algorithm: Using algorithms to learn patterns from the data.
- Feature Extractor: Identifying relevant features from the text.
- Scoring Function: Assigning scores to different classes based on the features.
- Model Preparation: Building a model that can make predictions on new, unseen data.
Once trained, the model performs inference on the test set to classify new text samples.
Generative vs. Discriminative Models
Generative Models
Generative models calculate the probability of the input data itself. They focus on modeling the joint probability or the stand-alone probability .
Discriminative Models
Discriminative models, on the other hand, calculate the probability of a label given the data, denoted as . They directly model the decision boundary between classes, which often leads to better performance in classification tasks.
Language Modeling
Language modeling is the task of calculating the probability of a sentence. Formally, it’s expressed as:
The challenge lies in predicting these probabilities accurately.
Unigram Modeling
A simple approach is unigram modeling, which assumes each word is independent of the others. The probability is estimated using Maximum Likelihood Estimation (MLE):
However, this method doesn’t consider context and assigns zero probability to unseen words in the training data.
Handling Unknown Words
To tackle unknown words, models can be character or subword-based, calculating probabilities based on word spelling. Additionally, parameterizing in log space ensures numerical stability by converting multiplication into addition:
Generative Text Classifier
A generative classifier decomposes the joint probability as:
Here, is a class-conditional language model trained on data specific to class , and represents the prior probability of class :
Naive Bayes Classifier
A common generative classifier is the Naive Bayes classifier, which uses a bag-of-words approach. It multiplies the frequency of each word in the sentence with the log probabilities .
Discriminative Model Training
Discriminative models train to directly optimize with parameters θ\theta. The loss function, often the negative log-likelihood, is minimized:
Optimization is typically done using gradient descent:
Bag-of-Words Discriminative Model
For binary classification, the score for a class yy given input XX is:
This score is converted to a probability using the sigmoid function:
For multi-class classification, softmax is used instead of sigmoid:
Evaluation Metrics
- Accuracy: The percentage of correctly predicted labels.
- Precision: The proportion of true positives among predicted positives.
- Recall: The proportion of true positives among actual positives.
- F1 Score: The harmonic mean of precision and recall.
Language Modeling and Neural Networks
What is a Language Model?
A language model (LM) generates sentences by predicting the next word given the previous context:
Applications of Language Models
- Scoring Sentences: Assessing grammatical correctness.
- Generating Sentences: Creating new text by sampling from the probability distribution.
Smoothing Methods
To handle zero probabilities for unseen words, various smoothing techniques are employed:
- Additive/Dirichlet smoothing
- Discounting
- Kneser-Ney smoothing
Types of Language Models
- Class-Based Language Models
- Skip-Gram Models
When to Use N-gram Models?
N-gram models are extremely fast and effective for modeling low-frequency phenomena, as count-based models never forget seen data.
Evaluation of Language Models
- Log-Likelihood:
- Per-Word Log Likelihood:
- Perplexity:
Handling Unknown Words
Unknown words are inevitable. Common strategies include:
- Limiting vocabulary by frequency threshold.
- Modeling characters or subwords to capture the probability of unseen words.
Neural Networks in NLP
Neural networks are essentially computation graphs, typically directed and acyclic. Smaller graphs are easier to compute, which is crucial for efficiency.
Sequence Modeling and Recurrent Networks
Language inherently involves long-distance dependencies like agreement in number and gender or selectional preferences. Recurrent Neural Networks (RNNs) and their variants like LSTMs are designed to handle such dependencies.
Bidirectional RNNs
These process the sequence in both forward and backward directions, allowing the model to capture context from both sides.
Long Short-Term Memory (LSTM)
LSTMs mitigate the vanishing gradient problem, enabling the model to learn long-term dependencies effectively.
Increasing Efficiency
- Mini-Batching: Processing multiple samples simultaneously.
- Masking: Padding sequences to uniform lengths and masking the padding tokens.
- Truncated Backpropagation Through Time (BPTT): Handling long sequences by truncating the backpropagation process.
Conditioned Generation
Generating Sentences
Generating sentences involves sampling or selecting words based on the probability distribution:
- Ancestral Sampling: Sampling words one by one until an end-of-sentence token is generated.
- Greedy Search: Selecting the highest probability word at each step.
while yj−1≠"</s>":
yj=argmax P(yj∣X,y1,…,yj−1)
- Beam Search: Keeping multiple hypotheses at each step to explore more possibilities.
Examples of Conditioned Generation
- Translation
- Summarization
- Extractive: Copying input.
- Abstractive: Generating new text.
- Dialog Response Generation
- Image Captioning
Evaluation Metrics
BLEU (Bilingual Evaluation Understudy)
Measures n-gram overlap with reference translations. Pros:
- Easy to use.
- Good for measuring system improvements. Cons:
- Doesn’t always match human judgment.
- Penalizes correct paraphrases.
Embedding-Based Metrics:
- BERTScore: Uses BERT embeddings to find similarity.
- BLEURT: Trains BERT to predict human evaluation scores.
- COMET: Combines source sentences with reference outputs.
- PRISM: Based on a paraphrasing model.
- BARTScore: Calculates the probability of source, reference, or system output.
Challenges with Beam Search
- Machine Translation: Tends to produce short hypotheses.
- Open-Ended Generation: Can lead to repetition.
Other sampling methods like top-k and nucleus sampling are often used to mitigate these issues.
Attention Mechanisms
What is Attention?
Attention allows models to focus on different parts of the input when generating each part of the output. Instead of compressing the entire input into a single vector, attention uses multiple vectors proportional to the input length.
Calculating Attention
- Query and Key Vectors: Each word is encoded into vectors.
- Attention Weights: Calculated using a score function a(q,k)a(q, k).
- Softmax Normalization: Ensures weights sum to one.
- Weighted Sum: Combines value vectors based on attention weights.
Attention Score Functions
- Multilayer Perceptron:
- Bilinear:
- Dot Product:
- Scaled Dot Product:
Self-Attention and Multi-Headed Attention
- Self-Attention: Each word attends to all other words in the sentence, creating context-sensitive encodings.
- Multi-Headed Attention: Multiple attention “heads” focus on different parts of the sentence independently, enhancing the model’s ability to capture diverse relationships.
Transformers
The Transformer Model
Transformers are sequence-to-sequence models based entirely on attention mechanisms, making them highly parallelizable and efficient. The main components include:
- Multi-Headed Attention
- Feed Forward Networks
- Add & Normalize Layers
Training Transformers
- Positional Encoding: Since transformers don’t have a built-in sense of order like RNNs, positional encoding helps distinguish word positions.
- Layer Normalization: Ensures that activations remain in a reasonable range, stabilizing training.
- Specialized Training Schedules: Adjust learning rates and other hyperparameters for optimal performance.
- Label Smoothing: Introduces uncertainty, preventing the model from becoming overconfident.
- Masking: Ensures that the model doesn’t attend to future tokens during training.
Language Model Pretraining
Pre-training Methods
Pre-training involves training models on large datasets before fine-tuning them on specific tasks. Common approaches include:
- Standard Multi-Task Learning: Training on multiple tasks simultaneously.
- Pre-train and Fine-Tune: Pre-training on a general task, then fine-tuning on a specific one.
Popular Pre-training Objectives
- Auto-Regressive Language Modeling: Suitable for text generation tasks.
- Masked Language Modeling (MLM): Used in models like BERT for tasks requiring understanding of context.
Notable Pre-trained Models
- BERT: Uses MLM and next sentence prediction.
- RoBERTa: Optimizes hyperparameters and drops next sentence prediction for better performance.
- XLNet: Extends BERT by predicting words in random order, capturing longer contexts.
- DeBERTa: Introduces disentangled attention mechanisms.
- ALBERT: Reduces model size with parameter sharing.
- DistilBERT: A smaller, faster version of BERT trained to match its performance.
- GPT Series:
- GPT-2: 1.5B parameters, focuses on left-to-right language modeling.
- GPT-3: 175B parameters, excels in text generation.
- PaLM: A massive 540B parameter model designed for diverse language tasks.
Conclusion
This guide has covered essential aspects of text classification and language modeling within NLP, from foundational concepts to advanced neural network architectures like transformers. Understanding these principles is crucial for developing robust NLP applications that can interpret and generate human language effectively.
Further Reading
- Attention is All You Need by Vaswani et al.