๐ Table of Contents
- Introduction
- Why Do We Need Self-Attention?
- How Self-Attention Works
- Mathematical Breakdown
- Step-by-Step Example with Python
- Summary
- References
๐ Introduction
Self-attention is a key mechanism in deep learning models, especially Transformers, which allows neural networks to weigh different parts of an input sequence when making predictions. Unlike traditional attention, self-attention works within a sequence, meaning each token attends to all others.
โ Why Do We Need Self-Attention?
- Capturing Long-Range Dependencies: Self-attention allows a model to focus on important words, no matter how far apart they are.
- Parallel Processing: Unlike RNNs, which process tokens sequentially, self-attention can process all tokens simultaneously, making it much faster.
- Foundation of Transformers: The Transformer architecture (used in GPT, BERT, etc.) is built entirely on self-attention, replacing RNNs and CNNs in NLP.
๐ How Self-Attention Works
Given an input sentence, self-attention assigns a score to each word based on its relevance to other words. The process involves:
-
Create Query (Q), Key (K), and Value (V) Matrices
- Each input token is projected into three different vectors.
-
Compute Attention Scores
-
Scores are computed using the formula:
-
This determines how much focus each word should get.
-
-
Generate the Output
- The weighted sum of values forms the new representation of each word.
๐งฎ Mathematical Breakdown
The self-attention mechanism follows these key steps:
-
Compute Query (Q), Key (K), and Value (V) Matrices:
where are learned weight matrices.
-
Compute Attention Scores:
-
Apply Softmax to Normalize Scores:
-
Compute Final Self-Attention Output:
๐ ๏ธ Step-by-Step Example with Python
๐ Installing Dependencies
pip install torch numpy
๐ง Implementing Self-Attention in PyTorch
import torch
import torch.nn as nn
class SelfAttention(nn.Module):
def __init__(self, embed_size):
super(SelfAttention, self).__init__()
self.embed_size = embed_size
self.W_q = nn.Linear(embed_size, embed_size, bias=False)
self.W_k = nn.Linear(embed_size, embed_size, bias=False)
self.W_v = nn.Linear(embed_size, embed_size, bias=False)
self.scale = torch.sqrt(torch.tensor(embed_size, dtype=torch.float32))
def forward(self, x):
Q = self.W_q(x) # Query matrix
K = self.W_k(x) # Key matrix
V = self.W_v(x) # Value matrix
scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
attention_weights = torch.softmax(scores, dim=-1)
output = torch.matmul(attention_weights, V)
return output, attention_weights
โ Running the Code
embed_size = 8 # Example embedding size
sequence_length = 5 # Example sequence length
batch_size = 1
# Random input tensor (batch_size, sequence_length, embed_size)
x = torch.randn(batch_size, sequence_length, embed_size)
# Initialize self-attention layer and pass the input
self_attention = SelfAttention(embed_size)
output, attention_weights = self_attention(x)
print("Self-Attention Output:\n", output)
print("Attention Weights:\n", attention_weights)
This implementation processes a sequence of 5 words, each represented by an 8-dimensional embedding. The output represents contextualized word embeddings, where each wordโs representation depends on other words in the sequence.
๐ฏ Summary
- Self-Attention enables models to weigh different parts of a sequence dynamically.
- It replaces recurrence (RNNs) and convolution (CNNs) in NLP models.
- Transformers like GPT and BERT use self-attention to capture long-range dependencies.