Google ML Crash Course #3 Notes: Advanced ML Models

This post is part of a four-part summary of Google's Machine Learning Crash Course. For context, check out this post. This third module covers advanced ML model architectures.

Neural networks

Introduction

Neural networks are a model architecture designed to automatically identify non-linear patterns in data, eliminating the need for manual feature cross experimentation.

Source: Neural networks | Machine Learning | Google for Developers

Nodes and hidden layers

In neural network terminology, additional layers between the input layer and the output layer are called hidden layers, and the nodes in these layers are called neurons. HiddenLayerBigPicture.png

Source: Neural networks: Nodes and hidden layers | Machine Learning | Google for Developers

Activation functions

Each neuron in a neural network performs the following two-step action:

Common activation functions include sigmoid, tanh, and ReLU.

The sigmoid function maps input x to an output value between 0 and 1: $$ F(x) = \frac{1}{1 + e^{-x}} $$ sigmoid.png

The tanh function (short for “hyperbolic tangent”) maps input x to an output value between -1 and 1: $$ F(x) = \tanh{(x)} $$ tanh.png

The rectified linear unit activation function (or ReLU, for short) applies a simple rule:

ReLU often outperforms sigmoid and tanh because it reduces vanishing gradient issues and requires less computation. relu.png

A neural network consists of:

Source: Neural networks: Activation functions | Machine Learning | Google for Developers

Training using backpropagation

Backpropagation is the primary training algorithm for neural networks. It calculates how much each weight and bias in the network contributed to the overall prediction error by applying the chain rule of calculus. It works backwards from the output layer to tell the gradient descent algorithm which equations to adjust to reduce loss.

In practice, this involves a forward pass, where the network makes a prediction and the loss function measures the error, followed by a backward pass that propagates that error back through the layers to compute gradients for each parameter.

Best practices for neural network training:

Source: Neural Networks: Training using backpropagation | Machine Learning | Google for Developers

Multi-class classification

Multi-class classification models predict from multiple possibilities (binary classification models predict just two).

Multi-class classification can be achieved through two main approaches:

One-vs.-all uses multiple binary classifiers, one for each possible outcome, to determine the probability of each class independently. one_vs_all_binary_classifiers.png

This approach is fairly reasonable when the total number of classes is small.

We can create a more efficient one-vs.-all model with a deep neural network in which each output node represents a different class. one_vs_all_neural_net.png

Note that the probabilities do not sum to 1. With a one-vs.-all approach, the probability of each binary set of outcomes is determined independently of all the other sets (the sigmoid function is applied to each output node independently).

One-vs.-one (softmax) predicts probabilities of each class relative to all other classes, ensuring all probabilities sum to 1 using the softmax function in the output layer. It assigns decimal probabilities to each class such that all probabilities add up to 1.0. This additional constraint helps training converge more quickly.

Note that the softmax layer must have the same number of nodes as the output layer. one_vs_one_neural_net.png

The softmax formula extends logistic regression to multiple classes: $$ p(y = j|\textbf{x}) = \frac{e^{(\textbf{w}_j^{T}\textbf{x} + b_j)}}{\sum_{k\in K} e^{(\textbf{w}_k^{T}\textbf{x} + b_k)}} $$

Full softmax is fairly cheap when the number of classes is small but can become computationally expensive with many classes.

Candidate sampling offers an alternative for increased efficiency. It computes probabilities for all positive labels but only a random sample of negative labels. For example, if we are interested in determining whether an input image is a beagle or a bloodhound, we do not have to provide probabilities for every non-dog example.

One label versus many labels

Softmax assumes that each example is a member of exactly one class. Some examples, however, can simultaneously be a member of multiple classes. For multi-label problems, use multiple independent logistic regressions instead.

Example: To classify dog breeds from images, including mixed-breed dogs, use one-vs.-all, since it predicts each breed independently and can assign high probabilities to multiple breeds, unlike softmax, which forces probabilities to sum to 1.

Source: Neural networks: Multi-class classification | Machine Learning | Google for Developers

Embeddings

Introduction

Embeddings are lower-dimensional representations of sparse data that address problems associated with one-hot encodings.

A one-hot encoded feature “meal” of 5,000 popular meal items: food_images_one_hot_encodings.png

This representation of data has several problems:

Embeddings, lower-dimensional representations of sparse data, address these issues.

Source: Embeddings | Machine Learning | Google for Developers

Embedding space and static embeddings

Embeddings are low-dimensional representations of high-dimensional data, often used to capture semantic relationships between items.

Embeddings place similar items closer together in the embedding space, allowing for efficient machine learning on large datasets.

Example of a 1D embedding of a sparse feature vector representing meal items: embeddings_1D.png

2D embedding: embeddings_2D.png

3D embedding: embeddings_3D_tangyuan.png

Distances in the embedding space represent relative similarity between items.

Real-world embeddings can encode complex relationships, such as those between countries and their capitals, allowing models to detect patterns.

In practice, embedding spaces have many more than three dimensions, although far fewer than the original data, and the meaning of individual dimensions is often unclear.

Embeddings usually are task-specific, but one task with broad applicability is predicting the context of a word.

Static embeddings like word2vec represent all meanings of a word with a single point, which can be a limitation in some cases. When each word or data point has a single embedding vector, this is called a static embedding.

word2vec can refer both to an algorithm for obtaining static word embeddings and to a set of word vectors that were pre-trained with that algorithm.

Source: Embeddings: Embedding space and static embeddings | Machine Learning | Google for Developers

Obtaining embeddings

Embeddings can be created using dimensionality reduction techniques such as PCA or by training them as part of a neural network.

Training an embedding within a neural network allows customisation for specific tasks, where the embedding layer learns optimal weights to represent data in a lower-dimensional space, but it may take longer than training the embedding separately.

In general, you can create a hidden layer of size d in your neural network that is designated as the embedding layer, where d represents both the number of nodes in the hidden layer and the number of dimensions in the embedding space. one_hot_hot_dog_embedding.png

Word embeddings, such as word2vec, leverage the distributional hypothesis to map semantically similar words to geometrically close vectors. However, such static word embeddings have limitations because they assign a single representation per word.

Contextual embeddings offer multiple representations based on context. For example, “orange” would have a different embedding for every unique sentence containing the word in the dataset (as it could be used as a colour or a fruit).

Contextual embeddings encode positional information, while static embeddings do not. Because contextual embeddings incorporate positional information, one token can have multiple contextual embedding vectors. Static embeddings allow only a single representation of each token.

Methods for creating contextual embeddings include ELMo, BERT, and transformer models with a self-attention layer.

Source: Embeddings: Obtaining embeddings | Machine Learning | Google for Developers

Large language models

Introduction

A language model estimates the probability of a token or sequence of tokens given surrounding text, enabling tasks such as text generation, translation, and summarisation.

Tokens, the atomic units of language modelling, represent words, subwords, or characters and are crucial for understanding and processing language.

Example: “unwatched” would be split into three tokens: un (the prefix), watch (the root), ed (the suffix).

N-grams are ordered sequences of words used to build language models, where N is the number of words in the sequence.

Short N-grams capture too little information, while very long N-grams fail to generalise due to insufficient repeated examples in training data (sparsity issues).

Recurrent neural networks improve on N-grams by processing sequences token by token and learning which past information to retain or discard, allowing them to model longer dependencies across sentences and gain more context.

Model performance depends on training data size and diversity.

While recurrent neural networks improve context understanding compared to N-grams, they have limitations, paving the way for the emergence of large language models that evaluate the whole context simultaneously.

Source: Large language models | Machine Learning | Google for Developers

What's a large language model?

Large language models (LLMs) predict sequences of tokens and outperform previous models because they use far more parameters and exploit much wider context.

Transformers form the dominant architecture for LLMs and typically combine an encoder that converts input text into an intermediate representation with a decoder that generates output text, for example translating between languages. TransformerBasedTranslator.png

Partial transformers

Encoder-only models focus on representation learning and embeddings (which may serve as input for another system), while decoder-only models specialise in generating long sequences such as dialogue or text continuations.

Self-attention allows the model to weigh the importance of different words in relation to each other, enhancing context understanding.

Example: “The animal didn't cross the street because it was too tired.”

The self-attention mechanism determines the relevance of each nearby word to the pronoun “it”. The bluer the line, the more important that word is to the pronoun it. As shown, “animal” is more important than “street” to the pronoun “it”. Theanimaldidntcrossthestreet.png

Multi-head multi-layer self-attention

Each self-attention layer contains multiple self-attention heads. The output of a layer is a mathematical operation (such as a weighted average or dot product) of the outputs of the different heads.

A complete transformer model stacks multiple self-attention layers. The output from one layer becomes the input for the next, allowing the model to build increasingly complex representations, from basic syntax to more nuanced concepts.

Self-attention is an O(N^2 * S * D) problem.

LLMs are trained using masked predictions on massive datasets, enabling them to learn patterns and generate text based on probabilities. You probably will never train an LLM from scratch.

Instruction tuning can improve an LLM's ability to follow instructions.

Why transformers are so large

This course generally recommends building models with a smaller number of parameters, but research shows that transformers with more parameters consistently achieve better performance.

Text generation

LLMs generate text by repeatedly predicting the most probable next token, effectively acting as highly powerful autocomplete systems. You can think of a user's question to an LLM as the “given” sentence followed by a masked response.

Benefits and problems

While LLMs offer benefits such as clear text generation, they also present challenges.

Source: LLMs: What's a large language model? | Machine Learning | Google for Developers

Fine-tuning, distillation, and prompt engineering

General-purpose LLMs, also known as foundation LLMs, base LLMs, or pre-trained LLMs, are pre-trained on vast amounts of text, enabling them to understand language structure and generate creative content, but they act as platforms rather than complete solutions for tasks such as classification or regression.

Fine-tuning updates the parameters of a model to improve its performance on a specialised task, improving prediction quality.

Distillation aims to reduce model size, typically at the cost of some prediction quality.

Prompt engineering allows users to customise an LLM's output by providing examples or instructions within the prompt, leveraging the model's existing pattern-recognition abilities without changing its parameters.

One-shot, few-shot, and zero-shot prompting differ by how many examples the prompt provides, with more examples usually improving reliability by giving clearer context.

Prompt engineering does not alter the model's parameters. Prompts leverage the pattern-recognition abilities of the existing LLM.

Offline inference pre-computes and caches LLM predictions for tasks where real-time response is not critical, saving resources and enabling the use of larger models.

Responsible use of LLMs requires awareness that models inherit biases from their training and distillation data.

Source: LLMs: Fine-tuning, distillation, and prompt engineering | Machine Learning | Google for Developers