Google ML Crash Course #3 Notes: Advanced ML Models
This post is part of a four-part summary of Google's Machine Learning Crash Course. For context, check out this post. This third module covers advanced ML model architectures.
Neural networks
Introduction
Neural networks are a model architecture designed to automatically identify non-linear patterns in data, eliminating the need for manual feature cross experimentation.
Source: Neural networks | Machine Learning | Google for Developers
Nodes and hidden layers
In neural network terminology, additional layers between the input layer and the output layer are called hidden layers, and the nodes in these layers are called neurons.

Source: Neural networks: Nodes and hidden layers | Machine Learning | Google for Developers
Activation functions
Each neuron in a neural network performs the following two-step action:
- Calculates the weighted sum of input values.
- Applies an activation function to that sum.
Common activation functions include sigmoid, tanh, and ReLU.
The sigmoid function maps input x to an output value between 0 and 1:
$$
F(x) = \frac{1}{1 + e^{-x}}
$$

The tanh function (short for “hyperbolic tangent”) maps input x to an output value between -1 and 1:
$$
F(x) = \tanh{(x)}
$$

The rectified linear unit activation function (or ReLU, for short) applies a simple rule:
- If the input value is less than 0, return 0.
- If the input value is greater than or equal to 0, return the input value. $$ F(x) = \max{(0,x)} $$
ReLU often outperforms sigmoid and tanh because it reduces vanishing gradient issues and requires less computation.

A neural network consists of:
- A set of nodes, analogous to neurons, organised in layers.
- A set of learned weights and biases connecting layers.
- Activation functions that transform each node's output, which may differ across layers.
Source: Neural networks: Activation functions | Machine Learning | Google for Developers
Training using backpropagation
Backpropagation is the primary training algorithm for neural networks. It calculates how much each weight and bias in the network contributed to the overall prediction error by applying the chain rule of calculus. It works backwards from the output layer to tell the gradient descent algorithm which equations to adjust to reduce loss.
In practice, this involves a forward pass, where the network makes a prediction and the loss function measures the error, followed by a backward pass that propagates that error back through the layers to compute gradients for each parameter.
Best practices for neural network training:
- Vanishing gradients occur when gradients in earlier layers become very small, slowing or stalling training, and can be mitigated by using the ReLU activation function.
- Exploding gradients happen when large weights cause excessively large gradients in early layers, disrupting convergence, and can be addressed with batch normalisation or by lowering the learning rate.
- Dead ReLU units emerge when a ReLU unit's output gets stuck at 0, halting gradient flow during backpropagation, and can be avoided by lowering the learning rate or using ReLU variants like LeakyReLU.
- Dropout regularisation is a technique to prevent overfitting by randomly dropping unit activations in a network for a single gradient step, with higher dropout rates indicating stronger regularisation (0 = no regularisation, 1 = drop out all nodes).
Source: Neural Networks: Training using backpropagation | Machine Learning | Google for Developers
Multi-class classification
Multi-class classification models predict from multiple possibilities (binary classification models predict just two).
Multi-class classification can be achieved through two main approaches:
- One-vs.-all
- One-vs.-one (softmax)
One-vs.-all uses multiple binary classifiers, one for each possible outcome, to determine the probability of each class independently.

This approach is fairly reasonable when the total number of classes is small.
We can create a more efficient one-vs.-all model with a deep neural network in which each output node represents a different class.

Note that the probabilities do not sum to 1. With a one-vs.-all approach, the probability of each binary set of outcomes is determined independently of all the other sets (the sigmoid function is applied to each output node independently).
One-vs.-one (softmax) predicts probabilities of each class relative to all other classes, ensuring all probabilities sum to 1 using the softmax function in the output layer. It assigns decimal probabilities to each class such that all probabilities add up to 1.0. This additional constraint helps training converge more quickly.
Note that the softmax layer must have the same number of nodes as the output layer.

The softmax formula extends logistic regression to multiple classes: $$ p(y = j|\textbf{x}) = \frac{e^{(\textbf{w}_j^{T}\textbf{x} + b_j)}}{\sum_{k\in K} e^{(\textbf{w}_k^{T}\textbf{x} + b_k)}} $$
Full softmax is fairly cheap when the number of classes is small but can become computationally expensive with many classes.
Candidate sampling offers an alternative for increased efficiency. It computes probabilities for all positive labels but only a random sample of negative labels. For example, if we are interested in determining whether an input image is a beagle or a bloodhound, we do not have to provide probabilities for every non-dog example.
One label versus many labels
Softmax assumes that each example is a member of exactly one class. Some examples, however, can simultaneously be a member of multiple classes. For multi-label problems, use multiple independent logistic regressions instead.
Example: To classify dog breeds from images, including mixed-breed dogs, use one-vs.-all, since it predicts each breed independently and can assign high probabilities to multiple breeds, unlike softmax, which forces probabilities to sum to 1.
Source: Neural networks: Multi-class classification | Machine Learning | Google for Developers
Embeddings
Introduction
Embeddings are lower-dimensional representations of sparse data that address problems associated with one-hot encodings.
A one-hot encoded feature “meal” of 5,000 popular meal items:

This representation of data has several problems:
- Large input vectors mean a huge number of weights for a neural network.
- The more weights in your model, the more data you need to train effectively.
- The more weights, the more computation required to train and use the model.
- The more weights in your model, the more memory is needed on the accelerators that train and serve it.
- Poor suitability for on-device machine learning (ODML).
Embeddings, lower-dimensional representations of sparse data, address these issues.
Source: Embeddings | Machine Learning | Google for Developers
Embedding space and static embeddings
Embeddings are low-dimensional representations of high-dimensional data, often used to capture semantic relationships between items.
Embeddings place similar items closer together in the embedding space, allowing for efficient machine learning on large datasets.
Example of a 1D embedding of a sparse feature vector representing meal items:

2D embedding:

3D embedding:

Distances in the embedding space represent relative similarity between items.
Real-world embeddings can encode complex relationships, such as those between countries and their capitals, allowing models to detect patterns.
In practice, embedding spaces have many more than three dimensions, although far fewer than the original data, and the meaning of individual dimensions is often unclear.
Embeddings usually are task-specific, but one task with broad applicability is predicting the context of a word.
Static embeddings like word2vec represent all meanings of a word with a single point, which can be a limitation in some cases. When each word or data point has a single embedding vector, this is called a static embedding.
word2vec can refer both to an algorithm for obtaining static word embeddings and to a set of word vectors that were pre-trained with that algorithm.
Source: Embeddings: Embedding space and static embeddings | Machine Learning | Google for Developers
Obtaining embeddings
Embeddings can be created using dimensionality reduction techniques such as PCA or by training them as part of a neural network.
Training an embedding within a neural network allows customisation for specific tasks, where the embedding layer learns optimal weights to represent data in a lower-dimensional space, but it may take longer than training the embedding separately.
In general, you can create a hidden layer of size d in your neural network that is designated as the embedding layer, where d represents both the number of nodes in the hidden layer and the number of dimensions in the embedding space.

Word embeddings, such as word2vec, leverage the distributional hypothesis to map semantically similar words to geometrically close vectors. However, such static word embeddings have limitations because they assign a single representation per word.
Contextual embeddings offer multiple representations based on context. For example, “orange” would have a different embedding for every unique sentence containing the word in the dataset (as it could be used as a colour or a fruit).
Contextual embeddings encode positional information, while static embeddings do not. Because contextual embeddings incorporate positional information, one token can have multiple contextual embedding vectors. Static embeddings allow only a single representation of each token.
Methods for creating contextual embeddings include ELMo, BERT, and transformer models with a self-attention layer.
Source: Embeddings: Obtaining embeddings | Machine Learning | Google for Developers
Large language models
Introduction
A language model estimates the probability of a token or sequence of tokens given surrounding text, enabling tasks such as text generation, translation, and summarisation.
Tokens, the atomic units of language modelling, represent words, subwords, or characters and are crucial for understanding and processing language.
Example: “unwatched” would be split into three tokens: un (the prefix), watch (the root), ed (the suffix).
N-grams are ordered sequences of words used to build language models, where N is the number of words in the sequence.
Short N-grams capture too little information, while very long N-grams fail to generalise due to insufficient repeated examples in training data (sparsity issues).
Recurrent neural networks improve on N-grams by processing sequences token by token and learning which past information to retain or discard, allowing them to model longer dependencies across sentences and gain more context.
- Note that training recurrent neural networks for long contexts is constrained by the vanishing gradient problem.
Model performance depends on training data size and diversity.
While recurrent neural networks improve context understanding compared to N-grams, they have limitations, paving the way for the emergence of large language models that evaluate the whole context simultaneously.
Source: Large language models | Machine Learning | Google for Developers
What's a large language model?
Large language models (LLMs) predict sequences of tokens and outperform previous models because they use far more parameters and exploit much wider context.
Transformers form the dominant architecture for LLMs and typically combine an encoder that converts input text into an intermediate representation with a decoder that generates output text, for example translating between languages.

Partial transformers
Encoder-only models focus on representation learning and embeddings (which may serve as input for another system), while decoder-only models specialise in generating long sequences such as dialogue or text continuations.
Self-attention allows the model to weigh the importance of different words in relation to each other, enhancing context understanding.
Example: “The animal didn't cross the street because it was too tired.”
The self-attention mechanism determines the relevance of each nearby word to the pronoun “it”. The bluer the line, the more important that word is to the pronoun it. As shown, “animal” is more important than “street” to the pronoun “it”.

- Some self-attention mechanisms are bidirectional, meaning they calculate relevance scores for tokens preceding and following the word being attended to. This is useful for generating representations of whole sequences (encoders).
- By contrast, a unidirectional self-attention mechanism can gather context only from words on one side of the word being attended to. This suits applications that generate sequences token by token (decoders).
Multi-head multi-layer self-attention
Each self-attention layer contains multiple self-attention heads. The output of a layer is a mathematical operation (such as a weighted average or dot product) of the outputs of the different heads.
A complete transformer model stacks multiple self-attention layers. The output from one layer becomes the input for the next, allowing the model to build increasingly complex representations, from basic syntax to more nuanced concepts.
Self-attention is an O(N^2 * S * D) problem.
- N is the number of tokens in the context.
- S is the number of self-attention layers.
- D is the number of heads per layer.
LLMs are trained using masked predictions on massive datasets, enabling them to learn patterns and generate text based on probabilities. You probably will never train an LLM from scratch.
Instruction tuning can improve an LLM's ability to follow instructions.
Why transformers are so large
This course generally recommends building models with a smaller number of parameters, but research shows that transformers with more parameters consistently achieve better performance.
Text generation
LLMs generate text by repeatedly predicting the most probable next token, effectively acting as highly powerful autocomplete systems. You can think of a user's question to an LLM as the “given” sentence followed by a masked response.
Benefits and problems
While LLMs offer benefits such as clear text generation, they also present challenges.
- Training an LLM involves gathering enormous training sets, consuming vast computational resources and electricity, and solving parallelism challenges.
- Using an LLM for inference raises issues such as hallucinations, high computational and electricity costs, and bias.
Source: LLMs: What's a large language model? | Machine Learning | Google for Developers
Fine-tuning, distillation, and prompt engineering
General-purpose LLMs, also known as foundation LLMs, base LLMs, or pre-trained LLMs, are pre-trained on vast amounts of text, enabling them to understand language structure and generate creative content, but they act as platforms rather than complete solutions for tasks such as classification or regression.
Fine-tuning updates the parameters of a model to improve its performance on a specialised task, improving prediction quality.
- Adapts a foundation LLM to a specific task by training on task-specific examples, often only hundreds or thousands, which improves performance for that task but retains the original model size (same number of parameters) and can still be computationally expensive.
- Parameter-efficient tuning reduces fine-tuning costs by updating only a subset of model parameters during training rather than all weights and biases.
Distillation aims to reduce model size, typically at the cost of some prediction quality.
- Distillation compresses an LLM into a smaller student model that runs faster and uses fewer resources, at the cost of some predictive accuracy.
- It typically uses a large teacher model to label data, often with rich numerical scores rather than simple labels, and trains a smaller student model on those outputs.
Prompt engineering allows users to customise an LLM's output by providing examples or instructions within the prompt, leveraging the model's existing pattern-recognition abilities without changing its parameters.
One-shot, few-shot, and zero-shot prompting differ by how many examples the prompt provides, with more examples usually improving reliability by giving clearer context.
Prompt engineering does not alter the model's parameters. Prompts leverage the pattern-recognition abilities of the existing LLM.
Offline inference pre-computes and caches LLM predictions for tasks where real-time response is not critical, saving resources and enabling the use of larger models.
Responsible use of LLMs requires awareness that models inherit biases from their training and distillation data.
Source: LLMs: Fine-tuning, distillation, and prompt engineering | Machine Learning | Google for Developers