Stefan Angrick

Google ML Crash Course #4 Notes: Real-World ML

Mon, 29 Dec 2025 10:04:59 +0000

This post is part of a four-part summary of Google's Machine Learning Crash Course. For context, check out this post. This fourth module covers critical considerations when building and deploying ML models in the real world, including productionisation best practices, automation, and responsible engineering.

Production ML systems

Introduction

The model is only a small part of real-world production ML systems. It often represents only 5% or less of the total codebase in the system.

Source: Production ML systems | Machine Learning | Google for Developers

Static versus dynamic training

Machine learning models can be trained statically (once) or dynamically (continuously).

	Static training (offline training)	Dynamic training (online training)
Advantages	Simpler. You only need to develop and test the model once.	More adaptable. Keeps up with changes in data patterns, providing more accurate predictions.
Disadvantages	Sometimes stale. Can become outdated if data patterns change, requiring data monitoring.	More work. You must build, test, and release a new product continuously.

Choosing between static and dynamic training depends on the specific dataset and how frequently it changes.

Monitoring input data is essential for both static and dynamic training to ensure reliable predictions.

Source: Production ML systems: Static versus dynamic training | Machine Learning | Google for Developers

Static versus dynamic inference

Inference involves using a trained model to make predictions on unlabelled examples, and it can be done as follows:

Static inference (offline inference, batch inference) generates predictions in advance and caches them, which suits scenarios where prediction speed is critical.
Dynamic inference (online inference, real-time inference) generates predictions on demand, offering flexibility for diverse inputs.

	Static inference (offline inference, batch inference)	Dynamic inference (online inference, real-time inference)
Advantages	No need to worry about cost of inference; allows post-verification of predictions before pushing	Can infer a prediction on any new item as it comes in
Disadvantages	Limited ability to handle uncommon inputs	Compute-intensive and latency-sensitive; monitoring needs are intensive

Choosing between static and dynamic inference depends on factors such as model complexity, desired prediction speed, and the nature of the input data.

Static inference is advantageous when cost and prediction verification are prioritised, while dynamic inference excels in handling diverse, real-time predictions.

Source: Production ML systems: Static versus dynamic inference | Machine Learning | Google for Developers

When to transform data?

Feature engineering can be performed before or during model training, each with its own advantages and disadvantages.

Transforming data before training allows for a one-time transformation of the entire dataset but requires careful recreation of transformations during prediction to avoid training-serving skew.
Transforming data during training ensures consistency between training and prediction but can increase model latency and complicate batch processing.
- When transforming data during training, considerations such as Z-score normalisation across batches with varying distributions need to be addressed.

Source: Production ML systems: When to transform data? | Machine Learning | Google for Developers

Deployment testing

Deploying a machine learning model involves validating data, features, model versions, serving infrastructure, and pipeline integration.

Reproducible model training involves deterministic seeding, fixed initialisation order, averaging multiple runs, and using version control.

Integration tests ensure that different components of the ML pipeline work together seamlessly and should run continuously and for new model or software versions.

Before serving a new model, validate its quality by checking for sudden and gradual degradations against previous versions and fixed thresholds.

Ensure model-infrastructure compatibility by staging the model in a sandboxed server environment to avoid dependency conflicts.

Source: Production ML systems: Deployment testing | Machine Learning | Google for Developers

Monitoring pipelines

ML pipeline monitoring involves validating data (using data schemas) and features (using unit tests), tracking real-world metrics, and addressing potential biases in data slices.

Monitoring training-serving skew, label leakage, model age, and numerical stability is crucial for maintaining pipeline health and model performance.

Training-serving skew means that input data during training differs from input data during serving, for example because training and serving data use different schemas (schema skew) or because engineered data differs between training and serving (feature skew).
Label leakage means that the ground truth labels being predicted have inadvertently entered the training features.
Numerical stability involves writing tests to check for NaN and Inf values in weights and layer outputs, and testing that more than half of the outputs of a layer are not zero.

Live model quality testing uses methods such as human labelling and statistical analysis to ensure ongoing model effectiveness in real-world scenarios.

Implementing proper randomisation through deterministic data generation enables reproducible experiments and consistent analysis.

Maintaining invariant hashing ensures that data splits remain consistent across experiments, contributing to reliable analysis and model evaluation.

Source: Production ML systems: Monitoring pipelines | Machine Learning | Google for Developers

Questions to ask

Continuously monitor models in production to evaluate feature importance and potentially remove unnecessary features, ensuring prediction quality and resource efficiency.

Regularly assess whether features are genuinely helpful and whether their value outweighs the cost of inclusion.

Data reliability is crucial. Consider data source stability, potential changes in upstream data processes, and the creation of local data copies to control versioning and mitigate risks.

Be aware of feedback loops, where a model's predictions influence future input data, potentially leading to unexpected behaviour or biased outcomes, especially in interconnected systems.

Source: Production ML systems: Questions to ask | Machine Learning | Google for Developers

Automated machine learning

Introduction

AutoML automates tasks in the machine learning workflow, such as data engineering (feature selection and engineering), training (algorithm selection and hyperparameter tuning), and analysis, making model building faster and easier.

While manual training involves writing code and iteratively adjusting it, AutoML reduces repetitive work and the need for specialised skills.

Source: Automated Machine Learning (AutoML) | Google for Developers

Benefits and limitations

Benefits:

To save time.
To improve the quality of an ML model.
To build an ML model without needing specialised skills.
To smoke test a dataset. AutoML can give quick baseline estimates of whether a dataset has enough signal relative to noise.
To evaluate a dataset. AutoML can help determine which features may be worth using.
To enforce best practices. Automation includes built-in support for applying ML best practices.

Limitations:

Model quality may not match that of manual training.
Model search and complexity can be opaque. Models generated with AutoML are difficult to reproduce manually.
Multiple AutoML runs may show greater variance.
Models cannot be customised during training.

Large amounts of data are generally required for AutoML, although specialised systems using transfer learning (taking a model trained on one task and adapting its learned representations to a different but related task) can reduce this requirement.

AutoML suits teams with limited ML experience or those seeking productivity gains without customisation needs. Custom (manual) training suits cases where model quality and customisation matter most.

Source: AutoML: Benefits and limitations | Machine Learning | Google for Developers

Getting started

AutoML tools fall into two categories:

Tools that require no coding.
API and CLI tools.

The AutoML workflow follows steps similar to traditional machine learning, including problem definition, data gathering, preparation, model development, evaluation, and potential retraining.

Some AutoML systems also support model deployment.

Data preparation is crucial for AutoML and involves labelling, cleaning and formatting data, and applying feature transformations.

No-code AutoML tools guide users through model development with steps such as data import, analysis, refinement, and configuration of run parameters before starting the automated training process.

Users still need to carry out semantic checks to select the appropriate semantic type for each feature (for example recognising that postal codes are categorical rather than numeric), and to set transformations accordingly.

Source: AutoML: Getting started | Machine Learning | Google for Developers

Fairness

Introduction

Before putting a model into production, it is critical to audit training data and evaluate predictions for bias.

Source: Fairness | Machine Learning | Google for Developers

Types of bias

Machine learning models can be susceptible to bias due to human involvement in data selection and curation.

Understanding common human biases is crucial for mitigating their impact on model predictions.

Types of bias include reporting bias, historical bias, automation bias, selection bias, coverage bias, non-response bias, sampling bias, group attribution bias (in-group bias and out-group homogeneity bias), implicit bias, confirmation bias, and experimenter's bias, among others.

Source: Fairness: Types of bias | Machine Learning | Google for Developers

Identifying bias

Missing or unexpected feature values in a dataset can indicate potential sources of bias.

Data skew, where certain groups are under- or over-represented, can introduce bias and should be addressed.

Evaluating model performance by subgroup ensures fairness and equal performance across different characteristics.

Source: Fairness: Identifying bias | Machine Learning | Google for Developers

Mitigating bias

Machine learning engineers use two primary strategies to mitigate bias in models:

Augmenting training data.
Adjusting the model's loss function.

Augmenting training data involves collecting additional data to address missing, incorrect, or skewed data, but it can be infeasible due to data availability or resource constraints.

Adjusting the model's loss function involves using fairness-aware optimisation functions rather than the common default log loss.

The TensorFlow Model Remediation Library provides optimisation functions designed to penalise errors in a fairness-aware manner:

MinDiff aims to balance errors between different data slices by penalising differences in prediction distributions.
Counterfactual Logit Pairing (CLP) penalises discrepancies in predictions for similar examples with different sensitive attribute values.

Source: Fairness: Mitigating bias | Machine Learning | Google for Developers

Evaluating for bias

Aggregate model performance metrics such as precision, recall, and accuracy can hide biases against minority groups.

Fairness in model evaluation involves ensuring equitable outcomes across different demographic groups.

Fairness metrics can help assess model predictions for bias.

Demographic parity
Equality of opportunity
Counterfactual fairness

Candidate pool of 100 students: 80 students belong to the majority group (blue), and 20 students belong to the minority group (orange):

Source: Fairness: Evaluating for bias | Machine Learning | Google for Developers

Demographic parity

Demographic parity aims to ensure equal acceptance rates for majority and minority groups, regardless of individual qualifications.

Both the majority (blue) and minority (orange) groups have an acceptance rate of 20%:

While demographic parity promotes equal representation, it can overlook differences in individual qualifications within each group, potentially leading to unfair outcomes.

Qualified students in both groups are shaded in green, and qualified students who were rejected are marked with an X:

Majority acceptance rate = Qualified majority accepted / Qualified majority = 16/35 = 46% Minority acceptance rate = Qualified minority accepted / Qualified minority = 4/15 = 27%

When the distribution of a preferred label (“qualified”) differs substantially between groups, demographic parity may not be the most appropriate fairness metric.

There may be additional benefits/drawbacks of demographic parity not discussed here that are also worth considering.

Source: Fairness: Demographic parity | Machine Learning | Google for Developers

Equality of opportunity

Equality of opportunity focuses on ensuring that qualified individuals have an equal chance of acceptance, regardless of demographic group.

Qualified students in both groups are shaded in green:

Majority acceptance rate = Qualified majority accepted / Qualified majority = 14/35 = 40% Minority acceptance rate = Qualified minority accepted / Qualified minority = 6/15 = 40%

Equality of opportunity has limitations, including reliance on a clearly defined preferred label and challenges in settings that lack demographic data.

It is possible for a model to satisfy both demographic parity and equality of opportunity under specific conditions where positive prediction rates and true positive rates align across groups.

Source: Fairness: Equality of opportunity | Machine Learning | Google for Developers

Counterfactual fairness

Counterfactual fairness evaluates fairness by comparing predictions for similar individuals who differ only in a sensitive attribute such as demographic group.

This metric is particularly useful when datasets lack complete demographic information for most examples but contain it for a subset.

Candidate pool, with demographic group membership unknown for most candidates (icons shaded in grey):

Counterfactual fairness may not capture broader systemic biases across subgroups. Other fairness metrics, such as demographic parity and equality of opportunity, provide a more holistic view but may require complete demographic data.

Summary

Selecting the appropriate fairness metric depends on the specific application and desired outcome, with no single “right” metric universally applicable.

For example, if the goal is to achieve equal representation, demographic parity may be the optimal metric. If the goal is to achieve equal opportunity, equality of opportunity may be the best metric.

Some definitions of fairness are mutually incompatible.

Source: Fairness: Counterfactual fairness | Machine Learning | Google for Developers

Google ML Crash Course #3 Notes: Advanced ML Models

Mon, 29 Dec 2025 10:04:16 +0000

This post is part of a four-part summary of Google's Machine Learning Crash Course. For context, check out this post. This third module covers advanced ML model architectures.

Neural networks

Introduction

Neural networks are a model architecture designed to automatically identify non-linear patterns in data, eliminating the need for manual feature cross experimentation.

Source: Neural networks | Machine Learning | Google for Developers

Nodes and hidden layers

In neural network terminology, additional layers between the input layer and the output layer are called hidden layers, and the nodes in these layers are called neurons.

Source: Neural networks: Nodes and hidden layers | Machine Learning | Google for Developers

Activation functions

Each neuron in a neural network performs the following two-step action:

Calculates the weighted sum of input values.
Applies an activation function to that sum.

Common activation functions include sigmoid, tanh, and ReLU.

The sigmoid function maps input x to an output value between 0 and 1: $$ F(x) = \frac{1}{1 + e^{-x}} $$

The tanh function (short for “hyperbolic tangent”) maps input x to an output value between -1 and 1: $$ F(x) = \tanh{(x)} $$

The rectified linear unit activation function (or ReLU, for short) applies a simple rule:

If the input value is less than 0, return 0.
If the input value is greater than or equal to 0, return the input value. $$ F(x) = \max{(0,x)} $$

ReLU often outperforms sigmoid and tanh because it reduces vanishing gradient issues and requires less computation.

A neural network consists of:

A set of nodes, analogous to neurons, organised in layers.
A set of learned weights and biases connecting layers.
Activation functions that transform each node's output, which may differ across layers.

Source: Neural networks: Activation functions | Machine Learning | Google for Developers

Training using backpropagation

Backpropagation is the primary training algorithm for neural networks. It calculates how much each weight and bias in the network contributed to the overall prediction error by applying the chain rule of calculus. It works backwards from the output layer to tell the gradient descent algorithm which equations to adjust to reduce loss.

In practice, this involves a forward pass, where the network makes a prediction and the loss function measures the error, followed by a backward pass that propagates that error back through the layers to compute gradients for each parameter.

Best practices for neural network training:

Vanishing gradients occur when gradients in earlier layers become very small, slowing or stalling training, and can be mitigated by using the ReLU activation function.
Exploding gradients happen when large weights cause excessively large gradients in early layers, disrupting convergence, and can be addressed with batch normalisation or by lowering the learning rate.
Dead ReLU units emerge when a ReLU unit's output gets stuck at 0, halting gradient flow during backpropagation, and can be avoided by lowering the learning rate or using ReLU variants like LeakyReLU.
Dropout regularisation is a technique to prevent overfitting by randomly dropping unit activations in a network for a single gradient step, with higher dropout rates indicating stronger regularisation (0 = no regularisation, 1 = drop out all nodes).

Source: Neural Networks: Training using backpropagation | Machine Learning | Google for Developers

Multi-class classification

Multi-class classification models predict from multiple possibilities (binary classification models predict just two).

Multi-class classification can be achieved through two main approaches:

One-vs.-all
One-vs.-one (softmax)

One-vs.-all uses multiple binary classifiers, one for each possible outcome, to determine the probability of each class independently.

This approach is fairly reasonable when the total number of classes is small.

We can create a more efficient one-vs.-all model with a deep neural network in which each output node represents a different class.

Note that the probabilities do not sum to 1. With a one-vs.-all approach, the probability of each binary set of outcomes is determined independently of all the other sets (the sigmoid function is applied to each output node independently).

One-vs.-one (softmax) predicts probabilities of each class relative to all other classes, ensuring all probabilities sum to 1 using the softmax function in the output layer. It assigns decimal probabilities to each class such that all probabilities add up to 1.0. This additional constraint helps training converge more quickly.

Note that the softmax layer must have the same number of nodes as the output layer.

The softmax formula extends logistic regression to multiple classes: $$ p(y = j|\textbf{x}) = \frac{e^{(\textbf{w}_j^{T}\textbf{x} + b_j)}}{\sum_{k\in K} e^{(\textbf{w}_k^{T}\textbf{x} + b_k)}} $$

Full softmax is fairly cheap when the number of classes is small but can become computationally expensive with many classes.

Candidate sampling offers an alternative for increased efficiency. It computes probabilities for all positive labels but only a random sample of negative labels. For example, if we are interested in determining whether an input image is a beagle or a bloodhound, we do not have to provide probabilities for every non-dog example.

One label versus many labels

Softmax assumes that each example is a member of exactly one class. Some examples, however, can simultaneously be a member of multiple classes. For multi-label problems, use multiple independent logistic regressions instead.

Example: To classify dog breeds from images, including mixed-breed dogs, use one-vs.-all, since it predicts each breed independently and can assign high probabilities to multiple breeds, unlike softmax, which forces probabilities to sum to 1.

Source: Neural networks: Multi-class classification | Machine Learning | Google for Developers

Embeddings

Introduction

Embeddings are lower-dimensional representations of sparse data that address problems associated with one-hot encodings.

A one-hot encoded feature “meal” of 5,000 popular meal items:

This representation of data has several problems:

Large input vectors mean a huge number of weights for a neural network.
The more weights in your model, the more data you need to train effectively.
The more weights, the more computation required to train and use the model.
The more weights in your model, the more memory is needed on the accelerators that train and serve it.
Poor suitability for on-device machine learning (ODML).

Embeddings, lower-dimensional representations of sparse data, address these issues.

Source: Embeddings | Machine Learning | Google for Developers

Embedding space and static embeddings

Embeddings are low-dimensional representations of high-dimensional data, often used to capture semantic relationships between items.

Embeddings place similar items closer together in the embedding space, allowing for efficient machine learning on large datasets.

Example of a 1D embedding of a sparse feature vector representing meal items:

2D embedding:

3D embedding:

Distances in the embedding space represent relative similarity between items.

Real-world embeddings can encode complex relationships, such as those between countries and their capitals, allowing models to detect patterns.

In practice, embedding spaces have many more than three dimensions, although far fewer than the original data, and the meaning of individual dimensions is often unclear.

Embeddings usually are task-specific, but one task with broad applicability is predicting the context of a word.

Static embeddings like word2vec represent all meanings of a word with a single point, which can be a limitation in some cases. When each word or data point has a single embedding vector, this is called a static embedding.

word2vec can refer both to an algorithm for obtaining static word embeddings and to a set of word vectors that were pre-trained with that algorithm.

Source: Embeddings: Embedding space and static embeddings | Machine Learning | Google for Developers

Obtaining embeddings

Embeddings can be created using dimensionality reduction techniques such as PCA or by training them as part of a neural network.

Training an embedding within a neural network allows customisation for specific tasks, where the embedding layer learns optimal weights to represent data in a lower-dimensional space, but it may take longer than training the embedding separately.

In general, you can create a hidden layer of size d in your neural network that is designated as the embedding layer, where d represents both the number of nodes in the hidden layer and the number of dimensions in the embedding space.

Word embeddings, such as word2vec, leverage the distributional hypothesis to map semantically similar words to geometrically close vectors. However, such static word embeddings have limitations because they assign a single representation per word.

Contextual embeddings offer multiple representations based on context. For example, “orange” would have a different embedding for every unique sentence containing the word in the dataset (as it could be used as a colour or a fruit).

Contextual embeddings encode positional information, while static embeddings do not. Because contextual embeddings incorporate positional information, one token can have multiple contextual embedding vectors. Static embeddings allow only a single representation of each token.

Methods for creating contextual embeddings include ELMo, BERT, and transformer models with a self-attention layer.

Source: Embeddings: Obtaining embeddings | Machine Learning | Google for Developers

Large language models

Introduction

A language model estimates the probability of a token or sequence of tokens given surrounding text, enabling tasks such as text generation, translation, and summarisation.

Tokens, the atomic units of language modelling, represent words, subwords, or characters and are crucial for understanding and processing language.

Example: “unwatched” would be split into three tokens: un (the prefix), watch (the root), ed (the suffix).

N-grams are ordered sequences of words used to build language models, where N is the number of words in the sequence.

Short N-grams capture too little information, while very long N-grams fail to generalise due to insufficient repeated examples in training data (sparsity issues).

Recurrent neural networks improve on N-grams by processing sequences token by token and learning which past information to retain or discard, allowing them to model longer dependencies across sentences and gain more context.

Note that training recurrent neural networks for long contexts is constrained by the vanishing gradient problem.

Model performance depends on training data size and diversity.

While recurrent neural networks improve context understanding compared to N-grams, they have limitations, paving the way for the emergence of large language models that evaluate the whole context simultaneously.

Source: Large language models | Machine Learning | Google for Developers

What's a large language model?

Large language models (LLMs) predict sequences of tokens and outperform previous models because they use far more parameters and exploit much wider context.

Transformers form the dominant architecture for LLMs and typically combine an encoder that converts input text into an intermediate representation with a decoder that generates output text, for example translating between languages.

Partial transformers

Encoder-only models focus on representation learning and embeddings (which may serve as input for another system), while decoder-only models specialise in generating long sequences such as dialogue or text continuations.

Self-attention allows the model to weigh the importance of different words in relation to each other, enhancing context understanding.

Example: “The animal didn't cross the street because it was too tired.”

The self-attention mechanism determines the relevance of each nearby word to the pronoun “it”. The bluer the line, the more important that word is to the pronoun it. As shown, “animal” is more important than “street” to the pronoun “it”.

Some self-attention mechanisms are bidirectional, meaning they calculate relevance scores for tokens preceding and following the word being attended to. This is useful for generating representations of whole sequences (encoders).
By contrast, a unidirectional self-attention mechanism can gather context only from words on one side of the word being attended to. This suits applications that generate sequences token by token (decoders).

Multi-head multi-layer self-attention

Each self-attention layer contains multiple self-attention heads. The output of a layer is a mathematical operation (such as a weighted average or dot product) of the outputs of the different heads.

A complete transformer model stacks multiple self-attention layers. The output from one layer becomes the input for the next, allowing the model to build increasingly complex representations, from basic syntax to more nuanced concepts.

Self-attention is an O(N^2 * S * D) problem.

N is the number of tokens in the context.
S is the number of self-attention layers.
D is the number of heads per layer.

LLMs are trained using masked predictions on massive datasets, enabling them to learn patterns and generate text based on probabilities. You probably will never train an LLM from scratch.

Instruction tuning can improve an LLM's ability to follow instructions.

Why transformers are so large

This course generally recommends building models with a smaller number of parameters, but research shows that transformers with more parameters consistently achieve better performance.

Text generation

LLMs generate text by repeatedly predicting the most probable next token, effectively acting as highly powerful autocomplete systems. You can think of a user's question to an LLM as the “given” sentence followed by a masked response.

Benefits and problems

While LLMs offer benefits such as clear text generation, they also present challenges.

Training an LLM involves gathering enormous training sets, consuming vast computational resources and electricity, and solving parallelism challenges.
Using an LLM for inference raises issues such as hallucinations, high computational and electricity costs, and bias.

Source: LLMs: What's a large language model? | Machine Learning | Google for Developers

Fine-tuning, distillation, and prompt engineering

General-purpose LLMs, also known as foundation LLMs, base LLMs, or pre-trained LLMs, are pre-trained on vast amounts of text, enabling them to understand language structure and generate creative content, but they act as platforms rather than complete solutions for tasks such as classification or regression.

Fine-tuning updates the parameters of a model to improve its performance on a specialised task, improving prediction quality.

Adapts a foundation LLM to a specific task by training on task-specific examples, often only hundreds or thousands, which improves performance for that task but retains the original model size (same number of parameters) and can still be computationally expensive.
Parameter-efficient tuning reduces fine-tuning costs by updating only a subset of model parameters during training rather than all weights and biases.

Distillation aims to reduce model size, typically at the cost of some prediction quality.

Distillation compresses an LLM into a smaller student model that runs faster and uses fewer resources, at the cost of some predictive accuracy.
It typically uses a large teacher model to label data, often with rich numerical scores rather than simple labels, and trains a smaller student model on those outputs.

Prompt engineering allows users to customise an LLM's output by providing examples or instructions within the prompt, leveraging the model's existing pattern-recognition abilities without changing its parameters.

One-shot, few-shot, and zero-shot prompting differ by how many examples the prompt provides, with more examples usually improving reliability by giving clearer context.

Prompt engineering does not alter the model's parameters. Prompts leverage the pattern-recognition abilities of the existing LLM.

Offline inference pre-computes and caches LLM predictions for tasks where real-time response is not critical, saving resources and enabling the use of larger models.

Responsible use of LLMs requires awareness that models inherit biases from their training and distillation data.

Source: LLMs: Fine-tuning, distillation, and prompt engineering | Machine Learning | Google for Developers

Google ML Crash Course #2 Notes: Data

Mon, 29 Dec 2025 10:02:48 +0000

This post is part of a four-part summary of Google's Machine Learning Crash Course. For context, check out this post. This second module covers fundamental techniques and best practices for working with machine learning data.

Working with numerical data

Introduction

Numerical data: Integers or floating-point values that behave like numbers. They are additive, countable, ordered, and so on. Examples include temperature, weight, or the number of deer wintering in a nature preserve.

Source: Working with numerical data | Machine Learning | Google for Developers

How a model ingests data with feature vectors

A machine learning model ingests data through floating-point arrays called feature vectors, which are derived from dataset features. Feature vectors often utilise processed or transformed values instead of raw dataset values to enhance model learning.

Example of a feature vector: [0.13, 0.47]

Feature engineering is the process of converting raw data into suitable representations for the model. Common techniques are:

Normalization: Converting numerical values into a standard range.
Binning (bucketing): Converting numerical values into buckets or ranges.

Non-numerical data like strings must be converted into numerical values for use in feature vectors.

Source: Numerical data: How a model ingests data using feature vectors | Machine Learning | Google for Developers

First steps

Before creating feature vectors, it is crucial to analyse numerical data to detect anomalies and patterns in the data, which aids in identifying potential issues early in the data analysis process.

Visualising it through plots and graphs (like scatter plots or histograms)
Calculating basic statistics like mean, median, standard deviation, or values at the quartile divisions (0th, 25th, 50th, 75th, 100th percentiles, where the 50th percentile is the median)

Outliers, values significantly distant from others, should be identified and handled appropriately.

The outlier is due to a mistake: For example, an experimenter incorrectly entered data, or an instrument malfunctioned. We generally delete examples containing mistake outliers.
If the outlier is a legitimate data point: If the model needs to infer good predictions on these outliers, keep them. If not, delete them or apply more invasive feature engineering techniques, such as clipping.

A dataset probably contains outliers when:

The delta between the 0th and 25th percentiles differs significantly from the delta between the 75th and 100th percentiles
The standard deviation is almost as high as the mean

Source: Numerical data: First steps | Machine Learning | Google for Developers

Normalization

Data normalization is crucial for enhancing machine learning model performance by scaling features to a similar range. It is also recommended to normalise a single numeric feature that covers a wide range (for example, city population).

Normalisation has the following benefits:

Helps a model converge more quickly.
Helps models infer better predictions.
Helps avoid the NaN trap (large numerical values exceeding the floating-point precision limit and flipping into NaN values).
Helps the model learn appropriate weights (so the model does not pay too much attention to features with wide ranges).

Normalization technique	Formula	When to use
Linear scaling	$$x'=\frac{x-x_\text{min}}{x_\text{max}-x_\text{min}}$$	When the feature is mostly uniformly distributed across range; flat-shaped
Z-score scaling	$$x' = (x-\mu)/\sigma$$	When the feature is normally distributed (peak close to mean); bell-shaped
Log scaling	$$x'=ln(x)$$	When the feature distribution is heavy skewed on at least either side of tail; heavy Tail-shaped
Clipping	If x > max, set $$x'=max$$ If x < min, set $$x' = min$$	When the feature contains extreme outliers

Source: Numerical data: Normalization | Machine Learning | Google for Developers

Binning

Binning (bucketing) is a feature engineering technique used to group numerical data into categories (bins). In many cases, this turns numerical data into categorical data.

For example, if a feature X has values ranging from 15 to 425, we can apply binning to represent X as a feature vector divided into specific intervals:

Bin number	Range	Feature vector
1	15-34	[1.0, 0.0, 0.0, 0.0, 0.0]
2	35-117	[0.0, 1.0, 0.0, 0.0, 0.0]
3	118-279	[0.0, 0.0, 1.0, 0.0, 0.0]
4	280-392	[0.0, 0.0, 0.0, 1.0, 0.0]
5	393-425	[0.0, 0.0, 0.0, 0.0, 1.0]

Even though X is a single column in the dataset, binning causes a model to treat X as five separate features. Therefore, the model learns separate weights for each bin.

Binning offers an alternative to scaling or clipping and is particularly useful for handling outliers and improving model performance on non-linear data.

When to use: Binning works well when features exhibit a “clumpy” distribution, that is, the overall linear relationship between the feature and label is weak or nonexistent, or when feature values are clustered.

Example: Number of shoppers versus temperature. By binning them, the model learns separate weights for each bin.

While creating multiple bins is possible, it is generally recommended to avoid an excessive number, as this can lead to insufficient training examples per bin and increased feature dimensionality.

Quantile bucketing is a specific binning technique that ensures each bin contains a roughly equal number of examples, which can be particularly useful for datasets with skewed distributions.

Quantile buckets give extra information space to the large torso while compacting the long tail into a single bucket.
Equal intervals give extra information space to the long tail while compacting the large torso into a single bucket.

Source: Numerical data: Binning | Machine Learning | Google for Developers

Scrubbing

Problem category	Example
Omitted values	A census taker fails to record a resident's age
Duplicate examples	A server uploads the same logs twice
Out-of-range feature values	A human accidentally types an extra digit
Bad labels	A human evaluator mislabels a picture of an oak tree as a maple

You can use programs or scripts to identify and handle data issues such as omitted values, duplicates, and out-of-range feature values by removing or correcting them.

Source: Numerical data: Scrubbing | Machine Learning | Google for Developers

Qualities of good numerical features

Good feature vectors require features that are clearly named and have obvious meanings to anyone on the project.
Data should be checked and tested for bad data or outliers, such as inappropriate values, before being used for training.
Features should be sensible, avoiding “magic values” that create discontinuities (for example, setting the value “watch_time_in_seconds” to -1 to indicate an absence of measurement); instead, use separate boolean features or new discrete values to indicate missing data.

Source: Numerical data: Qualities of good numerical features | Machine Learning | Google for Developers

Polynomial transformations

Synthetic features, such as polynomial transforms, enable linear models to represent non-linear relationships by introducing new features based on existing ones.

By incorporating synthetic features, linear regression models can effectively separate data points that are not linearly separable, using curves instead of straight lines. For example, we can separate two classes with y = x^2.

Feature crosses, a related concept for categorical data, synthesise new features by combining existing features, further enhancing model flexibility.

Source: Numerical data: Polynomial transforms | Machine Learning | Google for Developers

Working with categorical data

Introduction

Categorical data has a specific set of possible values. Examples include species of animals, names of streets, whether or not an email is spam, and binned numbers.

Categorical data can include numbers that behave like categories. An example is postal codes.

Numerical data can be meaningfully multiplied.
Data that are native integer values should be represented as categorical data.

Encoding means converting categorical or other data to numerical vectors that a model can train on.

Preprocessing includes converting non-numerical data, such as strings, to floating-point values.

Source: Working with categorical data | Machine Learning | Google for Developers

Vocabulary and one-hot encoding

Machine learning models require numerical input; therefore, categorical data such as strings must be converted to numerical representations.

The term dimension is a synonym for the number of elements in a feature vector. Some categorical features are low dimensional. For example:

Feature name	# of categories	Sample categories
snowed_today	2	True, False
skill_level	3	Beginner, Practitioner, Expert
season	4	Winter, Spring, Summer, Autumn
dayofweek	7	Monday, Tuesday, Wednesday
planet	8	Mercury, Venus, Earth
car_colour	8	Red, Orange, Blue, Yellow

When a categorical feature has a low number of possible categories, you can encode it as a vocabulary. This treats each category as a separate feature, allowing the model to learn distinct weights for each during training.

One-hot encoding transforms categorical values into numerical vectors (arrays) of N elements, where N is the number of categories. Exactly one of the elements in a one-hot vector has the value 1.0; all the remaining elements have the value 0.0.

Feature	Red	Orange	Blue	Yellow	Green	Black	Purple	Brown
“Red”	1	0	0	0	0	0	0	0
“Orange”	0	1	0	0	0	0	0	0
“Blue”	0	0	1	0	0	0	0	0
“Yellow”	0	0	0	1	0	0	0	0
“Green”	0	0	0	0	1	0	0	0
“Black”	0	0	0	0	0	1	0	0
“Purple”	0	0	0	0	0	0	1	0
“Brown”	0	0	0	0	0	0	0	1

It is the one-hot vector, not the string or the index number, that gets passed to the feature vector. The model learns a separate weight for each element of the feature vector.

The end-to-end process to map categories to feature vectors:

In a true one-hot encoding, only one element has the value 1.0. In a variant known as multi-hot encoding, multiple values can be 1.0.

A feature whose values are predominantly zero (or empty) is termed a sparse feature.

Sparse representation efficiently stores one-hot encoded data by only recording the position of the '1' value to reduce memory usage.

For example, the one-hot vector for “car_colour” “Blue” is: [0, 0, 1, 0, 0, 0, 0, 0].
Since the 1 is in position 2 (when starting the count at 0), the sparse representation is: 2.

Notice that the sparse representation consumes far less memory. Importantly, the model must train on the one-hot vector, not the sparse representation.

The sparse representation of a multi-hot encoding stores the positions of all the non-zero elements. For example, the sparse representation of a car that is both “Blue” and “Black” is 2, 5.

Categorical features can have outliers. If “car_colour” includes rare values such as “Mauve” or “Avocado”, you can group them into one out-of-vocabulary (OOV) category. All rare colours go into this single bucket, and the model learns one weight for it.

For high-dimensional categorical features with many categories, one-hot encoding might be inefficient, and embeddings or hashing (also called the hashing trick) are recommended.

For example, a feature like “words_in_english” has around 500,000 categories.
Embeddings substantially reduce the number of dimensions, which helps the model train faster and infer predictions more quickly.

Source: Categorical data: Vocabulary and one-hot encoding | Machine Learning | Google for Developers

Common issues with categorical data

Categorical data quality hinges on how categories are defined and labelled, impacting data reliability.

Human-labelled data, known as “gold labels”, is generally preferred for training due to its higher quality, but it is essential to check for human errors and biases.

Any two human beings may label the same example differently. The difference between human raters' decisions is called inter-rater agreement.
Inter-rater agreement can be measured using kappa and intra-class correlation (Hallgren, 2012), or Krippendorff's alpha (Krippendorff, 2011).

Machine-labelled data, or “silver labels”, can introduce biases or inaccuracies, necessitating careful quality checks and awareness of potential common-sense violations.

For example, if a computer-vision model mislabels a photo of a chihuahua as a muffin, or a photo of a muffin as a chihuahua.
Similarly, a sentiment analyser that scores neutral words as -0.25, when 0.0 is the neutral value, might be scoring all words with an additional negative bias.

High dimensionality in categorical data increases training complexity and costs, leading to techniques such as embeddings for dimensionality reduction.

Source: Categorical data: Common issues | Machine Learning | Google for Developers

Feature crosses

Feature crosses are created by combining two or more categorical or bucketed features to capture interactions and non-linearities within a dataset.

For example, consider a leaf dataset with the categorical features:

“edges”, containing values {smooth, toothed, lobed}
“arrangement”, containing values {opposite, alternate}

The feature cross, or Cartesian product, of these two features would be:

{Smooth_Opposite, Smooth_Alternate, Toothed_Opposite, Toothed_Alternate, Lobed_Opposite, Lobed_Alternate}

For example, if a leaf has a lobed edge and an alternate arrangement, the feature-cross vector will have a value of 1 for “Lobed_Alternate”, and a value of 0 for all other terms:

{0, 0, 0, 0, 0, 1}

This dataset could be used to classify leaves by tree species, since these characteristics do not vary within a species.

Feature crosses are somewhat analogous to polynomial transforms.

Feature crosses can be particularly effective when guided by domain expertise. It is often possible, though computationally expensive, to use neural networks to automatically find and apply useful feature combinations during training.

Overuse of feature crosses with sparse features should be avoided, as it can lead to excessive sparsity in the resulting feature set. For example, if feature A is a 100-element sparse feature and feature B is a 200-element sparse feature, a feature cross of A and B yields a 20,000-element sparse feature.

Source: Categorical data: Feature crosses | Machine Learning | Google for Developers

Datasets, generalization, and overfitting

Introduction

Data quality significantly impacts model performance more than algorithm choice.
Machine learning practitioners typically dedicate a substantial portion of their project time (around 80%) to data preparation and transformation, including tasks such as dataset construction and feature engineering.

Source: Datasets, generalization, and overfitting | Machine Learning | Google for Developers

Data characteristics

A machine learning model's performance is heavily reliant on the quality and quantity of the dataset it is trained on, with larger, high-quality datasets generally leading to better results.

Datasets can contain various data types, including numerical, categorical, text, multimedia, and embedding vectors, each requiring specific handling for optimal model training.

The following are common causes of unreliable data in datasets:

Omitted values
Duplicate examples
Bad feature values
Bad labels
Bad sections of data

Maintaining data quality involves addressing issues such as label errors, noisy features, and proper filtering to ensure the reliability of the dataset for accurate predictions.

Incomplete examples with missing feature values should be handled by either deletion or imputation to avoid negatively impacting model training.

When imputing missing values, use reliable methods such as mean/median imputation and consider adding an indicator column to signal imputed values to the model. For example, alongside temperature include “temperature_is_imputed”. This lets the model learn to trust real observations more than imputed ones.

Source: Datasets: Data characteristics | Machine Learning | Google for Developers

Labels

Direct labels are generally preferred but often unavailable.

Direct labels exactly match the prediction target and appear explicitly in the dataset, such as a “bicycle_owner” column for predicting bicycle ownership.
Proxy labels approximate the target and correlate with it, such as a bicycle magazine subscription as a signal of bicycle ownership.

Use a proxy label when no direct label exists or when the direct concept resists easy numeric representation. Carefully evaluate proxy labels to ensure they are a suitable approximation.

Human-generated labels, while offering flexibility and nuanced understanding, can be expensive to produce and prone to errors, requiring careful quality control.

Models can train on a mix of automated and human-generated labels, but an extra set of human labels often adds complexity without sufficient benefit.

Source: Datasets: Labels | Machine Learning | Google for Developers

Imbalanced datasets

Imbalanced datasets occur when one label (majority class) is significantly more frequent than another (minority class), potentially hindering model training on the minority class.

Note: Accuracy is usually a poor metric for assessing a model trained on a class-imbalanced dataset.

A highly imbalanced floral dataset containing far more sunflowers (200) than roses (2):

During training, a model should learn two things:

What each class looks like, that is, what feature values correspond to which class.
How common each class is, that is, what the relative distribution of the classes is.

Standard training conflates these two goals. In contrast, a two-step technique of downsampling and upweighting the majority class separates these two goals, enabling the model to achieve both.

Step 1: Downsample the majority class by training on only a small fraction of majority class examples, which makes an imbalanced dataset more balanced during training and increases the chance that each batch contains enough minority examples.

For example, with a class-imbalanced dataset consisting of 99% majority class and 1% minority class examples, we could downsample the majority class by a factor of 25 to create a more balanced training set (80% majority class and 20% minority class).

Downsampling the majority class by a factor of 25:

Step 2: Upweight the downsampled majority class by the same factor used for downsampling, so each majority class error counts proportionally more during training. This corrects the artificial class distribution and bias introduced by downsampling, because the training data no longer reflects real-world frequencies.

Continuing the example from above, we must upweight the majority class by a factor of 25. That is, when the model mistakenly predicts the majority class, treat the loss as if it were 25 errors (multiply the regular loss by 25).

Upweighting the majority class by a factor of 25:

Experiment with different downsampling and upweighting factors just as you would experiment with other hyperparameters.

Benefits of this technique include a better model (the resultant model knows what each class looks like and how common each class is) and faster convergence.

Source: Datasets: Class-imbalanced datasets | Machine Learning | Google for Developers

Dividing the original dataset

Machine learning models should be tested against unseen data.

It is recommended to split the dataset into three subsets: training, validation, and test sets.

The validation set is used for initial testing during training (to determine hyperparameter tweaks, add, remove, or transform features, and so on), and the test set is used for final evaluation.

The validation and test sets can “wear out” with repeated use. For this reason, it is a good idea to collect more data to “refresh” the test and validation sets.

A good test set is:

Large enough to yield statistically significant results
Representative of the dataset as a whole
Representative of real-world data the model will encounter (if your model performs poorly on real-world data, determine how your dataset differs from real-life data)
Free of duplicates from the training set

In theory, the validation set and test set should contain the same number of examples, or nearly so.

Source: Datasets: Dividing the original dataset | Machine Learning | Google for Developers

Transforming data

Machine learning models require all data, including features such as street names, to be transformed into numerical (floating-point) representations for training.

Normalisation improves model training by converting existing floating-point features to a constrained range.

When dealing with large datasets, select a subset of examples for training. When possible, select the subset that is most relevant to your model's predictions. Safeguard privacy by omitting examples containing personally identifiable information.

Source: Datasets: Transforming data | Machine Learning | Google for Developers

Generalization

Generalisation refers to a model's ability to perform well on new, unseen data.

Source: Generalization | Machine Learning | Google for Developers

Overfitting

Overfitting means creating a model that matches the training set so closely that the model fails to make correct predictions on new data.

Generalization is the opposite of overfitting. That is, a model that generalises well makes good predictions on new data.

An overfit model is analogous to an invention that performs well in the lab but is worthless in the real world. An underfit model is like a product that does not even do well in the lab.

Overfitting can be detected by observing diverging loss curves for training and validation sets on a generalization curve (a graph that shows two or more loss curves). A generalization curve for a well-fit model shows two loss curves that have similar shapes.

Common causes of overfitting include:

A training set that does not adequately represent real-life data (or the validation set or test set).
A model that is too complex.

Dataset conditions for good generalization include:

Examples must be independently and identically distributed, which is a fancy way of saying that your examples cannot influence each other.
The dataset is stationary, meaning it does not change significantly over time.
The dataset partitions have the same distribution, meaning the examples in the training set, validation set, test set, and real-world data are statistically similar.

Source: Overfitting | Machine Learning | Google for Developers

Model complexity

Simpler models often generalise better to new data than complex models, even if they perform slightly worse on training data.

Occam's Razor favours simpler explanations and models.

Model training should minimise both loss and complexity for optimal performance on new data. $$ \text{minimise}(\text{loss + complexity}) $$

Unfortunately, loss and complexity are typically inversely related. As complexity increases, loss decreases. As complexity decreases, loss increases.

Regularisation techniques help prevent overfitting by penalising model complexity during training.

L1 regularisation (also called LASSO) uses model weights to measure model complexity.
L2 regularisation (also called ridge regularisation) uses squares of model weights to measure model complexity.

Source: Overfitting: Model complexity | Machine Learning | Google for Developers

L2 regularization

L2 regularisation is a popular regularisation metric to reduce model complexity and prevent overfitting. It uses the following formula: $$ L_2 \text{ regularisation} = w^2_1 + w^2_2 + \ldots + w^2_n $$

It penalises especially large weights.

L2 regularisation encourages weights towards 0, but never pushes them all the way to zero.

A regularisation rate (lambda) controls the strength of regularisation. $$ \text{minimise}(\text{loss} + \lambda \text{ complexity}) $$

A high regularisation rate reduces the likelihood of overfitting and tends to produce a histogram of model weights that are normally distributed around 0.
A low regularisation rate lowers the influence of regularisation and tends to produce a histogram of model weights with a flat distribution.

Tuning is required to find the ideal regularisation rate.

Early stopping is an alternative regularisation method that involves ending training before the model fully converges to prevent overfitting. It usually increases training loss but decreases test loss. It is a quick but rarely optimal form of regularisation.

Learning rate and regularisation rate tend to pull weights in opposite directions. A high learning rate often pulls weights away from zero, while a high regularisation rate pulls weights towards zero. The goal is to find the equilibrium.

Source: Overfitting: L2 regularization | Machine Learning | Google for Developers

Interpreting loss curves

An ideal loss curve looks like this:

To improve an oscillating loss curve:

Reduce the learning rate.
Reduce the training set to a tiny number of trustworthy examples.
Check your data against a data schema to detect bad examples, then remove the bad examples from the training set.

Possible reasons for a loss curve with a sharp jump include:

The input data contains a burst of outliers.
The input data contains one or more NaNs (for example, a value caused by a division by zero).

Test loss diverges from training loss when:

The model overfits the training set.

The loss curve gets stuck when:

The training set is not shuffled well.

Source: Overfitting: Interpreting loss curves | Machine Learning | Google for Developers

Google ML Crash Course #1 Notes: ML Models

Mon, 29 Dec 2025 10:02:39 +0000

This post is part of a four-part summary of Google's Machine Learning Crash Course. For context, check out this post. This first module covers the fundamentals of building regression and classification models.

Linear regression

Introduction

The linear regression model uses an equation $$ y' = b + w_1x_1 + w_2x_2 + \ldots $$ to represent the relationship between features and the label.

y' is the predicted label—the output
b is the bias of the model (the y-intercept in algebraic terms), sometimes referred to as w_0
w_1 is the weight of the feature (the slope in algebraic terms)
x_1 is a feature—the input

y and features x are given. b and w are calculated from training by minimizing the difference between predicted and actual values.

Source: Linear regression | Machine Learning | Google for Developers

Loss

Loss is a numerical value indicating the difference between a model's predictions and the actual values.

The goal of model training is to minimize loss, bringing it as close to zero as possible.

Loss type	Definition	Equation
L1 loss	The sum of the absolute values of the difference between the predicted values and the actual values.	$$\sum \|\text{actual value}-\text{predicted value}\|$$
Mean absolute error (MAE)	The average of L1 losses across a set of N examples.	$$\frac{1}{N}\sum \|\text{actual value}-\text{predicted value}\|$$
L2 loss	The sum of the squared difference between the predicted values and the actual values.	$$\sum (\text{actual value}-\text{predicted value})^2$$
Mean squared error (MSE)	The average of L2 losses across a set of N examples.	$$\frac{1}{N}\sum (\text{actual value}-\text{predicted value})^2$$

The most common methods for calculating loss are Mean Absolute Error (MAE) and Mean Squared Error (MSE), which differ in their sensitivity to outliers.

A model trained with MSE moves the model closer to the outliers but further away from most of the other data points.

A model trained with MAE is farther from the outliers but closer to most of the other data points.

Source: Linear regression: Loss | Machine Learning | Google for Developers

Gradient descent

Gradient descent is an iterative optimisation algorithm used to find the best weights and bias for a linear regression model by minimising the loss function.

Calculate the loss with the current weight and bias.
Determine the direction to move the weights and bias that reduce loss.
Move the weight and bias values a small amount in the direction that reduces loss.
Return to step one and repeat the process until the model can't reduce the loss any further.

A model is considered to have converged when further iterations do not significantly reduce the loss, indicating it has found the weights and bias that produce the lowest possible loss.

Loss curves visually represent the model's progress during training, showing how the loss decreases over iterations and helping to identify convergence.

Linear models have convex loss functions, ensuring that gradient descent will always find the global minimum, resulting in the best possible model for the given data.

Source: Linear regression: Gradient descent | Google for Developers

Hyperparameters

Hyperparameters, such as learning rate, batch size, and epochs, are external configurations that influence the training process of a machine learning model.

The learning rate determines the step size during gradient descent, impacting the speed and stability of convergence.

If the learning rate is too low, the model can take a long time to converge.
However, if the learning rate is too high, the model never converges, but instead bounces around the weights and bias that minimise the loss.

Batch size dictates the number of training examples processed before updating model parameters, influencing training speed and noise.

When a dataset contains hundreds of thousands or even millions of examples, using the full batch isn't practical.
Two common techniques to get the right gradient on average without needing to look at every example in the dataset before updating the weights and bias are stochastic gradient descent and mini-batch stochastic gradient descent.
- Stochastic gradient descent uses only a single random example (a batch size of one) per iteration. Given enough iterations, SGD works but is very noisy.
- Mini-batch stochastic gradient descent is a compromise between full-batch and SGD. For N number of data points, the batch size can be any number greater than 1 and less than N. The model chooses the examples included in each batch at random, averages their gradients, and then updates the weights and bias once per iteration.

Model trained with SGD:

Model trained with mini-batch SGD:

Epochs represent the number of times the entire training dataset is used during training, affecting model performance and training time.

For example, given a training set with 1,000 examples and a mini-batch size of 100 examples, it will take the model 10 iterations to complete one epoch.

Source: Linear regression: Hyperparameters | Machine Learning | Google for Developers

Logistic regression

Introduction

Logistic regression is a model used to predict the probability of an outcome, unlike linear regression which predicts continuous numerical values.

Logistic regression models output probabilities, which can be used directly or converted to binary categories.

Source: Logistic Regression | Machine Learning | Google for Developers

Calculating a probability with the sigmoid function

A logistic regression model uses a linear equation and the sigmoid function to calculate the probability of an event.

The sigmoid function ensures the output of logistic regression is always between 0 and 1, representing a probability. $$ f(x) = \frac{1}{1 + e^{-x}} $$

Linear component of a logistic regression model: $$ z = b + w_1 x_1 + w_2 x_2 + \ldots + w_N x_N $$ To obtain the logistic regression prediction, the z value is then passed to the sigmoid function, yielding a value (a probability) between 0 and 1: $$ y' = \frac{1}{1+e^{-z}} $$

y' is the output of the logistic regression model.
z is the linear output (as calculated in the preceding equation).

z is referred to as the log-odds because if you solve the sigmoid function for z you get: $$ z = \log(\frac{y}{1-y}) $$ This is the log of the ratio of the probabilities of the two possible outcomes: y and 1 – y.

When the linear equation becomes input to the sigmoid function, it bends the straight line into an s-shape.

Source: Logistic regression: Calculating a probability with the sigmoid function | Machine Learning | Google for Developers

Loss and regularisation

Logistic regression models are trained similarly to linear regression models but use Log Loss instead of squared loss and require regularisation.

Log Loss is used in logistic regression because the rate of change isn't constant, requiring varying precision levels unlike squared loss used in linear regression.

The Log Loss equation returns the logarithm of the magnitude of the change, rather than just the distance from data to prediction. Log Loss is calculated as follows: $$ \text{Log Loss} = \sum_{(x,y)\in D} -y\log(y') – (1 – y)\log(1 – y') $$

(x,y) is the dataset containing many labelled examples, which are (x, y) pairs.
y is the label in a labelled example. Since this is logistic regression, every value of y must either be 0 or 1.
y' is your model's prediction (somewhere between 0 and 1), given the set of features in x.

Regularisation, such as L2 regularisation or early stopping, is crucial in logistic regression to prevent overfitting (due to the model's asymptotic nature) and improve generalisation.

Source: Logistic regression: Loss and regularization | Machine Learning | Google for Developers

Classification

Introduction

Logistic regression models can be converted into binary classification models for predicting categories instead of probabilities.

Source: Classification | Machine Learning | Google for Developers

Thresholds and the confusion matrix

To convert the raw output from a logistic regression model into binary classification (positive and negative class), you need a classification threshold.

Confusion matrix

	Actual positive	Actual negative
Predicted positive	True positive (TP)	False positive (FP)
Predicted negative	False negative (FN)	True negative (TN)

Total of each row = all predicted positives (TP + FP) and all predicted negatives (FN + TN) Total of each column = all real positives (TP + FN) and all real negatives (FP + TN)

When positive examples and negative examples are generally well differentiated, with most positive examples having higher scores than negative examples, the dataset is separated.
When the total of actual positives is not close to the total of actual negatives, the dataset is imbalanced.
When many positive examples have lower scores than negative examples, and many negative examples have higher scores than positive examples, the dataset is unseparated.

When we increase the classification threshold, both TP and FP decrease, and both TN and FN increase.

Source: Thresholds and the confusion matrix | Machine Learning | Google for Developers

Accuracy, Recall, Precision, and related metrics are all calculated at a single classification threshold value.

Accuracy is the proportion of all classifications that were correct. $$ \text{Accuracy} = \frac{\text{correct classifications}}{\text{total classifications}} = \frac{TP+TN}{TP+TN+FP+FN} $$

Use as a rough indicator of model training progress/convergence for balanced datasets. Typically the default.
For model performance, use only in combination with other metrics.
Avoid for imbalanced datasets. Consider using another metric.

Recall, or true positive rate, is the proportion of all actual positives that were classified correctly as positives. Also known as probability of detection. $$ \text{Recall (or TPR)} = \frac{\text{correctly classified actual positives}}{\text{all actual positives}} = \frac{TP}{TP+FN} $$

Use when false negatives are more expensive than false positives.
Better than Accuracy in imbalanced datasets.
Improves when false negatives decrease.

False positive rate is the proportion of all actual negatives that were classified incorrectly as positives. Also known as probability of a false alarm. $$ \text{FPR} = \frac{\text{incorrectly classified actual negatives}}{\text{all actual negatives}}=\frac{FP}{FP+TN} $$

Use when false positives are more expensive than false negatives.
Less meaningful and useful in a dataset where the number of actual negtives is very, very low.

Precision is the proportion of all the model's positive classifications that are actually positive. $$ \text{Precision} = \frac{\text{correctly classified actual positives}}{\text{everything classified as positive}}=\frac{TP}{TP+FP} $$

Use when it's very important for positive predictions to be accurate.
Less meaningful and useful in a dataset where the number of actual positives is very, very low.
Improves as false positives decrease.

Precision and Recall often show an inverse relationship.

F1 score is the harmonic mean of Precision and Recall. $$ \text{F1} = 2 * \frac{\text{precision} * \text{recall}}{\text{precision} + \text{recall}} = \frac{2TP}{2TP + FP + FN} $$

Preferable for class-imbalanced datasets.
When Precision and Recall are close in value, F1 will be close to their value.
When Precision and Recall are far apart, F1 will be similar to whichever metric is worse.

Source: Classification: Accuracy, recall, precision, and related metrics | Machine Learning | Google for Developers

ROC and AUC

ROC and AUC evaluate a model's quality across all possible thresholds.

ROC curve, or receiver operating characteristic curves, plot the true positive rate (TPR) against the false positive rate (FPR) at different thresholds. A perfect model would pass through (0,1), while a random guesser forms a diagonal line from (0,0) to (1,1).

AUC, or area under the curve, represents the probability that the model will rank a randomly chosen positive example higher than a negative example. A perfect model has AUC = 1.0, while a random model has AUC = 0.5.

ROC and AUC of a hypothetical perfect model (AUC = 1.0) and for completely random guesses (AUC = 0.5):

ROC and AUC are effective when class distributions are balanced. For imbalanced data, precision-recall curves (PRCs) can be more informative.

A higher AUC generally indicates a better-performing model.

ROC and AUC of two hypothetical models; the first curve (AUC = 0.65) represents the better of the two models:

Threshold choice depends on the cost of false positives versus false negatives. The most relevant thresholds are those closest to (0,1) on the ROC curve. For costly false positives, a conservative threshold (like A in the chart below) is better. For costly false negatives, a more sensitive threshold (like C) is preferable. If costs are roughly equivalent, a threshold in the middle (like B) may be best.

Source: Classification: ROC and AUC | Machine Learning | Google for Developers

Prediction bias

Prediction bias measures the difference between the average of a model's predictions and the average of the true labels in the data. For example, if 5% of emails in the dataset are spam, a model without prediction bias should also predict about 5% as spam. A large mismatch between these averages indicates potential problems.

Prediction bias can be caused by:

Biased and noisy data (e.g., skewed sampling)
Overly strong regularisation that oversimplifies the model
Bugs in the model training pipeline
Insufficient features provided to the model

Source: Classification: Prediction bias | Machine Learning | Google for Developers

Multi-class classification

Multi-class classification extends binary classification to cases with more than two classes.

If each example belongs to only one class, the problem can be broken down into a series of binary classifications. For instance, with three classes (A, B, C), you could first separate C from A+B, then distinguish A from B within the A+B group.

Source: Classification: Multi-class classification | Machine Learning | Google for Developers

Notes From Google’s Machine Learning Crash Course

Mon, 29 Dec 2025 09:20:44 +0000

I like to revisit Google's Machine Learning Crash Course every now and then to refresh my understanding of key machine learning concepts. It's a fantastic free resource, published under the Creative Commons Attribution 4.0 License, now updated with content on recent developments like large language models and automated ML.

I take notes as a personal reference, and I thought I would post them here to keep track of what I've learned—while hopefully offering something useful to others doing the same.

The notes are organised into four posts, one for each course module:

Does Japan Have a Labour Shortage?

Thu, 23 Jan 2025 12:09:25 +0000

The notion that Japan has a labour shortage is widely accepted. The country's ageing and shrinking population leaves fewer working-age people for firms to hire. Businesses often cite difficulties finding workers as a reason for raising pay, and labour shortages are seen as a driver of inflation. However, Mizuho executive economist Momma Kazuo, in a commentary published on 21 January 2025, challenges this view. He argues that labour market conditions are more nuanced than commonly perceived and that declining real wages contradict the narrative of a general labour shortage. This post summarises key points from the report.

A PDF of the original Japanese-language report can be found here.

Nominal wages are increasing

The 2024 shunto spring wage negotiations produced a 5.1% nominal pay gain, the highest in 33 years. Labour unions are aiming for a 5%+ increase in 2025.
Inflation and labour shortages are widely cited as drivers of wage increases. While inflation is a clear driver of wage gains, the impact of labour shortages is less certain.
Businesses cite pay increases as necessary to retain workers. But if labour shortages were driving inflation, nominal pay gains would outpace inflation and real wages would rise.
Real wages briefly increased in summer 2024 but remain 2.4% lower than in 2019.
To recover this loss by 2026, real wages need to rise 1.2% annually, which requires nominal pay gains of 3.2% assuming 2% inflation. This aligns with shunto outcomes of 5%.
The modest union wage demands suggest limited negotiating power. True labour shortages would lead to stronger wage demands.

2017 again?

In 2017, wage stagnation was partly attributed to increased employment of women and seniors, who often take lower-wage jobs, reducing average wage levels.
Workforce growth from these groups has since slowed, leading some to argue that current wage increases reflect genuine labour shortages.
But the continued stagnation of real wages casts doubt on this argument. It also suggests that wage stagnation in 2017 has yet to be fully explained.

Not really a labour shortage?

Some researchers argue that labour shortages are overstated, including Professor Saito Makoto of Nagoya University and Professor Shioji Etsuro of Chuo University.
Indicators like the job-to-applicant ratio, the unemployment rate, and the Bank of Japan's labour supply-demand gap estimate show that the labour market is not as tight as before the pandemic.
Labour shortages may be sector-specific, with industries like IT, construction, and essential services facing shortages, while others, such as office roles, experience surplus supply.

Financial constraints impeding wage growth

Businesses may lack the financial resources to support significant wage increases.
Japan's real GDP growth has been so weak as to suggest that Japan is heading towards four lost decades.
Weak real GDP growth and worsening terms of trade have eroded Japan's real Gross Domestic Income (GDI), with ¥5.8 trillion (approximately 1% of GDP) in income flowing out of the economy since 2019.
This weak macroeconomic environment limits the ability to achieve sustained real wage growth.

Domestic growth expectations are key

Many businesses increasingly rely on price increases to fund wage hikes, but this does not improve real wages as higher costs offset nominal gains.
Policies to sustainably increase real wages are difficult to identify, but obstacles that need to be resolved include institutional constraints, time and cost barriers to reskilling, a lack of investment in workers, and weak domestic growth expectations.

On the Brink of Four Lost Decades

Sun, 29 Dec 2024 00:34:09 +0000

This post summarises key points from the latest economic commentary by Momma Kazuo, executive economist with Mizuho Research & Technologies and former assistant governor at the Bank of Japan. Published on 25 December 2024, this piece is part of a series of articles exploring issues in the Japanese and global economies. It highlights the risk of Japan's lost decades continuing and questions whether the country can overcome its demographic challenges.

A PDF of the original Japanese-language report is available here.

The end of deflation, but no growth

Looking back, 2024 marked the year Japan's economy exited deflation, as Shunto spring wage negotiations achieved increases exceeding 5% for the first time in 33 years, consumer price inflation surpassed 2% for three consecutive years, and the stock market reached record highs.
But real GDP growth remains alarmingly sluggish, suggesting that after the “Lost 30 Years,” Japan's economy is now heading towards a “Lost 40 Years.”
Since the COVID-19 pandemic, real GDP growth has averaged only 0.2% per year, far below the 0.5% trend growth rate.

Wages and inflation reflect economic problems

Deflation occurs when demand is too weak to utilise available labour and other supply capacities. In Japan, this happened after the bursting of the bubble which made households and businesses turn extremely cautious and unwilling to spend.
This demand weakness persists but is now exacerbated by supply-side constraints, such as rising import costs and a shrinking workforce due to population ageing.
Deflation has ended, with wages and prices rising, not because of stronger demand but because supply capacity can no longer meet even weak demand. Labelling this a “virtuous cycle of wages and prices” is an unrealistic and overly optimistic interpretation.
If a true virtuous cycle were occurring, public opinion surveys by the Bank of Japan would reflect greater optimism about living conditions, and the ruling coalitions' loss in recent general elections would be considered odd.

Stocks and the real economy diverge

Corporations and privileged individuals claim a virtuous cycle is underway because they are relatively well-off, can absorb the rising cost of living, and mistakenly believe higher prices reflect an improved economy due to increased profits and stock gains.
Nominal GDP growth appears strong, even as real GDP stagnates, but there are two critical considerations.
First, inflation erodes the value of financial assets.
Second, stock market gains since the early 2010s are attributable to corporate reforms, as wages, prices, and nominal output showed limited growth during this period.
Stock prices have risen due to shareholder-focused and profitability-enhancing reforms, such as the elimination of cross-shareholdings, increased influence of foreign investors, and government-led reforms, countering the perception that Abenomics' structural reform agenda was neglected.
If investors assume Japan's domestic market will continue shrinking, companies are evaluated primarily on (1) overseas expansion and (2) domestic business efficiency.
The issue is that (1) overseas expansion is unlikely to translate into domestic investment or wage increases, and (2) efficiency improvements may even have a negative impact.
This dynamic explains the disconnect between rising stock prices and the stagnation of the domestic economy, particularly among small and medium-sized enterprises and households.

On a genuine virtuous cycle

Ensuring households directly benefit from rising stock prices is one approach, but focusing solely on asset-based measures risks increasing inequality.
Outright redistribution policies, such as increased taxes on financial income, are impractical as they would disrupt the stock market's momentum and corporate reform progress.
There is hope in Japanese companies' inherent resilience and the potential for labour shortages to drive productivity gains, but achieving this requires businesses and shareholders to view domestic market expansion as profitable.
Initiatives like the TSMC boom in Kyushu demonstrate how government intervention can stimulate virtuous cycles by creating business opportunities, even amid demographic challenges. Economic security was a key driver in this case.
To avoid another four decades of stagnation, Japan must address critical issues like decarbonisation, developing new energy sources, strengthening science and technology, renewing infrastructure, and reforming education. This requires vision and insight from both the public and private sectors.

Proxying the Yen Carry Trade

Sun, 18 Aug 2024 10:57:22 +0000

The big story over the past two weeks has been the unwinding of the yen carry trade. This trade involves borrowing yen at low rates in Japan and investing the funds in higher-yielding foreign currencies, such as the US dollar or the Mexican peso. Many have argued that a surprise rate hike from the Bank of Japan on 31 July and disappointing US labour market data on 2 August triggered the collapse of this trade. As a result, the yen surged, equities stumbled, and government bonds rallied. While not much is known about the exact scale of the yen carry trade, this post explores two methods of proxying it.

Method one: Tracking yen credit to borrowers outside Japan via BIS

One method to track the yen carry trade was suggested by Hyun Song Shin, Economic Adviser and Head of Research at the Bank for International Settlements (BIS), in a Twitter/X thread on 9 August. Shin pointed to the BIS Global Liquidity Indicators data Q.JPY.3P.N.B.I.G.JPY, which is quarterly data that tracks yen-denominated loans as foreign currency, excluding yen securities. By the end of March 2024, these loans tallied just above 41 trillion JPY, up from about 30 trillion pre-pandemic. Shin notes that not all of this reflects carry trade activity, so this figure should be considered a ceiling for carry trades conducted on-balance sheet.

Shin further explains that carry trades can also be done through FX swaps, where a lender provides dollars in return for yen. The dollar provider normally parks the yen proceeds in a safe yen asset, such as short-term Japanese government bonds. But if the dollar provider instead sells the yen proceeds in the spot market for dollars, it leaves an unhedged yen obligation, which constitutes a carry trade that isn't captured in BIS data. Shin points to this study, which puts the size of dollar/yen FX swaps at around 14 trillion USD, with about 1 trillion USD held by foreigners in official assets.

Method two: Monitoring foreign bank borrowing from Tokyo offices via BOJ

Another method for proxying the yen carry trade that is popular in Japan is monitoring foreign banks' borrowing from their Tokyo offices. A bank with a global presence can raise yen through its Tokyo branch. When that happens, the Tokyo office acquires a claim on the foreign office through interoffice accounts. This data is available at monthly frequency through the Bank of Japan's release on Principal Assets and Liabilities of Foreign Banks in Japan data BS02'FAFBK_FAFB2A9. As of May 2024, foreign borrowing from Tokyo offices was about 11 trillion JPY, up from about 7.5 trillion pre-pandemic.

Here is an R script to produce the charts above.

Update (1 September 2024): The BIS has released new analysis on the carry trade unwind, which shows additional data for yen claims on non-banks in offshore centres. This data is part of the Locational Banking Statistics and can be compiled by summing up the figures for offshore economies, starting with the Cayman Islands under the code Q.S.C.A.JPY.A.5J.A.5A.N.KY.N. Alternatively, you can visit the Bank of Japan's statistics portal and download the flat file for BIS International Locational Banking Statistics in Japan (Claims), which contains the relevant totals under indicator codes BIS3G00103203200071N and BIS3G00303203200071N.

Measuring Expected Inflation with Breakevens

Sun, 11 Aug 2024 08:36:23 +0000

Central banks in recent years have paid close attention to “inflation expectations” when setting monetary policy. But measuring these expectations is not always straightforward. Expectations can vary significantly depending on whether they are drawn from businesses, households, or professional forecasters. One of the more practical methods for gauging inflation expectations is by analysing bond market data. This post provides an overview of how this works.

Bond market-based measures of inflation expectations are known as “breakevens”. These breakeven rates offer a snapshot of what investors anticipate inflation will be over a specific timeframe. To calculate a breakeven rate, we compare the yield of an inflation-protected government bond (such as US TIPS) with a nominal government bond of the same maturity. The difference between these two yields represents the breakeven inflation rate, or the rate at which an investor would earn the same return whether they bought an inflation-protected bond or a nominal bond.

Examples

Using data from the U.S. 5-year Treasuries available on the St. Louis Fed's FRED database, we calculate the breakeven rate as follows:

5-Year Breakeven Inflation Rate (T5YIE) = Yield on 5-year US Treasury (DGS5) – Yield on 5-year US Treasury Inflation-Indexed (DFII5)

As of 8 August 2024, the yield on a 5-year US Treasury is 3.83%, while the yield on an inflation-indexed 5-year US Treasury stands at 1.85%. This translates to a breakeven rate of 1.98% (3.83% – 1.85%).

Let's repeat this for 10-year US Treasuries:

10-Year Breakeven Inflation Rate (T10YIE) = Yield on 10-year US Treasury (DGS10) – Yield on 10-year US Treasury Inflation-Indexed (DFII10)

As of 8 August 2024, the yield on a 10-year US Treasury is 3.99%, while the yield on an inflation-indexed 10-year US Treasury is 1.87%. This yields a breakeven rate of 2.12% (3.99% – 1.87%).

Estimating the Scale of Japanese Foreign Exchange Intervention

Mon, 15 Jul 2024 07:00:53 +0000

Japanese FX intervention has been in the headlines lately, as the yen's depreciation has prompted Japanese authorities to step into the market multiple times over the past two years. According to media reports, the Ministry of Finance and the Bank of Japan likely intervened this past Thursday, 11 July, which saw the yen surge from nearly 162 JPY/USD to almost 157 JPY/USD. Official confirmation of the intervention won't arrive until 31 July, but we can estimate the scale of the intervention using BOJ accounts and money market broker forecasts.

Pinpoint the relevant dates

Spot FX interventions settle two business days after the transaction date. If the authorities intervened late on Thursday, 11 July, the corresponding settlement date would be Tuesday, 16 July, since Monday, 15 July, was a public holiday in Japan.

Check money broker forecasts

Next, check money broker forecasts for interbank liquidity due to fiscal factors (財政) on the settlement date. University of Tokyo Professor Hattori Takahiro provides a summary of projections from the three major money broker firms:

Ueda Yagi: 200 billion JPY
Central Tanshi: 400 billion JPY
Tokyo Tanshi: 500 billion JPY

Note that these figures are typically expressed in hundreds of millions (億円). Averaging across all three forecasts, we get a surplus of 367 billion JPY.

Check BOJ figures

Now check the BOJ's projections for changes in interbank liquidity due to fiscal factors (財政等要因) on the settlement date. The BOJ publishes this data in three stages: a forecast one business day in advance, a tentative figure on the day itself, and a revised figure the following business day. The current forecast for 16 July shows a 3,170 billion JPY deficit, as reported here.

Calculate the difference

To estimate the amount involved in the intervention, calculate:

BOJ figure – Money broker forecasts = Estimated scale of intervention

Using the available data, we get:

-3,170 – 367 = -3,537 billion JPY

This suggests the MOF and BOJ likely spent about 3.5 trillion JPY to support the yen—roughly 22 billion USD. This estimate aligns with reports by Bloomberg, Nikkei, and Professor Hattori.

Update (10 August 2024): The latest BOJ data, both tentative and revised, now show a deficit of -2,960 billion JPY. Based on this update, the estimated intervention falls to -3,327 billion JPY.

Update (19 April 2025): The latest official MOF data confirming the intervention shows that the government sold USD and bought JPY to the tune of 3,167.8 billion JPY, which is close to our estimate.

Summary

FX intervention
Intervention and settlement dates	Dates
Intervention date	2024/07/11
Settlement date	2024/07/16
(1) Money broker forecasts for settlement date	JPY bn
Ueda Tanshi	200.0
Central Tanshi	400.0
Tokyo Tanshi	400.0
Average	333.3
(2) BoJ figures for settlement date	JPY bn
Forecast	-3,170.0
Preliminary	-2,960.0
Revised	-2,960.0
Estimated FX intervention = (1) – (2)	JPY bn
Based on BoJ forecast	3,503.3
Based on BoJ preliminary	3,293.3
Based on BoJ revised	3,293.3
Reported FX intervention	JPY bn
JPY bought, USD sold	3,167.8