Google ML Crash Course #4 Notes: Real-World ML
This post is part of a four-part summary of Google's Machine Learning Crash Course. For context, check out this post. This fourth module covers critical considerations when building and deploying ML models in the real world, including productionisation best practices, automation, and responsible engineering.
Production ML systems
Introduction
The model is only a small part of real-world production ML systems. It often represents only 5% or less of the total codebase in the system.

Source: Production ML systems | Machine Learning | Google for Developers
Static versus dynamic training
Machine learning models can be trained statically (once) or dynamically (continuously).
| Static training (offline training) | Dynamic training (online training) | |
|---|---|---|
| Advantages | Simpler. You only need to develop and test the model once. | More adaptable. Keeps up with changes in data patterns, providing more accurate predictions. |
| Disadvantages | Sometimes stale. Can become outdated if data patterns change, requiring data monitoring. | More work. You must build, test, and release a new product continuously. |
Choosing between static and dynamic training depends on the specific dataset and how frequently it changes.
Monitoring input data is essential for both static and dynamic training to ensure reliable predictions.
Source: Production ML systems: Static versus dynamic training | Machine Learning | Google for Developers
Static versus dynamic inference
Inference involves using a trained model to make predictions on unlabelled examples, and it can be done as follows:
Static inference (offline inference, batch inference) generates predictions in advance and caches them, which suits scenarios where prediction speed is critical.
Dynamic inference (online inference, real-time inference) generates predictions on demand, offering flexibility for diverse inputs.
| Static inference (offline inference, batch inference) | Dynamic inference (online inference, real-time inference) | |
|---|---|---|
| Advantages | No need to worry about cost of inference; allows post-verification of predictions before pushing | Can infer a prediction on any new item as it comes in |
| Disadvantages | Limited ability to handle uncommon inputs | Compute-intensive and latency-sensitive; monitoring needs are intensive |
Choosing between static and dynamic inference depends on factors such as model complexity, desired prediction speed, and the nature of the input data.
Static inference is advantageous when cost and prediction verification are prioritised, while dynamic inference excels in handling diverse, real-time predictions.
Source: Production ML systems: Static versus dynamic inference | Machine Learning | Google for Developers
When to transform data?
Feature engineering can be performed before or during model training, each with its own advantages and disadvantages.
- Transforming data before training allows for a one-time transformation of the entire dataset but requires careful recreation of transformations during prediction to avoid training-serving skew.
- Transforming data during training ensures consistency between training and prediction but can increase model latency and complicate batch processing.
- When transforming data during training, considerations such as Z-score normalisation across batches with varying distributions need to be addressed.
Source: Production ML systems: When to transform data? | Machine Learning | Google for Developers
Deployment testing
Deploying a machine learning model involves validating data, features, model versions, serving infrastructure, and pipeline integration.
Reproducible model training involves deterministic seeding, fixed initialisation order, averaging multiple runs, and using version control.
Integration tests ensure that different components of the ML pipeline work together seamlessly and should run continuously and for new model or software versions.
Before serving a new model, validate its quality by checking for sudden and gradual degradations against previous versions and fixed thresholds.
Ensure model-infrastructure compatibility by staging the model in a sandboxed server environment to avoid dependency conflicts.
Source: Production ML systems: Deployment testing | Machine Learning | Google for Developers
Monitoring pipelines
ML pipeline monitoring involves validating data (using data schemas) and features (using unit tests), tracking real-world metrics, and addressing potential biases in data slices.
Monitoring training-serving skew, label leakage, model age, and numerical stability is crucial for maintaining pipeline health and model performance.
- Training-serving skew means that input data during training differs from input data during serving, for example because training and serving data use different schemas (schema skew) or because engineered data differs between training and serving (feature skew).
- Label leakage means that the ground truth labels being predicted have inadvertently entered the training features.
- Numerical stability involves writing tests to check for NaN and Inf values in weights and layer outputs, and testing that more than half of the outputs of a layer are not zero.
Live model quality testing uses methods such as human labelling and statistical analysis to ensure ongoing model effectiveness in real-world scenarios.
Implementing proper randomisation through deterministic data generation enables reproducible experiments and consistent analysis.
Maintaining invariant hashing ensures that data splits remain consistent across experiments, contributing to reliable analysis and model evaluation.
Source: Production ML systems: Monitoring pipelines | Machine Learning | Google for Developers
Questions to ask
Continuously monitor models in production to evaluate feature importance and potentially remove unnecessary features, ensuring prediction quality and resource efficiency.
- Regularly assess whether features are genuinely helpful and whether their value outweighs the cost of inclusion.
Data reliability is crucial. Consider data source stability, potential changes in upstream data processes, and the creation of local data copies to control versioning and mitigate risks.
Be aware of feedback loops, where a model's predictions influence future input data, potentially leading to unexpected behaviour or biased outcomes, especially in interconnected systems.
Source: Production ML systems: Questions to ask | Machine Learning | Google for Developers
Automated machine learning
Introduction
AutoML automates tasks in the machine learning workflow, such as data engineering (feature selection and engineering), training (algorithm selection and hyperparameter tuning), and analysis, making model building faster and easier.

While manual training involves writing code and iteratively adjusting it, AutoML reduces repetitive work and the need for specialised skills.
Source: Automated Machine Learning (AutoML) | Google for Developers
Benefits and limitations
Benefits:
- To save time.
- To improve the quality of an ML model.
- To build an ML model without needing specialised skills.
- To smoke test a dataset. AutoML can give quick baseline estimates of whether a dataset has enough signal relative to noise.
- To evaluate a dataset. AutoML can help determine which features may be worth using.
- To enforce best practices. Automation includes built-in support for applying ML best practices.
Limitations:
- Model quality may not match that of manual training.
- Model search and complexity can be opaque. Models generated with AutoML are difficult to reproduce manually.
- Multiple AutoML runs may show greater variance.
- Models cannot be customised during training.
Large amounts of data are generally required for AutoML, although specialised systems using transfer learning (taking a model trained on one task and adapting its learned representations to a different but related task) can reduce this requirement.
AutoML suits teams with limited ML experience or those seeking productivity gains without customisation needs. Custom (manual) training suits cases where model quality and customisation matter most.
Source: AutoML: Benefits and limitations | Machine Learning | Google for Developers
Getting started
AutoML tools fall into two categories:
- Tools that require no coding.
- API and CLI tools.
The AutoML workflow follows steps similar to traditional machine learning, including problem definition, data gathering, preparation, model development, evaluation, and potential retraining.
- Some AutoML systems also support model deployment.
Data preparation is crucial for AutoML and involves labelling, cleaning and formatting data, and applying feature transformations.
No-code AutoML tools guide users through model development with steps such as data import, analysis, refinement, and configuration of run parameters before starting the automated training process.
- Users still need to carry out semantic checks to select the appropriate semantic type for each feature (for example recognising that postal codes are categorical rather than numeric), and to set transformations accordingly.
Source: AutoML: Getting started | Machine Learning | Google for Developers
Fairness
Introduction
Before putting a model into production, it is critical to audit training data and evaluate predictions for bias.
Source: Fairness | Machine Learning | Google for Developers
Types of bias
Machine learning models can be susceptible to bias due to human involvement in data selection and curation.
Understanding common human biases is crucial for mitigating their impact on model predictions.
Types of bias include reporting bias, historical bias, automation bias, selection bias, coverage bias, non-response bias, sampling bias, group attribution bias (in-group bias and out-group homogeneity bias), implicit bias, confirmation bias, and experimenter's bias, among others.
Source: Fairness: Types of bias | Machine Learning | Google for Developers
Identifying bias
Missing or unexpected feature values in a dataset can indicate potential sources of bias.
Data skew, where certain groups are under- or over-represented, can introduce bias and should be addressed.
Evaluating model performance by subgroup ensures fairness and equal performance across different characteristics.
Source: Fairness: Identifying bias | Machine Learning | Google for Developers
Mitigating bias
Machine learning engineers use two primary strategies to mitigate bias in models:
- Augmenting training data.
- Adjusting the model's loss function.
Augmenting training data involves collecting additional data to address missing, incorrect, or skewed data, but it can be infeasible due to data availability or resource constraints.
Adjusting the model's loss function involves using fairness-aware optimisation functions rather than the common default log loss.
The TensorFlow Model Remediation Library provides optimisation functions designed to penalise errors in a fairness-aware manner:
- MinDiff aims to balance errors between different data slices by penalising differences in prediction distributions.
- Counterfactual Logit Pairing (CLP) penalises discrepancies in predictions for similar examples with different sensitive attribute values.
Source: Fairness: Mitigating bias | Machine Learning | Google for Developers
Evaluating for bias
Aggregate model performance metrics such as precision, recall, and accuracy can hide biases against minority groups.
Fairness in model evaluation involves ensuring equitable outcomes across different demographic groups.
Fairness metrics can help assess model predictions for bias.
- Demographic parity
- Equality of opportunity
- Counterfactual fairness
Candidate pool of 100 students: 80 students belong to the majority group (blue), and 20 students belong to the minority group (orange):

Source: Fairness: Evaluating for bias | Machine Learning | Google for Developers
Demographic parity
Demographic parity aims to ensure equal acceptance rates for majority and minority groups, regardless of individual qualifications.
Both the majority (blue) and minority (orange) groups have an acceptance rate of 20%:

While demographic parity promotes equal representation, it can overlook differences in individual qualifications within each group, potentially leading to unfair outcomes.
Qualified students in both groups are shaded in green, and qualified students who were rejected are marked with an X:

Majority acceptance rate = Qualified majority accepted / Qualified majority = 16/35 = 46% Minority acceptance rate = Qualified minority accepted / Qualified minority = 4/15 = 27%
When the distribution of a preferred label (“qualified”) differs substantially between groups, demographic parity may not be the most appropriate fairness metric.
There may be additional benefits/drawbacks of demographic parity not discussed here that are also worth considering.
Source: Fairness: Demographic parity | Machine Learning | Google for Developers
Equality of opportunity
Equality of opportunity focuses on ensuring that qualified individuals have an equal chance of acceptance, regardless of demographic group.
Qualified students in both groups are shaded in green:

Majority acceptance rate = Qualified majority accepted / Qualified majority = 14/35 = 40% Minority acceptance rate = Qualified minority accepted / Qualified minority = 6/15 = 40%
Equality of opportunity has limitations, including reliance on a clearly defined preferred label and challenges in settings that lack demographic data.
It is possible for a model to satisfy both demographic parity and equality of opportunity under specific conditions where positive prediction rates and true positive rates align across groups.
Source: Fairness: Equality of opportunity | Machine Learning | Google for Developers
Counterfactual fairness
Counterfactual fairness evaluates fairness by comparing predictions for similar individuals who differ only in a sensitive attribute such as demographic group.
This metric is particularly useful when datasets lack complete demographic information for most examples but contain it for a subset.
Candidate pool, with demographic group membership unknown for most candidates (icons shaded in grey):

Counterfactual fairness may not capture broader systemic biases across subgroups. Other fairness metrics, such as demographic parity and equality of opportunity, provide a more holistic view but may require complete demographic data.
Summary
Selecting the appropriate fairness metric depends on the specific application and desired outcome, with no single “right” metric universally applicable.
For example, if the goal is to achieve equal representation, demographic parity may be the optimal metric. If the goal is to achieve equal opportunity, equality of opportunity may be the best metric.
Some definitions of fairness are mutually incompatible.
Source: Fairness: Counterfactual fairness | Machine Learning | Google for Developers