Google ML Crash Course #4 Notes: Real-World ML

This post is part of a four-part summary of Google's Machine Learning Crash Course. For context, check out this post. This fourth module covers critical considerations when building and deploying ML models in the real world, including productionisation best practices, automation, and responsible engineering.

Production ML systems

Introduction

The model is only a small part of real-world production ML systems. It often represents only 5% or less of the total codebase in the system. MlSystem.png

Source: Production ML systems | Machine Learning | Google for Developers

Static versus dynamic training

Machine learning models can be trained statically (once) or dynamically (continuously).

Static training (offline training) Dynamic training (online training)
Advantages Simpler. You only need to develop and test the model once. More adaptable. Keeps up with changes in data patterns, providing more accurate predictions.
Disadvantages Sometimes stale. Can become outdated if data patterns change, requiring data monitoring. More work. You must build, test, and release a new product continuously.

Choosing between static and dynamic training depends on the specific dataset and how frequently it changes.

Monitoring input data is essential for both static and dynamic training to ensure reliable predictions.

Source: Production ML systems: Static versus dynamic training | Machine Learning | Google for Developers

Static versus dynamic inference

Inference involves using a trained model to make predictions on unlabelled examples, and it can be done as follows:

Static inference (offline inference, batch inference) Dynamic inference (online inference, real-time inference)
Advantages No need to worry about cost of inference; allows post-verification of predictions before pushing Can infer a prediction on any new item as it comes in
Disadvantages Limited ability to handle uncommon inputs Compute-intensive and latency-sensitive; monitoring needs are intensive

Choosing between static and dynamic inference depends on factors such as model complexity, desired prediction speed, and the nature of the input data.

Static inference is advantageous when cost and prediction verification are prioritised, while dynamic inference excels in handling diverse, real-time predictions.

Source: Production ML systems: Static versus dynamic inference | Machine Learning | Google for Developers

When to transform data?

Feature engineering can be performed before or during model training, each with its own advantages and disadvantages.

Source: Production ML systems: When to transform data? | Machine Learning | Google for Developers

Deployment testing

Deploying a machine learning model involves validating data, features, model versions, serving infrastructure, and pipeline integration.

Reproducible model training involves deterministic seeding, fixed initialisation order, averaging multiple runs, and using version control.

Integration tests ensure that different components of the ML pipeline work together seamlessly and should run continuously and for new model or software versions.

Before serving a new model, validate its quality by checking for sudden and gradual degradations against previous versions and fixed thresholds.

Ensure model-infrastructure compatibility by staging the model in a sandboxed server environment to avoid dependency conflicts.

Source: Production ML systems: Deployment testing | Machine Learning | Google for Developers

Monitoring pipelines

ML pipeline monitoring involves validating data (using data schemas) and features (using unit tests), tracking real-world metrics, and addressing potential biases in data slices.

Monitoring training-serving skew, label leakage, model age, and numerical stability is crucial for maintaining pipeline health and model performance.

Live model quality testing uses methods such as human labelling and statistical analysis to ensure ongoing model effectiveness in real-world scenarios.

Implementing proper randomisation through deterministic data generation enables reproducible experiments and consistent analysis.

Maintaining invariant hashing ensures that data splits remain consistent across experiments, contributing to reliable analysis and model evaluation.

Source: Production ML systems: Monitoring pipelines | Machine Learning | Google for Developers

Questions to ask

Continuously monitor models in production to evaluate feature importance and potentially remove unnecessary features, ensuring prediction quality and resource efficiency.

Data reliability is crucial. Consider data source stability, potential changes in upstream data processes, and the creation of local data copies to control versioning and mitigate risks.

Be aware of feedback loops, where a model's predictions influence future input data, potentially leading to unexpected behaviour or biased outcomes, especially in interconnected systems.

Source: Production ML systems: Questions to ask | Machine Learning | Google for Developers

Automated machine learning

Introduction

AutoML automates tasks in the machine learning workflow, such as data engineering (feature selection and engineering), training (algorithm selection and hyperparameter tuning), and analysis, making model building faster and easier. ml-workflow.png

While manual training involves writing code and iteratively adjusting it, AutoML reduces repetitive work and the need for specialised skills.

Source: Automated Machine Learning (AutoML) | Google for Developers

Benefits and limitations

Benefits:

Limitations:

Large amounts of data are generally required for AutoML, although specialised systems using transfer learning (taking a model trained on one task and adapting its learned representations to a different but related task) can reduce this requirement.

AutoML suits teams with limited ML experience or those seeking productivity gains without customisation needs. Custom (manual) training suits cases where model quality and customisation matter most.

Source: AutoML: Benefits and limitations | Machine Learning | Google for Developers

Getting started

AutoML tools fall into two categories:

The AutoML workflow follows steps similar to traditional machine learning, including problem definition, data gathering, preparation, model development, evaluation, and potential retraining.

Data preparation is crucial for AutoML and involves labelling, cleaning and formatting data, and applying feature transformations.

No-code AutoML tools guide users through model development with steps such as data import, analysis, refinement, and configuration of run parameters before starting the automated training process.

Source: AutoML: Getting started | Machine Learning | Google for Developers

Fairness

Introduction

Before putting a model into production, it is critical to audit training data and evaluate predictions for bias.

Source: Fairness | Machine Learning | Google for Developers

Types of bias

Machine learning models can be susceptible to bias due to human involvement in data selection and curation.

Understanding common human biases is crucial for mitigating their impact on model predictions.

Types of bias include reporting bias, historical bias, automation bias, selection bias, coverage bias, non-response bias, sampling bias, group attribution bias (in-group bias and out-group homogeneity bias), implicit bias, confirmation bias, and experimenter's bias, among others.

Source: Fairness: Types of bias | Machine Learning | Google for Developers

Identifying bias

Missing or unexpected feature values in a dataset can indicate potential sources of bias.

Data skew, where certain groups are under- or over-represented, can introduce bias and should be addressed.

Evaluating model performance by subgroup ensures fairness and equal performance across different characteristics.

Source: Fairness: Identifying bias | Machine Learning | Google for Developers

Mitigating bias

Machine learning engineers use two primary strategies to mitigate bias in models:

Augmenting training data involves collecting additional data to address missing, incorrect, or skewed data, but it can be infeasible due to data availability or resource constraints.

Adjusting the model's loss function involves using fairness-aware optimisation functions rather than the common default log loss.

The TensorFlow Model Remediation Library provides optimisation functions designed to penalise errors in a fairness-aware manner:

Source: Fairness: Mitigating bias | Machine Learning | Google for Developers

Evaluating for bias

Aggregate model performance metrics such as precision, recall, and accuracy can hide biases against minority groups.

Fairness in model evaluation involves ensuring equitable outcomes across different demographic groups.

Fairness metrics can help assess model predictions for bias.

Candidate pool of 100 students: 80 students belong to the majority group (blue), and 20 students belong to the minority group (orange): fairness_metrics_candidate_pool.png

Source: Fairness: Evaluating for bias | Machine Learning | Google for Developers

Demographic parity

Demographic parity aims to ensure equal acceptance rates for majority and minority groups, regardless of individual qualifications.

Both the majority (blue) and minority (orange) groups have an acceptance rate of 20%: fairness_metrics_demographic_parity.png

While demographic parity promotes equal representation, it can overlook differences in individual qualifications within each group, potentially leading to unfair outcomes.

Qualified students in both groups are shaded in green, and qualified students who were rejected are marked with an X: fairness_metrics_demographic_parity_by_qualifications.png

Majority acceptance rate = Qualified majority accepted / Qualified majority = 16/35 = 46% Minority acceptance rate = Qualified minority accepted / Qualified minority = 4/15 = 27%

When the distribution of a preferred label (“qualified”) differs substantially between groups, demographic parity may not be the most appropriate fairness metric.

There may be additional benefits/drawbacks of demographic parity not discussed here that are also worth considering.

Source: Fairness: Demographic parity | Machine Learning | Google for Developers

Equality of opportunity

Equality of opportunity focuses on ensuring that qualified individuals have an equal chance of acceptance, regardless of demographic group.

Qualified students in both groups are shaded in green: fairness_metrics_equality_of_opportunity_by_qualifications.png

Majority acceptance rate = Qualified majority accepted / Qualified majority = 14/35 = 40% Minority acceptance rate = Qualified minority accepted / Qualified minority = 6/15 = 40%

Equality of opportunity has limitations, including reliance on a clearly defined preferred label and challenges in settings that lack demographic data.

It is possible for a model to satisfy both demographic parity and equality of opportunity under specific conditions where positive prediction rates and true positive rates align across groups.

Source: Fairness: Equality of opportunity | Machine Learning | Google for Developers

Counterfactual fairness

Counterfactual fairness evaluates fairness by comparing predictions for similar individuals who differ only in a sensitive attribute such as demographic group.

This metric is particularly useful when datasets lack complete demographic information for most examples but contain it for a subset.

Candidate pool, with demographic group membership unknown for most candidates (icons shaded in grey): fairness_metrics_counterfactual_satisfied.png

Counterfactual fairness may not capture broader systemic biases across subgroups. Other fairness metrics, such as demographic parity and equality of opportunity, provide a more holistic view but may require complete demographic data.

Summary

Selecting the appropriate fairness metric depends on the specific application and desired outcome, with no single “right” metric universally applicable.

For example, if the goal is to achieve equal representation, demographic parity may be the optimal metric. If the goal is to achieve equal opportunity, equality of opportunity may be the best metric.

Some definitions of fairness are mutually incompatible.

Source: Fairness: Counterfactual fairness | Machine Learning | Google for Developers