Google ML Crash Course #2 Notes: Data

This post is part of a four-part summary of Google's Machine Learning Crash Course. For context, check out this post. This second module covers fundamental techniques and best practices for working with machine learning data.

Working with numerical data

Introduction

Numerical data: Integers or floating-point values that behave like numbers. They are additive, countable, ordered, and so on. Examples include temperature, weight, or the number of deer wintering in a nature preserve.

Source: Working with numerical data | Machine Learning | Google for Developers

How a model ingests data with feature vectors

A machine learning model ingests data through floating-point arrays called feature vectors, which are derived from dataset features. Feature vectors often utilise processed or transformed values instead of raw dataset values to enhance model learning.

Example of a feature vector: [0.13, 0.47]

Feature engineering is the process of converting raw data into suitable representations for the model. Common techniques are:

Non-numerical data like strings must be converted into numerical values for use in feature vectors.

Source: Numerical data: How a model ingests data using feature vectors | Machine Learning | Google for Developers

First steps

Before creating feature vectors, it is crucial to analyse numerical data to detect anomalies and patterns in the data, which aids in identifying potential issues early in the data analysis process.

Outliers, values significantly distant from others, should be identified and handled appropriately.

A dataset probably contains outliers when:

Source: Numerical data: First steps | Machine Learning | Google for Developers

Normalization

Data normalization is crucial for enhancing machine learning model performance by scaling features to a similar range. It is also recommended to normalise a single numeric feature that covers a wide range (for example, city population).

Normalisation has the following benefits:

Normalization technique Formula When to use
Linear scaling $$x'=\frac{x-x_\text{min}}{x_\text{max}-x_\text{min}}$$ When the feature is mostly uniformly distributed across range; flat-shaped
Z-score scaling $$x' = (x-\mu)/\sigma$$ When the feature is normally distributed (peak close to mean); bell-shaped
Log scaling $$x'=ln(x)$$ When the feature distribution is heavy skewed on at least either side of tail; heavy Tail-shaped
Clipping If x > max, set $$x'=max$$ If x < min, set $$x' = min$$ When the feature contains extreme outliers

Source: Numerical data: Normalization | Machine Learning | Google for Developers

Binning

Binning (bucketing) is a feature engineering technique used to group numerical data into categories (bins). In many cases, this turns numerical data into categorical data.

For example, if a feature X has values ranging from 15 to 425, we can apply binning to represent X as a feature vector divided into specific intervals:

Bin number Range Feature vector
1 15-34 [1.0, 0.0, 0.0, 0.0, 0.0]
2 35-117 [0.0, 1.0, 0.0, 0.0, 0.0]
3 118-279 [0.0, 0.0, 1.0, 0.0, 0.0]
4 280-392 [0.0, 0.0, 0.0, 1.0, 0.0]
5 393-425 [0.0, 0.0, 0.0, 0.0, 1.0]

Even though X is a single column in the dataset, binning causes a model to treat X as five separate features. Therefore, the model learns separate weights for each bin.

Binning offers an alternative to scaling or clipping and is particularly useful for handling outliers and improving model performance on non-linear data.

When to use: Binning works well when features exhibit a “clumpy” distribution, that is, the overall linear relationship between the feature and label is weak or nonexistent, or when feature values are clustered.

Example: Number of shoppers versus temperature. By binning them, the model learns separate weights for each bin. binning_temperature_vs_shoppers_divided_into_3_bins.png

While creating multiple bins is possible, it is generally recommended to avoid an excessive number, as this can lead to insufficient training examples per bin and increased feature dimensionality.

Quantile bucketing is a specific binning technique that ensures each bin contains a roughly equal number of examples, which can be particularly useful for datasets with skewed distributions.

Source: Numerical data: Binning | Machine Learning | Google for Developers

Scrubbing

Problem category Example
Omitted values A census taker fails to record a resident's age
Duplicate examples A server uploads the same logs twice
Out-of-range feature values A human accidentally types an extra digit
Bad labels A human evaluator mislabels a picture of an oak tree as a maple

You can use programs or scripts to identify and handle data issues such as omitted values, duplicates, and out-of-range feature values by removing or correcting them.

Source: Numerical data: Scrubbing | Machine Learning | Google for Developers

Qualities of good numerical features

Source: Numerical data: Qualities of good numerical features | Machine Learning | Google for Developers

Polynomial transformations

Synthetic features, such as polynomial transforms, enable linear models to represent non-linear relationships by introducing new features based on existing ones.

By incorporating synthetic features, linear regression models can effectively separate data points that are not linearly separable, using curves instead of straight lines. For example, we can separate two classes with y = x^2. ft_cross1.png

Feature crosses, a related concept for categorical data, synthesise new features by combining existing features, further enhancing model flexibility.

Source: Numerical data: Polynomial transforms | Machine Learning | Google for Developers

Working with categorical data

Introduction

Categorical data has a specific set of possible values. Examples include species of animals, names of streets, whether or not an email is spam, and binned numbers.

Categorical data can include numbers that behave like categories. An example is postal codes.

Encoding means converting categorical or other data to numerical vectors that a model can train on.

Preprocessing includes converting non-numerical data, such as strings, to floating-point values.

Source: Working with categorical data | Machine Learning | Google for Developers

Vocabulary and one-hot encoding

Machine learning models require numerical input; therefore, categorical data such as strings must be converted to numerical representations.

The term dimension is a synonym for the number of elements in a feature vector. Some categorical features are low dimensional. For example:

Feature name # of categories Sample categories
snowed_today 2 True, False
skill_level 3 Beginner, Practitioner, Expert
season 4 Winter, Spring, Summer, Autumn
dayofweek 7 Monday, Tuesday, Wednesday
planet 8 Mercury, Venus, Earth
car_colour 8 Red, Orange, Blue, Yellow

When a categorical feature has a low number of possible categories, you can encode it as a vocabulary. This treats each category as a separate feature, allowing the model to learn distinct weights for each during training.

One-hot encoding transforms categorical values into numerical vectors (arrays) of N elements, where N is the number of categories. Exactly one of the elements in a one-hot vector has the value 1.0; all the remaining elements have the value 0.0.

Feature Red Orange Blue Yellow Green Black Purple Brown
“Red” 1 0 0 0 0 0 0 0
“Orange” 0 1 0 0 0 0 0 0
“Blue” 0 0 1 0 0 0 0 0
“Yellow” 0 0 0 1 0 0 0 0
“Green” 0 0 0 0 1 0 0 0
“Black” 0 0 0 0 0 1 0 0
“Purple” 0 0 0 0 0 0 1 0
“Brown” 0 0 0 0 0 0 0 1

It is the one-hot vector, not the string or the index number, that gets passed to the feature vector. The model learns a separate weight for each element of the feature vector.

The end-to-end process to map categories to feature vectors: vocabulary-index-sparse-feature.png

In a true one-hot encoding, only one element has the value 1.0. In a variant known as multi-hot encoding, multiple values can be 1.0.

A feature whose values are predominantly zero (or empty) is termed a sparse feature.

Sparse representation efficiently stores one-hot encoded data by only recording the position of the '1' value to reduce memory usage.

Notice that the sparse representation consumes far less memory. Importantly, the model must train on the one-hot vector, not the sparse representation.

The sparse representation of a multi-hot encoding stores the positions of all the non-zero elements. For example, the sparse representation of a car that is both “Blue” and “Black” is 2, 5.

Categorical features can have outliers. If “car_colour” includes rare values such as “Mauve” or “Avocado”, you can group them into one out-of-vocabulary (OOV) category. All rare colours go into this single bucket, and the model learns one weight for it.

For high-dimensional categorical features with many categories, one-hot encoding might be inefficient, and embeddings or hashing (also called the hashing trick) are recommended.

Source: Categorical data: Vocabulary and one-hot encoding | Machine Learning | Google for Developers

Common issues with categorical data

Categorical data quality hinges on how categories are defined and labelled, impacting data reliability.

Human-labelled data, known as “gold labels”, is generally preferred for training due to its higher quality, but it is essential to check for human errors and biases.

Machine-labelled data, or “silver labels”, can introduce biases or inaccuracies, necessitating careful quality checks and awareness of potential common-sense violations.

High dimensionality in categorical data increases training complexity and costs, leading to techniques such as embeddings for dimensionality reduction.

Source: Categorical data: Common issues | Machine Learning | Google for Developers

Feature crosses

Feature crosses are created by combining two or more categorical or bucketed features to capture interactions and non-linearities within a dataset.

For example, consider a leaf dataset with the categorical features:

The feature cross, or Cartesian product, of these two features would be:

{Smooth_Opposite, Smooth_Alternate, Toothed_Opposite, Toothed_Alternate, Lobed_Opposite, Lobed_Alternate}

For example, if a leaf has a lobed edge and an alternate arrangement, the feature-cross vector will have a value of 1 for “Lobed_Alternate”, and a value of 0 for all other terms:

{0, 0, 0, 0, 0, 1}

This dataset could be used to classify leaves by tree species, since these characteristics do not vary within a species.

Feature crosses are somewhat analogous to polynomial transforms.

Feature crosses can be particularly effective when guided by domain expertise. It is often possible, though computationally expensive, to use neural networks to automatically find and apply useful feature combinations during training.

Overuse of feature crosses with sparse features should be avoided, as it can lead to excessive sparsity in the resulting feature set. For example, if feature A is a 100-element sparse feature and feature B is a 200-element sparse feature, a feature cross of A and B yields a 20,000-element sparse feature.

Source: Categorical data: Feature crosses | Machine Learning | Google for Developers

Datasets, generalization, and overfitting

Introduction

Source: Datasets, generalization, and overfitting | Machine Learning | Google for Developers

Data characteristics

A machine learning model's performance is heavily reliant on the quality and quantity of the dataset it is trained on, with larger, high-quality datasets generally leading to better results.

Datasets can contain various data types, including numerical, categorical, text, multimedia, and embedding vectors, each requiring specific handling for optimal model training.

The following are common causes of unreliable data in datasets:

Maintaining data quality involves addressing issues such as label errors, noisy features, and proper filtering to ensure the reliability of the dataset for accurate predictions.

Incomplete examples with missing feature values should be handled by either deletion or imputation to avoid negatively impacting model training.

When imputing missing values, use reliable methods such as mean/median imputation and consider adding an indicator column to signal imputed values to the model. For example, alongside temperature include “temperature_is_imputed”. This lets the model learn to trust real observations more than imputed ones.

Source: Datasets: Data characteristics | Machine Learning | Google for Developers

Labels

Direct labels are generally preferred but often unavailable.

Use a proxy label when no direct label exists or when the direct concept resists easy numeric representation. Carefully evaluate proxy labels to ensure they are a suitable approximation.

Human-generated labels, while offering flexibility and nuanced understanding, can be expensive to produce and prone to errors, requiring careful quality control.

Models can train on a mix of automated and human-generated labels, but an extra set of human labels often adds complexity without sufficient benefit.

Source: Datasets: Labels | Machine Learning | Google for Developers

Imbalanced datasets

Imbalanced datasets occur when one label (majority class) is significantly more frequent than another (minority class), potentially hindering model training on the minority class.

Note: Accuracy is usually a poor metric for assessing a model trained on a class-imbalanced dataset.

A highly imbalanced floral dataset containing far more sunflowers (200) than roses (2): FloralDataset200Sunflowers2Roses.png

During training, a model should learn two things:

Standard training conflates these two goals. In contrast, a two-step technique of downsampling and upweighting the majority class separates these two goals, enabling the model to achieve both.

Step 1: Downsample the majority class by training on only a small fraction of majority class examples, which makes an imbalanced dataset more balanced during training and increases the chance that each batch contains enough minority examples.

For example, with a class-imbalanced dataset consisting of 99% majority class and 1% minority class examples, we could downsample the majority class by a factor of 25 to create a more balanced training set (80% majority class and 20% minority class).

Downsampling the majority class by a factor of 25: FloralDatasetDownsampling.png

Step 2: Upweight the downsampled majority class by the same factor used for downsampling, so each majority class error counts proportionally more during training. This corrects the artificial class distribution and bias introduced by downsampling, because the training data no longer reflects real-world frequencies.

Continuing the example from above, we must upweight the majority class by a factor of 25. That is, when the model mistakenly predicts the majority class, treat the loss as if it were 25 errors (multiply the regular loss by 25).

Upweighting the majority class by a factor of 25: FloralDatasetUpweighting.png

Experiment with different downsampling and upweighting factors just as you would experiment with other hyperparameters.

Benefits of this technique include a better model (the resultant model knows what each class looks like and how common each class is) and faster convergence.

Source: Datasets: Class-imbalanced datasets | Machine Learning | Google for Developers

Dividing the original dataset

Machine learning models should be tested against unseen data.

It is recommended to split the dataset into three subsets: training, validation, and test sets. PartitionThreeSets.png

The validation set is used for initial testing during training (to determine hyperparameter tweaks, add, remove, or transform features, and so on), and the test set is used for final evaluation. workflow_with_validation_set.png

The validation and test sets can “wear out” with repeated use. For this reason, it is a good idea to collect more data to “refresh” the test and validation sets.

A good test set is:

In theory, the validation set and test set should contain the same number of examples, or nearly so.

Source: Datasets: Dividing the original dataset | Machine Learning | Google for Developers

Transforming data

Machine learning models require all data, including features such as street names, to be transformed into numerical (floating-point) representations for training.

Normalisation improves model training by converting existing floating-point features to a constrained range.

When dealing with large datasets, select a subset of examples for training. When possible, select the subset that is most relevant to your model's predictions. Safeguard privacy by omitting examples containing personally identifiable information.

Source: Datasets: Transforming data | Machine Learning | Google for Developers

Generalization

Generalisation refers to a model's ability to perform well on new, unseen data.

Source: Generalization | Machine Learning | Google for Developers

Overfitting

Overfitting means creating a model that matches the training set so closely that the model fails to make correct predictions on new data.

Generalization is the opposite of overfitting. That is, a model that generalises well makes good predictions on new data.

An overfit model is analogous to an invention that performs well in the lab but is worthless in the real world. An underfit model is like a product that does not even do well in the lab.

Overfitting can be detected by observing diverging loss curves for training and validation sets on a generalization curve (a graph that shows two or more loss curves). A generalization curve for a well-fit model shows two loss curves that have similar shapes.

Common causes of overfitting include:

Dataset conditions for good generalization include:

Source: Overfitting | Machine Learning | Google for Developers

Model complexity

Simpler models often generalise better to new data than complex models, even if they perform slightly worse on training data.

Occam's Razor favours simpler explanations and models.

Model training should minimise both loss and complexity for optimal performance on new data. $$ \text{minimise}(\text{loss + complexity}) $$

Unfortunately, loss and complexity are typically inversely related. As complexity increases, loss decreases. As complexity decreases, loss increases.

Regularisation techniques help prevent overfitting by penalising model complexity during training.

Source: Overfitting: Model complexity | Machine Learning | Google for Developers

L2 regularization

L2 regularisation is a popular regularisation metric to reduce model complexity and prevent overfitting. It uses the following formula: $$ L_2 \text{ regularisation} = w^2_1 + w^2_2 + \ldots + w^2_n $$

It penalises especially large weights.

L2 regularisation encourages weights towards 0, but never pushes them all the way to zero.

A regularisation rate (lambda) controls the strength of regularisation. $$ \text{minimise}(\text{loss} + \lambda \text{ complexity}) $$

Tuning is required to find the ideal regularisation rate.

Early stopping is an alternative regularisation method that involves ending training before the model fully converges to prevent overfitting. It usually increases training loss but decreases test loss. It is a quick but rarely optimal form of regularisation.

Learning rate and regularisation rate tend to pull weights in opposite directions. A high learning rate often pulls weights away from zero, while a high regularisation rate pulls weights towards zero. The goal is to find the equilibrium.

Source: Overfitting: L2 regularization | Machine Learning | Google for Developers

Interpreting loss curves

An ideal loss curve looks like this: metric-curve-ideal.png

To improve an oscillating loss curve:

Possible reasons for a loss curve with a sharp jump include:

Test loss diverges from training loss when:

The loss curve gets stuck when:

Source: Overfitting: Interpreting loss curves | Machine Learning | Google for Developers