Google ML Crash Course #2 Notes: Data
This post is part of a four-part summary of Google's Machine Learning Crash Course. For context, check out this post. This second module covers fundamental techniques and best practices for working with machine learning data.
Working with numerical data
Introduction
Numerical data: Integers or floating-point values that behave like numbers. They are additive, countable, ordered, and so on. Examples include temperature, weight, or the number of deer wintering in a nature preserve.
Source: Working with numerical data | Machine Learning | Google for Developers
How a model ingests data with feature vectors
A machine learning model ingests data through floating-point arrays called feature vectors, which are derived from dataset features. Feature vectors often utilise processed or transformed values instead of raw dataset values to enhance model learning.
Example of a feature vector: [0.13, 0.47]
Feature engineering is the process of converting raw data into suitable representations for the model. Common techniques are:
- Normalization: Converting numerical values into a standard range.
- Binning (bucketing): Converting numerical values into buckets or ranges.
Non-numerical data like strings must be converted into numerical values for use in feature vectors.
First steps
Before creating feature vectors, it is crucial to analyse numerical data to detect anomalies and patterns in the data, which aids in identifying potential issues early in the data analysis process.
- Visualising it through plots and graphs (like scatter plots or histograms)
- Calculating basic statistics like mean, median, standard deviation, or values at the quartile divisions (0th, 25th, 50th, 75th, 100th percentiles, where the 50th percentile is the median)
Outliers, values significantly distant from others, should be identified and handled appropriately.
- The outlier is due to a mistake: For example, an experimenter incorrectly entered data, or an instrument malfunctioned. We generally delete examples containing mistake outliers.
- If the outlier is a legitimate data point: If the model needs to infer good predictions on these outliers, keep them. If not, delete them or apply more invasive feature engineering techniques, such as clipping.
A dataset probably contains outliers when:
- The delta between the 0th and 25th percentiles differs significantly from the delta between the 75th and 100th percentiles
- The standard deviation is almost as high as the mean
Source: Numerical data: First steps | Machine Learning | Google for Developers
Normalization
Data normalization is crucial for enhancing machine learning model performance by scaling features to a similar range. It is also recommended to normalise a single numeric feature that covers a wide range (for example, city population).
Normalisation has the following benefits:
- Helps a model converge more quickly.
- Helps models infer better predictions.
- Helps avoid the NaN trap (large numerical values exceeding the floating-point precision limit and flipping into NaN values).
- Helps the model learn appropriate weights (so the model does not pay too much attention to features with wide ranges).
| Normalization technique | Formula | When to use |
|---|---|---|
| Linear scaling | $$x'=\frac{x-x_\text{min}}{x_\text{max}-x_\text{min}}$$ | When the feature is mostly uniformly distributed across range; flat-shaped |
| Z-score scaling | $$x' = (x-\mu)/\sigma$$ | When the feature is normally distributed (peak close to mean); bell-shaped |
| Log scaling | $$x'=ln(x)$$ | When the feature distribution is heavy skewed on at least either side of tail; heavy Tail-shaped |
| Clipping | If x > max, set $$x'=max$$ If x < min, set $$x' = min$$ | When the feature contains extreme outliers |
Source: Numerical data: Normalization | Machine Learning | Google for Developers
Binning
Binning (bucketing) is a feature engineering technique used to group numerical data into categories (bins). In many cases, this turns numerical data into categorical data.
For example, if a feature X has values ranging from 15 to 425, we can apply binning to represent X as a feature vector divided into specific intervals:
| Bin number | Range | Feature vector |
|---|---|---|
| 1 | 15-34 | [1.0, 0.0, 0.0, 0.0, 0.0] |
| 2 | 35-117 | [0.0, 1.0, 0.0, 0.0, 0.0] |
| 3 | 118-279 | [0.0, 0.0, 1.0, 0.0, 0.0] |
| 4 | 280-392 | [0.0, 0.0, 0.0, 1.0, 0.0] |
| 5 | 393-425 | [0.0, 0.0, 0.0, 0.0, 1.0] |
Even though X is a single column in the dataset, binning causes a model to treat X as five separate features. Therefore, the model learns separate weights for each bin.
Binning offers an alternative to scaling or clipping and is particularly useful for handling outliers and improving model performance on non-linear data.
When to use: Binning works well when features exhibit a “clumpy” distribution, that is, the overall linear relationship between the feature and label is weak or nonexistent, or when feature values are clustered.
Example: Number of shoppers versus temperature. By binning them, the model learns separate weights for each bin.

While creating multiple bins is possible, it is generally recommended to avoid an excessive number, as this can lead to insufficient training examples per bin and increased feature dimensionality.
Quantile bucketing is a specific binning technique that ensures each bin contains a roughly equal number of examples, which can be particularly useful for datasets with skewed distributions.
- Quantile buckets give extra information space to the large torso while compacting the long tail into a single bucket.
- Equal intervals give extra information space to the long tail while compacting the large torso into a single bucket.

Source: Numerical data: Binning | Machine Learning | Google for Developers
Scrubbing
| Problem category | Example |
|---|---|
| Omitted values | A census taker fails to record a resident's age |
| Duplicate examples | A server uploads the same logs twice |
| Out-of-range feature values | A human accidentally types an extra digit |
| Bad labels | A human evaluator mislabels a picture of an oak tree as a maple |
You can use programs or scripts to identify and handle data issues such as omitted values, duplicates, and out-of-range feature values by removing or correcting them.
Source: Numerical data: Scrubbing | Machine Learning | Google for Developers
Qualities of good numerical features
- Good feature vectors require features that are clearly named and have obvious meanings to anyone on the project.
- Data should be checked and tested for bad data or outliers, such as inappropriate values, before being used for training.
- Features should be sensible, avoiding “magic values” that create discontinuities (for example, setting the value “watch_time_in_seconds” to -1 to indicate an absence of measurement); instead, use separate boolean features or new discrete values to indicate missing data.
Source: Numerical data: Qualities of good numerical features | Machine Learning | Google for Developers
Polynomial transformations
Synthetic features, such as polynomial transforms, enable linear models to represent non-linear relationships by introducing new features based on existing ones.
By incorporating synthetic features, linear regression models can effectively separate data points that are not linearly separable, using curves instead of straight lines. For example, we can separate two classes with y = x^2.

Feature crosses, a related concept for categorical data, synthesise new features by combining existing features, further enhancing model flexibility.
Source: Numerical data: Polynomial transforms | Machine Learning | Google for Developers
Working with categorical data
Introduction
Categorical data has a specific set of possible values. Examples include species of animals, names of streets, whether or not an email is spam, and binned numbers.
Categorical data can include numbers that behave like categories. An example is postal codes.
- Numerical data can be meaningfully multiplied.
- Data that are native integer values should be represented as categorical data.
Encoding means converting categorical or other data to numerical vectors that a model can train on.
Preprocessing includes converting non-numerical data, such as strings, to floating-point values.
Source: Working with categorical data | Machine Learning | Google for Developers
Vocabulary and one-hot encoding
Machine learning models require numerical input; therefore, categorical data such as strings must be converted to numerical representations.
The term dimension is a synonym for the number of elements in a feature vector. Some categorical features are low dimensional. For example:
| Feature name | # of categories | Sample categories |
|---|---|---|
| snowed_today | 2 | True, False |
| skill_level | 3 | Beginner, Practitioner, Expert |
| season | 4 | Winter, Spring, Summer, Autumn |
| dayofweek | 7 | Monday, Tuesday, Wednesday |
| planet | 8 | Mercury, Venus, Earth |
| car_colour | 8 | Red, Orange, Blue, Yellow |
When a categorical feature has a low number of possible categories, you can encode it as a vocabulary. This treats each category as a separate feature, allowing the model to learn distinct weights for each during training.
One-hot encoding transforms categorical values into numerical vectors (arrays) of N elements, where N is the number of categories. Exactly one of the elements in a one-hot vector has the value 1.0; all the remaining elements have the value 0.0.
| Feature | Red | Orange | Blue | Yellow | Green | Black | Purple | Brown |
|---|---|---|---|---|---|---|---|---|
| “Red” | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| “Orange” | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| “Blue” | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| “Yellow” | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| “Green” | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| “Black” | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| “Purple” | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| “Brown” | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
It is the one-hot vector, not the string or the index number, that gets passed to the feature vector. The model learns a separate weight for each element of the feature vector.
The end-to-end process to map categories to feature vectors:

In a true one-hot encoding, only one element has the value 1.0. In a variant known as multi-hot encoding, multiple values can be 1.0.
A feature whose values are predominantly zero (or empty) is termed a sparse feature.
Sparse representation efficiently stores one-hot encoded data by only recording the position of the '1' value to reduce memory usage.
- For example, the one-hot vector for “car_colour” “Blue” is: [0, 0, 1, 0, 0, 0, 0, 0].
- Since the 1 is in position 2 (when starting the count at 0), the sparse representation is: 2.
Notice that the sparse representation consumes far less memory. Importantly, the model must train on the one-hot vector, not the sparse representation.
The sparse representation of a multi-hot encoding stores the positions of all the non-zero elements. For example, the sparse representation of a car that is both “Blue” and “Black” is 2, 5.
Categorical features can have outliers. If “car_colour” includes rare values such as “Mauve” or “Avocado”, you can group them into one out-of-vocabulary (OOV) category. All rare colours go into this single bucket, and the model learns one weight for it.
For high-dimensional categorical features with many categories, one-hot encoding might be inefficient, and embeddings or hashing (also called the hashing trick) are recommended.
- For example, a feature like “words_in_english” has around 500,000 categories.
- Embeddings substantially reduce the number of dimensions, which helps the model train faster and infer predictions more quickly.
Source: Categorical data: Vocabulary and one-hot encoding | Machine Learning | Google for Developers
Common issues with categorical data
Categorical data quality hinges on how categories are defined and labelled, impacting data reliability.
Human-labelled data, known as “gold labels”, is generally preferred for training due to its higher quality, but it is essential to check for human errors and biases.
- Any two human beings may label the same example differently. The difference between human raters' decisions is called inter-rater agreement.
- Inter-rater agreement can be measured using kappa and intra-class correlation (Hallgren, 2012), or Krippendorff's alpha (Krippendorff, 2011).
Machine-labelled data, or “silver labels”, can introduce biases or inaccuracies, necessitating careful quality checks and awareness of potential common-sense violations.
- For example, if a computer-vision model mislabels a photo of a chihuahua as a muffin, or a photo of a muffin as a chihuahua.
- Similarly, a sentiment analyser that scores neutral words as -0.25, when 0.0 is the neutral value, might be scoring all words with an additional negative bias.
High dimensionality in categorical data increases training complexity and costs, leading to techniques such as embeddings for dimensionality reduction.
Source: Categorical data: Common issues | Machine Learning | Google for Developers
Feature crosses
Feature crosses are created by combining two or more categorical or bucketed features to capture interactions and non-linearities within a dataset.
For example, consider a leaf dataset with the categorical features:
- “edges”, containing values {smooth, toothed, lobed}
- “arrangement”, containing values {opposite, alternate}
The feature cross, or Cartesian product, of these two features would be:
{Smooth_Opposite, Smooth_Alternate, Toothed_Opposite, Toothed_Alternate, Lobed_Opposite, Lobed_Alternate}
For example, if a leaf has a lobed edge and an alternate arrangement, the feature-cross vector will have a value of 1 for “Lobed_Alternate”, and a value of 0 for all other terms:
{0, 0, 0, 0, 0, 1}
This dataset could be used to classify leaves by tree species, since these characteristics do not vary within a species.
Feature crosses are somewhat analogous to polynomial transforms.
Feature crosses can be particularly effective when guided by domain expertise. It is often possible, though computationally expensive, to use neural networks to automatically find and apply useful feature combinations during training.
Overuse of feature crosses with sparse features should be avoided, as it can lead to excessive sparsity in the resulting feature set. For example, if feature A is a 100-element sparse feature and feature B is a 200-element sparse feature, a feature cross of A and B yields a 20,000-element sparse feature.
Source: Categorical data: Feature crosses | Machine Learning | Google for Developers
Datasets, generalization, and overfitting
Introduction
- Data quality significantly impacts model performance more than algorithm choice.
- Machine learning practitioners typically dedicate a substantial portion of their project time (around 80%) to data preparation and transformation, including tasks such as dataset construction and feature engineering.
Source: Datasets, generalization, and overfitting | Machine Learning | Google for Developers
Data characteristics
A machine learning model's performance is heavily reliant on the quality and quantity of the dataset it is trained on, with larger, high-quality datasets generally leading to better results.
Datasets can contain various data types, including numerical, categorical, text, multimedia, and embedding vectors, each requiring specific handling for optimal model training.
The following are common causes of unreliable data in datasets:
- Omitted values
- Duplicate examples
- Bad feature values
- Bad labels
- Bad sections of data
Maintaining data quality involves addressing issues such as label errors, noisy features, and proper filtering to ensure the reliability of the dataset for accurate predictions.
Incomplete examples with missing feature values should be handled by either deletion or imputation to avoid negatively impacting model training.
When imputing missing values, use reliable methods such as mean/median imputation and consider adding an indicator column to signal imputed values to the model. For example, alongside temperature include “temperature_is_imputed”. This lets the model learn to trust real observations more than imputed ones.
Source: Datasets: Data characteristics | Machine Learning | Google for Developers
Labels
Direct labels are generally preferred but often unavailable.
- Direct labels exactly match the prediction target and appear explicitly in the dataset, such as a “bicycle_owner” column for predicting bicycle ownership.
- Proxy labels approximate the target and correlate with it, such as a bicycle magazine subscription as a signal of bicycle ownership.
Use a proxy label when no direct label exists or when the direct concept resists easy numeric representation. Carefully evaluate proxy labels to ensure they are a suitable approximation.
Human-generated labels, while offering flexibility and nuanced understanding, can be expensive to produce and prone to errors, requiring careful quality control.
Models can train on a mix of automated and human-generated labels, but an extra set of human labels often adds complexity without sufficient benefit.
Source: Datasets: Labels | Machine Learning | Google for Developers
Imbalanced datasets
Imbalanced datasets occur when one label (majority class) is significantly more frequent than another (minority class), potentially hindering model training on the minority class.
Note: Accuracy is usually a poor metric for assessing a model trained on a class-imbalanced dataset.
A highly imbalanced floral dataset containing far more sunflowers (200) than roses (2):

During training, a model should learn two things:
- What each class looks like, that is, what feature values correspond to which class.
- How common each class is, that is, what the relative distribution of the classes is.
Standard training conflates these two goals. In contrast, a two-step technique of downsampling and upweighting the majority class separates these two goals, enabling the model to achieve both.
Step 1: Downsample the majority class by training on only a small fraction of majority class examples, which makes an imbalanced dataset more balanced during training and increases the chance that each batch contains enough minority examples.
For example, with a class-imbalanced dataset consisting of 99% majority class and 1% minority class examples, we could downsample the majority class by a factor of 25 to create a more balanced training set (80% majority class and 20% minority class).
Downsampling the majority class by a factor of 25:

Step 2: Upweight the downsampled majority class by the same factor used for downsampling, so each majority class error counts proportionally more during training. This corrects the artificial class distribution and bias introduced by downsampling, because the training data no longer reflects real-world frequencies.
Continuing the example from above, we must upweight the majority class by a factor of 25. That is, when the model mistakenly predicts the majority class, treat the loss as if it were 25 errors (multiply the regular loss by 25).
Upweighting the majority class by a factor of 25:

Experiment with different downsampling and upweighting factors just as you would experiment with other hyperparameters.
Benefits of this technique include a better model (the resultant model knows what each class looks like and how common each class is) and faster convergence.
Source: Datasets: Class-imbalanced datasets | Machine Learning | Google for Developers
Dividing the original dataset
Machine learning models should be tested against unseen data.
It is recommended to split the dataset into three subsets: training, validation, and test sets.

The validation set is used for initial testing during training (to determine hyperparameter tweaks, add, remove, or transform features, and so on), and the test set is used for final evaluation.

The validation and test sets can “wear out” with repeated use. For this reason, it is a good idea to collect more data to “refresh” the test and validation sets.
A good test set is:
- Large enough to yield statistically significant results
- Representative of the dataset as a whole
- Representative of real-world data the model will encounter (if your model performs poorly on real-world data, determine how your dataset differs from real-life data)
- Free of duplicates from the training set
In theory, the validation set and test set should contain the same number of examples, or nearly so.
Source: Datasets: Dividing the original dataset | Machine Learning | Google for Developers
Transforming data
Machine learning models require all data, including features such as street names, to be transformed into numerical (floating-point) representations for training.
Normalisation improves model training by converting existing floating-point features to a constrained range.
When dealing with large datasets, select a subset of examples for training. When possible, select the subset that is most relevant to your model's predictions. Safeguard privacy by omitting examples containing personally identifiable information.
Source: Datasets: Transforming data | Machine Learning | Google for Developers
Generalization
Generalisation refers to a model's ability to perform well on new, unseen data.
Source: Generalization | Machine Learning | Google for Developers
Overfitting
Overfitting means creating a model that matches the training set so closely that the model fails to make correct predictions on new data.
Generalization is the opposite of overfitting. That is, a model that generalises well makes good predictions on new data.
An overfit model is analogous to an invention that performs well in the lab but is worthless in the real world. An underfit model is like a product that does not even do well in the lab.
Overfitting can be detected by observing diverging loss curves for training and validation sets on a generalization curve (a graph that shows two or more loss curves). A generalization curve for a well-fit model shows two loss curves that have similar shapes.
Common causes of overfitting include:
- A training set that does not adequately represent real-life data (or the validation set or test set).
- A model that is too complex.
Dataset conditions for good generalization include:
- Examples must be independently and identically distributed, which is a fancy way of saying that your examples cannot influence each other.
- The dataset is stationary, meaning it does not change significantly over time.
- The dataset partitions have the same distribution, meaning the examples in the training set, validation set, test set, and real-world data are statistically similar.
Source: Overfitting | Machine Learning | Google for Developers
Model complexity
Simpler models often generalise better to new data than complex models, even if they perform slightly worse on training data.
Occam's Razor favours simpler explanations and models.
Model training should minimise both loss and complexity for optimal performance on new data. $$ \text{minimise}(\text{loss + complexity}) $$
Unfortunately, loss and complexity are typically inversely related. As complexity increases, loss decreases. As complexity decreases, loss increases.
Regularisation techniques help prevent overfitting by penalising model complexity during training.
- L1 regularisation (also called LASSO) uses model weights to measure model complexity.
- L2 regularisation (also called ridge regularisation) uses squares of model weights to measure model complexity.
Source: Overfitting: Model complexity | Machine Learning | Google for Developers
L2 regularization
L2 regularisation is a popular regularisation metric to reduce model complexity and prevent overfitting. It uses the following formula: $$ L_2 \text{ regularisation} = w^2_1 + w^2_2 + \ldots + w^2_n $$
It penalises especially large weights.
L2 regularisation encourages weights towards 0, but never pushes them all the way to zero.
A regularisation rate (lambda) controls the strength of regularisation. $$ \text{minimise}(\text{loss} + \lambda \text{ complexity}) $$
- A high regularisation rate reduces the likelihood of overfitting and tends to produce a histogram of model weights that are normally distributed around 0.
- A low regularisation rate lowers the influence of regularisation and tends to produce a histogram of model weights with a flat distribution.
Tuning is required to find the ideal regularisation rate.
Early stopping is an alternative regularisation method that involves ending training before the model fully converges to prevent overfitting. It usually increases training loss but decreases test loss. It is a quick but rarely optimal form of regularisation.
Learning rate and regularisation rate tend to pull weights in opposite directions. A high learning rate often pulls weights away from zero, while a high regularisation rate pulls weights towards zero. The goal is to find the equilibrium.
Source: Overfitting: L2 regularization | Machine Learning | Google for Developers
Interpreting loss curves
An ideal loss curve looks like this:

To improve an oscillating loss curve:
- Reduce the learning rate.
- Reduce the training set to a tiny number of trustworthy examples.
- Check your data against a data schema to detect bad examples, then remove the bad examples from the training set.

Possible reasons for a loss curve with a sharp jump include:
- The input data contains a burst of outliers.
- The input data contains one or more NaNs (for example, a value caused by a division by zero).

Test loss diverges from training loss when:
- The model overfits the training set.

The loss curve gets stuck when:
- The training set is not shuffled well.

Source: Overfitting: Interpreting loss curves | Machine Learning | Google for Developers