alt-text

What is One Hot Encoding?

When pre-processing data for machine learning we will often need to convert categorical data (generally strings, but not always) such and countries, colors, education levels, etc. into numeric values to work with our machine learning models. There are various approaches to this. The approach we are discussing here is one-hot encoding.

The one-hot encoding process converts each unique value in a categorical column into its own new field. In this format each column now represents one specific categorical value. One-hot encoded fields are populated by zeros and ones. A value of 1 indicates the presence of a categorical value. A value of 0 indicates the absence of that categorical value.

One Hot Encoding Example

Let's go through a simple one-hot encoding example. We have a small dataset that contains two fields, employee_id and city.

╒═══════════════╤═══════════════╕
│   employee_id │ city          │
╞═══════════════╪═══════════════╡
│             1 │ san francisco │
├───────────────┼───────────────┤
│             2 │ foster city   │
├───────────────┼───────────────┤
│             3 │ dublin        │
├───────────────┼───────────────┤
│             4 │ san francisco │
╘═══════════════╧═══════════════╛

Let's convert the city field to a number of one-hot encoded fields (features). After creating the one-hot encoded fields, we will have a separate field for each unique value.

╒═══════════════╤═══════════════╤════════════════════╤══════════════════════╕
│   employee_id │   city_dublin │   city_foster city │   city_san francisco │
╞═══════════════╪═══════════════╪════════════════════╪══════════════════════╡
│             1 │             0 │                  0 │                    1 │
├───────────────┼───────────────┼────────────────────┼──────────────────────┤
│             2 │             0 │                  1 │                    0 │
├───────────────┼───────────────┼────────────────────┼──────────────────────┤
│             3 │             1 │                  0 │                    0 │
├───────────────┼───────────────┼────────────────────┼──────────────────────┤
│             4 │             0 │                  0 │                    1 │
╘═══════════════╧═══════════════╧════════════════════╧══════════════════════╛

Note how city for each employee is recorded in this format. For example, instead of employee 2 having city=foster city they now have city_foster city=1 with zeros in all other fields.

Why not Assign Every Value an ID?

Why not simply assign each unique value a numeric ID? This approach is very intuitive but has potential to lead to decreased model performance. For example we could assign 1=USA, 2=Germany, and 3=South Africa. The problem is that this implies an ordering/ranking relationship between values that does not actually exist.

However, this approach can be appropriate for categories that do actually have an ordinal relationship. A example of this could be demographic generations in the United States. Since generations happen consecutively over time, these categories do have an ordinal/ranking relationship. For example, 1=Baby Boomers, 2=Gen X, 3=Millenials, 4=Gen Z, etc.

╒══════════════════╤═════════════════╕
│ Genration Name   │   Encoded Value │
╞══════════════════╪═════════════════╡
│ Baby Boomers     │               0 │
├──────────────────┼─────────────────┤
│ Gen X            │               1 │
├──────────────────┼─────────────────┤
│ Millenials       │               2 │
├──────────────────┼─────────────────┤
│ Gen Z            │               4 │
╘══════════════════╧═════════════════╛

One Hot Code Examples

One-hot encoding is a common task and there are many libararies in various languages that make creating one-hot encoded fields easy to do. We will publish them here as they become available.