Categories
Algorithms

Fast One Hot Encoding using Numpy (No For Loops)

In a lot of artificial intelligence applications, specially in supervised learning classification problems, where the labels for each of the datapoints are available, we often have a one-dimensional array containing the classes of each of the datapoints. Depending upon the machine learning algorithm we are going to apply to our dataset for classification, we might need to one-hot encode our labels.

What is One Hot Encoding?

Say, we are given a labels array like,

>>> Y=numpy.randdom.randint(15,size=10).reshape(-1,1)
>>> Y
array([[12],
       [ 8],
       [ 4],
       [ 4],
       [11],
       [ 0],
       [10],
       [10],
       [ 1],
       [ 8]])

Where each row contains the class number of the correspoding row for the datapoint X. Note that, in our example there are 6 unique classes, namely, 0,1,4,8,10,11,12 .One hot encoding for the above array Y will be,

>>> one_hot(Y)
array([[0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0.]])

Where in each row has only one column set to 1 representing the class to which the datapoint belongs. For example, the first row has the last column set to 1. Which represents the column of the class ’12’. Which would mean we would also require the class to column mapping along with the one hot encoding. Something like,

array([[ 0],
       [ 1],
       [ 4],
       [ 8],
       [10],
       [11],
       [12]])

Here the row number of the class represents it’s column in the one hot encoding. It is simply a sorted order of all unique classes in Y.

The easy solution to the easy problem(For loop)

def one_hot_for(Y):
    data_size=Y.shape[0]
    classes=np.unique(Y).reshape(-1,1)
    num_classes=classes.shape[0]

    one_hot=np.zeros((data_size,num_classes))
    for row in range(data_size):
        one_hot[row,np.where(classes==Y[row])[0]]=1

    return one_hot,classes

The hard solution to the easy problem(vector(ish))

def one_hot(Y):
    data_size=Y.shape[0]
    classes=np.unique(Y).reshape(-1,1)
    num_classes=classes.shape[0]

    class_mappings=np.arange(0,max(Y)+1)
    class_mappings[np.unique(classes)]=np.arange(num_classes)
    Y=class_mappings[Y]

    one_hot=np.zeros((data_size,num_classes))
    one_hot[np.arange(data_size).reshape(-1,1),Y.reshape(-1,1)]=1
    class_col=np.sort(classes)
    return one_hot,class_col

Speed comparison

Generating random labels and storing in a file

import random

file_name="randoms.txt"
with open(file_name,"w+") as random_labels:
    for i in range(10000):
        random_labels.write(str(random.randint(0,1000))+"\n")

The above code will generate 10,000 random numbers and store them on individual lines in the file randoms.txt

Script for comparing the two functions

import matplotlib.pyplot as plt
import time
import numpy as np
from helper_functions import one_hot,one_hot_for

filename= "randoms.txt"

with open(filename,"r+") as f:
    Y=f.readlines()
    int_map=map(int,Y)
    Y=list(int_map)
    Y=np.asarray(Y).reshape(-1,1)

one_hot_timings=[]
one_hot_for_timings=[]
for i in range(100,10000,100):
    start=time.time()
    _,_=one_hot(Y[:i])
    end=time.time()
    one_hot_timings.append(end-start)

    start=time.time()
    _,_=one_hot_for(Y[:i])
    end=time.time()
    one_hot_for_timings.append(end-start)

plt.plot(one_hot_timings,label="one_hot_vector")
plt.plot(one_hot_for_timings,label="one_hot_for")
plt.xlabel('data_size for every 100 datapoints')
plt.ylabel('time of execution')
plt.legend(loc='best')
plt.show()

Result

one hot encoding algorithm execution time comparison
Comparison for time of execution for different one-hot encoding algos

Additional notes

  • Libraries such as scipy, torch, sklearn etc, could probably do this faster.
  • Depending on how often you call the above functions in your applications, it might not be relevant for you to choose between the above two methods as both need less than a few seconds at most.
  • The assimilated code for the above can be found here.

By attackonalgorithms

Techie, Poet, Gamer, Philosopher, Artist!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s