Creating a dataset for image classification

Creating a dataset for image classification

In this article we will go through the necessary steps of building a dataset for a image classification tasks. When you get started with deep learning, most of the ‘Hello World’ tutorials are using datasets provided by the framework (MNIST, Fashion-Mnist, etc.). Usually, at that point, you don’t spent to much time worrying about data.

However, for building real life applications, obtaining and preparing the data is a big part of the task. This posts aims to present such a situations, when we have two folders containing pictures of cats and dogs respectively. Our objective is to train a neural network that, given any random pictures of a cat or dog on the internet, can distinguish between the two.


Downloading the images

The cat/dog images can be downloaded from here but fell free to try with your own images as well.

The structure of the /data folder looks like this:


The name of the images and the number of categories doesn’t matter as long as you group the images belonging to the same category in a subfolder.


Identifying the categories

The first goal is the image categories from the directory. This can be easily achieved with python using the function os.walk(directory_name):

import os 
directory = 'data/PetImages/'

# Get the labels from the directory 
labels = [x[1] for x in os.walk(directory)][0]   # ['Cat', 'Dog'] 

# Sort the labels to be consistent
labels = sort(labels)
NUM_LABELS = len(labels)

# build dictionary for indexes
label_indexes = {labels[i]: i for i in range(0, len(labels))}  


For the cat/dogs images, this should output

{'Cat': 0, 'Dog': 1}

Extracting the filenames

When working with thousands of images, it’s not possible to simply keep them in memory as Numpy arrays.  At this point, we not interested in the actual pixels, only the category and  the image location. The image content can be read right before feeding it to the neural network.

For extracting the filenames, we are going to use python glob. The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell. In this case, the function will look into the data folder for any file with extension .jpg. By setting the parameter recursive to True, we make sure that the look-up happens in subfolders as well.

from random import shuffle
import glob  # get the file paths 

data_files = glob.glob(directory + '**/*.jpg', recursive=True)

# shuffle the data 

num_data_files = len(data_files)

Adding the labels

Since all images from a category are placed in the same subfolder, it’s easy to extract the label from the filepath. Let’s build a label array, which will contain the category for each image in data_files:

data_labels = []

# build the labels 
for file in data_files:
    # file will be /data/{category}/image_name.jpg so we 
    # extract the category from there
    label = file.split('/')[2]

assert num_data_files == len(data_labels)

One hot encoding

For our cat/dog classifier, the numerical representation of the labels is pretty straightforward: 0 for Cats and 1 for Dogs. The neural network will output a single number between 0 and 1 telling us how confident is that the picture is a Dog. If the value leans towards 0 (let’s say < 0.5), then probably we show it a picture with a cat.

Output: 0.2 
This means that the nn is 80% sure that this is a cat

With this binary approach, the networks will tells us for a picture if it leans towards a cat or a dog. However, when we have more than 2 categories, things are getting a bit more complicated.

Let’s say, for example, that we want to train the network to tells us if the picture contains a cat, a dog or none of them (a 3rd categories).

To understand how to better represent the labels as numerical structures, let’s have a look on how a modern neural networks makes predictions.

The final layer of the network usually contains a neurons for each class. The value of that neuron tells us how confident is the algorithm that the input belongs to that class (on a scale from 0 to 1).  The neuron with the higher activation gives the predictions, but we would like to know as well the probability for that class (how confident is the neural network that is correct).

For this we are using a softmax function. A softmax function take all the input values (the activations in the final layer) and transforms them so they would add up to 1 by keeping the proportion between the original values. This way, we know the probability for each category.

Sample images

First image 
Last layer activation:     0.80   0.30   0.50
Softmax                    0.50   0.18   0.31 (50% sure it's a cat)
Correct prediction         1      0      0 

Second image 
Last layer activation:     0.20   0.70   0.20
Softmax                    0.18   0.63   0.18 (63% sure it's a dog)
Correct prediction         0      1      0 

Third image 
Last layer activation:    0.30   0.40   0.40 
Softmax                   0.27   0.36   0.36 (36% it's a dog)
Correct prediction        0      0      1

As you can see, all predictions are correct but the degree of confidence varies significantly.

Ideally, we would like that the output for the correct class to be as close to 1 as possible and the other outputs to be near 0. A network that outputs similar values for multiple classes is not confident in its prediction.

The correct label for an image should be then a vector containing zeros for all classes except the correct one (which has value one). This representation is named one-hot encoding.

Let’s transform our labels into one hot:

import numpy as np

def one_hot(label_array, num_classes):
  return np.squeeze(np.eye(num_classes)[label_array.reshape(-1)])

data_labels = np.array(data_labels)

# For our Cat/Dogs datasets we have 2 categories 
data_labels = one_hot(data_labels, 2) 

Train/Test split

Most of the images we have are going to be used for training the network, but we should put aside a small percentage of it for validating purposes. In this case, we chose 15%.

# TRAIN/TEST split 

# The percentage of the data which will be used in the test set

nr_test_data = int(num_data_files * TRAIN_TEST_SPLIT)

train_data_files = data_files[nr_test_data:]
test_data_files = data_files[:nr_test_data]

train_labels = data_labels[nr_test_data:]
test_labels = data_labels[:nr_test_data]

assert len(train_labels) + len(test_labels) == num_data_files
assert len(test_data_files) + len(train_data_files) == num_data_files

Saving/restoring the data

We are now almost done – we have a test/train dataset containing the filename and the category for each of our images.

Until this point, there is no need to execute this process more than one time (unless we add more images). So we can save the our dataset in a file and restore it any time we want to train the model.

# Save data to file 
# Reload the data from file 
data_file = 'data.npz'

data = np.load(data_file)

# the filenames
train_data_files = data['train_data_files']
test_data_files = data['test_data_files']

# the labels 
train_labels = data['train_labels']
test_labels = data['test_labels']

NUM_LABELS = train_labels.shape[1]
TRAINING_SIZE = train_labels.shape[0]
TEST_SIZE = test_labels.shape[0]

Creating batches

If batch gradient descent is used, we are going to feed the network with a random subset of the training data. This time, we would like to return the actual pixels of the image.

import cv2 


def read_image(filename): 
    image = cv2.imread(filename)
    resized_image = cv2.resize(image, 
                               dsize=(IMAGE_SIZE, IMAGE_SIZE), 
    return resized_image

# Creating batches 
def get_random_batch(batch_size=128):
    # extract a random subset of indexes
    indexes_subset = np.random.randint(0, test_data_files.shape[0], 
    filename_batch = train_data_files[indexes_subset]
    image_batch = [read_image(file) for file in filename_batch]
    label_batch = train_labels[indexes_subset]
    return image_batch, label_batch

And that’s it! Every call to get_random_batch() gives us a random sample of data which we can feed directly to our model.

A faster alternative in Tensorflow

Tensorflow provides a Dataset API which can help you create complex pipelines for data. You can see an example on my github account or in live in action in this blog post about detecting sign language.

A way faster alternative in Keras

Keras provides ImageDataGenerator in its preprocessing package, which not only takes all the preprocessing heavy weight from your shoulders, but also provides means to easily expand the dataset with data augmentation (rotation, sheering, horizontal/vertical flip).

from keras.preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator(rescale=1./255,

test_datagen = ImageDataGenerator(rescale=1./255,

training_set = train_datagen.flow_from_directory('data_train',
                                                 target_size=(IMG_SIZE, IMG_SIZE),

test_set = test_datagen.flow_from_directory('data_test',
                                            target_size=(IMG_SIZE, IMG_SIZE),



Preprocessing images is definitely not the most appealing part in building machine learning models, but that doesn’t make it less important. The silver lining is that you usually go through this process only one time when training a model. As you saw in this article, you don’t necessary have to rely on a framework for this process as long as you have a good template to build upon. You can find the source code here.

Let me know your feedback or ideas for improvement in the comments!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s