10 Best Keras Datasets for Building and Training Deep Learning Models

8 Mar 2023

Keras provides a high-level API that simplifies the process of building and training complex neural network models. With a wide range of pre-built layers and functions, developers can easily build and train deep learning models on large datasets using optimization algorithms. Keras also supports GPU acceleration for training and inference, making it a popular choice for both research and industry applications.

What are “Keras Datasets”?

Keras datasets are preprocessed datasets that come pre-installed with the Keras library. These datasets are commonly used in the deep learning community for benchmarking models on various tasks such as image classification, text classification, and regression. By leveraging these datasets, developers can experiment with different deep learning models and easily compare their performance.

This article looks at the Best Keras Datasets for Building and Training Deep Learning Models, accessible to developers and researchers worldwide.

List of Keras Datasets

1. MNIST

The MNIST dataset is popular and widely used in the fields of machine learning and computer vision. It consists of 70,000 grayscale images of handwritten digits 0–9, with 60,000 images for training and 10,000 for testing. Each image is 28x28 pixels in size and has a corresponding label denoting which digits it represents.

This dataset can be downloaded from Kaggle or loaded from Keras:

tf.keras.datasets.mnist.load_data(path="mnist.npz")

2. CIFAR-10

The CIFAR-10 dataset consists of 60,000 32x32 colour images in 10 classes, with 6,000 images per class. It has a total of 50,000 training images and 10,000 test images which is further divided into five training batches and one test batch, each with 10,000 images.

This dataset can be downloaded from Kaggle, or loaded from Keras:

tf.keras.datasets.cifar10.load_data()

3. CIFAR-100

The CIFAR-100 dataset has 60,000 (50,000 training images and 10,000 test images) 32x32 colour images in 100 classes, with 600 images per class. The 100 classes are grouped into 20 super-classes, with a fine label to denote its class and a coarse label to represent the super-class that it belongs to.

This dataset can be downloaded from Kaggle, or loaded from Keras:

tf.keras.datasets.cifar100.load_data(label_mode="fine")

4. Fashion-MNIST

The Fashion MNIST dataset was created by Zalando Research as a replacement for the original MNIST dataset. The Fashion MNIST dataset consists of 70,000 grayscale images(training set of 60,000 and a test set of 10,000) of clothing items.

The images are 28x28 pixels in size and represent 10 different classes of clothing items, including T-shirts/tops, trousers, pullovers, dresses, coats, sandals, shirts, sneakers, bags, and ankle boots. It is similar to the original MNIST dataset, but with more challenging classification tasks due to the greater complexity and variety of the clothing items.

This dataset can be downloaded from Kaggle, or loaded from Keras:

tf.keras.datasets.fashion_mnist.load_data()

Fashion-MNIST images

5. IMDB

The IMDB dataset is commonly used for sentiment analysis tasks, where the goal is to classify the reviews as positive or negative based on their content. It consists of a collection of 50,000 movie reviews (training set of 25,000 and a test set of 25,000) from the Internet Movie Database website, split evenly between positive and negative reviews.

Each review in this dataset is a text document, preprocessed and transformed into sequences of integers, where each integer represents a word in the review. The vocabulary size is limited to the 10,000 most frequent words in the dataset and any less frequent words are replaced with a special “unknown” token.

This dataset can be downloaded from Kaggle, or loaded from Keras:

tf.keras.datasets.imdb.load_data(
    path="imdb.npz",
    num_words=None,
    skip_top=0,
    maxlen=None,
    seed=113,
    start_char=1,
    oov_char=2,
    index_from=3,
    **kwargs
)

6. Boston Housing

The Boston Housing dataset contains information about housing in the Boston area. This information consists of 506 instances (404 training and 102 test instances), with attributes for each instance.

The attributes have a mix of quantitative and categorical variables, such as the average number of rooms per dwelling, per capita crime rate and the proportion of non-retail business acres per town.

This dataset can be downloaded from Kaggle, or loaded from Keras:

tf.keras.datasets.boston_housing.load_data(
    path="boston_housing.npz", test_split=0.2, seed=113
)

7. Wine Quality

The Wine Quality dataset contains information on red and white wine samples. The goal of this dataset is to classify the quality of the wine based on chemical properties like pH, density, alcohol content and citric acid content.

The variables in this dataset include:

Fixed Acidity - The number of fixed acids in the wine, expressed in g/dm^3.
Volatile Acidity - The number of volatile acids in the wine, expressed in g/dm^3.
Citric Acid - The amount of citric acid in the wine, expressed in g/dm^3.
Residual Sugar: The amount of residual sugar in the wine, expressed in g/dm^3.
Chlorides - The amount of chloride in the wine, expressed in g/dm^3.
Free Sulfur Dioxide - The amount of free sulfur dioxide in the wine, expressed in mg/dm^3.
Total Sulfur Dioxide - The amount of total sulfur dioxide in the wine, expressed in mg/dm^3.
Density - The density of the wine, expressed in g/cm^3.
pH - The pH level of the wine.
Sulphates - The number of sulphates in the wine, expressed in g/dm^3.
Alcohol - The alcohol content of the wine, expressed in % vol.
Quality - The quality rating of the wine, on a scale of 0 to 10.

You can download the dataset here, or it can be loaded from Keras:

from keras.datasets import wine_quality

(X_train, y_train), (X_test, y_test) = wine_quality.load_data(test_split=0.2, seed=113)

8. Reuters Newswire

The Reuters Newswire dataset is a preprocessed version of the original Reuters dataset, with the text encoded as sequences of integers. It consists of 11,228 news articles with a vocabulary of 30,979 words.

Each article is classified into one of 46 different topics like “corn”, “crude”, “earnings” and “acquisitions”.

You can download the dataset from Kaggle, or it can be loaded from Keras:

tf.keras.datasets.reuters.load_data(path="reuters.npz",num_words=None,skip_top=0,
maxlen=None,test_split=0.2,seed=113,start_char=1,oov_char=2,index_from=3,**kwargs)

9. Pima Indians Diabetes

This dataset consists of medical data about Pima Indian women, such as age, number of pregnancies, glucose levels, blood pressure, skin thickness, BMI and insulin level. The Keras version of the Pima Indians Diabetes dataset contains 768 samples with 8 input variables and 1 output variable.

The Pima Indians Diabetes dataset can be downloaded on Kaggle, or it can be loaded from Keras:

from tensorflow.keras.datasets import pima_indians_diabetes

(x_train, y_train), (x_test, y_test) = pima_indians_diabetes.load_data()

10. Dogs vs Cats

The Dogs vs Cats dataset consists of 25,000 labelled images of dogs and cats, with 12,500 images of each class. These images were collected from various sources with varying sizes and quality.

You can download the dataset from Kaggle, or it can be loaded from Keras:

# Import the necessary Keras libraries:
from keras.preprocessing.image import ImageDataGenerator

# Set the paths to the training and validation directories:
train_dir = 'path/to/train'
validation_dir = 'path/to/validation'

# Define an ImageDataGenerator object to perform data augmentation and normalization:
train_datagen = ImageDataGenerator(rescale=1./255,
                                   rotation_range=40,
                                   width_shift_range=0.2,
                                   height_shift_range=0.2,
                                   shear_range=0.2,
                                   zoom_range=0.2,
                                   horizontal_flip=True)

# Use flow_from_directory to load directory data in Keras:
validation_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(train_dir,
                                                    target_size=(150, 150),
                                                    batch_size=32,
                                                    class_mode='binary')

validation_generator = validation_datagen.flow_from_directory(validation_dir,
                                                              target_size=(150, 150),
                                                              batch_size=32,
                                                              class_mode='binary')

# The flow_from_directory yields preprocessed image batches and labels as DirectoryIterator.

Note that in the above code, we are using data augmentation to create variations of the training images to help prevent overfitting. The validation data is not augmented.

Dogs vs Cats images

Common Use Cases for Keras Datasets

MNIST - Handwritten digit recognition.

CIFAR-10 - Object recognition in images.

CIFAR-100 - Object recognition in images (more fine-grained than CIFAR-10).

Fashion-MNIST - Clothing item recognition.

IMDB - Sentiment analysis on movie reviews.

Boston Housing - Regression on housing prices.

Wine Quality - Classification of wine quality.

Reuters Newswire - Topic classification of news articles.

Pima Indians Diabetes - Binary classification of diabetes in Pima Indian women.

Dogs vs Cats - Binary classification of images of dogs and cats.

Final Thoughts

Keras datasets are a valuable resource for machine learning practitioners and researchers, which can save time and effort in data collection and preprocessing, allowing for more focus on model development and experimentation.

These Keras datasets are also available for anyone to download and use freely.

More Dataset Listicles:

← Previous

12 Best Pre-Installed R Datasets Commonly Used for Statistical Analysis

Up Next →

14 Best Tableau Datasets for Practicing Data Visualization