16 Best Sklearn Datasets for Building Machine Learning Models

15 Apr 2023

Data powers machine learning algorithms and scikit-learn or sklearn offers high quality datasets that are widely used by researchers, practitioners and enthusiasts. Scikit-learn (sklearn) is a Python module for machine learning built on top of SciPy. It is unique due to its wide range of algorithms, ease of use and integration with other Python libraries.

What are “Sklearn Datasets”?

Sklearn datasets are included as part of the scikit-learn (sklearn) library, so they come pre-installed with the library. Due to this, you can easily access and load these datasets, without having to download them separately.

To use a specific dataset, you can simply import it from sklearn.datasets module and call the appropriate function to load the data into your program.

These datasets are usually pre-processed and ready to use, which saves time and effort for data practitioners who need to experiment with different machine learning models and algorithms.

Complete List of Datasets in the Sklearn Library

Iris
Diabetes
Digits
Linnerud
Wine
Breast Cancer Wisconsin
Boston Housing
Olivetti Faces
California Housing
MNIST
Fashion-MNIST
make_classification
make_regression
make_blobs
make_moons and make_circles
Make_sparse_coded_signal

Pre-Installed(Toy) Sklearn Datasets

1. Iris

This dataset includes measurements of the sepal length, sepal width, petal length and petal width of 150 iris flowers, which belong to 3 different species: setosa, versicolor and virginica. The iris dataset has 150 rows and 5 columns, which are stored as a dataframe, including a column for the species of each flower.

The variables include:

Sepal.Length - The sepal.length represents the length of the sepal in centimetres.
Sepal.Width - The sepal.width represents the width of the sepal in centimetres.
Petal.Length - The petal.length represents the length of the petal in centimetres.
Species - The species variable represents the species of the iris flower, with three possible values: setosa, versicolor and virginica.

You can load the iris dataset directly from sklearn using the load_iris function from the sklearn.datasets module.

# To install sklearn
pip install scikit-learn

# To import sklearn
from sklearn.datasets import load_iris

# Load the iris dataset
iris = load_iris()

# Print the dataset description
print(iris.describe())

Code for loading the Iris dataset using sklearn. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html on 27/3/2023.

2. Diabetes

This sklearn dataset contains information on 442 patients with diabetes, including demographic and clinical measurements:

Age
Sex
Body mass index (BMI)
Average blood pressure
Six blood serum measurements (e.g. total cholesterol, low-density lipoprotein (LDL) cholesterol, high-density lipoprotein (HDL) cholesterol).
A quantitative measure of diabetes disease progression (HbA1c).

The Diabetes dataset can be loaded using the load_diabetes() function from the sklearn.datasets module.

from sklearn.datasets import load_diabetes

# Load the diabetes dataset
diabetes = load_diabetes()

# Print some information about the dataset
print(diabetes.describe())

Code for loading the Diabetes dataset using sklearn. Retrieved from https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset on 28/3/2023.

3. Digits

This sklearn dataset is a collection of hand-written digits from 0 to 9, stored as grayscale images. It contains a total of 1797 samples, with each sample is a 2D array of shape (8,8). There are 64 variables (or features) in the digits sklearn dataset, corresponding to the 64 pixels in each digit image.

The Digits dataset can be loaded using the load_digits() function from the sklearn.datasets module.

from sklearn.datasets import load_digits

# Load the digits dataset
digits = load_digits()

# Print the features and target data
print(digits.data)
print(digits.target)

Code for loading the Digits dataset using sklearn. Retrieved from https://scikit-learn.org/stable/datasets/toy_dataset.html#optical-recognition-of-handwritten-digits-dataset on 29/3/2023.

4. Linnerud

The Linnerud dataset contains physical and physiological measurements of 20 professional athletes.

The dataset includes the following variables:

Three physical exercise variables - chin-ups, sit-ups, and jumping jacks.
Three physiological measurement variables - pulse, systolic blood pressure, and diastolic blood pressure.

To load the Linnerud dataset in Python using sklearn:

from sklearn.datasets import load_linnerud
linnerud = load_linnerud()

Code for loading the linnerud dataset using sklearn. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_linnerud.html#sklearn.datasets.load_linnerud on 27/3/2023.

5. Wine

This sklearn dataset contains the results of chemical analyses of wines grown in a specific area of Italy, to classify the wines into their correct varieties.

Some of the variables in the dataset:

Alcohol
Malic acid
Ash
Alkalinity of ash
Magnesium
Total phenols
Flavanoids

The Wine dataset can be loaded using the load_wine() function from the sklearn.datasets module.

from sklearn.datasets import load_wine

# Load the Wine dataset
wine_data = load_wine()

# Access the features and targets of the dataset
X = wine_data.data  # Features
y = wine_data.target  # Targets

# Access the feature names and target names of the dataset
feature_names = wine_data.feature_names
target_names = wine_data.target_names

Code for loading the Wine Quality dataset using sklearn. Retrieved from https://scikit-learn.org/stable/datasets/toy_dataset.html#wine-recognition-dataset on 28/3/2023.

6. Breast Cancer Wisconsin Dataset

This sklearn dataset consists of information about breast cancer tumours and was initially created by Dr. William H. Wolberg. The dataset was created to assist researchers and machine learning practitioners in classifying tumours as either malignant(cancerous) or benign (non-cancerous).

Some of the variables included in this dataset:

ID number
Diagnosis (M = malignant, B = benign).
Radius (the mean of distances from the centre to points on the perimeter).
Texture (the standard deviation of gray-scale values).
Perimeter
Area
Smoothness (the local variation in radius lengths).
Compactness (the perimeter^2 / area - 1.0).
Concavity (the severity of concave portions of the contour).
Concave points (the number of concave portions of the contour).
Symmetry
Fractal dimension ("coastline approximation" - 1).

You can load the Breast Cancer Wisconsin dataset directly from sklearn using the load_breast_cancer function from the sklearn.datasets module.

from sklearn.datasets import load_breast_cancer

# Load the Breast Cancer Wisconsin dataset
cancer = load_breast_cancer()

# Print the dataset description
print(cancer.describe())

Code for loading the Breast Cancer Wisconsin dataset using sklearn. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html on 28/3/2023.

Breast Cancer Wisconsin dataset

Real World Sklearn Datasets

Real world sklearn datasets are based on real-world problems, commonly used to practice and experiment with machine learning algorithms and techniques using the sklearn library in Python.

7. Boston Housing

The Boston Housing dataset consists of information on housing in the area of Boston, Massachusetts. It has about 506 rows and 14 columns of data.

Some of the variables in the dataset include:

CRIM - Per capita crime rate by town.
ZN - The proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS - The proportion of non-retail business acres per town.
CHAS - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
NOX - The nitric oxide concentration (parts per 10 million).
RM - The average number of rooms per dwelling.
AGE - The proportion of owner-occupied units built prior to 1940.
DIS - The weighted distances to five Boston employment centres.
RAD - The Index of accessibility to radial highways.
TAX - The full-value property-tax rate per $10,000.
PTRATIO - The pupil-teacher ratio by town.
B - 1000(Bk - 0.63)^2 where -Bk is the proportion of blacks by town.
LSTAT - The percentage lower status of the population.
MEDV - The median value of owner-occupied homes in $1000's.

You can load the Boston Housing dataset directly from scikit-learn using the load_boston function from the sklearn.datasets module.

from sklearn.datasets import load_boston

# Load the Boston Housing dataset
boston = load_boston()

# Print the dataset description
print(boston.describe())

Code for loading the Boston Housing dataset using sklearn. Retrieved from https://scikit-learn.org/0.15/modules/generated/sklearn.datasets.load_boston.html on 29/3/2023.

8. Olivetti Faces

The Olivetti Faces dataset is a collection of grayscale images of human faces taken between April 1992 and April 1994 at AT&T Laboratories. It contains 400 images of 10 individuals, with each individual having 40 images shot at different angles and different lighting conditions.

You can load the Olivetti Faces dataset in sklearn by using the fetch_olivetti_faces function from the datasets module.

from sklearn.datasets import fetch_olivetti_faces

# Load the dataset
faces = fetch_olivetti_faces()

# Get the data and target labels
X = faces.data
y = faces.target

Code for loading the Olivetti Faces dataset using sklearn. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_olivetti_faces.html on 29/3/2023.

9. California Housing

This sklearn dataset contains information on median house values, as well as attributes for census tracts in California. It also includes 20,640 instances and 8 features.

Some of the variables in the dataset:

MedInc - The median income in block.
HouseAge - The median age of houses in block.
AveRooms - The average number of rooms per household.
AveBedrms - The average number of bedrooms per household.
Population - The block population.
AveOccup - The average household occupancy.
Latitude - The latitude of the block in decimal degrees.
Longitude - The longitude of the block in decimal degrees.

You can load the California Housing dataset using the fetch_california_housing function from sklearn.

from sklearn.datasets import fetch_california_housing

# Load the dataset
california_housing = fetch_california_housing()

# Get the features and target variable
X = california_housing.data
y = california_housing.target

Code for loading the California Housing dataset using sklearn. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html on 29/3/2023.

10. MNIST

The MNIST dataset is popular and widely used in the fields of machine learning and computer vision. It consists of 70,000 grayscale images of handwritten digits 0–9, with 60,000 images for training and 10,000 for testing. Each image is 28x28 pixels in size and has a corresponding label denoting which digits it represents.

You can load the MNIST dataset from sklearn using the following code:

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')

Note: The MNIST dataset is a subset of the Digits dataset.

Code for loading the MNIST dataset using sklearn. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html#sklearn.datasets.fetch_openml on 30/3/2023.

11. Fashion-MNIST

The Fashion MNIST dataset was created by Zalando Research as a replacement for the original MNIST dataset. The Fashion MNIST dataset consists of 70,000 grayscale images(training set of 60,000 and a test set of 10,000) of clothing items.

The images are 28x28 pixels in size and represent 10 different classes of clothing items, including T-shirts/tops, trousers, pullovers, dresses, coats, sandals, shirts, sneakers, bags, and ankle boots. It is similar to the original MNIST dataset, but with more challenging classification tasks due to the greater complexity and variety of the clothing items.

You can load this sklearn dataset using the fetch_openml function.

from sklearn.datasets import fetch_openml

fmnist = fetch_openml(name='Fashion-MNIST')

Code for loading the Fashion MNIST dataset using sklearn. Retrieved from__https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html#sklearn.datasets.fetch_openml__ on 30/3/2023.

Generated Sklearn Datasets

Generated sklearn datasets are synthetic datasets, generated using the sklearn library in Python. They are used for testing, benchmarking and developing machine learning algorithms/models.

12. make_classification

This function generates a random n-class classification dataset with a specified number of samples, features, and informative features.

Here's an example code to generate this sklearn dataset with 100 samples, 5 features, and 3 classes:

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=100, n_features=5, n_informative=3, n_classes=3, random_state=42)

This code generates a dataset with 100 samples and 5 features, with 3 classes and 3 informative features. The remaining features will be redundant or noise.

Code for loading the make_classification dataset using sklearn. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html#sklearn.datasets.make_classification on 30/3/2023.

13. make_regression

This function generates a random regression dataset with a specified number of samples, features, and noise.

Here's an example code to generate this sklearn dataset with 100 samples, 5 features, and noise level of 0.1:

from sklearn.datasets import make_regression

X, y = make_regression(n_samples=100, n_features=5, noise=0.1, random_state=42)

This code generates a dataset with 100 samples and 5 features, with a noise level of 0.1. The target variable y will be a continuous variable.

Code for loading the make_regression dataset using sklearn. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html#sklearn.datasets.make_regression on 30/3/2023.

14. make_blobs

This function generates a random dataset with a specified number of samples and clusters.

Here's an example code to generate this sklearn dataset with 100 samples and 3 clusters:

from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=100, centers=3, random_state=42)

This code generates a dataset with 100 samples and 2 features (x and y coordinates), with 3 clusters centred at random locations, and with no noise.

Code for loading the make_blobs dataset using sklearn. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html#sklearn.datasets.make_blobs on 30/3/2023.

15. make_moons and make_circles

These functions generate datasets with non-linear boundaries that are useful for testing non-linear classification algorithms.

Here's an example code for loading the make_moons dataset:

from sklearn.datasets import make_moons

X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)

This code generates a dataset with 1000 samples and 2 features (x and y coordinates) with a non-linear boundary between the two classes, and with 0.2 standard deviations of Gaussian noise added to the data.

Code for loading the make_moons dataset using sklearn. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.html#sklearn.datasets.make_moons on 30/3/2023.

Here's an example code to generate and load the make_circles dataset:

from sklearn.datasets import make_circles

X, y = make_circles(n_samples=1000, noise=0.05, random_state=42)

Code for loading the make_circles dataset using sklearn. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_circles.html#sklearn.datasets.make_circles on 30/3/2023.

16. make_sparse_coded_signal

This function generates a sparse coded signal dataset that is useful for testing compressive sensing algorithms.

Here's an example code for loading this sklearn dataset:

from sklearn.datasets import make_sparse_coded_signal

X, y, w = make_sparse_coded_signal(n_samples=100, n_components=10, n_features=50, n_nonzero_coefs=3, random_state=42)

This code generates a sparse coded signal dataset with 100 samples, 50 features, and 10 atoms.

Code for loading the make_sparse_coded_signal dataset using sklearn. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_sparse_coded_signal.html#sklearn-datasets-make-sparse-coded-signal on 30/3/2023.

Common Use Cases for Sklearn Datasets

Pre-Installed(Toy) Sklearn Datasets

Iris - This sklearn dataset is commonly used for classification tasks and is used as a benchmark dataset for testing classification algorithms.

Diabetes - This dataset contains medical information about patients with diabetes and is used for classification and regression tasks in healthcare analytics.

Digits - This sklearn dataset contains images of handwritten digits and is commonly used for image classification and pattern recognition tasks.

Linnerud - This dataset contains physical fitness and medical data of 20 athletes and is commonly used for multivariate regression analysis.

Wine - This sklearn dataset contains chemical analysis of wines and is commonly used for classification and clustering tasks.

Breast Cancer Wisconsin - This dataset contains medical information about breast cancer patients and is commonly used for classification tasks in healthcare analytics.

Real World Sklearn Datasets

Boston Housing - This sklearn dataset contains information about housing in Boston and is commonly used for regression tasks.

Olivetti Faces - This dataset contains grayscale images of faces and is commonly used for image classification and facial recognition tasks.

California Housing - This sklearn dataset contains information about housing in California and is commonly used for regression tasks.

MNIST - This dataset contains images of handwritten digits and is commonly used for image classification and pattern recognition tasks.

Fashion-MNIST - This sklearn dataset contains images of clothing items and is commonly used for image classification and pattern recognition tasks.

Generated Sklearn Datasets

make_classification - This dataset is a randomly generated dataset for binary and multiclass classification tasks.

make_regression - This dataset is a randomly generated dataset for regression tasks.

make_blobs - This sklearn dataset is a randomly generated dataset for clustering tasks.

make_moons and make_circles - These datasets are randomly generated datasets for classification tasks and are commonly used for testing nonlinear classifiers.

make_sparse_coded_signal - This dataset is a randomly generated dataset for sparse coding tasks in signal processing.

Final Thoughts

Sklearn datasets provide a convenient way for developers and researchers to test and evaluate machine learning models without having to manually collect and preprocess data.

They are also available for anyone to download and use freely.

The lead image of this article was generated via HackerNoon's AI Stable Diffusion model using the prompt 'iris dataset'.

More Dataset Listicles:

← Previous

11 Torchvision Datasets for Computer Vision You Need to Know

Up Next →

MS MARCO Web Search: Powering Next-Gen Information Access & Neural Indexers