Data powers machine learning algorithms and scikit-learn or
What are “Sklearn Datasets”?
Sklearn datasets are included as part of the scikit-learn (
To use a specific dataset, you can simply import it from sklearn.datasets module and call the appropriate function to load the data into your program.
These datasets are usually pre-processed and ready to use, which saves time and effort for data practitioners who need to experiment with different machine learning models and algorithms.
Complete List of Datasets in the Sklearn Library
- Iris
- Diabetes
- Digits
- Linnerud
- Wine
- Breast Cancer Wisconsin
- Boston Housing
- Olivetti Faces
- California Housing
- MNIST
- Fashion-MNIST
- make_classification
- make_regression
- make_blobs
- make_moons and make_circles
- Make_sparse_coded_signal
Pre-Installed(Toy) Sklearn Datasets
1. Iris
This dataset includes measurements of the sepal length, sepal width, petal length and petal width of 150 iris flowers, which belong to 3 different species: setosa, versicolor and virginica. The iris dataset has 150 rows and 5 columns, which are stored as a dataframe, including a column for the species of each flower.
The variables include:
- Sepal.Length - The sepal.length represents the length of the sepal in centimetres.
- Sepal.Width - The sepal.width represents the width of the sepal in centimetres.
- Petal.Length - The petal.length represents the length of the petal in centimetres.
- Species - The species variable represents the species of the iris flower, with three possible values: setosa, versicolor and virginica.
You can load the iris dataset directly from sklearn using the load_iris function from the sklearn.datasets module.
# To install sklearn
pip install scikit-learn
# To import sklearn
from sklearn.datasets import load_iris
# Load the iris dataset
iris = load_iris()
# Print the dataset description
print(iris.describe())
Code for loading the Iris dataset using sklearn. Retrieved from
2. Diabetes
This sklearn dataset contains information on 442 patients with diabetes, including demographic and clinical measurements:
- Age
- Sex
- Body mass index (BMI)
- Average blood pressure
- Six blood serum measurements (e.g. total cholesterol, low-density lipoprotein (LDL) cholesterol, high-density lipoprotein (HDL) cholesterol).
- A quantitative measure of diabetes disease progression (HbA1c).
The Diabetes dataset can be loaded using the load_diabetes() function from the sklearn.datasets module.
from sklearn.datasets import load_diabetes
# Load the diabetes dataset
diabetes = load_diabetes()
# Print some information about the dataset
print(diabetes.describe())
Code for loading the Diabetes dataset using sklearn. Retrieved from
3. Digits
This sklearn dataset is a collection of hand-written digits from 0 to 9, stored as grayscale images. It contains a total of 1797 samples, with each sample is a 2D array of shape (8,8). There are 64 variables (or features) in the digits sklearn dataset, corresponding to the 64 pixels in each digit image.
The Digits dataset can be loaded using the load_digits() function from the sklearn.datasets module.
from sklearn.datasets import load_digits
# Load the digits dataset
digits = load_digits()
# Print the features and target data
print(digits.data)
print(digits.target)
Code for loading the Digits dataset using sklearn. Retrieved from
4. Linnerud
The Linnerud dataset contains physical and physiological measurements of 20 professional athletes.
The dataset includes the following variables:
- Three physical exercise variables - chin-ups, sit-ups, and jumping jacks.
- Three physiological measurement variables - pulse, systolic blood pressure, and diastolic blood pressure.
To load the Linnerud dataset in Python using sklearn:
from sklearn.datasets import load_linnerud
linnerud = load_linnerud()
Code for loading the linnerud dataset using sklearn. Retrieved from
5. Wine
This sklearn dataset contains the results of chemical analyses of wines grown in a specific area of Italy, to classify the wines into their correct varieties.
Some of the variables in the dataset:
- Alcohol
- Malic acid
- Ash
- Alkalinity of ash
- Magnesium
- Total phenols
- Flavanoids
The Wine dataset can be loaded using the load_wine() function from the sklearn.datasets module.
from sklearn.datasets import load_wine
# Load the Wine dataset
wine_data = load_wine()
# Access the features and targets of the dataset
X = wine_data.data # Features
y = wine_data.target # Targets
# Access the feature names and target names of the dataset
feature_names = wine_data.feature_names
target_names = wine_data.target_names
Code for loading the Wine Quality dataset using sklearn. Retrieved from
6. Breast Cancer Wisconsin Dataset
This sklearn dataset consists of information about breast cancer tumours and was initially created by Dr. William H. Wolberg. The dataset was created to assist researchers and machine learning practitioners in classifying tumours as either malignant(cancerous) or benign (non-cancerous).
Some of the variables included in this dataset:
- ID number
- Diagnosis (M = malignant, B = benign).
- Radius (the mean of distances from the centre to points on the perimeter).
- Texture (the standard deviation of gray-scale values).
- Perimeter
- Area
- Smoothness (the local variation in radius lengths).
- Compactness (the perimeter^2 / area - 1.0).
- Concavity (the severity of concave portions of the contour).
- Concave points (the number of concave portions of the contour).
- Symmetry
- Fractal dimension ("coastline approximation" - 1).
You can load the Breast Cancer Wisconsin dataset directly from sklearn using the load_breast_cancer function from the sklearn.datasets module.
from sklearn.datasets import load_breast_cancer
# Load the Breast Cancer Wisconsin dataset
cancer = load_breast_cancer()
# Print the dataset description
print(cancer.describe())
Code for loading the Breast Cancer Wisconsin dataset using sklearn. Retrieved from
Real World Sklearn Datasets
Real world sklearn datasets are based on real-world problems, commonly used to practice and experiment with machine learning algorithms and techniques using the sklearn library in Python.
7. Boston Housing
The Boston Housing dataset consists of information on housing in the area of Boston, Massachusetts. It has about 506 rows and 14 columns of data.
Some of the variables in the dataset include:
- CRIM - Per capita crime rate by town.
- ZN - The proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS - The proportion of non-retail business acres per town.
- CHAS - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
- NOX - The nitric oxide concentration (parts per 10 million).
- RM - The average number of rooms per dwelling.
- AGE - The proportion of owner-occupied units built prior to 1940.
- DIS - The weighted distances to five Boston employment centres.
- RAD - The Index of accessibility to radial highways.
- TAX - The full-value property-tax rate per $10,000.
- PTRATIO - The pupil-teacher ratio by town.
- B - 1000(Bk - 0.63)^2 where -Bk is the proportion of blacks by town.
- LSTAT - The percentage lower status of the population.
- MEDV - The median value of owner-occupied homes in $1000's.
You can load the Boston Housing dataset directly from scikit-learn using the load_boston function from the sklearn.datasets module.
from sklearn.datasets import load_boston
# Load the Boston Housing dataset
boston = load_boston()
# Print the dataset description
print(boston.describe())
Code for loading the Boston Housing dataset using sklearn. Retrieved from
8. Olivetti Faces
The Olivetti Faces dataset is a collection of grayscale images of human faces taken between April 1992 and April 1994 at AT&T Laboratories. It contains 400 images of 10 individuals, with each individual having 40 images shot at different angles and different lighting conditions.
You can load the Olivetti Faces dataset in sklearn by using the fetch_olivetti_faces function from the datasets module.
from sklearn.datasets import fetch_olivetti_faces
# Load the dataset
faces = fetch_olivetti_faces()
# Get the data and target labels
X = faces.data
y = faces.target
Code for loading the Olivetti Faces dataset using sklearn. Retrieved from
9. California Housing
This sklearn dataset contains information on median house values, as well as attributes for census tracts in California. It also includes 20,640 instances and 8 features.
Some of the variables in the dataset:
- MedInc - The median income in block.
- HouseAge - The median age of houses in block.
- AveRooms - The average number of rooms per household.
- AveBedrms - The average number of bedrooms per household.
- Population - The block population.
- AveOccup - The average household occupancy.
- Latitude - The latitude of the block in decimal degrees.
- Longitude - The longitude of the block in decimal degrees.
You can load the California Housing dataset using the fetch_california_housing function from sklearn.
from sklearn.datasets import fetch_california_housing
# Load the dataset
california_housing = fetch_california_housing()
# Get the features and target variable
X = california_housing.data
y = california_housing.target
Code for loading the California Housing dataset using sklearn. Retrieved from
10. MNIST
The MNIST dataset is popular and widely used in the fields of machine learning and computer vision. It consists of 70,000 grayscale images of handwritten digits 0–9, with 60,000 images for training and 10,000 for testing. Each image is 28x28 pixels in size and has a corresponding label denoting which digits it represents.
You can load the MNIST dataset from sklearn using the following code:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
Note: The MNIST dataset is a subset of the Digits dataset.
Code for loading the MNIST dataset using sklearn. Retrieved from
11. Fashion-MNIST
The Fashion MNIST dataset was created by Zalando Research as a replacement for the original MNIST dataset. The Fashion MNIST dataset consists of 70,000 grayscale images(training set of 60,000 and a test set of 10,000) of clothing items.
The images are 28x28 pixels in size and represent 10 different classes of clothing items, including T-shirts/tops, trousers, pullovers, dresses, coats, sandals, shirts, sneakers, bags, and ankle boots. It is similar to the original MNIST dataset, but with more challenging classification tasks due to the greater complexity and variety of the clothing items.
You can load this sklearn dataset using the fetch_openml function.
from sklearn.datasets import fetch_openml
fmnist = fetch_openml(name='Fashion-MNIST')
Code for loading the Fashion MNIST dataset using sklearn. Retrieved from__https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html#sklearn.datasets.fetch_openml__ on 30/3/2023.
Generated Sklearn Datasets
Generated sklearn datasets are synthetic datasets, generated using the sklearn library in Python. They are used for testing, benchmarking and developing machine learning algorithms/models.
12. make_classification
This function generates a random n-class classification dataset with a specified number of samples, features, and informative features.
Here's an example code to generate this sklearn dataset with 100 samples, 5 features, and 3 classes:
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, n_features=5, n_informative=3, n_classes=3, random_state=42)
This code generates a dataset with 100 samples and 5 features, with 3 classes and 3 informative features. The remaining features will be redundant or noise.
Code for loading the make_classification dataset using sklearn. Retrieved from
13. make_regression
This function generates a random regression dataset with a specified number of samples, features, and noise.
Here's an example code to generate this sklearn dataset with 100 samples, 5 features, and noise level of 0.1:
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=100, n_features=5, noise=0.1, random_state=42)
This code generates a dataset with 100 samples and 5 features, with a noise level of 0.1. The target variable y will be a continuous variable.
Code for loading the make_regression dataset using sklearn. Retrieved from
14. make_blobs
This function generates a random dataset with a specified number of samples and clusters.
Here's an example code to generate this sklearn dataset with 100 samples and 3 clusters:
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=100, centers=3, random_state=42)
This code generates a dataset with 100 samples and 2 features (x and y coordinates), with 3 clusters centred at random locations, and with no noise.
Code for loading the make_blobs dataset using sklearn. Retrieved from
15. make_moons and make_circles
These functions generate datasets with non-linear boundaries that are useful for testing non-linear classification algorithms.
Here's an example code for loading the make_moons dataset:
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)
This code generates a dataset with 1000 samples and 2 features (x and y coordinates) with a non-linear boundary between the two classes, and with 0.2 standard deviations of Gaussian noise added to the data.
Code for loading the make_moons dataset using sklearn. Retrieved from
Here's an example code to generate and load the make_circles dataset:
from sklearn.datasets import make_circles
X, y = make_circles(n_samples=1000, noise=0.05, random_state=42)
Code for loading the make_circles dataset using sklearn. Retrieved from
16. make_sparse_coded_signal
This function generates a sparse coded signal dataset that is useful for testing compressive sensing algorithms.
Here's an example code for loading this sklearn dataset:
from sklearn.datasets import make_sparse_coded_signal
X, y, w = make_sparse_coded_signal(n_samples=100, n_components=10, n_features=50, n_nonzero_coefs=3, random_state=42)
This code generates a sparse coded signal dataset with 100 samples, 50 features, and 10 atoms.
Code for loading the make_sparse_coded_signal dataset using sklearn. Retrieved from
Common Use Cases for Sklearn Datasets
Pre-Installed(Toy) Sklearn Datasets
Real World Sklearn Datasets
Generated Sklearn Datasets
Final Thoughts
Sklearn datasets provide a convenient way for developers and researchers to test and evaluate machine learning models without having to manually collect and preprocess data.
They are also available for anyone to download and use freely.
The lead image of this article was generated via HackerNoon's AI Stable Diffusion model using the prompt 'iris dataset'.
More Dataset Listicles: