Wednesday, November 29, 2017

Loading First Data Set In Machine Learning


In my previous post, I have already explained the difference between the Deep Learning and Machine Learning. In this post, I going to talk more about how to load the data set in Scikit learn which is machine learning library for new comers.

This post is more focus on loading IRIS dataset from Scikit Learn library by using Jupyter notebook. In 1936 Edgar Anderson collected 50 samples of 3 different species of IRIS for each sample he measured the sepal length, sepal width, petal length and petal width and record the measurements along with its species.

Use the below code to load the CSV file in your Jupyter notebook directly from internet.
from IPython.display import HTML
Use this link to load it in your Jupyter Notebook http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

Output: There are 150 rows but I have only pasted sample from each row.
5.1,3.5,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
5.9,3.0,5.1,1.8,Iris-virginica

The above output is comma separated output starting with sepal length, sepal width, petal length and petal width. The dataset also covers the species setosa, versicolor and virginica. With this dataset we can predict the species of the flower. This problem is also known as supervised learning because we are trying to learn the relationship between the data namely the IRIS measurements and the outcome is the species of IRIS.

I am using Scikit learn library for writing machine learning code. You can also install it from here. The best way to learn the machine learning or deep learning is to use the anaconda package on your laptop. This has lot of inbuilt libraries which can help you to sharpen your skills with GUI. GUI is nothing but it’s Jupyter iPYTHON version.

The IRIS dataset is the most famous dataset in machine learning and it is already built in Scikit-Learn. So we don’t need to load it from anywhere else as it is already built in function in Scikit-Learn. Let’s load the dataset.

#import Load_iris function from SKlearn datasets
from sklearn.datasets import load_iris

#Save iris dataset and it’s attributes in object called iris
Iris = load_iris()
type(iris)

#Iris is a object which has special container called bunch which is scikit learn special object type for storing datasets and it’s attributes.

#to check what’s in the dataset we can run the below code
print (iris.data)

#to check all about the dataset we can run the below code
Print(iris.DESCR)

In Machine Learning, each row is known as observation and column is known as feature.

#The below command will let you know about the features of the IRIS dataset
print (iris.feature_names)
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

#The below command will let you know about the type of IRIS category
print (iris.target_names)
['setosa' 'versicolor' 'virginica']

Each value what we are predicting is the response or target or outcome. In machine learning features and responses are separate objects and it can be numeric too. All the features and responses must be in numpy format else it will not going to work. We can check the type of the dataset by using “type(iris.data)” command.

In this dataset, we have X axis which is nothing but the attributes and we have outcome which is nothing but the Y axis. So lets’ store the attributes in X and outcome in Y.

#Store the IRIS features in X
X = iris.data

#Store the IRIS outcome in Y
Y = iris.target

Stay tuned for the next post which will help to understand how to use this dataset to write your first machine learning algorithm and check it's accuracy score.

People who read this post also read :



No comments: