I'm using only the first 10000 samples from the data to save time. You can change that below if you want to.
from sklearn.datasets import fetch_mldata mnist = fetch_mldata('MNIST original') data = mnist.data[0:10000] target= mnist.target[0:10000] print "Shape of Data: ", data.shape print "Shape of Target", target.shape
Shape of Data: (10000, 784) Shape of Target (10000,)
We split the data into training and testing. Do not touch the test data until you're ready to make a submission on Kaggle.
from sklearn.cross_validation import train_test_split
X, kaggle_x, Y, kaggle_y = train_test_split(data, target, train_size= 0.8)
In Kaggle competitions, you will not have access to the targets of the test set(kaggle_y in this case)
sample = data print "Sample shape: ", sample.shape
Sample shape: (784,)
Each vector is created by flattening a $28 \times 28$ matrix. To visualize the images, we can reshape the array into the original matrix shape and visualize it.
import matplotlib.pyplot as plt %matplotlib inline sample = sample.reshape(28,28) plt.imshow(sample, cmap='gray') plt.show()
Let's test the performance of the 2 different classifiers in scikit. We do this by using the cross validation score function inbuilt in scikit.
from sklearn.cross_validation import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
knnmodel = KNeighborsClassifier() knncvscore = cross_val_score(knnmodel, X, Y, scoring='accuracy', n_jobs=-1)
print "Knn CV Score is: ", knncvscore
Knn CV Score is: [ 0.99850019 0.99850019 0.99924981]
from sklearn.svm import SVC
svcmodel = SVC() svccvscore = cross_val_score(svcmodel, X,Y, scoring='accuracy', n_jobs =-1)
print "SVC CV Score is: ", svccvscore
SVC CV Score is: [ 0.59130109 0.59130109 0.59152288]
from xgboost import XGBClassifier
xgbmodel= XGBClassifier() xgbcvscore = cross_val_score(xgbmodel, X,Y, scoring='accuracy', n_jobs =-1)
print "Gradient Boosting CV Score is: ", xgbcvscore
Gradient Boosting CV Score is: [ 0.99775028 0.99700037 0.99737434]
The model is fit on the entire training set and then used to make predictions.
finalmodel = knnmodel finalmodel.fit(X,Y) predictions = finalmodel.predict(kaggle_x)
The final score from Kaggle is one that you get from your predictions against the test set(kaggle_x). It is verfied against the target to provide your leaderboard score. If you are not overfitting, your cross validation score and leaderboard score should be almost identical.
from sklearn.metrics import accuracy_score accscore = accuracy_score(predictions, kaggle_y)
print "The accuracy score is", accscore