Kaggle学习入门

kaggle是数据挖掘与机器学习领域的常用网站，经常会有各种比赛，适合在数据挖掘与机器学习领域的实战提高。下面以CIFAR-10 – Object Recognition in Images项目为例演示kaggle入门。

1、下载并解压数据

直接下载，数据通常比较大，解压需要一定时间,mac下可以使用7za x vps12.7z解压。

2、读入数据

通常是csv格式，利用pandas读入数据

df = pd.read_csv(“data/train.csv”)

pandas是python的一个常用库，详情可参见：Python科学计算(二)

3､运行算法

利用k近邻算法，得到图片分类结果。

import numpy as np from scipy.misc import imread, imsave, imresize import pandas as pd class NearestNeighbor(object): def __init__(self): pass def train(self, X, y): “”” X is N x D where each row is an example. Y is 1-dimension of size N “”” # the nearest neighbor classifier simply remembers all the training data self.Xtr = X self.ytr = y def predict(self, X): “”” X is N x D where each row is an example we wish to predict label for “”” num_test = X.shape[0] # lets make sure that the output type matches the input type Ypred = np.zeros(num_test, dtype = self.ytr.dtype) # loop over all test rows for i in xrange(num_test): # find the nearest training image to the i’th test image # using the L1 distance (sum of absolute value differences) distances = np.sum(np.abs(self.Xtr – X[i,:]), axis = 1) min_index = np.argmin(distances) # get the index with smallest distance Ypred[i] = self.ytr[min_index] # predict the label of the nearest example return Ypred

4、运行算法，得到结果并保存成csv格式

if __name__ == ‘__main__’: nearestNeighbor = NearestNeighbor() trainSize=500 testSize=500 imgMatrix = np.ones((trainSize, 32*32*3)) testImag = np.ones((testSize,32*32*3)) for i in xrange(trainSize): print i img = imread(‘train/%d.png’%(i+1)) img_row = img.reshape(1,32*32*3) imgMatrix[i]=img_row df = pd.read_csv(“trainLabels.csv”) nearestNeighbor.train(imgMatrix,df.label) for j in xrange(testSize): print j img = imread(‘train/%d.png’%(j+501)) img_row=img.reshape(1,32*32*3) testImag[j]=img_row yPred = nearestNeighbor.predict(testImag) output = pd.DataFrame(columns=[‘id’,’label’]) output[‘label’]=yPred output[‘id’]=range(1,testSize+1) output.to_csv(‘output.csv’, index=False)

5、将所得到csv文件上传到kaggle，等待分析结果

使用k近邻算法，所得正确率不高，只有大约30%左右。