小窥KFold及其变体

作者: 井底蛙蛙呱呱呱 | 来源:发表于2020-08-17 11:00 被阅读0次

小窥KFold及其变体
KFold
邻女窥墙｜宋玉辞赋中的成语典故（4）
干货|ResNet及其变体概述
Python机器学习库sklearn KFold交叉验证分组情况
【sklearn】KFold、StratifiedKFold、G
以连咖啡为例，设计小程序的流量裂变体系
数据集划分方法
小窥
小窥

在sklearn中有多种划分数据集方法，如最常用的train_test_split, KFold, 以及KFold变体GroupKFold, StratifiedKFold。本文主要对这3种K折交叉验证数据集划分方法做一个简单介绍：

KFold，最朴素的K折交叉验证划分方法，通常是按数据集顺序对数据集进行划分k折，当然也可以通过shuffle参数先对数据进行打乱，然后再进行划分k折；
GroupKFold，GroupKFold按照指定的groups来划分验证集合训练集，如kaggle中有人对时间序列数据集构建验证集时，按weekday、weekend或者不同月份来构建groups；
StratifiedKFold，顾名思义，StratifiedKFold指在划分数据集时使用了层次抽样。

import numpy as np
from sklearn.model_selection import KFold,GroupKFold,StratifiedKFold
X=np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12],[13,14],[15,16],[17,18],[19,20]])
y=np.array([1,1,1,1,2,2,2,3,3,3])
 
#KFold
print('KFlod:')
kf=KFold(n_splits=5)
for train_index,test_index in kf.split(X):
    print("Train Index:",train_index,",Test Index:",test_index)
    X_train,X_test=X[train_index],X[test_index]
    y_train,y_test=y[train_index],y[test_index]

# GroupKFold
print('GroupKFold:')
# 这里可以将groups中的不同数值理解成不同的月份，譬如1对应1月份，2对应2月份，3对应3月份，然后分别使用1，2，3月份的数据构建验证集
groups=np.array([1,1,1,1,2,2,2,3,3,3])
group_kfold=GroupKFold(n_splits=3)
for train_index,test_index in group_kfold.split(X,y,groups):
    print("Train Index:",train_index,",Test Index:",test_index)
    X_train,X_test=X[train_index],X[test_index]
    y_train,y_test=y[train_index],y[test_index]


# StratifiedKFold
print('StratifiedKFold:')
skf=StratifiedKFold(n_splits=3)
for train_index,test_index in skf.split(X,y):
    print("Train Index:",train_index,",Test Index:",test_index)
    X_train,X_test=X[train_index],X[test_index]
    y_train,y_test=y[train_index],y[test_index]

需要指出的是，对于GroupKFold，其中group参数的unique值个数需要与splits数相同。

除了上面提及的几个构建交叉验证数据集的方法之外，还有ShuffleSplit，GroupShuffleSplit，StratifiedShuffleSplit。这三种方法可以看作是train_test_split和KFold的一个融合，具体用法可参考sklearn中的数据集的划分。

参考：
sklearn的KFold，GroupKFold，StratifiedKFold区分
 sklearn中的数据集的划分