如何将数据分割为三个集合（训练集、验证集和测试集）

技术背景

在机器学习和数据分析中，为了评估模型的性能和泛化能力，通常需要将数据集分割为训练集、验证集和测试集。训练集用于模型的训练，验证集用于调整模型的超参数，测试集用于最终评估模型的性能。虽然sklearn中的train_test_split函数可以将数据分割为两个集合，但有时我们需要将数据分割为三个集合。

实现步骤

方法一：使用Numpy的`np.split`函数

首先对整个数据集进行洗牌操作。
然后使用np.split函数将洗牌后的数据集分割为三个部分，分别作为训练集、验证集和测试集。

方法二：使用`train_test_split`函数两次

第一次使用train_test_split函数将数据集分割为训练集和临时集。
第二次使用train_test_split函数将临时集分割为验证集和测试集。

方法三：自定义函数实现分割

定义一个自定义函数，通过随机排列数据集的索引，然后根据指定的比例分割数据集。

方法四：分层采样分割

定义一个函数，通过两次调用train_test_split函数，实现按指定列进行分层采样分割数据集。

核心代码

方法一：使用Numpy的`np.split`函数

import numpy as np
import pandas as pd

# 创建示例数据
df = pd.DataFrame(np.random.rand(10, 5), columns=list('ABCDE'))

# 分割数据
train, validate, test = np.split(df.sample(frac=1, random_state=42), [int(.6*len(df)), int(.8*len(df))])

方法二：使用`train_test_split`函数两次

from sklearn.model_selection import train_test_split

# 示例数据
xtrain = np.random.rand(100, 5)
labels = np.random.randint(0, 2, 100)

# 第一次分割
x, x_test, y, y_test = train_test_split(xtrain, labels, test_size=0.2, train_size=0.8)

# 第二次分割
x_train, x_cv, y_train, y_cv = train_test_split(x, y, test_size=0.25, train_size=0.75)

方法三：自定义函数实现分割

import numpy as np
import pandas as pd

def train_validate_test_split(df, train_percent=.6, validate_percent=.2, seed=None):
    np.random.seed(seed)
    perm = np.random.permutation(df.index)
    m = len(df.index)
    train_end = int(train_percent * m)
    validate_end = int(validate_percent * m) + train_end
    train = df.iloc[perm[:train_end]]
    validate = df.iloc[perm[train_end:validate_end]]
    test = df.iloc[perm[validate_end:]]
    return train, validate, test

# 创建示例数据
df = pd.DataFrame(np.random.rand(10, 5), columns=list('ABCDE'))

# 分割数据
train, validate, test = train_validate_test_split(df)

方法四：分层采样分割

import pandas as pd
from sklearn.model_selection import train_test_split

def split_stratified_into_train_val_test(df_input, stratify_colname='y',
                                         frac_train=0.6, frac_val=0.15, frac_test=0.25,
                                         random_state=None):
    if frac_train + frac_val + frac_test != 1.0:
        raise ValueError('fractions %f, %f, %f do not add up to 1.0' % (frac_train, frac_val, frac_test))

    if stratify_colname not in df_input.columns:
        raise ValueError('%s is not a column in the dataframe' % (stratify_colname))

    X = df_input
    y = df_input[[stratify_colname]]

    # 第一次分割
    df_train, df_temp, y_train, y_temp = train_test_split(X, y, stratify=y,
                                                          test_size=(1.0 - frac_train),
                                                          random_state=random_state)

    # 第二次分割
    relative_frac_test = frac_test / (frac_val + frac_test)
    df_val, df_test, y_val, y_test = train_test_split(df_temp, y_temp, stratify=y_temp,
                                                      test_size=relative_frac_test,
                                                      random_state=random_state)

    assert len(df_input) == len(df_train) + len(df_val) + len(df_test)

    return df_train, df_val, df_test

# 创建示例数据
df = pd.DataFrame({'A': list(range(0, 100)),
                   'B': list(range(100, 0, -1)),
                   'label': ['foo'] * 75 + ['bar'] * 15 + ['baz'] * 10})

# 分割数据
df_train, df_val, df_test = split_stratified_into_train_val_test(df, stratify_colname='label',
                                                                 frac_train=0.60, frac_val=0.20, frac_test=0.20)

最佳实践

在分割数据集之前，建议对数据集进行洗牌操作，以确保数据的随机性。
如果数据集中存在类别不平衡的问题，建议使用分层采样的方法进行数据分割，以保证每个集合中各类别的比例与原始数据集一致。
在使用自定义函数时，建议设置随机种子，以保证结果的可重复性。

常见问题

分割比例问题：确保分割比例之和为1，否则会导致分割结果不符合预期。
分层采样列问题：在使用分层采样分割时，确保指定的列存在于数据集中，否则会抛出错误。
随机种子问题：如果不设置随机种子，每次运行代码时分割结果可能会不同，影响结果的可重复性。

数据科学 > 数据预处理

#数据科学 #Python #Pandas #Scikit-learn #数据分割

如何将数据分割为三个集合（训练集、验证集和测试集）

https://119291.xyz/posts/2025-04-11.how-to-split-data-into-three-sets/

作者

发布于

2025年4月22日

许可协议

Spring中@Component、@Repository和@Service注解的区别上一篇

Python中获取当前时间的方法下一篇

如何将数据分割为三个集合（训练集、验证集和测试集）

如何将数据分割为三个集合（训练集、验证集和测试集）

技术背景

实现步骤

方法一：使用Numpy的np.split函数

方法二：使用train_test_split函数两次

方法三：自定义函数实现分割

方法四：分层采样分割

核心代码

方法一：使用Numpy的np.split函数

方法二：使用train_test_split函数两次

方法三：自定义函数实现分割

方法四：分层采样分割

最佳实践

常见问题

方法一：使用Numpy的`np.split`函数

方法二：使用`train_test_split`函数两次

方法一：使用Numpy的`np.split`函数

方法二：使用`train_test_split`函数两次