在训练、验证和测试集中对 Pandas 数据框进行分层拆分

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50781562/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:40:47  来源:igfitidea点击:

Stratified splitting of pandas dataframe in training, validation and test set

pythonpandasdataframemachine-learningdeep-learning

提问by user1934212

The following extremely simplified DataFrame represents a much larger DataFrame containing medical diagnoses:

以下极其简化的 DataFrame 表示包含医疗诊断的更大的 DataFrame:

medicalData = pd.DataFrame({'diagnosis':['positive','positive','negative','negative','positive','negative','negative','negative','negative','negative']})
medicalData

    diagnosis
0   positive
1   positive
2   negative
3   negative
4   positive
5   negative
6   negative
7   negative
8   negative
9   negative

For machine learning, I need to randomly split this dataframe into three subframesin the following way:

对于机器学习,我需要通过以下方式将此数据帧随机拆分为三个子帧

trainingDF, validationDF, testDF = SplitData(medicalData,fractions = [0.6,0.2,0.2])

Where the split array specifies the fraction of the complete data that goes into each subframe, the data in the subframe needs to be mutually exclusive and the split array needs to sum to one. Aditionally, the fraction of positive diagnoses in each subset needs to be approximately the same.

其中拆分数组指定了进入每个子帧的完整数据的比例,子帧中的数据需要互斥,拆分数组需要相加为一。 此外,每个子集中的阳性诊断比例需要大致相同。

Answers to this question recommend using the pandas sample methodor the train_test_split function from sklearn. But none of these solutions seem to generalize well to n splits and none provides a stratified split.

此问题的答案建议使用pandas 示例方法sklearn 中的 train_test_split 函数。但是这些解决方案似乎都没有很好地推广到 n 个拆分,也没有一个提供分层拆分。

回答by cs95

np.array_split

np.array_split

If you want to generalise to nsplits, np.array_splitis your friend (it works with DataFrames well).

如果您想推广到n拆分,np.array_split是您的朋友吗(它适用于 DataFrames)。

fractions = np.array([0.6, 0.2, 0.2])
# shuffle your input
df = df.sample(frac=1) 
# split into 3 parts
train, val, test = np.array_split(
    df, (fractions[:-1].cumsum() * len(df)).astype(int))


train_test_split

train_test_split

A windy solution using train_test_splitfor stratified splitting.

train_test_split用于分层拆分的有风解决方案。

y = df.pop('diagnosis').to_frame()
X = df

X_train, X_test, y_train, y_test = train_test_split(
        X, y,stratify=y, test_size=0.4)

X_test, X_val, y_test, y_val = train_test_split(
        X_test, y_test, stratify=y_test, test_size=0.5)

Where Xis a DataFrame of your features, and yis a single-columned DataFrame of your labels.

X你的特征的 DataFrame在哪里y,你的标签的单列 DataFrame在哪里。

回答by Tom Hale

Pure pandassolution

pandas溶液

To split into train / validation / test in the ratio 70 / 20 / 10%:

以 70 / 20 / 10% 的比例分成训练 / 验证 / 测试:

train_df = df.sample(frac=0.7, random_state=random_seed)
tmp_df = df.drop(train_df.index)
test_df = tmp_df.sample(frac=0.33333, random_state=random_seed)
valid_df = tmp_df.drop(test_df.index)

assert len(df) == len(train_df) + len(valid_df) + len(test_df), "Dataset sizes don't add up"
del tmp_df