在训练、验证和测试集中对 Pandas 数据框进行分层拆分

Question

提问by user1934212

The following extremely simplified DataFrame represents a much larger DataFrame containing medical diagnoses:

以下极其简化的 DataFrame 表示包含医疗诊断的更大的 DataFrame：

medicalData = pd.DataFrame({'diagnosis':['positive','positive','negative','negative','positive','negative','negative','negative','negative','negative']})
medicalData

    diagnosis
0   positive
1   positive
2   negative
3   negative
4   positive
5   negative
6   negative
7   negative
8   negative
9   negative

For machine learning, I need to randomly split this dataframe into three subframesin the following way:

对于机器学习，我需要通过以下方式将此数据帧随机拆分为三个子帧：

trainingDF, validationDF, testDF = SplitData(medicalData,fractions = [0.6,0.2,0.2])

Where the split array specifies the fraction of the complete data that goes into each subframe, the data in the subframe needs to be mutually exclusive and the split array needs to sum to one. Aditionally, the fraction of positive diagnoses in each subset needs to be approximately the same.

其中拆分数组指定了进入每个子帧的完整数据的比例，子帧中的数据需要互斥，拆分数组需要相加为一。 此外，每个子集中的阳性诊断比例需要大致相同。

Answers to this question recommend using the pandas sample methodor the train_test_split function from sklearn. But none of these solutions seem to generalize well to n splits and none provides a stratified split.

此问题的答案建议使用pandas 示例方法或sklearn 中的 train_test_split 函数。但是这些解决方案似乎都没有很好地推广到 n 个拆分，也没有一个提供分层拆分。

Answer 1

回答by cs95

`np.array_split`

If you want to generalise to nsplits, np.array_splitis your friend (it works with DataFrames well).

如果您想推广到n拆分，np.array_split是您的朋友吗（它适用于 DataFrames）。

fractions = np.array([0.6, 0.2, 0.2])
# shuffle your input
df = df.sample(frac=1) 
# split into 3 parts
train, val, test = np.array_split(
    df, (fractions[:-1].cumsum() * len(df)).astype(int))

`train_test_split`

A windy solution using train_test_splitfor stratified splitting.

train_test_split用于分层拆分的有风解决方案。

y = df.pop('diagnosis').to_frame()
X = df

X_train, X_test, y_train, y_test = train_test_split(
        X, y,stratify=y, test_size=0.4)

X_test, X_val, y_test, y_val = train_test_split(
        X_test, y_test, stratify=y_test, test_size=0.5)

Where Xis a DataFrame of your features, and yis a single-columned DataFrame of your labels.

X你的特征的 DataFrame在哪里y，你的标签的单列 DataFrame在哪里。

Answer 2

回答by Tom Hale

Pure `pandas`solution

纯`pandas`溶液

To split into train / validation / test in the ratio 70 / 20 / 10%:

以 70 / 20 / 10% 的比例分成训练 / 验证 / 测试：

train_df = df.sample(frac=0.7, random_state=random_seed)
tmp_df = df.drop(train_df.index)
test_df = tmp_df.sample(frac=0.33333, random_state=random_seed)
valid_df = tmp_df.drop(test_df.index)

assert len(df) == len(train_df) + len(valid_df) + len(test_df), "Dataset sizes don't add up"
del tmp_df

在训练、验证和测试集中对 Pandas 数据框进行分层拆分

提问by user1934212

回答by cs95

`np.array_split`

`np.array_split`

`train_test_split`

`train_test_split`

回答by Tom Hale

Pure `pandas`solution

纯`pandas`溶液

相关推荐

最近更新

标签

在训练、验证和测试集中对 Pandas 数据框进行分层拆分

提问by user1934212

回答by cs95

np.array_split

np.array_split

train_test_split

train_test_split

回答by Tom Hale

Pure pandassolution

纯pandas溶液

相关推荐

pandas 反转熊猫中的 get_dummies 编码

Pandas 中的 .dat 文件导入

使用 pandas 中的 read_csv 时为特定列设置数据类型

从 Pandas 数据框中绘制折线图（多条线）

相关推荐

最近更新

标签

`np.array_split`

`np.array_split`

`train_test_split`

`train_test_split`

Pure `pandas`solution

纯`pandas`溶液