Pandas:随机删除行而不混洗数据集
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28556942/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: Remove rows at random without shuffling dataset
提问by Black
I've got a dataset which needs to omit a few rows whilst preserving the order of the rows. My idea was to use a mask with a random number between 0and the length of my dataset but I'm not sure how to setup a mask without shuffling the rows around i.e. a method similar to sampling a dataset.
我有一个数据集,它需要在保留行顺序的同时省略几行。我的想法是在0我的数据集的长度和之间使用一个带有随机数的掩码,但我不确定如何设置掩码而不改变周围的行,即类似于对数据集进行采样的方法。
Example: Dataset has 5 rows and 2 columns and I would like to remove a row at random.
示例:数据集有 5 行和 2 列,我想随机删除一行。
Col1 | Col2
A | 1
B | 2
C | 5
D | 4
E | 0
transforms to:
转换为:
Col1 | Col2
A | 1
B | 2
D | 4
E | 0
with the third row (Col1='C') omitted by a random choice.
Col1='C'随机选择省略第三行 ( )。
How should I go about this?
我该怎么办?
回答by cel
The following should work for you. Here I sample remove_nrandom row_ids from df's index. After that df.dropremoves those rows from the data frame and returns the new subset of the old data frame.
以下应该适合您。在这里,我remove_n从df的索引中随机抽取 row_ids 。之后df.drop从数据框中删除这些行并返回旧数据框的新子集。
import pandas as pd
import numpy as np
np.random.seed(10)
remove_n = 1
df = pd.DataFrame({"a":[1,2,3,4], "b":[5,6,7,8]})
drop_indices = np.random.choice(df.index, remove_n, replace=False)
df_subset = df.drop(drop_indices)
DataFrame df:
数据帧df:
a b
0 1 5
1 2 6
2 3 7
3 4 8
DataFrame df_subset:
数据帧df_subset:
a b
0 1 5
1 2 6
3 4 8

