pandas 子样本熊猫数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18713929/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:08:31  来源:igfitidea点击:

Subsample pandas dataframe

pythonnumpypandassubsampling

提问by Nishant

I have a DataFrameloaded from a .tsvfile. I wanted to generate some exploratory plots. The problem is that the data set is large (~1 million rows), so there are too many points on the plot to see a trend. Plus, it is taking a while to plot.

我有一个DataFrame.tsv文件加载的。我想生成一些探索性的情节。问题是数据集很大(约 100 万行),所以图上的点太多,看不到趋势。另外,绘图需要一段时间。

I wanted to sub-sample 10000 randomly distributed rows. This should be reproducible so the same sequence of random numbers is generated in each run.

我想对 10000 个随机分布的行进行子采样。这应该是可重现的,因此每次运行都会生成相同的随机数序列。

This: Sample two pandas dataframes the same wayseems to be on the right track, but I cannot guarantee the subsample size.

这:以相同的方式两个Pandas数据帧进行采样似乎是在正确的轨道上,但我不能保证子样本的大小。

回答by joris

You can select random elements from you index with np.random.choice. Eg to select 5 random rows:

您可以从索引中选择随机元素np.random.choice。例如选择 5 个随机行:

df = pd.DataFrame(np.random.rand(10))

df.loc[np.random.choice(df.index, 5, replace=False)]

This function is new in 1.7. If you want a solution with an older numpy, you can shuffle the data and taken the first elements of that:

此功能是 1.7 中的新增功能。如果你想要一个较旧的 numpy 的解决方案,你可以洗牌数据并取其中的第一个元素:

df.loc[np.random.permutation(df.index)[:5]]

In this way you DataFrame is not sorted anymore, but if this is needed for plottin (for a line plot eg), you can simply do .sort()afterwards.

通过这种方式,您不再对 DataFrame 进行排序,但是如果 plottin 需要这样做(例如,对于线图),您可以在.sort()之后简单地进行。

回答by Andy Hayden

Unfortunately np.random.choiceappears to be quite slow for small samples (less than 10% of all rows), you may be better off using plain ol' sample:

不幸的是np.random.choice,对于小样本(少于所有行的 10%)似乎很慢,你最好使用普通的 ol' 样本:

from random import sample
df.loc[sample(df.index, 1000)]

For large DataFrame (a million rows), we see small samples:

对于大型 DataFrame(一百万行),我们看到小样本:

In [11]: %timeit df.loc[sample(df.index, 10)]
1000 loops, best of 3: 1.19 ms per loop

In [12]: %timeit df.loc[np.random.choice(df.index, 10, replace=False)]
1 loops, best of 3: 1.36 s per loop

In [13]: %timeit df.loc[np.random.permutation(df.index)[:10]]
1 loops, best of 3: 1.38 s per loop

In [21]: %timeit df.loc[sample(df.index, 1000)]
10 loops, best of 3: 14.5 ms per loop

In [22]: %timeit df.loc[np.random.choice(df.index, 1000, replace=False)]
1 loops, best of 3: 1.28 s per loop    

In [23]: %timeit df.loc[np.random.permutation(df.index)[:1000]]
1 loops, best of 3: 1.3  s per loop

But around 10% it gets about the same:

但大约 10% 的情况大致相同:

In [31]: %timeit df.loc[sample(df.index, 100000)]
1 loops, best of 3: 1.63 s per loop

In [32]: %timeit df.loc[np.random.choice(df.index, 100000, replace=False)]
1 loops, best of 3: 1.36 s per loop

In [33]: %timeit df.loc[np.random.permutation(df.index)[:100000]]
1 loops, best of 3: 1.4 s per loop

and if you are sampling everything (don't use sample!):

如果您正在对所有内容进行采样(不要使用样本!):

In [41]: %timeit df.loc[sample(df.index, 1000000)]
1 loops, best of 3: 10 s per loop

Note: both numpy.random and random accept a seed, to reproduce randomly generated output.

注意: numpy.random 和 random 都接受种子,以重现随机生成的输出。

As @joris points out in the comments, choice (without replacement) is actually sugar for permutationso it's no suprise it's constant time and slower for smaller samples...

正如@joris 在评论中指出的那样,选择(无替换)实际上是排列的糖,所以它是恒定的时间并且对于较小的样本更慢也就不足为奇了......

回答by Alex Coventry

These days, one can simply use the samplemethod on a DataFrame:

现在,人们可以简单地sample在 DataFrame 上使用该方法:

>>> help(df.sample)
Help on method sample in module pandas.core.generic:

sample(self, n=None, frac=None, replace=False, weights=None, random_state=None, axis=None) method of pandas.core.frame.DataFrame instance
    Returns a random sample of items from an axis of object.

Replicability can be achieved by using the random_statekeyword:

可以通过使用random_state关键字来实现可复制性:

>>> len(set(df.sample(n=1, random_state=np.random.RandomState(0)).iterations.values[0] for _ in xrange(1000)))
1
>>> len(set(df.sample(n=1).iterations.values[0] for _ in xrange(1000)))
40