pandas 子样本熊猫数据框

Question

提问by Nishant

I have a DataFrameloaded from a .tsvfile. I wanted to generate some exploratory plots. The problem is that the data set is large (~1 million rows), so there are too many points on the plot to see a trend. Plus, it is taking a while to plot.

我有一个DataFrame从.tsv文件加载的。我想生成一些探索性的情节。问题是数据集很大（约 100 万行），所以图上的点太多，看不到趋势。另外，绘图需要一段时间。

I wanted to sub-sample 10000 randomly distributed rows. This should be reproducible so the same sequence of random numbers is generated in each run.

我想对 10000 个随机分布的行进行子采样。这应该是可重现的，因此每次运行都会生成相同的随机数序列。

This: Sample two pandas dataframes the same wayseems to be on the right track, but I cannot guarantee the subsample size.

这：以相同的方式对两个Pandas数据帧进行采样似乎是在正确的轨道上，但我不能保证子样本的大小。

Answer 1

回答by joris

You can select random elements from you index with np.random.choice. Eg to select 5 random rows:

您可以从索引中选择随机元素np.random.choice。例如选择 5 个随机行：

df = pd.DataFrame(np.random.rand(10))

df.loc[np.random.choice(df.index, 5, replace=False)]

This function is new in 1.7. If you want a solution with an older numpy, you can shuffle the data and taken the first elements of that:

此功能是 1.7 中的新增功能。如果你想要一个较旧的 numpy 的解决方案，你可以洗牌数据并取其中的第一个元素：

df.loc[np.random.permutation(df.index)[:5]]

In this way you DataFrame is not sorted anymore, but if this is needed for plottin (for a line plot eg), you can simply do .sort()afterwards.

通过这种方式，您不再对 DataFrame 进行排序，但是如果 plottin 需要这样做（例如，对于线图），您可以在.sort()之后简单地进行。

Answer 2

回答by Andy Hayden

Unfortunately np.random.choiceappears to be quite slow for small samples (less than 10% of all rows), you may be better off using plain ol' sample:

不幸的是np.random.choice，对于小样本（少于所有行的 10%）似乎很慢，你最好使用普通的 ol' 样本：

from random import sample
df.loc[sample(df.index, 1000)]

For large DataFrame (a million rows), we see small samples:

对于大型 DataFrame（一百万行），我们看到小样本：

In [11]: %timeit df.loc[sample(df.index, 10)]
1000 loops, best of 3: 1.19 ms per loop

In [12]: %timeit df.loc[np.random.choice(df.index, 10, replace=False)]
1 loops, best of 3: 1.36 s per loop

In [13]: %timeit df.loc[np.random.permutation(df.index)[:10]]
1 loops, best of 3: 1.38 s per loop

In [21]: %timeit df.loc[sample(df.index, 1000)]
10 loops, best of 3: 14.5 ms per loop

In [22]: %timeit df.loc[np.random.choice(df.index, 1000, replace=False)]
1 loops, best of 3: 1.28 s per loop    

In [23]: %timeit df.loc[np.random.permutation(df.index)[:1000]]
1 loops, best of 3: 1.3  s per loop

But around 10% it gets about the same:

但大约 10% 的情况大致相同：

In [31]: %timeit df.loc[sample(df.index, 100000)]
1 loops, best of 3: 1.63 s per loop

In [32]: %timeit df.loc[np.random.choice(df.index, 100000, replace=False)]
1 loops, best of 3: 1.36 s per loop

In [33]: %timeit df.loc[np.random.permutation(df.index)[:100000]]
1 loops, best of 3: 1.4 s per loop

and if you are sampling everything (don't use sample!):

如果您正在对所有内容进行采样（不要使用样本！）：

In [41]: %timeit df.loc[sample(df.index, 1000000)]
1 loops, best of 3: 10 s per loop

Note: both numpy.random and random accept a seed, to reproduce randomly generated output.

注意： numpy.random 和 random 都接受种子，以重现随机生成的输出。

As @joris points out in the comments, choice (without replacement) is actually sugar for permutationso it's no suprise it's constant time and slower for smaller samples...

正如@joris 在评论中指出的那样，选择（无替换）实际上是排列的糖，所以它是恒定的时间并且对于较小的样本更慢也就不足为奇了......

Answer 3

回答by Alex Coventry

These days, one can simply use the samplemethod on a DataFrame:

现在，人们可以简单地sample在 DataFrame 上使用该方法：

>>> help(df.sample)
Help on method sample in module pandas.core.generic:

sample(self, n=None, frac=None, replace=False, weights=None, random_state=None, axis=None) method of pandas.core.frame.DataFrame instance
    Returns a random sample of items from an axis of object.

Replicability can be achieved by using the random_statekeyword:

可以通过使用random_state关键字来实现可复制性：

>>> len(set(df.sample(n=1, random_state=np.random.RandomState(0)).iterations.values[0] for _ in xrange(1000)))
1
>>> len(set(df.sample(n=1).iterations.values[0] for _ in xrange(1000)))
40

pandas 子样本熊猫数据框

提问by Nishant

回答by joris

回答by Andy Hayden

回答by Alex Coventry

相关推荐

最近更新

标签

pandas 子样本熊猫数据框

提问by Nishant

回答by joris

回答by Andy Hayden

回答by Alex Coventry

相关推荐

逆透视 Pandas 数据

pandas 将 DataFrame 列标题设置为 MultiIndex

pandas 熊猫：填充组内的缺失值

对 Pandas 和 HD5 / HDFStore 使用压缩

相关推荐

最近更新

标签