从 pandas.DataFrame 中绘制引导样本

Question

提问by Till Hoffmann

I would like to draw a bootstrap sample of a pandas.DataFrameas efficiently as possible. Using the builtin iloctogether with a list of integers seems to be slow:

我想pandas.DataFrame尽可能有效地绘制 a 的引导样本。将内置iloc函数与整数列表一起使用似乎很慢：

import pandas
import numpy as np
# Generate some data
n = 5000
values = np.random.uniform(size=(n, 5))
# Construct a pandas.DataFrame
columns = ['a', 'b', 'c', 'd', 'e']
df = pandas.DataFrame(values, columns=columns)
# Bootstrap
%timeit df.iloc[np.random.randint(n, size=n)]
# Out: 1000 loops, best of 3: 1.46 ms per loop

Indexing the numpyarray is of course much faster:

索引numpy数组当然要快得多：

%timeit values[np.random.randint(n, size=n)]
# Out: 10000 loops, best of 3: 159 μs per loop

But even extracting the values, sampling the numpyarray, and constructing a new pandas.DataFrameis faster:

但即使提取值、对numpy数组进行采样并构造一个新的值pandas.DataFrame也更快：

%timeit pandas.DataFrame(df.values[np.random.randint(n, size=n)], columns=columns)
# Out: 1000 loops, best of 3: 302 μs per loop

@JohnE suggested samplewhich is unfortunately even slower:

@JohnE 建议sample不幸的是，它甚至更慢：

%timeit df.sample(n, replace=True)
# Out: 100 loops, best of 3: 5.14 ms per loop

@firelynx suggested merge:

@firelynx 建议merge：

%timeit df.merge(pandas.DataFrame(index=np.random.randint(n, size=n)), left_index=True, right_index=True, how='right')
# Out: 1000 loops, best of 3: 1.23 ms per loop

Does anyone have an idea why ilocis so slow and/or whether there are better alternatives than extracting the values, sampling and then constructing a new pandas.DataFrame?

有没有人知道为什么iloc这么慢和/或是否有比提取值、采样然后构建新值更好的替代方法pandas.DataFrame？

Answer 1

回答by firelynx

The merge method in pandas is fairly optimized, so I tried my luck with it and it gave me a significant speed increase. Given my machine is a bit slower than yours, I'm also using pandas 0.15.2 Things may be a bit different.

pandas 中的合并方法相当优化，所以我用它试试运气，它给了我显着的速度提升。鉴于我的机器比你的慢一点，我也在使用 Pandas 0.15.2 事情可能有点不同。

%timeit df.iloc[np.random.randint(n, size=n)]
# 100 loops, best of 3: 2.41 ms per loop

randlist = pandas.DataFrame(index=np.random.randint(n, size=n))
%timeit df.merge(randlist, left_index=True, right_index=True, how='right')
# 1000 loops, best of 3: 1.87 ms per loop

%timeit df.merge(pandas.DataFrame(index=np.random.randint(n, size=n)), left_index=True, right_index=True, how='right')
# 100 loops, best of 3: 2.29 ms per loop

Answer 2

回答by tmthydvnprt

Indexing Speeds

索引速度

Boolean Indexing tested to be slightly faster for me:

布尔索引测试对我来说稍微快一点：

Boolean Indexing

布尔索引

%timeit -n10000 df[np.random.randint(2, size=n).astype(bool)]
# 10000 loops, best of 3: 307 μs per loop

`numpy`sampling & re`DataFrame`ing

`numpy`抽样和`DataFrame`

%timeit -n10000 pandas.DataFrame(df.values[np.random.randint(n, size=n)], columns=columns)
# 10000 loops, best of 3: 380 μs per loop

从 pandas.DataFrame 中绘制引导样本

提问by Till Hoffmann

回答by firelynx

回答by tmthydvnprt

Indexing Speeds

索引速度

Boolean Indexing

布尔索引

`numpy`sampling & re`DataFrame`ing

`numpy`抽样和`DataFrame`

相关推荐

最近更新

标签

从 pandas.DataFrame 中绘制引导样本

提问by Till Hoffmann

回答by firelynx

回答by tmthydvnprt

Indexing Speeds

索引速度

Boolean Indexing

布尔索引

numpysampling & reDataFrameing

numpy抽样和DataFrame

相关推荐

pandas ValueError：索引必须单调递增或递减

Pandas：无法根据字符串相等进行过滤

pandas 熊猫在图表上显示多个条形图

pandas 尝试加载 matplotlib 时 iPython 笔记本错误

相关推荐

最近更新

标签

`numpy`sampling & re`DataFrame`ing

`numpy`抽样和`DataFrame`