Python 如何有效地迭代 Pandas 数据帧的连续块
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25699439/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to iterate over consecutive chunks of Pandas dataframe efficiently
提问by Andrew Clegg
I have a large dataframe (several million rows).
我有一个大数据框(几百万行)。
I want to be able to do a groupby operation on it, but just grouping by arbitrary consecutive (preferably equal-sized) subsets of rows, rather than using any particular property of the individual rows to decide which group they go to.
我希望能够对其进行 groupby 操作,但只是按行的任意连续(最好是相等大小)子集进行分组,而不是使用单个行的任何特定属性来决定它们去哪个组。
The use case: I want to apply a function to each row via a parallel map in IPython. It doesn't matter which rows go to which back-end engine, as the function calculates a result based on one row at a time. (Conceptually at least; in reality it's vectorized.)
用例:我想通过 IPython 中的并行映射将函数应用于每一行。哪些行进入哪个后端引擎并不重要,因为该函数一次基于一行计算结果。(至少在概念上;实际上它是矢量化的。)
I've come up with something like this:
我想出了这样的事情:
# Generate a number from 0-9 for each row, indicating which tenth of the DF it belongs to
max_idx = dataframe.index.max()
tenths = ((10 * dataframe.index) / (1 + max_idx)).astype(np.uint32)
# Use this value to perform a groupby, yielding 10 consecutive chunks
groups = [g[1] for g in dataframe.groupby(tenths)]
# Process chunks in parallel
results = dview.map_sync(my_function, groups)
But this seems very long-winded, and doesn't guarantee equal sized chunks. Especially if the index is sparse or non-integer or whatever.
但这似乎很啰嗦,并不能保证大小相等的块。特别是如果索引是稀疏的或非整数的或其他什么。
Any suggestions for a better way?
对更好的方法有什么建议吗?
Thanks!
谢谢!
采纳答案by DSM
In practice, you can't guaranteeequal-sized chunks. The number of rows (N) might be prime, in which case you could only get equal-sized chunks at 1 or N. Because of this, real-world chunking typically uses a fixed size and allows for a smaller chunk at the end. I tend to pass an array to groupby. Starting from:
实际上,您不能保证大小相等的块。行数 (N) 可能是质数,在这种情况下,您只能在 1 或 N 处获得相同大小的块。因此,现实世界的分块通常使用固定大小并在最后允许较小的块。我倾向于将数组传递给groupby. 从...开始:
>>> df = pd.DataFrame(np.random.rand(15, 5), index=[0]*15)
>>> df[0] = range(15)
>>> df
0 1 2 3 4
0 0 0.746300 0.346277 0.220362 0.172680
0 1 0.657324 0.687169 0.384196 0.214118
0 2 0.016062 0.858784 0.236364 0.963389
[...]
0 13 0.510273 0.051608 0.230402 0.756921
0 14 0.950544 0.576539 0.642602 0.907850
[15 rows x 5 columns]
where I've deliberately made the index uninformative by setting it to 0, we simply decide on our size (here 10) and integer-divide an array by it:
我故意通过将索引设置为 0 来使索引没有信息,我们只需决定我们的大小(这里是 10)并将数组除以它:
>>> df.groupby(np.arange(len(df))//10)
<pandas.core.groupby.DataFrameGroupBy object at 0xb208492c>
>>> for k,g in df.groupby(np.arange(len(df))//10):
... print(k,g)
...
0 0 1 2 3 4
0 0 0.746300 0.346277 0.220362 0.172680
0 1 0.657324 0.687169 0.384196 0.214118
0 2 0.016062 0.858784 0.236364 0.963389
[...]
0 8 0.241049 0.246149 0.241935 0.563428
0 9 0.493819 0.918858 0.193236 0.266257
[10 rows x 5 columns]
1 0 1 2 3 4
0 10 0.037693 0.370789 0.369117 0.401041
0 11 0.721843 0.862295 0.671733 0.605006
[...]
0 14 0.950544 0.576539 0.642602 0.907850
[5 rows x 5 columns]
Methods based on slicing the DataFrame can fail when the index isn't compatible with that, although you can always use .iloc[a:b]to ignore the index values and access data by position.
当索引与其不兼容时,基于切片 DataFrame 的方法可能会失败,尽管您始终可以使用.iloc[a:b]忽略索引值并按位置访问数据。
回答by Ryan
I'm not sure if this is exactly what you want, but I found these grouper functions on another SO threadfairly useful for doing a multiprocessor pool.
我不确定这是否正是您想要的,但我发现另一个 SO 线程上的这些 grouper 函数对于执行多处理器池非常有用。
Here's a short example from that thread, which might do something like what you want:
这是该线程中的一个简短示例,它可能会执行您想要的操作:
import numpy as np
import pandas as pds
df = pds.DataFrame(np.random.rand(14,4), columns=['a', 'b', 'c', 'd'])
def chunker(seq, size):
return (seq[pos:pos + size] for pos in xrange(0, len(seq), size))
for i in chunker(df,5):
print i
Which gives you something like this:
这给了你这样的东西:
a b c d
0 0.860574 0.059326 0.339192 0.786399
1 0.029196 0.395613 0.524240 0.380265
2 0.235759 0.164282 0.350042 0.877004
3 0.545394 0.881960 0.994079 0.721279
4 0.584504 0.648308 0.655147 0.511390
a b c d
5 0.276160 0.982803 0.451825 0.845363
6 0.728453 0.246870 0.515770 0.343479
7 0.971947 0.278430 0.006910 0.888512
8 0.044888 0.875791 0.842361 0.890675
9 0.200563 0.246080 0.333202 0.574488
a b c d
10 0.971125 0.106790 0.274001 0.960579
11 0.722224 0.575325 0.465267 0.258976
12 0.574039 0.258625 0.469209 0.886768
13 0.915423 0.713076 0.073338 0.622967
I hope that helps.
我希望这有帮助。
EDIT
编辑
In this case, I used this function with pool of processorsin (approximately) this manner:
在这种情况下,我以(大约)这种方式将此函数与处理器池一起使用:
from multiprocessing import Pool
nprocs = 4
pool = Pool(nprocs)
for chunk in chunker(df, nprocs):
data = pool.map(myfunction, chunk)
data.domorestuff()
I assume this should be very similar to using the IPython distributed machinery, but I haven't tried it.
我认为这应该与使用 IPython 分布式机器非常相似,但我还没有尝试过。
回答by Miles
A sign of a good environment is many choices, so I'll add this from Anaconda Blaze, really using Odo
一个好的环境的标志是有很多选择,所以我会从Anaconda Blaze 中添加这个,真正使用Odo
import blaze as bz
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':[2,4,6,8,10]})
for chunk in bz.odo(df, target=bz.chunks(pd.DataFrame), chunksize=2):
# Do stuff with chunked dataframe
回答by Ivelin
Use numpy's array_split():
使用 numpy 的 array_split():
import numpy as np
import pandas as pd
data = pd.DataFrame(np.random.rand(10, 3))
for chunk in np.array_split(data, 5):
assert len(chunk) == len(data) / 5
回答by Andrei Krivoshei
Chunksgenerator function for iterating pandas Dataframes and Series
用于迭代熊猫数据帧和系列的块生成器函数
A generator version of the chunk function is presented below. Moreover this version works with custom index of the pd.DataFrame or pd.Series (e.g. float type index)
块函数的生成器版本如下所示。此外,此版本适用于 pd.DataFrame 或 pd.Series 的自定义索引(例如浮点型索引)
import numpy as np
import pandas as pd
df_sz = 14
df = pd.DataFrame(np.random.rand(df_sz,4),
index=np.linspace(0., 10., num=df_sz),
columns=['a', 'b', 'c', 'd']
)
def chunker(seq, size):
for pos in range(0, len(seq), size):
yield seq.iloc[pos:pos + size]
chunk_size = 6
for i in chunker(df, chunk_size):
print(i)
chnk = chunker(df, chunk_size)
print('\n', chnk)
print(next(chnk))
print(next(chnk))
print(next(chnk))
The output is
输出是
a b c d
0.000000 0.560627 0.665897 0.683055 0.611884
0.769231 0.241871 0.357080 0.841945 0.340778
1.538462 0.065009 0.234621 0.250644 0.552410
2.307692 0.431394 0.235463 0.755084 0.114852
3.076923 0.173748 0.189739 0.148856 0.031171
3.846154 0.772352 0.697762 0.557806 0.254476
a b c d
4.615385 0.901200 0.977844 0.250316 0.957408
5.384615 0.400939 0.520841 0.863015 0.177043
6.153846 0.356927 0.344220 0.863067 0.400573
6.923077 0.375417 0.156420 0.897889 0.810083
7.692308 0.666371 0.152800 0.482446 0.955556
8.461538 0.242711 0.421591 0.005223 0.200596
a b c d
9.230769 0.735748 0.402639 0.527825 0.595952
10.000000 0.420209 0.365231 0.966829 0.514409
- generator object chunker at 0x7f503c9d0ba0
First "next()":
a b c d
0.000000 0.560627 0.665897 0.683055 0.611884
0.769231 0.241871 0.357080 0.841945 0.340778
1.538462 0.065009 0.234621 0.250644 0.552410
2.307692 0.431394 0.235463 0.755084 0.114852
3.076923 0.173748 0.189739 0.148856 0.031171
3.846154 0.772352 0.697762 0.557806 0.254476
Second "next()":
a b c d
4.615385 0.901200 0.977844 0.250316 0.957408
5.384615 0.400939 0.520841 0.863015 0.177043
6.153846 0.356927 0.344220 0.863067 0.400573
6.923077 0.375417 0.156420 0.897889 0.810083
7.692308 0.666371 0.152800 0.482446 0.955556
8.461538 0.242711 0.421591 0.005223 0.200596
Third "next()":
a b c d
9.230769 0.735748 0.402639 0.527825 0.595952
10.000000 0.420209 0.365231 0.966829 0.514409
回答by wllbll
import pandas as pd
def batch(iterable, batch_num=10):
"""
split a iterable into mini batch with batch length of batch_num
support batch dataframe
usage:
for i in batch([1,2,3,4,5], batch_num=2):
print(i)
for idx, mini_data in enumerate(batch(df, batch_number=10)):
print(idx)
print(mini_data)
"""
l = len(iterable)
for idx in range(0, l, batch_num):
if isinstance(iterable, pd.DataFrame):
# dataframe can't split index label, should iter according index
yield iterable.iloc[idx:min(idx+batch_num, l)]
else:
yield iterable[idx:min(idx+batch_num, l)]

