Pandas groupby 应用执行缓慢
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33498061/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas groupby apply performing slow
提问by Alex
I am working on a program that involves large amounts of data. I am using the python pandas module to look for errors in my data. This usually works very fast. However this current piece of code I wrote seems to be way slower than it should be and I am looking for a way to speed it up.
我正在开发一个涉及大量数据的程序。我正在使用 python pandas 模块来查找数据中的错误。这通常工作得非常快。然而,我目前编写的这段代码似乎比它应该的速度慢得多,我正在寻找一种方法来加快它的速度。
In order for you guys to properly test it I uploaded a rather large piece of code. You should be able to run it as is. The comments in the code should explain what I am trying to do here. Any help would be greatly appreciated.
为了让你们正确测试它,我上传了一段相当大的代码。您应该能够按原样运行它。代码中的注释应该解释我在这里尝试做什么。任何帮助将不胜感激。
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
# Filling dataframe with data
# Just ignore this part for now, real data comes from csv files, this is an example of how it looks
TimeOfDay_options = ['Day','Evening','Night']
TypeOfCargo_options = ['Goods','Passengers']
np.random.seed(1234)
n = 10000
df = pd.DataFrame()
df['ID_number'] = np.random.randint(3, size=n)
df['TimeOfDay'] = np.random.choice(TimeOfDay_options, size=n)
df['TypeOfCargo'] = np.random.choice(TypeOfCargo_options, size=n)
df['TrackStart'] = np.random.randint(400, size=n) * 900
df['SectionStart'] = np.nan
df['SectionStop'] = np.nan
grouped_df = df.groupby(['ID_number','TimeOfDay','TypeOfCargo','TrackStart'])
for index, group in grouped_df:
if len(group) == 1:
df.loc[group.index,['SectionStart']] = group['TrackStart']
df.loc[group.index,['SectionStop']] = group['TrackStart'] + 899
if len(group) > 1:
track_start = group.loc[group.index[0],'TrackStart']
track_end = track_start + 899
section_stops = np.random.randint(track_start, track_end, size=len(group))
section_stops[-1] = track_end
section_stops = np.sort(section_stops)
section_starts = np.insert(section_stops, 0, track_start)
for i,start,stop in zip(group.index,section_starts,section_stops):
df.loc[i,['SectionStart']] = start
df.loc[i,['SectionStop']] = stop
#%% This is what a random group looks like without errors
#Note that each section neatly starts where the previous section ended
#There are no gaps (The whole track is defined)
grouped_df.get_group((2, 'Night', 'Passengers', 323100))
#%% Introducing errors to the data
df.loc[2640,'SectionStart'] += 100
df.loc[5390,'SectionStart'] += 7
#%% This is what the same group looks like after introducing errors
#Note that the 'SectionStop' of row 1525 is no longer similar to the 'SectionStart' of row 2640
#This track now has a gap of 100, it is not completely defined from start to end
grouped_df.get_group((2, 'Night', 'Passengers', 323100))
#%% Try to locate the errors
#This is the part of the code I need to speed up
def Full_coverage(group):
if len(group) > 1:
#Sort the grouped data by column 'SectionStart' from low to high
#Updated for newer pandas version
#group.sort('SectionStart', ascending=True, inplace=True)
group.sort_values('SectionStart', ascending=True, inplace=True)
#Some initial values, overwritten at the end of each loop
#These variables correspond to the first row of the group
start_km = group.iloc[0,4]
end_km = group.iloc[0,5]
end_km_index = group.index[0]
#Loop through all the rows in the group
#index is the index of the row
#i is the 'SectionStart' of the row
#j is the 'SectionStop' of the row
#The loop starts from the 2nd row in the group
for index, (i, j) in group.iloc[1:,[4,5]].iterrows():
#The start of the next row must be equal to the end of the previous row in the group
if i != end_km:
#Add the faulty data to the error list
incomplete_coverage.append(('Expected startpoint: '+str(end_km)+' (row '+str(end_km_index)+')', \
'Found startpoint: '+str(i)+' (row '+str(index)+')'))
#Overwrite these values for the next loop
start_km = i
end_km = j
end_km_index = index
return group
#Check if the complete track is completely defined (from start to end) for each combination of:
#'ID_number','TimeOfDay','TypeOfCargo','TrackStart'
incomplete_coverage = [] #Create empty list for storing the error messages
df_grouped = df.groupby(['ID_number','TimeOfDay','TypeOfCargo','TrackStart']).apply(lambda x: Full_coverage(x))
#Print the error list
print('\nFound incomplete coverage in the following rows:')
for i,j in incomplete_coverage:
print(i)
print(j)
print()
#%%Time the procedure -- It is very slow, taking about 6.6 seconds on my pc
%timeit df.groupby(['ID_number','TimeOfDay','TypeOfCargo','TrackStart']).apply(lambda x: Full_coverage(x))
采纳答案by jakevdp
The problem, I believe, is that your data has 5300 distinct groups. Due to this, anything slow within your function will be magnified. You could probably use a vectorized operation rather than a for
loop in your function to save time, but a much easier way to shave off a few seconds is to return 0
rather than return group
. When you return group
, pandas will actually create a new data object combining your sorted groups, which you don't appear to use. When you return 0
, pandas will combine 5300 zeros instead, which is much faster.
我相信,问题在于您的数据有 5300 个不同的组。因此,您函数中的任何缓慢都会被放大。您可能可以for
在函数中使用矢量化操作而不是循环来节省时间,但是减少几秒钟的更简单的方法是使用return 0
而不是return group
. 当你return group
,pandas 实际上会创建一个新的数据对象,结合你的排序组,你似乎没有使用它。当你return 0
,pandas 将组合 5300 个零,这要快得多。
For example:
例如:
cols = ['ID_number','TimeOfDay','TypeOfCargo','TrackStart']
groups = df.groupby(cols)
print(len(groups))
# 5353
%timeit df.groupby(cols).apply(lambda group: group)
# 1 loops, best of 3: 2.41 s per loop
%timeit df.groupby(cols).apply(lambda group: 0)
# 10 loops, best of 3: 64.3 ms per loop
Just combining the results you don't use is taking about 2.4 seconds; the rest of the time is actual computation in your loop which you should attempt to vectorize.
仅仅结合你不使用的结果大约需要 2.4 秒;其余时间是循环中的实际计算,您应该尝试对其进行矢量化。
Edit:
编辑:
With a quick additional vectorized check before the for
loop and returning 0
instead of group
, I got the time down to about ~2sec, which is basically the cost of sorting each group. Try this function:
通过在for
循环之前快速进行额外的矢量化检查并返回0
而不是group
,我将时间缩短到大约 2 秒,这基本上是对每个组进行排序的成本。试试这个功能:
def Full_coverage(group):
if len(group) > 1:
group = group.sort('SectionStart', ascending=True)
# this condition is sufficient to find when the loop
# will add to the list
if np.any(group.values[1:, 4] != group.values[:-1, 5]):
start_km = group.iloc[0,4]
end_km = group.iloc[0,5]
end_km_index = group.index[0]
for index, (i, j) in group.iloc[1:,[4,5]].iterrows():
if i != end_km:
incomplete_coverage.append(('Expected startpoint: '+str(end_km)+' (row '+str(end_km_index)+')', \
'Found startpoint: '+str(i)+' (row '+str(index)+')'))
start_km = i
end_km = j
end_km_index = index
return 0
cols = ['ID_number','TimeOfDay','TypeOfCargo','TrackStart']
%timeit df.groupby(cols).apply(Full_coverage)
# 1 loops, best of 3: 1.74 s per loop
Edit 2: here's an example which incorporates my suggestion to move the sort outside the groupby and to remove the unnecessary loops. Removing the loops is not much faster for the given example, but will be faster if there are a lot of incompletes:
编辑 2:这是一个示例,其中包含我的建议,将排序移到 groupby 之外并删除不必要的循环。对于给定的示例,删除循环的速度并不快,但如果有很多不完整的内容,则速度会更快:
def Full_coverage_new(group):
if len(group) > 1:
mask = group.values[1:, 4] != group.values[:-1, 5]
if np.any(mask):
err = ('Expected startpoint: {0} (row {1}) '
'Found startpoint: {2} (row {3})')
incomplete_coverage.extend([err.format(group.iloc[i, 5],
group.index[i],
group.iloc[i + 1, 4],
group.index[i + 1])
for i in np.where(mask)[0]])
return 0
incomplete_coverage = []
cols = ['ID_number','TimeOfDay','TypeOfCargo','TrackStart']
df_s = df.sort_values(['SectionStart','SectionStop'])
df_s.groupby(cols).apply(Full_coverage_nosort)
回答by Alex
I found the pandas locate commands (.loc or .iloc) were also slowing down the progress. By moving the sort out of the loop and converting the data to numpy arrays at the start of the function I got an even faster result. I am aware that the data is no longer a dataframe, but the indices returned in the list can be used to find the data in the original df.
我发现 Pandas locate 命令(.loc 或 .iloc)也拖慢了进度。通过将排序移出循环并在函数开始时将数据转换为 numpy 数组,我得到了更快的结果。我知道数据不再是数据帧,但列表中返回的索引可用于在原始 df 中查找数据。
If there is any way to speed up the process even further I would appreciate the help. What I have so far:
如果有任何方法可以进一步加快进程,我将不胜感激。到目前为止我所拥有的:
def Full_coverage(group):
if len(group) > 1:
group_index = group.index.values
group = group.values
# this condition is sufficient to find when the loop will add to the list
if np.any(group[1:, 4] != group[:-1, 5]):
start_km = group[0,4]
end_km = group[0,5]
end_km_index = group_index[0]
for index, (i, j) in zip(group_index, group[1:,[4,5]]):
if i != end_km:
incomplete_coverage.append(('Expected startpoint: '+str(end_km)+' (row '+str(end_km_index)+')', \
'Found startpoint: '+str(i)+' (row '+str(index)+')'))
start_km = i
end_km = j
end_km_index = index
return 0
incomplete_coverage = []
df.sort(['SectionStart','SectionStop'], ascending=True, inplace=True)
cols = ['ID_number','TimeOfDay','TypeOfCargo','TrackStart']
%timeit df.groupby(cols).apply(Full_coverage)
# 1 loops, best of 3: 272 ms per loop