pandas python pandas中的Groupby:快速方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38143717/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Groupby in python pandas: Fast Way
提问by Náthali
I want to improve the time of a groupby
in python pandas.
I have this code:
我想改善groupby
python pandas 中的时间。我有这个代码:
df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)
The objective is to count how many contracts a client has in a month and add this information in a new column (Nbcontrats
).
目标是计算客户在一个月内拥有的合同数量,并将此信息添加到新列 ( Nbcontrats
) 中。
Client
: client codeMonth
: month of data extractionContrat
: contract number
Client
: 客户端代码Month
: 数据提取月份Contrat
: 合同号码
I want to improve the time. Below I am only working with a subset of my real data:
我想改善时间。下面我只使用我的真实数据的一个子集:
%timeit df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)
1 loops, best of 3: 391 ms per loop
df.shape
Out[309]: (7464, 61)
How can I improve the execution time?
如何提高执行时间?
采纳答案by Náthali
With the DataFrameGroupBy.size
method:
使用DataFrameGroupBy.size
方法:
df.set_index(['Client', 'Month'], inplace=True)
df['Nbcontrats'] = df.groupby(level=(0,1)).size()
df.reset_index(inplace=True)
The most work goes into assigning the result back into a column of the source DataFrame.
大多数工作是将结果分配回源 DataFrame 的列中。
回答by Divakar
Here's one way to proceed :
这是继续的一种方法:
Slice out the relevant columns (
['Client', 'Month']
) from the input dataframe into a NumPy array. This is mostly a performance-focused idea as we would be using NumPy functions later on, which are optimized to work with NumPy arrays.Convert the two columns data from
['Client', 'Month']
into a single1D
array, which would be a linear index equivalent of it considering elements from the two columns as pairs. Thus, we can assume that the elements from'Client'
represent the row indices, whereas'Month'
elements are the column indices. This is like going from2D
to1D
. But, the issue would be deciding the shape of the 2D grid to perform such a mapping. To cover all pairs, one safe assumption would be assuming a 2D grid whose dimensions are one more than the max along each column because of 0-based indexing in Python. Thus, we would get linear indices.Next up, we tag each linear index based on their uniqueness among others. I think this would correspond to the keys obtained with
grouby
instead. We also need to get counts of each group/unique key along the entire length of that 1D array. Finally, indexing into the counts with those tags should map for each element the respective counts.
['Client', 'Month']
将输入数据帧中的相关列 ( ) 切成 NumPy 数组。这主要是一个以性能为中心的想法,因为我们稍后将使用 NumPy 函数,这些函数经过优化以与 NumPy 数组一起使用。将两列数据从
['Client', 'Month']
转换为单个1D
数组,这将是线性索引等价于将两列中的元素视为对。因此,我们可以假设来自的元素'Client'
代表行索引,而'Month'
元素是列索引。这就像从2D
到1D
。但是,问题将是决定 2D 网格的形状来执行这样的映射。为了涵盖所有对,一个安全的假设是假设一个二维网格,由于 Python 中的基于 0 的索引,其维度比沿每列的最大值多 1。因此,我们将得到线性索引。接下来,我们根据每个线性索引的唯一性来标记它们。我认为这将对应于获得的密钥
grouby
。我们还需要沿该一维数组的整个长度获取每个组/唯一键的计数。最后,用这些标签索引计数应该为每个元素映射相应的计数。
That's the whole idea about it! Here's the implementation -
这就是它的全部想法!这是实现 -
# Save relevant columns as a NumPy array for performing NumPy operations afterwards
arr_slice = df[['Client', 'Month']].values
# Get linear indices equivalent of those columns
lidx = np.ravel_multi_index(arr_slice.T,arr_slice.max(0)+1)
# Get unique IDs corresponding to each linear index (i.e. group) and grouped counts
unq,unqtags,counts = np.unique(lidx,return_inverse=True,return_counts=True)
# Index counts with the unique tags to map across all elements with the counts
df["Nbcontrats"] = counts[unqtags]
Runtime test
运行时测试
1) Define functions :
1) 定义函数:
def original_app(df):
df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)
def vectorized_app(df):
arr_slice = df[['Client', 'Month']].values
lidx = np.ravel_multi_index(arr_slice.T,arr_slice.max(0)+1)
unq,unqtags,counts = np.unique(lidx,return_inverse=True,return_counts=True)
df["Nbcontrats"] = counts[unqtags]
2) Verify results :
2) 验证结果:
In [143]: # Let's create a dataframe with 100 unique IDs and of length 10000
...: arr = np.random.randint(0,100,(10000,3))
...: df = pd.DataFrame(arr,columns=['Client','Month','Contrat'])
...: df1 = df.copy()
...:
...: # Run the function on the inputs
...: original_app(df)
...: vectorized_app(df1)
...:
In [144]: np.allclose(df["Nbcontrats"],df1["Nbcontrats"])
Out[144]: True
3) Finally time them :
3)最后计时:
In [145]: # Let's create a dataframe with 100 unique IDs and of length 10000
...: arr = np.random.randint(0,100,(10000,3))
...: df = pd.DataFrame(arr,columns=['Client','Month','Contrat'])
...: df1 = df.copy()
...:
In [146]: %timeit original_app(df)
1 loops, best of 3: 645 ms per loop
In [147]: %timeit vectorized_app(df1)
100 loops, best of 3: 2.62 ms per loop