pandas python pandas中的Groupby：快速方法

Question

提问by Náthali

I want to improve the time of a groupbyin python pandas. I have this code:

我想改善groupbypython pandas 中的时间。我有这个代码：

df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)

The objective is to count how many contracts a client has in a month and add this information in a new column (Nbcontrats).

目标是计算客户在一个月内拥有的合同数量，并将此信息添加到新列 ( Nbcontrats) 中。

Client: client code
Month: month of data extraction
Contrat: contract number

Client: 客户端代码
Month: 数据提取月份
Contrat：合同号码

I want to improve the time. Below I am only working with a subset of my real data:

我想改善时间。下面我只使用我的真实数据的一个子集：

%timeit df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)
1 loops, best of 3: 391 ms per loop

df.shape
Out[309]: (7464, 61)

How can I improve the execution time?

如何提高执行时间？

Answer 1

采纳答案by Náthali

With the DataFrameGroupBy.sizemethod:

使用DataFrameGroupBy.size方法：

df.set_index(['Client', 'Month'], inplace=True)
df['Nbcontrats'] = df.groupby(level=(0,1)).size()
df.reset_index(inplace=True)

The most work goes into assigning the result back into a column of the source DataFrame.

大多数工作是将结果分配回源 DataFrame 的列中。

Answer 2

回答by Divakar

Here's one way to proceed :

这是继续的一种方法：

Slice out the relevant columns (['Client', 'Month']) from the input dataframe into a NumPy array. This is mostly a performance-focused idea as we would be using NumPy functions later on, which are optimized to work with NumPy arrays.
Convert the two columns data from ['Client', 'Month']into a single 1Darray, which would be a linear index equivalent of it considering elements from the two columns as pairs. Thus, we can assume that the elements from 'Client'represent the row indices, whereas 'Month'elements are the column indices. This is like going from 2Dto 1D. But, the issue would be deciding the shape of the 2D grid to perform such a mapping. To cover all pairs, one safe assumption would be assuming a 2D grid whose dimensions are one more than the max along each column because of 0-based indexing in Python. Thus, we would get linear indices.
Next up, we tag each linear index based on their uniqueness among others. I think this would correspond to the keys obtained with groubyinstead. We also need to get counts of each group/unique key along the entire length of that 1D array. Finally, indexing into the counts with those tags should map for each element the respective counts.

['Client', 'Month']将输入数据帧中的相关列 ( ) 切成 NumPy 数组。这主要是一个以性能为中心的想法，因为我们稍后将使用 NumPy 函数，这些函数经过优化以与 NumPy 数组一起使用。
将两列数据从['Client', 'Month']转换为单个1D数组，这将是线性索引等价于将两列中的元素视为对。因此，我们可以假设来自的元素'Client'代表行索引，而'Month'元素是列索引。这就像从2D到1D。但是，问题将是决定 2D 网格的形状来执行这样的映射。为了涵盖所有对，一个安全的假设是假设一个二维网格，由于 Python 中的基于 0 的索引，其维度比沿每列的最大值多 1。因此，我们将得到线性索引。
接下来，我们根据每个线性索引的唯一性来标记它们。我认为这将对应于获得的密钥grouby。我们还需要沿该一维数组的整个长度获取每个组/唯一键的计数。最后，用这些标签索引计数应该为每个元素映射相应的计数。

That's the whole idea about it! Here's the implementation -

这就是它的全部想法！这是实现 -

# Save relevant columns as a NumPy array for performing NumPy operations afterwards
arr_slice = df[['Client', 'Month']].values

# Get linear indices equivalent of those columns
lidx = np.ravel_multi_index(arr_slice.T,arr_slice.max(0)+1)

# Get unique IDs corresponding to each linear index (i.e. group) and grouped counts
unq,unqtags,counts = np.unique(lidx,return_inverse=True,return_counts=True)

# Index counts with the unique tags to map across all elements with the counts
df["Nbcontrats"] = counts[unqtags]

Runtime test

运行时测试

1) Define functions :

1) 定义函数：

def original_app(df):
    df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)

def vectorized_app(df):
    arr_slice = df[['Client', 'Month']].values
    lidx = np.ravel_multi_index(arr_slice.T,arr_slice.max(0)+1)
    unq,unqtags,counts = np.unique(lidx,return_inverse=True,return_counts=True)
    df["Nbcontrats"] = counts[unqtags]

2) Verify results :

2) 验证结果：

In [143]: # Let's create a dataframe with 100 unique IDs and of length 10000
     ...: arr = np.random.randint(0,100,(10000,3))
     ...: df = pd.DataFrame(arr,columns=['Client','Month','Contrat'])
     ...: df1 = df.copy()
     ...: 
     ...: # Run the function on the inputs
     ...: original_app(df)
     ...: vectorized_app(df1)
     ...: 

In [144]: np.allclose(df["Nbcontrats"],df1["Nbcontrats"])
Out[144]: True

3) Finally time them :

3）最后计时：

In [145]: # Let's create a dataframe with 100 unique IDs and of length 10000
     ...: arr = np.random.randint(0,100,(10000,3))
     ...: df = pd.DataFrame(arr,columns=['Client','Month','Contrat'])
     ...: df1 = df.copy()
     ...: 

In [146]: %timeit original_app(df)
1 loops, best of 3: 645 ms per loop

In [147]: %timeit vectorized_app(df1)
100 loops, best of 3: 2.62 ms per loop

pandas python pandas中的Groupby：快速方法

提问by Náthali

采纳答案by Náthali

回答by Divakar

相关推荐

最近更新

标签

pandas python pandas中的Groupby：快速方法

提问by Náthali

采纳答案by Náthali

回答by Divakar

相关推荐

如何在 Pandas 中使用 apply 并行化许多（模糊）字符串比较？

Pandas 重新索引并填充缺失值：“索引必须是单调的”

Pandas - 在 groupby 之后返回一个数据帧

pandas 如何使用pandas-python递归构造一列数据框？

相关推荐

最近更新

标签