Pandas：如何更快地应用数据帧？

Question

提问by Khris

Consider this pandas example where I'm calculating column Cby multiplying Awith Band a floatif a certain condition is fulfilled using applywith a lambdafunction:

假设在我计算列该只大Pandas例如C乘以A与B和float如果某个条件满足使用apply与lambda功能：

import pandas as pd
df = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9],'B':[9,8,7,6,5,4,3,2,1]})

df['C'] = df.apply(lambda x: x.A if x.B > 5 else 0.1*x.A*x.B, axis=1)

The expected result would be:

预期的结果是：

   A  B    C
0  1  9  1.0
1  2  8  2.0
2  3  7  3.0
3  4  6  4.0
4  5  5  2.5
5  6  4  2.4
6  7  3  2.1
7  8  2  1.6
8  9  1  0.9

The problem is that this code is slow and I need to do this operation on a dataframe with around 56 million rows.

问题是这段代码很慢，我需要在大约有 5600 万行的数据帧上执行此操作。

The %timeit-result of the above lambda operation is:

%timeit上面 lambda 操作的结果是：

1000 loops, best of 3: 1.63 ms per loop

Going from the calculation time and also the memory usage when doing this on my large dataframe I presume this operation uses intermediary series while doing the calculations.

从计算时间以及在我的大型数据帧上执行此操作时的内存使用情况来看，我认为此操作在进行计算时使用了中间系列。

I tried to formulate it in different ways including using temporary columns, but every alternative solution I came up with is even slower.

我尝试以不同的方式来制定它，包括使用临时列，但我想出的每个替代解决方案都更慢。

Is there a way to get the result I need in a different and faster way, e.g. by using numpy?

有没有办法以不同且更快的方式获得我需要的结果，例如使用numpy？

Answer 1

采纳答案by Divakar

For performance, you might be better off working with NumPy array and using np.where-

对于性能，您最好使用 NumPy 数组并使用np.where-

a = df.values # Assuming you have two columns A and B
df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])

Runtime test

运行时测试

def numpy_based(df):
    a = df.values # Assuming you have two columns A and B
    df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])

Timings -

时间 -

In [271]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']])

In [272]: %timeit numpy_based(df)
1000 loops, best of 3: 380 μs per loop

In [273]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']])

In [274]: %timeit df['C'] = df.A.where(df.B.gt(5), df[['A', 'B']].prod(1).mul(.1))
100 loops, best of 3: 3.39 ms per loop

In [275]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']])

In [276]: %timeit df['C'] = np.where(df['B'] > 5, df['A'], 0.1 * df['A'] * df['B'])
1000 loops, best of 3: 1.12 ms per loop

In [277]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']])

In [278]: %timeit df['C'] = np.where(df.B > 5, df.A, df.A.mul(df.B).mul(.1))
1000 loops, best of 3: 1.19 ms per loop

Closer look

仔细看看

Let's take a closer look at NumPy's number crunching capability and compare with pandas into the mix -

让我们仔细看看 NumPy 的数字运算能力，并与 Pandas 进行比较——

# Extract out as array (its a view, so not really expensive
#   .. as compared to the later computations themselves)

In [291]: a = df.values 

In [296]: %timeit df.values
10000 loops, best of 3: 107 μs per loop

Case #1 : Work with NumPy array and use numpy.where :

案例 #1：使用 NumPy 数组并使用 numpy.where ：

In [292]: %timeit np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])
10000 loops, best of 3: 86.5 μs per loop

Again, assigning into a new column : df['C']would not be very expensive either -

同样，分配到一个新列：df['C']也不会很贵 -

In [300]: %timeit df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])
1000 loops, best of 3: 323 μs per loop

Case #2 : Work with pandas dataframe and use its .wheremethod (no NumPy)

案例 #2：使用 pandas 数据框并使用其.where方法（无 NumPy）

In [293]: %timeit df.A.where(df.B.gt(5), df[['A', 'B']].prod(1).mul(.1))
100 loops, best of 3: 3.4 ms per loop

Case #3 : Work with pandas dataframe (no NumPy array), but use numpy.where-

案例 #3：使用 Pandas 数据框（无 NumPy 数组），但使用numpy.where-

In [294]: %timeit np.where(df['B'] > 5, df['A'], 0.1 * df['A'] * df['B'])
1000 loops, best of 3: 764 μs per loop

Case #4 : Work with pandas dataframe again (no NumPy array), but use numpy.where-

案例 #4：再次使用 Pandas 数据框（没有 NumPy 数组），但使用numpy.where-

In [295]: %timeit np.where(df.B > 5, df.A, df.A.mul(df.B).mul(.1))
1000 loops, best of 3: 830 μs per loop

Answer 2

回答by piRSquared

pure pandas
using pd.Series.where

纯pandas
用pd.Series.where

df['C'] = df.A.where(df.B.gt(5), df[['A', 'B']].prod(1).mul(.1))

   A  B    C
0  1  9  1.0
1  2  8  2.0
2  3  7  3.0
3  4  6  4.0
4  5  5  2.5
5  6  4  2.4
6  7  3  2.1
7  8  2  1.6
8  9  1  0.9

Answer 3

回答by IanS

Using numpy.where:

使用numpy.where：

df['C'] = numpy.where(df['B'] > 5, df['A'], 0.1 * df['A'] * df['B'])

Answer 4

回答by jezrael

Use:

用：

df['C'] = np.where(df.B > 5, df.A, df.A.mul(df.B).mul(.1))
print (df)
   A  B    C
0  1  9  1.0
1  2  8  2.0
2  3  7  3.0
3  4  6  4.0
4  5  5  2.5
5  6  4  2.4
6  7  3  2.1
7  8  2  1.6
8  9  1  0.9

Pandas：如何更快地应用数据帧？

提问by Khris

采纳答案by Divakar

回答by piRSquared

回答by IanS

回答by jezrael

相关推荐

最近更新

标签

Pandas：如何更快地应用数据帧？

提问by Khris

采纳答案by Divakar

回答by piRSquared

回答by IanS

回答by jezrael

相关推荐

使用 Pandas read_html 的问题

pandas 熊猫将两列与空值组合在一起

使用 Pandas 查找分组行的最小值

pandas 如何在 Mac OS X Sierra 上删除额外的蟒蛇

相关推荐

最近更新

标签