Pandas:如何更快地应用数据帧?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41588034/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: How to make apply on dataframe faster?
提问by Khris
Consider this pandas example where I'm calculating column C
by multiplying A
with B
and a float
if a certain condition is fulfilled using apply
with a lambda
function:
假设在我计算列该只大Pandas例如C
乘以A
与B
和float
如果某个条件满足使用apply
与lambda
功能:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9],'B':[9,8,7,6,5,4,3,2,1]})
df['C'] = df.apply(lambda x: x.A if x.B > 5 else 0.1*x.A*x.B, axis=1)
The expected result would be:
预期的结果是:
A B C
0 1 9 1.0
1 2 8 2.0
2 3 7 3.0
3 4 6 4.0
4 5 5 2.5
5 6 4 2.4
6 7 3 2.1
7 8 2 1.6
8 9 1 0.9
The problem is that this code is slow and I need to do this operation on a dataframe with around 56 million rows.
问题是这段代码很慢,我需要在大约有 5600 万行的数据帧上执行此操作。
The %timeit
-result of the above lambda operation is:
%timeit
上面 lambda 操作的结果是:
1000 loops, best of 3: 1.63 ms per loop
Going from the calculation time and also the memory usage when doing this on my large dataframe I presume this operation uses intermediary series while doing the calculations.
从计算时间以及在我的大型数据帧上执行此操作时的内存使用情况来看,我认为此操作在进行计算时使用了中间系列。
I tried to formulate it in different ways including using temporary columns, but every alternative solution I came up with is even slower.
我尝试以不同的方式来制定它,包括使用临时列,但我想出的每个替代解决方案都更慢。
Is there a way to get the result I need in a different and faster way, e.g. by using numpy
?
有没有办法以不同且更快的方式获得我需要的结果,例如使用numpy
?
采纳答案by Divakar
For performance, you might be better off working with NumPy array and using np.where
-
对于性能,您最好使用 NumPy 数组并使用np.where
-
a = df.values # Assuming you have two columns A and B
df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])
Runtime test
运行时测试
def numpy_based(df):
a = df.values # Assuming you have two columns A and B
df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])
Timings -
时间 -
In [271]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']])
In [272]: %timeit numpy_based(df)
1000 loops, best of 3: 380 μs per loop
In [273]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']])
In [274]: %timeit df['C'] = df.A.where(df.B.gt(5), df[['A', 'B']].prod(1).mul(.1))
100 loops, best of 3: 3.39 ms per loop
In [275]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']])
In [276]: %timeit df['C'] = np.where(df['B'] > 5, df['A'], 0.1 * df['A'] * df['B'])
1000 loops, best of 3: 1.12 ms per loop
In [277]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']])
In [278]: %timeit df['C'] = np.where(df.B > 5, df.A, df.A.mul(df.B).mul(.1))
1000 loops, best of 3: 1.19 ms per loop
Closer look
仔细看看
Let's take a closer look at NumPy's number crunching capability and compare with pandas into the mix -
让我们仔细看看 NumPy 的数字运算能力,并与 Pandas 进行比较——
# Extract out as array (its a view, so not really expensive
# .. as compared to the later computations themselves)
In [291]: a = df.values
In [296]: %timeit df.values
10000 loops, best of 3: 107 μs per loop
Case #1 : Work with NumPy array and use numpy.where :
案例 #1:使用 NumPy 数组并使用 numpy.where :
In [292]: %timeit np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])
10000 loops, best of 3: 86.5 μs per loop
Again, assigning into a new column : df['C']
would not be very expensive either -
同样,分配到一个新列:df['C']
也不会很贵 -
In [300]: %timeit df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])
1000 loops, best of 3: 323 μs per loop
Case #2 : Work with pandas dataframe and use its .where
method (no NumPy)
案例 #2:使用 pandas 数据框并使用其.where
方法(无 NumPy)
In [293]: %timeit df.A.where(df.B.gt(5), df[['A', 'B']].prod(1).mul(.1))
100 loops, best of 3: 3.4 ms per loop
Case #3 : Work with pandas dataframe (no NumPy array), but use numpy.where
-
案例 #3:使用 Pandas 数据框(无 NumPy 数组),但使用numpy.where
-
In [294]: %timeit np.where(df['B'] > 5, df['A'], 0.1 * df['A'] * df['B'])
1000 loops, best of 3: 764 μs per loop
Case #4 : Work with pandas dataframe again (no NumPy array), but use numpy.where
-
案例 #4:再次使用 Pandas 数据框(没有 NumPy 数组),但使用numpy.where
-
In [295]: %timeit np.where(df.B > 5, df.A, df.A.mul(df.B).mul(.1))
1000 loops, best of 3: 830 μs per loop
回答by piRSquared
pure pandas
using pd.Series.where
纯pandas
用pd.Series.where
df['C'] = df.A.where(df.B.gt(5), df[['A', 'B']].prod(1).mul(.1))
A B C
0 1 9 1.0
1 2 8 2.0
2 3 7 3.0
3 4 6 4.0
4 5 5 2.5
5 6 4 2.4
6 7 3 2.1
7 8 2 1.6
8 9 1 0.9
回答by IanS
Using numpy.where
:
使用numpy.where
:
df['C'] = numpy.where(df['B'] > 5, df['A'], 0.1 * df['A'] * df['B'])
回答by jezrael
Use:
用:
df['C'] = np.where(df.B > 5, df.A, df.A.mul(df.B).mul(.1))
print (df)
A B C
0 1 9 1.0
1 2 8 2.0
2 3 7 3.0
3 4 6 4.0
4 5 5 2.5
5 6 4 2.4
6 7 3 2.1
7 8 2 1.6
8 9 1 0.9