Python Pandas 数据帧性能
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22084338/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas DataFrame performance
提问by Owen
Pandas is really great, but I am really surprised by how inefficient it is to retrieve values from a Pandas.DataFrame. In the following toy example, even the DataFrame.iloc method is more than 100 times slower than a dictionary.
Pandas 真的很棒,但我真的很惊讶从 Pandas.DataFrame 检索值是多么低效。在下面的玩具示例中,即使是 DataFrame.iloc 方法也比字典慢 100 多倍。
The question: Is the lesson here just that dictionaries are the better way to look up values? Yes, I get that that is precisely what they were made for. But I just wonder if there is something I am missing about DataFrame lookup performance.
问题:这里的教训是否只是字典是查找值的更好方法?是的,我明白这正是它们的目的。但我只是想知道我是否遗漏了 DataFrame 查找性能。
I realize this question is more "musing" than "asking" but I will accept an answer that provides insight or perspective on this. Thanks.
我意识到这个问题比“提问”更“沉思”,但我会接受一个提供洞察力或观点的答案。谢谢。
import timeit
setup = '''
import numpy, pandas
df = pandas.DataFrame(numpy.zeros(shape=[10, 10]))
dictionary = df.to_dict()
'''
f = ['value = dictionary[5][5]', 'value = df.loc[5, 5]', 'value = df.iloc[5, 5]']
for func in f:
print func
print min(timeit.Timer(func, setup).repeat(3, 100000))
value = dictionary[5][5]
0.130625009537
value = df.loc[5, 5]
19.4681699276
value = df.iloc[5, 5]
17.2575249672
值 = 字典[5][5]
0.130625009537
值 = df.loc[5, 5]
19.4681699276
值 = df.iloc[5, 5]
17.2575249672
采纳答案by unutbu
A dict is to a DataFrame as a bicycle is to a car. You can pedal 10 feet on a bicycle faster than you can start a car, get it in gear, etc, etc. But if you need to go a mile, the car wins.
dict 之于 DataFrame 就像自行车之于汽车一样。你可以在自行车上踩 10 英尺,比启动汽车、挂挡等等的速度快 10 英尺。但如果你需要走一英里,汽车就会赢。
For certain small, targeted purposes, a dict may be faster. And if that is all you need, then use a dict, for sure! But if you need/want the power and luxury of a DataFrame, then a dict is no substitute. It is meaningless to compare speed if the data structure does not first satisfy your needs.
对于某些小的、有针对性的目的,dict 可能更快。如果这就是你所需要的,那么当然可以使用 dict !但是,如果您需要/想要 DataFrame 的强大功能和奢华,那么 dict 是不可替代的。如果数据结构不能首先满足您的需求,那么比较速度是没有意义的。
Now for example -- to be more concrete -- a dict is good for accessing columns, but it is not so convenient for accessing rows.
现在例如 - 更具体地 - dict 适合访问列,但访问行不太方便。
import timeit
setup = '''
import numpy, pandas
df = pandas.DataFrame(numpy.zeros(shape=[10, 1000]))
dictionary = df.to_dict()
'''
# f = ['value = dictionary[5][5]', 'value = df.loc[5, 5]', 'value = df.iloc[5, 5]']
f = ['value = [val[5] for col,val in dictionary.items()]', 'value = df.loc[5]', 'value = df.iloc[5]']
for func in f:
print(func)
print(min(timeit.Timer(func, setup).repeat(3, 100000)))
yields
产量
value = [val[5] for col,val in dictionary.iteritems()]
25.5416321754
value = df.loc[5]
5.68071913719
value = df.iloc[5]
4.56006002426
So the dict of lists is 5 times slower at retrieving rows than df.iloc. The speed deficit becomes greater as the number of columns grows. (The number of columns is like the number of feet in the bicycle analogy. The longer the distance, the more convenient the car becomes...)
因此,列表的 dict 在检索行时比df.iloc. 随着列数的增加,速度赤字变得更大。(列数就像自行车比喻中的英尺数。距离越远,汽车变得越方便......)
This is just one example of when a dict of lists would be less convenient/slower than a DataFrame.
这只是列表字典不如 DataFrame 方便/慢的一个例子。
Another example would be when you have a DatetimeIndex for the rows and wish to select all rows between certain dates. With a DataFrame you can use
另一个例子是当你有一个 DatetimeIndex 的行并希望选择某些日期之间的所有行时。使用 DataFrame,您可以使用
df.loc['2000-1-1':'2000-3-31']
There is no easy analogue for that if you were to use a dict of lists. And the Python loops you would need to use to select the right rows would again be terribly slow compared to the DataFrame.
如果您要使用列表字典,则没有简单的类似物。与 DataFrame 相比,您需要用来选择正确行的 Python 循环再次慢得可怕。
回答by user3566825
I encountered the same problem. you can use atto improve.
我遇到了同样的问题。你可以at用来改进。
"Since indexing with [] must handle a lot of cases (single-label access, slicing, boolean indexing, etc.), it has a bit of overhead in order to figure out what you're asking for. If you only want to access a scalar value, the fastest way is to use the atand iatmethods, which are implemented on all of the data structures."
“由于使用 [] 进行索引必须处理很多情况(单标签访问、切片、布尔索引等),因此为了弄清楚您的要求,它有一些开销。如果您只想访问标量值,最快的方法是使用at和iat方法,它们在所有数据结构上实现。”
see official reference http://pandas.pydata.org/pandas-docs/stable/indexing.htmlchapter "Fast scalar value getting and setting"
请参阅官方参考http://pandas.pydata.org/pandas-docs/stable/indexing.html章节“快速标量值获取和设置”
回答by joon
It seems the performance difference is much smaller now (0.21.1 -- I forgot what was the version of Pandas in the original example). Not only the performance gap between dictionary access and .locreduced (from about 335 times to 126 times slower), loc(iloc) is less than two times slower than at(iat) now.
现在看来性能差异要小得多(0.21.1 - 我忘记了原始示例中 Pandas 的版本是什么)。词典的访问和之间不仅性能差距.loc减小(从约335倍到126倍速度较慢), (loc)iloc小于2倍慢于at(iat)现在。
In [1]: import numpy, pandas
...: ...: df = pandas.DataFrame(numpy.zeros(shape=[10, 10]))
...: ...: dictionary = df.to_dict()
...:
In [2]: %timeit value = dictionary[5][5]
85.5 ns ± 0.336 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [3]: %timeit value = df.loc[5, 5]
10.8 μs ± 137 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [4]: %timeit value = df.at[5, 5]
6.87 μs ± 64.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [5]: %timeit value = df.iloc[5, 5]
14.9 μs ± 114 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [6]: %timeit value = df.iat[5, 5]
9.89 μs ± 54.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [7]: print(pandas.__version__)
0.21.1
---- Original answer below ----
---- 原答案如下----
+1 for using ator iatfor scalar operations. Example benchmark:
+1 用于使用at或iat用于标量运算。示例基准:
In [1]: import numpy, pandas
...: df = pandas.DataFrame(numpy.zeros(shape=[10, 10]))
...: dictionary = df.to_dict()
In [2]: %timeit value = dictionary[5][5]
The slowest run took 34.06 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 310 ns per loop
In [4]: %timeit value = df.loc[5, 5]
10000 loops, best of 3: 104 μs per loop
In [5]: %timeit value = df.at[5, 5]
The slowest run took 6.59 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 9.26 μs per loop
In [6]: %timeit value = df.iloc[5, 5]
10000 loops, best of 3: 98.8 μs per loop
In [7]: %timeit value = df.iat[5, 5]
The slowest run took 6.67 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 9.58 μs per loop
It seems using at(iat) is about 10 times faster than loc(iloc).
似乎使用at( iat) 比loc( iloc)快 10 倍左右。
回答by amityaffliction
I experienced different phenomenon about accessing the dataframe row. test this simple example on dataframe about 10,000,000 rows. dictionary rocks.
我在访问数据帧行时遇到了不同的现象。在大约 10,000,000 行的数据帧上测试这个简单的例子。字典岩石。
def testRow(go):
go_dict = go.to_dict()
times = 100000
ot= time.time()
for i in range(times):
go.iloc[100,:]
nt = time.time()
print('for iloc {}'.format(nt-ot))
ot= time.time()
for i in range(times):
go.loc[100,2]
nt = time.time()
print('for loc {}'.format(nt-ot))
ot= time.time()
for i in range(times):
[val[100] for col,val in go_dict.iteritems()]
nt = time.time()
print('for dict {}'.format(nt-ot))
回答by Orvar Korvar
I think the fastest way of accessing a cell, is
我认为访问单元格的最快方法是
df.get_value(row,column)
df.set_value(row,column,value)
Both are faster than (I think)
两者都比(我认为)快
df.iat(...)
df.at(...)

