Python 在 for 循环中使用 pandas .append
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37009287/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using pandas .append within for loop
提问by calpyte
I am appending rows to a pandas DataFrame within a for loop, but at the end the dataframe is always empty. I don't want to add the rows to an array and then call the DataFrame constructer, because my actual for loop handles lots of data. I also tried pd.concat
without success. Could anyone highlight what I am missing to make the append statement work? Here's a dummy example:
我在 for 循环中将行附加到 Pandas DataFrame,但最后数据框始终为空。我不想将行添加到数组然后调用 DataFrame 构造函数,因为我的实际 for 循环处理大量数据。我也试过pd.concat
没有成功。任何人都可以突出显示我缺少什么以使 append 语句起作用吗?这是一个虚拟示例:
import pandas as pd
import numpy as np
data = pd.DataFrame([])
for i in np.arange(0, 4):
if i % 2 == 0:
data.append(pd.DataFrame({'A': i, 'B': i + 1}, index=[0]), ignore_index=True)
else:
data.append(pd.DataFrame({'A': i}, index=[0]), ignore_index=True)
print data.head()
Empty DataFrame
Columns: []
Index: []
[Finished in 0.676s]
回答by Alexander
Every time you call append, Pandas returns a copy of the original dataframe plus your new row. This is called quadratic copy, and it is an O(N^2) operation that will quickly become very slow (especially since you have lots of data).
每次调用 append 时,Pandas 都会返回原始数据帧的副本以及您的新行。这称为二次复制,它是一个 O(N^2) 操作,很快就会变得非常慢(特别是因为您有大量数据)。
In your case, I would recommend using lists, appending to them, and then calling the dataframe constructor.
在您的情况下,我建议使用列表,附加到它们,然后调用数据帧构造函数。
a_list = []
b_list = []
for data in my_data:
a, b = process_data(data)
a_list.append(a)
b_list.append(b)
df = pd.DataFrame({'A': a_list, 'B': b_list})
del a_list, b_list
Timings
时间安排
%%timeit
data = pd.DataFrame([])
for i in np.arange(0, 10000):
if i % 2 == 0:
data = data.append(pd.DataFrame({'A': i, 'B': i + 1}, index=[0]), ignore_index=True)
else:
data = data.append(pd.DataFrame({'A': i}, index=[0]), ignore_index=True)
1 loops, best of 3: 6.8 s per loop
%%timeit
a_list = []
b_list = []
for i in np.arange(0, 10000):
if i % 2 == 0:
a_list.append(i)
b_list.append(i + 1)
else:
a_list.append(i)
b_list.append(None)
data = pd.DataFrame({'A': a_list, 'B': b_list})
100 loops, best of 3: 8.54 ms per loop
回答by johnchase
You need to set the the variable data
equal to the appended data frame. Unlike the append
method on a python list the pandas append
does not happen in place
您需要将变量设置为data
等于附加的数据框。与append
python 列表中的方法不同,pandasappend
不会就地发生
import pandas as pd
import numpy as np
data = pd.DataFrame([])
for i in np.arange(0, 4):
if i % 2 == 0:
data = data.append(pd.DataFrame({'A': i, 'B': i + 1}, index=[0]), ignore_index=True)
else:
data = data.append(pd.DataFrame({'A': i}, index=[0]), ignore_index=True)
print(data.head())
A B
0 0 1.0
1 2 3.0
2 3 NaN
NOTE:This answer aims to answer the question as it was posed. It is not however the optimal strategy for combining large numbers of dataframes. For a more optimal solution have a look at Alexander's answerbelow
注意:此答案旨在回答提出的问题。然而,这并不是组合大量数据帧的最佳策略。有关更优化的解决方案,请查看下面亚历山大的回答
回答by Mike Müller
You can build your dataframe without a loop:
您可以在没有循环的情况下构建数据框:
n = 4
data = pd.DataFrame({'A': np.arange(n)})
data['B'] = np.NaN
data.loc[data['A'] % 2 == 0, 'B'] = data['A'] + 1
For:
为了:
n = 10000
This is a bit faster:
这有点快:
%%timeit
data = pd.DataFrame({'A': np.arange(n)})
data['B'] = np.NaN
data.loc[data['A'] % 2 == 0, 'B'] = data['A'] + 1
100 loops, best of 3: 3.3 ms per loop
vs.
对比
%%timeit
a_list = []
b_list = []
for i in np.arange(n):
if i % 2 == 0:
a_list.append(i)
b_list.append(i + 1)
else:
a_list.append(i)
b_list.append(None)
data1 = pd.DataFrame({'A': a_list, 'B': b_list})
100 loops, best of 3: 12.4 ms per loop