Python - 向数据帧添加行的有效方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41888080/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python - Efficient way to add rows to dataframe
提问by Jarrod
From this questionand others it seems that it is not recommended to use concat
or append
to build a pandas dataframe because it is recopying the whole dataframe each time.
从这个问题和其他问题来看,似乎不建议使用concat
或append
构建熊猫数据框,因为它每次都在重新复制整个数据框。
My project involves retrieving a small amount of data every 30 seconds. This might run for a 3 day weekend, so someone could easily expect over 8000 rows to be created one row at a time. What would be the most efficient way to add rows to this dataframe?
我的项目涉及每 30 秒检索少量数据。这可能会运行 3 天的周末,因此有人很容易期望一次创建 8000 多行。向此数据框中添加行的最有效方法是什么?
采纳答案by sundance
Editing the chosen answer here since it was completely mistaken. What follows is an explanation of why you should notuse setting with enlargement. "Setting with enlargement" is actually worse than append.
在这里编辑选择的答案,因为它完全是错误的。下面是对为什么不应该使用放大设置的解释。 “设置放大”实际上比附加更糟糕。
The tl;drhere is that there is no efficient way to do this with a DataFrame, so if you need speed you should use another data structure instead.See other answers for better solutions.
该TL;博士在这里是有一个数据帧要做到这一点没有有效的方法,所以如果你需要的速度,你应该使用另一种数据结构来代替。请参阅其他答案以获得更好的解决方案。
More on setting with enlargement
有关放大设置的更多信息
You can add rows to a DataFrame in-place using loc
on a non-existent index, but that also performs a copy of all of the data (see this discussion). Here's how it would look, from the Pandas documentation:
您可以使用loc
不存在的索引向 DataFrame 就地添加行,但这也会执行所有数据的副本(请参阅此讨论)。这是它的外观,来自Pandas 文档:
In [119]: dfi
Out[119]:
A B C
0 0 1 0
1 2 3 2
2 4 5 4
In [120]: dfi.loc[3] = 5
In [121]: dfi
Out[121]:
A B C
0 0 1 0
1 2 3 2
2 4 5 4
3 5 5 5
For something like the use case described, setting with enlargement actually takes 50% longer than append
:
对于类似于所描述的用例的情况,放大设置实际上比append
以下时间长 50%:
With append()
, 8000 rows took 6.59s (0.8ms per row)
使用append()
8000 行需要 6.59 秒(每行 0.8 毫秒)
%%timeit df = pd.DataFrame(columns=["A", "B", "C"]); new_row = pd.Series({"A": 4, "B": 4, "C": 4})
for i in range(8000):
df = df.append(new_row, ignore_index=True)
# 6.59 s ± 53.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
With .loc()
, 8000 rows took 10s (1.25ms per row)
使用.loc()
,8000 行需要 10 秒(每行 1.25 毫秒)
%%timeit df = pd.DataFrame(columns=["A", "B", "C"]); new_row = pd.Series({"A": 4, "B": 4, "C": 4})
for i in range(8000):
df.loc[i] = new_row
# 10.2 s ± 148 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
What about a longer DataFrame?
更长的 DataFrame 怎么样?
As with all profiling in data-oriented code, YMMV and you should test this for your use case. One characteristic of the copy-on-write behavior of append
and "setting with enlargement" is that it will get slower and slower with large DataFrame
s:
与面向数据的代码中的所有分析一样,YMMV 和您应该针对您的用例进行测试。append
和“设置放大”的写时复制行为的一个特点是它会随着大DataFrame
s变得越来越慢:
%%timeit df = pd.DataFrame(columns=["A", "B", "C"]); new_row = pd.Series({"A": 4, "B": 4, "C": 4})
for i in range(16000):
df.loc[i] = new_row
# 23.7 s ± 286 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Building a 16k row DataFrame
with this method takes 2.3x longer than 8k rows.
DataFrame
使用此方法构建 16k 行所需的时间是 8k 行的 2.3 倍。
回答by Tom Harvey
I used this answer's df.loc[i] = [new_data]
suggestion, but I have > 500,000 rows and that was very slow.
我使用了这个答案的df.loc[i] = [new_data]
建议,但我有 > 500,000 行,而且速度很慢。
While the answers given are good for the OP's question, I found it more efficient, when dealing with large numbers of rows up front (instead of the tricking in described by the OP) to use csvwriter to add data to an in memory CSV object, then finally use pandas.read_csv(csv)
to generate the desired DataFrame output.
虽然给出的答案对 OP 的问题有好处,但我发现它更有效,当预先处理大量行(而不是 OP 描述的欺骗)使用 csvwriter 将数据添加到内存中的 CSV 对象时,然后最后用于pandas.read_csv(csv)
生成所需的 DataFrame 输出。
from io import BytesIO
from csv import writer
import pandas as pd
output = BytesIO()
csv_writer = writer(output)
for row in iterable_object:
csv_writer.writerow(row)
output.seek(0) # we need to get back to the start of the BytesIO
df = pd.read_csv(output)
return df
This, for ~500,000 rows was 1000x faster and as the row count grows the speed improvement will only get larger (the df.loc[1] = [data]
will get a lot slower comparatively)
这,对于 ~500,000 行,速度提高了 1000 倍,随着行数的增加,速度的提高只会变得更大(相对而言the df.loc[1] = [data]
会变慢很多)
Hope this helps someone who need efficiency when dealing with more rows than the OP.
希望这可以帮助那些在处理比 OP 多的行时需要效率的人。
回答by Théo Simier
The response of Tom Harveyworks well. However, I would like to add an simpler answer based on pandas.DataFrame.from_dict.
Tom Harvey的回应很有效。但是,我想添加一个基于pandas.DataFrame.from_dict的更简单的答案。
By adding the data of a row in a list and then this list to a dictionary, you can then use pd.DataFrame.from_dict(dict)
to create a dataframe without iteration.
通过在列表中添加一行的数据,然后将此列表添加到字典中,您就可以pd.DataFrame.from_dict(dict)
创建一个无需迭代的数据框。
If each value of the dictionary is a row. You can use just:
pd.DataFrame.from_dict(dictionary,orient='index')
如果字典的每个值都是一行。您可以只使用:
pd.DataFrame.from_dict(dictionary,orient='index')
Small example:
小例子:
# Dictionary containing the data
dic = {'row_1':['some','test','values',78,90],'row_2':['some','test','values',100,589]}
# Creation of the dataframe
df = pd.DataFrame.from_dict(dic,orient='index')
df
0 1 2 3 4
row_1 some test values 78 90
row_2 some test values 100 589
回答by Burhan Khalid
You need to split the problem into two parts:
您需要将问题分成两部分:
- Accepting the data (collecting it) every 30 seconds efficiently.
- Processing the data once its collected.
- 每 30 秒有效地接受数据(收集数据)。
- 收集后处理数据。
If your data is critical (that is, you cannot afford to lose it) - send it to a queue and then read it from the queue in batches.
如果您的数据很重要(也就是说,您不能丢失它) - 将它发送到一个队列,然后从队列中批量读取它。
The queue will provide reliable (guaranteed) acceptance and that your data will not be lost.
队列将提供可靠的(有保证的)接受并且您的数据不会丢失。
You can read the data from the queue and dump it in a database.
您可以从队列中读取数据并将其转储到数据库中。
Now your Python app simply reads from the database and does the analysis at whatever interval makes sense for the application - perhaps you want to do hourly averages; in this case you would run your script each hour to pull the data from the db and perhaps write the results in another database / table / file.
现在,您的 Python 应用程序只需从数据库中读取数据,并在对应用程序有意义的任何时间间隔内进行分析——也许您想要做每小时的平均值;在这种情况下,您将每小时运行一次脚本以从数据库中提取数据,并可能将结果写入另一个数据库/表/文件中。
The bottom line - split the collecting and analyzing parts of your application.
底线 - 拆分应用程序的收集和分析部分。
回答by sparrow
Assuming that your dataframe is indexed in order you can:
假设您的数据框按顺序编入索引,您可以:
First check to see what the next index value is to create a new row:
首先检查下一个索引值是什么来创建新行:
myindex = df.shape[0]+1
Then use 'at' to write to each desired column
然后使用“at”写入每个所需的列
df.at[myindex,'A']=val1
df.at[myindex,'B']=val2
df.at[myindex,'C']=val3
回答by dasjanik
sundance's answer might be correct in terms of usage, but the benchmark is just wrong. As moobiecorrectly pointed out an index 3 already exists in this example, which makes access way quicker than with a non-existent index. Have a look at this:
sundance的答案在使用方面可能是正确的,但基准是错误的。正如moobie正确指出的那样,本示例中已经存在索引 3,这使得访问方式比使用不存在的索引更快。看看这个:
%%timeit
test = pd.DataFrame({"A": [1,2,3], "B": [1,2,3], "C": [1,2,3]})
for i in range(0,1000):
testrow = pd.DataFrame([0,0,0])
pd.concat([test[:1], testrow, test[1:]])
2.15 s ± 88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
每个循环 2.15 s ± 88 ms(7 次运行的平均值 ± 标准偏差,每次 1 次循环)
%%timeit
test = pd.DataFrame({"A": [1,2,3], "B": [1,2,3], "C": [1,2,3]})
for i in range(0,1000):
test2 = pd.DataFrame({'A': 0, 'B': 0, 'C': 0}, index=[i+0.5])
test.append(test2, ignore_index=False)
test.sort_index().reset_index(drop=True)
972 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
每个循环 972 ms ± 14.4 ms(7 次运行的平均值 ± 标准偏差,每个循环 1 次)
%%timeit
test = pd.DataFrame({"A": [1,2,3], "B": [1,2,3], "C": [1,2,3]})
for i in range(0,1000):
test3 = [0,0,0]
test.loc[i+0.5] = test3
test.reset_index(drop=True)
1.13 s ± 46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
每个循环 1.13 s ± 46 ms(7 次运行的平均值 ± 标准偏差,每次 1 次循环)
Of course, this is purely synthetic, and I admittedly wasn't expecting these results, but it seems that with non-existent indices .loc
and .append
perform quite similarly. Just leaving this here.
当然,这纯粹是合成的,我承认我没有预料到这些结果,但似乎不存在索引.loc
并且.append
表现非常相似。就离开这里吧。