Python 提高 Pandas DataFrames 的行追加性能

Question

提问by Brideau

I am running a basic script that loops over a nested dictionary, grabs data from each record, and appends it to a Pandas DataFrame. The data looks something like this:

我正在运行一个循环嵌套字典的基本脚本，从每条记录中获取数据，并将其附加到 Pandas DataFrame。数据看起来像这样：

data = {"SomeCity": {"Date1": {record1, record2, record3, ...}, "Date2": {}, ...}, ...}

In total it has a few million records. The script itself looks like this:

它总共有几百万条记录。脚本本身如下所示：

city = ["SomeCity"]
df = DataFrame({}, columns=['Date', 'HouseID', 'Price'])
for city in cities:
    for dateRun in data[city]:
        for record in data[city][dateRun]:
            recSeries = Series([record['Timestamp'], 
                                record['Id'], 
                                record['Price']],
                                index = ['Date', 'HouseID', 'Price'])
            FredDF = FredDF.append(recSeries, ignore_index=True)

This runs painfully slow, however. Before I look for a way to parallelize it, I just want to make sure I'm not missing something obvious that would make this perform faster as it is, as I'm still quite new to Pandas.

然而，这运行得非常缓慢。在我寻找并行化它的方法之前，我只想确保我没有遗漏一些明显的东西，因为我对 Pandas 仍然很陌生。

Answer 1

采纳答案by P-S

I also used the dataframe's appendfunction inside a loop and I was perplexed how slow it ran.

我还在循环中使用了数据帧的append函数，但我对它的运行速度感到困惑。

A useful example for those who are suffering, based on the correct answer on this page.

基于此页面上的正确答案，对于那些遭受痛苦的人来说，这是一个有用的例子。

Python version: 3

Python版本：3

Pandas version: 0.20.3

熊猫版本：0.20.3

# the dictionary to pass to panda's dataframe
dict = {}

# a counter to use to add entries to "dict"
i = 0 

# Example data to loop and append to a dataframe
data = [{"foo": "foo_val_1", "bar": "bar_val_1"}, 
       {"foo": "foo_val_2", "bar": "bar_val_2"}]

# the loop
for entry in data:

    # add a dictionary entry to the final dictionary
    dict[i] = {"col_1_title": entry['foo'], "col_2_title": entry['bar']}

    # increment the counter
    i = i + 1

# create the dataframe using 'from_dict'
# important to set the 'orient' parameter to "index" to make the keys as rows
df = DataFrame.from_dict(dict, "index")

The "from_dict" function: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html

“from_dict”函数：https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html

Answer 2

回答by Ryan Sheftel

I ran into a similar problem where I had to append many times to a DataFrame, but did not know the values in advance of the appends. I wrote a lightweight DataFrame like data structure that is just blists() under the hood. I use that to accumulate all of the data and then when it is complete transform the output into a Pandas DataFrame. Here is a link to my project, all open source so I hope it helps others:

我遇到了一个类似的问题，我不得不多次向 DataFrame 追加，但在追加之前不知道这些值。我写了一个轻量级的 DataFrame 就像数据结构一样，它的底层只是 blists()。我使用它来累积所有数据，然后在完成后将输出转换为 Pandas DataFrame。这是我的项目的链接，全部是开源的，所以我希望它可以帮助其他人：

https://pypi.python.org/pypi/raccoon

Answer 3

回答by Mahidhar Surapaneni

Appending rows to lists is far more efficient than to a DataFrame. Hence you would want to

将行附加到列表比附加到DataFrame. 因此你会想要

append the rows to a list.
Then convert it into DataFrameand
set the index as required.

将行附加到列表中。
然后将其转换为DataFrame和
根据需要设置索引。

Answer 4

回答by Rob

I think the best way to do it is, if you know the data you are going to receive, allocate before hand.

我认为最好的方法是，如果您知道要接收的数据，请事先分配。

import numpy as np
import pandas as pd

random_matrix = np.random.randn(100, 100)
insert_df = pd.DataFrame(random_matrix)

df = pd.DataFrame(columns=range(100), index=range(200))
df.loc[range(100), df.columns] = random_matrix
df.loc[range(100, 200), df.columns] = random_matrix

This is the pattern that I think makes the most sense. appendwill be faster if you have a very small dataframe, but it doesn't scale.

这是我认为最有意义的模式。 append如果你有一个非常小的数据框会更快，但它不会扩展。

In [1]: import numpy as np; import pandas as pd

In [2]: random_matrix = np.random.randn(100, 100)
   ...: insert_df = pd.DataFrame(random_matrix)
   ...: df = pd.DataFrame(np.random.randn(100, 100))

In [2]: %timeit df.append(insert_df)
272 μs ± 2.36 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [3]: %timeit df.loc[range(100), df.columns] = random_matrix
493 μs ± 4.25 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: %timeit df.loc[range(100), df.columns] = insert_df
821 μs ± 8.68 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

When we run this with a 100,000 row dataframe, we see much more dramatic results.

当我们使用 100,000 行数据帧运行它时，我们会看到更加引人注目的结果。

In [1]: df = pd.DataFrame(np.random.randn(100_000, 100))

In [2]: %timeit df.append(insert_df)
17.9 ms ± 253 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: %timeit df.loc[range(100), df.columns] = random_matrix
465 μs ± 13.7 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: %timeit df.loc[range(99_900, 100_000), df.columns] = random_matrix
465 μs ± 5.75 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %timeit df.loc[range(99_900, 100_000), df.columns] = insert_df
1.02 ms ± 3.42 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So we can see an append is about 17 times slower than an insert with a dataframe, and 35 times slower than an insert with a numpy array.

所以我们可以看到追加比插入数据帧慢 17 倍，比插入 numpy 数组慢 35 倍。

Answer 5

回答by libertasT

Another way is to make it into a list and then use pd.concat

另一种方法是把它做成一个列表，然后使用 pd.concat

import pandas as pd 

df = pd.DataFrame({'num_legs': [2, 4, 8, 0],

                   'num_wings': [2, 0, 0, 0],

                   'num_specimen_seen': [10, 2, 1, 8]},

                  index=['falcon', 'dog', 'spider', 'fish'])

def append(df):
    df_out = df.copy()
    for i in range(1000):
        df_out = df_out.append(df)
    return df_out

def concat(df):
    df_list = []
    for i in range(1001):
        df_list.append(df)

    return pd.concat(df_list)


# some testing
df2 = concat(df)
df3 = append(df)

pd.testing.assert_frame_equal(df2,df3)

%timeit concat(df):

%timeit concat(df)：

20.2 ms ± 794 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

每个循环 20.2 ms ± 794 μs（7 次运行的平均值 ± 标准偏差，每次 100 次循环）

%timeit append(df)

275 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

每个循环 275 ms ± 2.54 ms（7 次运行的平均值 ± 标准偏差，每次 1 次循环）

It is the recommended way to concatenate rows in pandas now:

这是现在在 Pandas 中连接行的推荐方法：

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once. link

以迭代方式将行附加到 DataFrame 可能比单个串联在计算上更加密集。更好的解决方案是将这些行附加到列表中，然后一次性将列表与原始 DataFrame 连接起来。关联

Python 提高 Pandas DataFrames 的行追加性能

提问by Brideau

采纳答案by P-S

回答by Ryan Sheftel

回答by Mahidhar Surapaneni

回答by Rob

回答by libertasT

相关推荐

最近更新

标签

Python 提高 Pandas DataFrames 的行追加性能

提问by Brideau

采纳答案by P-S

回答by Ryan Sheftel

回答by Mahidhar Surapaneni

回答by Rob

回答by libertasT

相关推荐

Python 使用 Pandas 读取制表符分隔的文件 - 适用于 Windows，但不适用于 Mac

Python 从 django 数据库中检索数据并显示在表中

Python 没有名为 flask.ext.restful 的模块

Python 为什么 isnumeric 不起作用？

相关推荐

最近更新

标签