pandas 将python迭代器输出转换为pandas数据帧的最快方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42999332/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:16:20  来源:igfitidea点击:

Fastest way to convert python iterator output to pandas dataframe

pythonpandas

提问by James

I have a generator that returns an unknown number of rows of data that I want to convert to an indexed pandas dataframe. The fastest way I know of is to write a CSV to disk then parse back in via 'read_csv'. I'm aware that it is not efficient to create an empty dataframe then constantly append new rows. I can't create a pre-sized dataframe because I do not know how many rows will be returned. Is there a way to convert the iterator output to a pandas dataframe without writing to disk?

我有一个生成器,它返回未知数量的数据行,我想将其转换为索引的 Pandas 数据帧。我所知道的最快方法是将 CSV 写入磁盘,然后通过“read_csv”解析回来。我知道创建一个空的数据框然后不断添加新行效率不高。我无法创建预先确定大小的数据框,因为我不知道将返回多少行。有没有办法在不写入磁盘的情况下将迭代器输出转换为 Pandas 数据帧?

回答by James

Iteratively appending to a pandas data frame is not the best solution. It is better to build your data as a list, and then pass it to pd.DataFrame.

以迭代方式附加到 Pandas 数据框并不是最好的解决方案。最好将数据构建为列表,然后将其传递给pd.DataFrame.

import random
import pandas as pd

alpha = list('abcdefghijklmnopqrstuvwxyz')

Here we create a generator, use it to construct a list, then pass it to the dataframe constructor:

这里我们创建了一个生成器,用它来构造一个列表,然后将它传递给数据框构造器:

%%timeit
gen = ((random.choice(alpha), random.randint(0,100)) for x in range(10000))
my_data = [x for x in gen]
df = pd.DataFrame(my_data, columns=['letter','value'])

# result: 1 loop, best of 3: 373 ms per loop

This is quite a bit faster than creating a generator, construct an empty dataframe, and appending rows, seen here:

这比创建生成器、构造一个空数据框和附加行要快得多,如下所示:

%%timeit
gen = ((random.choice(alpha), random.randint(0,100)) for x in range(10000))
df = pd.DataFrame(columns=['letter','value'])
for tup in gen:
    df.loc[df.shape[0],:] = tup

# result: 1 loop, best of 3: 13.6 s per loop

This is incredibly slow at 13 seconds to construct 10000 rows.

这在 13 秒内构建 10000 行非常慢。

回答by blacksite

Would something general like this do the trick?

像这样一般的东西会起作用吗?

def make_equal_length_cols(df, new_iter, col_name):
    # convert the generator to a list so we can append
    new_iter = list(new_iter)
    # if the passed generator (as a list) has fewer elements that the dataframe, we ought to add NaN elements until their lengths are equal
    if len(new_iter) < df.shape[0]:
        new_iter += [np.nan]*(df.shape[0]-len(new_iter))
    else:
        # otherwise, each column gets n new NaN rows, where n is the difference between the number of elements in new_iter and the length of the dataframe
        new_rows = [{c: np.nan for c in df.columns} for _ in range((len(new_iter)-df.shape[0]))]
        new_rows_df = pd.DataFrame(new_rows)
        df = df.append(new_rows_df).reset_index(drop=True)
    df[col_name] = new_iter
    return df

Test it out:

测试一下:

make_equal_length_cols(df, (x for x in range(20)), 'new')
Out[22]: 
      A    B  new
0   0.0  0.0    0
1   1.0  1.0    1
2   2.0  2.0    2
3   3.0  3.0    3
4   4.0  4.0    4
5   5.0  5.0    5
6   6.0  6.0    6
7   7.0  7.0    7
8   8.0  8.0    8
9   9.0  9.0    9
10  NaN  NaN   10
11  NaN  NaN   11
12  NaN  NaN   12
13  NaN  NaN   13
14  NaN  NaN   14
15  NaN  NaN   15
16  NaN  NaN   16
17  NaN  NaN   17
18  NaN  NaN   18
19  NaN  NaN   19

And it also works when the passed generator is shorter than the dataframe:

当传递的生成器比数据帧短时,它也可以工作:

make_equal_length_cols(df, (x for x in range(5)), 'new')
Out[26]: 
   A  B  new
0  0  0  0.0
1  1  1  1.0
2  2  2  2.0
3  3  3  3.0
4  4  4  4.0
5  5  5  NaN
6  6  6  NaN
7  7  7  NaN
8  8  8  NaN
9  9  9  NaN

Edit: removed row-by-row pandas.DataFrame.appendcall, and constructed separate dataframe to append in one shot. Timings:

编辑:删除了逐行pandas.DataFrame.append调用,并构建了单独的数据帧以附加到一个镜头中。时间:

New append:

新追加:

%timeit make_equal_length_cols(df, (x for x in range(10000)), 'new')
10 loops, best of 3: 40.1 ms per loop

Old append:

旧附加:

very slow...