Python 从生成器创建一个熊猫数据帧？

Question

提问by tinproject

I've create a tuple generator that extract information from a file filtering only the records of interest and converting it to a tuple that generator returns.

我创建了一个元组生成器，它从文件中提取信息，仅过滤感兴趣的记录并将其转换为生成器返回的元组。

I've try to create a DataFrame from:

我尝试从以下位置创建一个 DataFrame：

import pandas as pd
df = pd.DataFrame.from_records(tuple_generator, columns = tuple_fields_name_list)

but throws an error:

但抛出一个错误：

... 
C:\Anaconda\envs\py33\lib\site-packages\pandas\core\frame.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows)
   1046                 values.append(row)
   1047                 i += 1
-> 1048                 if i >= nrows:
   1049                     break
   1050 

TypeError: unorderable types: int() >= NoneType()

I managed it to work consuming the generator in a list, but uses twice memory:

我设法让它在列表中使用生成器，但使用了两倍的内存：

df = pd.DataFrame.from_records(list(tuple_generator), columns = tuple_fields_name_list)

The files I want to load are big, and memory consumption matters. The last try my computer spends two hours trying to increment virtual memory :(

我要加载的文件很大，内存消耗很重要。最后一次尝试我的电脑花了两个小时试图增加虚拟内存:(

The question:Anyone knows a method to create a DataFrame from a record generator directly, without previously convert it to a list?

问题：任何人都知道一种直接从记录生成器创建 DataFrame 的方法，而无需事先将其转换为列表？

Note: I'm using python 3.3 and pandas 0.12 with Anaconda on Windows.

注意：我在 Windows 上将 python 3.3 和 pandas 0.12 与 Anaconda 一起使用。

Update:

更新：

It's not problem of reading the file, my tuple generator do it well, it scan a text compressed file of intermixed records line by line and convert only the wanted data to the correct types, then it yields fields in a generator of tuples form. Some numbers, it scans 2111412 records on a 130MB gzip file, about 6.5GB uncompressed, in about a minute and with little memory used.

读取文件不是问题，我的元组生成器做得很好，它逐行扫描混合记录的文本压缩文件，只将想要的数据转换为正确的类型，然后以元组形式生成字段。有些数字，它在大约一分钟内扫描了 130MB gzip 文件上的 2111412 条记录，大约 6.5GB 未压缩，占用的内存很少。

Pandas 0.12 does not allow generators, dev version allows it but put all the generator in a list and then convert to a frame. It's not efficient but it's something that have to deal internally pandas. Meanwhile I've must think about buy some more memory.

Pandas 0.12 不允许生成器，dev 版本允许它但将所有生成器放在一个列表中，然后转换为一个框架。它效率不高，但必须在内部处理熊猫。同时我必须考虑购买更多内存。

Answer 1

采纳答案by Viktor Kerkez

You cannot create a DataFrame from a generator with the 0.12 version of pandas. You can either update yourself to the development version (get it from the github and compile it - which is a little bit painful on windows but I would prefer this option).

您无法使用 0.12 版本的 Pandas 从生成器创建 DataFrame。您可以将自己更新到开发版本（从 github 获取并编译它 - 这在 Windows 上有点痛苦，但我更喜欢这个选项）。

Or you can, since you said you are filtering the lines, first filter them, write them to a file and then load them using read_csvor something else...

或者你可以，因为你说你正在过滤这些行，首先过滤它们，将它们写入文件，然后使用read_csv或其他东西加载它们......

If you want to get super complicated you can create a file like object that will return the lines:

如果你想变得超级复杂，你可以创建一个类似对象的文件来返回行：

def gen():
    lines = [
        'col1,col2\n',
        'foo,bar\n',
        'foo,baz\n',
        'bar,baz\n'
    ]
    for line in lines:
        yield line

class Reader(object):
    def __init__(self, g):
        self.g = g
    def read(self, n=0):
        try:
            return next(self.g)
        except StopIteration:
            return ''

And then use the read_csv:

然后使用read_csv：

>>> pd.read_csv(Reader(gen()))
  col1 col2
0  foo  bar
1  foo  baz
2  bar  baz

Answer 2

回答by Jeff

To get it to be memory efficient, read in chunks. Something like this, using Viktor's Reader class from above.

要使其具有内存效率，请分块读取。像这样，使用 Viktor 的 Reader 类从上面。

df = pd.concat(list(pd.read_csv(Reader(gen()),chunksize=10000)),axis=1)

Answer 3

回答by Guilherme Freitas

You can also use something like (Python tested in 2.7.5)

您还可以使用类似的东西（Python 在 2.7.5 中测试）

from itertools import izip

def dataframe_from_row_iterator(row_iterator, colnames):
    col_iterator = izip(*row_iterator)
    return pd.DataFrame({cn: cv for (cn, cv) in izip(colnames, col_iterator)})

You can also adapt this to append rows to a DataFrame.

您还可以调整它以将行附加到 DataFrame。

-- Edit, Dec 4th: s/row/rows in last line

-- 编辑，12 月 4 日：最后一行中的 s/row/rows

Answer 4

回答by C8H10N4O2

You certainly canconstruct a pandas.DataFrame()from a generator of tuples, as of version 19 (and probably earlier). Don't use .from_records(); just use the constructor, for example:

从 19 版（可能更早）开始，您当然可以pandas.DataFrame()从元组生成器构建 a 。不要使用.from_records(); 只需使用构造函数，例如：

import pandas as pd
someGenerator = ( (x, chr(x)) for x in range(48,127) )
someDf = pd.DataFrame(someGenerator)

Produces:

产生：

type(someDf) #pandas.core.frame.DataFrame

someDf.dtypes
#0     int64
#1    object
#dtype: object

someDf.tail(10)
#      0  1
#69  117  u
#70  118  v
#71  119  w
#72  120  x
#73  121  y
#74  122  z
#75  123  {
#76  124  |
#77  125  }
#78  126  ~

Answer 5

回答by Natalia Sashnikova

If generator is just like a list of DataFrames, you need just to create a new DataFrameconcatenating elements of the list:

如果 generator 就像一个的列表DataFrames，你只需要创建一个新DataFrame的列表连接元素：

result = pd.concat(list)

Recently I've faced the same problem.

最近我遇到了同样的问题。

Python 从生成器创建一个熊猫数据帧？

提问by tinproject

采纳答案by Viktor Kerkez

回答by Jeff

回答by Guilherme Freitas

回答by C8H10N4O2

回答by Natalia Sashnikova

相关推荐

最近更新

标签

Python 从生成器创建一个熊猫数据帧？

提问by tinproject

采纳答案by Viktor Kerkez

回答by Jeff

回答by Guilherme Freitas

回答by C8H10N4O2

回答by Natalia Sashnikova

相关推荐

Python 如何使用 Pandas 创建随机整数的 DataFrame？

用于 Python 2.7 的 MySQL

Python 如何在pyqt中更改Qtablewidget的特定单元格背景颜色

Python 如何通过pip安装mysql-connector

相关推荐

最近更新

标签