Python 从生成器创建一个熊猫数据帧?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18915941/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Create a pandas DataFrame from generator?
提问by tinproject
I've create a tuple generator that extract information from a file filtering only the records of interest and converting it to a tuple that generator returns.
我创建了一个元组生成器,它从文件中提取信息,仅过滤感兴趣的记录并将其转换为生成器返回的元组。
I've try to create a DataFrame from:
我尝试从以下位置创建一个 DataFrame:
import pandas as pd
df = pd.DataFrame.from_records(tuple_generator, columns = tuple_fields_name_list)
but throws an error:
但抛出一个错误:
...
C:\Anaconda\envs\py33\lib\site-packages\pandas\core\frame.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows)
1046 values.append(row)
1047 i += 1
-> 1048 if i >= nrows:
1049 break
1050
TypeError: unorderable types: int() >= NoneType()
I managed it to work consuming the generator in a list, but uses twice memory:
我设法让它在列表中使用生成器,但使用了两倍的内存:
df = pd.DataFrame.from_records(list(tuple_generator), columns = tuple_fields_name_list)
The files I want to load are big, and memory consumption matters. The last try my computer spends two hours trying to increment virtual memory :(
我要加载的文件很大,内存消耗很重要。最后一次尝试我的电脑花了两个小时试图增加虚拟内存:(
The question:Anyone knows a method to create a DataFrame from a record generator directly, without previously convert it to a list?
问题:任何人都知道一种直接从记录生成器创建 DataFrame 的方法,而无需事先将其转换为列表?
Note: I'm using python 3.3 and pandas 0.12 with Anaconda on Windows.
注意:我在 Windows 上将 python 3.3 和 pandas 0.12 与 Anaconda 一起使用。
Update:
更新:
It's not problem of reading the file, my tuple generator do it well, it scan a text compressed file of intermixed records line by line and convert only the wanted data to the correct types, then it yields fields in a generator of tuples form. Some numbers, it scans 2111412 records on a 130MB gzip file, about 6.5GB uncompressed, in about a minute and with little memory used.
读取文件不是问题,我的元组生成器做得很好,它逐行扫描混合记录的文本压缩文件,只将想要的数据转换为正确的类型,然后以元组形式生成字段。有些数字,它在大约一分钟内扫描了 130MB gzip 文件上的 2111412 条记录,大约 6.5GB 未压缩,占用的内存很少。
Pandas 0.12 does not allow generators, dev version allows it but put all the generator in a list and then convert to a frame. It's not efficient but it's something that have to deal internally pandas. Meanwhile I've must think about buy some more memory.
Pandas 0.12 不允许生成器,dev 版本允许它但将所有生成器放在一个列表中,然后转换为一个框架。它效率不高,但必须在内部处理熊猫。同时我必须考虑购买更多内存。
采纳答案by Viktor Kerkez
You cannot create a DataFrame from a generator with the 0.12 version of pandas. You can either update yourself to the development version (get it from the github and compile it - which is a little bit painful on windows but I would prefer this option).
您无法使用 0.12 版本的 Pandas 从生成器创建 DataFrame。您可以将自己更新到开发版本(从 github 获取并编译它 - 这在 Windows 上有点痛苦,但我更喜欢这个选项)。
Or you can, since you said you are filtering the lines, first filter them, write them to a file and then load them using read_csv
or something else...
或者你可以,因为你说你正在过滤这些行,首先过滤它们,将它们写入文件,然后使用read_csv
或其他东西加载它们......
If you want to get super complicated you can create a file like object that will return the lines:
如果你想变得超级复杂,你可以创建一个类似对象的文件来返回行:
def gen():
lines = [
'col1,col2\n',
'foo,bar\n',
'foo,baz\n',
'bar,baz\n'
]
for line in lines:
yield line
class Reader(object):
def __init__(self, g):
self.g = g
def read(self, n=0):
try:
return next(self.g)
except StopIteration:
return ''
And then use the read_csv
:
然后使用read_csv
:
>>> pd.read_csv(Reader(gen()))
col1 col2
0 foo bar
1 foo baz
2 bar baz
回答by Jeff
To get it to be memory efficient, read in chunks. Something like this, using Viktor's Reader class from above.
要使其具有内存效率,请分块读取。像这样,使用 Viktor 的 Reader 类从上面。
df = pd.concat(list(pd.read_csv(Reader(gen()),chunksize=10000)),axis=1)
回答by Guilherme Freitas
You can also use something like (Python tested in 2.7.5)
您还可以使用类似的东西(Python 在 2.7.5 中测试)
from itertools import izip
def dataframe_from_row_iterator(row_iterator, colnames):
col_iterator = izip(*row_iterator)
return pd.DataFrame({cn: cv for (cn, cv) in izip(colnames, col_iterator)})
You can also adapt this to append rows to a DataFrame.
您还可以调整它以将行附加到 DataFrame。
-- Edit, Dec 4th: s/row/rows in last line
-- 编辑,12 月 4 日:最后一行中的 s/row/rows
回答by C8H10N4O2
You certainly canconstruct a pandas.DataFrame()
from a generator of tuples, as of version 19 (and probably earlier). Don't use .from_records()
; just use the constructor, for example:
从 19 版(可能更早)开始,您当然可以pandas.DataFrame()
从元组生成器构建 a 。不要使用.from_records()
; 只需使用构造函数,例如:
import pandas as pd
someGenerator = ( (x, chr(x)) for x in range(48,127) )
someDf = pd.DataFrame(someGenerator)
Produces:
产生:
type(someDf) #pandas.core.frame.DataFrame
someDf.dtypes
#0 int64
#1 object
#dtype: object
someDf.tail(10)
# 0 1
#69 117 u
#70 118 v
#71 119 w
#72 120 x
#73 121 y
#74 122 z
#75 123 {
#76 124 |
#77 125 }
#78 126 ~
回答by Natalia Sashnikova
If generator is just like a list of DataFrames
, you need just to create a new DataFrame
concatenating elements of the list:
如果 generator 就像一个 的列表DataFrames
,你只需要创建一个新DataFrame
的列表连接元素:
result = pd.concat(list)
result = pd.concat(list)
Recently I've faced the same problem.
最近我遇到了同样的问题。