python - 使用带有大型 csv(迭代和块大小)的 Pandas 结构

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33642951/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 13:40:45  来源:igfitidea点击:

python - Using pandas structures with large csv(iterate and chunksize)

pythonpandascsvdataframe

提问by Thodoris P

I have a large csv file, about 600mb with 11 million rows and I want to create statistical data like pivots, histograms, graphs etc. Obviously trying to just to read it normally:

我有一个很大的 csv 文件,大约 600mb 有 1100 万行,我想创建统计数据,如数据透视、直方图、图形等。显然只是想正常读取它:

df = pd.read_csv('Check400_900.csv', sep='\t')

doesn't work so I found iterate and chunksize in a similar post so I used

不起作用所以我在类似的帖子中找到了 iterate 和 chunksize 所以我使用了

df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)

All good, i can for example print df.get_chunk(5)and search the whole file with just

一切都很好,例如,我可以print df.get_chunk(5)只搜索整个文件

for chunk in df:
    print chunk

My problem is I don't know how to use stuff like these below for the whole df and not for just one chunk

我的问题是我不知道如何在整个 df 中使用下面这样的东西,而不仅仅是一个块

plt.plot()
print df.head()
print df.describe()
print df.dtypes
customer_group3 = df.groupby('UserID')
y3 = customer_group.size()

I hope my question is not so confusing

我希望我的问题不要那么混乱

采纳答案by jezrael

Solution, if need create one big DataFrameif need processes all data at once (what is possible, but not recommended):

解决方案,如果需要创建一个大DataFrame如果需要一次处理所有数据(可能,但不推荐):

Then use concatfor all chunks to df, because type of output of function:

然后对所有块使用concat到 df,因为函数的输出类型:

df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)

isn't dataframe, but pandas.io.parsers.TextFileReader- source.

不是数据框,而是pandas.io.parsers.TextFileReader- source

tp = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
print tp
#<pandas.io.parsers.TextFileReader object at 0x00000000150E0048>
df = pd.concat(tp, ignore_index=True)

I think is necessary add parameter ignore indexto function concat, because avoiding duplicity of indexes.

我认为有必要向函数添加参数忽略索引concat,因为避免索引重复。

EDIT:

编辑:

But if want working with large data like aggregating, much better is use dask, because it provides advanced parallelism.

但是如果想要处理像聚合这样的大数据,最好使用dask,因为它提供了高级并行性。

回答by user29791

You need to concatenate the chucks. For example:

您需要连接夹头。例如:

df2 = pd.concat([chunk for chunk in df])

And then run your commands on df2

然后运行你的命令 df2

回答by abarnert

You do notneed concathere. It's exactly like writing sum(map(list, grouper(tup, 1000)))instead of list(tup). The only thing iteratorand chunksize=1000does is to give you a reader object that iterates 1000-row DataFrames instead of reading the whole thing. If you want the whole thing at once, just don't use those parameters.

没有需要concat在这里。这就像写作sum(map(list, grouper(tup, 1000)))而不是list(tup). 唯一iteratorchunksize=1000所做的是给你的迭代而不是读取整个事情1000行DataFrames读者对象。如果你想要一次完成整个事情,就不要使用这些参数。

But if reading the whole file into memory at once is too expensive (e.g., takes so much memory that you get a MemoryError, or slow your system to a crawl by throwing it into swap hell), that's exactly what chunksizeis for.

但是,如果一次将整个文件读入内存成本太高(例如,占用太多内存以至于你得到一个MemoryError,或者通过将它扔进交换地狱来减慢你的系统爬行速度),这正是chunksize它的目的。

The problem is that you named the resulting iterator df, and then tried to use it as a DataFrame. It's not a DataFrame; it's an iterator that gives you 1000-row DataFrames one by one.

问题是您将生成的迭代器命名为df,然后尝试将其用作 DataFrame。它不是数据帧;它是一个迭代器,可以一个接一个地为您提供 1000 行数据帧。

When you say this:

当你这样说时:

My problem is I don't know how to use stuff like these below for the whole df and not for just one chunk

我的问题是我不知道如何在整个 df 中使用下面这样的东西,而不仅仅是一个块

The answer is that you can't. If you can't load the whole thing into one giant DataFrame, you can't use one giant DataFrame. You have to rewrite your code around chunks.

答案是你不能。如果您不能将整个内容加载到一个巨大的 DataFrame 中,那么您就不能使用一个巨大的 DataFrame。您必须围绕块重写代码。

Instead of this:

取而代之的是:

df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
print df.dtypes
customer_group3 = df.groupby('UserID')

… you have to do things like this:

……你必须做这样的事情:

for df in pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000):
    print df.dtypes
    customer_group3 = df.groupby('UserID')

Often, what you need to do is aggregate some data—reduce each chunk down to something much smaller with only the parts you need. For example, if you want to sum the entire file by groups, you can groupbyeach chunk, then sum the chunk by groups, and store a series/array/list/dict of running totals for each group.

通常,您需要做的是聚合一些数据——将每个块减少到更小,只包含您需要的部分。例如,如果要按组对整个文件groupby求和,则可以每个块,然后按组对块求和,并存储每个组的运行总计的系列/数组/列表/字典。

Of course it's slightly more complicated than just summing a giant series all at once, but there's no way around that. (Except to buy more RAM and/or switch to 64 bits.) That's how iteratorand chunksizesolve the problem: by allowing you to make this tradeoff when you need to.

当然,这比一次总结一个巨大的系列要复杂一些,但没有办法解决这个问题。(除了购买更多的RAM和/或切换到64位。)那怎么iteratorchunksize解决的问题:通过允许您当你需要做出这种权衡。