pandas 如何在python中流式传输和操作大型数据文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38208389/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to stream in and manipulate a large data file in python
提问by HFBrowning
I have a relatively large (1 GB) text file that I want to cut down in size by summing across categories:
我有一个相对较大的 (1 GB) 文本文件,我想通过跨类别求和来缩小它的大小:
Geography AgeGroup Gender Race Count
County1 1 M 1 12
County1 2 M 1 3
County1 2 M 2 0
To:
到:
Geography Count
County1 15
County2 23
This would be a simple matter if the whole file could fit in memory but using pandas.read_csv()
gives MemoryError
. So I have been looking into other methods, and there appears to be many options - HDF5? Using itertools
(which seems complicated - generators?) Or just using the standard file methods to read in the first geography (70 lines), sum the count column, and write out before loading in another 70 lines.
如果整个文件可以放入内存但使用pandas.read_csv()
give ,这将是一件简单的事情MemoryError
。所以我一直在研究其他方法,似乎有很多选择——HDF5?使用itertools
(这看起来很复杂 - 生成器?)或者只是使用标准文件方法读取第一个地理(70 行),对计数列求和,然后在加载另外 70 行之前写出。
Does anyone have any suggestions on the best way to do this? I especially like the idea of streaming data in, especially because I can think of a lot of other places where this would be useful. I am most interested in this method, or one that similarly uses the most basic functionality possible.
有没有人对执行此操作的最佳方法有任何建议?我特别喜欢流式传输数据的想法,特别是因为我可以想到很多其他有用的地方。我对这种方法或类似地使用最基本功能的方法最感兴趣。
Edit:In this small case I only want the sums of count by geography. However, it would be ideal if I could read in a chunk, specify any function (say, add 2 columns together, or take the max of a column by geography), apply the function, and write the output before reading in a new chunk.
编辑:在这个小案例中,我只想要按地理位置计算的总数。但是,如果我可以读取一个块,指定任何函数(例如,将 2 列加在一起,或按地理取列的最大值),应用该函数并在读取新块之前写入输出,那将是理想的.
回答by root
You can use dask.dataframe
, which is syntactically similar to pandas
, but performs manipulations out-of-core, so memory shouldn't be an issue:
您可以使用dask.dataframe
,它在语法上类似于pandas
,但在核外执行操作,因此内存应该不是问题:
import dask.dataframe as dd
df = dd.read_csv('my_file.csv')
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')
Alternatively, if pandas
is a requirement you can use chunked reads, as mentioned by @chrisaycock. You may want to experiment with the chunksize
parameter.
或者,如果pandas
需要,您可以使用分块读取,如@chrisaycock 所述。您可能想对chunksize
参数进行试验。
# Operate on chunks.
data = []
for chunk in pd.read_csv('my_file.csv', chunksize=10**5):
chunk = chunk.groupby('Geography', as_index=False)['Count'].sum()
data.append(chunk)
# Combine the chunked data.
df = pd.concat(data, ignore_index=True)
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')
回答by MaxU
I do like @root's solution, but i would go bit further optimizing memory usage - keeping only aggregated DF in memory and reading only those columns, that you really need:
我确实喜欢@root 的解决方案,但我会进一步优化内存使用 - 仅在内存中保留聚合 DF 并仅读取您真正需要的那些列:
cols = ['Geography','Count']
df = pd.DataFrame()
chunksize = 2 # adjust it! for example --> 10**5
for chunk in (pd.read_csv(filename,
usecols=cols,
chunksize=chunksize)
):
# merge previously aggregated DF with a new portion of data and aggregate it again
df = (pd.concat([df,
chunk.groupby('Geography')['Count'].sum().to_frame()])
.groupby(level=0)['Count']
.sum()
.to_frame()
)
df.reset_index().to_csv('c:/temp/result.csv', index=False)
test data:
测试数据:
Geography,AgeGroup,Gender,Race,Count
County1,1,M,1,12
County2,2,M,1,3
County3,2,M,2,0
County1,1,M,1,12
County2,2,M,1,33
County3,2,M,2,11
County1,1,M,1,12
County2,2,M,1,111
County3,2,M,2,1111
County5,1,M,1,12
County6,2,M,1,33
County7,2,M,2,11
County5,1,M,1,12
County8,2,M,1,111
County9,2,M,2,1111
output.csv:
输出.csv:
Geography,Count
County1,36
County2,147
County3,1122
County5,24
County6,33
County7,11
County8,111
County9,1111
PS using this approach will you can process huge files.
PS使用这种方法可以处理大文件。
PPS using chunking approach should work unless you need to sort your data - in this case i would use classic UNIX tools, like awk
, sort
, etc. for sorting your data first
采用分块方法PPS应该工作,除非你需要对数据进行排序-在这种情况下,我会用经典UNIX工具,如awk
,sort
等,为第一划分您的数据
I would also recommend to use PyTables (HDF5 Storage), instead of CSV files - it is very fast and allows you to read data conditionally (using where
parameter), so it's very handy and saves a lot of resources and usually much fastercompared to CSV.
我还建议使用 PyTables(HDF5 存储),而不是 CSV 文件 - 它非常快,并且允许您有条件地读取数据(使用where
参数),因此它非常方便并且节省了大量资源,并且通常比 CSV快得多.