使用 Pandas 读取大文本文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23411619/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:59:32  来源:igfitidea点击:

Reading large text files with Pandas

pythoncsvpandasipythonlarge-files

提问by marillion

I have been trying to read a few large text files (sizes around 1.4GB - 2GB) with Pandas, using the read_csvfunction, with no avail. Below are the versions I am using:

我一直在尝试使用 Pandas 读取一些大文本文件(大小约为 1.4GB - 2GB),但read_csv没有成功。以下是我正在使用的版本:

  • Python 2.7.6
  • Anaconda 1.9.2 (64-bit) (default, Nov 11 2013, 10:49:15) [MSC v.1500 64 bit (AMD64)]
  • IPython 1.1.0
  • Pandas 0.13.1
  • 蟒蛇 2.7.6
  • Anaconda 1.9.2(64 位)(默认,2013 年 11 月 11 日,10:49:15)[MSC v.1500 64 位 (AMD64)]
  • IPython 1.1.0
  • Pandas 0.13.1

I tried the following:

我尝试了以下方法:

df = pd.read_csv(data.txt')

and it crashed Ipython with a message: Kernel died, restarting.

它使 Ipython 崩溃并显示一条消息:Kernel died, restarting

Then I tried using an iterator:

然后我尝试使用迭代器:

tp = pd.read_csv('data.txt', iterator = True, chunksize=1000)

again, I got the Kernel died, restartingerror.

再次,我得到了Kernel died, restarting错误。

Any ideas? Or any other way to read big text files?

有任何想法吗?或者任何其他方式来读取大文本文件?

Thank you!

谢谢!

回答by DarkCygnus

A solution for a similar question was given heresome time after the posting of this question. Basically, it suggests to read the file in chunksby doing the following:

在发布此问题一段时间后,此处给出了类似问题的解决方案。基本上,它建议chunks通过执行以下操作来读入文件:

chunksize = 10 ** 6  # number of rows per chunk
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

You should specify the chunksizeparameter accordingly to your machine's capabilities (that is, make sure it can process the chunk).

您应该chunksize根据您的机器的能力指定相应的参数(即确保它可以处理块)。