Python 如何部分读取一个巨大的 CSV 文件?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29334463/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I partially read a huge CSV file?
提问by lserlohn
I have a very big csv file so that I can not read them all into the memory. I only want to read and process a few lines in it. So I am seeking a function in Pandas which could handle this task, which the basic python can handle this well:
我有一个非常大的 csv 文件,所以我无法将它们全部读入内存。我只想阅读和处理其中的几行。所以我在 Pandas 中寻找一个可以处理这个任务的函数,基本的 python 可以很好地处理这个任务:
with open('abc.csv') as f:
line = f.readline()
# pass until it reaches a particular line number....
However, if I do this in pandas, I always read the first line:
但是,如果我在熊猫中这样做,我总是阅读第一行:
datainput1 = pd.read_csv('matrix.txt',sep=',', header = None, nrows = 1 )
datainput2 = pd.read_csv('matrix.txt',sep=',', header = None, nrows = 1 )
I am looking for some easier way to handle this task in pandas. For example, if I want to read rows from 1000 to 2000. How can I do this quickly?
我正在寻找一些更简单的方法来处理熊猫中的这项任务。例如,如果我想读取从 1000 到 2000 的行。我怎样才能快速做到这一点?
I want to use pandas because I want to read data into the dataframe.
我想使用熊猫,因为我想将数据读入数据帧。
采纳答案by EdChum
Use chunksize
:
使用chunksize
:
for df in pd.read_csv('matrix.txt',sep=',', header = None, chunksize=1):
#do something
To answer your second part do this:
要回答您的第二部分,请执行以下操作:
df = pd.read_csv('matrix.txt',sep=',', header = None, skiprows=1000, chunksize=1000)
This will skip the first 1000 rows and then only read the next 1000 rows giving you rows 1000-2000, unclear if you require the end points to be included or not but you can fiddle the numbers to get what you want.
这将跳过前 1000 行,然后只读取接下来的 1000 行,为您提供 1000-2000 行,不清楚您是否需要包括端点,但您可以摆弄数字以获得您想要的。
回答by petezurich
In addition to EdChums answer i find the nrows
argument useful which simply defines the number of rows you want to import. Thereby you don't get an iterator but rather can just import a part of the whole file of size nrows
. It works with skiprows
too.
除了 EdChums 的答案之外,我发现这个nrows
参数很有用,它只是定义了要导入的行数。因此,您不会获得迭代器,而只能导入整个 size 文件的一部分nrows
。它也适用skiprows
。
df = pd.read_csv('matrix.txt',sep=',', header = None, skiprows= 1000, nrows=1000)