你如何在 Python 中将读取一个大的 csv 文件分成大小均匀的块?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4956984/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 18:20:06  来源:igfitidea点击:

How do you split reading a large csv file into evenly-sized chunks in Python?

pythonlistcsvchunks

提问by Mario César

In a basic I had the next process.

在一个基本的我有下一个过程。

import csv
reader = csv.reader(open('huge_file.csv', 'rb'))

for line in reader:
    process_line(line)

See this related question. I want to send the process line every 100 rows, to implement batch sharding.

请参阅此相关问题。我想每 100 行发送一次处理线,以实现批量分片。

The problem about implementing the related answer is that csv object is unsubscriptable and can not use len.

关于实现相关答案的问题是 csv 对象是不可订阅的,不能使用 len。

>>> import csv
>>> reader = csv.reader(open('dataimport/tests/financial_sample.csv', 'rb'))
>>> len(reader)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object of type '_csv.reader' has no len()
>>> reader[10:]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '_csv.reader' object is unsubscriptable
>>> reader[10]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '_csv.reader' object is unsubscriptable

How can I solve this?

我该如何解决这个问题?

采纳答案by miku

Just make your readersubscriptable by wrapping it into a list. Obviously this will break on really large files (see alternatives in the Updatesbelow):

只需reader将它包装成list. 显然,这会在非常大的文件上中断(请参阅下面更新中的替代方案):

>>> reader = csv.reader(open('big.csv', 'rb'))
>>> lines = list(reader)
>>> print lines[:100]
...

Further reading: How do you split a list into evenly sized chunks in Python?

进一步阅读:如何在 Python 中将列表拆分为大小均匀的块?



Update 1(list version): Another possible way would just process each chuck, as it arrives while iterating over the lines:

更新 1(列表版本):另一种可能的方法是处理每个卡盘,因为它在遍历行时到达:

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

chunk, chunksize = [], 100

def process_chunk(chuck):
    print len(chuck)
    # do something useful ...

for i, line in enumerate(reader):
    if (i % chunksize == 0 and i > 0):
        process_chunk(chunk)
        del chunk[:]  # or: chunk = []
    chunk.append(line)

# process the remainder
process_chunk(chunk)


Update 2(generator version): I haven't benchmarked it, but maybe you can increase performance by using a chunk generator:

更新 2(生成器版本):我没有对其进行基准测试,但也许您可以通过使用块生成器来提高性能:

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

def gen_chunks(reader, chunksize=100):
    """ 
    Chunk generator. Take a CSV `reader` and yield
    `chunksize` sized slices. 
    """
    chunk = []
    for i, line in enumerate(reader):
        if (i % chunksize == 0 and i > 0):
            yield chunk
            del chunk[:]  # or: chunk = []
        chunk.append(line)
    yield chunk

for chunk in gen_chunks(reader):
    print chunk # process chunk

# test gen_chunk on some dummy sequence:
for chunk in gen_chunks(range(10), chunksize=3):
    print chunk # process chunk

# => yields
# [0, 1, 2]
# [3, 4, 5]
# [6, 7, 8]
# [9]

There is a minor gotcha, as @totalhackpoints out:

正如@totalhack指出的那样,有一个小问题:

Be aware that this yields the same object over and over with different contents. This works fine if you plan on doing everything you need to with the chunk between each iteration.

请注意,这会一遍又一遍地产生具有不同内容的相同对象。如果您计划在每次迭代之间对块进行所需的一切操作,那么这很有效。

回答by D.Shawley

There isn't a goodway to do this for all .csvfiles. You should be able to divide the file into chunks using file.seekto skip a section of the file. Then you have to scan one byte at a time to find the end of the row. The you can process the two chunks independently. Something like the following (untested) code should get you started.

没有对所有文件执行此操作的方法.csv。您应该能够使用file.seek跳过文件的一部分将文件分成块。然后您必须一次扫描一个字节以找到行的结尾。您可以独立处理这两个块。类似以下(未经测试)的代码应该可以帮助您入门。

file_one = open('foo.csv')
file_two = open('foo.csv') 
file_two.seek(0, 2)     # seek to the end of the file
sz = file_two.tell()    # fetch the offset
file_two.seek(sz / 2)   # seek back to the middle
chr = ''
while chr != '\n':
    chr = file_two.read(1)
# file_two is now positioned at the start of a record
segment_one = csv.reader(file_one)
segment_two = csv.reader(file_two)

I'm not sure how you can tell that you have finished traversing segment_one. If you have a column in the CSV that is a row id, then you can stop processing segment_onewhen you encounter the row id from the first row in segment_two.

我不知道你怎么知道你已经完成了遍历segment_one。如果 CSV 中有一列是行 id,那么segment_one当您遇到 .csv 中第一行的行 id 时,您可以停止处理segment_two

回答by debaonline4u

We can use pandas module to handle these big csv files.

我们可以使用 pandas 模块来处理这些大的 csv 文件。

df = pd.DataFrame()
temp = pd.read_csv('BIG_File.csv', iterator=True, chunksize=1000)
df = pd.concat(temp, ignore_index=True)