使用Python按行号将大文本文件拆分为较小的文本文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16289859/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Splitting large text file into smaller text files by line numbers using Python
提问by walterfaye
I have a text file say really_big_file.txt that contains:
我有一个文本文件说really_big_file.txt,其中包含:
line 1
line 2
line 3
line 4
...
line 99999
line 100000
I would like to write a Python script that divides really_big_file.txt into smaller files with 300 lines each. For example, small_file_300.txt to have lines 1-300, small_file_600 to have lines 301-600, and so on until there are enough small files made to contain all the lines from the big file.
我想编写一个 Python 脚本,将really_big_file.txt 分成每个300 行的小文件。例如,small_file_300.txt 包含第 1-300 行,small_file_600 包含第 301-600 行,依此类推,直到有足够的小文件包含大文件中的所有行。
I would appreciate any suggestions on the easiest way to accomplish this using Python
对于使用 Python 完成此操作的最简单方法的任何建议,我将不胜感激
回答by jamylak
Using itertoolsgrouperrecipe:
使用itertools石斑鱼食谱:
from itertools import zip_longest
def grouper(n, iterable, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return zip_longest(fillvalue=fillvalue, *args)
n = 300
with open('really_big_file.txt') as f:
for i, g in enumerate(grouper(n, f, fillvalue=''), 1):
with open('small_file_{0}'.format(i * n), 'w') as fout:
fout.writelines(g)
The advantage of this method as opposed to storing each line in a list, is that it works with iterables, line by line, so it doesn't have to store each small_fileinto memory at once.
与将每一行存储在列表中相比,这种方法的优点在于它可以逐行处理可迭代对象,因此不必small_file一次将每一行存储到内存中。
Note that the last file in this case will be small_file_100200but will only go until line 100000. This happens because fillvalue='', meaning I write out nothingto the file when I don't have any more lines left to write because a group size doesn't divide equally. You can fix this by writing to a temp file and then renaming it after instead of naming it first like I have. Here's how that can be done.
请注意,在这种情况下,最后一个文件将是small_file_100200但只会到line 100000. 这是因为fillvalue='',这意味着我写出来没有到该文件时,我没有留下来写,因为一组大小不平分任何更多的线路。您可以通过写入临时文件然后重命名它来解决此问题,而不是像我那样先命名它。这是如何做到的。
import os, tempfile
with open('really_big_file.txt') as f:
for i, g in enumerate(grouper(n, f, fillvalue=None)):
with tempfile.NamedTemporaryFile('w', delete=False) as fout:
for j, line in enumerate(g, 1): # count number of lines in group
if line is None:
j -= 1 # don't count this line
break
fout.write(line)
os.rename(fout.name, 'small_file_{0}.txt'.format(i * n + j))
This time the fillvalue=Noneand I go through each line checking for None, when it occurs, I know the process has finished so I subtract 1from jto not count the filler and then write the file.
这一次fillvalue=None,我经过的每一行检查None,当它发生时,我知道这个过程已经完成,所以我减去1从j不计填料,然后写入文件。
回答by juliomalegria
lines_per_file = 300 # Lines on each small file
lines = [] # Stores lines not yet written on a small file
lines_counter = 0 # Same as len(lines)
created_files = 0 # Counting how many small files have been created
with open('really_big_file.txt') as big_file:
for line in big_file: # Go throught the whole big file
lines.append(line)
lines_counter += 1
if lines_counter == lines_per_file:
idx = lines_per_file * (created_files + 1)
with open('small_file_%s.txt' % idx, 'w') as small_file:
# Write all lines on small file
small_file.write('\n'.join(stored_lines))
lines = [] # Reset variables
lines_counter = 0
created_files += 1 # One more small file has been created
# After for-loop has finished
if lines_counter: # There are still some lines not written on a file?
idx = lines_per_file * (created_files + 1)
with open('small_file_%s.txt' % idx, 'w') as small_file:
# Write them on a last small file
small_file.write('n'.join(stored_lines))
created_files += 1
print '%s small files (with %s lines each) were created.' % (created_files,
lines_per_file)
回答by Ryan Saxe
I do this a more understandable way and using less short cuts in order to give you a further understanding of how and why this works. Previous answers work, but if you are not familiar with certain built-in-functions, you will not understand what the function is doing.
我这样做是一种更容易理解的方式,并使用较少的捷径,以便让您进一步了解它的工作原理和原因。以前的答案有效,但如果您不熟悉某些内置函数,您将无法理解该函数在做什么。
Because you posted no code I decided to do it this way since you could be unfamiliar with things other than basic python syntax given that the way you phrased the question made it seem as though you did not try nor had any clue as how to approach the question
因为你没有发布任何代码,所以我决定这样做,因为你可能不熟悉基本 python 语法以外的东西,因为你提出问题的方式让人感觉好像你没有尝试过,也不知道如何处理题
Here are the steps to do this in basic python:
以下是在基本 python 中执行此操作的步骤:
First you should read your file into a list for safekeeping:
首先,您应该将文件读入一个列表以进行妥善保管:
my_file = 'really_big_file.txt'
hold_lines = []
with open(my_file,'r') as text_file:
for row in text_file:
hold_lines.append(row)
Second, you need to set up a way of creating the new files by name! I would suggest a loop along with a couple counters:
其次,您需要设置一种按名称创建新文件的方法!我建议一个循环和几个计数器:
outer_count = 1
line_count = 0
sorting = True
while sorting:
count = 0
increment = (outer_count-1) * 300
left = len(hold_lines) - increment
file_name = "small_file_" + str(outer_count * 300) + ".txt"
Third, inside that loop you need some nested loops that will save the correct rows into an array:
第三,在该循环中,您需要一些嵌套循环来将正确的行保存到数组中:
hold_new_lines = []
if left < 300:
while count < left:
hold_new_lines.append(hold_lines[line_count])
count += 1
line_count += 1
sorting = False
else:
while count < 300:
hold_new_lines.append(hold_lines[line_count])
count += 1
line_count += 1
Last thing, again in your first loop you need to write the new file and add your last counter increment so your loop will go through again and write a new file
最后一件事,再次在您的第一个循环中,您需要写入新文件并添加最后一个计数器增量,以便您的循环将再次通过并写入一个新文件
outer_count += 1
with open(file_name,'w') as next_file:
for row in hold_new_lines:
next_file.write(row)
note: if the number of lines is not divisible by 300, the last file will have a name that does not correspond to the last file line.
注意:如果行数不能被 300 整除,则最后一个文件的名称将与最后一个文件行不对应。
It is important to understand why these loops work. You have it set so that on the next loop, the name of the file that you write changes because you have the name dependent on a changing variable. This is a very useful scripting tool for file accessing, opening, writing, organizing etc.
了解为什么这些循环起作用很重要。您设置了它,以便在下一个循环中,您写入的文件的名称会发生变化,因为您的名称依赖于一个不断变化的变量。这是一个非常有用的脚本工具,用于文件访问、打开、写入、组织等。
In case you could not follow what was in what loop, here is the entirety of the function:
如果您无法遵循 what 循环中的内容,这里是整个函数:
my_file = 'really_big_file.txt'
sorting = True
hold_lines = []
with open(my_file,'r') as text_file:
for row in text_file:
hold_lines.append(row)
outer_count = 1
line_count = 0
while sorting:
count = 0
increment = (outer_count-1) * 300
left = len(hold_lines) - increment
file_name = "small_file_" + str(outer_count * 300) + ".txt"
hold_new_lines = []
if left < 300:
while count < left:
hold_new_lines.append(hold_lines[line_count])
count += 1
line_count += 1
sorting = False
else:
while count < 300:
hold_new_lines.append(hold_lines[line_count])
count += 1
line_count += 1
outer_count += 1
with open(file_name,'w') as next_file:
for row in hold_new_lines:
next_file.write(row)
回答by Matt Anderson
lines_per_file = 300
smallfile = None
with open('really_big_file.txt') as bigfile:
for lineno, line in enumerate(bigfile):
if lineno % lines_per_file == 0:
if smallfile:
smallfile.close()
small_filename = 'small_file_{}.txt'.format(lineno + lines_per_file)
smallfile = open(small_filename, "w")
smallfile.write(line)
if smallfile:
smallfile.close()
回答by Varun
import csv
import os
import re
MAX_CHUNKS = 300
def writeRow(idr, row):
with open("file_%d.csv" % idr, 'ab') as file:
writer = csv.writer(file, delimiter=',', quotechar='\"', quoting=csv.QUOTE_ALL)
writer.writerow(row)
def cleanup():
for f in os.listdir("."):
if re.search("file_.*", f):
os.remove(os.path.join(".", f))
def main():
cleanup()
with open("large_file.csv", 'rb') as results:
r = csv.reader(results, delimiter=',', quotechar='\"')
idr = 1
for i, x in enumerate(r):
temp = i + 1
if not (temp % (MAX_CHUNKS + 1)):
idr += 1
writeRow(idr, x)
if __name__ == "__main__": main()
回答by knowingpark
I had to do the same with 650000 line files.
我不得不对 650000 个行文件做同样的事情。
Use the enumerate index and integer div it (//) with the chunk size
使用枚举索引和整数 div it (//) 与块大小
When that number changes close the current file and open a new one
当该数字更改时关闭当前文件并打开一个新文件
This is a python3 solution using format strings.
这是一个使用格式字符串的 python3 解决方案。
chunk = 50000 # number of lines from the big file to put in small file
this_small_file = open('./a_folder/0', 'a')
with open('massive_web_log_file') as file_to_read:
for i, line in enumerate(file_to_read.readlines()):
file_name = f'./a_folder/{i // chunk}'
print(i, file_name) # a bit of feedback that slows the process down a
if file_name == this_small_file.name:
this_small_file.write(line)
else:
this_small_file.write(line)
this_small_file.close()
this_small_file = open(f'{file_name}', 'a')

