Python 将大文本文件(约 50GB)拆分为多个文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22751000/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Split large text file(around 50GB) into multiple files
提问by saz
I would like to split a large text file around size of 50GB into multiple files. Data in the files are like this-[x= any integer between 0-9]
我想将一个大约 50GB 的大文本文件拆分为多个文件。文件中的数据是这样的-[x=0-9之间的任意整数]
xxx.xxx.xxx.xxx
xxx.xxx.xxx.xxx
xxx.xxx.xxx.xxx
xxx.xxx.xxx.xxx
...............
...............
There might be few billions of lines in the file and i would like write for example 30/40 millions per file. I guess the steps would be-
文件中可能有几十亿行,我想写例如每个文件 30/40 百万行。我想步骤是——
- I've to open the file
- then using readline() have to read the file line by line and write at the same time to a new file
- and as soon as it hits the maximum number of lines it will create another file and starts writing again.
- 我必须打开文件
- 然后使用 readline() 必须逐行读取文件并同时写入新文件
- 一旦达到最大行数,它将创建另一个文件并再次开始写入。
I'm wondering, how to put all these steps together in a memory efficient and faster way. I've seen some examples in stack but none of them totally helping what i exactly need. I would really appreciate if anyone could help me out.
我想知道,如何以一种高效且快速的方式将所有这些步骤放在一起。我在堆栈中看到了一些例子,但没有一个完全帮助我真正需要的。如果有人能帮助我,我将不胜感激。
采纳答案by Andrey
This working solution uses split
command available in shell. Since the author has already accepted a possibility of a non-python solution, please do not downvote.
此工作解决方案使用split
shell 中可用的命令。由于作者已经接受了非 Python 解决方案的可能性,请不要投票。
First, I created a test file with 1000M entries (15 GB) with
首先,我创建了一个包含 1000M 条目 (15 GB) 的测试文件
awk 'BEGIN{for (i = 0; i < 1000000000; i++) {print "123.123.123.123"} }' > t.txt
Then I used split
:
然后我用了split
:
split --lines=30000000 --numeric-suffixes --suffix-length=2 t.txt t
It took 5 min to produce a set of 34 small files with names t00
-t33
. 33 files are 458 MB each and the last t33
is 153 MB.
生成一组名称为t00
-的 34 个小文件需要 5 分钟t33
。33 个文件每个是 458 MB,最后一个t33
是 153 MB。
回答by tommy.carstensen
I would use the Unix utility split, if it is available to you and your only task is to split the file. Here is however a pure Python solution:
如果您可以使用 Unix 实用程序拆分,并且您唯一的任务是拆分文件,我将使用它。然而,这是一个纯 Python 解决方案:
import contextlib
file_large = 'large_file.txt'
l = 30*10**6 # lines per split file
with contextlib.ExitStack() as stack:
fd_in = stack.enter_context(open(file_large))
for i, line in enumerate(fd_in):
if not i % l:
file_split = '{}.{}'.format(file_large, i//l)
fd_out = stack.enter_context(open(file_split, 'w'))
fd_out.write('{}\n'.format(line))
If all of your lines have 4 3-digit numbers on them and you have multiple cores available, then you can exploit file seek and run multiple processes.
如果您的所有行上都有 4 个 3 位数字并且您有多个可用内核,那么您可以利用文件搜索并运行多个进程。
回答by log0
from itertools import chain, islice
def chunks(iterable, n):
"chunks(ABCDE,2) => AB CD E"
iterable = iter(iterable)
while True:
# store one line in memory,
# chain it to an iterator on the rest of the chunk
yield chain([next(iterable)], islice(iterable, n-1))
l = 30*10**6
file_large = 'large_file.txt'
with open(file_large) as bigfile:
for i, lines in enumerate(chunks(bigfile, l)):
file_split = '{}.{}'.format(file_large, i)
with open(file_split, 'w') as f:
f.writelines(lines)
回答by Saeed Zahedian Abroodi
This class may solve your problem. I've tested it on Linux and Windows operating system, and it's worked perfectly on both of them. Also, I've tested binary and text file with different sizes each time and it was great. Enjoy :)
这门课或许能解决你的问题。我已经在 Linux 和 Windows 操作系统上对其进行了测试,并且在这两种操作系统上都运行良好。此外,我每次都测试了不同大小的二进制文件和文本文件,这很棒。享受 :)
import os
import math
class FileSpliter:
# If file type is text then CHUNK_SIZE is count of chars
# If file type is binary then CHUNK_SIZE is count of bytes
def __init__(self, InputFile, FileType="b", CHUNK_SIZE=524288, OutFile="outFile"):
self.CHUNK_SIZE = CHUNK_SIZE # byte or char
self.InputFile = InputFile
self.FileType = FileType # b: binary, t: text
self.OutFile = OutFile
self.FileSize = 0
self.Parts = None
self.CurrentPartNo = 0
self.Progress = 0.0
def Prepare(self):
if not(os.path.isfile(self.InputFile) and os.path.getsize(self.InputFile) > 0):
print("ERROR: The file is not exists or empty!")
return False
self.FileSize = os.path.getsize(self.InputFile)
if self.CHUNK_SIZE >= self.FileSize:
self.Parts = 1
else:
self.Parts = math.ceil(self.FileSize / self.CHUNK_SIZE)
return True
def Split(self):
if self.FileSize == 0 or self.Parts == None:
print("ERROR: File is not prepared for split!")
return False
with open(self.InputFile, "r" + self.FileType) as f:
while True:
if self.FileType == "b":
buf = bytearray(f.read(self.CHUNK_SIZE))
elif self.FileType == "t":
buf = f.read(self.CHUNK_SIZE)
else:
print("ERROR: File type error!")
if not buf:
# we've read the entire file in, so we're done.
break
of = self.OutFile + str(self.CurrentPartNo)
outFile = open(of, "w" + self.FileType)
outFile.write(buf)
outFile.close()
self.CurrentPartNo += 1
self.ProgressBar()
return True
def Rebuild(self):
self.CurrentPartNo = 0
if self.Parts == None:
return False
with open(self.OutFile, "w" + self.FileType) as f:
while self.CurrentPartNo < self.Parts:
If = self.OutFile + str(self.CurrentPartNo)
if not(os.path.isfile(If) and os.path.getsize(If) > 0):
print("ERROR: The file [" + If + "] is not exists or empty!")
return False
InputFile = open(If, "r" + self.FileType)
buf = InputFile.read()
if not buf:
# we've read the entire file in, so we're done.
break
f.write(buf)
InputFile.close()
os.remove(If)
self.CurrentPartNo += 1
self.ProgressBar()
return True
def ProgressBar(self, BarLength=20, ProgressIcon="#", BarIcon="-"):
try:
# You can't have a progress bar with zero or negative length.
if BarLength <1:
BarLength = 20
# Use status variable for going to the next line after progress completion.
Status = ""
# Calcuting progress between 0 and 1 for percentage.
self.Progress = float(self.CurrentPartNo) / float(self.Parts)
# Doing this conditions at final progressing.
if self.Progress >= 1.:
self.Progress = 1
Status = "\r\n" # Going to the next line
# Calculating how many places should be filled
Block = int(round(BarLength * self.Progress))
# Show this
Bar = "\r[{}] {:.0f}% {}".format(ProgressIcon * Block + BarIcon * (BarLength - Block), round(self.Progress * 100, 0), Status)
print(Bar, end="")
except:
print("\rERROR")
def main():
fp = FileSpliter(InputFile="inFile", FileType="b") #, CHUNK_SIZE=300000)
if fp.Prepare():
# Spliting ...
print("Spliting ...")
sr = fp.Split()
if sr == True:
print("The file splited successfully.")
print()
# Rebuilding ...
print("Rebuilding ...")
rr = fp.Rebuild()
if rr == True:
print("The file rebuilded successfully.")
if __name__ == "__main__":
main()
回答by Jyo the Whiff
I am writing a Python3 code solution which I usually use to split files having size in MBs.
我正在编写一个 Python3 代码解决方案,我通常用它来拆分大小为 MB 的文件。
However, I have not yet tried for files having size in GBs.
但是,我还没有尝试过以 GB 为单位的文件。
TextFileSplitter.py
文本文件拆分器.py
import traceback
#get a file name to be read
fileToRead = input("Enter file name : ")
# max lines you want to write in a single file
fileLineCount = 2000
lineCount = 0
fileCount = 1
try:
print('Start splitting...')
#read a file
fileReader = open(fileToRead)
line = fileReader.readline()
fileWriter = open(str(fileCount)+".txt","a")
while line != '':#empty is EOF
if lineCount == 0:
#create a file in append mode
fileWriter = open(str(fileCount)+".txt","a")
#increment file count, use it for new file name
fileCount += 1
#write a line
fileWriter.write(line+"\n")
lineCount += 1
if lineCount == fileLineCount:
lineCount = 0
fileWriter.close()
#read a line
line = fileReader.readline()
fileWriter.close()
except Exception as e:
#print the exception if any
print(e.__traceback__)
traceback.print_exc()
finally:
#close the file reader
fileReader.close()
o/p will look like, files, each having fileLineCount(i.e. 2000) lines, created in a same directory as :
o/p 看起来像,文件,每个文件都有 fileLineCount(即 2000)行,创建在与以下相同的目录中:
1.txt
2.txt
3.txt
.
.
.
.
n.txt