Python 将大文本文件(约 50GB)拆分为多个文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22751000/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:35:45  来源:igfitidea点击:

Split large text file(around 50GB) into multiple files

pythonunixpython-2.7split

提问by saz

I would like to split a large text file around size of 50GB into multiple files. Data in the files are like this-[x= any integer between 0-9]

我想将一个大约 50GB 的大文本文件拆分为多个文件。文件中的数据是这样的-[x=0-9之间的任意整数]

xxx.xxx.xxx.xxx
xxx.xxx.xxx.xxx
xxx.xxx.xxx.xxx
xxx.xxx.xxx.xxx
...............
...............

There might be few billions of lines in the file and i would like write for example 30/40 millions per file. I guess the steps would be-

文件中可能有几十亿行,我想写例如每个文件 30/40 百万行。我想步骤是——

  • I've to open the file
  • then using readline() have to read the file line by line and write at the same time to a new file
  • and as soon as it hits the maximum number of lines it will create another file and starts writing again.
  • 我必须打开文件
  • 然后使用 readline() 必须逐行读取文件并同时写入新文件
  • 一旦达到最大行数,它将创建另一个文件并再次开始写入。

I'm wondering, how to put all these steps together in a memory efficient and faster way. I've seen some examples in stack but none of them totally helping what i exactly need. I would really appreciate if anyone could help me out.

我想知道,如何以一种高效且快速的方式将所有这些步骤放在一起。我在堆栈中看到了一些例子,但没有一个完全帮助我真正需要的。如果有人能帮助我,我将不胜感激。

采纳答案by Andrey

This working solution uses splitcommand available in shell. Since the author has already accepted a possibility of a non-python solution, please do not downvote.

此工作解决方案使用splitshell 中可用的命令。由于作者已经接受了非 Python 解决方案的可能性,请不要投票。

First, I created a test file with 1000M entries (15 GB) with

首先,我创建了一个包含 1000M 条目 (15 GB) 的测试文件

awk 'BEGIN{for (i = 0; i < 1000000000; i++) {print "123.123.123.123"} }' > t.txt

Then I used split:

然后我用了split

split --lines=30000000 --numeric-suffixes --suffix-length=2 t.txt t

It took 5 min to produce a set of 34 small files with names t00-t33. 33 files are 458 MB each and the last t33is 153 MB.

生成一组名称为t00-的 34 个小文件需要 5 分钟t33。33 个文件每个是 458 MB,最后一个t33是 153 MB。

回答by tommy.carstensen

I would use the Unix utility split, if it is available to you and your only task is to split the file. Here is however a pure Python solution:

如果您可以使用 Unix 实用程序拆分,并且您唯一的任务是拆分文件,我将使用它。然而,这是一个纯 Python 解决方案:

import contextlib

file_large = 'large_file.txt'
l = 30*10**6  # lines per split file
with contextlib.ExitStack() as stack:
    fd_in = stack.enter_context(open(file_large))
    for i, line in enumerate(fd_in):
        if not i % l:
           file_split = '{}.{}'.format(file_large, i//l)
           fd_out = stack.enter_context(open(file_split, 'w'))
        fd_out.write('{}\n'.format(line))

If all of your lines have 4 3-digit numbers on them and you have multiple cores available, then you can exploit file seek and run multiple processes.

如果您的所有行上都有 4 个 3 位数字并且您有多个可用内核,那么您可以利用文件搜索并运行多个进程。

回答by log0

from itertools import chain, islice

def chunks(iterable, n):
   "chunks(ABCDE,2) => AB CD E"
   iterable = iter(iterable)
   while True:
       # store one line in memory,
       # chain it to an iterator on the rest of the chunk
       yield chain([next(iterable)], islice(iterable, n-1))

l = 30*10**6
file_large = 'large_file.txt'
with open(file_large) as bigfile:
    for i, lines in enumerate(chunks(bigfile, l)):
        file_split = '{}.{}'.format(file_large, i)
        with open(file_split, 'w') as f:
            f.writelines(lines)

回答by Saeed Zahedian Abroodi

This class may solve your problem. I've tested it on Linux and Windows operating system, and it's worked perfectly on both of them. Also, I've tested binary and text file with different sizes each time and it was great. Enjoy :)

这门课或许能解决你的问题。我已经在 Linux 和 Windows 操作系统上对其进行了测试,并且在这两种操作系统上都运行良好。此外,我每次都测试了不同大小的二进制文件和文本文件,这很棒。享受 :)

import os
import math

class FileSpliter:
    # If file type is text then CHUNK_SIZE is count of chars
    # If file type is binary then CHUNK_SIZE is count of bytes
    def __init__(self, InputFile, FileType="b", CHUNK_SIZE=524288, OutFile="outFile"):
        self.CHUNK_SIZE = CHUNK_SIZE    # byte or char
        self.InputFile = InputFile
        self.FileType = FileType        # b: binary,  t: text
        self.OutFile = OutFile
        self.FileSize = 0
        self.Parts = None
        self.CurrentPartNo = 0
        self.Progress = 0.0

    def Prepare(self):
        if not(os.path.isfile(self.InputFile) and os.path.getsize(self.InputFile) > 0):
            print("ERROR: The file is not exists or empty!")
            return False
        self.FileSize = os.path.getsize(self.InputFile)
        if self.CHUNK_SIZE >= self.FileSize:
            self.Parts = 1
        else:
            self.Parts = math.ceil(self.FileSize / self.CHUNK_SIZE)
        return True

    def Split(self):
        if self.FileSize == 0 or self.Parts == None:
            print("ERROR: File is not prepared for split!")
            return False        
        with open(self.InputFile, "r" + self.FileType) as f:
            while True:
                if self.FileType == "b":
                    buf = bytearray(f.read(self.CHUNK_SIZE))
                elif self.FileType == "t":
                    buf = f.read(self.CHUNK_SIZE)
                else:
                    print("ERROR: File type error!")
                if not buf:
                    # we've read the entire file in, so we're done.
                    break
                of = self.OutFile + str(self.CurrentPartNo)
                outFile = open(of, "w" + self.FileType)
                outFile.write(buf)                              
                outFile.close()
                self.CurrentPartNo += 1 
                self.ProgressBar()
        return True

    def Rebuild(self):
        self.CurrentPartNo = 0
        if self.Parts == None:
            return False    
        with open(self.OutFile, "w" + self.FileType) as f:
            while self.CurrentPartNo < self.Parts:
                If = self.OutFile + str(self.CurrentPartNo) 
                if not(os.path.isfile(If) and os.path.getsize(If) > 0):
                    print("ERROR: The file [" + If + "] is not exists or empty!")
                    return False
                InputFile = open(If, "r" + self.FileType)
                buf = InputFile.read()
                if not buf:
                    # we've read the entire file in, so we're done.
                    break               
                f.write(buf)                                
                InputFile.close()
                os.remove(If)
                self.CurrentPartNo += 1 
                self.ProgressBar()
        return True 

    def ProgressBar(self, BarLength=20, ProgressIcon="#", BarIcon="-"):
        try:
            # You can't have a progress bar with zero or negative length.
            if BarLength <1:
                BarLength = 20
            # Use status variable for going to the next line after progress completion.
            Status = ""
            # Calcuting progress between 0 and 1 for percentage.
            self.Progress = float(self.CurrentPartNo) / float(self.Parts)
            # Doing this conditions at final progressing.
            if self.Progress >= 1.:
                self.Progress = 1
                Status = "\r\n"    # Going to the next line             
            # Calculating how many places should be filled
            Block = int(round(BarLength * self.Progress))
            # Show this
            Bar = "\r[{}] {:.0f}% {}".format(ProgressIcon * Block + BarIcon * (BarLength - Block), round(self.Progress * 100, 0), Status)
            print(Bar, end="")
        except:
            print("\rERROR")

def main():
    fp = FileSpliter(InputFile="inFile", FileType="b") #, CHUNK_SIZE=300000)
    if fp.Prepare():
        # Spliting ...      
        print("Spliting ...")
        sr = fp.Split()
        if sr == True:
            print("The file splited successfully.")
        print()
        # Rebuilding ...
        print("Rebuilding ...") 
        rr = fp.Rebuild()
        if rr == True:
            print("The file rebuilded successfully.")

if __name__ == "__main__":
    main()

回答by Jyo the Whiff

I am writing a Python3 code solution which I usually use to split files having size in MBs.

我正在编写一个 Python3 代码解决方案,我通常用它来拆分大小为 MB 的文件。

However, I have not yet tried for files having size in GBs.

但是,我还没有尝试过以 GB 为单位的文件。

TextFileSplitter.py

文本文件拆分器.py

import traceback

#get a file name to be read
fileToRead = input("Enter file name : ")

# max lines you want to write in a single file
fileLineCount = 2000
lineCount = 0
fileCount = 1    

try:
    print('Start splitting...')
    #read a file
    fileReader = open(fileToRead)
    line = fileReader.readline()
    fileWriter = open(str(fileCount)+".txt","a")

    while line != '':#empty is EOF
        if lineCount == 0:
            #create a file in append mode
            fileWriter = open(str(fileCount)+".txt","a")
            #increment file count, use it for new file name
            fileCount += 1
        #write a line
        fileWriter.write(line+"\n")
        lineCount += 1
        if lineCount == fileLineCount:
            lineCount = 0
            fileWriter.close()
        #read a line
        line = fileReader.readline()

    fileWriter.close()

except Exception as e:
    #print the exception if any
    print(e.__traceback__)
    traceback.print_exc()
finally:
    #close the file reader
    fileReader.close()

o/p will look like, files, each having fileLineCount(i.e. 2000) lines, created in a same directory as :

o/p 看起来像,文件,每个文件都有 fileLineCount(即 2000)行,创建在与以下相同的目录中:

1.txt
2.txt
3.txt
.
.
.
.
n.txt