在 Python 中从大文件中删除一行的最快方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2329417/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fastest Way to Delete a Line from Large File in Python
提问by AJ.
I am working with a very large (~11GB) text file on a Linux system. I am running it through a program which is checking the file for errors. Once an error is found, I need to either fix the line or remove the line entirely. And then repeat...
我正在 Linux 系统上处理一个非常大(~11GB)的文本文件。我正在通过一个程序运行它,该程序正在检查文件是否有错误。一旦发现错误,我需要修复该行或完全删除该行。然后重复...
Eventually once I'm comfortable with the process, I'll automate it entirely. For now however, let's assume I'm running this by hand.
最终,一旦我对这个过程感到满意,我就会完全自动化。但是现在,让我们假设我正在手动运行它。
What would be the fastest (in terms of execution time) way to remove a specific line from this large file? I thought of doing it in Python...but would be open to other examples. The line might be anywherein the file.
从这个大文件中删除特定行的最快(就执行时间而言)方法是什么?我想用 Python 做这件事……但会接受其他例子。该行可能位于文件中的任何位置。
If Python, assume the following interface:
如果是 Python,假设如下接口:
def removeLine(filename, lineno):
def removeLine(filename, lineno):
Thanks,
谢谢,
-aj
-aj
采纳答案by K. Brafford
You can have two file objects for the same file at the same time (one for reading, one for writing):
您可以同时为同一个文件创建两个文件对象(一个用于读取,一个用于写入):
def removeLine(filename, lineno):
fro = open(filename, "rb")
current_line = 0
while current_line < lineno:
fro.readline()
current_line += 1
seekpoint = fro.tell()
frw = open(filename, "r+b")
frw.seek(seekpoint, 0)
# read the line we want to discard
fro.readline()
# now move the rest of the lines in the file
# one line back
chars = fro.readline()
while chars:
frw.writelines(chars)
chars = fro.readline()
fro.close()
frw.truncate()
frw.close()
回答by John La Rooy
Modify the file in place, offending line is replaced with spaces so the remainder of the file does not need to be shuffled around on disk. You can also "fix" the line in place if the fix is not longer than the line you are replacing
就地修改文件,有问题的行被替换为空格,因此文件的其余部分不需要在磁盘上随意移动。如果修复不长于您要替换的行,您还可以“修复”该行
import os
from mmap import mmap
def removeLine(filename, lineno):
f=os.open(filename, os.O_RDWR)
m=mmap(f,0)
p=0
for i in range(lineno-1):
p=m.find('\n',p)+1
q=m.find('\n',p)
m[p:q] = ' '*(q-p)
os.close(f)
If the other program can be changed to output the fileoffset instead of the line number, you can assign the offset to p directly and do without the for loop
如果其他程序可以改成输出fileoffset而不是行号,可以直接把offset赋值给p,不用for循环
回答by Justin Peel
As far as I know, you can't just open a txt file with python and remove a line. You have to make a new file and move everything but that line to it. If you know the specific line, then you would do something like this:
据我所知,你不能只用python打开一个txt文件并删除一行。您必须创建一个新文件并将除该行之外的所有内容移动到该文件中。如果你知道具体的行,那么你会做这样的事情:
f = open('in.txt')
fo = open('out.txt','w')
ind = 1
for line in f:
if ind != linenumtoremove:
fo.write(line)
ind += 1
f.close()
fo.close()
You could of course check the contents of the line instead to determine if you want to keep it or not. I also recommend that if you have a whole list of lines to be removed/changed to do all those changes in one pass through the file.
您当然可以检查该行的内容以确定是否要保留它。我还建议,如果您有一整套要删除/更改的行列表,以便在一次通过文件时完成所有这些更改。
回答by Dancrumb
If the lines are variable length then I don't believe that there is a better algorithm than reading the file line by line and writing out all lines, except for the one(s) that you do not want.
如果行是可变长度,那么我认为没有比逐行读取文件并写出所有行更好的算法,除了您不想要的那些行。
You can identify these lines by checking some criteria, or by keeping a running tally of lines read and suppressing the writing of the line(s) that you do not want.
您可以通过检查一些标准来识别这些行,或者通过保持读取的行的运行记录并禁止写入您不想要的行。
If the lines are fixed length and you want to delete specific line numbers, then you may be able to use seek
to move the file pointer... I doubt you're that lucky though.
如果行是固定长度并且您想删除特定的行号,那么您可以使用seek
来移动文件指针...不过我怀疑您是否那么幸运。
回答by Mark Byers
Update: solution using sed as requested by poster in comment.
更新:根据海报在评论中的要求使用 sed 的解决方案。
To delete for example the second line of file:
例如删除文件的第二行:
sed '2d' input.txt
Use the -i
switch to edit in place. Warning: this is a destructive operation. Read the help for this command for information on how to make a backup automatically.
使用-i
开关就地编辑。警告:这是一个破坏性的操作。阅读此命令的帮助以获取有关如何自动进行备份的信息。
回答by Matt Joiner
def removeLine(filename, lineno):
in = open(filename)
out = open(filename + ".new", "w")
for i, l in enumerate(in, 1):
if i != lineno:
out.write(l)
in.close()
out.close()
os.rename(filename + ".new", filename)
回答by Heikki Toivonen
I think there was a somewhat similar if not exactly the same type of question asked here. Reading (and writing) line by line is slow, but you can read a bigger chunk into memory at once, go through that line by line skipping lines you don't want, then writing this as a single chunk to a new file. Repeat until done. Finally replace the original file with the new file.
我认为这里提出的问题类型有些相似,如果不是完全相同的话。逐行读取(和写入)很慢,但是您可以一次将更大的块读入内存,逐行跳过您不想要的行,然后将其作为单个块写入新文件。重复直到完成。最后用新文件替换原来的文件。
The thing to watch out for is when you read in a chunk, you need to deal with the last, potentially partial line you read, and prepend that into the next chunk you read.
需要注意的是,当您读取一个块时,您需要处理您读取的最后一行,可能是部分行,并将其添加到您读取的下一个块中。
回答by ghostdog74
@OP, if you can use awk, eg assuming line number is 10
@OP,如果您可以使用 awk,例如假设行号为 10
$ awk 'NR!=10' file > newfile
回答by lpapp
I will provide two alternatives based on the look-up factor (line number or a search string):
我将根据查找因素(行号或搜索字符串)提供两种替代方案:
Line number
电话号码
def removeLine2(filename, lineNumber):
with open(filename, 'r+') as outputFile:
with open(filename, 'r') as inputFile:
currentLineNumber = 0
while currentLineNumber < lineNumber:
inputFile.readline()
currentLineNumber += 1
seekPosition = inputFile.tell()
outputFile.seek(seekPosition, 0)
inputFile.readline()
currentLine = inputFile.readline()
while currentLine:
outputFile.writelines(currentLine)
currentLine = inputFile.readline()
outputFile.truncate()
String
细绳
def removeLine(filename, key):
with open(filename, 'r+') as outputFile:
with open(filename, 'r') as inputFile:
seekPosition = 0
currentLine = inputFile.readline()
while not currentLine.strip().startswith('"%s"' % key):
seekPosition = inputFile.tell()
currentLine = inputFile.readline()
outputFile.seek(seekPosition, 0)
currentLine = inputFile.readline()
while currentLine:
outputFile.writelines(currentLine)
currentLine = inputFile.readline()
outputFile.truncate()