Python 在大型文本文件中搜索字符串的廉价方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3893885/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Cheap way to search a large text file for a string
提问by iman453
I need to search a pretty large text file for a particular string. Its a build log with about 5000 lines of text. Whats the best way to go about doing that? Using regex shouldn't cause any problems should it? I'll go ahead and read blocks of lines, and use the simple find.
我需要在一个非常大的文本文件中搜索特定字符串。它是一个包含大约 5000 行文本的构建日志。这样做的最佳方法是什么?使用正则表达式应该不会引起任何问题吗?我将继续阅读行块,并使用简单的查找。
采纳答案by eumiro
If it is "pretty large" file, then access the lines sequentially and don't read the whole file into memory:
如果它是“相当大”的文件,则按顺序访问这些行,不要将整个文件读入内存:
with open('largeFile', 'r') as inF:
for line in inF:
if 'myString' in line:
# do_something
回答by JoshD
You could do a simple find:
你可以做一个简单的查找:
f = open('file.txt', 'r')
lines = f.read()
answer = lines.find('string')
A simple find will be quite a bit quicker than regex if you can get away with it.
如果你能侥幸成功,一个简单的 find 会比 regex 快很多。
回答by Bitgamma
If there is no way to tell where the string will be (first half, second half, etc) then there is really no optimized way to do the search other than the builtin "find" function. You could reduce the I/O time and memory consumption by not reading the file all in one shot, but at 4kb blocks (which is usually the size of an hard disk block). This will not make the search faster, unless the string is in the first part of the file, but in all case will reduce memory consumption which might be a good idea if the file is huge.
如果无法确定字符串的位置(前半部分、后半部分等),那么除了内置的“查找”函数之外,确实没有优化的搜索方式。您可以不一次性读取所有文件,而是以 4kb 块(通常是硬盘块的大小)读取文件,从而减少 I/O 时间和内存消耗。这不会使搜索更快,除非字符串位于文件的第一部分,但在所有情况下都会减少内存消耗,如果文件很大,这可能是一个好主意。
回答by laurasia
The following function works for textfiles and binary files (returns only position in byte-count though), it does have the benefit to find strings even if they would overlap a line or bufferand would not be found when searching line- or buffer-wise.
以下函数适用于文本文件和二进制文件(尽管只返回字节计数中的位置),它确实有利于查找字符串,即使它们与行或缓冲区重叠并且在搜索行或缓冲区时找不到.
def fnd(fname, s, start=0):
with open(fname, 'rb') as f:
fsize = os.path.getsize(fname)
bsize = 4096
buffer = None
if start > 0:
f.seek(start)
overlap = len(s) - 1
while True:
if (f.tell() >= overlap and f.tell() < fsize):
f.seek(f.tell() - overlap)
buffer = f.read(bsize)
if buffer:
pos = buffer.find(s)
if pos >= 0:
return f.tell() - (len(buffer) - pos)
else:
return -1
The idea behind this is:
这背后的想法是:
- seek to a start position in file
- read from file to buffer (the search strings has to be smaller than the buffer size) but if not at the beginning, drop back the - 1 bytes, to catch the string if started at the end of the last read buffer and continued on the next one.
- return position or -1 if not found
- 寻找文件中的起始位置
- 从文件读取到缓冲区(搜索字符串必须小于缓冲区大小)但如果不是在开头,则回退 - 1 个字节,如果在最后一个读取缓冲区的末尾开始并继续在该字符串上继续下一个。
- 如果未找到,则返回位置或 -1
I used something like this to find signatures of files inside larger ISO9660 files, which was quite fast and did not use much memory, you can also use a larger buffer to speed things up.
我使用这样的方法在较大的 ISO9660 文件中查找文件的签名,这非常快并且没有使用太多内存,您还可以使用更大的缓冲区来加快速度。
回答by John Conroy
5000 lines isn't big (well, depends on how long the lines are...)
5000行并不大(好吧,取决于行有多长......)
Anyway: assuming the string will be a word and will be seperated by whitespace...
无论如何:假设字符串将是一个单词并由空格分隔......
lines=open(file_path,'r').readlines()
str_wanted="whatever_youre_looking_for"
for i in range(len(lines)):
l1=lines.split()
for p in range(len(l1)):
if l1[p]==str_wanted:
#found
# i is the file line, lines[i] is the full line, etc.
回答by Martlark
I've had a go at putting together a multiprocessing example of file text searching. This is my first effort at using the multiprocessing module; and I'm a python n00b. Comments quite welcome. I'll have to wait until at work to test on really big files. It should be faster on multi core systems than single core searching. Bleagh! How do I stop the processes once the text has been found and reliably report line number?
我已经尝试将文件文本搜索的多处理示例放在一起。这是我第一次尝试使用 multiprocessing 模块;我是蟒蛇 n00b。非常欢迎评论。我将不得不等到工作时才能测试真正的大文件。在多核系统上它应该比单核搜索更快。死!一旦找到文本并可靠地报告行号,如何停止进程?
import multiprocessing, os, time
NUMBER_OF_PROCESSES = multiprocessing.cpu_count()
def FindText( host, file_name, text):
file_size = os.stat(file_name ).st_size
m1 = open(file_name, "r")
#work out file size to divide up to farm out line counting
chunk = (file_size / NUMBER_OF_PROCESSES ) + 1
lines = 0
line_found_at = -1
seekStart = chunk * (host)
seekEnd = chunk * (host+1)
if seekEnd > file_size:
seekEnd = file_size
if host > 0:
m1.seek( seekStart )
m1.readline()
line = m1.readline()
while len(line) > 0:
lines += 1
if text in line:
#found the line
line_found_at = lines
break
if m1.tell() > seekEnd or len(line) == 0:
break
line = m1.readline()
m1.close()
return host,lines,line_found_at
# Function run by worker processes
def worker(input, output):
for host,file_name,text in iter(input.get, 'STOP'):
output.put(FindText( host,file_name,text ))
def main(file_name,text):
t_start = time.time()
# Create queues
task_queue = multiprocessing.Queue()
done_queue = multiprocessing.Queue()
#submit file to open and text to find
print 'Starting', NUMBER_OF_PROCESSES, 'searching workers'
for h in range( NUMBER_OF_PROCESSES ):
t = (h,file_name,text)
task_queue.put(t)
#Start worker processes
for _i in range(NUMBER_OF_PROCESSES):
multiprocessing.Process(target=worker, args=(task_queue, done_queue)).start()
# Get and print results
results = {}
for _i in range(NUMBER_OF_PROCESSES):
host,lines,line_found = done_queue.get()
results[host] = (lines,line_found)
# Tell child processes to stop
for _i in range(NUMBER_OF_PROCESSES):
task_queue.put('STOP')
# print "Stopping Process #%s" % i
total_lines = 0
for h in range(NUMBER_OF_PROCESSES):
if results[h][1] > -1:
print text, 'Found at', total_lines + results[h][1], 'in', time.time() - t_start, 'seconds'
break
total_lines += results[h][0]
if __name__ == "__main__":
main( file_name = 'testFile.txt', text = 'IPI1520' )
回答by Javier
I'm surprised no one mentioned mapping the file into memory: mmap
我很惊讶没有人提到将文件映射到内存中:mmap
With this you can access the file as if it were already loaded into memory and the OS will take care of mapping it in and out as possible. Also, if you do this from 2 independent processes and they map the file "shared", they will share the underlying memory.
有了这个,您可以访问文件,就好像它已经加载到内存中一样,操作系统会尽可能地将它映射进和映射出。此外,如果您从 2 个独立进程执行此操作并且它们映射“共享”文件,则它们将共享底层内存。
Once mapped, it will behave like a bytearray. You can use regular expressions, find or any of the other common methods.
一旦映射,它将表现得像一个bytearray。您可以使用正则表达式、查找或任何其他常用方法。
Beware that this approach is a little OS specific. It will not be automatically portable.
请注意,这种方法是特定于操作系统的。它不会自动移植。
回答by GrazingScientist
I like the solution of Javier. I did not try it, but it sounds cool!
我喜欢哈维尔的解决方案。我没有尝试,但听起来很酷!
For reading through a arbitary large text and wanting to know it a string exists, replace it, you can use Flashtext, which is faster than Regex with very large files.
要阅读任意大文本并想知道它是否存在字符串,请替换它,您可以使用Flashtext,它比具有非常大文件的 Regex 更快。
Edit:
编辑:
From the developer page:
从开发者页面:
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> # keyword_processor.add_keyword(<unclean name>, <standardised name>)
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.')
>>> keywords_found
>>> # ['New York', 'Bay Area']
Or when extracting the offset:
或者在提取偏移量时:
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.', span_info=True)
>>> keywords_found
>>> # [('New York', 7, 16), ('Bay Area', 21, 29)]
回答by Graham
This is entirely inspired by laurasia's answer above, but it refines the structure.
这完全受laurasia 上面的回答的启发,但它改进了结构。
It also adds some checks:
它还添加了一些检查:
- It will correctly return
0when searching an empty file for the empty string. In laurasia's answer, this is an edge case that will return-1. - It also pre-checks whether the goal string is larger than the buffer size, and raises an error if this is the case.
0在空文件中搜索空字符串时,它将正确返回。在 laurasia 的回答中,这是一个将返回的边缘情况-1。- 它还预先检查目标字符串是否大于缓冲区大小,如果是这种情况,则会引发错误。
In practice, the goal string should be much smaller than the buffer for efficiency, and there are more efficient methods of searching if the size of the goal string is very close to the size of the buffer.
实际上,为了效率,目标字符串应该比缓冲区小得多,如果目标字符串的大小非常接近缓冲区的大小,则有更有效的搜索方法。
def fnd(fname, goal, start=0, bsize=4096):
if bsize < len(goal):
raise ValueError("The buffer size must be larger than the string being searched for.")
with open(fname, 'rb') as f:
if start > 0:
f.seek(start)
overlap = len(goal) - 1
while True:
buffer = f.read(bsize)
pos = buffer.find(goal)
if pos >= 0:
return f.tell() - len(buffer) + pos
if not buffer:
return -1
f.seek(f.tell() - overlap)

