Java 在流中搜索字符串的有效方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/846175/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Efficient way to search a stream for a string
提问by Alex Spurling
Let's suppose that have a stream of text (or Reader in Java) that I'd like to check for a particular string. The stream of text might be very large so as soon as the search string is found I'd like to return true and also try to avoid storing the entire input in memory.
假设有一个我想检查特定字符串的文本流(或 Java 中的 Reader)。文本流可能非常大,因此一旦找到搜索字符串,我就想返回 true 并尽量避免将整个输入存储在内存中。
Naively, I might try to do something like this (in Java):
天真地,我可能会尝试做这样的事情(在 Java 中):
public boolean streamContainsString(Reader reader, String searchString) throws IOException {
char[] buffer = new char[1024];
int numCharsRead;
while((numCharsRead = reader.read(buffer)) > 0) {
if ((new String(buffer, 0, numCharsRead)).indexOf(searchString) >= 0)
return true;
}
return false;
}
Of course this fails to detect the given search string if it occurs on the boundary of the 1k buffer:
当然,如果给定的搜索字符串发生在 1k 缓冲区的边界上,则无法检测到它:
Search text: "stackoverflow"
Stream buffer 1: "abc.........stack"
Stream buffer 2: "overflow.......xyz"
搜索文本:“stackoverflow”
流缓冲区 1:“abc .......stack”
流缓冲区 2:“溢出.......xyz”
How can I modify this code so that it correctly finds the given search string across the boundary of the buffer but without loading the entire stream into memory?
如何修改此代码,使其在缓冲区边界上正确找到给定的搜索字符串,但不将整个流加载到内存中?
Edit:Note when searching a stream for a string, we're trying to minimise the number of reads from the stream(to avoid latency in a network/disk) and to keep memory usage constantregardless of the amount of data in the stream. Actual efficiency of the string matching algorithmis secondary but obviously, it would be nice to find a solution that used one of the more efficient of those algorithms.
编辑:注意在流中搜索字符串时,我们试图最小化从流中读取的次数(以避免网络/磁盘中的延迟)并保持内存使用量不变,而不管流中的数据量如何。字符串匹配算法的实际效率是次要的,但很明显,找到使用这些算法中效率更高的一种的解决方案会很好。
采纳答案by sw.
I did a few changes to the Knuth Morris Pratt algorithm for partial searches. Since the actual comparison position is always less or equal than the next one there is no need for extra memory. The code with a Makefile is also available on githuband it is written in Haxe to target multiple programming languages at once, including Java.
我对部分搜索的 Knuth Morris Pratt 算法做了一些更改。由于实际比较位置总是小于或等于下一个位置,因此不需要额外的内存。带有 Makefile 的代码也可以在github 上找到,它是用 Haxe 编写的,可以同时针对多种编程语言,包括 Java。
I also wrote a related article: searching for substrings in streams: a slight modification of the Knuth-Morris-Pratt algorithm in Haxe. The article mentions the Jakarta RegExp, now retired and resting in the Apache Attic. The Jakarta Regexp library “match” method in the RE class uses a CharacterIterator as a parameter.
我还写了一篇相关文章:在流中搜索子串:对 Haxe 中 Knuth-Morris-Pratt 算法的轻微修改。文章提到了Jakarta RegExp,现在已经退休并在 Apache Attic 中休息。RE 类中的 Jakarta Regexp 库“ match”方法使用 CharacterIterator 作为参数。
class StreamOrientedKnuthMorrisPratt {
var m: Int;
var i: Int;
var ss:
var table: Array<Int>;
public function new(ss: String) {
this.ss = ss;
this.buildTable(this.ss);
}
public function begin() : Void {
this.m = 0;
this.i = 0;
}
public function partialSearch(s: String) : Int {
var offset = this.m + this.i;
while(this.m + this.i - offset < s.length) {
if(this.ss.substr(this.i, 1) == s.substr(this.m + this.i - offset,1)) {
if(this.i == this.ss.length - 1) {
return this.m;
}
this.i += 1;
} else {
this.m += this.i - this.table[this.i];
if(this.table[this.i] > -1)
this.i = this.table[this.i];
else
this.i = 0;
}
}
return -1;
}
private function buildTable(ss: String) : Void {
var pos = 2;
var cnd = 0;
this.table = new Array<Int>();
if(ss.length > 2)
this.table.insert(ss.length, 0);
else
this.table.insert(2, 0);
this.table[0] = -1;
this.table[1] = 0;
while(pos < ss.length) {
if(ss.substr(pos-1,1) == ss.substr(cnd, 1))
{
cnd += 1;
this.table[pos] = cnd;
pos += 1;
} else if(cnd > 0) {
cnd = this.table[cnd];
} else {
this.table[pos] = 0;
pos += 1;
}
}
}
public static function main() {
var KMP = new StreamOrientedKnuthMorrisPratt("aa");
KMP.begin();
trace(KMP.partialSearch("ccaabb"));
KMP.begin();
trace(KMP.partialSearch("ccarbb"));
trace(KMP.partialSearch("fgaabb"));
}
}
回答by Tetha
Implement a sliding window. Have your buffer around, move all elements in the buffer one forward and enter a single new character in the buffer at the end. If the buffer is equal to your searched word, it is contained.
实现滑动窗口。准备好缓冲区,将缓冲区中的所有元素向前移动一个,最后在缓冲区中输入一个新字符。如果缓冲区等于您搜索的单词,则将其包含在内。
Of course, if you want to make this more efficient, you can look at a way to prevent moving all elements in the buffer around, for example by having a cyclic buffer and a representation of the strings which 'cycles' the same way the buffer does, so you only need to check for content-equality. This saves moving all elements in the buffer.
当然,如果您想让这更有效,您可以考虑一种防止移动缓冲区中所有元素的方法,例如通过具有循环缓冲区和以与缓冲区相同的方式“循环”的字符串的表示确实如此,所以您只需要检查内容是否相等。这节省了移动缓冲区中的所有元素。
回答by ChrisW
I think you need to buffer a small amount at the boundary between buffers.
我认为您需要在缓冲区之间的边界处缓冲少量。
For example if your buffer size is 1024 and the length of the SearchString is 10, then as well as searching each 1024-byte buffer you also need to search each 18-byte transition between two buffers (9 bytes from the end of the previous buffer concatenated with 9 bytes from the start of the next buffer).
例如,如果您的缓冲区大小为 1024 并且 SearchString 的长度为 10,那么除了搜索每个 1024 字节的缓冲区之外,您还需要搜索两个缓冲区之间的每个 18 字节转换(距前一个缓冲区末尾的 9 个字节)从下一个缓冲区开始与 9 个字节连接)。
回答by AlbertoPL
I'd say switch to a character by character solution, in which case you'd scan for the first character in your target text, then when you find that character increment a counter and look for the next character. Every time you don't find the next consecutive character restart the counter. It would work like this:
我会说切换到逐字符解决方案,在这种情况下,您将扫描目标文本中的第一个字符,然后当您发现该字符时增加一个计数器并查找下一个字符。每次找不到下一个连续字符时,重新启动计数器。它会像这样工作:
public boolean streamContainsString(Reader reader, String searchString) throws IOException {
char[] buffer = new char[1024];
int numCharsRead;
int count = 0;
while((numCharsRead = reader.read(buffer)) > 0) {
if (buffer[numCharsRead -1] == searchString.charAt(count))
count++;
else
count = 0;
if (count == searchString.size())
return true;
}
return false;
}
The only problem is when you're in the middle of looking through characters... in which case there needs to be a way of remembering your count variable. I don't see an easy way of doing so except as a private variable for the whole class. In which case you would not instantiate count inside this method.
唯一的问题是当您正在查看字符时……在这种情况下,需要有一种方法来记住您的计数变量。除了作为整个班级的私有变量之外,我没有看到这样做的简单方法。在这种情况下,您不会在此方法中实例化计数。
回答by brabster
This answer applied to the initial version of the question where the key was to read the stream only as far as necessary to match on a String, if that String was present. This solution would not meet the requirement to guarantee fixed memory utilisation, but may be worth considering if you have found this question and are not bound by that constraint.
此答案适用于问题的初始版本,其中关键是仅在匹配字符串时读取流,前提是该字符串存在。此解决方案不符合保证固定内存利用率的要求,但如果您发现了此问题并且不受该约束的约束,则可能值得考虑。
If you are bound by the constant memory usage constraint, Java stores arrays of any type on the heap, and as such nulling the reference does not deallocate memory in any way; I think any solution involving arrays in a loop will consume memory on the heap and require GC.
如果您受到常量内存使用限制的约束,Java 会在堆上存储任何类型的数组,因此清空引用不会以任何方式释放内存;我认为任何涉及循环中数组的解决方案都会消耗堆上的内存并需要 GC。
For simple implementation, maybe Java 5's Scannerwhich can accept an InputStream and use a java.util.regex.Patternto search the input for might save you worrying about the implementation details.
对于简单的实现,也许 Java 5 的Scanner可以接受 InputStream 并使用java.util.regex.Pattern来搜索输入,这可能会让您不必担心实现细节。
Here's an example of a potential implementation:
这是一个潜在实现的示例:
public boolean streamContainsString(Reader reader, String searchString)
throws IOException {
Scanner streamScanner = new Scanner(reader);
if (streamScanner.findWithinHorizon(searchString, 0) != null) {
return true;
} else {
return false;
}
}
I'm thinking regex because it sounds like a job for a Finite State Automaton, something that starts in an initial state, changing state character by character until it either rejects the string (no match) or gets to an accept state.
我在考虑正则表达式,因为它听起来像是有限状态自动机的工作,它从初始状态开始,逐个字符地改变状态,直到它拒绝字符串(不匹配)或进入接受状态。
I think this is probably the most efficient matching logic you could use, and how you organize the reading of the information can be divorced from the matching logic for performance tuning.
我认为这可能是您可以使用的最有效的匹配逻辑,并且您可以将信息的读取方式与用于性能调整的匹配逻辑分开。
It's also how regexes work.
这也是正则表达式的工作方式。
回答by Alex
You can increase the speed of search for very large strings by using some string search algorithm
您可以通过使用一些字符串搜索算法来提高搜索非常大的字符串的速度
回答by deverton
If you're not tied to using a Reader, then you can use Java's NIO API to efficiently load the file. For example (untested, but should be close to working):
如果您不依赖于使用 Reader,那么您可以使用 Java 的 NIO API 来有效地加载文件。例如(未经测试,但应该接近工作):
public boolean streamContainsString(File input, String searchString) throws IOException {
Pattern pattern = Pattern.compile(Pattern.quote(searchString));
FileInputStream fis = new FileInputStream(input);
FileChannel fc = fis.getChannel();
int sz = (int) fc.size();
MappedByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, sz);
CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
CharBuffer cb = decoder.decode(bb);
Matcher matcher = pattern.matcher(cb);
return matcher.matches();
}
This basically mmap()'s the file to search and relies on the operating system to do the right thing regarding cache and memory usage. Note however that map() is more expensive the just reading the file in to a large buffer for files less than around 10 KiB.
这基本上是 mmap() 是要搜索的文件,并依赖于操作系统在缓存和内存使用方面做正确的事情。但是请注意,对于小于 10 KiB 的文件,map() 只是将文件读入大缓冲区的成本更高。
回答by deverton
If you're looking for a constant substring rather than a regex, I'd recommend Boyer-Moore. There's plenty of source code on the internet.
如果您正在寻找常量子字符串而不是正则表达式,我会推荐 Boyer-Moore。互联网上有大量的源代码。
Also, use a circular buffer, to avoid think too hard about buffer boundaries.
此外,使用循环缓冲区,以避免对缓冲区边界考虑太多。
Mike.
麦克风。
回答by Norman Ramsey
There are three good solutions here:
这里有三个很好的解决方案:
If you want something that is easy and reasonably fast, go with no buffer, and instead implement a simple nondeterminstic finite-state machine. Your state will be a list of indices into the string you are searching, and your logic looks something like this (pseudocode):
String needle; n = needle.length(); for every input character c do add index 0 to the list for every index i in the list do if c == needle[i] then if i + 1 == n then return true else replace i in the list with i + 1 end else remove i from the list end end end
This will find the string if it exists and you will never need a buffer.
Slightly more work but also faster: do an NFA-to-DFA conversion that figures out in advance what lists of indices are possible, and assign each one to a small integer. (If you read about string search on Wikipedia, this is called the powerset construction.) Then you have a single state and you make a state-to-state transition on each incoming character. The NFA you want is just the DFA for the string preceded with a state that nondeterministically either drops a character or tries to consume the current character. You'll want an explicit error state as well.
If you want something faster, create a buffer whose size is at least twice
n
, and user Boyer-Moore to compile a state machine fromneedle
. You'll have a lot of extra hassle because Boyer-Moore is not trivial to implement (although you'll find code online) and because you'll have to arrange to slide the string through the buffer. You'll have to build or find a circularbuffer that can 'slide' without copying; otherwise you're likely to give back any performance gains you might get from Boyer-Moore.
如果您想要一些简单且相当快的东西,请不要使用缓冲区,而是实现一个简单的非确定性有限状态机。您的状态将是您正在搜索的字符串的索引列表,您的逻辑看起来像这样(伪代码):
String needle; n = needle.length(); for every input character c do add index 0 to the list for every index i in the list do if c == needle[i] then if i + 1 == n then return true else replace i in the list with i + 1 end else remove i from the list end end end
如果它存在,这将找到字符串,您将永远不需要缓冲区。
工作稍多,但速度也更快:进行 NFA 到 DFA 的转换,提前确定哪些索引列表是可能的,并将每个索引分配给一个小整数。(如果您在 Wikipedia 上阅读有关字符串搜索的内容,这称为powerset 构造。)然后您有一个状态,并且您对每个传入的字符进行状态到状态的转换。您想要的 NFA 只是该字符串的 DFA,该字符串前面带有一个非确定性地删除字符或尝试使用当前字符的状态。您还需要一个明确的错误状态。
如果您想要更快的速度,请创建一个大小至少为 2 倍的缓冲区
n
,然后用户 Boyer-Moore 从needle
. 你会遇到很多额外的麻烦,因为 Boyer-Moore 实现起来并不容易(尽管你会在网上找到代码)并且因为你必须安排将字符串滑过缓冲区。您必须构建或找到一个可以“滑动”而无需复制的循环缓冲区;否则,您可能会放弃从 Boyer-Moore 获得的任何性能提升。
回答by Norman Ramsey
Instead of having your buffer be an array, use an abstraction that implements a circular buffer. Your index calculation will be buf[(next+i) % sizeof(buf)]
, and you'll have to be careful to full the buffer one-half at a time. But as long as the search string fits in half the buffer, you'll find it.
不要让你的缓冲区是一个数组,而是使用一个实现循环缓冲区的抽象。您的索引计算将为buf[(next+i) % sizeof(buf)]
,并且您必须小心地一次将缓冲区填满一半。但是只要搜索字符串能容纳一半的缓冲区,您就会找到它。
回答by Darius Bacon
The Knuth-Morris-Pratt search algorithmnever backs up; this is just the property you want for your stream search. I've used it before for this problem, though there may be easier ways using available Java libraries. (When this came up for me I was working in C in the 90s.)
该克努特莫里斯普拉特搜索算法从未备份; 这只是您想要用于流搜索的属性。我以前用它来解决这个问题,尽管使用可用的 Java 库可能有更简单的方法。(当我想到这一点时,我在 90 年代使用 C 语言。)
KMP in essence is a fast way to build a string-matching DFA, like Norman Ramsey's suggestion #2.
KMP 本质上是一种构建字符串匹配 DFA 的快速方法,就像 Norman Ramsey 的建议 #2 一样。