C# 代码/算法来搜索术语的文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/382263/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
C# Code/Algorithm to Search Text for Terms
提问by user47892
We have 5mb of typical text (just plain words). We have 1000 words/phrases to use as terms to search for in this text.
我们有 5mb 的典型文本(只是简单的单词)。我们有 1000 个单词/短语可用作在本文中搜索的术语。
What's the most efficient way to do this in .NET (ideally C#)?
在 .NET(最好是 C#)中执行此操作的最有效方法是什么?
Our ideas include regex's (a single one, lots of them) plus even the String.Contains stuff.
我们的想法包括正则表达式(一个,很多)加上 String.Contains 的东西。
The input is a 2mb to 5mb text string - all text. Multiple hits are good, as in each term (of the 1000) that matches then we do want to know about it. Performance in terms of entire time to execute, don't care about footprint. Current algorithm gives about 60 seconds+ using naive string.contains. We don't want 'cat' to provide a match with 'category' or even 'cats' (i.e. entire term word must hit, no stemming).
输入是一个 2mb 到 5mb 的文本字符串 - 所有文本。多次点击是好的,因为在每个匹配项(1000 个)中,我们确实想知道它。就整个执行时间而言的性能,不关心占用空间。当前算法使用天真的 string.contains 给出大约 60 秒+。我们不希望 'cat' 提供与 'category' 甚至 'cats' 的匹配(即整个词必须命中,没有词干)。
We expect a <5% hit ratio in the text. The results would ideally just be the terms that matched (dont need position or frequency just yet). We get a new 2-5mb string every 10 seconds, so can't assume we can index the input. The 1000 terms are dynamic, although have a change rate of about 1 change an hour.
我们期望文本中的命中率小于 5%。理想情况下,结果只是匹配的项(暂时不需要位置或频率)。我们每 10 秒得到一个新的 2-5mb 字符串,所以不能假设我们可以索引输入。1000 个术语是动态的,但其变化率约为每小时 1 次。
采纳答案by Mark Brackett
A naive string.Contains with 762 words (the final page) of War and Peace (3.13MB) runs in about 10s for me. Switching to 1000 GUIDs runs in about 5.5 secs.
一个天真的字符串。包含 762 个单词(最后一页)的战争与和平(3.13MB)对我来说运行时间大约为 10 秒。切换到 1000 个 GUID 大约需要 5.5 秒。
Regex.IsMatch found the 762 words (much of which were probably in earlier pages as well) in about .5 seconds, and ruled out the GUIDs in 2.5 seconds.
Regex.IsMatch 在大约 0.5 秒内找到了 762 个单词(其中大部分也可能在较早的页面中),并在 2.5 秒内排除了 GUID。
I'd suggest your problem lies elsewhere...Or you just need some decent hardware.
我建议你的问题出在别处……或者你只需要一些像样的硬件。
回答by Vilx-
A modified Suffix treewould be very fast, though it would take up a lot of memory and I don't know how fast it would be to build it. After that however every search would take O(1).
修改后的后缀树会非常快,尽管它会占用大量内存,我不知道构建它会多快。之后,每次搜索都需要 O(1)。
回答by BenAlabaster
Are you trying to achieve a list of matched words or are you trying to highlight them in the text getting the start and length of the match position? If all you're trying to do is find out if the words exist, then you could use subset theory to fairly efficiently perform this.
您是想获得匹配单词的列表,还是试图在文本中突出显示它们以获取匹配位置的开始和长度?如果您要做的只是找出单词是否存在,那么您可以使用子集理论来相当有效地执行此操作。
However, I expect you're trying to each match's start position in the text... in which case this approach wouldn't work.
但是,我希望您尝试使用文本中每个匹配项的开始位置……在这种情况下,这种方法将不起作用。
The most efficient approach I can think is to dynamically build a match pattern using a list and then use regex. It's far easier to maintain a list of 1000 items than it is to maintain a regex pattern based on those same 1000 items.
我能想到的最有效的方法是使用列表动态构建匹配模式,然后使用正则表达式。维护 1000 个项目的列表比维护基于 1000 个项目的正则表达式模式要容易得多。
It is my understanding that Regex uses the same KMP algorithm suggested to efficiently process large amounts of data - so unless you really need to dig through and understand the minutiae of how it works (which might be beneficial for personal growth), then perhaps regex would be ok.
我的理解是 Regex 使用与建议的相同 KMP 算法来有效处理大量数据 - 所以除非你真的需要深入了解它的工作原理(这可能有利于个人成长),否则 regex 可能会没事。
There's quite an interesting paper on search algorithms for many patterns in large files here: http://webglimpse.net/pubs/TR94-17.pdf
这里有一篇关于大文件中许多模式的搜索算法的有趣论文:http: //webglimpse.net/pubs/TR94-17.pdf
回答by Vilx-
Here's another idea: Make a class something like this:
这是另一个想法:创建一个这样的类:
class Word
{
string Word;
List<int> Positions;
}
For every unique word in your text you create an instance of this class. Positions array will store positions (counted in words, not characters) from the start of the text where this word was found.
对于文本中的每个唯一单词,您都可以创建此类的一个实例。Positions 数组将存储从找到该单词的文本开始的位置(以单词计数,而不是字符计数)。
Then make another two lists which will serve as indexes. One will store all these classes sorted by their texts, the other - by their positions in the text. In essence, the text index would probably be a SortedDictionary, while the position index would be a simple List<Word>
.
然后再制作两个列表作为索引。一个将存储所有这些类,按它们的文本排序,另一个 - 按它们在文本中的位置。本质上,文本索引可能是一个 SortedDictionary,而位置索引将是一个简单的List<Word>
.
Then to search for a phrase, you split that phrase into words. Look up the first word in the Dictionary (that's O(log(n))). From there you know what are the possible words that follow it in the text (you have them from the Positions array). Look at those words (use the position index to find them in O(1)) and go on, until you've found one or more full matches.
然后要搜索一个短语,您将该短语拆分为单词。查找字典中的第一个单词(即 O(log(n)))。从那里你知道文本中跟在它后面的可能的词是什么(你从 Positions 数组中得到它们)。查看这些单词(使用位置索引在 O(1) 中找到它们)并继续,直到找到一个或多个完整匹配项。
回答by dbones
have you considered the following:
您是否考虑过以下问题:
do you care about substring? lets say I am looking for the word "cat", nothing more or nothing less. now consider the Knuth-Morris-Pratt algorithm, or string.contains for "concatinate". both of these will return true (or an index). is this ok?
Also you will have to look into the idea of the stemmed or "Finite" state of the word. lets look for "diary" again, the test sentance is "there are many kinds of diaries". well to you and me we have the word "diaries" does this count? if so we will need to preprocess the sentance converting the words to a finite state (diaries -> diary) the sentance will become "there are many kind of diary". now we can say that Diary is in the sentance (please look at the porter Stemmer Algroithm)
Also when it comes to processing text (aka Natrual Langauge Processing) you can remove some words as noise, take for example "a, have, you, I, me, some, to" <- these could be considered as useless words, and can then be removed before any processing takes place? for example
你关心子串吗?假设我正在寻找“猫”这个词,不多也不少。现在考虑 Knuth-Morris-Pratt 算法,或 string.contains 表示“连接”。这两个都将返回 true (或索引)。这个可以吗?
此外,您还必须研究词的词干或“有限”状态的想法。再找“日记”,测试句是“日记有很多种”。对你我来说,我们有“日记”这个词,这算不算?如果是这样,我们需要对句子进行预处理,将单词转换为有限状态(日记 - > 日记),句子将变成“有很多种日记”。现在我们可以说 Diary 在句子中(请看搬运工 Stemmer 算法)
此外,在处理文本(又名自然语言处理)时,您可以删除一些单词作为噪音,例如“a,have,you,I,me,some,to”<-这些可能被认为是无用的单词,并且然后可以在任何处理发生之前删除吗?例如
"I have written some C# today", if i have 10,000 key works to look for I would have to scan the entire sentance 10,000 x the number of words in the sentance. removing noise before hand will shorting the processing time
“我今天写了一些 C#”,如果我有 10,000 个关键作品要查找,我将不得不扫描整个句子 10,000 x 句子中的单词数。事先去除噪音会缩短处理时间
"written C# today" <- removed noise, now there are lots less to look throught.
“今天写的 C#” <- 消除了噪音,现在可以看透的东西少了很多。
A great article on NLP can be found here. Sentance comparing
可以在这里找到一篇关于 NLP 的精彩文章。句子比较
HTH
HTH
Bones
骨头
回答by Konrad Rudolph
Is this a bottleneck? How long does it take? 5 MiB isn't actually a lot of data to search in. Regular expressions might do just fine, especially if you encode all the search strings into onepattern using alternations. This basically amortizes the overall cost of the search to O(n + m) where n is the length of your text and m is the length of all patterns, combined. Notice that this is a very good performance.
这是瓶颈吗?多久时间?5 MiB 实际上并不是要搜索的大量数据。正则表达式可能会很好,特别是如果您使用交替将所有搜索字符串编码为一个模式。这基本上将搜索的总成本分摊到 O(n + m) ,其中 n 是文本的长度,m 是所有模式的长度,合并。请注意,这是一个非常好的性能。
An alternative that's well suited for many patterns is the Wu Manber algorithm. I've already posted a very simplistic C++ implementationof the algorithm.
非常适合许多模式的替代方法是Wu Manber 算法。我已经发布了该算法的一个非常简单的C++ 实现。
回答by user47892
Ok, current rework shows this as fastest (psuedocode):
好的,当前的返工表明这是最快的(伪代码):
foreach (var term in allTerms)
{
string pattern = term.ToWord(); // Use /b word boundary regex
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
if (regex.IsMatch(bigTextToSearchForTerms))
{
result.Add(term);
}
}
What was surprising (to me at least!) is that running the regex 1000 times was faster that a single regex with 1000 alternatives, i.e. "/b term1 /b | /b term2 /b | /b termN /b" and then trying to use regex.Matches.Count
令人惊讶的是(至少对我来说!)运行正则表达式 1000 次比具有 1000 个替代项的单个正则表达式更快,即“/b term1 /b | /b term2 /b | /b termN /b”然后尝试使用 regex.Matches.Count
回答by BenAlabaster
How does this perform in comparison? It uses LINQ, so it may be a little slower, not sure...
相比之下,这表现如何?它使用LINQ,所以可能会慢一点,不确定......
List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Where(item => Regex.IsMatch(bigTextToSearchForTerms, item, RegexOptions.IgnoreCase));
This uses classic predicates to implement the FIND method, so it should be quicker than LINQ:
这使用经典谓词来实现 FIND 方法,因此它应该比 LINQ 更快:
static bool Match(string checkItem)
{
return Regex.IsMatch(bigTextToSearchForTerms, checkItem, RegexOptions.IgnoreCase);
}
static void Main(string[] args)
{
List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Find(Match);
}
Or this uses the lambda syntax to implement the classic predicate, which again should be faster than the LINQ, but is more readable than the previous syntax:
或者这使用 lambda 语法来实现经典谓词,这同样应该比 LINQ 更快,但比以前的语法更具可读性:
List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Find(checkItem => Regex.IsMatch(bigTextToSearchForTerms, checkItem, RegexOptions.IgnoreCase));
I haven't tested any of them for performance, but they all implement your idea of iteration through the search list using the regex. It's just different methods of implementing it.
我没有测试它们中的任何一个的性能,但是它们都使用正则表达式通过搜索列表实现了您的迭代想法。这只是实现它的不同方法。
回答by Kent Boogaart
Why reinvent the wheel? Why not just leverage something like Lucene.NET?
为什么要重新发明轮子?为什么不直接利用像Lucene.NET这样的东西?