string 字符串相似度算法?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3576211/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
String similarity algorithms?
提问by TheFlash
I need to compare 2 strings and calculate their similarity, to filter down a list of the most similar strings.
我需要比较 2 个字符串并计算它们的相似度,以筛选出最相似的字符串列表。
Eg. searching for "dog" would return
例如。搜索“狗”将返回
- dog
- doggone
- bog
- fog
- foggy
- 狗
- 狗哥
- 沼泽
- 多雾路段
- 有雾的
Eg. searching for "crack" would return
例如。搜索“裂缝”将返回
- crack
- wisecrack
- rack
- Hyman
- quack
- 裂缝
- 俏皮话
- 架子
- Hyman
- 嘎嘎
I have come across:
我遇到过:
Do you know of any more string similarity algorithms?
你知道更多的字符串相似度算法吗?
采纳答案by Mojo Risin
It seems you are needing some kind of fuzzy matching. Here is java implementation of some set of similarity metrics http://www.dcs.shef.ac.uk/~sam/stringmetrics.html. Here is more detailed explanation of string metrics http://www.cs.cmu.edu/~wcohen/postscript/ijcai-ws-2003.pdfit depends on how fuzzy and how fast your implementation must be.
看来您需要某种模糊匹配。这是一些相似性度量的 java 实现http://www.dcs.shef.ac.uk/~sam/stringmetrics.html。这里是字符串度量的更详细解释http://www.cs.cmu.edu/~wcohen/postscript/ijcai-ws-2003.pdf这取决于你的实现必须有多模糊和多快。
回答by Peter
The Levenshteindistance is the algorithm I would recommend. It calculates the minimum number of operations you must do to change 1 string into another. The fewer changes means the strings are more similar...
在莱文斯坦距离算法,我会建议。它计算将 1 个字符串更改为另一个字符串必须执行的最小操作数。变化越少意味着字符串越相似......
回答by e2-e4
If the focus is on performance, I would implement an algorithm based on a trie
structure
(works well to find words in a text, or to help correct a word, but in your case you can find quickly all words containing a given word or all but one letter, for instance).
如果重点是性能,我将实现基于trie
结构的算法
(在文本中查找单词或帮助纠正单词效果很好,但在您的情况下,您可以快速找到包含给定单词的所有单词或所有单词例如,一封信)。
Please follow first the wikipedia link above.Tries
is the fastest words sorting method (nwords, search s, O(n) to create the trie, O(1) to search s(or if you prefer, if ais the average length, O(an) for the trie and O(s) for the search)).
请首先按照上面的维基百科链接进行操作。Tries
是最快的单词排序方法(n 个单词,搜索s,O( n) 来创建特里树,O(1) 来搜索s(或者,如果您愿意,如果a是平均长度,则 O( an) 为特里树和O( s) 用于搜索))。
A fast and easy implementation (to be optimized) of your problem (similar words) consists of
您的问题(相似词)的快速简便实施(待优化)包括
- Make the triewith the list of words, having all letters indexed front and back (see example below)
- To search s, iterate from s[0] to find the word in the trie, then s[1] etc...
- In the trie, if the number of letters found is len(s)-kthe word is displayed, where kis the tolerance (1 letter missing, 2...).
- The algorithm may be extended to the words in the list (see below)
- 让特里与单词列表,让所有字母索引的正面和背面(见下面的例子)
- 要搜索s,请从s[0]迭代以查找树中的单词,然后是 s[1] 等...
- 在树中,如果找到的字母数为 len( s)- k则显示该词,其中k是容差(1 个字母缺失,2 个...)。
- 该算法可以扩展到列表中的单词(见下文)
Example, with the words car
, vars
.
例如,单词car
, vars
。
Building the trie (big letter means a word end here, while another may continue). The >
is post-index (go forward) and <
is pre-index (go backward). In another example we may have to indicate also the starting letter, it is not presented here for clarity.
The <
and >
in C++ for instance would be Mystruct *previous,*next
, meaning from a > c < r
, you can go directly from a
to c
, and reversely, also from a
to R
.
构建trie(大字母表示一个词在这里结束,而另一个可能继续)。本>
是后指数(向前),并<
为预指数(留后路)。在另一个示例中,我们可能还必须指示起始字母,为了清楚起见,此处未显示。例如,C++ 中
的<
and>
是Mystruct *previous,*next
,意思是 from a > c < r
,你可以直接从a
to c
,反过来,也可以从a
to R
。
1. c < a < R
2. a > c < R
3. > v < r < S
4. R > a > c
5. > v < S
6. v < a < r < S
7. S > r > a > v
Looking strictly for carthe trie gives you access from 1., and you find car(you would have found also everything starting with car, but also anything with car inside - it is not in the example - but vicarfor instance would have been found from c > i > v < a < R
).
严格查找汽车,trie 可以让您从 1. 访问,然后您会找到汽车(您还可以找到以car开头的所有内容,以及里面包含汽车的任何内容-示例中没有-但例如会找到vicar来自c > i > v < a < R
)。
To search while allowing 1-letter wrong/missing tolerance, you iterate from each letter of s, and, count the number of consecutive - or by skipping 1 letter - letters you get from sin the trie.
要在允许 1 个字母错误/缺失容限的同时进行搜索,您可以从s 的每个字母进行迭代,并计算连续的数量 - 或者跳过 1 个字母 - 您从树中的s获得的字母。
looking for car
,
寻找car
,
c
: searching the trie forc < a
andc < r
(missing letter in s). To accept a wrong letter in a word w, try to jump at each iteration the wrong letter to see ifar
is behind, this is O(w). With two letters, O(w2) etc... but another level of index could be added to the trie to take into account the jumpover letters - making the trie complex, and greedy regarding memory.a
, thenr
: same as above, but searching backwards as well
c
:在树中搜索c < a
和c < r
(s 中缺少字母)。要接受单词w 中的错误字母,请尝试在每次迭代中跳出错误的字母以查看是否ar
在后面,这是 O( w)。有两个字母,O( w2) 等等……但是可以将另一个级别的索引添加到 trie 中以考虑跳过字母 - 使 trie 复杂,并且对内存贪婪。a
, thenr
: 同上,但也向后搜索
This is just to provide an idea about the principle - the example above may have some glitches (I'll check again tomorrow).
这只是提供一个关于原理的想法 - 上面的例子可能有一些小故障(我明天再检查)。
回答by Gumbo
You could do this:
你可以这样做:
Foreach string in haystack Do offset := -1; matchedCharacters := 0; Foreach char in needle Do offset := PositionInString(string, char, offset+1); If offset = -1 Then Break; End; matchedCharacters := matchedCharacters + 1; End; If matchedCharacters > 0 Then // (partial) match found End; End;
With matchedCharactersyou can determine the “degree” of the match. If it is equal to the length of needle, all characters in needleare also in string. If you also store the offset of the first matched character, you can also sort the result by the “density” of the matched characters by subtracting the offset of the first matched character from the offset of the last matched character offset; the lower the difference, the more dense the match.
使用matchedCharacters,您可以确定匹配的“程度”。如果它等于长度针,在所有字符针也是字符串。如果你还存储了第一个匹配字符的偏移量,你也可以通过从最后一个匹配字符的偏移量中减去第一个匹配字符的偏移量,按匹配字符的“密度”对结果进行排序;差异越小,匹配越密集。
回答by Indrit Kello
class Program {
static int ComputeLevenshteinDistance(string source, string target) {
if ((source == null) || (target == null)) return 0;
if ((source.Length == 0) || (target.Length == 0)) return 0;
if (source == target) return source.Length;
int sourceWordCount = source.Length;
int targetWordCount = target.Length;
int[,] distance = new int[sourceWordCount + 1, targetWordCount + 1];
// Step 2
for (int i = 0; i <= sourceWordCount; distance[i, 0] = i++);
for (int j = 0; j <= targetWordCount; distance[0, j] = j++);
for (int i = 1; i <= sourceWordCount; i++) {
for (int j = 1; j <= targetWordCount; j++) {
// Step 3
int cost = (target[j - 1] == source[i - 1]) ? 0 : 1;
// Step 4
distance[i, j] = Math.Min(Math.Min(distance[i - 1, j] + 1, distance[i, j - 1] + 1), distance[i - 1, j - 1] + cost);
}
}
return distance[sourceWordCount, targetWordCount];
}
static void Main(string[] args){
Console.WriteLine(ComputeLevenshteinDistance ("Stackoverflow","StuckOverflow"));
Console.ReadKey();
}
}