什么是最好的自动完成/建议算法,数据结构 [C++/C]

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1783652/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 21:09:21  来源:igfitidea点击:

What is the best autocomplete/suggest algorithm,datastructure [C++/C]

c++calgorithmsearchautocomplete

提问by subbul

We see Google, Firefox some AJAX pages show up a list of probable items while user types characters.

我们看到 Google、Firefox 的一些 AJAX 页面会在用户键入字符时显示可能的项目列表。

Can someone give a good algorithm, data structure for implementing autocomplete?

有人可以给出一个很好的算法,实现自动完成的数据结构吗?

回答by Glen

A trieis a data structure that can be used to quickly find words that match a prefix.

一个线索是可以用来快速找到匹配前缀词的数据结构。

Edit: Here's an example showing how to use one to implement autocomplete http://rmandvikar.blogspot.com/2008/10/trie-examples.html

编辑:这是一个示例,展示了如何使用一个来实现自动完成http://rmandvikar.blogspot.com/2008/10/trie-examples.html

Here's a comparison of 3 different auto-complete implementations(though it's in Java not C++).

这是 3 种不同的自动完成实现的比较(尽管它是在 Java 中而不是在 C++ 中)。

* In-Memory Trie
* In-Memory Relational Database
* Java Set

When looking up keys, the trie is marginally faster than the Set implementation. Both the trie and the set are a good bit faster than the relational database solution.

查找键时,trie 比 Set 实现略快。trie 和 set 都比关系数据库解决方案快很多。

The setup cost of the Set is lower than the Trie or DB solution. You'd have to decide whether you'd be constructing new "wordsets" frequently or whether lookup speed is the higher priority.

Set 的设置成本低于 Trie 或 DB 解决方案。您必须决定是否经常构建新的“词集”,或者查找速度是否具有更高的优先级。

These results are in Java, your mileage may vary with a C++ solution.

这些结果是在 Java 中的,您的里程可能因 C++ 解决方案而异。

回答by Joy Dutta

For large datasets, a good candidate for the backend would be Ternary search trees. They combine the best of two worlds: the low space overhead of binary search trees and the character-based time efficiency of digital search tries.

对于大型数据集,后端的一个很好的候选者是三元搜索树。它们结合了两个世界的优点:二叉搜索树的低空间开销和数字搜索尝试的基于字符的时间效率。

See in Dr. Dobbs Journal: http://www.ddj.com/windows/184410528

参见 Dobbs 博士期刊:http: //www.ddj.com/windows/184410528

The goal is the fast retrieval of a finite resultset as the user types in. Lets first consider that to search "computer science" you can start typing from "computer" or "science" but not "omputer". So, given a phrase, generate the sub-phrases starting with a word. Now for each of the phrases, feed them into the TST (ternary search tree). Each node in the TST will represent a prefix of a phrase that has been typed so far. We will store the best 10 (say) results for that prefix in that node. If there are many more candidates than the finite amount of results (10 here) for a node, there should be a ranking function to resolve competition between two results.

目标是在用户输入时快速检索有限结果集。让我们首先考虑搜索“计算机科学”,您可以从“计算机”或“科学”而不是“计算机”开始输入。因此,给定一个短语,生成以单词开头的子短语。现在对于每个短语,将它们输入到 TST(三元搜索树)中。TST 中的每个节点将代表迄今为止已键入的短语的前缀。我们将在该节点中存储该前缀的最佳 10(比如说)结果。如果一个节点的候选结果比有限数量的结果(这里是 10 个)多得多,那么应该有一个排名函数来解决两个结果之间的竞争。

The tree can be built once every few hours, depending on the dynamism of the data. If the data is in real time, then I guess some other algorithm will give a better balance. In this case, the absolute requirement is the lightning-fast retrieval of results for every keystroke typed which it does very well.

该树可以每隔几个小时构建一次,具体取决于数据的动态性。如果数据是实时的,那么我想其他一些算法会提供更好的平衡。在这种情况下,绝对要求是对每次键入的击键结果进行闪电般的快速检索,它做得很好。

More complications will arise if the suggestion of spelling corrections is involved. In that case, the edit distance algorithms will have to be considered as well.

如果涉及拼写更正的建议,则会出现更多的复杂情况。在这种情况下,还必须考虑编辑距离算法。

For small datasets like a list of countries, a simple implementation of Trie will do. If you are going to implement such an autocomplete drop-down in a web application, the autocomplete widget of YUI3 will do everything for you after you provide the data in a list. If you use YUI3 as just the frontend for an autocomplete backed by large data, make the TST based web services in C++, and then use script node data source of the autocomplete widget to fetch data from the web service instead of a simple list.

对于像国家列表这样的小数据集,一个简单的 Trie 实现就可以了。如果您打算在 Web 应用程序中实现这样的自动完成下拉菜单,在您提供列表中的数据后,YUI3 的自动完成小部件将为您完成所有工作。如果您仅使用 YUI3 作为大数据支持的自动完成的前端,请使用 C++ 制作基于 TST 的 Web 服务,然后使用自动完成小部件的脚本节点数据源从 Web 服务而不是简单列表中获取数据。

回答by r15habh

Segment treescan be used for efficiently implementing auto complete

段树可用于有效地实现自动完成

回答by Nicolai

If you want to suggest the most popular completions, a "Suggest Tree" may be a good choice: Suggest Tree

如果你想推荐最受欢迎的完成,“建议树”可能是一个不错的选择: 建议树

回答by anno

For a simple solution : you generate a 'candidate' with a minimum edit (Levenshtein) distance (1 or 2) then you test the existence of the candidate with a hash container (setwill suffice for a simple soltion, then use unordered_setfrom the tr1 or boost).

对于一个简单的解决方案:您生成一个具有最小编辑(Levenshtein)距离(1 或 2)的“候选” ,然后使用散列容器测试候选者的存在(set足以用于简单的解决方案,然后使用unordered_set从tr1 或增强)。

Example: You wrote carr and you want car. arr is generated by 1 deletion. Is arr in your unordered_set ? No. crr is generated by 1 deletion. Is crr in your unordered_set ? No. car is generated by 1 deletion. Is car in your unordered_set ? Yes, you win.

例子:你写了 carr 并且你想要 car。arr 由 1 次删除生成。arr 在您的 unordered_set 中吗?编号 crr 是由 1 次删除生成的。crr 在您的 unordered_set 中吗?号车是通过1次删除生成的。汽车在您的 unordered_set 中吗?是的,你赢了。

Of course there's insertion, deletion, transposition etc...

当然还有插入、删除、转置等......

You see that your algorithm for generating candidates is really where you're wasting time, especially if you have a very little unordered_set.

您会发现生成候选的算法确实是您浪费时间的地方,尤其是当您的unordered_set很少时。