java 像谷歌这样的全文搜索

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1977815/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 18:47:13  来源:igfitidea点击:

Full Text Search like Google

javafull-text-searchlucene

提问by Eduardo

I would like to implement full-text-search in my off-line (android) application to search the user generated list of notes.

我想在我的离线 (android) 应用程序中实现全文搜索来搜索用户生成的笔记列表。

I would like it to behave just like Google (since most people are already used to querying to Google)

我希望它的行为就像谷歌一样(因为大多数人已经习惯了向谷歌查询)

My initial requirements are:

我最初的要求是:

  • Fast: like Google or as fast as possible, having 100000 documents with 200 hundred words each.
  • Searching for two words should only return documents that contain both words (not just one word) (unless the OR operator is used)
  • Case insensitive (aka: normalization): If I have the word 'Hello' and I search for 'hello' it should match.
  • Diacritical mark insensitive: If I have the word 'así' a search for 'asi' should match. In Spanish, many people, incorrectly, either do not put diacritical marks or fail in correctly putting them.
  • Stop word elimination: To not have a huge index meaningless words like 'and', 'the' or 'for' should not be indexed at all.
  • Dictionary substitution (aka: stem words): Similar words should be indexed as one. For example, instances of 'hungrily' and 'hungry' should be replaced with 'hunger'.
  • Phrase search: If I have the text 'Hello world!' a search of '"world hello"' should not match it but a search of '"hello world"' should match.
  • Search all fields (in multifield documents) if no field specified (not just a default field)
  • Auto-completion in search results while typing to give popular searches. (just like Google Suggest)
  • 快:像谷歌或尽可能快,有 100000 个文档,每个文档 20000 字。
  • 搜索两个词应该只返回包含两个词(不仅仅是一个词)的文档(除非使用 OR 运算符)
  • 不区分大小写(又名:规范化):如果我有“你好”这个词并且我搜索“你好”它应该匹配。
  • 变音标记不敏感:如果我有“así”这个词,那么搜索“asi”应该匹配。在西班牙语中,许多人错误地要么没有使用变音符号,要么没有正确使用它们。
  • 停止词消除:为了没有一个巨大的索引,像“and”、“the”或“for”这样的无意义词根本不应该被索引。
  • 字典替换(又名:词干):相似的词应该被索引为一个。例如,“饥饿”和“饥饿”的实例应替换为“饥饿”。
  • 短语搜索:如果我有文本“Hello world!” 搜索 '"world hello"' 不应该匹配它,但搜索 '"hello world"' 应该匹配。
  • 如果没有指定字段(不仅仅是默认字段),则搜索所有字段(在多字段文档中)
  • 键入时在搜索结果中自动完成以提供热门搜索。(就像谷歌建议)

How may I configure a full-text-search engine to behave as much as possible as Google?

我如何配置一个全文搜索引擎,使其表现得尽可能像 Google?

(I am mostly interested in Open Source, Java and in particular Lucene)

(我最感兴趣的是开源、Java,尤其是 Lucene)

回答by Yuval F

I think Lucenecan address your requirements. You should also consider using Solr, which has similar functionality and is much easier to set up.

我认为Lucene可以满足您的要求。您还应该考虑使用Solr,它具有类似的功能并且更容易设置。

I will discuss each requirement separately, using Lucene. I believe Solr has similar mechanisms.

我将使用 Lucene 分别讨论每个需求。我相信 Solr 有类似的机制。

  • Fast: like Google or as fast as possible, having 100000 documents with 200 hundred words each.
  • 快:像谷歌或尽可能快,有 100000 个文档,每个文档 20000 字。

This is a reasonable index size both for Lucene and Solr, enabling retrieval at several tens of milliseconds per query.

这对于 Lucene 和 Solr 来说都是一个合理的索引大小,可以在每个查询几十毫秒内进行检索。

  • Searching for two words should only return documents that contain both words (not just one word) (unless the OR operator is used)
  • 搜索两个词应该只返回包含两个词(不仅仅是一个词)的文档(除非使用 OR 运算符)

You can do that using a BooleanQuerywith MUSTas default in Lucene.

您可以使用BooleanQuery并在 Lucene 中默认为MUST来做到这一点。

The next four requirements can be handled by customizing a Lucene Analyzer:

可以通过自定义 Lucene Analyzer来处理接下来的四个要求:

  • Case insensitive (aka: normalization): If I have the word 'Hello' and I search for 'hello' it should match.
  • 不区分大小写(又名:规范化):如果我有“你好”这个词并且我搜索“你好”它应该匹配。

A LowerCaseFiltercan be used for this.

一个LowerCaseFilter可以用于此目的。

  • Diacritical mark insensitive: If I have the word 'así' a search for 'asi' should match. In Spanish, many people, incorrectly, either do not put diacritical marks or fail in correctly putting them.
  • 变音标记不敏感:如果我有“así”这个词,那么搜索“asi”应该匹配。在西班牙语中,许多人错误地要么没有使用变音符号,要么没有正确使用它们。

This requires Unicode normalization followed by diacritic removal. You can build a custom Analyzer for this.

这需要 Unicode 规范化,然后移除变音符号。您可以为此构建自定义分析器。

  • Stop word elimination: To not have a huge index meaningless words like 'and', 'the' or 'for' should not be indexed at all.
  • 停止词消除:为了没有一个巨大的索引,像“and”、“the”或“for”这样的无意义词根本不应该被索引。

A StopFilterremoves stop words in Lucene.

的StopFilter删除停用词在Lucene的。

  • Dictionary substitution (aka: stem words): Similar words should be indexed as one. For example, instances of 'hungrily' and 'hungry' should be replaced with 'hunger'.
  • 字典替换(又名:词干):相似的词应该被索引为一个。例如,“饥饿”和“饥饿”的实例应替换为“饥饿”。

Lucene has many Snowball Stemmers. One of them may be appropriate.

Lucene 有很多Snowball Stemmers。其中之一可能是合适的。

  • Phrase search: If I have the text 'Hello world!' a search of '"world hello"' should not match it but a search of '"hello world"' should match.
  • 短语搜索:如果我有文本“Hello world!” 搜索 '"world hello"' 不应该匹配它,但搜索 '"hello world"' 应该匹配。

This is covered by the Lucene PhraseQueryspecialized query.

Lucene PhraseQuery专用查询涵盖了这一点。

As you can see, Lucene covers all of the required functionality. To get a more general picture, I suggest the book Lucene in Action, The Apache Lucene Wikior The Lucid Imagination Site.

如您所见,Lucene 涵盖了所有必需的功能。为了获得更一般的图片,我建议阅读Lucene in ActionThe Apache Lucene WikiThe Lucid Imagination Site 一书

回答by Kaleb Brasee

A lot of these behaviors are default for Lucene. The first (including all terms) is not, but you can force this behavior by setting the default operator:

许多这些行为是 Lucene 的默认设置。第一个(包括所有术语)不是,但您可以通过设置默认运算符来强制执行此行为:

MultiFieldQueryParser parser = new MultiFieldQueryParser(fields, new StandardAnalyzer());
parser.setDefaultOperator(QueryParser.AND_OPERATOR);

I know that items 2, 4, and 6 are possible, and IIRC, they happen by default. I'm not sure about items 3 and 5, but Lucene offers a ton of customization options, so I'd suggest implementing a proof-of-concept with your data to see if it meets these requirements as well.

我知道第 2、4 和 6 项是可能的,而 IIRC,它们默认发生。我不确定第 3 项和第 5 项,但 Lucene 提供了大量自定义选项,因此我建议使用您的数据实施概念验证,以查看它是否也满足这些要求。

回答by Jim Mitchener

Buy a Google Search Appliance. Or, as the comments say, use Lucenelike you already mentioned.

购买Google Search Appliance。或者,正如评论所说,像你已经提到的那样使用Lucene

回答by Kristopher Ives

HyperSQL is a pure-java SQL implementation that can be ran quite easily, as can SQLite. You could use their full-text capabilities and querying to re-create the wheel, but as the other commenters have pointed out an existing implementation is probably best.

HyperSQL 是一个纯 Java SQL 实现,可以很容易地运行,SQLite 也是如此。您可以使用他们的全文功能和查询来重新创建轮子,但正如其他评论者指出的那样,现有的实现可能是最好的。

回答by John Doe

Unless you buy a search engine, you have Lucene, Nutch, Apache Solr and few others.

除非您购买搜索引擎,否则您将拥有 Lucene、Nutch、Apache Solr 和其他几个。