java 带有“-”字符的Lucene索引问题
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/10186675/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Lucene Index problems with "-" character
提问by Zteve
I'm having trouble with a Lucene Index, which has indexed words, that contain "-" Characters.
我在使用包含“-”字符的 Lucene 索引时遇到了问题,该索引具有索引词。
It works for some words that contain "-" but not for all and I don't find the reason, why it's not working.
它适用于某些包含“-”的单词,但不适用于所有单词,我找不到原因,为什么它不起作用。
The field I'm searching in, is analyzed and contains version of the word with and without the "-" character.
我正在搜索的字段经过分析并包含带有和不带有“-”字符的单词版本。
I'm using the analyzer: org.apache.lucene.analysis.standard.StandardAnalyzer
我正在使用分析器:org.apache.lucene.analysis.standard.StandardAnalyzer
here an example:
这里有一个例子:
if I search for "gsx-*" I got a result, the indexed field contains "SUZUKI GSX-R 1000 GSX-R1000 GSXR"
如果我搜索“gsx-*”,我会得到一个结果,索引字段包含“SUZUKI GSX-R 1000 GSX-R1000 GSXR”
but if I search for "v-*" I got no result. The indexed field of the expected result contains: "SUZUKI DL 1000 V-STROM DL1000V-STROMVSTROM V STROM"
但是如果我搜索“v-*”,我没有得到任何结果。预期结果的索引字段包含:“SUZUKI DL 1000 V-STROM DL1000V-STROMVSTROM V STROM”
If I search for "v-strom" without "*" it works, but if I just search for "v-str" for example I don't get the result. (There should be a result because it's for a live search for a webshop)
如果我搜索没有“*”的“v-strom”,它可以工作,但如果我只是搜索“v-str”,例如我不会得到结果。(应该有结果,因为它是用于网上商店的实时搜索)
So, what's the difference between the 2 expected results? why does it work for "gsx-" but not for "v-" ?
那么,这两个预期结果之间有什么区别?为什么它适用于“gsx- ”而不适用于“v-”?
采纳答案by Marko Topolnik
StandardAnalyzer will treat the hyphen as whitespace, I believe. So it turns your query "gsx-*"
into "gsx*"
and "v-*"
into nothing because at also eliminates single-letter tokens. What you see as the field contents in the search result is the stored value of the field, which is completely independent of the terms that were indexed for that field.
我相信,StandardAnalyzer 会将连字符视为空格。因此,原来你查询"gsx-*"
到"gsx*"
,并"v-*"
为没有因为也消除了单字母标记。您在搜索结果中看到的字段内容是该字段的存储值,它完全独立于为该字段编制索引的术语。
So what you want is for "v-strom" as a whole to be an indexed term. StandardAnalyzer
is not suited to this kind of text. Maybe have a go with the WhitespaceAnalyzer
or SimpleAnalyzer
. If that still doesn't cut it, you also have the option of throwing together your own analyzer, or just starting off those two mentined and composing them with further TokenFilters
. A very good explanation is given in the Lucene Analysis package Javadoc.
所以你想要的是“v-strom”作为一个整体成为一个索引词。StandardAnalyzer
不适合这种文字。也许试试WhitespaceAnalyzer
or SimpleAnalyzer
。如果这仍然不能解决问题,您还可以选择将自己的分析器放在一起,或者只是从这两个 mentined 开始并使用进一步的TokenFilters
. Lucene 分析包 Javadoc 中给出了很好的解释。
BTW there's no need to enter all the variants in the index, like V-strom, V-Strom, etc. The idea is for the same analyzer to normalize all these variants to the same string both in the index and while parsing the query.
顺便说一句,不需要在索引中输入所有变体,如 V-strom、V-Strom 等。这个想法是让同一个分析器在索引中和解析查询时将所有这些变体标准化为相同的字符串。
回答by Mark Leighton Fisher
ClassicAnalyzerhandles '-' as a useful, non-delimiter character. As I understand ClassicAnalyzer, it handles '-' like the pre-3.1 StandardAnalyzer because ClassicAnalyzer uses ClassicTokenizerwhich treats numbers with an embedded '-' as a product code, so the whole thing is tokenized as one term.
ClassicAnalyzer 将“-”作为有用的非定界符处理。据我了解,ClassicAnalyzer 处理 '-' 就像 3.1 之前的 StandardAnalyzer 一样,因为 ClassicAnalyzer 使用ClassicTokenizer将带有嵌入的 '-' 的数字视为产品代码,因此整个事物都被标记为一个术语。
When I was at Regenstrief Institute I noticed this after upgrading Luke, as the LOINC standard medical terms (LOINC was initiated by R.I.) are identified by a number followed by a '-' and a checkdigit, like '1-8' or '2857-1'. My searches for LOINCs like '45963-6' failed using StandardAnalyzer in Luke 3.5.0, but succeeded with ClassicAnalyzer (and this was because we built the index with the 2.9.2 Lucene.NET).
当我在 Regenstrief Institute 时,我在升级 Luke 后注意到这一点,因为 LOINC 标准医学术语(LOINC 由 RI 发起)由一个数字标识,后跟一个“-”和一个校验位,如“1-8”或“2857” -1'。我在 Luke 3.5.0 中使用 StandardAnalyzer 搜索“45963-6”等 LOINC 失败,但使用 ClassicAnalyzer 成功(这是因为我们使用 2.9.2 Lucene.NET 构建了索引)。
回答by PVR
(Based on Lucene 4.7) StandardTokenizersplits hyphenated words into two. for example "chat-room" into "chat","room" and index the two words separately instead of indexing as a single whole word. It is quite common for separate words to be connected with a hyphen: “sport-mad,” “camera-ready,” “quick-thinking,” and so on. A significant number are hyphenated names, such as “Emma-Claire.” When doing a Whole Word Search or query, users expect to find the word within those hyphens. While there are some cases where they are separate words, that's why lucene keeps the hyphen out of the default definition.
(基于 Lucene 4.7)StandardTokenizer将带连字符的单词一分为二。例如,“chat-room”变成“chat”、“room”,并分别索引这两个词,而不是索引为一个完整的词。单独的词用连字符连接是很常见的:“运动狂”、“准备好相机”、“快速思考”等等。很大一部分是带连字符的名字,例如“Emma-Claire”。在进行全词搜索或查询时,用户希望在这些连字符中找到该词。虽然在某些情况下它们是单独的词,但这就是 lucene 将连字符排除在默认定义之外的原因。
To give support of hyphen in StandardAnalyzer
, you have to make changes in StandardTokenizerImpl.java
which is generated class from jFlex.
要支持连字符 in StandardAnalyzer
,您必须更改StandardTokenizerImpl.java
从jFlex生成的类。
Refer this linkfor complete guide.
You have to add following line in SUPPLEMENTARY.jflex-macro
which is included by StandardTokenizerImpl.jflex
file.
请参阅此链接以获取完整指南。
您必须添加以下行,SUPPLEMENTARY.jflex-macro
其中包含StandardTokenizerImpl.jflex
文件。
MidLetterSupp = ( [\u002D] )
And After making changes provide StandardTokenizerImpl.jflex
file as input to jFlex engine and click on generate. The output of that will be StandardTokenizerImpl.java
并在进行更改后提供StandardTokenizerImpl.jflex
文件作为 jFlex 引擎的输入,然后单击生成。的输出将是StandardTokenizerImpl.java
And using that class file rebuild the index.
并使用该类文件重建索引。
回答by Ralph
The ClassicAnalzer is recommended to index text containing product codes like 'GSX-R1000'. It will recognize this as a single term and did not split up its parts. But for example the text 'Europe/Berlin' will be split up by the ClassicAnalzer into the words 'Europe' and 'Berlin'. This means if you have a text indexed by the ClassicAnalyzer containing the phrase
建议使用 ClassicAnalzer 对包含产品代码(如“GSX-R1000”)的文本进行索引。它将将此识别为单个术语,并且不会拆分其部分。但是,例如文本“欧洲/柏林”将被 ClassicAnalzer 拆分为单词“欧洲”和“柏林”。这意味着如果您有包含短语的 ClassicAnalyzer 索引的文本
Europe/Berlin GSX-R1000
you can search for "europe", "berlin" or "GSX-R1000".
您可以搜索“欧洲”、“柏林”或“GSX-R1000”。
But be careful which analyzer you use for the search. I think the best choice to search a Lucene index is the KeywordAnalyzer. With the KeywordAnalyzer you can also search for specific fields in a document and you can build complex queries like:
但请注意您使用哪种分析器进行搜索。我认为搜索 Lucene 索引的最佳选择是 KeywordAnalyzer。使用 KeywordAnalyzer,您还可以搜索文档中的特定字段,并且可以构建复杂的查询,例如:
(processid:4711) (berlin)
This query will search documents with the phrase 'berlin' but also a field 'processid' containing the number 4711.
此查询将搜索包含短语 'berlin' 和包含数字 4711 的字段 'processid' 的文档。
But if you search the index for the phrase "europe/berlin" you will get no result! This is because the KeywordAnalyzer did not change your search phrase, but the phrase 'Europe/Berlin' was split up into two separate words by the ClassicAnalyzer. This means you have to search for 'europe' and 'berlin' separately.
但是,如果您在索引中搜索短语“欧洲/柏林”,您将不会得到任何结果!这是因为 KeywordAnalyzer 没有更改您的搜索词组,但是 ClassicAnalyzer 将词组“Europe/Berlin”分成两个单独的词。这意味着您必须分别搜索“欧洲”和“柏林”。
To solve this conflict you can translate a search term, entered by the user, in a search query that fits you needs using the following code:
要解决此冲突,您可以使用以下代码在符合您需要的搜索查询中翻译用户输入的搜索词:
QueryParser parser = new QueryParser("content", new ClassicAnalyzer());
Query result = parser.parse(searchTerm);
searchTerm = result.toString("content");
This code will translate the serach pharse
此代码将翻译 serach 短语
Europe/Berlin
into
进入
europe berlin
which will result in the expected document set.
这将导致预期的文档集。
Note:This will also work for more complex situations. The search term
注意:这也适用于更复杂的情况。搜索词
Europe/Berlin GSX-R1000
will be translated into:
将被翻译成:
(europe berlin) GSX-R1000
which will search correctly for all phrases in combination using the KeyWordAnalyzer.
它将使用 KeyWordAnalyzer 正确搜索所有短语的组合。