java Lucene - 精确的字符串匹配
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25809704/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Lucene - Exact string matching
提问by LucaT
I'm trying to create a Lucene 4.10 index. I just want to save in the index the exact strings that I put into the document, witout tokenization.
我正在尝试创建一个 Lucene 4.10 索引。我只想在索引中保存我放入文档的确切字符串,而无需标记化。
I'm using the StandardAnalyzer.
我正在使用 StandardAnalyzer。
Directory dir = FSDirectory.open(new File("myDire"));
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_4_10_0, analyzer);
iwc.setOpenMode(OpenMode.CREATE);
IndexWriter writer = new IndexWriter(dir, iwc);
StringField field1 = new StringField("1", content1, Store.YES);
StringField field2 = new StringField("2", content2, Store.YES);
StringField field3 = new StringField("3", content3, Store.YES);
doc.add(field1);
doc.add(field2);
doc.add(field3);
writer.addDocument(doc, analyzer);
writer.close();
If I print the index's content, I can see my data being stored, for example, my document has this "field 3":
如果我打印索引的内容,我可以看到我的数据被存储,例如,我的文档有这个“字段 3”:
stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<3:"Fuel Tank Capacity"@en>
I'm trying to query the index in order to get it back:
我正在尝试查询索引以将其取回:
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer();
QueryParser parser = new QueryParser("3", analyzer);
String queryString = "\"\"Fuel Tank Capacity"\@en\"";
Query query = parser.createPhraseQuery("3", QueryParser.escape(queryString));
TopDocs docs = searcher.search(query, null, 20);
I'm trying to search the term "Fuel Tank Capacity"@en (quotation marks included) so I tried to escape them and I put another couple of quotes around the terms in order to let lucene understand that I'm searching for the entire texts.
我正在尝试搜索术语“Fuel Tank Capacity”@en(包括引号),所以我试图避开它们,并在这些术语周围加上了另外几个引号,以便让 lucene 明白我正在搜索整个文本。
If I print the query, I get: 3:"fuel tank capacity en" but I dont want to split the text on the @ symbol.
如果我打印查询,我会得到: 3:"fuel tank capacity en" 但我不想在 @ 符号上拆分文本。
I think that my first problem is the StandardAnalyzer, because it seems to tokenize, if I'm not mistaken. However, I cannot understand how to query the index in order to get exactly "Fuel Tank Capacity"@en (quotation marks included).
我认为我的第一个问题是 StandardAnalyzer,因为如果我没记错的话,它似乎是标记化的。但是,我无法理解如何查询索引以准确获得“油箱容量”@en(包括引号)。
Thank you
谢谢
回答by femtoRgon
You could simplify matters, and just cut the QueryParser
out of the equation entirely. Since you are using a StringField
, the whole content of the field is a single term, so a simple TermQuery
should work well:
你可以简化问题,QueryParser
完全去掉等式。由于您使用的是 a StringField
,该字段的整个内容是一个术语,因此一个简单的TermQuery
应该可以很好地工作:
Query query = new TermQuery(new Term("3","\"Fuel Tank Capacity\"@en"));
回答by mindas
When escaping quote (or any other special symbol in Lucene), you need to use \, but don't forget that backslash needs to be escaped inside Java string.
在转义引号(或 Lucene 中的任何其他特殊符号)时,您需要使用 \,但不要忘记在 Java 字符串中需要转义反斜杠。
Following works for me:
以下对我有用:
Query q = new QueryParser(
Version.LUCENE_4_10_0,
"",
new StandardAnalyzer(Version.LUCENE_4_10_0)
).parse("3:\"\\"Fuel Tank Capacity\\"@en\"");
How did I arrive to this?
我是怎么到这个地步的?
- Took the original string
"Fuel Tank Capacity"@en
- Added escaping which is necessary for Lucene (escaped each
"
with\
):\"Fuel Tank Capacity\"@en
- Added escaped quotes in the beginning and the end of the string:
"\"Fuel Tank Capacity\"@en"
- Added escaping which is necessary for Java String (each slash becomes double slash, double quotes is escaped with backslash):
\"\\\"Fuel Tank Capacity\\\"@en\"
- 取原始字符串
"Fuel Tank Capacity"@en
- 添加了 Lucene 所必需的转义(每个都
"
用转义\
):\"Fuel Tank Capacity\"@en
- 在字符串的开头和结尾添加转义引号:
"\"Fuel Tank Capacity\"@en"
- 添加了 Java String 所需的转义(每个斜杠变成双斜杠,双引号用反斜杠转义):
\"\\\"Fuel Tank Capacity\\\"@en\"