java Solr 中 StandardTokenizerFactory 和 KeywordTokenizerFactory 的区别?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7645465/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 20:50:06  来源:igfitidea点击:

Difference between StandardTokenizerFactory and KeywordTokenizerFactory in Solr?

javasolrsolrnettokenize

提问by ravidev

I am new to Solr.I want to know when to use StandardTokenizerFactoryand KeywordTokenizerFactory?

我是 Solr 的新手。我想知道什么时候使用StandardTokenizerFactoryKeywordTokenizerFactory

I read the docs on Apache Wiki, but I am not getting it.

我阅读了 Apache Wiki 上的文档,但我不明白。

Can anybody explain the difference between StandardTokenizerFactory and KeywordTokenizerFactory?

有人可以解释StandardTokenizerFactory 和 KeywordTokenizerFactory 之间区别吗?

回答by Jayendra

StandardTokenizerFactory :-
It tokenizes on whitespace, as well as strips characters

StandardTokenizerFactory :-
它对空格进行标记,并去除字符

Documentation :-

文件:-

Splits words at punctuation characters, removing punctuations. However, a dot that's not followed by whitespace is considered part of a token. Splits words at hyphens, unless there's a number in the token. In that case, the whole token is interpreted as a product number and is not split. Recognizes email addresses and Internet hostnames as one token.

在标点字符处拆分单词,删除标点符号。但是,后面没有空格的点被视为标记的一部分。在连字符处拆分单词,除非标记中有数字。在这种情况下,整个令牌将被解释为产品编号并且不会被拆分。将电子邮件地址和 Internet 主机名识别为一个标记。

Would use this for fields where you want to search on the field data.

将此用于要搜索字段数据的字段。

e.g. -

例如——

http://example.com/I-am+example?Text=-Hello

would generate 7 tokens (separated by comma) -

将生成 7 个标记(以逗号分隔)-

http,example.com,I,am,example,Text,Hello

KeywordTokenizerFactory :-

KeywordTokenizerFactory :-

Keyword Tokenizer does not split the input at all.
No processing in performed on the string, and the whole string is treated as a single entity.
This doesn't actually do any tokenization. It returns the original text as one term.

Keyword Tokenizer 根本不拆分输入。
不对字符串执行任何处理,整个字符串被视为单个实体。
这实际上并没有进行任何标记化。它将原始文本作为一个术语返回。

Mainly used for sorting or faceting requirements, where you want to match the exact facet when filtering on multiple words and sorting as sorting does not work on tokenized fields.

主要用于排序或分面要求,在过滤多个单词时要匹配确切的分面,并且排序对标记化字段不起作用。

e.g.

例如

http://example.com/I-am+example?Text=-Hello

would generate a single token -

将生成一个令牌 -

http://example.com/I-am+example?Text=-Hello