java elasticsearch - 返回字段的标记

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13178550/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-31 11:46:50  来源:igfitidea点击:

elasticsearch - Return the tokens of a field

javasearchlucenetokenelasticsearch

提问by Kennedy

How can I have the tokens of a particular field returned in the result

如何在结果中返回特定字段的标记

For example, A GET request

例如,一个 GET 请求

curl -XGET 'http://localhost:9200/twitter/tweet/1'

returns

回报

{
    "_index" : "twitter",
    "_type" : "tweet",
    "_id" : "1", 
    "_source" : {
        "user" : "kimchy",
        "postDate" : "2009-11-15T14:12:12",
        "message" : "trying out Elastic Search"
    } 
}

I would like to have the tokens of '_source.message' field included in the result

我想在结果中包含“_source.message”字段的标记

回答by imotov

There is also another way to do it using the following script_fields script:

还有另一种方法可以使用以下 script_fields 脚本:

curl 'http://localhost:9200/test-idx/_search?pretty=true' -d '{
    "query" : {
        "match_all" : { }
    },
    "script_fields": {
        "terms" : {
            "script": "doc[field].values",
            "params": {
                "field": "message"
            }
        }

    }
}'

It's important to note that while this script returns the actual terms that were indexed, it also caches all field values and on large indices can use a lot of memory. So, on large indices, it might be more useful to retrieve field values from stored fields or source and reparse them again on the fly using the following MVEL script:

请务必注意,虽然此脚本返回已编入索引的实际术语,但它还缓存所有字段值,并且在大型索引上可能会使用大量内存。因此,对于大型索引,使用以下 MVEL 脚本从存储的字段或源中检索字段值并重新解析它们可能更有用:

import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import java.io.StringReader;

// Cache analyzer for further use
cachedAnalyzer=(isdef cachedAnalyzer)?cachedAnalyzer:doc.mapperService().documentMapper(doc._type.value).mappers().indexAnalyzer();

terms=[];
// Get value from Fields Lookup
//val=_fields[field].values;

// Get value from Source Lookup
val=_source[field];

if(val != null) {
  tokenStream=cachedAnalyzer.tokenStream(field, new StringReader(val)); 
  CharTermAttribute termAttribute = tokenStream.addAttribute(CharTermAttribute); 
  while(tokenStream.incrementToken()) { 
    terms.add(termAttribute.toString())
  }; 
  tokenStream.close(); 
} 
terms

This MVEL script can be stored as config/scripts/analyze.mveland used with the following query:

此 MVEL 脚本可以存储为config/scripts/analyze.mvel以下查询并与以下查询一起使用:

curl 'http://localhost:9200/test-idx/_search?pretty=true' -d '{
    "query" : {
        "match_all" : { }
    },
    "script_fields": {
        "terms" : {
            "script": "analyze",
            "params": {
                "field": "message"
            }
        }

    }
}'

回答by javanna

If you mean the tokens that have been indexed you can make a terms faceton the message field. Increase the sizevalue in order to get more entries back, or set to 0to get all terms.

如果您的意思是已编入索引的令牌,您可以在消息字段上创建一个术语方面。增加size值以获取更多条目,或设置0为获取所有术语。

Lucene provides the ability to store the term vectors, but there's no way to have access to it with elasticsearch by now (as far as I know).

Lucene 提供了存储术语向量的能力,但目前还没有办法通过 elasticsearch 访问它(据我所知)。

Why do you need that? If you only want to check what you're indexing you can have a look at the analyze api.

你为什么需要那个?如果您只想检查索引的内容,可以查看分析 api