Java ElasticSearch 仅返回具有不同值的文档
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24508191/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
ElasticSearch returning only documents with distinct value
提问by user962206
Let's say I have this given data
假设我有这个给定的数据
{
"name" : "ABC",
"favorite_cars" : [ "ferrari","toyota" ]
}, {
"name" : "ABC",
"favorite_cars" : [ "ferrari","toyota" ]
}, {
"name" : "GEORGE",
"favorite_cars" : [ "honda","Hyundae" ]
}
Whenever I query this data when searching for people who's favorite car is toyota, it returns this data
每当我在搜索最喜欢的汽车是丰田的人时查询此数据时,它都会返回此数据
{
"name" : "ABC",
"favorite_cars" : [ "ferrari","toyota" ]
}, {
"name" : "ABC",
"favorite_cars" : [ "ferrari","toyota" ]
}
the result is Two records of with a name of ABC. How do I select distinct documents only? The result I want to get is only this
结果是名称为 ABC 的两条记录。如何仅选择不同的文档?我想要的结果只有这个
{
"name" : "ABC",
"favorite_cars" : [ "ferrari","toyota" ]
}
Here's my Query
这是我的查询
{
"fuzzy_like_this_field" : {
"favorite_cars" : {
"like_text" : "toyota",
"max_query_terms" : 12
}
}
}
I am using ElasticSearch 1.0.0. with the java api client
我正在使用 ElasticSearch 1.0.0。使用 java api 客户端
回答by dark_shadow
ElasticSearch doesn't provide any query by which you can get distinct documents based a field value.
ElasticSearch 不提供任何查询,您可以通过该查询获取基于字段值的不同文档。
Ideally you should have indexed the same document with same typeand idsince these two things are used by ElasticSearch to give a _uidunique id to a document. Unique id is important not only because of its way of detecting duplicate documents but also updating the same document in case of any modification instead of inserting a new one. For more information about indexing documents you can read this.
理想情况下,您应该使用相同的类型和id索引同一个文档,因为 ElasticSearch 使用这两个东西来为文档提供_uid唯一 id。唯一 id 很重要,不仅因为它检测重复文档的方式,而且在任何修改的情况下更新同一文档而不是插入新文档。有关索引文档的更多信息,您可以阅读此内容。
But there is definitely a work around for your problem. Since you are using java api client, you can remove duplicate documents based on a field value on your own. Infact, it gives you more flexibility to perform custom operations on the responses that you get from ES.
但是肯定有解决您的问题的方法。由于您使用的是 java api 客户端,您可以根据自己的字段值删除重复的文档。事实上,它使您可以更灵活地对从 ES 获得的响应执行自定义操作。
SearchResponse response = client.prepareSearch().execute().actionGet();
SearchHits hits = response.getHits();
Iterator<SearchHit> iterator = hits.iterator();
Map<String, SearchHit> distinctObjects = new HashMap<String,SearchHit>();
while (iterator.hasNext()) {
SearchHit searchHit = (SearchHit) iterator.next();
Map<String, Object> source = searchHit.getSource();
if(source.get("name") != null){
distinctObjects.put(source.get("name").toString(),source);
}
}
So, you will have a map of unique searchHit objects in your map.
因此,您的地图中将有一个唯一 searchHit 对象的地图。
You can also create an object mapping and use that in place of SearchHit.
您还可以创建一个对象映射并使用它来代替 SearchHit。
I hope this solves your problem. Please forgive me if there are any errors in the code. This is just a pseudo-ish code to make you understand how you can solve your problem.
我希望这能解决你的问题。代码中如有错误请见谅。这只是一个伪代码,让您了解如何解决您的问题。
Thanks
谢谢
回答by JRL
You can eliminate duplicates using aggregations. With term aggregationthe results will be grouped by one field, e.g. name
, also providing a count of the ocurrences of each value of the field, and will sort the results by this count (descending).
您可以使用聚合消除重复项。使用术语聚合,结果将按一个字段分组,例如name
,还提供该字段每个值的出现次数,并将按此计数(降序)对结果进行排序。
{
"query": {
"fuzzy_like_this_field": {
"favorite_cars": {
"like_text": "toyota",
"max_query_terms": 12
}
}
},
"aggs": {
"grouped_by_name": {
"terms": {
"field": "name",
"size": 0
}
}
}
}
In addition to the hits
, the result will also contain the buckets
with the unique values in key
and with the count in doc_count
:
除了hits
,结果还将包含buckets
具有唯一值key
和计数的doc_count
:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.19178301,
"hits" : [ {
"_index" : "pru",
"_type" : "pru",
"_id" : "vGkoVV5cR8SN3lvbWzLaFQ",
"_score" : 0.19178301,
"_source":{"name":"ABC","favorite_cars":["ferrari","toyota"]}
}, {
"_index" : "pru",
"_type" : "pru",
"_id" : "IdEbAcI6TM6oCVxCI_3fug",
"_score" : 0.19178301,
"_source":{"name":"ABC","favorite_cars":["ferrari","toyota"]}
} ]
},
"aggregations" : {
"grouped_by_name" : {
"buckets" : [ {
"key" : "abc",
"doc_count" : 2
} ]
}
}
}
Note that using aggregations will be costly because of duplicate elimination and result sorting.
请注意,由于重复消除和结果排序,使用聚合的成本会很高。
回答by Ajey Dudhe
For a single shard this can be handled using custom filter which also takes care of pagination. To handle the above use case we can use the script support as follows:
对于单个分片,这可以使用自定义过滤器处理,该过滤器也负责分页。为了处理上述用例,我们可以使用脚本支持,如下所示:
- Define a custom script filter. For this discussion assume it is called AcceptDistinctDocumentScriptFilter
- This custom filter takes in a list of primary keys as input.
- These primary keys are the fields whose values will be used to determine uniqueness of records.
- Now, instead of using aggregation we use normal search request and pass the custom script filter to the request.
- If the search already has a filter\query criteria defined then append the custom filter using logical AND operator.
- Following is example using pseudo syntax
if the request is:
select * from myindex where file_hash = 'hash_value'
then append the custom filter as:
select * from myindex where file_hash = 'hash_value' AND AcceptDistinctDocumentScriptFilter(params= ['file_name', 'file_folder'])
- 定义自定义脚本过滤器。对于这个讨论,假设它被称为 AcceptDistinctDocumentScriptFilter
- 此自定义过滤器将主键列表作为输入。
- 这些主键是其值将用于确定记录唯一性的字段。
- 现在,我们不使用聚合,而是使用普通搜索请求并将自定义脚本过滤器传递给请求。
- 如果搜索已经定义了过滤器\查询条件,则使用逻辑 AND 运算符附加自定义过滤器。
- 以下是使用伪语法的示例,如果请求是: select * from myindex where file_hash = 'hash_value' 然后将自定义过滤器附加为:
select * from myindex where file_hash = 'hash_value' AND AcceptDistinctDocumentScriptFilter(params= ['file_name', 'file_folder '])
For distributed search this is tricky and needs plugin to hook into QUERY phase. More details here.
对于分布式搜索,这很棘手,需要插件来挂钩到 QUERY 阶段。更多细节在这里。
回答by Eulalie367
@JRL is almost corrrect. You will need an aggregation in your query. This will get you a list of the top 10000 "favorite_cars" in your object ordered by occurance.
@JRL 几乎是正确的。您将需要在查询中进行聚合。这将为您提供按出现次数排序的对象中前 10000 辆“favorite_cars”的列表。
{
"query":{ "match_all":{ } },
"size":0,
"Distinct" : {
"Cars" : {
"terms" : { "field" : "favorite_cars", "order": { "_count": "desc"}, "size":10000 }
}
}
}
It is also worth noting that you are going to want your "favorite_car" field to not be analyzed in order to get "McLaren F1" instead of "McLaren ", "F1".
还值得注意的是,为了获得“McLaren F1”而不是“McLaren”、“F1”,您将不希望对“favorite_car”字段进行分析。
"favorite_car": {
"type": "string",
"index": "not_analyzed"
}