Java ElasticSearch 仅返回具有不同值的文档

Question

提问by user962206

Let's say I have this given data

假设我有这个给定的数据

{
            "name" : "ABC",
            "favorite_cars" : [ "ferrari","toyota" ]
          }, {
            "name" : "ABC",
            "favorite_cars" : [ "ferrari","toyota" ]
          }, {
            "name" : "GEORGE",
            "favorite_cars" : [ "honda","Hyundae" ]
          }

Whenever I query this data when searching for people who's favorite car is toyota, it returns this data

每当我在搜索最喜欢的汽车是丰田的人时查询此数据时，它都会返回此数据

{

            "name" : "ABC",
            "favorite_cars" : [ "ferrari","toyota" ]
          }, {
            "name" : "ABC",
            "favorite_cars" : [ "ferrari","toyota" ]
          }

the result is Two records of with a name of ABC. How do I select distinct documents only? The result I want to get is only this

结果是名称为 ABC 的两条记录。如何仅选择不同的文档？我想要的结果只有这个

{
                "name" : "ABC",
                "favorite_cars" : [ "ferrari","toyota" ]
              }

Here's my Query

这是我的查询

{
    "fuzzy_like_this_field" : {
        "favorite_cars" : {
            "like_text" : "toyota",
            "max_query_terms" : 12
        }
    }
}

I am using ElasticSearch 1.0.0. with the java api client

我正在使用 ElasticSearch 1.0.0。使用 java api 客户端

Answer 1

回答by dark_shadow

ElasticSearch doesn't provide any query by which you can get distinct documents based a field value.

ElasticSearch 不提供任何查询，您可以通过该查询获取基于字段值的不同文档。

Ideally you should have indexed the same document with same typeand idsince these two things are used by ElasticSearch to give a _uidunique id to a document. Unique id is important not only because of its way of detecting duplicate documents but also updating the same document in case of any modification instead of inserting a new one. For more information about indexing documents you can read this.

理想情况下，您应该使用相同的类型和id索引同一个文档，因为 ElasticSearch 使用这两个东西来为文档提供_uid唯一 id。唯一 id 很重要，不仅因为它检测重复文档的方式，而且在任何修改的情况下更新同一文档而不是插入新文档。有关索引文档的更多信息，您可以阅读此内容。

But there is definitely a work around for your problem. Since you are using java api client, you can remove duplicate documents based on a field value on your own. Infact, it gives you more flexibility to perform custom operations on the responses that you get from ES.

但是肯定有解决您的问题的方法。由于您使用的是 java api 客户端，您可以根据自己的字段值删除重复的文档。事实上，它使您可以更灵活地对从 ES 获得的响应执行自定义操作。

SearchResponse response = client.prepareSearch().execute().actionGet();
SearchHits hits = response.getHits();

Iterator<SearchHit> iterator = hits.iterator();
Map<String, SearchHit> distinctObjects = new HashMap<String,SearchHit>();
while (iterator.hasNext()) {
    SearchHit searchHit = (SearchHit) iterator.next();
    Map<String, Object> source = searchHit.getSource();
    if(source.get("name") != null){
        distinctObjects.put(source.get("name").toString(),source);
    }

}

So, you will have a map of unique searchHit objects in your map.

因此，您的地图中将有一个唯一 searchHit 对象的地图。

You can also create an object mapping and use that in place of SearchHit.

您还可以创建一个对象映射并使用它来代替 SearchHit。

I hope this solves your problem. Please forgive me if there are any errors in the code. This is just a pseudo-ish code to make you understand how you can solve your problem.

我希望这能解决你的问题。代码中如有错误请见谅。这只是一个伪代码，让您了解如何解决您的问题。

Thanks

谢谢

Answer 2

回答by JRL

You can eliminate duplicates using aggregations. With term aggregationthe results will be grouped by one field, e.g. name, also providing a count of the ocurrences of each value of the field, and will sort the results by this count (descending).

您可以使用聚合消除重复项。使用术语聚合，结果将按一个字段分组，例如name，还提供该字段每个值的出现次数，并将按此计数（降序）对结果进行排序。

{
  "query": {
    "fuzzy_like_this_field": {
      "favorite_cars": {
        "like_text": "toyota",
        "max_query_terms": 12
      }
    }
  },
  "aggs": {
    "grouped_by_name": {
      "terms": {
        "field": "name",
        "size": 0
      }
    }
  }
}

In addition to the hits, the result will also contain the bucketswith the unique values in keyand with the count in doc_count:

除了hits，结果还将包含buckets具有唯一值key和计数的doc_count：

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.19178301,
    "hits" : [ {
      "_index" : "pru",
      "_type" : "pru",
      "_id" : "vGkoVV5cR8SN3lvbWzLaFQ",
      "_score" : 0.19178301,
      "_source":{"name":"ABC","favorite_cars":["ferrari","toyota"]}
    }, {
      "_index" : "pru",
      "_type" : "pru",
      "_id" : "IdEbAcI6TM6oCVxCI_3fug",
      "_score" : 0.19178301,
      "_source":{"name":"ABC","favorite_cars":["ferrari","toyota"]}
    } ]
  },
  "aggregations" : {
    "grouped_by_name" : {
      "buckets" : [ {
        "key" : "abc",
        "doc_count" : 2
      } ]
    }
  }
}

Note that using aggregations will be costly because of duplicate elimination and result sorting.

请注意，由于重复消除和结果排序，使用聚合的成本会很高。

Answer 3

回答by Ajey Dudhe

For a single shard this can be handled using custom filter which also takes care of pagination. To handle the above use case we can use the script support as follows:

对于单个分片，这可以使用自定义过滤器处理，该过滤器也负责分页。为了处理上述用例，我们可以使用脚本支持，如下所示：

Define a custom script filter. For this discussion assume it is called AcceptDistinctDocumentScriptFilter
This custom filter takes in a list of primary keys as input.
These primary keys are the fields whose values will be used to determine uniqueness of records.
Now, instead of using aggregation we use normal search request and pass the custom script filter to the request.
If the search already has a filter\query criteria defined then append the custom filter using logical AND operator.
Following is example using pseudo syntax if the request is: select * from myindex where file_hash = 'hash_value' then append the custom filter as:
select * from myindex where file_hash = 'hash_value' AND AcceptDistinctDocumentScriptFilter(params= ['file_name', 'file_folder'])

定义自定义脚本过滤器。对于这个讨论，假设它被称为 AcceptDistinctDocumentScriptFilter
此自定义过滤器将主键列表作为输入。
这些主键是其值将用于确定记录唯一性的字段。
现在，我们不使用聚合，而是使用普通搜索请求并将自定义脚本过滤器传递给请求。
如果搜索已经定义了过滤器\查询条件，则使用逻辑 AND 运算符附加自定义过滤器。
以下是使用伪语法的示例，如果请求是： select * from myindex where file_hash = 'hash_value' 然后将自定义过滤器附加为：
select * from myindex where file_hash = 'hash_value' AND AcceptDistinctDocumentScriptFilter(params= ['file_name', 'file_folder '])

For distributed search this is tricky and needs plugin to hook into QUERY phase. More details here.

对于分布式搜索，这很棘手，需要插件来挂钩到 QUERY 阶段。更多细节在这里。

Answer 4

回答by Eulalie367

@JRL is almost corrrect. You will need an aggregation in your query. This will get you a list of the top 10000 "favorite_cars" in your object ordered by occurance.

@JRL 几乎是正确的。您将需要在查询中进行聚合。这将为您提供按出现次数排序的对象中前 10000 辆“favorite_cars”的列表。

{
    "query":{ "match_all":{ } },
    "size":0,
    "Distinct" : {
        "Cars" : {
            "terms" : { "field" : "favorite_cars", "order": { "_count": "desc"}, "size":10000 }
         }
    }
}

It is also worth noting that you are going to want your "favorite_car" field to not be analyzed in order to get "McLaren F1" instead of "McLaren ", "F1".

还值得注意的是，为了获得“McLaren F1”而不是“McLaren”、“F1”，您将不希望对“favorite_car”字段进行分析。

"favorite_car": {
    "type": "string",
    "index": "not_analyzed"
}

Java ElasticSearch 仅返回具有不同值的文档

提问by user962206

回答by dark_shadow

回答by JRL

回答by Ajey Dudhe

回答by Eulalie367

相关推荐

最近更新

标签

Java ElasticSearch 仅返回具有不同值的文档

提问by user962206

回答by dark_shadow

回答by JRL

回答by Ajey Dudhe

回答by Eulalie367

相关推荐

Java 如何以编程方式配置 appender 或在 log4j2 中初始化日志记录？

Java 将 null 添加到空 TreeSet 引发 NullPointerException

Java 使用 JaxRS 自定义 JSON 序列化

Java WebLogic 12c - 目标无法访问异常

相关推荐

最近更新

标签