将 postgreSql 数据与 ElasticSearch 同步

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35813923/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-21 02:11:44  来源:igfitidea点击:

Sync postgreSql data with ElasticSearch

postgresqlelasticsearchlogstash

提问by Khanetor

Ultimately I want to have a scalable search solution for the data in PostgreSql. My finding points me towards using Logstash to ship write events from Postgres to ElasticSearch, however I have not found a usable solution. The soluions I have found involve using jdbc-input to query alldata from Postgres on an interval, and the delete events are not captured.

最终我想为 PostgreSql 中的数据提供一个可扩展的搜索解决方案。我的发现使我倾向于使用 Logstash 将写入事件从 Postgres 传送到 ElasticSearch,但是我还没有找到可用的解决方案。我发现的解决方案涉及使用 jdbc-input 以间隔查询来自 Postgres 的所有数据,并且不会捕获删除事件。

I think this is a common use case so I hope you guys could share with me your experience, or give me some pointers to proceed.

我认为这是一个常见的用例,所以我希望你们可以与我分享您的经验,或者给我一些指示以继续。

采纳答案by Val

If you need to also be notified on DELETEs and delete the respective record in Elasticsearch, it is true that the Logstash jdbc input will not help. You'd have to use a solution working around the binlog as suggested here

如果您还需要在 DELETE 上得到通知并删除 Elasticsearch 中的相应记录,那么 Logstash jdbc 输入确实无济于事。您必须按照此处的建议使用解决 binlog 的解决方案

However, if you still want to use the Logstash jdbc input, what you could do is simply soft-delete records in PostgreSQL, i.e. create a new BOOLEAN column in order to mark your records as deleted. The same flag would then exist in Elasticsearch and you can exclude them from your searches with a simple termquery on the deletedfield.

但是,如果您仍然想使用 Logstash jdbc 输入,您可以做的只是软删除 PostgreSQL 中的记录,即创建一个新的 BOOLEAN 列以将您的记录标记为deleted. 然后 Elasticsearch 中将存在相同的标志,您可以term通过对该deleted字段的简单查询将它们从搜索中排除。

Whenever you need to perform some cleanup, you can delete all records flagged deletedin both PostgreSQL and Elasticsearch.

每当您需要执行一些清理时,您可以删除deleted在 PostgreSQL 和 Elasticsearch 中标记的所有记录。

回答by Yegor Zaremba

Please take a look at Debezium. It's a change data capture (CDC) platform, which allow you to steam your data

请看一下Debezium。这是一个变更数据捕获 (CDC) 平台,可让您传输数据

I created a simple github repository, which shows how it works

我创建了一个简单的github 存储库,它展示了它是如何工作的

enter image description here

在此处输入图片说明

回答by taina

You can also take a look at PGSync.

您还可以查看PGSync

It's similar to Debezium but a lot easier to get up and running.

它类似于 Debezium,但更容易启动和运行。

PGSync is a Change data capture tool for moving data from Postgres to Elasticsearch. It allows you to keep Postgres as your source-of-truth and expose structured denormalized documents in Elasticsearch.

PGSync 是一个变更数据捕获工具,用于将数据从 Postgres 移动到 Elasticsearch。它允许您将 Postgres 作为真实来源并在 Elasticsearch 中公开结构化的非规范化文档。

You simply define a JSON schema describing the structure of the data in Elasticsearch.

您只需定义一个 JSON 模式来描述 Elasticsearch 中的数据结构。

Here is an example schema: (you can also have nested objects)

这是一个示例架构:(您也可以有嵌套对象)

e.g

例如

{
    "nodes": [
        {
            "table": "book",
            "columns": [
                "isbn",
                "title",
                "description"
            ]
        }
    ]
}
{
    "nodes": [
        {
            "table": "book",
            "columns": [
                "isbn",
                "title",
                "description"
            ]
        }
    ]
}

PGsync generates queries for your document on the fly. No need to write queries like Logstash. It also supports and tracks deletion operations.

PGsync 会即时为您的文档生成查询。无需像 Logstash 那样编写查询。它还支持和跟踪删除操作。

It operates both a polling and an event-driven model to capture changes made to date and notification for changes that occur at a point in time. The initial sync polls the database for changes since the last time the daemon was run and thereafter event notification (based on triggers and handled by the pg-notify) for changes to the database.

它同时运行轮询和事件驱动模型来捕获迄今为止所做的更改并通知某个时间点发生的更改。初始同步轮询数据库自上次运行守护程序以来的更改,此后事件通知(基于触发器并由 pg-notify 处理)以获取数据库更改。

It has very little development overhead.

它的开发开销非常小。

  • Create a schema as described above
  • point pgsync at your Postgres database and Elasticsearch cluster
  • Startup the daemon.
  • 如上所述创建架构
  • 将 pgsync 指向您的 Postgres 数据库和 Elasticsearch 集群
  • 启动守护进程。

You can easily create a document that includes multiple relations as nested objects. PGSync tracks any changes for you.

您可以轻松创建包含多个关系作为嵌套对象的文档。PGSync 会为您跟踪任何更改。

Have a look at the githubrepo for more details.

查看githubrepo 以获取更多详细信息。

You can pip install the package from PyPI

你可以从PyPIpip install 包