java 跨数据库同步表数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15222679/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-31 18:54:22  来源:igfitidea点击:

Synchronizing table data across databases

javasqlspring-batch

提问by Serkan Ar?ku?u

I have one table that records its row insert/update timestamps on a field.

我有一个表,它在一个字段上记录其行插入/更新时间戳。

I want to synchronize data in this table with another table on another db server. Two db servers are not connected and synchronization is one way (master/slave). Using table triggers is not suitable

我想将此表中的数据与另一个数据库服务器上的另一个表同步。两个数据库服务器没有连接,同步是一种方式(主/从)。不适合使用表触发器

My workflow:

我的工作流程:

  • I use a global last_sync_date parameter and query table Master for the changed/inserted records
  • Output the resulting rows to xml
  • Parse the xml and update table Slave using updates and inserts
  • 我对更改/插入的记录使用全局 last_sync_date 参数和查询表 Master
  • 将结果行输出到 xml
  • 使用更新和插入解析 xml 并更新表 Slave

The complexity of the problem rises when dealing with deleted records of Master table. To catch the deleted records I think I have to maintain a log table for the previously inserted records and use sql "NOT IN". This becomes a performance problem when dealing with large datasets.

在处理Master表的已删除记录时,问题的复杂性就上升了。为了捕获已删除的记录,我想我必须为以前插入的记录维护一个日志表并使用 sql“NOT IN”。在处理大型数据集时,这会成为一个性能问题。

What would be an alternative workflow dealing with this scenario?

处理这种情况的替代工作流程是什么?

采纳答案by Adam Gent

It sounds like you need a transactional message queue.

听起来您需要一个事务性消息队列。

How this works is simple. When you update the master db you can send a message to the message broker (of whatever the update was) which can go to any number of queues. Each slave db can have its own queue and because queue's preserve order the process should eventually synchronize correctly (ironically this is sort of how most RDBMS do replication internally).

这是如何工作的很简单。当您更新主数据库时,您可以向消息代理(无论更新是什么)发送一条消息,该消息可以进入任意数量的队列。每个从数据库都可以有自己的队列,并且由于队列的保留顺序,进程最终应该正确同步(具有讽刺意味的是,这是大多数 RDBMS 在内部进行复制的方式)。

Think of the Message Queue as a sort of SCM change-list or patch-list database.That is for the most part the same (or roughly the same) SQL statements sent to master should be replicated to the other databases eventually. Don't worry about loosing messages as most message queues support durability and transactions.

将消息队列视为一种 SCM 更改列表或补丁列表数据库。也就是说,发送到 master 的大部分相同(或大致相同)的 SQL 语句最终应该被复制到其他数据库。不要担心丢失消息,因为大多数消息队列都支持持久性和事务。

I recommend you look at spring-amqpand/or spring-integrationespecially since you tagged this question with spring-batch.

我建议您查看spring-amqp和/或spring-integration ,特别是因为您用spring-batch标记了这个问题。

Based on your comments:

根据您的评论:

BTW your concern of NOT INbeing a performance problem is not a very good one as there are a plethora of work-arounds but given your not wanting to do DB specific things (like triggers and replication) I still feel a message queue is your best option.

顺便说一句,您对NOT IN性能问题的担忧并不是一个很好的问题,因为有很多变通办法,但鉴于您不想做特定于数据库的事情(如触发器和复制),我仍然觉得消息队列是您的最佳选择。

EDIT - Non MQ route

编辑 - 非 MQ 路由

Since I gave you a tough time about asking this quesiton I will continue to try to help. Besides the message queue you can do some sort of XML file like you we were trying before. THE CRITICAL FEATURE you need in the schema is a CREATE TIMESTAMP column on your master database so that you can do the batch processing while the system is up and running (otherwise you will have to stop the system). Now if you go this route you will want to SELECT * WHERE CREATE_TIME < ?is less than the current time. Basically your only getting the rows at a snapshot.

由于我在问这个问题时给您带来了困难,因此我将继续尝试提供帮助。除了消息队列之外,您还可以像我们之前尝试的那样执行某种 XML 文件。架构中所需的关键功能是主数据库上的 CREATE TIMESTAMP 列,以便您可以在系统启动并运行时进行批处理(否则您将不得不停止系统)。现在,如果您走这条路线,您将希望到达SELECT * WHERE CREATE_TIME < ?的时间少于当前时间。基本上你只能在快照中获取行。

Now on your other database for the delete your going to remove rows by inner joiningon a ID table but with !=(that is you can use JOINS instead of slow NOT IN). Luckily you only need all the idsfor delete and not the other columns. The other columns you can use a delta based on the the update time stamp column (for update, and create aka insert).

现在在您的其他数据库上进行删除,您将删除inner joiningID 表上的行,但使用!=(即您可以使用 JOINS 而不是 slow NOT IN)。幸运的是,您只需要所有的idsfor delete 而不是其他列。您可以使用基于更新时间戳列的增量(用于更新和创建又名插入)的其他列。

回答by user1050755

Why don't you just add a TIMESTAMP column that indicates the last update/insert/delete time? Then add a deleted column -- ie. mark the row as deleted instead of actually deleting it immediately. Delete it after having exported the delete action.

为什么不只添加一个 TIMESTAMP 列来指示上次更新/插入/删除时间?然后添加一个删除的列——即。将该行标记为已删除,而不是立即将其删除。导出删除操作后将其删除。

In case you cannot alter schema usage in an existing app:

如果您无法更改现有应用程序中的架构使用情况:

Can't you use triggers at all? How about a second ("hidden") table that gets populated with every insert/update/delete and which would constitute the content of the next to be generated xml export file? That is a common concept: a history (or "log") table: it would have its own progressing id column which can be used as an export marker.

你根本不能使用触发器吗?第二个(“隐藏”)表如何填充每个插入/更新/删除,它将构成下一个要生成的 xml 导出文件的内容?这是一个常见的概念:历史(或“日志”)表:它会有自己的进度 id 列,可用作导出标记。

回答by Mikhail

Have a look at Oracle GoldenGate:

看看Oracle GoldenGate

Oracle GoldenGate is a comprehensive software package for enabling the replication of data in heterogeneous data environments. The product set enables high availability solutions, real-time data integration, transactional change data capture, data replication, transformations, and verification between operational and analytical enterprise systems.

Oracle GoldenGate 是一个全面的软件包,用于支持异构数据环境中的数据复制。该产品集支持高可用性解决方案、实时数据集成、事务变更数据捕获、数据复制、转换以及运营和分析企业系统之间的验证。

SymmetricDS:

对称DS

SymmetricDS is open source software for multi-master database replication, filtered synchronization, or transformation across the network in a heterogeneous environment. It supports multiple subscribers with one direction or bi-directional asynchronous data replication.

SymmetricDS 是一个开源软件,用于在异构环境中跨网络进行多主数据库复制、过滤同步或转换。它支持单向或双向异步数据复制的多个订阅者。

Daffodil Replicator:

水仙花复制器

Daffodil Replicator is a Java tool for data synchronization, data migration, and data backup between various database servers.

Daffodil Replicator 是一个Java 工具,用于各种数据库服务器之间的数据同步、数据迁移和数据备份。

回答by basav

Make a history table for the table that needs to be synchronized (basically a duplicate of that table, with a few extra fields perhaps) and insert the entire row every time something is inserted/updated/deleted in the active table.

为需要同步的表创建一个历史表(基本上是该表的副本,可能还有一些额外的字段)并在每次在活动表中插入/更新/删除某些内容时插入整行。

Write a Spring batch job to sync the data to Slave machine based on the history table's extra fields

编写一个Spring批处理作业,根据历史表的额外字段将数据同步到从机

hope this helps..

希望这可以帮助..

回答by badpanda

A potential option for allowing deletes within your current workflow:

允许在当前工作流程中删除的潜在选项:

In the case that the trigger restriction is limited to triggers with references across databases, a possible solution within your current workflow would be to create a helper table in your Master database to store only the unique identifiers of the deleted rows (or whatever unique key would enable you to most efficiently delete your deleted rows).

在触发器限制仅限于具有跨数据库引用的触发器的情况下,当前工作流程中可能的解决方案是在主数据库中创建一个帮助表,以仅存储已删除行的唯一标识符(或任何唯一键)使您能够最有效地删除已删除的行)。

Those ids would need to be inserted by a trigger on your master table on delete.

这些 id 需要在删除时由主表上的触发器插入。

Using the same mechanism as your insert/updates, create a task following your inserts and updates. You could export your helper table to xml, as you noted in your current workflow.

使用与插入/更新相同的机制,在插入和更新之后创建一个任务。正如您在当前工作流程中所指出的,您可以将帮助表导出为 xml。

This task would simply delete the rows out of the slave table, then delete all data from your helper table following completion of the task. Log any errors from the task so that you can troubleshoot this since there is no audit trail.

此任务将简单地从从表中删除行,然后在完成任务后从帮助表中删除所有数据。记录任务中的任何错误,以便您可以解决此问题,因为没有审计跟踪。

回答by Michail Nikolaev

Very interesting question.

很有趣的问题。

In may case I was having enough RAM to load all ids from master and slave tables to diff them.

在可能的情况下,我有足够的 RAM 来加载主表和从表中的所有 id 来区分它们。

If ids in master table are sequential you try to may maintain a set of full filled ranges in master table (ranges with all ids used, without blanks, like 100,101,102,103).

如果主表中的 id 是连续的,您可以尝试在主表中维护一组完整的填充范围(使用所有 id 的范围,没有空格,如 100,101,102,103)。

To find removed ids without loading all of them to the memory you may execute SQL query to count number of records with id >= full_region.start and id <= full_region.endfor each full filled region. If result of query == (full_region.end - full_region.end) + 1it means all record in region are notdeleted. Otherwise - split region into 2 parts and do the same check for both of them (in a lot of cases only one side contains removed records).

要查找已删除的 id 而不将它们全部加载到内存中,您可以执行 SQL 查询来计算id >= full_region.start and id <= full_region.end每个已填充区域的记录数。如果查询的结果== (full_region.end - full_region.end) + 1表示区域内的所有记录都没有被删除。否则 - 将区域分成 2 部分并对它们进行相同的检查(在很多情况下,只有一侧包含已删除的记录)。

After some length of range (about 5000 I think) it will faster to load all present ids and check for absent using Set.

在一定长度的范围(我认为大约 5000)之后,加载所有现有 ID 并使用 Set 检查是否不存在会更快。

Also there is a sense to load all ids to the memory for a batch of small (10-20 records) regions.

还有一种意义是将所有 id 加载到一批小(10-20 条记录)区域的内存中。

回答by user1050755

If your database has a transaction dump log, just ship that one.

如果您的数据库有事务转储日志,只需发送该日志即可。

It is possible with MySQL and should be possible with PostgreSQL.

MySQL是可能的,PostgreSQL应该也是可能的。

回答by Aaron Saray

I would agree with another comment - this requires the usage of triggers. I think another table should hold the history of your sql statements. See this answerabout using 2008 extended events... Then, you can get the entire sql, and store the result query in the history table. Its up to you if you want to store it as a mysql query or a mssql query.

我同意另一条评论——这需要使用触发器。我认为另一个表应该保存您的 sql 语句的历史记录。看到这个关于使用2008扩展事件的答案...然后,你可以得到整个sql,并将结果查询存储在历史表中。如果您想将其存储为 mysql 查询或 mssql 查询,这取决于您。

回答by Robert Co

Here's my take. Do you really need to deal with this? I assume that the slave is for reporting purposes. So the question I would ask is how up to date should it be? Is it ok if the data is one day old? Do you plan a nightly refresh?

这是我的看法。你真的需要处理这个吗?我假设奴隶是为了报告目的。所以我要问的问题是它应该更新到什么程度?如果数据是一天前的,可以吗?你计划每晚刷新吗?

If so, forget about this online sync process, download the full tables; ship it to the mysql and batch load it. Processing time might be a lot quicker than you think.

如果是这样,忘记这个在线同步过程,下载完整的表格;将它发送到 mysql 并批量加载它。处理时间可能比您想象的要快得多。