oracle 使用超过 2000 万条记录的 SSIS 的最佳增量加载方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37197332/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Best incremental load method using SSIS with over 20 million records
提问by LTLTLT
What is needed: I'm needing 25 million records from oracle incrementally loaded to SQL Server 2012. It will need to have an UPDATE, DELETE, NEW RECORDS feature in the package. The oracle data source is always changing.
需要什么:我需要从 oracle 增量加载 2500 万条记录到 SQL Server 2012。它需要在包中具有更新、删除、新记录功能。oracle 数据源一直在变化。
What I have:I've done this many times before but not anything past 10 million records.First I have an [Execute SQL Task] that is set to grab the result set of the [Max Modified Date]. I then have a query that only pulls data from the [ORACLE SOURCE] > [Max Modified Date] and have that lookup against my destination table.
我有什么:我以前做过很多次,但没有超过 1000 万条记录。首先,我有一个 [执行 SQL 任务],它被设置为获取 [最大修改日期] 的结果集。然后我有一个查询,它只从 [ORACLE SOURCE] > [最大修改日期] 中提取数据,并对我的目标表进行查找。
I have the the [ORACLE Source] connecting to the [Lookup-Destination table], the lookup is set to NO CACHE mode, I get errors if I use partial or full cache mode because I assume the [ORACLE Source] is always changing. The [Lookup] then connects to a [Conditional Split] where I would input an expression like the one below.
我有 [ORACLE Source] 连接到 [Lookup-Destination table],查找设置为 NO CACHE 模式,如果我使用部分或完全缓存模式我会出错,因为我假设 [ORACLE Source] 总是在变化。[Lookup] 然后连接到 [Conditional Split],我将在其中输入如下所示的表达式。
(REPLACENULL(ORACLE.ID,"") != REPLACENULL(Lookup.ID,""))
|| (REPLACENULL(ORACLE.CASE_NUMBER,"")
!= REPLACENULL(ORACLE.CASE_NUMBER,""))
I would then have the rows that the [Conditional Split] outputs into a staging table. I then add a [Execute SQL Task] and perform an UPDATE to the DESTINATION-TABLE with the query below:
然后我会将 [Conditional Split] 输出到临时表中的行。然后我添加一个 [执行 SQL 任务] 并使用以下查询对 DESTINATION-TABLE 执行更新:
UPDATE Destination
SET SD.CASE_NUMBER =UP.CASE_NUMBER,
SD.ID = UP.ID,
From Destination SD
JOIN STAGING.TABLE UP
ON UP.ID = SD.ID
Problem:This becomes very slow and takes a very long time and it just keeps running. How can I improve the time and get it to work? Should I use a cache transformation? Should I use a merge statement instead?
问题:这会变得很慢,需要很长时间,而且它一直在运行。我怎样才能改善时间并让它发挥作用?我应该使用缓存转换吗?我应该使用合并语句吗?
How would I use the expression REPLACENULL in the conditional split when it is a data column? would I use something like :
当它是数据列时,如何在条件拆分中使用表达式 REPLACENULL?我会使用类似的东西:
(REPLACENULL(ORACLE.LAST_MODIFIED_DATE,"01-01-1900 00:00:00.000")
!= REPLACENULL(Lookup.LAST_MODIFIED_DATE," 01-01-1900 00:00:00.000"))
PICTURES BELOW:
下图:
回答by Nick.McDermaid
A pattern that is usually faster for larger datasets is to load the source data into a local staging table then use a query like below to identify the new records:
对于较大的数据集,通常更快的模式是将源数据加载到本地临时表中,然后使用如下查询来识别新记录:
SELECT column1,column 2
FROM StagingTable SRC
WHERE NOT EXISTS (
SELECT * FROM TargetTable TGT
WHERE TGT.MatchKey = SRC.MatchKey
)
Then you just feed that dataset into an insert:
然后你只需将该数据集输入到插入中:
INSERT INTO TargetTable (column1,column 2)
SELECT column1,column 2
FROM StagingTable SRC
WHERE NOT EXISTS (
SELECT * FROM TargetTable TGT
WHERE TGT.MatchKey = SRC.MatchKey
)
Updates look like this:
更新如下:
UPDATE TGT
SET
column1 = SRC.column1,
column2 = SRC.column2,
DTUpdated=GETDATE()
FROM TargetTable TGT
WHERE EXISTS (
SELECT * FROM TargetTable SRC
WHERE TGT.MatchKey = SRC.MatchKey
)
Note the additional column DTUpdated
. You should always have a 'last updated' column in your table to help with auditing and debugging.
请注意附加列DTUpdated
。您的表中应该始终有一个“上次更新”列,以帮助审核和调试。
This is an INSERT/UPDATE approach. There are other data load approaches such as windowing (pick a trailing window of data to be fully deleted and reloaded) but the approach depends on how your system works and whether you can make assumptions about data (i.e. posted data in the source will never be changed)
这是一种插入/更新方法。还有其他数据加载方法,例如窗口化(选择要完全删除和重新加载的数据的尾随窗口),但该方法取决于您的系统如何工作以及您是否可以对数据做出假设(即源中发布的数据永远不会改变了)
You can squash the seperate INSERT
and UPDATE
statements into a single MERGE
statement, although it gets pretty huge, and I've had performance issues with it and there are other documented issues with MERGE
您可以将单独的INSERT
和UPDATE
语句压缩为一个MERGE
语句,尽管它变得非常庞大,而且我遇到了性能问题,并且还有其他记录在案的问题MERGE
回答by Patrick Tucci
Unfortunately, there's not a good way to do what you're trying to do. SSIS has some controls and documented ways to do this, but as you have found they don't work as well when you start dealing with large amounts of data.
不幸的是,没有一个好的方法来做你想做的事情。SSIS 有一些控制和记录方法可以做到这一点,但正如您发现的那样,当您开始处理大量数据时,它们的效果不佳。
At a previous job, we had something similar that we needed to do. We needed to update medical claims from a source system to another system, similar to your setup. For a very long time, we just truncated everything in the destination and rebuilt every night. I think we were doing this daily with more than 25M rows. If you're able to transfer all the rows from Oracle to SQL in a decent amount of time, then truncating and reloading may be an option.
在以前的工作中,我们有类似的事情需要做。我们需要将医疗索赔从一个源系统更新到另一个系统,类似于您的设置。很长一段时间,我们只是把目的地的一切都截断了,每晚都重建。我认为我们每天都在处理超过 2500 万行。如果您能够在相当长的时间内将所有行从 Oracle 传输到 SQL,那么截断和重新加载可能是一种选择。
We eventually had to get away from this as our volumes grew, however. We tried to do something along the lines of what you're attempting, but never got anything we were satisfied with. We ended up with a sort of non-conventional process. First, each medical claim had a unique numeric identifier. Second, whenever the medical claim was updated in the source system, there was an incremental ID on the individual claim that was also incremented.
然而,随着我们的交易量增长,我们最终不得不摆脱这种情况。我们试图按照您尝试的方式做一些事情,但从未得到我们满意的任何东西。我们最终采用了一种非常规流程。首先,每个医疗索赔都有一个唯一的数字标识符。其次,无论何时在源系统中更新医疗索赔,都会增加个人索赔的增量 ID。
Step one of our process was to bring over any new medical claims, or claims that had changed. We could determine this quite easily, since the unique ID and the "change ID" column were both indexed in source and destination. These records would be inserted directly into the destination table. The second step was our "deletes", which we handled with a logical flag on the records. For actual deletes, where records existed in destination but were no longer in source, I believe it was actually fastest to do this by selecting the DISTINCT
claim numbers from the source system and placing them in a temporary table on the SQL side. Then, we simply did a LEFT JOIN
update to set the missing claims to logically deleted. We did something similar with our updates: if a newer version of the claim was brought over by our original Lookup, we would logically delete the old one. Every so often we would clean up the logical deletes and actually delete them, but since the logical delete indicator was indexed, this didn't need to be done too frequently. We never saw much of a performance hit, even when the logically deleted records numbered in the tens of millions.
我们流程的第一步是提出任何新的医疗索赔或已更改的索赔。我们可以很容易地确定这一点,因为唯一 ID 和“更改 ID”列都在源和目标中建立了索引。这些记录将直接插入到目标表中。第二步是我们的“删除”,我们用记录上的逻辑标志来处理。对于实际删除,记录存在于目标中但不再存在于源中,我相信通过DISTINCT
从源系统中选择索赔编号并将它们放置在 SQL 端的临时表中,实际上是最快的。然后,我们简单地做了一个LEFT JOIN
更新以将丢失的声明设置为逻辑删除。我们对我们的更新做了类似的事情:如果我们的原始查找带来了更新版本的声明,我们会在逻辑上删除旧版本。我们每隔一段时间就会清理逻辑删除并实际删除它们,但由于逻辑删除指示符已编入索引,因此不需要太频繁地执行此操作。我们从未看到性能受到太大影响,即使逻辑删除的记录数以千万计。
This process was always evolving as our server loads and data source volumes changed, and I suspect the same may be true for your process. Because every system and setup is different, some of the things that worked well for us may not work for you, and vice versa. I know our data center was relatively good and we were on some stupid fast flash storage, so truncating and reloading worked for us for a very, very long time. This may not be true on conventional storage, where your data interconnects are not as fast, or where your servers are not colocated.
这个过程总是随着我们的服务器负载和数据源数量的变化而发展,我怀疑你的过程可能也是如此。因为每个系统和设置都不同,一些对我们有用的东西可能不适合你,反之亦然。我知道我们的数据中心相对较好,而且我们使用的是一些愚蠢的快速闪存存储,因此截断和重新加载对我们来说很长很长一段时间。这在传统存储上可能并非如此,在传统存储中,您的数据互连速度不快,或者您的服务器不在同一位置。
When designing your process, keep in mind that deletes are one of the more expensive operations you can perform, followed by updates and by non-bulk inserts, respectively.
在设计流程时,请记住,删除是您可以执行的成本较高的操作之一,其次是更新和非批量插入。
回答by sandeep rawat
I am assuming you have some identity like (pk)column in your oracle table.
我假设您的 oracle 表中有一些身份,例如 (pk)column。
1 Get max identity (Business key) from Destination database (SQL server one)
1 从目标数据库(SQL 服务器一)获取最大身份(业务密钥)
2 Create two data flow
2 创建两个数据流
a) Pull only data >max identity from oracle and put them Destination directly .( As these are new record).
a) 仅从 oracle 中提取数据 > max identity 并将它们直接放在 Destination 中。(因为这些是新记录)。
b) Get all record < max identity and update date > last load put them into temp (staging ) table (as this is updated data)
b)获取所有记录<最大身份和更新日期>最后一次加载将它们放入临时(暂存)表(因为这是更新的数据)
3 Update Destination table with record from temp table record (created at step b)
3 使用临时表记录中的记录更新目标表(在步骤 b 中创建)