oracle 如何加快对具有有效日期数据的庞大数据仓库表的查询?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3113448/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I speed up queries against huge data warehouse tables with effective-dated data?
提问by aw crud
So I am querying some extremely large tables. The reason they are so large is because PeopleSoft inserts new records every time a change is made to some data, rather than updating existing records. In effect, its transactional tables are also a data warehouse.
所以我正在查询一些非常大的表。它们如此之大的原因是 PeopleSoft 每次对某些数据进行更改时都会插入新记录,而不是更新现有记录。实际上,它的事务表也是一个数据仓库。
This necessitates queries that have nested selects in them, to get the most recent/current row. They are both effective dated and within each date (cast to a day) they can have an effective sequence. Thus, in order to get the current record for user_id=123
, I have to do this:
这需要在其中嵌套选择的查询,以获取最新/当前行。它们都有生效日期,并且在每个日期内(投射到一天),它们可以有一个有效的序列。因此,为了获得 的当前记录user_id=123
,我必须这样做:
select * from sometable st
where st.user_id = 123
and st.effective_date = (select max(sti.effective_date)
from sometable sti where sti.user_id = st.user_id)
and st.effective_sequence = (select max(sti.effective_sequence)
from sometable sti where sti.user_id = st.user_id
and sti.effective_date = st.effective_date)
There are a phenomenal number of indexes on these tables, and I can't find anything else that would speed up my queries.
这些表上有数量惊人的索引,我找不到其他可以加快查询速度的东西。
My trouble is that I often times want to get data about an individual from these tables for maybe 50 user_ids, but when I join my tables having only a few records in them with a few of these PeopleSoft tables, things just go to crap.
我的问题是,我经常想从这些表中获取有关个人的数据,可能有 50 个 user_id,但是当我将其中只有几条记录的表与这些 PeopleSoft 表中的一些表连接起来时,事情就变得糟糕了。
The PeopleSoft tables are on a remote database that I access through a database link. My queries tend to look like this:
PeopleSoft 表位于我通过数据库链接访问的远程数据库上。我的查询往往是这样的:
select st.* from local_table lt, sometable@remotedb st
where lt.user_id in ('123', '456', '789')
and lt.user_id = st.user_id
and st.effective_date = (select max(sti.effective_date)
from sometable@remotedb sti where sti.user_id = st.user_id)
and st.effective_sequence = (select max(sti.effective_sequence)
from sometable@remotedb sti where sti.user_id = st.user_id
and sti.effective_date = st.effective_date)
Things get even worse when I have to join several PeopleSoft tables with my local table. Performance is just unacceptable.
当我必须将多个 PeopleSoft 表与本地表连接时,情况会变得更糟。性能是不可接受的。
What are some things I can do to improve performance? I've tried query hints to ensure that my local table is joined to its partner in PeopleSoft first, so it doesn't attempt to join all its tables together before narrowing it down to the correct user_id. I've tried the LEADING
hint and toyed around with hints that tried to push the processing to the remote database, but the explain plan was obscured and just said 'REMOTE' for several of the operations and I had no idea what was going on.
我可以做些什么来提高性能?我已经尝试了查询提示以确保我的本地表首先连接到它在 PeopleSoft 中的合作伙伴,因此在将其缩小到正确的 user_id 之前,它不会尝试将所有表连接在一起。我已经尝试过LEADING
提示并尝试将处理推送到远程数据库的提示,但是解释计划被模糊了,并且只对几个操作说“远程”,我不知道发生了什么。
Assuming I don't have the power to change PeopleSoft and the location of my tables, are hints my best choice? If I was joining a local table with four remote tables, and the local table joined with two of them, how would I format the hint so that my local table (which is very small -- in fact, I can just do an inline view to have my local table only be the user_ids I'm interested in) is joined first with each of the remote ones?
假设我无权更改 PeopleSoft 和我的表的位置,提示是我的最佳选择吗?如果我加入了一个带有四个远程表的本地表,而本地表又加入了其中的两个,我将如何格式化提示以便我的本地表(它非常小——事实上,我可以做一个内联视图让我的本地表只是我感兴趣的 user_ids) 首先与每个远程表连接?
EDIT:The application needs real-time data so unfortunately a materialized view or other method of caching data will not suffice.
编辑:应用程序需要实时数据,因此不幸的是,物化视图或其他缓存数据的方法是不够的。
采纳答案by DCookie
Does refactoring your query something like this help at all?
像这样重构您的查询有帮助吗?
SELECT *
FROM (SELECT st.*, MAX(st.effective_date) OVER (PARTITION BY st.user_id) max_dt,
MAX(st.effective_sequence) OVER (PARTITION BY st.user_id, st.effective_date) max_seq
FROM local_table lt JOIN sometable@remotedb st ON (lt.user_id = st.user_id)
WHERE lt.user_id in ('123', '456', '789'))
WHERE effective_date = max_dt
AND effective_seq = max_seq;
I agree with @Mark Baker that performance joining over DB Links really can suck and you're likely to be limited in what you can accomplish with this approach.
我同意@Mark Baker 的观点,即通过 DB Links 加入的性能确实很糟糕,而且您可能会限制使用这种方法可以完成的工作。
回答by Gary Myers
One approach would be to stick PL/SQL functions around everything. As an example
一种方法是将 PL/SQL 函数放在所有东西上。举个例子
create table remote (user_id number, eff_date date, eff_seq number, value varchar2(10));
create type typ_remote as object (user_id number, eff_date date, eff_seq number, value varchar2(10));
.
/
create type typ_tab_remote as table of typ_remote;
.
/
insert into remote values (1, date '2010-01-02', 1, 'a');
insert into remote values (1, date '2010-01-02', 2, 'b');
insert into remote values (1, date '2010-01-02', 3, 'c');
insert into remote values (1, date '2010-01-03', 1, 'd');
insert into remote values (1, date '2010-01-03', 2, 'e');
insert into remote values (1, date '2010-01-03', 3, 'f');
insert into remote values (2, date '2010-01-02', 1, 'a');
insert into remote values (2, date '2010-01-02', 2, 'b');
insert into remote values (2, date '2010-01-03', 1, 'd');
create function show_remote (i_user_id_1 in number, i_user_id_2 in number) return typ_tab_remote pipelined is
CURSOR c_1 is
SELECT user_id, eff_date, eff_seq, value
FROM
(select user_id, eff_date, eff_seq, value,
rank() over (partition by user_id order by eff_date desc, eff_seq desc) rnk
from remote
where user_id in (i_user_id_1,i_user_id_2))
WHERE rnk = 1;
begin
for c_rec in c_1 loop
pipe row (typ_remote(c_rec.user_id, c_rec.eff_date, c_rec.eff_seq, c_rec.value));
end loop;
return;
end;
/
select * from table(show_remote(1,null));
select * from table(show_remote(1,2));
Rather than having user_id's passed individually as parameters, you could load them into a local table (eg a global temporary table). The PL/SQL would loop then through the table, doing the remote select for each row in the local table. No single query would have both local and remote tables. Effectively you would be writing your own join code.
您可以将 user_id 作为参数单独传递,而不是将它们加载到本地表(例如全局临时表)中。PL/SQL 将循环遍历表,对本地表中的每一行执行远程选择。没有一个查询会同时拥有本地表和远程表。实际上,您将编写自己的连接代码。
回答by josephj1989
One option is to first materialize the remote part of the query using a common table expression so you can be sure only relevantt data is fetched from remote db.Another improvement would be to merge the 2 subqueries against the remote db into one analytical function based subquery.Such a query can be used in your current query also. I can make other suggestions only after playing with the db.
一种选择是首先使用公共表表达式来具体化查询的远程部分,这样您就可以确保只从远程数据库中获取相关数据。另一种改进是将针对远程数据库的 2 个子查询合并到一个基于分析函数的子查询中.此类查询也可用于您当前的查询。只有在玩过数据库后,我才能提出其他建议。
see below
见下文
with remote_query as
(
select /*+ materialize */ st.* from sometable@remotedb st
where st.user_id in ('123', '456', '789')
and st.rowid in( select first_value(rowid) over (order by effective_date desc,
effective_sequence desc ) from sometable@remotedb st1
where st.user_id=st1.user_id)
)
select lt.*,st.*
FROM local_table st,remote_query rt
where st.user_id=rt.user_id
回答by dpbradley
You haven't mentioned the requirements for the freshness of the data, but one option would be to create materialized views (you'll be restricted to REFRESH COMPLETE since you can't create snapshot logs in the source system) that have data only for the current versioned row of the transaction tables. These materialized view tables will reside in your local system and additional indexing can be added to them to improve query performance.
您还没有提到对数据新鲜度的要求,但一种选择是创建物化视图(您将被限制为 REFRESH COMPLETE,因为您无法在源系统中创建快照日志),其数据仅用于事务表的当前版本行。这些物化视图表将驻留在您的本地系统中,并且可以向它们添加额外的索引以提高查询性能。
回答by Mark Baker
The performance issue is going to be the access across the link. With part of the query against local tables, it's all being executed locally so no access to the remote indexes and it's pulling all the remote data back to test lkocally.
性能问题将是跨链接的访问。对于针对本地表的部分查询,它全部在本地执行,因此无法访问远程索引,并且将所有远程数据拉回以进行本地测试。
If you could use materialized views in a local database refreshed from the peoplesoft database on a periodic (nightly) basis for the historic data, only accessing the remote peoplesoft database for today's changes (adding a effective_date = today to your where clause) and merging the two queries.
如果您可以在定期(每晚)的历史数据基础上使用从 peoplesoft 数据库刷新的本地数据库中的物化视图,则仅访问远程 peoplesoft 数据库以获取今天的更改(在您的 where 子句中添加一个 Effective_date = today)并合并两个查询。
Another option might be to use an INSERT INTO X SELECT FROM just for the remote data to pull it into a temporary local table or materialized view, then a second query to join that with your local data... similar to josephj1989's suggestion
另一种选择可能是仅对远程数据使用 INSERT INTO X SELECT FROM 将其拉入临时本地表或物化视图,然后第二个查询将其与本地数据连接...类似于 josephj1989 的建议
Alternatively (though there may be licensing issues) try RAC Clustering your local db with the remote peoplesoft db.
或者(尽管可能存在许可问题)尝试使用远程 peoplesoft 数据库对本地数据库进行 RAC 集群。
回答by Tom H
Instead of using the subqueries, you can try this. I don't know if Oracle will perform better with this or not, since I don't use Oracle much.
你可以试试这个,而不是使用子查询。我不知道 Oracle 是否会在这方面表现得更好,因为我很少使用 Oracle。
SELECT
ST1.col1,
ST1.col2,
...
FROM
Some_Table ST1
LEFT OUTER JOIN Some_Table ST2 ON
ST2.user_id = ST1.user_id AND
(
ST2.effective_date > ST1.effective_date OR
(
ST2.effective_date = ST1.effective_date AND
ST2.effective_sequence > ST1.effective_sequence
)
)
WHERE
ST2.user_id IS NULL
Another possible solution would be:
另一种可能的解决方案是:
SELECT
ST1.col1,
ST1.col2,
...
FROM
Some_Table ST1
WHERE
NOT EXISTS
(
SELECT
FROM
Some_Table ST2
WHERE
ST2.user_id = ST1.user_id AND
(
ST2.effective_date > ST1.effective_date OR
(
ST2.effective_date = ST1.effective_date AND
ST2.effective_sequence > ST1.effective_sequence
)
)
)
回答by Abe Miessler
Would it be an option to create a database that you use for non-warehousing type stuff that you could update on a nightly basis? If it is you could create a nightly process that will move over only the most recent records. That would get rid of the MAX stuff you are doing for every day queries and significantly reduce the number or records.
是否可以选择创建一个用于每晚更新的非仓储类型资料的数据库?如果是,您可以创建一个夜间进程,只移动最近的记录。这将摆脱您为每天查询所做的 MAX 工作,并显着减少数量或记录。
Also, depends on whether you can have a 1 day lapse between the most recent data and what is available.
此外,取决于您是否可以在最新数据和可用数据之间间隔 1 天。
I'm not super familiar with Oracle so there may be a way to get improvements by making changes to your query also...
我对 Oracle 不是很熟悉,所以可能有一种方法可以通过对查询进行更改来获得改进......
回答by Frank R.
Can you ETL the rows with the desired user_id's into your own table, creating only the needed indexes to support your queries and perform your queries on it?
您能否将具有所需 user_id 的行 ETL 放入您自己的表中,仅创建所需的索引来支持您的查询并对其执行查询?
回答by Chip L
Is the PeopleSoft table a delivered one, or is it custom? Are you sure it's a physical table, and not a poorly-written view on the PS side? If it's a delivered record you're going against (example looks much like PS_JOB or a view that references it), maybe you could indicate this. PS_JOB is a beast with tons of indexes delivered, and most sites add even more.
PeopleSoft 表是交付的还是定制的?你确定它是一个物理表,而不是 PS 方面写得不好的视图?如果它是您反对的交付记录(示例看起来很像 PS_JOB 或引用它的视图),也许您可以指出这一点。PS_JOB 是一个交付了大量索引的野兽,大多数站点添加了更多。
If you know the indexes on the table, you can use Oracle hints to specify a preferred index to use; that sometimes helps.
如果您知道表上的索引,则可以使用 Oracle 提示指定要使用的首选索引;这有时会有所帮助。
Have you done an explain plan to see if you can determine where the problem is? Maybe there's a cartesian join, full table scan, etc.?
你有没有做一个解释计划,看看你是否可以确定问题出在哪里?也许有笛卡尔连接、全表扫描等?
回答by Damir Sudarevic
It looks to me that you are dealing with a type 2 dimension in the data warehouse. There are several ways how to implement type 2 dimension, mostly having columns like ValidFrom, ValidTo, Version, Status
. Not all of them are always present, it would be interesting if you could post the schema for your table. Here is an example of how it may look like (John Smith moved from Indiana to Ohio on 2010-06-24)
在我看来,您正在处理数据仓库中的类型 2 维。有几种方法可以实现类型 2 维度,主要是像ValidFrom, ValidTo, Version, Status
. 并非所有这些都始终存在,如果您可以发布表的架构会很有趣。这是它的外观示例(约翰史密斯于 2010 年 6 月 24 日从印第安纳州搬到俄亥俄州)
UserKey UserBusinessKey State ValidFrom ValidTo Version Status
7234 John_Smith_17 Indiana 2005-03-20 2010-06-23 1 expired
9116 John_Smith_17 Ohio 2010-06-24 3000-01-01 2 current
To obtain the latest version of a row, it is common to use
要获取一行的最新版本,通常使用
WHERE Status = 'current'
or
或者
WHERE ValidTo = '3000-01-01'
Note that this one has some constant far in the future.
请注意,这个在未来有一些常数。
or
或者
WHERE ValidTo > CURRENT_DATE
Seems that your example uses ValidFrom
(effective_date), so you are forced to find max()
in order to locate the latest row. Take a look at the schema -- is there Status or ValidTo
equivalents in your tables?
似乎您的示例使用了 ValidFrom
(effective_date),因此您必须进行查找max()
才能找到最新的行。看看模式——Status or ValidTo
你的表中有等价物吗?