在 oracle 中更快地“从 table1 中选择不同的 thing_id,thing_name”
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/936328/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Faster 'select distinct thing_id,thing_name from table1' in oracle
提问by Don Branson
I have this query:
我有这个查询:
select distinct id,name from table1
For a given ID, the name will always be the same. Both fields are indexed. There's no separate table that maps the id to the name. The table is very large (10 of millions of rows), so the query could take some time.
对于给定的 ID,名称将始终相同。这两个字段都已编入索引。没有单独的表将 id 映射到名称。该表非常大(数百万行中的 10 行),因此查询可能需要一些时间。
This query is very fast, since it's indexed:
此查询非常快,因为它已编入索引:
select distinct name from table1
Likewise for this query:
同样对于这个查询:
select distinct id from table1
Assuming I can't get the database structure changed (a very safe assumption) what's a better way to structure the first query for performance?
假设我无法更改数据库结构(一个非常安全的假设),构建第一个查询以提高性能的更好方法是什么?
Edit to add a sanitized desc of the table:
编辑以添加表格的消毒描述:
Name Null Type ------------------------------ -------- ---------------------------- KEY NOT NULL NUMBER COL1 NOT NULL NUMBER COL2 NOT NULL VARCHAR2(4000 CHAR) COL3 VARCHAR2(1000 CHAR) COL4 VARCHAR2(4000 CHAR) COL5 VARCHAR2(60 CHAR) COL6 VARCHAR2(150 CHAR) COL7 VARCHAR2(50 CHAR) COL8 VARCHAR2(3 CHAR) COL9 VARCHAR2(3 CHAR) COLA VARCHAR2(50 CHAR) COLB NOT NULL DATE COLC NOT NULL DATE COLD NOT NULL VARCHAR2(1 CHAR) COLE NOT NULL NUMBER COLF NOT NULL NUMBER COLG VARCHAR2(600 CHAR) ID NUMBER NAME VARCHAR2(50 CHAR) COLH VARCHAR2(3 CHAR) 20 rows selected
回答by spencer7593
[LATEST EDIT]
[最新编辑]
My ORIGINAL ANSWERregarding creating the appropriate index on (name,id) to replace the index on (name) is below. (That wasn't an answer to the original question, which disallowed any database changes.)
我关于在 (name,id) 上创建适当的索引以替换 (name) 上的索引的原始答案如下。(这不是对原始问题的回答,它不允许任何数据库更改。)
Here are statements that I have notyet tested. There's probably some obvious reason these won't work. I'd never actually suggestwriting statements like this (at the risk of being drummed thoroughly for such ridiculous suggestion.)
以下是声明,我不是还没有进行测试。可能有一些明显的原因这些不起作用。我从来没有真正建议过写这样的陈述(冒着被这种荒谬的建议彻底鼓吹的风险。)
If these queries even return result sets, the ressult set will only resemble the result set from the OP query, almost by accident, taking advantage of a quirky guaranteeabout the data that Don has provided us. This statement is NOT equivalent to the original SQL, these statements are designed for the special caseas described by Don.
如果这些查询甚至返回结果集,结果集只会类似于来自 OP 查询的结果集,这几乎是偶然的,利用了Don 提供给我们的数据的古怪保证。此语句不等同于原始 SQL,这些语句是为Don 描述的特殊情况而设计的。
select m1.id
, m2.name
from (select min(t1.rowid) as min_rowid
, t1.id
from table1 t1
where t1.id is not null
group by t1.id
) m1
, (select min(t2.rowid) as min_rowid
, t2.name from table1 t2
where t2.name is not null
group by t2.name
) m2
where m1.min_rowid = m2.min_rowid
order
by m1.id
Let's unpack that:
让我们解压一下:
- m1is an inline view that gets us a list of distinct id values.
- m2is an inline view that gets us a list of distinct name values.
- materialize the views m1and m2
- match the ROWID from m1and m2to match
id
withname
- m1是一个内联视图,它为我们提供了不同 id 值的列表。
- m2是一个内联视图,它为我们提供了不同名称值的列表。
- 实现视图m1和m2
- 从匹配ROWID M1和M2,以匹配
id
与name
Someone else suggested the idea of an index merge. I had previously dismissed that idea, an optimizer plan to match 10s of millions of rowids without eliminating any of them.
其他人提出了索引合并的想法。我之前驳回了这个想法,一个优化器计划匹配数百万个 rowids 而不消除任何一个。
With sufficiently low cardinality for id and name, and with the right optimizer plan:
使用足够低的 id 和 name 基数,以及正确的优化器计划:
select m1.id
, ( select m2.name
from table1 m2
where m2.id = m1.id
and rownum = 1
) as name
from (select t1.id
from table1 t1
where t1.id is not null
group by t1.id
) m1
order
by m1.id
Let's unpack that
让我们打开它
- m1is an inline view that gets us a list of distinct id values.
- materialize the view m1
- for each row in m1, query table1 to get the name value from a single row (stopkey)
- m1是一个内联视图,它为我们提供了不同 id 值的列表。
- 物化视图m1
- 对于m1 中的每一行,查询 table1 以从单行中获取名称值(停止键)
IMPORTANT NOTE
重要的提示
These statements are FUNDAMENTALLY different that the OP query. They are designed to return a DIFFERENT result set than the OP query. The happento return the desired result set because of a quirky guarantee about the data. Don has told us that a name
is determined by id
. (Is the converse true? Is id
determined by name
? Do we have a STATED GUARANTEE, not necessarily enforced by the database, but a guarantee that we can take advantage of?) For any ID
value, every row with that ID
value will have the same NAME
value. (And we are also guaranteed the converse is true, that for any NAME
value, every row with that NAME
value will have the same ID
value?)
这些语句与 OP 查询完全不同。它们旨在返回与 OP 查询不同的结果集。由于对数据的古怪保证,碰巧返回了所需的结果集。Don 告诉我们 aname
由 决定id
。(反过来是真的吗?id
由决定name
吗?我们是否有 STATED GUARANTEE,不一定由数据库强制执行,但保证我们可以利用?)对于任何ID
值,具有该ID
值的每一行都将具有相同的NAME
值。(而且我们还保证反过来是正确的,对于任何NAME
值,具有该NAME
值的每一行都将具有相同的ID
值?)
If so, maybe we can make use of that information. If ID
and NAME
appear in distinct pairs, we only need to find one particular row. The "pair" is going to have a matching ROWID, which conveniently happens to be available from each of the existing indexes. What if we get the minimum ROWID for each ID
, and get the minimum ROWID for each NAME
. Couldn't we then match the ID
to the NAME
based on the ROWID that contains the pair? I think it might work, given a low enough cardinality. (That is, if we're dealing with only hundreds of ROWIDs rather than 10s of millions.)
如果是这样,也许我们可以利用这些信息。如果ID
和NAME
出现在不同的对中,我们只需要找到一个特定的行。“对”将有一个匹配的 ROWID,它恰好可以从每个现有索引中获得。如果我们得到每个 的最小 ROWID ID
,并得到每个的最小 ROWID 呢NAME
?我们能不能再搭配ID
到NAME
基于包含一对ROWID?如果基数足够低,我认为它可能会起作用。(也就是说,如果我们只处理数百个 ROWID 而不是数百万个。)
[/LATEST EDIT]
[/最新编辑]
[EDIT]
[编辑]
The question is now updated with information concerning the table, it shows that the ID
column and the NAME
column both allow for NULL values. If Don can live without any NULLs returned in the result set, then adding the IS NOT NULL predicate on both of those columns may enable an index to be used. (NOTE: in an Oracle (B-Tree) index, NULL values do NOT appear in the index.)
问题现在更新了有关表的信息,它显示ID
列和NAME
列都允许 NULL 值。如果 Don 可以在结果集中没有返回任何 NULL 的情况下生存,那么在这两个列上添加 IS NOT NULL 谓词可以启用索引。(注意:在 Oracle (B-Tree) 索引中,NULL 值不会出现在索引中。)
[/EDIT]
[/编辑]
ORIGINAL ANSWER:
原始答案:
create an appropriate index
创建合适的索引
create index table1_ix3 on table_1 (name,id) ... ;
Okay, that's notthe answer to the question you asked, but it's the right answer to fixing the performance problem. (You specified no changes to the database, but in this case, changing the database is the right answer.)
好的,这不是您提出的问题的答案,但它是解决性能问题的正确答案。(您指定不对数据库进行任何更改,但在这种情况下,更改数据库是正确的答案。)
Note that if you have an index defined on (name,id)
, then you (very likely) don't need an index on (name)
, sine the optimizer will consider the leading name
column in the other index.
请注意,如果您在 上定义了索引(name,id)
,那么您(很可能)不需要在 上的索引(name)
,因为优化器会考虑name
其他索引中的前导列。
(UPDATE: as someone more astute than I pointed out, I hadn't even considered the possibility that the existing indexes were bitmap indexes and not B-tree indexes...)
(更新:作为比我指出的更精明的人,我什至没有考虑现有索引是位图索引而不是 B 树索引的可能性......)
Re-evaluate your need for the result set... do you need to return id
, or would returning name
be sufficient.
重新评估您对结果集的需求...您是否需要返回id
,或者返回name
就足够了。
select distinct name from table1 order by name;
For a particular name, you could submit a second query to get the associated id
, if and when you needed it...
对于特定名称,您可以提交第二个查询以获取关联的id
,如果需要的话...
select id from table1 where name = :b1 and rownum = 1;
If you you really needthe specified result set, you can try some alternatives to see if the performance is any better. I don't hold out much hope for any of these:
如果你真的需要指定的结果集,你可以尝试一些替代方案,看看性能是否更好。我对以下任何一个都不抱太大希望:
select /*+ FIRST_ROWS */ DISTINCT id, name from table1 order by id;
or
或者
select /*+ FIRST_ROWS */ id, name from table1 group by id, name order by name;
or
或者
select /*+ INDEX(table1) */ id, min(name) from table1 group by id order by id;
UPDATE: as others have astutely pointed out, with this approach we're testing and comparing performance of alternative queries, which is a sort of hit or miss approach. (I don't agree that it's random, but I would agree that it's hit or miss.)
更新:正如其他人敏锐地指出的那样,使用这种方法,我们正在测试和比较替代查询的性能,这是一种命中或未命中的方法。(我不同意它是随机的,但我同意它是命中还是未命中。)
UPDATE: tom suggests the ALL_ROWS hint. I hadn't considered that, because I was really focused on getting a query plan using an INDEX. I suspect the OP query is doing a full table scan, and it's probably not the scan that's taking the time, it's the sort unique operation (<10g) or hash operation (10gR2+) that takes the time. (Absent timed statistics and event 10046 trace, I'm just guessing here.) But then again, maybe it is the scan, who knows, the high water mark on the table could be way out in a vast expanse of empty blocks.
更新:汤姆建议使用 ALL_ROWS 提示。我没有考虑过这一点,因为我真正专注于使用 INDEX 获取查询计划。我怀疑 OP 查询正在执行全表扫描,这可能不是扫描需要时间,而是排序唯一操作 (<10g) 或哈希操作 (10gR2+) 需要时间。(没有定时统计数据和事件 10046 跟踪,我只是在这里猜测。)但话又说回来,也许是扫描,谁知道呢,桌子上的高水位可能会在广阔的空块中消失。
It almost goes without saying that the statistics on the table should be up-to-date, and we should be using SQL*Plus AUTOTRACE, or at least EXPLAIN PLAN to look at the query plans.
几乎不用说,表上的统计信息应该是最新的,我们应该使用 SQL*Plus AUTOTRACE,或者至少是 EXPLAIN PLAN 来查看查询计划。
But none of the suggested alternative queries really address the performance issue.
但是建议的替代查询都没有真正解决性能问题。
It's possible that hints will influence the optimizer to chooze a different plan, basically satisfying the ORDER BY from an index, but I'm not holding out much hope for that. (I don't think the FIRST_ROWS hint works with GROUP BY, the INDEX hint may.) I can see the potential for such an approach in a scenario where there's gobs of data blocks that are empty and sparsely populated, and ny accessing the data blocks via an index, it could actually be significantly fewer data blocks pulled into memory... but that scenario would be the exception rather than the norm.
提示可能会影响优化器选择不同的计划,基本上满足来自索引的 ORDER BY,但我对此并不抱太大希望。(我认为 FIRST_ROWS 提示不适用于 GROUP BY,INDEX 提示可能。)我可以看到这种方法在存在大量空且稀疏填充的数据块并且无法访问数据的情况下的潜力块,它实际上可能会显着减少被拉入内存的数据块......但这种情况将是例外而不是常态。
UPDATE: As Rob van Wijk points out, making use of the Oracle trace facility is the most effective approach to identifying and resolving performance issues.
更新:正如 Rob van Wijk 指出的那样,利用 Oracle 跟踪工具是识别和解决性能问题的最有效方法。
Without the output of an EXPLAIN PLAN or SQL*Plus AUTOTRACE output, I'm just guessing here.
如果没有 EXPLAIN PLAN 或 SQL*Plus AUTOTRACE 输出的输出,我只是在这里猜测。
I suspect the performance problem you have right now is that the table data blocks have to be referenced to get the specified result set.
我怀疑您现在遇到的性能问题是必须引用表数据块才能获得指定的结果集。
There's no getting around it, the query can not be satisfied from just an index, since there isn't an index that contains both the NAME
and ID
columns, with either the ID
or NAME
column as the leading column. The other two "fast" OP queries can be satisfied from index without need reference the row (data blocks).
无法绕过它,不能仅从索引中满足查询,因为不存在同时包含NAME
和ID
列的索引,其中ID
orNAME
列作为前导列。其他两个“快速”OP 查询可以通过索引满足,而无需引用行(数据块)。
Even if the optimizer plan for the query was to use one of the indexes, it still has to retrieve the associated row from the data block, in order to get the value for the other column. And with no predicate (no WHERE clause), the optimizer is likely opting for a full table scan, and likely doing a sort operation (<10g). (Again, an EXPLAIN PLAN would show the optimizer plan, as would AUTOTRACE.)
即使查询的优化程序计划使用其中一个索引,它仍然必须从数据块中检索关联的行,以获取另一列的值。并且没有谓词(没有 WHERE 子句),优化器可能会选择全表扫描,并可能执行排序操作(<10g)。(同样,EXPLAIN PLAN 将显示优化器计划,AUTOTRACE 也是如此。)
I'm also assuming here (big assumption) that both columns are defined as NOT NULL.
我还在这里假设(大假设)两列都定义为 NOT NULL。
You might also consider defining the table as an index organized table (IOT), especially if these are the only two columns in the table. (An IOT isn't a panacea, it comes with it's own set of performance issues.)
您还可以考虑将表定义为索引组织表 (IOT),尤其是当这些是表中仅有的两列时。(物联网不是万能药,它有自己的一系列性能问题。)
You can try re-writing the query (unless that's a database change that is also verboten) In our database environments, we consider a query to be as much a part of the database as the tables and indexes.)
您可以尝试重新编写查询(除非这也是禁止的数据库更改)在我们的数据库环境中,我们认为查询与表和索引一样是数据库的一部分。)
Again, without a predicate, the optimizer will likely not use an index. There's a chance you could get the query plan to use one of the existing indexes to get the first rows returned quickly, by adding a hint, test a combination of:
同样,如果没有谓词,优化器可能不会使用索引。您有可能让查询计划使用现有索引之一来快速返回第一行,通过添加提示,测试以下组合:
select /*+ INDEX(table1) */ ...
select /*+ FIRST_ROWS */ ...
select /*+ ALL_ROWS */ ...
distinct id, name from table1;
distinct id, name from table1 order by id;
distinct id, name from table1 order by name;
id, name from table1 group by id, name order by id;
id, min(name) from table1 group by id order by id;
min(id), name from table1 group by name order by name;
With a hint, you may be able to influence the optimizer to use an index, and that may avoid the sort operation, but overall, it make take more time to return the entire result set.
通过一个提示,您可以影响优化器使用索引,这可能会避免排序操作,但总的来说,返回整个结果集需要更多时间。
(UPDATE: someone else pointed out that the optimizer might choose to merge two indexes based on ROWID. That's a possibility, but without a predicate to eliminate some rows, that's likely going to be a much more expensive approach (matching 10s of millions ROWIDs) from two indexes, especially when none of the rows are going to be excluded on the basis of the match.)
(更新:其他人指出优化器可能会选择基于 ROWID 合并两个索引。这是一种可能性,但没有谓词来消除某些行,这可能是一种更昂贵的方法(匹配数百万个 ROWID)来自两个索引,特别是当没有任何行将根据匹配被排除时。)
But all that theorizing doesn't amount to squat without some performance statistics.
但是,所有这些理论化都不足以在没有一些性能统计数据的情况下进行深蹲。
Absent altering anything else in the database, the only other hope (I can think of) of you speeding up the query is to make sure the sort operation is tuned so that the (required) sort operation can be performed in memory, rather than on disk. But that's not really the right answer. The optimizer may not be doing a sort operation at all, it may be doing a hash operation (10gR2+) instead, in which case, that should be tuned. The sort operation is just a guess on my part, based on past experience with Oracle 7.3, 8, 8i, 9i.)
如果没有更改数据库中的任何其他内容,您加速查询的唯一其他希望(我能想到)是确保排序操作已调整,以便(所需的)排序操作可以在内存中执行,而不是在内存中执行盘。但这并不是真正的正确答案。优化器可能根本不执行排序操作,它可能执行哈希操作 (10gR2+),在这种情况下,应该对其进行调整。排序操作只是我的猜测,基于过去使用 Oracle 7.3、8、8i、9i 的经验。)
A serious DBA is going to have more issue with you futzing with the SORT_AREA_SIZE
and/or HASH_AREA_SIZE
parameters for your session(s) than he will in creating the correct indexes. (And those session parameters are "old school" for versions prior to 10g automatic memory management magic.)
一个严肃的 DBA 会比他在创建正确的索引时遇到更多的问题,因为他会为您的会话处理SORT_AREA_SIZE
和/或HASH_AREA_SIZE
参数。(对于 10g 自动内存管理魔法之前的版本,这些会话参数是“老派”。)
Show your DBA the specification for the result set, let the DBA tune it.
向您的 DBA 展示结果集的规范,让 DBA 对其进行调整。
回答by Rob van Wijk
A query cannot be tuned by looking at it, or randomly suggesting some equivalent queries, regardless how well meant they are.
不能通过查看查询或随机建议一些等效查询来调整查询,无论它们的意思如何。
You, we or the optimizer needs to know statistics about your data. And then you can measure with tools like EXPLAIN PLAN or SQLTrace/tkprof or even the simple autotrace tool from SQLPlus.
您、我们或优化器需要了解有关您的数据的统计信息。然后,您可以使用 EXPLAIN PLAN 或 SQL Trace/tkprof 之类的工具甚至 SQLPlus的简单自动跟踪工具进行测量。
Can you show us the output of this:
你能告诉我们这个输出吗:
set serveroutput off
select /*+ gather_plan_statistics */ distinct id,name from table1;
select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));
And how does your entire table1 look like? Please show a describe output.
你的整个 table1 是什么样子的?请显示描述输出。
Regards, Rob.
问候,罗伯。
回答by Gary Myers
"The table is very large (10 of millions of rows)" If you can't change the database (add index etc). Then your query will have no choice but to read the entire table. So firstly, determine how long that takes (ie time the SELECT ID,NAME FROM TABLE1). You won't get it any quicker than that. The second step it has to do is the DISTINCT. In 10g+ that should use a HASH GROUP BY. Prior to that it is a SORT operation. The former is quicker. If your database is 9i, then you MAY get an improvement by copying the 10 million rows into a 10g database and doing it there. Alternatively, allocate gobs of memory (google ALTER SESSION SET SORT_AREA_SIZE). That may harm other processes on the database, but then your DBAs aren't giving you much option.
“表非常大(数百万行中的 10 行)”如果您无法更改数据库(添加索引等)。那么您的查询将别无选择,只能读取整个表。所以首先,确定需要多长时间(即选择 ID,NAME FROM TABLE1 的时间)。你不会比这更快得到它。它必须做的第二步是 DISTINCT。在 10g+ 中应该使用 HASH GROUP BY。在此之前,它是一个 SORT 操作。前者更快。如果您的数据库是 9i,那么您可以通过将 1000 万行复制到 10g 数据库中并在那里进行操作来获得改进。或者,分配大量内存(google ALTER SESSION SET SORT_AREA_SIZE)。这可能会损害数据库上的其他进程,但是您的 DBA 并没有给您太多选择。
回答by tom
Really try to work something out with the DBAs. Really. Attempt to communicate the benefits and ease their fears of degraded performance.
真正尝试与 DBA 一起解决问题。真的。尝试传达好处并减轻他们对性能下降的恐惧。
Got a development environment/database to test this stuff?
有开发环境/数据库来测试这些东西吗?
How timely must the data be?
数据必须有多及时?
How about a copy of the table already grouped by id and name with proper indexing? A batch job could be configured to refresh your new table once a night.
已按 id 和 name 分组并具有适当索引的表副本怎么样?批处理作业可以配置为每晚刷新一次新表。
But if that doesn't work out...
但如果这不起作用......
How about exporting all of the id and name pairs to an alternate database where you can group and index to your benefit and leave the DBAs with all of their smug rigidness?
如何将所有 id 和 name 对导出到备用数据库,您可以在其中对您的利益进行分组和索引,并让 DBA 保持自鸣得意的僵化?
回答by Quassnoi
If for a given id
the same name
is always returned, you can run the following:
如果始终返回给定id
的相同name
值,则可以运行以下命令:
SELECT (
SELECT name
FROM table1
WHERE id = did
AND rownum = 1
)
FROM (
SELECT DISTINCT id AS did
FROM table1
WHERE id IS NOT NULL
)
Both queries will use the index on id
.
两个查询都将使用 上的索引id
。
If you still need the NULL
values, run this:
如果您仍然需要这些NULL
值,请运行以下命令:
SELECT (
SELECT name
FROM table1
WHERE id = did
AND rownum = 1
)
FROM (
SELECT DISTINCT id AS did
FROM table1
WHERE id IS NOT NULL
)
UNION ALL
SELECT NULL, name
FROM table1
WHERE id IS NULL
AND rownum = 1
This will be less efficient, since the second query doesn't use indexes, but it will stop on the first NULL
it encounters: if it's close to the beginning of the tables, then you're lucky.
这会降低效率,因为第二个查询不使用索引,但它会在NULL
遇到的第一个查询时停止:如果它接近表的开头,那么你很幸运。
See the entry in my blog for performance details:
有关性能详细信息,请参阅我博客中的条目:
回答by Dave Costa
This may perform better. It assumes that, as you said, the name is always the same for a given id.
这可能会表现得更好。它假设,正如您所说,给定 id 的名称始终相同。
WITH id_list AS (SELECT DISTINCT id FROM table1)
SELECT id_list.id, (SELECT name FROM table1 WHERE table1.id = id_list.id AND rownum = 1)
FROM id_list;
回答by Stefan Steinegger
You could try this:
你可以试试这个:
select id, max(name) from table1 group by id
This uses the index on id for sure, but you have to try if it performs fast.
这肯定会使用 id 上的索引,但是您必须尝试它是否执行得快。
回答by skaffman
Without wishing to indulge in the practice of throwing stuff at the wall until something sticks, try this:
不想沉迷于向墙上扔东西直到有东西粘住的做法,试试这个:
select id, name from table1 group by id, name
I have vague memories of a GROUP BY being inexplicably quicker than a DISTINCT.
我对 GROUP BY 莫名其妙地比 DISTINCT 快有模糊的记忆。
回答by GreenieMeanie
Why do you need to even have "name" in the clause if the name is always the same for a given id? (nm...you want the name you aren't just checking for existence)
如果给定 id 的名称始终相同,为什么您甚至需要在子句中包含“名称”?(nm ...你想要的名字不仅仅是检查是否存在)
SELECT name, id FROM table WHERE id in (SELECT DISTINCT id FROM table)?
Don't know if that helps...
不知道有没有帮助...
回答by Carl Manaster
Is id unique? If so, you could drop DISTINCT
from the query. If not - maybe it needs a new name? Yeah, I know, can't change the schema...
id 是唯一的吗?如果是这样,您可以DISTINCT
从查询中删除。如果没有 - 也许它需要一个新名称?是的,我知道,无法更改架构...