oracle 如何从具有数百万行的多个表中优化选择

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4205749/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 21:58:53  来源:igfitidea点击:

How to otimize select from several tables with millions of rows

oracleoptimizationtreeindexing

提问by Anton Schukin

Have the following tables (Oracle 10g):

有下表(Oracle 10g):

catalog (
  id NUMBER PRIMARY KEY,
  name VARCHAR2(255),
  owner NUMBER,
  root NUMBER REFERENCES catalog(id)
  ...
)
university (
  id NUMBER PRIMARY KEY,
  ...
)
securitygroup (
  id NUMBER PRIMARY KEY
  ...
)
catalog_securitygroup (
  catalog REFERENCES catalog(id),
  securitygroup REFERENCES securitygroup(id)
)
catalog_university (
  catalog REFERENCES catalog(id),
  university REFERENCES university(id)
)

Catalog: 500 000 rows, catalog_university: 500 000, catalog_securitygroup: 1 500 000.

目录:500 000 行,catalog_university:500 000,catalog_securitygroup:1 500 000。

I need to select any 50 rows from catalog with specified root ordered by name for current university and current securitygroup. There is a query:

我需要从目录中选择任何 50 行,其中指定的根按当前大学和当前安全组的名称排序。有一个查询:

SELECT ccc.* FROM (
  SELECT cc.*, ROWNUM AS n FROM (
      SELECT c.id, c.name, c.owner
        FROM catalog c, catalog_securitygroup cs, catalog_university cu
        WHERE c.root = 100
          AND cs.catalog = c.id
          AND cs.securitygroup = 200
          AND cu.catalog = c.id
          AND cu.university = 300
        ORDER BY name
    ) cc 
) ccc WHERE ccc.n > 0 AND ccc.n <= 50;

Where 100 - some catalog, 200 - some securitygroup, 300 - some university. This query return 50 rows from ~ 170 000 in 3 minutes.

其中 100 - 一些目录,200 - 一些安全组,300 - 一些大学。此查询在 3 分钟内从 ~ 170 000 返回 50 行。

But next query return this rows in 2 sec:

但是下一个查询在 2 秒内返回此行:

SELECT ccc.* FROM (
  SELECT cc.*, ROWNUM AS n FROM (
      SELECT c.id, c.name, c.owner
        FROM catalog c
        WHERE c.root = 100
        ORDER BY name
    ) cc 
) ccc WHERE ccc.n > 0 AND ccc.n <= 50;

I build next indexes: (catalog.id, catalog.name, catalog.owner), (catalog_securitygroup.catalog, catalog_securitygroup.index), (catalog_university.catalog, catalog_university.university).

我建立下一个索引:(catalog.id,catalog.name,catalog.owner),(catalog_securitygroup.catalog,catalog_securitygroup.index),(catalog_university.catalog,catalog_university.university)。

Plan for first query (using PLSQL Developer):

计划第一次查询(使用 PLSQL Developer):

http://habreffect.ru/66c/f25faa5f8/plan2.jpg

http://habreffect.ru/66c/f25faa5f8/plan2.jpg

Plan for second query:

计划第二次查询:

http://habreffect.ru/f91/86e780cc7/plan1.jpg

http://habreffect.ru/f91/86e780cc7/plan1.jpg

What are the ways to optimize the query I have?

有哪些方法可以优化我的查询?

采纳答案by Stephanie Page

First I assume that your University and SecurityGroup tables are rather small. You posted the size of the large tables but it's really the other sizes that are part of the problem

首先,我假设您的 University 和 SecurityGroup 表相当小。您发布了大桌子的尺寸,但实际上其他尺寸是问题的一部分

Your problem is from the fact that you can't join the smallest tables first. Your join order should be from small to large. But because your mapping tables don't include a securitygroup-to-university table, you can't join the smallest ones first. So you wind up starting with one or the other, to a big table, to another big table and then with that large intermediate result you have to go to a small table.

你的问题是因为你不能先加入最小的表。您的连接顺序应该从小到大。但是因为您的映射表不包括安全组到大学的表,所以您不能先加入最小的表。所以你最终从一个或另一个开始,到一张大桌子,再到另一张大桌子,然后有了那个大的中间结果,你必须去一张小桌子。

If you always have current_univ and current_secgrp and root as inputs you want to use them to filter as soon as possible. The only way to do that is to change your schema some. In fact, you can leave the existing tables in place if you have to but you'll be adding to the space with this suggestion.

如果您总是将 current_univ 和 current_secgrp 和 root 作为输入,您希望尽快使用它们进行过滤。做到这一点的唯一方法是更改​​一些架构。事实上,如果需要,您可以保留现有的表格,但您将根据此建议添加到空间中。

You've normalized the data very well. That's great for speed of update... not so great for querying. We denormalize to speed querying (that's the whole reason for datawarehouses (ok that and history)). Build a single mapping table with the following columns.

你已经很好地标准化了数据。这对于更新速度来说很棒......对于查询来说不是那么好。我们非规范化以加快查询速度(这就是数据仓库的全部原因(好吧,历史)。使用以下列构建单个映射表。

Univ_id, SecGrp_ID, Root, catalog_id. Make it an index organized table of the first 3 columns as pk.

Univ_id、SecGrp_ID、Root、catalog_id。使其成为前 3 列作为 pk 的索引组织表。

Now when you query that index with all three PK values, you'll finish that index scan with a complete list of allowable catalog Id, now it's just a single join to the cat table to get the cat item details and you're off an running.

现在,当您使用所有三个 PK 值查询该索引时,您将使用完整的允许目录 Id 列表完成该索引扫描,现在它只是与 cat 表的单个连接以获取 cat 项目的详细信息,并且您将离开跑步。

回答by Unreason

The indexes that can be useful and should be considered deal with

可能有用且应考虑的索引处理

WHERE c.root = 100
      AND cs.catalog = c.id
      AND cs.securitygroup = 200
      AND cu.catalog = c.id
      AND cu.university = 300

So the following fields can be interesting for indexes

所以以下字段可能对索引很有趣

c: id, root   
cs: catalog, securitygroup   
cu: catalog, university

So, try creating

所以,尝试创建

(catalog_securitygroup.catalog, catalog_securitygroup.securitygroup)

and

(catalog_university.catalog, catalog_university.university)

EDIT:I missed the ORDER BY - these fields should also be considered, so

编辑:我错过了 ORDER BY - 也应该考虑这些字段,所以

(catalog.name, catalog.id)

might be beneficial (or some other composite index that could be used for sorting andthe conditions - possibly (catalog.root, catalog.name, catalog.id))

可能是有益的(或其他一些可用于排序条件的复合索引- 可能(catalog.root,catalog.name,catalog.id))

EDIT2Although another question is accepted I'll provide some more food for thought. I have created some test data and run some benchmarks.

EDIT2虽然接受了另一个问题,但我会提供更多的思考。我创建了一些测试数据并运行了一些基准测试。

The test cases are minimal in terms of record width (in catalog_securitygroup and catalog_university the primary keys are (catalog, securitygroup) and (catalog, university)). Here is the number of records per table:

测试用例在记录宽度方面是最小的(在 catalog_securitygroup 和 catalog_university 中,主键是 (catalog, securitygroup) 和 (catalog, university))。这是每个表的记录数:

test=# SELECT (SELECT COUNT(*) FROM catalog), (SELECT COUNT(*) FROM catalog_securitygroup), (SELECT COUNT(*) FROM catalog_university);
 ?column? | ?column? | ?column? 
----------+----------+----------
   500000 |  1497501 |   500000
(1 row)

Database is postgres 8.4, default ubuntu install, hardware i5, 4GRAM

数据库为postgres 8.4,默认安装ubuntu,硬件i5,4GRAM

First I rewrote the query to

首先,我将查询重写为

SELECT c.id, c.name, c.owner
FROM catalog c, catalog_securitygroup cs, catalog_university cu
WHERE c.root < 50 
  AND cs.catalog = c.id 
  AND cu.catalog = c.id
  AND cs.securitygroup < 200
  AND cu.university < 200
ORDER BY c.name
LIMIT 50 OFFSET 100

note: the conditions are turned into less then to maintain comparable number of intermediate rows (the above query would return 198,801 rows without the LIMIT clause)

注意:条件变为 less then 以保持可比较的中间行数(上面的查询将返回 198,801 行而没有 LIMIT 子句)

If run as above, without any extra indexes (save for PKs and foreign keys) it runs in 556 mson a cold database (this is actually indication that I oversimplified the sample data somehow - I would be happier if I had 2-4s here without resorting to less then operators)

如果按上述方式运行,没有任何额外的索引(除了 PK 和外键),它会在556 毫秒内在冷数据库上运行(这实际上表明我以某种方式过度简化了示例数据 - 如果我在这里有 2-4s,我会更高兴)不诉诸于少于运营商)

This bring me to my point - any straight query that only joins and filters (certain number of tables) and returns only a certain number of the records should run under 1s on any decent database without need to use cursors or to denormalize data (one of these days I'll have to write a post on that).

这让我想到了我的观点 - 任何仅连接和过滤(特定数量的表)并仅返回一定数量的记录的直接查询都应该在任何体面的数据库上以 1 秒的速度运行,而无需使用游标或非规范化数据(其中之一)这些天我必须写一篇关于那个的帖子)。

Furthermore, if a query is returning only 50 rows and does simple equality joins and restrictive equality conditions it should run even much faster.

此外,如果查询仅返回 50 行并执行简单的等式连接和限制性等式条件,它的运行速度应该会更快。

Now let's see if I add some indexes, the biggest potential in queries like this is usually the sort order, so let me try that:

现在让我们看看我是否添加了一些索引,像这样的查询的最大潜力通常是排序顺序,所以让我尝试一下:

CREATE INDEX test1 ON catalog (name, id);

This makes execution time on the query - 22mson a cold database.

这使得查询的执行时间 -在冷数据库上为 22毫秒

And that'sthe point - if you are trying to get only a page of data, you should only get a page of data and execution times of queries such as this on normalizeddata with proper indexesshould take less then 100ms on decent hardware.

就是重点 - 如果您只想获取一页数据,则应该只获取一页数据,并且在具有适当索引的规范化数据上执行此类查询的时间在体面的硬件上应该不到 100 毫秒。

I hope I didn't oversimplify the case to the point of no comparison (as I stated before some simplification is present as I don't know the cardinality of relationships between catalog and the many-to-many tables).

我希望我没有将案例过度简化到无法比较的地步(正如我之前所说的,存在一些简化,因为我不知道目录和多对多表之间关系的基数)。

So, the conclusion is

所以,结论是

  • if I were you I would not stop tweaking indexes (and the SQL) until I get the performance of the query to go below 200ms as rule of the thumb.
  • only if I would find an objective explanation why it can't go below such value I would resort to denormalisation and/or cursors, etc...
  • 如果我是你,我不会停止调整索引(和 SQL),直到我根据经验将查询的性能降低到 200 毫秒以下。
  • 只有当我找到一个客观的解释为什么它不能低于这个值时,我才会求助于非规范化和/或游标等......

回答by Mike Meyers

The use of rownum is wrong and causes all the rows to be processed. It will process all the rows, assigned them all a row number, and then find those between 0 and 50. When you want to look for in the explain plan is COUNT STOPKEYrather than just count

rownum 的使用是错误的,导致所有行都被处理。它会处理所有的行,给它们分配一个行号,然后找到 0 到 50 之间的那些。 当你想在解释计划中寻找COUNT STOPKEY而不仅仅是计数时

The query below should be an improvement as it will only get the first 50 rows... but there is still the issue of the joins to look at too:

下面的查询应该是一个改进,因为它只会获得前 50 行......但仍然存在连接问题需要查看:

SELECT ccc.* FROM (
  SELECT cc.*, ROWNUM AS n FROM (
      SELECT c.id, c.name, c.owner
        FROM catalog c
        WHERE c.root = 100
        ORDER BY name
    ) cc 
    where rownum <= 50
) ccc WHERE ccc.n > 0 AND ccc.n <= 50;

Also, assuming this for a web page or something similar, maybe there is a better way to handle this than just running the query again to get the data for the next page.

此外,假设这是一个网页或类似的东西,也许有更好的方法来处理这个问题,而不是再次运行查询以获取下一页的数据。

回答by Mike Meyers

The Oracle cost-based optimizer makes use of all the information that it has to decide what the best access paths are for the data and what the least costly methods are for getting that data. So below are some random points related to your question.

Oracle 基于成本的优化器利用所有信息来决定数据的最佳访问路径以及获取该数据的成本最低的方法。因此,以下是与您的问题相关的一些随机点。

The first three tables that you've listed all have primary keys. Do the other tables (catalog_university and catalog_securitygroup) also have primary keys on them?? A primary key defines a column or set of columns that are non-null and unique and are very important in a relational database.

您列出的前三个表都有主键。其他表(catalog_university 和 catalog_securitygroup)是否也有主键?主键定义了非空且唯一且在关系数据库中非常重要的一列或一组列。

Oracle generally enforces a primary key by generating a unique index on the given columns. The Oracle optimizer is more likely to make use of a unique index if it available as it is more likely to be more selective.

Oracle 通常通过在给定列上生成唯一索引来强制执行主键。Oracle 优化器更有可能利用唯一索引(如果它可用),因为它更有可能更具选择性。

If possible an index that contains unique values should be defined as unique (CREATE UNIQUE INDEX...) and this will provide the optimizer with more information.

如果可能,应将包含唯一值的索引定义为唯一 ( CREATE UNIQUE INDEX...),这将为优化器提供更多信息。

The additional indexes that you have provided are no more selective than the existing indexes. For example, the index on (catalog.id, catalog.name, catalog.owner) is unique but is less useful than the existing primary key index on (catalog.id). If a query is written to select on the catalog.name column, it is possible to do and index skip scan but this starts being costly (and most not even be possible in this case).

您提供的附加索引并不比现有索引更具选择性。例如,(catalog.id, catalog.name, catalog.owner) 上的索引是唯一的,但不如 (catalog.id) 上的现有主键索引有用。如果在 catalog.name 列上编写查询以选择,则可以执行和索引跳过扫描,但这开始代价高昂(并且在这种情况下大多数甚至不可能)。

Since you are trying to select based in the catalog.root column, it might be worth adding an index on that column. This would mean that it could quickly find the relevant rows from the catalog table. The timing for the second query could be a bit misleading. It might be taking 2 seconds to find 50 matching rows from catalog, but these could easily be the first 50 rows from the catalog table..... finding 50 that match all your conditions might take longer, and not just because you need to join to other tables to get them. I would always use create table as selectwithout restricting on rownum when trying to performance tune. With a complex query I would generally care about how long it take to get all the rows back... and a simple select with rownum can be misleading

由于您尝试基于 catalog.root 列进行选择,因此可能值得在该列上添加索引。这意味着它可以从目录表中快速找到相关行。第二个查询的时间可能有点误导。从目录中找到 50 个匹配的行可能需要 2 秒钟,但这些很容易成为目录表中的前 50 行..... 找到匹配所有条件的 50 行可能需要更长的时间,而不仅仅是因为您需要加入其他表以获取它们。create table as select在尝试性能调整时,我将始终使用而不限制 rownum。对于复杂的查询,我通常会关心获取所有行需要多长时间......并且带有 rownum 的简单选择可能会产生误导

Everything about Oracle performance tuning is about providing the optimizer enough information and the right tools (indexes, constraints, etc) to do its job properly. For this reason it's important to get optimizer statistics using something like DBMS_STATS.GATHER_TABLE_STATS(). Indexes should have stats gathered automatically in Oracle 10g or later.

有关 Oracle 性能调优的一切都是为了向优化器提供足够的信息和正确的工具(索引、约束等)以正确完成其工作。出于这个原因,使用类似DBMS_STATS.GATHER_TABLE_STATS(). 在 Oracle 10g 或更高版本中,索引应该自动收集统计信息。

Somehow this grew into quite a long answer about the Oracle optimizer. Hopefully some of it answers your question. Here is a summary of what is said above:

不知何故,这变成了关于 Oracle 优化器的相当长的答案。希望其中一些回答了您的问题。以下是对上述内容的总结:

  • Give the optimizer as much information as possible, e.g if index is unique then declare it as such.
  • Add indexes on your access paths
  • Find the correct times for queries without limiting by rowwnum. It will always be quicker to find the first 50 M&Ms in a jar than finding the first 50 red M&Ms
  • Gather optimizer stats
  • Add unique/primary keys on all tables where they exist.
  • 为优化器提供尽可能多的信息,例如,如果索引是唯一的,则将其声明为唯一的。
  • 在访问路径上添加索引
  • 在不受 rowwnum 限制的情况下查找查询的正确时间。在罐子里找到前 50 个 M&Ms 总是比找到前 50 个红色 M&Ms 更快
  • 收集优化器统计信息
  • 在存在的所有表上添加唯一键/主键。

回答by RobertoBr

try to declare a cursor. I dont know oracle, but in SqlServer would look like this:

尝试声明一个游标。我不知道 oracle,但在 SqlServer 中看起来像这样:

declare @result 
table ( 
    id numeric,
    name varchar(255)
); 

declare __dyn_select_cursor cursor LOCAL SCROLL DYNAMIC for 

--Select
select distinct 
    c.id, c.name
From [catalog] c
    inner join university u
    on     u.catalog = c.id
       and u.university = 300
    inner join catalog_securitygroup s
    on     s.catalog = c.id
       and s.securitygroup = 200
Where
    c.root = 100
Order by name   

--Cursor
declare @id numeric;
declare @name varchar(255);

open __dyn_select_cursor; 

fetch relative 1 from __dyn_select_cursor into @id,@name declare @maxrowscount int 

set @maxrowscount = 50

while (@@fetch_status = 0 and @maxrowscount <> 0) 
begin 
     insert into @result values (@id, @name);
     set @maxrowscount = @maxrowscount - 1;
     fetch next from __dyn_select_cursor into  @id, @name; 
end 
close __dyn_select_cursor; 
deallocate __dyn_select_cursor; 


--Select temp, final result
select 
 id, 
 name
from @result;