SQL 左联接是我想要的,但它们很慢?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/628685/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 01:22:14  来源:igfitidea点击:

Left Joins are what I want but they are very slow?

sqloptimizationjoin

提问by jskulski

Overview:

概述:

I have three tables 1) subscribers, bios, and shirtsizes and i need to find the subscribers without a bio or shirtsizes

我有三个表 1) 订阅者、个人简介和衬衫尺寸,我需要找到没有个人简介或衬衫尺寸的订阅者

the tables are laid out such as

桌子的布局,如

subscribers

订户

| season_id |  user_id |

bio

生物

| bio_id | user_id |

shirt sizes

衬衫尺寸

| bio_id | shirtsize |

And I need to find all users who do not have a bio or shirtsize, (if no bio; then no shirtsize via relation) for any given season.

而且我需要找到任何给定季节没有个人简介或衬衫尺寸的所有用户(如果没有个人简介;则通过关系没有衬衫尺寸)。

I originally wrote a query like:

我最初写了一个查询,如:

SELECT *
   FROM subscribers s 
   LEFT JOIN bio b ON b.user_id = subscribers.user_id 
   LEFT JOIN shirtsizes ON shirtsize.bio_id = bio.bio_id 
WHERE s.season_id = 185181 AND (bio.bio_id IS NULL OR shirtsize.size IS NULL);

but it is taking 10 seconds to complete now.

但现在需要 10 秒钟才能完成。

I am wondering how I can restructure the query (or possibly the problem) so that it will preform reasonably.

我想知道如何重构查询(或可能是问题),以便它能够合理地执行。

Here is the mysql explain: (ogu = subscribers, b = bio, tn = shirtshize)

这是 mysql 解释:(ogu = 订阅者,b = bio,tn =shirtshize)

| id | select_type | table | type  | possible_keys | key     | key_len | ref         | rows   | Extra       |   
+----+-------------+-------+-------+---------------+---------+---------+-------------+--------+-------------+    
|  1 | SIMPLE      | ogu   | ref   | PRIMARY       | PRIMARY | 4       | const       |    133 | Using where |
|  1 | SIMPLE      | b     | index | NULL          | PRIMARY | 8       | NULL        | 187644 | Using index |
|  1 | SIMPLE      | tn    | ref   | nid           | nid     | 4       | waka2.b.nid |      1 | Using where | 

The above is pretty sanitized, here's the realz info:

上面的内容已经很清晰了,这里是 realz 信息:

mysql> DESCRIBE subscribers
+-----------+---------+------+-----+---------+-------+
| Field     | Type    | Null | Key | Default | Extra |
+-----------+---------+------+-----+---------+-------+
| subscribers  | int(11) | NO   | PRI |         |       | 
| uid       | int(11) | NO   | PRI |         |       | 


mysql> DESCRIBE bio;
+-------+------------------+------+-----+---------+-------+
| Field | Type             | Null | Key | Default | Extra |
+-------+------------------+------+-----+---------+-------+
| bio_id   | int(10) unsigned | NO   | PRI | 0       |       | 
| uid   | int(10) unsigned | NO   | PRI | 0       |       | 


mysql> DESCRIBE shirtsize;
+-------+------------------+------+-----+---------+-------+
| Field | Type             | Null | Key | Default | Extra |
+-------+------------------+------+-----+---------+-------+
| bio_id   | int(10) unsigned | NO   | PRI | 0       |       | 
| shirtsize   | int(10) unsigned | NO   | PRI | 0       |       | 

and the real query looks like:

真正的查询看起来像:

SELECT ogu.nid, ogu.is_active, ogu.uid, b.nid AS bio_node, tn.nid AS size
                  FROM og_uid ogu
                  LEFT JOIN bio b ON b.uid = ogu.uid
                  LEFT JOIN term_node tn ON tn.nid = b.nid
                  WHERE ogu.nid = 185033 AND ogu.is_admin = 0
                  AND (b.nid IS NULL OR tn.tid IS NULL)

nid is season_id or bio_id (with a type); term_node is going to be the shirtsize

nid 是 season_id 或 bio_id(带有类型);term_node 将是衬衫尺寸

回答by Tor Haugen

The query should be OK. I would run it through a query analyzer and refine the indexes on the tables.

查询应该没问题。我会通过查询分析器运行它并优化表上的索引。

回答by Brian

Joins are one of the most expensive operations that you can perform on an SQL query. While it should be able to automatically optimize your query somewhat, maybe try restructuring it. First of all, I would instead of SELECT *, be sure to specify which columns you need from which relations. This will speed things up quite a bit.

联接是您可以对 SQL 查询执行的最昂贵的操作之一。虽然它应该能够在某种程度上自动优化您的查询,但也许可以尝试对其进行重组。首先,我会代替SELECT *,确保指定您需要哪些关系的哪些列。这将大大加快速度。

If you only need the user ID for example:

如果您只需要用户 ID,例如:

SELECT s.user_id
   FROM subscribers s 
   LEFT JOIN bio b ON b.user_id = subscribers.user_id 
   LEFT JOIN shirtsizes ON shirtsize.bio_id = bio.bio_id 
WHERE s.season_id = 185181 AND (bio.bio_id IS NULL OR shirtsize.size IS NULL);

That will allow the SQL database to restructure your query a little more efficiently on its own.

这将允许 SQL 数据库自行更有效地重构您的查询。

回答by tvanfosson

Obviously I haven't checked this but it seems to be that what you want is to select any subscriber where there there isn't a matching bio or the join between bios and shirtsizes fails. I would consider using NOT EXISTSfor this condition. You'll probably want indexes on bio.user_id and shirtsizes.bio_id.

显然我没有检查过这个,但似乎您想要的是选择任何没有匹配 bio 或 bios 和衬衫尺寸之间的连接失败的订阅者。我会考虑在这种情况下使用NOT EXISTS。您可能需要 bio.user_id 和 shirtsizes.bio_id 上的索引。

select *
from subscribers
where s.season_id = 185181
      and not exists (select *
                      from bio join shirtsizes on bio.bio_id = shirtsizes.bio_id
                      where bio.user_id = subscribers.user_id)

EDIT:

编辑

Based on your update, you may want to create separate keys on each column instead of/in addition to having compound primary keys. It's possible that the joins aren't able to take optimal advantage of the compound primary indexes and an index on the join columns themselves may speed things up.

根据您的更新,您可能希望在每列上创建单独的键,而不是/除了具有复合主键之外。连接可能无法充分利用复合主索引,连接列本身的索引可能会加快速度。

回答by Jonathan Leffler

Would it be any quicker to do a difference between the list of subscribers for the relevant season and the list of subscribers for the season with bios and shirt sizes?

将相关季节的订阅者列表与该季节的订阅者列表与 bios 和衬衫尺寸进行区分是否会更快?

SELECT *
   FROM Subscribers
   WHERE season_id = 185181
     AND user_id NOT IN
         (SELECT DISTINCT s.user_id
             FROM subscribers s
             JOIN bios b ON s.user_id = b.user_id
             JOIN shirtsizes z ON b.bio_id = z.bio_id
             WHERE s.season_id = 185181
         )

This avoids outer joins, which are not as fast as inner joins, and may therefore be quicker. On the other hand, it might be creating two large lists with very few differences between them. It is not clear whether the DISTINCT in the sub-query would improve or harm performance. It implies a sort operation (expensive) but paves the way for a merge-join if the MySQL optimizer supports such things.

这避免了外连接,外连接不如内连接快,因此可能更快。另一方面,它可能会创建两个大列表,它们之间的差异很小。不清楚子查询中的 DISTINCT 是否会提高或损害性能。它意味着排序操作(昂贵),但如果 MySQL 优化器支持这样的事情,则为合并连接铺平道路。

There might be other notations available - MINUS or DIFFERENCE, for example.

可能还有其他可用的符号 - 例如 MINUS 或 DIFFERENCE。

回答by John Saunders

Is bio_idthe primary key of bios? Is it really possible for there to be a bios row with b.user_id= subscribers.user_idbut with b.bio_idNULL?

bio_idbios的主键吗?真的有可能有一个带有b.user_id=subscribers.user_id但带有b.bio_idNULL的 bios 行吗?

Are there shirtsize rows with shirtsize.bio_idNULL? Do those rows ever have shirtsize.size not NULL?

是否有带有shirtsize.bio_idNULL 的衬衫尺寸行?这些行是否曾经有shirtsize.size not NULL?

回答by SeanJA

If you define what you are looking for exactly rather than SELECT * it might speed it up a bit... also OR is not the fastest query to be doing, if you can re-write it without the OR it will be faster.

如果您准确地定义您要查找的内容而不是 SELECT * 它可能会加快速度......而且 OR 也不是最快的查询,如果您可以在没有 OR 的情况下重新编写它,它会更快。

Also... you could try unions instead of left joins maybe?

另外......你可以尝试联合而不是左联接吗?

SELECT s.user_id
   FROM subscribers s 
   LEFT JOIN bio b ON b.user_id = s.user_id 
   LEFT JOIN shirtsizes ON shirtsize.bio_id = bio.bio_id 
WHERE s.season_id = 185181 AND (bio.bio_id IS NULL OR shirtsize.size IS NULL);

would be something like:

会是这样的:

(SELECT s.user_id FROM subscribers s WHERE s.season_id = 185181)
UNION
(SELECT b.user_id, b.bio_id FROM bio b WHERE bio.bio_id IS NULL)
UNION
(SELECT shirtsizes.bio_id FROM shirtsizes WHERE shirtsizes.size is NULL)

(to be honest that doesn't look right to me... but then I never use joins orjoin syntax or unions...)

(老实说,这对我来说不太合适……但是我从不使用连接或连接语法或联合……)

I would do:

我会做:

SELECT *
FROM subscribers s, bio b, shirtsizes sh
WHERE s.season_id = 185181
AND shirtsize.bio_id = bio.bio_id 
AND b.user_id = s.user_id 
AND (bio.bio_id IS NULL 
     OR 
     shirtsize.size IS NULL);

回答by Quassnoi

Your query, as it is written now, evaluates all bio's and term_node's if they exist, and then filters them out.

您的查询,就像现在写的那样,评估所有bio's 和term_node's 如果它们存在,然后将它们过滤掉。

But what you want is just find og_uid's that don't have term_node's (not having a bioalso implies not having a term_node)

但你想要的只是找到og_uid没有term_node的(没有bio也意味着没有term_node

So you just want to stop evaluating bio's and term_node's as soon as you find the first existing term_node:

因此,您只想在找到第一个存在后立即停止评估bio's 和term_node's term_node

SELECT  *
FROM    (
        SELECT  ogu.nid, ogu.is_active, ogu.uid,
                (
                SELECT  1
                FROM    bio b, term_node tn
                WHERE   b.uid = ogu.uid
                        AND tn.nid = b.nid
                LIMIT   1
                ) AS ex
        FROM    og_uid ogu
        WHERE   ogu.nid = 185033
                AND ogu.is_admin = 0
        ) ogu1
WHERE   ex IS NULL

This will evaluate at most one bioand at most one term_nodefor each og_uid, instead of evaluating all existing thousands and the filtering them out.

这将评估bio最多一个term_node,每个最多一个og_uid,而不是评估所有现有的数千个并将它们过滤掉。

Should work much faster.

应该工作得更快。

回答by Hafthor

select * from subscribers where user_id not in (
  select user_id from bio where bio_id not in (
    select bio_id from shirt_sizes
  )
) and season_id=185181

回答by Hafthor

I presume that your "big table" is subscribers, and that season_id is probably neither selective nor indexed (indexing it is rather meaningless if it's not selective, anyway), which means that you'll have to fully scan subscribers, anyway. Parting, I would join (with an inner join) the two other tables - note that if there is no bio_id in shirt_size it's exactly the same for your query as if there were no bio. First bit:

我假设您的“大表”是订阅者,并且 season_id 可能既没有选择性也没有索引(无论如何,如果它没有选择性,索引它就毫无意义),这意味着无论如何您都必须完全扫描订阅者。分开时,我将加入(使用内部联接)另外两个表 - 请注意,如果shirt_size 中没有 bio_id,那么对于您的查询,它与没有 bio 的查询完全相同。第一点:

select uid
from bio
     inner join shirtsizes
             on shirtsizes.bio_id = bio.bio_id

At which point you want to check that shirtsizes is indexed on bio_id. Now you can left outer join this query to subscribers:

此时您要检查衬衫尺寸是否在 bio_id 上编入索引。现在你可以离开外部加入这个查询给订阅者:

select *
from subscribers s
     left outer join (select uid
                      from bio
                      inner join shirtsizes
                              on shirtsizes.bio_id = bio.bio_id) x
                  on x.uid = s.uid
where s.season_id = 185181
  and x.uid is null

which is likely to run reasonably fast if neither bio nor shirtsizes are gigantic ...

如果 bio 和衬衫尺寸都不大,这可能会运行得相当快......