SQL NULL 值如何影响数据库搜索的性能?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1017239/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 02:29:17  来源:igfitidea点击:

How do NULL values affect performance in a database search?

sqldatabaseoracledatabase-performancequery-performance

提问by Jakob Ojvind Nielsen

In our product we have a generic search engine, and trying to optimze the search performance. A lot of the tables used in the queries allow null values. Should we redesign our table to disallow null values for optimization or not?

在我们的产品中,我们有一个通用的搜索引擎,并试图优化搜索性能。查询中使用的许多表都允许空值。我们是否应该重新设计我们的表以禁止空值进行优化?

Our product runs on both Oracleand MS SQL Server.

我们的产品在Oracle和上运行MS SQL Server

采纳答案by Quassnoi

In Oracle, NULLvalues are not indexed, i. e. this query:

在 中OracleNULL值未编入索引,即此查询:

SELECT  *
FROM    table
WHERE   column IS NULL

will always use full table scan since index doesn't cover the values you need.

将始终使用全表扫描,因为索引不涵盖您需要的值。

More than that, this query:

不仅如此,这个查询:

SELECT  column
FROM    table
ORDER BY
        column

will also use full table scan and sort for same reason.

出于同样的原因,还将使用全表扫描和排序。

If your values don't intrinsically allow NULL's, then mark the column as NOT NULL.

如果您的值本质上不允许NULL's,则将该列标记为NOT NULL.

回答by Rob van Wijk

An extra answer to draw some extra attention to David Aldridge's comment on Quassnoi's accepted answer.

一个额外的答案,以引起人们对大卫·奥尔德里奇 (David Aldridge) 对 Quassnoi 已接受答案的评论的额外关注。

The statement:

该声明:

this query:

SELECT * FROM table WHERE column IS NULL

will always use full table scan

这个查询:

SELECT * FROM table WHERE 列是 NULL

将始终使用全表扫描

is not true. Here is the counter example using an index with a literal value:

不是真的。这是使用带有文字值的索引的计数器示例:

SQL> create table mytable (mycolumn)
  2  as
  3   select nullif(level,10000)
  4     from dual
  5  connect by level <= 10000
  6  /

Table created.

SQL> create index i1 on mytable(mycolumn,1)
  2  /

Index created.

SQL> exec dbms_stats.gather_table_stats(user,'mytable',cascade=>true)

PL/SQL procedure successfully completed.

SQL> set serveroutput off
SQL> select /*+ gather_plan_statistics */ *
  2    from mytable
  3   where mycolumn is null
  4  /

  MYCOLUMN
----------


1 row selected.

SQL> select * from table(dbms_xplan.display_cursor(null,null,'allstats last'))
  2  /

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------
SQL_ID  daxdqjwaww1gr, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ *   from mytable  where mycolumn
is null

Plan hash value: 1816312439

-----------------------------------------------------------------------------------
| Id  | Operation        | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT |      |      1 |        |      1 |00:00:00.01 |       2 |
|*  1 |  INDEX RANGE SCAN| I1   |      1 |      1 |      1 |00:00:00.01 |       2 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("MYCOLUMN" IS NULL)


19 rows selected.

As you can see, the index is being used.

如您所见,正在使用索引。

Regards, Rob.

问候,罗布。

回答by Jeremy Smyth

Short answer: yes, conditionally!

简短回答:是的,有条件!

The main issue with null values and performance is to do with forward lookups.

空值和性能的主要问题与前向查找有关。

If you insert a row into a table, with null values, it's placed in the natural page that it belongs to. Any query looking for that record will find it in the appropriate place. Easy so far....

如果您在表中插入一行,其中包含空值,则它会被放置在它所属的自然页中。任何查找该记录的查询都会在适当的位置找到它。到目前为止很容易......

...but let's say the page fills up, and now that row is cuddled in amongst the other rows. Still going well...

...但假设页面已填满,现在该行被挤在其他行中。还是顺利...

...until the row is updated, and the null value now contains something. The row's size has increased beyond the space available to it, so the DB engine has to do something about it.

...直到该行被更新,并且空值现在包含一些内容。行的大小增加到超出了它的可用空间,因此数据库引擎必须对此做一些事情。

The fastest thing for the server to do is to move the row offthat page into another, and to replace the row's entry with a forward pointer. Unfortunately, this requires an extra lookup when a query is performed: one to find the natural location of the row, and one to find its current location.

服务器要做的最快的事情是将行该页面移到另一个页面,并用前向指针替换该行的条目。不幸的是,这需要在执行查询时进行额外的查找:一次查找行的自然位置,另一次查找其当前位置。

So, the short answer to your question is yes, making those fields non-nullable will help search performance. This is especially true if it often happens that the null fields in records you search on are updated to non-null.

因此,对您的问题的简短回答是肯定的,使这些字段不可为空将有助于搜索性能。如果您搜索的记录中的空字段经常更新为非空字段,则尤其如此。

Of course, there are other penalties (notably I/O, although to a tiny extent index depth) associated with larger datasets, and then you have application issues with disallowing nulls in fields that conceptually require them, but hey, that's another problem :)

当然,还有与更大的数据集相关的其他惩罚(特别是 I/O,虽然在很小的程度上索引深度),然后你有应用程序问题,在概念上需要它们的字段中不允许空值,但是嘿,这是另一个问题:)

回答by Vincent Malgrat

If your column doesn't contain NULLs it is best to declare this column NOT NULL, the optimizer may be able to take more efficient path.

如果您的列不包含 NULL 最好声明此列NOT NULL,优化器可能能够采用更有效的路径。

However, if you have NULLs in your column you don't have much choice (a non-null default value may create more problems than it solves).

但是,如果您的列中有 NULL,则您没有太多选择(非空默认值可能会产生比它解决的更多的问题)。

As Quassnoi mentionned, NULLs are not indexed in Oracle, or to be more precise, a row won't be indexed if all the indexed columns are NULL, this means:

正如 Quassnoi 所提到的,NULL 在 Oracle 中没有被索引,或者更准确地说,如果所有索引列都是 NULL,那么一行将不会被索引,这意味着:

  • that NULLs can potentially speed up your research because the index will have fewer rows
  • you can still index the NULL rows if you add another NOT NULL column to the index or even a constant.
  • NULL 可能会加速您的研究,因为索引的行数较少
  • 如果您将另一个 NOT NULL 列添加到索引或什至一个常量,您仍然可以索引 NULL 行。

The following script demonstrates a way to index NULL values:

以下脚本演示了一种索引 NULL 值的方法:

CREATE TABLE TEST AS 
SELECT CASE
          WHEN MOD(ROWNUM, 100) != 0 THEN
           object_id
          ELSE
           NULL
       END object_id
  FROM all_objects;

CREATE INDEX idx_null ON test(object_id, 1);

SET AUTOTRACE ON EXPLAIN

SELECT COUNT(*) FROM TEST WHERE object_id IS NULL;

回答by Andrew

I would say that testing is required but it is nice to know other peoples experiences. In my experience on ms sql server, nulls can and do cause massive performance issues (differences). In a very simple test now I have seen a query return in 45 seconds when not null was set on the related fields in the table create statement and over 25 minutes where it wasn't set (I gave up waiting and just took a peak at the estimated query plan).

我会说测试是必需的,但很高兴了解其他人的经验。根据我在 ms sql server 上的经验,空值可以而且确实会导致大量的性能问题(差异)。现在在一个非常简单的测试中,当在 table create 语句中的相关字段上设置了 not null 并且超过 25 分钟没有设置时,我看到查询在 45 秒内返回(我放弃了等待,只是在估计的查询计划)。

Test data is 1 million rows x 20 columns which are constructed from 62 random lowercase alpha characters on an i5-3320 normal HD and 8GB RAM (SQL Server using 2GB) / SQL Server 2012 Enterprise Edition on windows 8.1. It's important to use random data / irregular data to make the testing a realistic "worse" case. In both cases table was recreated and reloaded with random data that took about 30 seconds on database files that already had a suitable amount of free space.

测试数据为 100 万行 x 20 列,由 i5-3320 普通 HD 和 8GB RAM(使用 2GB 的 SQL Server)/Windows 8.1 上的 SQL Server 2012 企业版上的 62 个随机小写字母字符构成。使用随机数据/不规则数据使测试成为现实的“更坏”情况很重要。在这两种情况下,表都被重新创建并使用随机数据重新加载,这些数据在已经具有适当可用空间量的数据库文件上花费了大约 30 秒。

select count(field0) from myTable where field0 
                     not in (select field1 from myTable) 1000000

CREATE TABLE [dbo].[myTable]([Field0] [nvarchar](64) , ...

 vs

CREATE TABLE [dbo].[myTable]([Field0] [nvarchar](64) not null,

for performance reasons both had table option data_compression = page set and everything else was defaulted. No indexes.

出于性能原因,两者都有表选项 data_compression = page set 并且其他所有内容都是默认值。没有索引。

alter table myTable rebuild partition = all with (data_compression = page);

Not having nulls is a requirement for in memory optimized tables for which I am not specifically using however sql server will obviously do what is fastest which in this specific case appears to be massively in favor of not having nulls in data and using not null on the table create.

没有空值是我没有专门使用的内存优化表的要求,但是 sql server 显然会做最快的事情,在这种特定情况下,这似乎非常赞成在数据中没有空值并在数据上使用 not null表创建。

Any subsequent queries of the same form on this table return in two seconds so I would assume standard default statistics and possibly having the (1.3GB) table fit into memory are working well. i.e.

此表上相同形式的任何后续查询都会在两秒钟内返回,因此我假设标准默认统计数据并且可能将 (1.3GB) 表放入内存中运行良好。IE

select count(field19) from myTable where field19 
                       not in (select field18 from myTable) 1000000

On an aside not having nulls and not having to deal with null cases also makes queries much simplier, shorter, less error prone and very normally faster. If at all possible, best to avoid nulls generally on ms sql server at least unless they are explicitly required and can not reasonably be worked out of the solution.

另一方面,没有空值和不必处理空值的情况也使查询更简单、更短、更不容易出错并且通常更快。如果可能的话,最好至少在 ms sql server 上避免空值,除非它们是明确需要的并且不能合理地从解决方案中解决。

Starting with a new table and sizing this up to 10m rows / 13GB same query takes 12 minutes which is very respectable considering the hardware and no indexes in use. For info query was completely IO bound with IO hovering between 20MB/s to 60MB/s. A repeat of the same query took 9 mins.

从一个新表开始并将其大小调整为 10m 行/13GB 相同的查询需要 12 分钟,考虑到硬件和未使用的索引,这是非常可观的。对于信息查询完全是 IO 绑定,IO 徘徊在 20MB/s 到 60MB/s 之间。重复相同的查询需要 9 分钟。

回答by HLGEM

The issue of whether to use Nulls because they affect performance is one of those balancing acts of database design. You have to balance business needs against performance.

是否因为 Null 影响性能而使用 Null 的问题是数据库设计的平衡行为之一。您必须在业务需求与性能之间取得平衡。

Nulls should be used if they are needed. For instance, you may have a begin date and an end date in a table. You often would not know the end date at the time the record is created. Therefore you must allow nulls whether they affect performance or not as the data is simply not there to be put in. However, if the data must, by the business rules, be there at the time the record is created, then you should not allow nulls. This would improve performance, make coding a bit simpler and make sure the data integrity is preserved.

如果需要,应使用空值。例如,您可能在表中有一个开始日期和一个结束日期。您通常不知道创建记录时的结束日期。因此,无论是否影响性能,您都必须允许空值,因为数据根本不存在。但是,如果根据业务规则,数据必须在创建记录时存在,那么您不应该允许空值。这将提高性能,使编码更简单,并确保保留数据完整性。

If you have existing data that you would like to change to no longer allow nulls, then you have to consider the impact of that change. First, do you know what value you need to put into the records which are currently null? Second, do you have a lot of code that is using isnullor coalescewhich you need to update (these things slow performance, so if you no longer need to check for them, you should change the code)? DO you need a default value? Can you really assign one? If not will some of the insert or update code break if it is not considering that the field can no longer be null. Sometimes people will put in bad information to allow them to get rid of nulls. So now the price field needs to contain decimal values and things like 'unknown' and thus can't properly be a decimal datatype and then you have to go to all sorts of lengths in order to do calculations. This often creates performance problems as bad or worse than the null created. PLus you need to go through all your code and where ever you used a refernce to the filed being null or not being null, you need to rewrite to exclude or include based on the possible bad values someone will put in becasue the data is not allowed to be null.

如果您有想要更改为不再允许空值的现有数据,那么您必须考虑该更改的影响。首先,您知道需要将什么值放入当前为空的记录中吗?其次,您是否有很多代码正在使用isnullcoalesce您需要更新哪个(这些东西会降低性能,所以如果您不再需要检查它们,您应该更改代码)?你需要一个默认值吗?你真的可以分配一个吗?如果不考虑该字段不再为空,则某些插入或更新代码会中断。有时人们会输入错误的信息来让他们摆脱空值。所以现在价格字段需要包含十进制值和诸如“未知”之类的东西,因此不能正确地成为十进制数据类型,然后您必须使用各种长度才能进行计算。这通常会产生与创建的空值一样糟糕或更糟的性能问题。另外,您需要检查所有代码,以及在何处使用对归档为空或不为空的引用,

I do a lot of data imports from client data and every time we get a file where some field that should allow nulls does not, we get garbage data that needs to be cleaned up before we import to our system. Email is one of these. Often the data is input not knowing this value and it's generally some type of string data, so the user can type anything in here. We go to import emails and find things "I don't know". Tough to try to actually send an email to "I don't know". If the system requres a valid email address and checks for something like the existance of an @ sign, we would get '[email protected]" How is garbage data like this useful to the users of the data?

我从客户端数据导入了很多数据,每次我们得到一个文件,其中一些应该允许空值的字段不允许时,我们会在导入到我们的系统之前得到需要清理的垃圾数据。电子邮件就是其中之一。通常输入的数据不知道这个值,它通常是某种类型的字符串数据,所以用户可以在这里输入任何内容。我们去导入电子邮件并找到“我不知道”的东西。很难尝试实际向“我不知道”发送电子邮件。如果系统需要一个有效的电子邮件地址并检查是否存在 @ 符号之类的东西,我们会得到“[email protected]”这样的垃圾数据对数据用户有什么用?

Some of the performance issues with nulls are a result of writing nonsargable queries. Sometimes just rearranging the where clause rather than eliminating a necessary null can improve the performance.

null 的一些性能问题是编写不可搜索查询的结果。有时只是重新排列 where 子句而不是消除必要的空值可以提高性能。

回答by Daniel Emge

Nullable fields can have a big impact on performance when doing "NOT IN" queries. Because rows with all indexed fields set to null aren't indexed in a B-Tree indexes, Oracle must do a full table scan to check for null entires, even when a index exists.

在执行“NOT IN”查询时,可空字段会对性能产生很大影响。因为所有索引字段都设置为空的行不会在 B 树索引中建立索引,所以即使存在索引,Oracle 也必须进行全表扫描以检查空整数。

For example:

例如:

create table t1 as select rownum rn from all_objects;

create table t2 as select rownum rn from all_objects;

create unique index t1_idx on t1(rn);

create unique index t2_idx on t2(rn);

delete from t2 where rn = 3;

explain plan for
select *
  from t1
 where rn not in ( select rn
                     from t2 );

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      | 50173 |   636K|  3162   (1)| 00:00:38 |
|*  1 |  FILTER            |      |       |       |            |          |
|   2 |   TABLE ACCESS FULL| T1   | 50205 |   637K|    24   (5)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| T2   | 45404 |   576K|     2   (0)| 00:00:01 |
---------------------------------------------------------------------------

The query has to check for null values so it has to do a full table scan of t2 for each row in t1.

查询必须检查空值,因此它必须对 t1 中的每一行进行 t2 的全表扫描。

Now, if we make the fields not nullable, it can use the index.

现在,如果我们使字段不可为空,它就可以使用索引。

alter table t1 modify rn not null;

alter table t2 modify rn not null;

explain plan for
select *
  from t1
 where rn not in ( select rn
                     from t2 );

-----------------------------------------------------------------------------
| Id  | Operation          | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |        |  2412 | 62712 |    24   (9)| 00:00:01 |
|   1 |  NESTED LOOPS ANTI |        |  2412 | 62712 |    24   (9)| 00:00:01 |
|   2 |   INDEX FULL SCAN  | T1_IDX | 50205 |   637K|    21   (0)| 00:00:01 |
|*  3 |   INDEX UNIQUE SCAN| T2_IDX | 45498 |   577K|     1   (0)| 00:00:01 |
-----------------------------------------------------------------------------

回答by David

In my experience NULL is a valid value and usually means "don't know". If you don't know then it really is pointless to make up some default value for the column or to try to enforce some NOT NULL constraint. NULL just happens to be a specific case.

根据我的经验,NULL 是一个有效值,通常表示“不知道”。如果您不知道,那么为列设置一些默认值或尝试强制执行一些 NOT NULL 约束确实毫无意义。NULL 恰好是一个特定的情况。

The real challenge for NULLs is it complicate retrieval a bit. For instance you can not say WHERE column_name IN (NULL,'value1','value2').

NULL 的真正挑战是它使检索有点复杂。例如,您不能说 WHERE column_name IN (NULL,'value1','value2')。

Personally if you find lot of your columns, or certain columns contain a lot of NULLs I think you might want to revisit your data model. Maybe those null columns can be put into a child table? For example: a table with phone numbers where it's name, homephone, cellphone, faxno, worknumber, emergencynumber etc... You may only populate one or two of those and it would better normalizing it.

就个人而言,如果您发现很多列,或者某些列包含很多 NULL,我认为您可能需要重新访问您的数据模型。也许那些空列可以放入子表中?例如:一个包含电话号码的表格,其中包括姓名、家庭电话、手机、传真号码、工作号码、紧急电话号码等……您可能只填充其中的一两个,最好将其标准化。

What you need to do is step back and see how the data will be accessed. Is this a column that should have a value? Is this a column that only has a value for certain cases? Is this a column that will be queried a lot?

您需要做的是退后一步,看看将如何访问数据。这是一个应该有值的列吗?这是一个仅在某些情况下具有值的列吗?这是一个会被查询很多的列吗?