SQL 选择 * 与选择列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3180375/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 06:46:37  来源:igfitidea点击:

select * vs select column

sqlperformance

提问by Neel Basu

If I just need 2/3 columns and I query SELECT *instead of providing those columns in select query, is there any performance degradation regarding more/less I/O or memory?

如果我只需要 2/3 列并且我查询SELECT *而不是在选择查询中提供这些列,是否会出现有关更多/更少 I/O 或内存的性能下降?

The network overhead might be present if I do select * without a need.

如果我在不需要的情况下选择 * 可能会存在网络开销。

But in a select operation, does the database engine always pull atomic tuple from the disk, or does it pull only those columns requested in the select operation?

但是在选择操作中,数据库引擎是总是从磁盘中提取原子元组,还是只提取选择操作中请求的那些列?

If it always pulls a tuple then I/O overhead is the same.

如果它总是拉一个元组,那么 I/O 开销是相同的。

At the same time, there might be a memory consumption for stripping out the requested columns from the tuple, if it pulls a tuple.

同时,如果它拉出一个元组,从元组中剥离请求的列可能会消耗内存。

So if that's the case, select someColumn will have more memory overhead than that of select *

所以如果是这样的话, select someColumn 将比 select * 有更多的内存开销

采纳答案by Charles Bretana

It always pulls a tuple (except in cases where the table has been vertically segmented - broken up into columns pieces), so, to answer the question you asked, it doesn't matter from a performance perspective. However, for many other reasons, (below) you should always select specifically those columns you want, by name.

它总是拉出一个元组(除非表格已被垂直分割 - 分成几列),因此,要回答您提出的问题,从性能的角度来看并不重要。但是,出于许多其他原因,(如下)您应该始终按名称专门选择您想要的那些列。

It always pulls a tuple, because (in every vendors RDBMS I am familiar with), the underlying on-disk storage structure for everything (including table data) is based on defined I/O Pages(in SQL Server for e.g., each Page is 8 kilobytes). And every I/O read or write is by Page.. I.e., every write or read is a complete Page of data.

它总是拉出一个元组,因为(在我熟悉的每个供应商 RDBMS 中),所有东西(包括表数据)的底层磁盘存储结构都基于定义的 I/O 页(例如,在 SQL Server 中,每个页是8 KB)。并且每次 I/O 读取或写入都是按页进行的。即,每次写入或读取都是一个完整的数据页。

Because of this underlying structural constraint, a consequence is that Each row of data in a database must always be on one and only one page. It cannot span multiple Pages of data (except for special things like blobs, where the actual blob data is stored in separate Page-chunks, and the actual table row column then only gets a pointer...). But these exceptions are just that, exceptions, and generally do not apply except in special cases ( for special types of data, or certain optimizations for special circumstances)
Even in these special cases, generally, the actual table row of data itself (which contains the pointer to the actual data for the Blob, or whatever), it must be stored on a single IO Page...

由于这种潜在的结构约束,结果是数据库中的每一行数据必须始终在一页上且仅在一页上。它不能跨越多个数据页(除了像 blob 这样的特殊事物,其中实际的 blob 数据存储在单独的页块中,而实际的表行列则只获得一个指针......)。但这些例外只是,例外,一般不适用,除非在特殊情况下(针对特殊类型的数据,或针对特殊情况的某些优化)
即使在这些特殊情况下,一般情况下,数据本身的实际表行(其中包含指向 Blob 实际数据的指针,或其他),它必须存储在单个 IO 页面上...

EXCEPTION. The only place where Select *is OK, is in the sub-query after an Existsor Not Existspredicate clause, as in:

例外。唯一可以的地方Select *是在ExistsorNot Exists谓词子句之后的子查询中,如下所示:

   Select colA, colB
   From table1 t1
   Where Exists (Select * From Table2
                 Where column = t1.colA)

EDIT: To address @Mike Sherer comment, Yes it is true, both technically, with a bit of definition for your special case, and aesthetically. First, even when the set of columns requested are a subset of those stored in some index, the query processor must fetch everycolumn stored in that index, not just the ones requested, for the same reasons - ALL I/O must be done in pages, and index data is stored in IO Pages just like table data. So if you define "tuple" for an index page as the set of columns stored in the index, the statement is still true.
and the statement is true aesthetically because the point is that it fetches data based on what is stored in the I/O page, not on what you ask for, and this true whether you are accessing the base table I/O Page or an index I/O Page.

编辑:为了解决@Mike Sherer 的评论,是的,在技术上,对您的特殊情况和美学都有一些定义。首先,即使请求的列集是存储在某个索引中的列的子集,查询处理器也必须获取存储在该索引中的每一列,而不仅仅是请求的列,出于同样的原因——所有 I/O 必须在pages,索引数据和表数据一样存储在 IO Pages 中。因此,如果您将索引页的“元组”定义为存储在索引中的列集,则该语句仍然为真。
并且该语句在美学上是正确的,因为关键是它根据存储在 I/O 页中的内容而不是您请求的内容获取数据,无论您是访问基表 I/O 页还是索引,这都是正确的输入/输出页面。

For other reasons not to use Select *, see Why is SELECT *considered harmful?:

对于不使用的其他原因Select *,请参阅 为什么被SELECT *认为是有害的?

回答by marc_s

There are several reasons you should never (never ever) use SELECT *in production code:

您永远不应该(永远)SELECT *在生产代码中使用的原因有几个:

  • since you're not giving your database any hints as to what you want, it will first need to check the table's definition in order to determine the columns on that table. That lookup will cost some time - not much in a single query - but it adds up over time

  • if you need only 2/3 of the columns, you're selecting 1/3 too much data which needs to be retrieving from disk and sent across the network

  • if you start to rely on certain aspects of the data, e.g. the order of the columns returned, you could get a nasty surprise once the table is reorganized and new columns are added (or existing ones removed)

  • in SQL Server (not sure about other databases), if you need a subset of columns, there's always a chance a non-clustered index might be covering that request (contain all columns needed). With a SELECT *, you're giving up on that possibility right from the get-go. In this particular case, the data would be retrieved from the index pages (if those contain all the necessary columns) and thus disk I/O andmemory overhead would be much less compared to doing a SELECT *....query.

  • 由于您没有向数据库提供任何有关您想要的内容的提示,因此首先需要检查表的定义以确定该表上的列。该查找将花费一些时间 - 在单个查询中花费的时间不多 - 但它会随着时间的推移而累加

  • 如果您只需要 2/3 的列,那么您选择了 1/3 太多需要从磁盘检索并通过网络发送的数据

  • 如果您开始依赖数据的某些方面,例如返回的列的顺序,一旦重新组织表并添加新列(或删除现有列),您可能会得到令人讨厌的惊喜

  • 在 SQL Server 中(不确定其他数据库),如果您需要列的子集,则非聚集索引可能会覆盖该请求(包含所需的所有列)。使用SELECT *,您从一开始就放弃了这种可能性。在这种特殊情况下,数据将从索引页中检索(如果那些包含所有必需的列),因此与执行查询相比,磁盘 I/O内存开销会少得多SELECT *....

Yes, it takes a bit more typing initially (tools like SQL Promptfor SQL Server will even help you there) - but this is really one case where there's a rule without any exception: do not ever use SELECT * in your production code. EVER.

是的,最初需要更多的输入(SQL Promptfor SQL Server 之类的工具甚至可以帮助您) - 但这确实是一种无一例外的规则:永远不要在您的生产代码中使用 SELECT * 。曾经。

回答by Donnie

You should alwaysonly selectthe columns that you actually need. It is never less efficient to select less instead of more, and you also run into fewer unexpected side effects - like accessing your result columns on client side by index, then having those indexes become incorrect by adding a new column to the table.

您应该始终select使用您实际需要的列。选择更少而不是更多永远不会降低效率,而且您还会遇到更少的意外副作用 - 例如通过索引访问客户端上的结果列,然后通过向表中添加新列使这些索引变得不正确。

[edit]: Meant accessing. Stupid brain still waking up.

[编辑]:意味着访问。脑残还没醒。

回答by gxti

Unless you're storing large blobs, performance isn't a concern. The big reason not to use SELECT * is that if you're using returned rows as tuples, the columns come back in whatever order the schema happens to specify, and if that changes you will have to fix all your code.

除非您要存储大型 blob,否则性能不是问题。不使用 SELECT * 的一个重要原因是,如果您将返回的行用作元组,则列会以架构指定的任何顺序返回,如果这种情况发生变化,您将不得不修复所有代码。

On the other hand, if you use dictionary-style access then it doesn't matter what order the columns come back in because you are always accessing them by name.

另一方面,如果您使用字典式访问,那么列返回的顺序无关紧要,因为您总是按名称访问它们。

回答by Richard JP Le Guen

This immediately makes me think of a table I was using which contained a column of type blob; it usually contained a JPEG image, a few Mbs in size.

这立即让我想起我正在使用的一个表,其中包含一个类型的列blob;它通常包含一个 JPEG 图像,Mb大小只有几秒。

Needless to say I didn't SELECTthat column unless I reallyneeded it. Having that data floating around - especially when I selected mulitple rows - was just a hassle.

不用说我没有SELECT那个专栏,除非我真的需要它。让这些数据四处浮动——尤其是当我选择多行时——只是一件麻烦事。

However, I will admit that I otherwise usually query for all the columns in a table.

但是,我承认我通常会查询表中的所有列。

回答by Will Hartung

During a SQL select, the DB is always going to refer to the metadata for the table, regardless of whether it's SELECT * for SELECT a, b, c... Why? Becuase that's where the information on the structure and layout of the table on the system is.

在 SQL 选择期间,数据库总是会引用表的元数据,无论它是否是 SELECT * for SELECT a, b, c ... 为什么?因为这是有关系统上表的结构和布局的信息所在的位置。

It has to read this information for two reasons. One, to simply compile the statement. It needs to make sure you specify an existing table at the very least. Also, the database structure may have changed since the last time a statement was executed.

出于两个原因,它必须阅读此信息。一,简单地编译语句。它需要确保您至少指定一个现有的表。此外,自上次执行语句以来,数据库结构可能已更改。

Now, obviously, DB metadata is cached in the system, but it's still processing that needs to be done.

现在,显然,DB 元数据已缓存在系统中,但仍需进行处理。

Next, the metadata is used to generate the query plan. This happens each time a statement is compiled as well. Again, this runs against cached metadata, but it's always done.

接下来,元数据用于生成查询计划。每次编译语句时也会发生这种情况。同样,这会针对缓存的元数据运行,但它始终可以完成。

The only time this processing is not done is when the DB is using a pre-compiled query, or has cached a previous query. This is the argument for using binding parameters rather than literal SQL. "SELECT * FROM TABLE WHERE key = 1" is a different query than "SELECT * FROM TABLE WHERE key = ?" and the "1" is bound on the call.

唯一不进行此处理的情况是 DB 正在使用预编译查询,或缓存了先前的查询。这是使用绑定参数而不是文字 SQL 的论据。“SELECT * FROM TABLE WHERE key = 1”与“SELECT * FROM TABLE WHERE key = ?”是不同的查询。并且“1”绑定在调用上。

DBs rely heavily on page caching for there work. Many modern DBs are small enough to fit completely in memory (or, perhaps I should say, modern memory is large enough to fit many DBs). Then your primary I/O cost on the back end is logging and page flushes.

数据库严重依赖页面缓存来完成工作。许多现代 DB 小到可以完全放入内存中(或者,也许我应该说,现代内存大到足以容纳许多 DB)。那么后端的主要 I/O 成本是日志记录和页面刷新。

However, if you're still hitting the disk for your DB, a primary optimization done by many systems is to rely on the data in indexes, rather than the tables themselves.

但是,如果您仍在为数据库访问磁盘,那么许多系统所做的主要优化是依赖索引中的数据,而不是表本身。

If you have:

如果你有:

CREATE TABLE customer (
    id INTEGER NOT NULL PRIMARY KEY,
    name VARCHAR(150) NOT NULL,
    city VARCHAR(30),
    state VARCHAR(30),
    zip VARCHAR(10));

CREATE INDEX k1_customer ON customer(id, name);

Then if you do "SELECT id, name FROM customer WHERE id = 1", it is very likely that you DB will pull this data from the index, rather than from the tables.

然后,如果您执行“SELECT id, name FROM customer WHERE id = 1”,则您的数据库很可能会从索引而不是从表中提取此数据。

Why? It will likely use the index anyway to satisfy the query (vs a table scan), and even though 'name' isn't used in the where clause, that index will still be the best option for the query.

为什么?无论如何,它可能会使用索引来满足查询(与表扫描相比),即使在 where 子句中未使用“名称”,该索引仍将是查询的最佳选择。

Now the database has all of the data it needs to satisfy the query, so there's no reason to hit the table pages themselves. Using the index results in less disk traffic since you have a higher density of rows in the index vs the table in general.

现在数据库拥有满足查询所需的所有数据,因此没有理由访问表页本身。使用索引会减少磁盘流量,因为索引中的行密度通常高于表。

This is a hand wavy explanation of a specific optimization technique used by some databases. Many have several optimization and tuning techniques.

这是对某些数据库使用的特定优化技术的手动解释。许多都有几种优化和调整技术。

In the end, SELECT * is useful for dynamic queries you have to type by hand, I'd never use it for "real code". Identification of individual columns gives the DB more information that it can use to optimize the query, and gives you better control in your code against schema changes, etc.

最后, SELECT * 对于您必须手动输入的动态查询很有用,我永远不会将它用于“真实代码”。单个列的标识为数据库提供了更多信息,可用于优化查询,并让您更好地控制代码以防止架构更改等。

回答by M.Torres

I think there is no exact answer for your question, because you have pondering performance and facility of maintain your apps. Select columnis more performatic of select *, but if you is developing an oriented object system, then you will like use object.propertiesand you can need a properties in any part of apps, then you will need write more methods to get properties in special situations if you don't use select *and populate all properties. Your apps need have a good performance using select *and in some case you will need use select column to improve performance. Then you will have the better of two worlds, facility to write and maintain apps and performance when you need performance.

我认为您的问题没有确切的答案,因为您考虑了维护应用程序的性能和便利性。Select column更具有性能select *,但是如果您正在开发面向对象系统,那么您会喜欢使用object.properties并且您可以在应用程序的任何部分中需要一个属性,那么您将需要编写更多方法来在特殊情况下获取属性,如果您不这样做使用select *并填充所有属性。您的应用程序需要具有良好的性能select *,在某些情况下,您需要使用选择列来提高性能。然后,您将拥有两个世界中更好的一个,即在需要性能时编写和维护应用程序和性能的工具。

回答by symcbean

The accepted answer here is wrong. I came across this when another questionwas closed as a duplicate of this (while I was still writing my answer - grr - hence the SQL below references the other question).

这里接受的答案是错误的。当另一个问题被关闭时我遇到了这个问题(当我还在写我的答案时 - grr - 因此下面的 SQL 引用了另一个问题)。

You should always use SELECT attribute, attribute.... NOT SELECT *

您应该始终使用 SELECT 属性、属性.... NOT SELECT *

It's primarily for performance issues.

它主要用于性能问题。

SELECT name FROM users WHERE name='John';

SELECT name FROM users WHERE name='John';

Is not a very useful example. Consider instead:

不是一个很有用的例子。考虑一下:

SELECT telephone FROM users WHERE name='John';

If there's an index on (name, telephone) then the query can be resolved without having to look up the relevant values from the table - there is a coveringindex.

如果(姓名,电话)上有索引,则无需从表中查找相关值即可解析查询 - 有一个覆盖索引。

Further, suppose the table has a BLOB containing a picture of the user, and an uploaded CV, and a spreadsheet... using SELECT * will willpull all this information back into the DBMS buffers (forcing out other useful information from the cache). Then it will all be sent to client using up time on the network and memory on the client for data which is redundant.

此外,假设该表有一个包含用户图片的 BLOB、一个上传的 CV 和一个电子表格……使用 SELECT * 将把所有这些信息拉回 DBMS 缓冲区(强制从缓存中取出其他有用的信息)。然后它将全部发送到客户端,使用网络上的正常运行时间和客户端上的内存用于冗余数据。

It can also cause functional issues if the client retrieves the data as an enumerated array (such as PHP's mysql_fetch_array($x, MYSQL_NUM)). Maybe when the code was written 'telephone' was the third column to be returned by SELECT *, but then someone comes along and decides to add an email address to the table, positioned before 'telephone'. The desired field is now shifted to the 4th column.

如果客户端将数据作为枚举数组(例如 PHP 的 mysql_fetch_array($x, MYSQL_NUM))检索,它也会导致功能问题。也许在编写代码时,'telephone' 是 SELECT * 返回的第三列,但是有人出现并决定在表中添加一个电子邮件地址,位于'telephone'之前。所需字段现在移至第 4 列。

回答by Chris Travers

There are reasons for doing things either way. I use SELECT * a lot on PostgreSQL because there are a lot of things you can do with SELECT * in PostgreSQL that you can't do with an explicit column list, particularly when in stored procedures. Similarly in Informix, SELECT * over an inherited table tree can give you jagged rows while an explicit column list cannot because additional columns in child tables are returned as well.

无论哪种方式做事都是有原因的。我在 PostgreSQL 上经常使用 SELECT * ,因为在 PostgreSQL 中使用 SELECT * 可以做很多事情,而使用显式列列表则无法做到,尤其是在存储过程中。类似地,在 Informix 中,对继承的表树执行 SELECT * 可以为您提供参差不齐的行,而显式列列表则不能,因为还会返回子表中的其他列。

The main reason why I do this in PostgreSQL is that it ensures that I get a well-formed type specific to a table. This allows me to take the results and use them as the table type in PostgreSQL. This also allows for many more options in the query than a rigid column list would.

我在 PostgreSQL 中这样做的主要原因是它确保我获得特定于表的格式良好的类型。这允许我获取结果并将它们用作 PostgreSQL 中的表类型。这还允许在查询中提供比刚性列列表更多的选项。

On the other hand, a rigid column list gives you an application-level check that db schemas haven't changed in certain ways and this can be helpful. (I do such checks on another level.)

另一方面,严格的列列表为您提供应用程序级别的检查,以确保 db 模式没有以某些方式更改,这可能会有所帮助。(我在另一个层面上做这样的检查。)

As for performance, I tend to use VIEWs and stored procedures returning types (and then a column list inside the stored procedure). This gives me control over what types are returned.

至于性能,我倾向于使用 VIEW 和返回类型的存储过程(然后是存储过程中的列列表)。这使我可以控制返回的类型。

But keep in mind I am using SELECT * usually against an abstraction layer rather than base tables.

但请记住,我通常针对抽象层而不是基表使用 SELECT *。

回答by Anvesh

Reference taken from this article:

引用自这篇文章:

Without SELECT *:When you are using ” SELECT * ” at that time you are selecting more columns from the database and some of this column might not be used by your application. This will create extra cost and load on database system and more data travel across the network.

不使用 SELECT *:当您使用“SELECT *”时,您正在从数据库中选择更多列,并且您的应用程序可能不会使用此列中的某些列。这将在数​​据库系统上产生额外的成本和负载,以及更多的数据通过网络传输。

With SELECT *:If you have special requirements and created dynamic environment when add or delete column automatically handle by application code. In this special case you don't require to change application and database code and this will automatically affect on production environment. In this case you can use “SELECT *”.

With SELECT *:如果您有特殊要求并在添加或删除列时创建了动态环境,则由应用程序代码自动处理。在这种特殊情况下,您不需要更改应用程序和数据库代码,这将自动影响生产环境。在这种情况下,您可以使用“SELECT *”。