oracle UNIQUE 约束与 INSERT 前检查

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21889843/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-19 02:14:38  来源:igfitidea点击:

UNIQUE constraint vs checking before INSERT

mysqlsqlsql-serveroracle

提问by Steam

I have a SQL server table RealEstate with columns - Id, Property, Property_Value. This table has about 5-10 million rows and can increase even more in the future. I want to insert a row only if a combination of Id, Property, Property_Value does not exist in this table.

我有一个带有列的 SQL 服务器表 RealEstate - Id、Property、Property_Value。该表大约有 5-1000 万行,将来可能会增加更多。仅当此表中不存在 Id、Property、Property_Value 的组合时,我才想插入一行。

Example Table -

示例表 -

1,Rooms,5
1,Bath,2
1,Address,New York
2,Rooms,2
2,Bath,1
2,Address,Miami

Inserting 2,Address,Miamishould NOT be allowed. But, 2,Price,2billionis okay. I am curious to know which is the "best" way to do this and why. The why part is most important to me. The two ways of checking are -

2,Address,Miami不应允许插入。但是,2,Price,2billion没关系。我很想知道哪种“最好”的方式来做到这一点以及为什么。为什么部分对我来说最重要。两种检查方式是——

  1. At application level - The app should check if a row exists before it inserts a row.
  2. At database level - Set unique constraints on all 3 columns and let the database do the checking instead of person/app.
  1. 在应用程序级别 - 应用程序应在插入行之前检查行是否存在。
  2. 在数据库级别 - 在所有 3 列上设置唯一约束,并让数据库而不是人员/应用程序进行检查。

Is there any scenario where one would be better than the other ?

有没有一种情况会比另一种更好?

Thanks.

谢谢。

PS:I know there is a similar question already, but it does not answer my problem - Unique constraint vs pre checkingAlso, I think that UNIQUE is applicable to all databases, so I don't think I should remove the mysql and oracle tags.

PS:我知道已经有一个类似的问题,但它没有回答我的问题 - 唯一约束与预检查另外,我认为 UNIQUE 适用于所有数据库,所以我认为我不应该删除 mysql 和 oracle 标签.

采纳答案by ivanatpr

I think it most cases the differences between that two are going to be small enough that the choice should mostly be driven by picking the implementation that ends up being most understandable to someone looking at the code for the first time.

我认为在大多数情况下,这两者之间的差异将足够小,以至于选择应该主要通过选择最终对第一次查看代码的人来说最容易理解的实现来驱动。

However, I think exception handling has a few smalladvantages:

但是,我认为异常处理有一些优势:

  • Exception handling avoids a potential race condition. The 'check, then insert' method might fail if another process inserts a record between your check and your insert. So, even if you're doing 'check then insert' you still want exception handling on the insert and if you're already doing exception handling anyways then you might as well do away with the initial check.

  • If your code is not a stored procedure and has to interact with the database via the network (i.e. the application and the db are not on the same box), then you want to avoid having two separate network calls (one for the check and the other for the insert) and doing it via exception handling provides a straightforward way of handling the whole thing with a single network call. Now, there are tons of ways to do the 'check then insert' method while still avoiding the second network call, but simply catching the exception is likely to be the simplest way to go about it.

  • 异常处理避免了潜在的竞争条件。如果另一个进程在您的检查和插入之间插入一条记录,则“检查,然后插入”方法可能会失败。因此,即使您正在执行“检查然后插入”,您仍然希望对插入进行异常处理,如果您已经在进行异常处理,那么您也可以取消初始检查。

  • 如果您的代码不是存储过程并且必须通过网络与数据库交互(即应用程序和数据库不在同一个框中),那么您希望避免有两个单独的网络调用(一个用于检查和其他用于插入)并通过异常处理来完成它提供了一种通过单个网络调用处理整个事情的直接方法。现在,有很多方法可以在避免第二次网络调用的同时执行“检查然后插入”方法,但简单地捕获异常可能是最简单的方法。

On the other hand, exception handling requires a unique constraint (which is really a unique index), which comes with a performance tradeoff:

另一方面,异常处理需要一个唯一约束(它实际上是一个唯一索引),它伴随着性能权衡:

  • Creating a unique constraint will be slow on very large tables and it will cause a performance hit on every single insert to that table. On truly large databases you also have to budget for the extra disk space consumed by the unique index used to enforce the constraint.
  • On the other hand, it might make selecting from the table faster if your queries can take advantage of that index.
  • 在非常大的表上创建唯一约束会很慢,并且会导致对该表的每次插入都会影响性能。在真正的大型数据库上,您还必须为用于强制约束的唯一索引消耗的额外磁盘空间做预算。
  • 另一方面,如果您的查询可以利用该索引,则可能会更快地从表中进行选择。

I'd also note that if you're in a situation where what you actually want to do is 'update else insert' (i.e. if a record with the unique value already exists then you want to update that record, else you insert a new record) then what you actually want to use is your particular database's UPSERT method, if it has one. For SQL Server and Oracle, this would be a MERGE statement.

我还要注意的是,如果您处于这样一种情况,您实际上想要做的是“更新其他插入”(即,如果具有唯一值的记录已经存在,那么您想更新该记录,否则您插入一个新的记录)那么你真正想要使用的是你的特定数据库的 UPSERT 方法,如果它有的话。对于 SQL Server 和 Oracle,这将是一个 MERGE 语句。

回答by Ben

Dependent on the cost of #1 (doing a lookup) being reasonable, I would do both. At least, in Oracle, which is the database I have the most experience with.

取决于#1(进行查找)的成本是否合理,我会两者都做。至少,在 Oracle 中,这是我最有经验的数据库。

Rationale:

理由:

  • Unique/primary keys should be a core part of your data model design, I can't see any reason to not implement them - if you have so much data that performance suffers from maintaining the unique index:
    • that's a lotof data
    • partition it or archive it away from your OLTP work
  • The more constraints you have, the safer your data is against application logic errors.
  • If you check that a row exists first, you can easily extract other information from that row to use as part of an error message, or otherwise fork the application logic to cope with the duplication.
  • In Oracle, rolling back DML statements is relatively expensive because Oracle expects to succeed (i.e. COMMITchanges that have been written) by default.
  • 唯一键/主键应该是数据模型设计的核心部分,我看不出有什么理由不实现它们——如果你有太多的数据以至于维护唯一索引会影响性能:
    • 这是很多数据
    • 将其分区或将其存档,使其远离 OLTP 工作
  • 您拥有的约束越多,您的数据就越能防止应用程序逻辑错误。
  • 如果您首先检查某行是否存在,您可以轻松地从该行中提取其他信息以用作错误消息的一部分,或者以其他方式分叉应用程序逻辑以处理重复。
  • 在 Oracle 中,回滚 DML 语句的成本相对较高,因为 Oracle 期望在COMMIT默认情况下成功(即已写入的更改)。

回答by Steam

This does not answer the question directly, but I thought it might be helpful to post it here since its better than wikipedia and the link might just become dead someday.

这并没有直接回答问题,但我认为将其发布在这里可能会有所帮助,因为它比维基百科更好,而且该链接可能有一天会失效。

Link - http://www.celticwolf.com/blog/2010/04/27/what-is-a-race-condition/

链接 - http://www.celticwolf.com/blog/2010/04/27/what-is-a-race-condition/

Wikipedia has a good description of a race condition, but it's hard to follow if you don't understand the basics of programming. I'm going to try to explain it in less technical terms, using the example of generating an identifier as described above. I'll also use analogies to human activities to try to convey the ideas.

维基百科对竞争条件有很好的描述,但如果你不了解编程的基础知识,就很难理解。我将尝试使用较少的技术术语来解释它,使用如上所述的生成标识符的示例。我还将使用人类活动的类比来尝试传达这些想法。

A race condition is when two or more programs (or independent parts of a single program) all try to acquire some resource at the same time, resulting in an incorrect answer or conflict. This resource can be information, like the next available appointment time, or it can be exclusive access to something, like a spreadsheet. If you've ever used Microsoft Excel to edit a document on a shared drive, you've probably had the experience of being told by Excel that someone else was already editing the spreadsheet. This error message is Excel's way of handling the potential race condition gracefully and preventing errors.

竞争条件是当两个或多个程序(或单个程序的独立部分)都尝试同时获取某些资源时,导致错误答案或冲突。此资源可以是信息,例如下一个可用的约会时间,也可以是对某些内容的独占访问,例如电子表格。如果您曾经使用 Microsoft Excel 编辑共享驱动器上的文档,您可能有过被 Excel 告知其他人已经在编辑电子表格的经历。此错误消息是 Excel 优雅地处理潜在竞争条件和防止错误的方式。

A common task for programs is to identify the next available value of some sort and then assign it. This technique is used for invoice numbers, student IDs, etc. It's an old problem that has been solved before. One of the most common solutions is to allow the database that is storing the data to generate the number. There are other solutions, and they all have their strengths and weaknesses.

程序的一个常见任务是识别某种类型的下一个可用值,然后分配它。这种技术用于发票号、学生证等,这是一个以前已经解决的老问题。最常见的解决方案之一是允许存储数据的数据库生成数字。还有其他解决方案,它们都有自己的优点和缺点。

Unfortunately, programmers who are ignorant of this area or simply bad at programming frequently try to roll their own. The smart ones discover quickly that it's a much more complex problem than it seems and look for existing solutions. The bad ones never see the problem or, once they do, insist on making their unworkable solution ever more complex without fixing the error. Let's take the example of a student ID. The neophyte programmer says “to know what the next student number should be, we'll just get the last student number and increment it.” Here's what happens under the hood:

不幸的是,对这个领域一无所知或根本不擅长编程的程序员经常尝试自己动手。聪明的人很快就会发现这是一个比看起来要复杂得多的问题,并会寻找现有的解决方案。坏的人永远不会看到问题,或者一旦他们看到了,就坚持让他们不可行的解决方案变得更加复杂而不修复错误。我们以学生证为例。新手程序员说“要知道下一个学生编号应该是多少,我们只需获取最后一个学生编号并将其递增。” 这是幕后发生的事情:

  1. Betty, an admin. assistant in the admissions office fires up the student management program. Note that this is really just a copy of the program that runs on her PC. It talks to the database server over the school's network, but has no way to talk to other copies of the program running on other PCs.
  2. Betty creates a new student record for Bob Smith, entering all of the information.
  3. While Betty is doing her data entry, George, another admin. assistant, fires up the student management program on his PC and begins creating a record for Gina Verde.
  4. George is a faster typist, so he finishes at the same time as Betty. They both hit the “Save” button at the same time.
  5. Betty's program connects to the database server and gets the highest student number in use, 5012.
  6. George's program, at the same time, gets the same answer to the same question.
  7. Both programs decide that the new student ID for the record that they're saving should be 5013. They add that information to the record and then save it in the database.
  8. Now Bob Smith (Betty's student) and Gina Verde (George's student) have the same student ID.
  1. 贝蒂,管理员。招生办公室的助理启动了学生管理计划。请注意,这实际上只是在她的 PC 上运行的程序的副本。它通过学校网络与数据库服务器通信,但无法与其他 PC 上运行的程序的其他副本通信。
  2. Betty 为 Bob Smith 创建一个新的学生记录,输入所有信息。
  3. Betty 正在输入数据时,另一位管理员 George。助理,在他的 PC 上启动学生管理程序并开始为 Gina Verde 创建记录。
  4. 乔治打字速度更快,所以他和贝蒂同时打完。他们都同时按下了“保存”按钮。
  5. Betty 的程序连接到数据库服务器并获得使用中的最高学生号 5012。
  6. 同时,George 的程序对同样的问题得到了同样的答案。
  7. 两个程序都决定他们正在保存的记录的新学生 ID 应该是 5013。他们将该信息添加到记录中,然后将其保存在数据库中。
  8. 现在 Bob Smith(Betty 的学生)和 Gina Verde(George 的学生)拥有相同的学生 ID。

This student ID will be attached to all sorts of other records, from grades to meal cards for the dining hall. Eventually this problem will come to light and someone will have to spend a lot of time assigning one of them a new ID and sorting out the mixed-up records.

该学生证将附在各种其他记录上,从成绩到食堂的餐卡。最终这个问题会暴露出来,有人将不得不花费大量时间为其中一个人分配一个新的 ID 并整理混乱的记录。

When I describe this problem to people, the usual reaction is “But how often will that happen in practice? Never, right?”. Wrong. First, when data entry is being done by your staff, it's generally done during a relatively small period of time by everyone. This increases the chances of an overlap. If the application in question is a web application open to the general public, the chances of two people hitting the “Save” button at the same time are even higher. I saw this in a production system recently. It was a web application in public beta. The usage rate was quite low, with only a few people signing up every day. Nevertheless, six pairs of people managed to get identical IDs over the space of a few months. In case you're wondering, no, neither I nor anyone from my team wrote that code. We were quite surprised, however, at how many times that problem occurred. In hindsight, we shouldn't have been. It's really a simple application of Murphy's Law.

当我向人们描述这个问题时,通常的反应是“但在实践中这种情况多久发生一次?从来没有,对吧?”。错误的。首先,当您的员工进行数据输入时,通常每个人都会在相对较短的时间内完成。这增加了重叠的机会。如果所讨论的应用程序是对公众开放的 Web 应用程序,则两个人同时点击“保存”按钮的机会更高。我最近在生产系统中看到了这一点。这是一个公共测试版的网络应用程序。使用率很低,每天只有几个人注册。尽管如此,在几个月的时间里,有六对人设法获得了相同的 ID。如果您想知道,不,我和我的团队中的任何人都没有编写该代码。然而,我们感到非常惊讶,该问题发生的次数。事后看来,我们不应该如此。这实际上是墨菲定律的一个简单应用。

How can this problem be avoided? The easiest way is to use an existing solution to the problem that has been well tested. All of the major databases (MS SQL Server, Oracle, MySQL, PostgreSQL, etc.) have a way to increment numbers without creating duplicates. MS SQL server calls it an “identity” column, while MySQL calls it an “auto number” column, but the function is the same. Whenever you insert a new record, a new identifier is automatically created and is guaranteed to be unique. This would change the above scenario as follows:

如何避免这个问题?最简单的方法是使用已经过测试的问题的现有解决方案。所有主要数据库(MS SQL Server、Oracle、MySQL、PostgreSQL 等)都有一种增加数字而不创建重复项的方法。MS SQL Server 称其为“身份”列,而 MySQL 称其为“自动编号”列,但功能相同。每当您插入新记录时,都会自动创建一个新标识符并保证是唯一的。这将改变上述情况如下:

  1. Betty, an admin. assistant in the admissions office fires up the student management program. Note that this is really just a copy of the program that runs on her PC. It talks to the database server over the school's network, but has no way to talk to other copies of the program running on other PCs.
  2. Betty creates a new student record for Bob Smith, entering all of the information.
  3. While Betty is doing her data entry, George, another admin. assistant, fires up the student management program on his PC and begins creating a record for Gina Verde.
  4. George is a faster typist, so he finishes at the same time as Betty. They both hit the “Save” button at the same time.
  5. Betty's program connects to the database server and hands it the record to be saved.
  6. George's program, at the same time, hands over the other record to be saved.
  7. The database server puts both records into a queue and saves them one at a time, assigning the next available number to them.
  8. Now Bob Smith (Betty's student) gets ID 5013 and Gina Verde (George's student) gets id 5014.
  1. 贝蒂,管理员。招生办公室的助理启动了学生管理计划。请注意,这实际上只是在她的 PC 上运行的程序的副本。它通过学校网络与数据库服务器通信,但无法与其他 PC 上运行的程序的其他副本通信。
  2. Betty 为 Bob Smith 创建一个新的学生记录,输入所有信息。
  3. Betty 正在输入数据时,另一位管理员 George。助理,在他的 PC 上启动学生管理程序并开始为 Gina Verde 创建记录。
  4. 乔治打字速度更快,所以他和贝蒂同时打完。他们都同时按下了“保存”按钮。
  5. Betty 的程序连接到数据库服务器并将要保存的记录交给它。
  6. 同时,乔治的程序将另一条记录移交给要保存的记录。
  7. 数据库服务器将两条记录放入一个队列中并一次保存一条,并为它们分配下一个可用编号。
  8. 现在 Bob Smith(Betty 的学生)的 ID 为 5013,Gina Verde(George 的学生)的 ID 为 5014。

With this solution, there is no problem with duplication. The code that does this for each database server has been tested repeatedly over the years, both by the manufacturer and by users. Millions of applications around the world rely on it and continue to stress test it every day. Can anyone say the same about their homegrown solution?

有了这个解决方案,重复就没有问题了。多年来,制造商和用户都反复测试了为每个数据库服务器执行此操作的代码。全球数以百万计的应用程序依赖它,并每天继续对其进行压力测试。任何人都可以对他们自己开发的解决方案说同样的话吗?

There is at least one well tested way to create identifiers in the software rather than in the database: uuids (Universally Unique Identifiers). However, a uuid takes the form of xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxwhere “x” stands for a hexadecimal digit (0-9 and a-f). Do you want to use that for an invoice number, student ID or some other identifier seen by the public? Probably not.

至少有一种经过充分测试的方法可以在软件中而不是在数据库中创建标识符:uuids(通用唯一标识符)。但是,uuid 的形式是xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx“x”代表十六进制数字(0-9 和 af)。您想将其用于发票编号、学生证或公众看到的其他一些标识符吗?可能不是。

To summarize, a race condition occurs when two programs, or two independent parts of a program, attempt to access some information or access a resource at the same time, resulting in an error, be it an incorrect calculation, a duplicated identifier or conflicting access to a resource. There are many more types of race conditions than I've presented here and they affect many other areas of software and hardware.

总而言之,当两个程序或程序的两个独立部分试图同时访问某些信息或访问资源时,会发生竞争条件,从而导致错误,无论是计算不正确、标识符重复还是访问冲突到一个资源。竞争条件的类型比我在此介绍的要多得多,它们会影响软件和硬件的许多其他领域。

回答by mjuarez

The description of your problem is exactly why primary keys can be compound, e.g., they consist of multiple fields. That way, the database will handle the uniqueness for you, and you don't need to care about it.

您的问题的描述正是主键可以复合的原因,例如,它们由多个字段组成。这样,数据库将为您处理唯一性,您无需关心它。

In your case, the table definition could be something similar to the following like:

在您的情况下,表定义可能类似于以下内容:

 CREATE TABLE `real_estate` (
   `id` int(11) NOT NULL AUTO_INCREMENT,
   `property` varchar(255) DEFAULT NULL,
   `property_value` varchar(255) DEFAULT NULL,
   PRIMARY KEY (`id`),
   UNIQUE KEY `index_id_property_property_value` (`id`, `property`, `property_value`),
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8;