database 在数据库列中存储分隔列表真的那么糟糕吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3653462/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-08 07:51:02  来源:igfitidea点击:

Is storing a delimited list in a database column really that bad?

databasedatabase-designdatabase-normalization

提问by Mad Scientist

Imagine a web form with a set of check boxes (any or all of them can be selected). I chose to save them in a comma separated list of values stored in one column of the database table.

想象一个带有一组复选框的 Web 表单(可以选择其中的任何一个或全部)。我选择将它们保存在存储在数据库表的一列中的逗号分隔值列表中。

Now, I know that the correct solution would be to create a second table and properly normalize the database. It was quicker to implement the easy solution, and I wanted to have a proof-of-concept of that application quickly and without having to spend too much time on it.

现在,我知道正确的解决方案是创建第二个表并正确规范化数据库。实施简单的解决方案更快,我希望快速获得该应用程序的概念验证,而不必在上面花费太多时间。

I thought the saved time and simpler code was worth it in my situation, is this a defensible design choice, or should I have normalized it from the start?

我认为节省的时间和更简单的代码在我的情况下是值得的,这是一个可防御的设计选择,还是应该从一开始就将其标准化?

Some more context, this is a small internal application that essentially replaces an Excel file that was stored on a shared folder. I'm also asking because I'm thinking about cleaning up the program and make it more maintainable. There are some things in there I'm not entirely happy with, one of them is the topic of this question.

更多的上下文,这是一个小型内部应用程序,它基本上替换了存储在共享文件夹中的 Excel 文件。我也在问,因为我正在考虑清理程序并使其更易于维护。里面有些东西我并不完全满意,其中之一就是这个问题的主题。

回答by Bill Karwin

In addition to violating First Normal Formbecause of the repeating group of values stored in a single column, comma-separated lists have a lot of other more practical problems:

除了因为存储在单个列中的重复值组而违反第一范式之外,逗号分隔的列表还有很多其他更实际的问题:

  • Can't ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
  • Can't use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
  • Can't enforce uniqueness: no way to prevent 1,2,3,3,3,5
  • Can't delete a value from the list without fetching the whole list.
  • Can't store a list longer than what fits in the string column.
  • Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan. May have to resort to regular expressions, for example in MySQL:
    idlist REGEXP '[[:<:]]2[[:>:]]'*
  • Hard to count elements in the list, or do other aggregate queries.
  • Hard to join the values to the lookup table they reference.
  • Hard to fetch the list in sorted order.
  • 无法确保每个值都是正确的数据类型:没有办法阻止1,2,3,banana,5
  • 不能使用外键约束将值链接到查找表;没有办法强制执行参照完整性。
  • 无法强制唯一性:无法阻止1,2,3,3,3,5
  • 不能在不获取整个列表的情况下从列表中删除一个值。
  • 不能存储比字符串列长的列表。
  • 很难在列表中搜索具有给定值的所有实体;您必须使用低效的表扫描。可能不得不求助于正则表达式,例如在 MySQL 中:
    idlist REGEXP '[[:<:]]2[[:>:]]'*
  • 难以计算列表中的元素,或进行其他聚合查询。
  • 很难将值连接到它们引用的查找表中。
  • 很难按排序顺序获取列表。

To solve these problems, you have to write tons of application code, reinventing functionality that the RDBMS already provides much more efficiently.

要解决这些问题,您必须编写大量应用程序代码,重新发明 RDBMS已经提供的功能,效率要高得多

Comma-separated lists are wrong enough that I made this the first chapter in my book: SQL Antipatterns: Avoiding the Pitfalls of Database Programming.

逗号分隔的列表是错误的,我把它作为我的书的第一章:SQL 反模式:避免数据库编程的陷阱

There are times when you need to employ denormalization, but as @OMG Ponies mentions, these are exception cases. Any non-relational “optimization” benefits one type of query at the expense of other uses of the data, so be sure you know which of your queries need to be treated so specially that they deserve denormalization.

有时您需要使用非规范化,但正如@OMG Ponies 所提到的,这些都是例外情况。任何非关系“优化”都会使一种类型的查询受益,但会牺牲数据的其他用途,因此请确保您知道哪些查询需要被如此特殊对待以使其值得非规范化。



*MySQL 8.0 no longer supports this word-boundary expression syntax.

*MySQL 8.0 不再支持此词边界表达式语法。

回答by Hammerite

"One reason was laziness".

“一个原因是懒惰”。

This rings alarm bells. The only reason you should do something like this is that you know how to do it "the right way" but you have come to the conclusion that there is a tangible reason not to do it that way.

这敲响了警钟。你应该做这样的事情的唯一原因是你知道如何“以正确的方式”去做,但你得出的结论是,有一个切实的理由不这样做。

Having said this: if the data you are choosing to store this way is data that you will never need to query by, then there may be a case for storing it in the way you have chosen.

话虽如此:如果您选择以这种方式存储的数据是您永远不需要查询的数据,那么可能存在以您选择的方式存储它的情况。

(Some users would dispute the statement in my previous paragraph, saying that "you can never know what requirements will be added in the future". These users are either misguided or stating a religious conviction. Sometimes it is advantageous to work to the requirements you have before you.)

(有些用户会反驳我上一段的说法,说“你永远不知道未来会增加什么要求”。这些用户要么被误导,要么表达了宗教信仰。有时按照你的要求工作是有利的在你面前。)

回答by OMG Ponies

There are numerous questions on SO asking:

有很多关于 SO 的问题:

  • how to get a count of specific values from the comma separated list
  • how to get records that have only the same 2/3/etc specific value from that comma separated list
  • 如何从逗号分隔列表中获取特定值的计数
  • 如何从该逗号分隔列表中获取仅具有相同 2/3/etc 特定值的记录

Another problem with the comma separated list is ensuring the values are consistent - storing text means the possibility of typos...

逗号分隔列表的另一个问题是确保值一致 - 存储文本意味着可能会出现拼写错误......

These are all symptoms of denormalized data, and highlight why you should always model for normalized data. Denormalization canbe a query optimization, to be applied when the need actually presents itself.

这些都是非规范化数据的症状,并强调了为什么您应该始终为规范化数据建模。非规范化可以是一种查询优化,在实际需要时应用

回答by bobbymcr

In general anything can be defensible if it meets the requirements of your project. This doesn't mean that people will agree with or want to defend your decision...

一般来说,如果它满足您的项目的要求,任何东西都是可以辩护的。这并不意味着人们会同意或想要为你的决定辩护......

In general, storing data in this way is suboptimal (e.g. harder to do efficient queries) and may cause maintenance issues if you modify the items in your form. Perhaps you could have found a middle ground and used an integer representing a set of bit flags instead?

通常,以这种方式存储数据是次优的(例如,更难进行有效的查询)并且如果您修改表单中的项目可能会导致维护问题。也许您可以找到一个中间立场并使用代表一组位标志的整数来代替?

回答by duffymo

Yes, I would say that it really is that bad. It's a defensible choice, but that doesn't make it correct or good.

是的,我会说这真的很糟糕。这是一个合理的选择,但这并不能使它正确或好。

It breaks first normal form.

它打破了第一范式。

A second criticism is that putting raw input results directly into a database, without any validation or binding at all, leaves you open to SQL injection attacks.

第二个批评是将原始输入结果直接放入数据库,根本没有任何验证或绑定,会让您容易受到 SQL 注入攻击。

What you're calling laziness and lack of SQL knowledge is the stuff that neophytes are made of. I'd recommend taking the time to do it properly and view it as an opportunity to learn.

你所说的懒惰和缺乏 SQL 知识是新手的组成部分。我建议花时间正确地做这件事,并将其视为学习的机会。

Or leave it as it is and learn the painful lesson of a SQL injection attack.

或者保持原样,吸取 SQL 注入攻击的惨痛教训。

回答by Raj

Well I've been using a key/value pair tab separated list in a NTEXT column in SQL Server for more than 4 years now and it works. You do lose the flexibility of making queries but on the other hand, if you have a library that persists/derpersists the key value pair then it's not a that bad idea.

好吧,我已经在 SQL Server 的 NTEXT 列中使用键/值对制表符分隔列表 4 年多,并且它有效。您确实失去了进行查询的灵活性,但另一方面,如果您有一个持久/持久化键值对的库,那么这不是一个坏主意。

回答by James A Mohler

I needed a multi-value column, it could be implemented as an xml field

我需要一个多值列,它可以实现为一个 xml 字段

It could be converted to a comma delimited as necessary

它可以根据需要转换为逗号分隔

querying an XML list in sql server using Xquery.

使用 Xquery 在 sql server 中查询 XML 列表

By being an xml field, some of the concerns can be addressed.

通过成为 xml 字段,可以解决一些问题。

With CSV:Can't ensure that each value is the right data type: no way to prevent 1,2,3,banana,5

使用 CSV:无法确保每个值都是正确的数据类型:无法阻止 1,2,3,banana,5

With XML:values in a tag can be forced to be the correct type

使用 XML:可以强制标记中的值是正确的类型



With CSV:Can't use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.

使用 CSV:不能使用外键约束将值链接到查找表;没有办法强制执行参照完整性。

With XML:still an issue

使用 XML:仍然是一个问题



With CSV:Can't enforce uniqueness: no way to prevent 1,2,3,3,3,5

使用 CSV:无法强制唯一性:无法阻止 1、2、3、3、3、5

With XML:still an issue

使用 XML:仍然是一个问题



With CSV:Can't delete a value from the list without fetching the whole list.

使用 CSV:无法在不获取整个列表的情况下从列表中删除值。

With XML:single items can be removed

使用 XML:可以删除单个项目



With CSV:Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan.

使用 CSV:很难在列表中搜索具有给定值的所有实体;您必须使用低效的表扫描。

With XML:xml field can be indexed

使用 XML:可以对 xml 字段进行索引



With CSV:Hard to count elements in the list, or do other aggregate queries.**

使用 CSV:难以计算列表中的元素,或进行其他聚合查询。**

With XML:not particularly hard

使用 XML:不是特别难



With CSV:Hard to join the values to the lookup table they reference.**

使用 CSV:很难将值连接到它们引用的查找表中。**

With XML:not particularly hard

使用 XML:不是特别难



With CSV:Hard to fetch the list in sorted order.

使用 CSV:难以按排序顺序获取列表。

With XML:not particularly hard

使用 XML:不是特别难



With CSV:Storing integers as strings takes about twice as much space as storing binary integers.

使用 CSV:将整数存储为字符串所占用的空间大约是存储二进制整数的两倍。

With XML:storage is even worse than a csv

使用 XML:存储甚至比 csv 还要糟糕



With CSV:Plus a lot of comma characters.

使用 CSV:加上很多逗号字符。

With XML:tags are used instead of commas

使用 XML:使用标签代替逗号



In short, using XML gets around some of the issues with delimited list AND can be converted to a delimited list as needed

简而言之,使用 XML 解决了分隔列表的一些问题,并且可以根据需要转换为分隔列表

回答by Robin

Yes, it isthat bad. My view is that if you don't like using relational databases then look for an alternative that suits you better, there are lots of interesting "NOSQL" projects out there with some really advanced features.

是的,就是这么糟糕。我的观点是,如果您不喜欢使用关系数据库,那么请寻找更适合您的替代方案,那里有许多有趣的“NOSQL”项目,其中包含一些非常高级的功能。

回答by Jerry Coffin

I would probably take the middle ground: make each field in the CSV into a separate column in the database, but not worry much about normalization (at least for now). At some point, normalization mightbecome interesting, but with all the data shoved into a single column you're gaining virtually no benefit from using a database at all. You need to separate the data into logical fields/columns/whatever you want to call them before you can manipulate it meaningfully at all.

我可能会采取中间立场:将 CSV 中的每个字段都放入数据库中的一个单独的列中,但不必担心规范化(至少现在是这样)。在某些时候,规范化可能会变得有趣,但是将所有数据都塞进一个列中,您实际上根本无法从使用数据库中获得任何好处。您需要将数据分成逻辑字段/列/您想调用的任何内容,然后才能对其进行有意义的操作。

回答by Solomon Ucko

If you have a fixed number of boolean fields, you could use a INT(1) NOT NULL(or BIT NOT NULLif it exists) or CHAR (0)(nullable) for each. You could also use a SET(I forget the exact syntax).

如果您有固定数量的布尔字段,您可以为每个字段使用 a INT(1) NOT NULL(或BIT NOT NULL如果它存在)或CHAR (0)(可为空)。您也可以使用 a SET(我忘记了确切的语法)。