SQL 如何为用户定义的字段设计数据库?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/5106335/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to design a database for User Defined Fields?
提问by Rachel
My requirements are:
我的要求是:
- Need to be able to dynamically add User-Defined fields of any data type
- Need to be able to query UDFs quickly
- Need to be able to do calculations on UDFs based on datatype
- Need to be able to sort UDFs based on datatype
- 需要能够动态添加任何数据类型的用户定义字段
- 需要能够快速查询 UDF
- 需要能够根据数据类型对 UDF 进行计算
- 需要能够根据数据类型对 UDF 进行排序
Other Information:
其他信息:
- I'm looking for performance primarily
- There are a few million Master records which can have UDF data attached
- When I last checked, there were over 50mil UDF records in our current database
- Most of the time, a UDF is only attached to a few thousand of the Master records, not all of them
- UDFs are not joined or used as keys. They're just data used for queries or reports
- 我主要是在寻找性能
- 有几百万条主记录可以附加 UDF 数据
- 当我上次检查时,我们当前的数据库中有超过 5000 万条 UDF 记录
- 大多数时候,一个 UDF 只附加到几千条主记录上,而不是全部
- UDF 不连接或用作键。它们只是用于查询或报告的数据
Options:
选项:
Create a big table with StringValue1, StringValue2... IntValue1, IntValue2,... etc. I hate this idea, but will consider it if someone can tell me it is better than other ideas and why.
Create a dynamic table which adds a new column on demand as needed. I also don't like this idea since I feel performance would be slow unless you indexed every column.
Create a single table containing UDFName, UDFDataType, and Value. When a new UDF gets added, generate a View which pulls just that data and parses it into whatever type is specified. Items which don't meet the parsing criteria return NULL.
Create multiple UDF tables, one per data type. So we'd have tables for UDFStrings, UDFDates, etc. Probably would do the same as #2 and auto-generate a View anytime a new field gets added
XML DataTypes? I haven't worked with these before but have seen them mentioned. Not sure if they'd give me the results I want, especially with performance.
Something else?
用 StringValue1、StringValue2...IntValue1、IntValue2...等创建一个大表。我讨厌这个想法,但如果有人能告诉我它比其他想法更好,为什么会考虑它。
创建一个动态表,根据需要按需添加新列。我也不喜欢这个想法,因为我觉得除非你索引每一列,否则性能会很慢。
创建一个包含 UDFName、UDFDataType 和 Value 的表。添加新的 UDF 时,生成一个视图,该视图仅提取该数据并将其解析为指定的任何类型。不符合解析条件的项目返回 NULL。
创建多个 UDF 表,每个数据类型一个。所以我们会有用于 UDFStrings、UDDFates 等的表。 可能会做与 #2 相同的工作,并在添加新字段时自动生成视图
XML 数据类型?我以前没有使用过这些,但看到他们提到过。不确定他们是否会给我想要的结果,尤其是性能方面。
还有什么?
采纳答案by Phil Helmer
If performance is the primary concern, I would go with #6... a table per UDF (really, this is a variant of #2). This answer is specifically tailored to this situation and the description of the data distribution and access patterns described.
如果性能是主要问题,我会选择 #6... 每个 UDF 一个表(实际上,这是 #2 的变体)。此答案专门针对这种情况以及所描述的数据分布和访问模式的描述量身定制。
Pros:
优点:
Because you indicate that some UDFs have values for a small portion of the overall data set, a separate table would give you the best performance because that table will be only as large as it needs to be to support the UDF. The same holds true for the related indices.
You also get a speed boost by limiting the amount of data that has to be processed for aggregations or other transformations. Splitting the data out into multiple tables lets you perform some of the aggregating and other statistical analysis on the UDF data, then join that result to the master table via foreign key to get the non-aggregated attributes.
You can use table/column names that reflect what the data actually is.
You have complete control to use data types, check constraints, default values, etc. to define the data domains. Don't underestimate the performance hit resulting from on-the-fly data type conversion. Such constraints also help RDBMS query optimizers develop more effective plans.
Should you ever need to use foreign keys, built-in declarative referential integrity is rarely out-performed by trigger-based or application level constraint enforcement.
因为您指出某些 UDF 具有整个数据集的一小部分的值,所以单独的表将为您提供最佳性能,因为该表的大小仅与支持 UDF 所需的一样大。相关指数也是如此。
您还可以通过限制必须为聚合或其他转换处理的数据量来提高速度。将数据拆分到多个表中可以让您对 UDF 数据执行一些聚合和其他统计分析,然后通过外键将该结果连接到主表以获得非聚合属性。
您可以使用反映数据实际内容的表/列名称。
您可以完全控制使用数据类型、检查约束、默认值等来定义数据域。不要低估动态数据类型转换对性能的影响。此类约束还有助于 RDBMS 查询优化器开发更有效的计划。
如果您需要使用外键,那么基于触发器或应用程序级别的约束执行很少能胜过内置的声明性参照完整性。
Cons:
缺点:
This could create a lot of tables. Enforcing schema separation and/or a naming convention would alleviate this.
There is more application code needed to operate the UDF definition and management. I expect this is still less code needed than for the original options 1, 3, & 4.
这可以创建很多表。强制模式分离和/或命名约定可以缓解这种情况。
需要更多的应用程序代码来操作 UDF 定义和管理。我希望这仍然比原始选项 1、3 和 4 所需的代码更少。
Other Considerations:
其他注意事项:
If there is anything about the nature of the data that would make sense for the UDFs to be grouped, that should be encouraged. That way, those data elements can be combined into a single table. For example, let's say you have UDFs for color, size, and cost. The tendency in the data is that most instances of this data looks like
'red', 'large', 45.03
rather than
NULL, 'medium', NULL
In such a case, you won't incur a noticeable speed penalty by combining the 3 columns in 1 table because few values would be NULL and you avoid making 2 more tables, which is 2 fewer joins needed when you need to access all 3 columns.
If you hit a performance wall from a UDF that is heavily populated and frequently used, then that should be considered for inclusion in the master table.
Logical table design can take you to a certain point, but when the record counts get truly massive, you also should start looking at what table partitioning options are provided by your RDBMS of choice.
如果数据的性质对 UDF 进行分组有意义,那么应该鼓励这样做。这样,这些数据元素就可以组合到一个表中。例如,假设您有颜色、大小和成本的 UDF。数据中的趋势是该数据的大多数实例看起来像
'red', 'large', 45.03
而不是
NULL, 'medium', NULL
在这种情况下,将 3 列合并到 1 个表中不会导致明显的速度损失,因为很少有值为 NULL 的值,并且您避免创建 2 个以上的表,当您需要访问所有 3 个列时,需要的连接数减少 2 个.
如果您遇到了大量填充和频繁使用的 UDF 的性能壁垒,则应考虑将其包含在主表中。
逻辑表设计可以将您带到某个点,但是当记录数量变得非常庞大时,您还应该开始查看您选择的 RDBMS 提供了哪些表分区选项。
回答by Bill Karwin
I have writtenabout this problem a lot. The most common solution is the Entity-Attribute-Value antipattern, which is similar to what you describe in your option #3. Avoid this design like the plague.
我已经写了很多关于这个问题的文章。最常见的解决方案是 Entity-Attribute-Value 反模式,它类似于您在选项 #3 中描述的内容。 避免这种设计就像瘟疫一样。
What I use for this solution when I need truly dynamic custom fields is to store them in a blob of XML, so I can add new fields at any time. But to make it speedy, also create additional tables for each field you need to search or sort on (you don't a table per field--just a table per searchablefield). This is sometimes called an inverted index design.
当我需要真正动态的自定义字段时,我使用此解决方案将它们存储在 XML 的 blob 中,以便我可以随时添加新字段。但为了加快速度,还可以为您需要搜索或排序的每个字段创建额外的表格(您不需要为每个字段创建一个表格——每个可搜索字段只创建一个表格)。这有时称为倒排索引设计。
You can read an interesting article from 2009 about this solution here: http://backchannel.org/blog/friendfeed-schemaless-mysql
您可以在 2009 年阅读有关此解决方案的有趣文章:http: //backchannel.org/blog/friendfeed-schemaless-mysql
Or you can use a document-oriented database, where it's expected that you have custom fields per document. I'd choose Solr.
或者,您可以使用面向文档的数据库,希望每个文档都有自定义字段。我会选择Solr。
回答by Stefan Steinegger
I would most probably create a table of the following structure:
我很可能会创建一个具有以下结构的表:
- varchar Name
- varchar Type
- decimal NumberValue
- varchar StringValue
- date DateValue
- 变量名
- varchar 类型
- 十进制数值
- varchar 字符串值
- 日期日期值
The exact types of course depend on your needs (and of course on the dbms you are using). You could also use the NumberValue (decimal) field for int's and booleans. You may need other types as well.
课程的确切类型取决于您的需求(当然也取决于您使用的 dbms)。您还可以将 NumberValue(十进制)字段用于整数和布尔值。您可能还需要其他类型。
You need some link to the Master records which own the value. It's probably easiest and fastest to create a user fields table for each master table and add a simple foreign key. This way you can filter master records by user fields easily and quickly.
您需要一些指向拥有该值的主记录的链接。为每个主表创建一个用户字段表并添加一个简单的外键可能是最简单和最快的。通过这种方式,您可以轻松快速地按用户字段过滤主记录。
You may want to have some kind of meta data information. So you end up with the following:
您可能想要某种元数据信息。所以你最终得到以下结果:
Table UdfMetaData
表 UdfMetaData
- int id
- varchar Name
- varchar Type
- 内部标识
- 变量名
- varchar 类型
Table MasterUdfValues
表 MasterUdfValues
- int Master_FK
- int MetaData_FK
- decimal NumberValue
- varchar StringValue
- date DateValue
- int Master_FK
- 整数元数据_FK
- 十进制数值
- varchar 字符串值
- 日期日期值
Whatever you do, I would notchange the table structure dynamically. It is a maintenance nightmare. I would also notuse XML structures, they are much too slow.
无论您做什么,我都不会动态更改表结构。这是一个维护噩梦。我也不会使用 XML 结构,它们太慢了。
回答by Data Monk
This sounds like a problem that might be better solved by a non-relational solution, like MongoDB or CouchDB.
这听起来像是一个问题,可以通过非关系解决方案更好地解决,例如 MongoDB 或 CouchDB。
They both allow for dynamic schema expansion while allowing you to maintain the tuple integrity you seek.
它们都允许动态模式扩展,同时允许您保持您寻求的元组完整性。
I agree with Bill Karwin, the EAV model is not a performant approach for you. Using name-value pairs in a relational system is not intrinsically bad, but only works well when the name-value pair make a complete tuple of information. When using it forces you to dynamically reconstruct a table at run-time, all kinds of things start to get hard. Querying becomes an exercise in pivot maintenance or forces you to push the tuple reconstruction up into the object layer.
我同意 Bill Karwin 的观点,EAV 模型对您来说不是一种高性能方法。在关系系统中使用名称-值对本质上并不坏,但只有在名称-值对构成完整的信息元组时才有效。当使用它迫使您在运行时动态重建表时,各种事情开始变得困难。查询成为枢轴维护中的练习或迫使您将元组重建推入对象层。
You can't determine whether a null or missing value is a valid entry or lack of entry without embedding schema rules in your object layer.
如果不在对象层中嵌入架构规则,则无法确定空值或缺失值是有效条目还是缺少条目。
You lose the ability to efficiently manage your schema. Is a 100-character varchar the right type for the "value" field? 200-characters? Should it be nvarchar instead? It can be a hard trade-off and one that ends with you having to place artificial limits on the dynamic nature of your set. Something like "you can only have x user-defined fields and each can only be y characters long.
您失去了有效管理架构的能力。100 个字符的 varchar 是“值”字段的正确类型吗?200个字符?它应该是 nvarchar 吗?这可能是一个艰难的权衡,最终你不得不对你的系列的动态特性进行人为的限制。类似于“您只能有 x 个用户定义的字段,每个字段的长度只能是 y 个字符。
With a document-oriented solution, like MongoDB or CouchDB, you maintain all attributes associated with a user within a single tuple. Since joins are not an issue, life is happy, as neither of these two does well with joins, despite the hype. Your users can define as many attributes as they want (or you will allow) at lengths that don't get hard to manage until you reach about 4MB.
使用面向文档的解决方案,如 MongoDB 或 CouchDB,您可以在单个元组中维护与用户关联的所有属性。由于连接不是问题,生活是快乐的,因为尽管大肆宣传,但这两个都不能很好地处理连接。您的用户可以定义任意数量的属性(或者您将允许),长度在您达到大约 4MB 之前不会变得难以管理。
If you have data that requires ACID-level integrity, you might consider splitting the solution, with the high-integrity data living in your relational database and the dynamic data living in a non-relational store.
如果您有需要 ACID 级别完整性的数据,您可以考虑拆分解决方案,将高完整性数据保存在关系数据库中,将动态数据保存在非关系存储中。
回答by Thomas
Even if you provide for a user adding custom columns, it will not necessarily be the case that querying on those columns will perform well. There are many aspects that go into query design that allow them to perform well, the most important of which is the proper specification on what should be stored in the first place. Thus, fundamentally, is it that you want to allow users to create schema without thought as to specifications and be able to quickly derive information from that schema? If so, then it is unlikley that any such solution will scale well especially if you want to allow the user to do numerical analysis on the data.
即使您为用户提供添加自定义列的功能,查询这些列也不一定会表现良好。查询设计有很多方面可以让它们很好地执行,其中最重要的是首先正确规范应该存储什么。因此,从根本上说,您是否希望允许用户在不考虑规范的情况下创建模式并能够从该模式中快速获取信息?如果是这样,那么任何这样的解决方案都不太可能很好地扩展,特别是如果您希望允许用户对数据进行数值分析。
Option 1
选项1
IMO this approach gives you schema with no knowledge as to what the schema means which is a recipe for disaster and a nightmare for report designers. I.e., you must have the meta data to know what column stores what data. If that metadata gets messed up, it has the potential to hose your data. Plus, it makes it easy to put the wrong data in the wrong column. ("What? String1 contains the name of convents? I thought it was Chalie Sheen's favorite drugs.")
IMO 这种方法为您提供了架构,但不知道架构意味着什么,这对于报告设计者来说是灾难的秘诀和噩梦。即,您必须拥有元数据才能知道哪个列存储了哪些数据。如果元数据被弄乱了,它就有可能对您的数据进行处理。另外,它可以很容易地将错误的数据放在错误的列中。(“什么?String1 包含修道院的名称?我还以为是 Chalie Sheen 最喜欢的药物。”)
Option 3,4,5
选项 3、4、5
IMO, requirements 2, 3, and 4 eliminate any variation of an EAV. If you need to query, sort or do calculations on this data, then an EAV is Cthulhu's dream and your development team's and DBA's nightmare. EAV's will create a bottleneck in terms of performance and will not give you the data integrity you need to quickly get to the information you want. Queries will quickly turn to crosstab Gordian knots.
IMO,要求 2、3 和 4 消除了 EAV 的任何变化。如果您需要对这些数据进行查询、排序或计算,那么 EAV 是 Cthulhu 的梦想,也是您的开发团队和 DBA 的噩梦。EAV 将在性能方面造成瓶颈,并且不会为您提供快速获取所需信息所需的数据完整性。查询将很快转向交叉表 Gordian 结。
Option 2,6
选项 2,6
That really leaves one choice: gather specifications and then build out the schema.
这真的只剩下一个选择:收集规范,然后构建模式。
If the client wants the best performance on data they wish to store, then they need to go through the process of working with a developer to understand their needs so that it is stored as efficiently as possible. It could still be stored in a table separate from the rest of the tables with code that dynamically builds a form based on the schema of the table. If you have a database that allows for extended properties on columns, you could even use those to help the form builder use nice labels, tooltips etc. so that all that was necessary is to add the schema. Either way, to build and run reports efficiently, the data needs to be stored properly. If the data in question will have lots of nulls, some databases have the ability to store that type of information. For example, SQL Server 2008 has a feature called Sparse Columns specifically for data with lots of nulls.
如果客户希望在他们希望存储的数据上获得最佳性能,那么他们需要通过与开发人员合作的过程来了解他们的需求,以便尽可能高效地存储数据。它仍然可以存储在一个与其余表分开的表中,并使用基于表模式动态构建表单的代码。如果您有一个允许对列进行扩展属性的数据库,您甚至可以使用它们来帮助表单构建器使用漂亮的标签、工具提示等,这样只需添加架构即可。无论哪种方式,要有效地构建和运行报告,都需要正确存储数据。如果有问题的数据有很多空值,一些数据库就有能力存储这种类型的信息。例如,
If this were only a bag of data on which no analysis, filtering, or sorting was to be done, I'd say some variation of an EAV might do the trick. However, given your requirements, the most efficient solution will be to get the proper specifications even if you store these new columns in separate tables and build forms dynamically off those tables.
如果这只是一包不需要分析、过滤或排序的数据,我想说 EAV 的一些变体可能会奏效。但是,根据您的要求,即使您将这些新列存储在单独的表中并从这些表动态构建表单,最有效的解决方案还是获得正确的规范。
回答by Amit Contractor
- Create multiple UDF tables, one per data type. So we'd have tables for UDFStrings, UDFDates, etc. Probably would do the same as #2 and auto-generate a View anytime a new field gets added
- 创建多个 UDF 表,每个数据类型一个。所以我们会有用于 UDFStrings、UDDFates 等的表。 可能会做与 #2 相同的工作,并在添加新字段时自动生成视图
According to my research multiple tables based on the data type not going to help you in performance. Especially if you have bulk data, like 20K or 25K records with 50+ UDFs. Performance was the worst.
根据我的研究,基于数据类型的多个表不会帮助您提高性能。尤其是当您有大量数据时,例如具有 50 多个 UDF 的 20K 或 25K 记录。性能是最差的。
You should go with single table with multiple columns like:
您应该使用具有多列的单个表,例如:
varchar Name
varchar Type
decimal NumberValue
varchar StringValue
date DateValue
回答by Ophir Yoktan
This is a problematic situation, and none of the solutions appears "right". However option 1 is probably the best both in terms of simplicity and in terms of performance.
这是一个有问题的情况,没有一个解决方案看起来是“正确的”。然而,选项 1 可能在简单性和性能方面都是最好的。
This is also the solution used in some commercial enterprise applications.
这也是一些商业企业应用程序中使用的解决方案。
EDIT
编辑
another option that is available now, but didn't exist (or at least wasn't mature) when the question was original asked is to use json fields in the DB.
现在可用但在最初提出问题时不存在(或至少不成熟)的另一个选项是在数据库中使用 json 字段。
many relational DBs support now json based fields (that can include a dynamic list of sub fields) and allow querying on them
许多关系数据库现在支持基于 json 的字段(可以包括子字段的动态列表)并允许对它们进行查询
回答by Jon Egerton
I've had experience or 1, 3 and 4 and they all end up either messy, with it not being clear what the data is or really complicated with some sort of soft categorisation to break the data down into dynamic types of record.
我有过 1、3 和 4 的经验,但它们最终要么很混乱,要么不清楚数据是什么,要么通过某种软分类将数据分解为动态记录类型而变得非常复杂。
I'd be tempted to try XML, you should be able to enforce schemas against the contents of the xml to check data typing etc which will help holding difference sets of UDF data. In newer versions of SQL server you can index on XML fields, which should help out on the performance. (see http://blogs.technet.com/b/josebda/archive/2009/03/23/sql-server-2008-xml-indexing.aspx) for example
我很想尝试 XML,您应该能够针对 xml 的内容强制执行模式以检查数据类型等,这将有助于保存不同的 UDF 数据集。在较新版本的 SQL Server 中,您可以对 XML 字段进行索引,这应该有助于提高性能。(参见http://blogs.technet.com/b/josebda/archive/2009/03/23/sql-server-2008-xml-indexing.aspx)例如
回答by Tim Rogers
If you're using SQL Server, don't overlook the sqlvariant type. It's pretty fast and should do your job. Other databases might have something similar.
如果您使用的是 SQL Server,请不要忽视 sqlvariant 类型。它非常快,应该可以完成您的工作。其他数据库可能有类似的东西。
XML datatypes are not so good for performance reasons. If youre doing calculations on the server then you're constantly having to deserialize these.
出于性能原因,XML 数据类型并不是那么好。如果您在服务器上进行计算,那么您必须不断地反序列化这些。
Option 1 sounds bad and looks cruddy, but performance-wise can be your best bet. I have created tables with columns named Field00-Field99 before because you just can't beat the performance. You might need to consider your INSERT performance too, in which case this is also the one to go for. You can always create Views on this table if you want it to look neat!
选项 1 听起来很糟糕,看起来很粗糙,但性能方面可能是您最好的选择。我之前创建了包含名为 Field00-Field99 的列的表,因为您无法超越性能。您可能还需要考虑您的 INSERT 性能,在这种情况下,这也是要考虑的。如果您希望它看起来整洁,您可以随时在此表上创建视图!
回答by Alex
Our database powers a SaaS app (helpdesk software) where users have over 7k "custom fields". We use a combined approach:
我们的数据库为 SaaS 应用程序(帮助台软件)提供支持,其中用户拥有超过 7000 个“自定义字段”。我们使用组合方法:
(EntityID, FieldID, Value)
table for searchingthe data- a JSON field in the
entities
table, that holds all entity values, used for displayingthe data. (this way you don't need a million JOIN's to get the values values).
(EntityID, FieldID, Value)
用于搜索数据的表entities
表中的 JSON 字段,包含所有实体值,用于显示数据。(这样您就不需要一百万个 JOIN 来获取值)。
You could further split #1 to have a "table per datatype" like this answersuggests, this way you can even index your UDFs.
您可以进一步拆分 #1 以获得“每个数据类型的表”,就像这个答案所建议的那样,这样您甚至可以索引您的 UDF。
P.S. Couple of words to defend the "Entity-Attribute-Value" approach everyone keeps bashing. We have used #1 without #2 for decades and it worked just fine. Sometimes it's a business decision. Do you have time to rewrite your app and redesign the db or you can throw a couple of bucks on cloud-servers, which are really cheap these days? By the way, when we were using #1 approach, our DB was holding millions of entities, accessed by 100s of thousands of users, and a 16GB dual-core db server was doing just fine
PS 用几句话来捍卫每个人都在抨击的“实体-属性-值”方法。几十年来,我们一直使用 #1 而没有 #2,并且效果很好。有时这是一个商业决策。您是否有时间重写您的应用程序并重新设计数据库,或者您可以在云服务器上投入几块钱,这些云服务器现在真的很便宜?顺便说一句,当我们使用 #1 方法时,我们的数据库拥有数百万个实体,由数百名用户访问,而 16GB 双核数据库服务器运行良好