MySQL 首次数据库设计:我是否过度设计?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2320633/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
First-time database design: am I overengineering?
提问by bob esponja
Background
背景
I'm a first year CS student and I work part time for my dad's small business. I don't have any experience in real world application development. I have written scripts in Python, some coursework in C, but nothing like this.
我是 CS 的一年级学生,我在我父亲的小企业做兼职。我在实际应用程序开发方面没有任何经验。我用 Python 写过脚本,用 C 写过一些课程,但没有像这样的。
My dad has a small training business and currently all classes are scheduled, recorded and followed up via an external web application. There is an export/"reports" feature but it is very generic and we need specific reports. We don't have access to the actual database to run the queries. I've been asked to set up a custom reporting system.
我父亲有一家小型培训公司,目前所有课程都通过外部 Web 应用程序进行安排、记录和跟进。有一个导出/“报告”功能,但它非常通用,我们需要特定的报告。我们无权访问实际数据库来运行查询。我被要求建立一个自定义报告系统。
My idea is to create the generic CSV exports and import (probably with Python) them into a MySQL database hosted in the office every night, from where I can run the specific queries that are needed. I don't have experience in databases but understand the very basics. I've read a little about database creation and normal forms.
我的想法是每晚创建通用的 CSV 导出并将它们导入(可能使用 Python)到办公室托管的 MySQL 数据库中,从那里我可以运行所需的特定查询。我没有数据库方面的经验,但了解非常基础的知识。我已经阅读了一些关于数据库创建和正常表单的内容。
We may start having international clients soon, so I want the database to not explode if/when that happens. We also currently have a couple big corporations as clients, with different divisions (e.g. ACME parent company, ACME healthcare division, ACME bodycare division)
我们可能很快就会开始拥有国际客户,所以我希望数据库在发生这种情况时不会爆炸。我们目前还有几家大公司作为客户,拥有不同的部门(例如 ACME 母公司、ACME 医疗保健部门、ACME 身体护理部门)
The schema I have come up with is the following:
我提出的架构如下:
- From the client perspective:
- Clients is the main table
- Clients are linked to the department they work for
- Departments can be scattered around a country: HR in London, Marketing in Swansea, etc.
- Departments are linked to the division of a company
- Divisions are linked to the parent company
- From the classes perspective:
- Sessions is the main table
- A teacher is linked to each session
- A statusid is given to each session. E.g. 0 - Completed, 1 - Cancelled
- Sessions are grouped into "packs" of an arbitrary size
- Each packs is assigned to a client
- Sessions is the main table
- 从客户的角度:
- 客户是主表
- 客户与他们工作的部门相关联
- 部门可以分散在一个国家/地区:伦敦的人力资源部、斯旺西的营销部等。
- 部门与公司的分工相关联
- 部门与母公司相关联
- 从类的角度:
- Sessions 是主表
- 每节课都有一名教师
- 每个会话都有一个 statusid。例如 0 - 已完成,1 - 已取消
- 会话被分组为任意大小的“包”
- 每个包都分配给一个客户
- Sessions 是主表
I "designed" (more like scribbled) the schema on a piece of paper, trying to keep it normalised to the 3rd form. I then plugged it into MySQL Workbench and it made it all pretty for me:
(Click here for full-sized graphic)
我在一张纸上“设计”(更像是潦草地写的)模式,试图将其规范化为第 3 种形式。然后我把电源插头插上到MySQL Workbench和它使人们都非常适合我:
(点击查看全尺寸图片)
(source: maian.org)
(来源:maian.org)
Example queries I'll be running
我将运行的示例查询
- Which clients with credit still left are inactive (those without a class scheduled in the future)
- What is the attendance rate per client/department/division (measured by the status id in each session)
- How many classes has a teacher had in a month
- Flag clients who have low attendance rate
- Custom reports for HR departments with attendance rates of people in their division
- 哪些仍有信用的客户处于非活动状态(未来没有安排课程的客户)
- 每个客户/部门/部门的出席率是多少(以每个会话的状态 ID 衡量)
- 一个老师一个月上多少课
- 标记出勤率低的客户
- 人力资源部门的自定义报告及其部门人员的出勤率
Question(s)
问题)
- Is this overengineered or am I headed the right way?
- Will the need to join multiple tables for most queries result in a big performance hit?
- I have added a 'lastsession' column to clients, as it is probably going to be a common query. Is this a good idea or should I keep the database strictly normalised?
- 这是过度设计还是我走对了路?
- 大多数查询需要连接多个表会导致性能下降吗?
- 我已经向客户添加了一个“lastsession”列,因为它可能会成为一个常见的查询。这是一个好主意还是我应该严格规范化数据库?
Thanks for your time
谢谢你的时间
采纳答案by Tom Crowe
Some more answers to your questions:
您的问题的更多答案:
1) You're pretty much on target for someone who is approaching a problem like this for the first time. I think the pointers from others on this question thus far pretty much cover it. Good job!
1) 对于第一次处理此类问题的人来说,您几乎是目标。我认为到目前为止其他人在这个问题上的指示几乎涵盖了它。做得好!
2 & 3) The performance hit you will take will largely be dependent on having and optimizing the right indexes for your particular queries / procedures and more importantly the volume of records. Unless you are talking about well over a million records in your main tables you seem to be on track to having a sufficiently mainstream design that performance will not be an issue on reasonable hardware.
2 & 3) 您将采取的性能影响将在很大程度上取决于为您的特定查询/过程拥有和优化正确的索引,更重要的是记录量。除非你在你的主表中谈论超过一百万条记录,否则你似乎有望拥有一个足够主流的设计,在合理的硬件上性能不会成为问题。
That said, and this relates to your question 3, with the start you have you probably shouldn't really be overly worried about performance or hyper-sensitivity to normalization orthodoxy here. This is a reporting server you are building, not a transaction based application backend, which would have a much different profile with respect to the importance of performance or normalization. A database backing a live signup and scheduling application has to be mindful of queries that take seconds to return data. Not only does a report server function have more tolerance for complex and lengthy queries, but the strategies to improve performance are much different.
也就是说,这与您的问题 3 相关,一开始您可能真的不应该过度担心性能或对规范化正统的过度敏感。这是您正在构建的报告服务器,而不是基于事务的应用程序后端,后者在性能或规范化的重要性方面会有很大不同。支持实时注册和调度应用程序的数据库必须注意需要几秒钟才能返回数据的查询。不仅报表服务器功能对复杂和冗长的查询有更大的容忍度,而且提高性能的策略也大不相同。
For example, in a transaction based application environment your performance improvement options might include refactoring your stored procedures and table structures to the nth degree, or developing a caching strategy for small amounts of commonly requested data. In a reporting environment you can certainly do this but you can have an even greater impact on performance by introducing a snapshot mechanism where a scheduled process runs and stores pre-configured reports and your users access the snapshot data with no stress on your db tier on a per request basis.
例如,在基于事务的应用程序环境中,您的性能改进选项可能包括将存储过程和表结构重构到第 n 级,或者为少量经常请求的数据开发缓存策略。在报告环境中,您当然可以这样做,但通过引入快照机制,您可以对性能产生更大的影响,在该机制中计划进程运行并存储预配置的报告,并且您的用户可以访问快照数据,而不会对您的数据库层造成压力每个请求的基础。
All of this is a long-winded rant to illustrate that what design principles and tricks you employ may differ given the role of the db you're creating. I hope that's helpful.
所有这些都是一个冗长的咆哮,以说明鉴于您正在创建的数据库的角色,您采用的设计原则和技巧可能会有所不同。我希望这会有所帮助。
回答by Reverend Gonzo
You've got the right idea. You can however clean it up, and remove some of the mapping (has*) tables.
你的想法是对的。但是,您可以清理它,并删除一些映射 (has*) 表。
What you can do is in the Departments table, add CityId and DivisionId.
您可以做的是在 Departments 表中,添加 CityId 和 DivisionId。
Besides that, I think everything is fine...
除此之外,我觉得一切都很好......
回答by Larry Lustig
No. It looks like you're designing at a good level of detail.
不。看起来您正在以良好的细节水平进行设计。
I think that Countries and Companies are really the same entity in your design, as are Cities and Divisions. I'd get rid of the Countries and Cities tables (and Cities_Has_Departments) and, if necessary, add a boolean flag IsPublicSector to the Companies table (or a CompanyType column if there are more choices than simply Private Sector / Public Sector).
我认为国家和公司在您的设计中实际上是同一个实体,城市和部门也是如此。我会去掉国家和城市表(和 Cities_Has_Departments),并在必要时将布尔标志 IsPublicSector 添加到 Companies 表(或 CompanyType 列,如果有更多选择,而不仅仅是私营部门/公共部门)。
Also, I think there's an error in your usage of the Departments table. It looks like the Departments table serves as a reference to the various kinds of departments that each customer division can have. If so, it should be called DepartmentTypes. But your clients (who are, I assume, attendees) do not belong to a department TYPE, they belong to an actual department instance in a company. As it stands now, you will know that a given client belongs to an HR department somewhere, but not which one!
另外,我认为您对 Departments 表的使用有误。看起来 Departments 表可以作为每个客户部门可以拥有的各种部门的参考。如果是这样,它应该被称为 DepartmentTypes。但是您的客户(我假设他们是参加者)不属于部门 TYPE,他们属于公司中的实际部门实例。就目前而言,您会知道某个客户属于某个地方的人力资源部门,但不知道是哪一个!
In other words, Clients should be linked to the table that you call Divisions_Has_Departments (but that I would call simply Departments). If this is so, then you must collapse Cities into Divisions as discussed above if you want to use standard referential integrity in the database.
换句话说,Clients 应该链接到您称为 Divisions_Has_Departments(但我将简称为 Departments)的表。如果是这样,那么如果您想在数据库中使用标准参照完整性,那么您必须将 Cities 折叠为 Divisions,如上所述。
回答by Jacob G
The only changes I would make are:
1- Change your VARCHAR to NVARCHAR, if you might be going international, you may want unicode.
我要做的唯一更改是:
1- 将您的 VARCHAR 更改为 NVARCHAR,如果您可能要走向国际,您可能需要 unicode。
2- Change your int id's to GUIDs (uniqueidentifier) if possible (this might just be my personal preference). Assuming you eventually get to the point where you have multiple environments (dev/test/staging/prod), you may want to migrate data from one to the other. Have GUID Ids makes this significantly easier.
2- 如果可能,将您的 int id 更改为 GUID(唯一标识符)(这可能只是我个人的偏好)。假设您最终达到了拥有多个环境(开发/测试/暂存/生产)的地步,您可能希望将数据从一个迁移到另一个。拥有 GUID Id 使这变得更加容易。
3- Three layers for your Company -> Division -> Department structure may not be enough. Now, this might be over-engineering, but you could generalize that hierarchy such that you can support n-levels of depth. This will make some of your queries more complex, so that may not be worth the trade-off. Further, it could be that any client that has more layers may be easily "stuffable" into this model.
3- 您的公司-> 部门-> 部门结构的三层可能还不够。现在,这可能是过度设计,但您可以概括该层次结构,以便您可以支持 n 级深度。这将使您的某些查询更加复杂,因此可能不值得进行权衡。此外,任何具有更多层的客户端都可能很容易“塞进”这个模型。
4- You also have a Status in the Client Table that is a VARCHAR and has no link to the Statuses table. I'd expect a little more clarity there as to what the Client Status represents.
4- 您在客户端表中还有一个状态,它是一个 VARCHAR,并且没有到状态表的链接。我希望在那里更清楚地了解客户端状态代表什么。
回答by jrheard
By the way, it's worth noting that if you're generating CSVs already and want to load them into a mySQL database, LOAD DATA LOCAL INFILE is your best friend: http://dev.mysql.com/doc/refman/5.1/en/load-data.html. Mysqlimport is also worth looking into, and is a command-line tool that's basically a nice wrapper around load data infile.
顺便说一句,值得注意的是,如果您已经在生成 CSV 并希望将它们加载到 mySQL 数据库中,那么 LOAD DATA LOCAL INFILE 是您最好的朋友:http: //dev.mysql.com/doc/refman/5.1/ zh/load-data.html。Mysqlimport 也值得研究,它是一个命令行工具,基本上是加载数据 infile 的一个很好的包装器。
回答by Will
Following comments based on role as a Business Intelligence/Reporting specialist and strategy/planning manager:
以下评论基于作为商业智能/报告专家和战略/规划经理的角色:
I agree with Larry's direction above. IMHO, It's not so much over engineered, some things just look a little out of place. To keep it simple, I would tag client directly to a Company ID, Department Description, Division Description, Department Type ID, Division Type ID. Use Department Type ID and Division Type ID as references to lookup tables and internal reporting/analysis fields for long term consistency.
Packs table contains "Credit" column, shouldn't that actually be tied to the Client base table so if they many packs you can see how much credit owed is left for future classes? The application can take care of the calc and store it centrally in the Client table.
Company info could use many more fields, including the obvious address/phone/etc. information. I'd also be prepared to add in D&B "DUNs" columns (Site/Branch/Ultimate) long term, Dun and Bradstreet (D&B) has a huge catalog of companies and you'll find later down the road their information is very helpful for reporting/analysis. This will take care of the multiple division issue you mention, and allow you to roll up their hierarchy for sub/division/branches/etc. of large corps.
You don't mention how many records you'll be working with which could imply setting yourself up for a large development initiative which could have been done quicker and far fewer headaches with prepackaged "reporting" software. If your not dealing with a large database (< 65000) rows, make sure MS-Access, OpenOffice (Base) or related report/app dev solutions couldn't do the trick. I use Oracle's free APEX software quite a bit myself, it comes with their free database Oracle XE just download it from their site.
FYI - Reporting insight: for large databases, you typically have two database instances a) transaction database for recording each detailed record. b) reporting database (data mart/data warehouse) housed on a separate machine. For more information search google both Star Schema and Snowflake Schema.
我同意上面拉里的指示。恕我直言,这并不是过度设计,有些东西看起来有点不合适。为简单起见,我会将客户直接标记为公司 ID、部门描述、部门描述、部门类型 ID、部门类型 ID。使用部门类型 ID 和部门类型 ID 作为查找表和内部报告/分析字段的参考,以实现长期一致性。
Packs 表包含“Credit”列,实际上不应该将其绑定到 Client 基表,因此如果它们有很多包,您可以看到还欠多少信用用于未来的课程?应用程序可以处理计算并将其集中存储在客户端表中。
公司信息可以使用更多字段,包括明显的地址/电话/等。信息。我还准备在 D&B 的“DUNs”列(站点/分支机构/终极)中长期添加,Dun and Bradstreet (D&B) 拥有庞大的公司目录,以后您会发现他们的信息非常有用用于报告/分析。这将解决您提到的多部门问题,并允许您为细分/部门/分支/等汇总其层次结构。的大部队。
您没有提到您将使用多少记录,这可能意味着为大型开发计划做好准备,使用预先打包的“报告”软件可以更快地完成并减少麻烦。如果您不处理大型数据库 (< 65000) 行,请确保 MS-Access、OpenOffice (Base) 或相关报告/应用程序开发解决方案无法解决问题。我自己经常使用 Oracle 的免费 APEX 软件,它带有他们的免费数据库 Oracle XE,只需从他们的站点下载即可。
仅供参考 - 报告洞察力:对于大型数据库,您通常有两个数据库实例 a) 用于记录每个详细记录的事务数据库。b) 安装在单独机器上的报告数据库(数据集市/数据仓库)。有关更多信息,请在谷歌搜索 Star Schema 和 Snowflake Schema。
Regards.
问候。
回答by Hans Westerbeek
Most things have already been said, but I feel that I can add one thing: it is quite common for younger developers to worry about performance a little bit too much up-front, and your question about joining tables seems to go into that direction. This is a software development anti-pattern called 'Premature Optimization'. Try to banish that reflex from your mind :)
大多数事情已经说过了,但我觉得我可以补充一件事:年轻的开发人员在前期担心性能有点过分是很常见的,而您关于加入表的问题似乎正朝着这个方向发展。这是一种称为“过早优化”的软件开发反模式。试着从你的脑海中消除这种反射:)
One more thing: Do you believe you really need the 'cities' and 'countries' tables? Wouldn't having a 'city' and 'country' column in the departments table suffice for your use cases? E.g. does your application need to list departments by city and cities by country?
还有一件事:你相信你真的需要“城市”和“国家”表吗?部门表中的“城市”和“国家”列是否足以满足您的用例?例如,您的应用程序是否需要按城市和国家/地区列出部门?
回答by HLGEM
I want to address only the concern that joining to mutiple tables will casue a performance hit. Do not be afraid to normalize because you will have to do joins. Joins are normal and expected in relational datbases and they are designed to handle them well. You will need to set PK/FK relationships (for data integrity, this is important to consider in designing) but in many databases FKs are not automatically indexed. Since they wil be used in the joins, you will definitelty want to start by indexing the FKS. PKs generally get an index on creation as they have to be unique. It is true that datawarehouse design reduces the number of joins, but usually one doesn't get to the point of data warehousing until one has millions of records needed to be accessed in one report. Even then almost all data warehouses start with a transactional database to collect the data in real time and then data is moved to the warehouse on a schedule (nightly or monthly or whatever the business need is). So this is a good start even if you need to design a data warehouse later to improve report performance.
我只想解决加入多个表会导致性能下降的问题。不要害怕标准化,因为您将不得不进行连接。联接在关系数据库中是正常的和预期的,它们旨在很好地处理它们。您将需要设置 PK/FK 关系(为了数据完整性,在设计时考虑这一点很重要)但在许多数据库中,FK 不会自动编入索引。由于它们将在连接中使用,您肯定希望从索引 FKS 开始。PK 通常在创建时获得索引,因为它们必须是唯一的。确实,数据仓库设计减少了连接的数量,但通常只有在一份报告中需要访问数百万条记录时,人们才会进入数据仓库的地步。即便如此,几乎所有的数据仓库都从事务数据库开始,实时收集数据,然后按计划(每晚或每月或任何业务需要)将数据移至仓库。因此,即使您以后需要设计数据仓库以提高报表性能,这也是一个好的开始。
I must say your design is impressive for a first year CS student.
我必须说你的设计对于一个一年级的 CS 学生来说是令人印象深刻的。
回答by Larry OBrien
I've worked in the training / school domain and I thought I'd point out that there's generally a M:1 relationship between what you call "sessions" (instances of a given course) and the course itself. In other words, your catalog offers the course ("Spanish 101" or whatever), but you might have two different instances of it during a single semester (Tu-Th taught by Smith, Wed-Fri taught by Jones).
我曾在培训/学校领域工作过,我想我想指出的是,您所谓的“课程”(给定课程的实例)与课程本身之间通常存在 M:1 关系。换句话说,您的目录提供了该课程(“西班牙语 101”或其他内容),但您可能在一个学期内有两个不同的实例(Smith 教授的周二至周五,Jones 教授的周三至周五)。
Other than that, it looks like a good start. I bet you'll find that the client domain (graphs leading to "clients") is more complex than you've modeled, but don't go overboard with that until you've got some real data to guide you.
除此之外,它看起来是一个好的开始。我敢打赌,您会发现客户端域(通向“客户端”的图形)比您建模的更复杂,但是在您获得一些真实数据来指导您之前不要过度使用它。
回答by Chris Dennett
It isn't over-engineered, this is how I would approach the problem. Joining is fine, there won't be much of a performance hit (it's completely necessary unless you de-normalise the database out which isn't recommended!). For statuses, see if you can use an enum datatype instead to optimise that table out.
它没有过度设计,这就是我解决问题的方式。加入很好,不会有太大的性能损失(这是完全必要的,除非您对数据库进行反规范化,这是不推荐的!)。对于状态,请查看是否可以使用枚举数据类型来优化该表。