database 什么是“大数据库”?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/647210/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Just what is 'A big database'?
提问by Randin
Ok, dumb question I know but I see the nebulous comment 'a large database' as well as small and medium and I wonder just what that means. Can someone define what a small, medium and large database is for us SQL neophytes?
好吧,我知道这个愚蠢的问题,但我看到了“大型数据库”以及中小型数据库的模糊评论,我想知道这意味着什么。有人可以为我们 SQL 新手定义什么是小型、中型和大型数据库吗?
回答by John Feminella
There isn't a threshold where a small database becomes medium or a medium database becomes large. Generally, when I hear these terms, I think of particular orders of magnitude in terms of total records being stored.
小数据库变中型或中型数据库变大没有门槛。一般来说,当我听到这些术语时,我会想到存储的总记录数的特定数量级。
- Small: 105or fewer records.
- Medium: 105to 107records.
- Large: 107to 109records.
- Very large: 109or greater number of records.
- 小:10 5或更少的记录。
- 中:10 5到 10 7记录。
- 大:10 7到 10 9记录。
- 非常大:10 9或更多的记录数。
As poster dkretzsuggested, you could also think about it in terms of the properties each kind of database has. Categorizing it this way, I'd say:
正如海报dkretz 所建议的,您也可以根据每种数据库具有的属性来考虑它。这样分类,我想说:
Small: Performance is not a concern. Your queries run fine without making any special optimizations. You see only a marginal performance difference when using front-line enhancements like indexes.
Medium: Your database probably has one or more staff that are assigned part-time to its maintenance and care. These people pay attention to the database's health; their primary administrative responsibility is to prevent unacceptable performance problems and minimize downtime.
Large: Probably has dedicated staff member(s) whose job is to work on the database and improve performance, as well as make sure that application changes don't cause schema breakage over the lifetime of the database. Metrics about the health and status of the database are monitored closely. Significant expertise is required to understand and perform optimizations.
Very large: The database stores vast amounts of information that must be readily accessible. Performance optimizations are absolutely required to wring every last ounce of speed out of each queries, and without it, the database would be much less usable or even impossible to use. The database may be using sophisticated or innovative replication or clustering techniques, pushing the boundaries of current technology.
小:性能不是问题。您的查询运行良好,无需进行任何特殊优化。在使用诸如索引之类的前线增强功能时,您只会看到轻微的性能差异。
中:您的数据库可能有一名或多名员工被分配兼职进行维护和保养。这些人关注数据库的健康;他们的主要管理职责是防止出现不可接受的性能问题并最大限度地减少停机时间。
大型:可能有专门的员工,他们的工作是处理数据库和提高性能,以及确保应用程序更改不会在数据库的生命周期内导致架构损坏。密切监视有关数据库运行状况和状态的指标。理解和执行优化需要大量的专业知识。
非常大:数据库存储了大量必须易于访问的信息。绝对需要性能优化,以从每个查询中榨取最后一盎司的速度,如果没有它,数据库的可用性将大大降低,甚至无法使用。数据库可能正在使用复杂的或创新的复制或集群技术,从而推动当前技术的发展。
Note that these are entirely subjective, and that someone may very well have a perfectly legitimate alternate definition of "large".
请注意,这些完全是主观的,有人很可能对“大”有一个完全合法的替代定义。
回答by dkretz
One way to figure it is by observing your test queries.
计算它的一种方法是观察您的测试查询。
A small database is one where indexes don't matter.
小型数据库是索引无关紧要的数据库。
A medium database is one where queries take longer than one second if you don't have an appropriate index in place.
如果您没有适当的索引,中型数据库是一种查询时间超过一秒的数据库。
A big database is one where queries often take hours to optimize, using a combination of query design, index modification, and many test cycles.
大型数据库是一种查询通常需要数小时才能优化的数据库,使用查询设计、索引修改和许多测试周期的组合。
回答by core
Large database are ones that force you have to stop using relational databases.
大型数据库是那些迫使您必须停止使用关系数据库的数据库。
In other words, a normalized, relational database where all the indexes in the world can't help you meet your response time requirements because of the massive JOINs.
换句话说,一个规范化的关系数据库,由于大量的 JOIN,世界上所有的索引都无法帮助您满足响应时间要求。
If you've ever had to abandon relational databases for something else, you're either a poor database developer, have no expert DBA, or have a very large database.
如果您曾经因为其他事情而不得不放弃关系数据库,那么您要么是一个糟糕的数据库开发人员,要么没有专业的 DBA,要么拥有一个非常大的数据库。
回答by vmarquez
“Large Database” is indeed a nebulous concept. There are already very different answers and opinions posted in the answers to this question. Some approaches to define “small”, “medium” and “large” Databases may make more sense than others BUT THEN, at some point, I consider each definition is right, true and valid.
“大数据库”确实是一个模糊的概念。在这个问题的答案中已经有非常不同的答案和意见。一些定义“小型”、“中型”和“大型”数据库的方法可能比其他方法更有意义,但在某些时候,我认为每个定义都是正确、真实和有效的。
Some definitions make more sense than others because they focus on different aspects of importance for the design, programming, use, maintenance and administration of a Database and these different aspects are what really matter for a usable Database. It just happens that all these aspects are impacted by the nebulous concept of “Database size”.
有些定义比其他定义更有意义,因为它们关注数据库的设计、编程、使用、维护和管理的不同方面的重要性,而这些不同的方面对于可用数据库来说才是真正重要的。碰巧所有这些方面都受到“数据库大小”这个模糊概念的影响。
So, Does this mean that it does not matter if you are able to define if a particular Database is big or not?
那么,这是否意味着您是否能够定义特定数据库是否大并不重要?
Certainly not. What it mean is you will apply the concept differently while evaluating different design/operational/administrative aspects of your Database. It also means that every time this concept will be nebulous.
当然不是。这意味着您将在评估数据库的不同设计/操作/管理方面时以不同方式应用该概念。这也意味着每次这个概念都会变得模糊不清。
As an example: Database Index strategy (an aspect of Database design) is impacted by record count for each table (a measure of “size”), by record size times record count (another measure of “size”), and by Query Vs. Creation/Update/Delete operations ratio (an aspect of Database usage).
例如:数据库索引策略(数据库设计的一个方面)受每个表的记录数(“大小”的度量)、记录大小乘以记录数(“大小”的另一个度量)以及查询 Vs 的影响. 创建/更新/删除操作比率(数据库使用的一个方面)。
Query response times are better if indexes are used for tables with large amount of records. Depending on the nature of your WHERE, ORDER BY and record-aggregation clauses you may need several indexes for certain tables.
如果对具有大量记录的表使用索引,则查询响应时间会更好。根据 WHERE、ORDER BY 和记录聚合子句的性质,某些表可能需要多个索引。
Creation, Update and Delete operations are impacted negatively with the increase of number of indexes on the affected table(s). More indexes for an affected table means more changes that the RDBMS must perform, spending more time and more resources to apply those changes.
随着受影响表上索引数量的增加,创建、更新和删除操作会受到负面影响。受影响表的更多索引意味着 RDBMS 必须执行更多更改,从而花费更多时间和更多资源来应用这些更改。
Also, if your RDBMS spends more time to apply those changes, then the locks are maintained for longer times also, impacting the response times other queries being sent to the system at the same time.
此外,如果您的 RDBMS 花费更多时间来应用这些更改,那么锁的维护时间也会更长,从而影响同时发送到系统的其他查询的响应时间。
So, How do you balance the quantity and design of your indexes? How do you know if you need an additional index and if by adding that index you will not be introducing a big negative impact on query response times? Answer: You test and profile your database against a target load as per your load/performance requirements and analyze the profiling data in order to discover if further optimizations/redesigns/indexes are needed.
那么,您如何平衡索引的数量和设计?您如何知道是否需要额外的索引,以及添加该索引是否不会对查询响应时间产生很大的负面影响?答:您可以根据负载/性能要求针对目标负载测试和分析数据库,并分析分析数据以发现是否需要进一步优化/重新设计/索引。
Different Index strategies are required for different Query Vs. Creation/Update/Delete operations ratios. If your Database is under a heavy load of queries but is rarely updated, the performance for the overall application will be better if you add every index that improves query response times. On the other hand, if your Database is constantly being updated but there are not large query operations, then the performance will be better if you use less indexes.
不同的 Query Vs 需要不同的 Index 策略。创建/更新/删除操作比率。如果您的数据库处于大量查询负载但很少更新,则如果您添加每个索引以提高查询响应时间,则整个应用程序的性能会更好。另一方面,如果你的数据库在不断更新,但查询量不大,那么使用较少的索引性能会更好。
There are other aspects of course: Database Schema design, Storage Strategy, Network design, Backup strategy, Stored Procedures/Triggers/Etc. programming, Application Programming (against the Database), Etc. All these aspects are impacted differently by distinct concepts of “size” (record size, record count, index size, index count, schema design, storage size, etc.).
当然还有其他方面:数据库架构设计、存储策略、网络设计、备份策略、存储过程/触发器/等。编程、应用程序编程(针对数据库)等。所有这些方面都受到不同“大小”概念(记录大小、记录计数、索引大小、索引计数、模式设计、存储大小等)的不同影响。
I'd like to have more time as this topic is fascinating. I hope this small contribution serves as an starting point for you in this fascinating world of SQL.
我想有更多的时间,因为这个话题很吸引人。我希望这个小小的贡献可以作为您进入这个迷人的 SQL 世界的起点。
回答by obecalp
You have to account for hardware advancement for this definition:
您必须考虑到此定义的硬件进步:
Small database: working set fits into the physical RAM of a single commodity server (about 16GB now)
Medium database: fits into a single or several (through RAID) commodity hard drives on a single machine (up to several TBs now)
Large database: Data needs to distributed across multiple commodity servers in order to fit (up to several PBs now.)
小型数据库:工作集适合单个商品服务器的物理 RAM(现在大约 16GB)
中型数据库:适合单个机器上的单个或多个(通过 RAID)商用硬盘驱动器(现在最多几个 TB)
大型数据库:数据需要分布在多个商品服务器上才能适应(现在最多几个 PB。)
回答by karlcow
According to wikipedia article on Very Large Database
根据维基百科关于超大数据库的文章
A very large database, or VLDB, is a database that contains an extremely high number of tuples (database rows), or occupies an extremely large physical filesystem storage space. The most common definition of VLDB is a database that occupies more than 1 terabyte or contains several billion rows, although naturally this definition changes over time.
超大数据库,简称VLDB,是一种包含大量元组(数据库行),或者占用极大物理文件系统存储空间的数据库。VLDB 最常见的定义是占用超过 1 TB 或包含数十亿行的数据库,尽管此定义自然会随着时间而变化。
回答by pearcewg
If you have a database that is large enough that you can't just "back it up" to put on a development or test box, you likely have a "large database".
如果您有一个足够大的数据库,以至于您不能仅仅“备份”它以放置在开发或测试盒上,那么您可能拥有一个“大型数据库”。
回答by Zoredache
I think something like wikipedia, or the US census data is a 'big' database. My personal address lists or todos is a small database. A middle sized database is something in between.
我认为维基百科或美国人口普查数据是一个“大”数据库。我的个人地址列表或待办事项是一个小型数据库。中型数据库介于两者之间。
You could try and define the sizes by how many servers you needed. A small database is a component of an application you run on your desktop, a mid-sized database would be a single mysql (whatever) server somewhere, and a large database is going to require multiple servers with some kind of replication/failover support.
您可以尝试根据您需要的服务器数量来定义大小。小型数据库是您在桌面上运行的应用程序的一个组件,中型数据库将是某个地方的单个 mysql(任何)服务器,而大型数据库将需要具有某种复制/故障转移支持的多台服务器。