database 大数据量的数据库选择?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/629445/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Database choice for large data volume?
提问by Marko
I'm about to start a new project which should have a rather large database.
我即将开始一个新项目,它应该有一个相当大的数据库。
The number of tables will not be large (<15), majority of data (99%) will be contained in one big table, which is almost insert/read only (no updates).
表的数量不会很大(<15),大部分数据(99%)将包含在一张大表中,几乎是插入/只读(无更新)。
The estimated amount of data in that one table is going to grow at 500.000 records a day, and we should keep at least 1 yearof them to be able to do various reports.
估计一张表中的数据量将增长到每天 500.000 条记录,我们应该至少保留1 年,以便能够做各种报告。
There needs to be (read-only) replicateddatabase as a backup/failover, and maybe for offloading reports in peak time.
需要有(只读)复制数据库作为备份/故障转移,并且可能用于在高峰时间卸载报告。
I don't have first hand experience with that large databases, so I'm asking the ones that have which DB is the best choice in this situation. I know that Oracleis the safe bet, but am more interested if anyone have experience with Postgresqlor Mysqlwith similar setup.
我没有那种大型数据库的第一手经验,所以我问那些拥有哪种数据库的人在这种情况下是最佳选择。我知道Oracle是安全的选择,但如果有人有使用类似设置的Postgresql或Mysql 的经验,我会更感兴趣。
采纳答案by DNS
I've used PostgreSQL in an environment where we're seeing 100K-2M new rows per day, most added to a single table. However, those rows tend to be reduced to samples and then deleted within a few days, so I can't speak about long-term performance with more than ~100M rows.
我在每天看到 100K-2M 新行的环境中使用 PostgreSQL,其中大多数添加到单个表中。然而,这些行往往会被缩减为样本,然后在几天内被删除,所以我不能谈论超过 1 亿行的长期性能。
I've found that insert performance is quite reasonable, especially if you use the bulk COPY. Query performance is fine, although the choices the planner makes sometimes puzzle me; particularly when doing JOINs / EXISTS. Our database requires pretty regular maintenance (VACUUM/ANALYZE) to keep it running smoothly. I could avoid some of this by more carefully optimizing autovacuum and other settings, and it's not so much of an issue if you're not doing many DELETEs. Overall, there are some areas where I feel it's more difficult to configure and maintain than it should be.
我发现插入性能是相当合理的,尤其是当您使用批量复制时。查询性能很好,尽管规划器做出的选择有时让我感到困惑;特别是在执行 JOIN/EXISTS 时。我们的数据库需要定期维护 (VACUUM/ANALYZE) 以保持其平稳运行。我可以通过更仔细地优化 autovacuum 和其他设置来避免其中的一些问题,如果你不做很多 DELETE,这不是什么大问题。总体而言,我觉得有些领域的配置和维护比应有的要困难。
I have not used Oracle, and MySQL only for small datasets, so I can't compare performance. But PostgreSQL does workfine for large datasets.
我没有使用过 Oracle,而 MySQL 只用于小数据集,所以我无法比较性能。但是 PostgreSQL 确实适用于大型数据集。
回答by S.Lott
Do you have a copy of "The Data Warehouse Toolkit"?
你有“数据仓库工具包”的副本吗?
The suggestion there is to do the following.
那里的建议是执行以下操作。
Separate facts (measurable, numeric) values from the dimensions which qualify or organize those facts. One big table isn't really the best idea. It's a fact table that dominates the design, plus a number of small dimension tables to allow "slicing and dicing" the facts.
Keep the facts in simple flat files until you want to do SQL-style reporting. Don't create and back up a database. Create and back up files; load a data base only for the reports you must do from SQL.
Where possible create summary or extra datamarts for analysis. In some cases, you may need to load the whole thing to a database. If your files reflect your table design, all databases have bulk loader tools that can populate and index SQL tables from the files.
将事实(可测量的、数字的)值与限定或组织这些事实的维度分开。一张大桌子并不是最好的主意。它是一个主导设计的事实表,加上一些小维度表,以允许对事实进行“切片和切块”。
将事实保存在简单的平面文件中,直到您想要进行 SQL 样式的报告。不要创建和备份数据库。创建和备份文件;仅为必须从 SQL 执行的报告加载数据库。
在可能的情况下,创建摘要或额外的数据集市以进行分析。在某些情况下,您可能需要将整个内容加载到数据库中。如果您的文件反映了您的表设计,则所有数据库都有批量加载器工具,可以从文件中填充 SQL 表并为其编制索引。
回答by user76035
The amount of data (200m records per year) is not really big and should go with any standard database engine.
数据量(每年 2 亿条记录)并不是很大,应该使用任何标准数据库引擎。
The case is yet easier if you do not need live reports on it. I'd mirror and preaggregate data on some other server in e.g. daily batch. Like S.Lott suggested, you might like to read up on data warehousing.
如果您不需要实时报告,情况会更容易。我会在其他服务器上镜像和预聚合数据,例如每日批处理。就像 S.Lott 建议的那样,您可能想阅读有关数据仓库的信息。
回答by MrValdez
Google's BigTable databaseand Hadoopare two database engines that can handle large amount of data.
Google 的BigTable 数据库和Hadoop是两个可以处理大量数据的数据库引擎。
回答by kevchadders
Some interesting points regarding Google BigTable in there are...
关于 Google BigTable 的一些有趣的观点是……
Bigtable Vs DBMS
Bigtable 与 DBMS
- Fast Query rate
- No Joins, No SQL support, column-oriented database
- Uses one Bigtable instead of having many normalized tables
- Is not even in 1NF in a traditional view
- Designed to support historical queries timestamp field => what did this webpage look like yesterday ?
- Data compression is easier –rows are sparse
- 查询速度快
- 无连接,无 SQL 支持,面向列的数据库
- 使用一个 Bigtable 而不是许多标准化表
- 在传统观点中甚至不属于 1NF
- 旨在支持历史查询时间戳字段 => 这个网页昨天是什么样子的?
- 数据压缩更容易——行稀疏
I highlighted the Joins and No SQL Support as you mentioned you will need to run a series of reports. I dont know how much (if any) not having the abililty to do this will have on you running reports if you where to use this.
我强调了联接和无 SQL 支持,正如您提到的,您将需要运行一系列报告。如果您在哪里使用它,我不知道有多少(如果有)没有能力这样做会对您运行报告产生影响。
回答by Xn0vv3r
We use Firebirdfor a really huge database (keeping data for more than 30 years now) and it scales very well.
我们将Firebird用于一个非常庞大的数据库(现在可以保存 30 多年的数据)并且它的扩展性非常好。
The best about it is that you have properties to configure, but unlike i.e. Oracle you install it and it works very well without the need to start configuring before you can use it.
最好的一点是您可以配置属性,但与 ie Oracle 不同的是,您可以安装它并且它运行良好,无需在使用前开始配置。