自动分片 postgresql?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/10323327/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Auto sharding postgresql?
提问by Lostsoul
I have a problem where I need to load alot of data (5+ billion rows) into a database very quickly (ideally less than an 30 min but quicker is better), and I was recently suggested to look into postgresql (I failed with mysql and was looking at hbase/cassandra). My setup is I have a cluster (currently 8 servers) that generates alot of data, and I was thinking of running databases locally on each machine in the cluster it writes quickly locally and then at the end (or throughout the data generating) data is merged together. The data is not in any order so I don't care which specific server its on (as long as its eventually there).
我有一个问题,我需要非常快速地将大量数据(5+ 十亿行)加载到数据库中(理想情况下少于 30 分钟,但更快更好),最近有人建议我查看 postgresql(我使用 mysql 失败了)并且正在查看 hbase/cassandra)。我的设置是我有一个生成大量数据的集群(目前有 8 个服务器),我正在考虑在集群中的每台机器上本地运行数据库,它在本地快速写入,然后在最后(或在整个数据生成过程中)数据是合并在一起。数据没有任何顺序,所以我不在乎它在哪个特定服务器上(只要它最终在那里)。
My questions are , is there any good tutorials or places to learn about PostgreSQL auto sharding (I found results of firms like sykpe doing auto sharding but no tutorials, I want to play with this myself)? Is what I'm trying to do possible? Because the data is not in any order I was going to use auto-incrementing ID number, will that cause a conflict if data is merged (this is not a big issue anymore)?
我的问题是,有没有什么好的教程或地方可以学习 PostgreSQL 自动分片(我发现像 sykpe 这样的公司做自动分片但没有教程,我想自己玩这个)?我正在尝试做的可能吗?因为数据没有任何顺序我打算使用自动递增的 ID 号,如果合并数据会不会导致冲突(这不再是一个大问题)?
Update: Frank's idea below kind of eliminated the auto-incrementing conflict issue I was asking about. The question is basically now, how can I learn about auto sharding and would it support distributed uploads of data to multiple servers?
更新:下面弗兰克的想法消除了我问的自动递增冲突问题。现在的问题基本上是,我如何了解自动分片以及它是否支持将数据分布式上传到多个服务器?
回答by Craig Ringer
First: Do you really need to insert the generated data from your cluster straight into a relational database? You don't mind merging it at the end anyway, so why bother inserting into a database at all? In your position I'd have your cluster nodes write flat files, probably gzip'd CSV data. I'd then bulk import and merge that data using a tool like pg_bulkload.
第一:您真的需要将集群中生成的数据直接插入到关系数据库中吗?无论如何,您不介意在最后合并它,那么为什么还要插入数据库呢?在你的位置上,我会让你的集群节点写平面文件,可能是 gzip 的 CSV 数据。然后我会使用pg_bulkload 之类的工具批量导入和合并该数据。
If you do need to insert directly into a relational database: That's (part of) what PgPool-IIand (especeially) PgBouncerare for. Configure PgBouncer to load-balance across different nodes and you should be pretty much sorted.
如果您确实需要直接插入关系数据库:这就是PgPool-II和(特别是)PgBouncer的(部分)用途。将 PgBouncer 配置为在不同节点之间进行负载平衡,您应该几乎可以排序。
Note that PostgreSQL is a transactional database with strong data durability guarantees. That also means that if you use it in a simplistic way, doing lots of small writes can be slow. You have to consider what trade-offs you're willing to make between data durability, speed, and cost of hardware.
请注意,PostgreSQL 是具有强大数据持久性保证的事务型数据库。这也意味着,如果您以简单的方式使用它,则进行大量小写操作可能会很慢。您必须考虑您愿意在数据持久性、速度和硬件成本之间进行哪些权衡。
At one extreme, each INSERT
can be its own transaction that's synchronously committed to disk before returning success. This limits the number of transactions per second to the number of fsync()s your disk subsystem can do, which is often only in the tens or hundreds per second (without battery backup RAID controller). This is the default if you do nothing special and if you don't wrap your INSERT
s in a BEGIN
and COMMIT
.
在一种极端情况下,每个INSERT
事务都可以是它自己的事务,在返回成功之前同步提交到磁盘。这将每秒事务数限制为您的磁盘子系统可以执行的 fsync() 数,通常每秒只有数十或数百(没有电池备份 RAID 控制器)。如果您没有做任何特别的事情并且没有将INSERT
s包裹在 a BEGIN
and 中,则这是默认设置COMMIT
。
At the other extreme, you say "I really don't care if I lose allthis data" and use unlogged tablesfor your inserts. This basically gives the database permission to throw your data away if it can't guarantee it's OK - say, after an OS crash, database crash, power loss, etc.
在另一个极端,你说“我真的不在乎我是否会丢失所有这些数据”并使用未记录的表进行插入。这基本上允许数据库在不能保证数据正常的情况下丢弃您的数据 - 例如,在操作系统崩溃、数据库崩溃、断电等之后。
The middle ground is where you will probably want to be. This involves some combination of asynchronous commit, group commits(commit_delayand commit_siblings), batching inserts into groups wrapped in explicit BEGIN
and END
, etc. Instead of INSERT batching you could do COPY
loads of a few thousand records at a time. All these things trade data durability off against speed.
中间地带是您可能想要的地方。这涉及到的某种组合的异步提交,集团承诺(COMMIT_DELAY和commit_siblings),配料插入到包裹在明确的群体BEGIN
和END
等相反INSERT配料你可以做COPY
在时间的几千条记录负载。所有这些都以数据持久性与速度进行交换。
For fast bulk inserts you should also consider inserting into tables without any indexes except a primary key. Maybe not even that. Create the indexes once your bulk inserts are done. This will be a hell of a lot faster.
对于快速批量插入,您还应该考虑插入除主键外没有任何索引的表。也许甚至不是。批量插入完成后创建索引。这会快很多。
回答by Edmund
Here are a few things that might help:
以下是一些可能有帮助的事情:
The DB on each server should have a small meta data table with that server's unique characteristics. Such as which server it is; servers can be numbered sequentially. Apart from the contents of that table, it's probably wise to try to keep the schema on each server as similar as possible.
With billions of rows you'll want bigint ids (or UUID or the like). With bigints, you could allocate a generous range for each server, and set its sequence up to use it. E.g. server 1 gets 1..1000000000000000, server 2 gets 1000000000000001 to 2000000000000000 etc.
If the data is simple data points (like a temperature reading from exactly 10 instruments every second) you might get efficiency gains by storing it in a table with columns
(time timestamp, values double precision[])
rather than the more correct(time timestamp, instrument_id int, value double precision)
. This is an explicit denormalisation in aid of efficiency. (I bloggedabout my own experience with this scheme.)
每个服务器上的数据库都应该有一个带有该服务器独特特征的小型元数据表。比如是哪个服务器;服务器可以按顺序编号。除了该表的内容之外,尝试使每个服务器上的架构尽可能相似可能是明智的。
对于数十亿行,您将需要 bigint id(或 UUID 等)。使用 bigints,您可以为每个服务器分配一个很大的范围,并设置它的序列以使用它。例如服务器 1 得到 1..1000000000000000,服务器 2 得到 1000000000000001 到 2000000000000000 等等。
如果数据是简单的数据点(例如每秒正好从 10 个仪器读取温度),您可以通过将其存储在带有列
(time timestamp, values double precision[])
而不是更正确的表中来提高效率(time timestamp, instrument_id int, value double precision)
。这是一个显式的非规范化以帮助提高效率。(我在博客上写了我自己对这个计划的体验。)
回答by C. Ramseyer
Sorry I don't have a tutorial at hand, but here's an outline of a possible solution:
抱歉,我手头没有教程,但这里有一个可能的解决方案的概述:
- Load one eight of your data into a PG instance on each of the servers
- For optimum load speed, don't use inserts but the COPYmethod
- When the data is loaded, do not combine the eight databases into one. Instead, use plProxyto launch a single statement to query all databases at once (or the right one to satisfy your query)
- 将八分之一的数据加载到每台服务器上的 PG 实例中
- 为获得最佳加载速度,不要使用插入,而是使用COPY方法
- 加载数据时,不要将八个数据库合二为一。相反,使用plProxy启动一个语句来一次查询所有数据库(或正确的一个来满足您的查询)
As already noted, keys might be an issue. Use non-overlapping sequences or uuids or sequence numbers with a string prefix, shouldn't be too hard to solve.
如前所述,密钥可能是一个问题。使用不重叠的序列或 uuid 或带有字符串前缀的序列号,应该不会太难解决。
You should start with a COPY test on one of the servers and see how close to your 30-minute goal you can get. If your data is not important and you have a recent Postgresql version, you can try using unlogged tableswhich should be a lot faster (but not crash-safe). Sounds like a fun project, good luck.
您应该从其中一台服务器上的 COPY 测试开始,看看您离 30 分钟的目标有多近。如果您的数据不重要并且您有最新的 Postgresql 版本,您可以尝试使用未记录的表,它应该会快得多(但不是崩溃安全的)。听起来是个有趣的项目,祝你好运。
回答by Guido Mocha
回答by Erik Aronesty
You could use mySQL - which supports auto-sharding across a cluster.
您可以使用 mySQL - 它支持跨集群的自动分片。