postgresql AWS Redshift JDBC 插入性能
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16485425/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
AWS Redshift JDBC insert performance
提问by dty
I am writing a proof-of-concept app which is intended to take live clickstream data at the rate of around 1000 messages per second and write it to Amazon Redshift.
我正在编写一个概念验证应用程序,旨在以每秒大约 1000 条消息的速度获取实时点击流数据并将其写入 Amazon Redshift。
I am struggling to get anything like the performance some others claim (for example, here).
我正在努力获得其他人声称的性能(例如,这里)。
I am running a cluster with 2 x dw.hs1.xlarge nodes (+ leader), and the machine that is doing the load is an EC2 m1.xlarge instance on the same VPC as the Redshift cluster running 64 bit Ubuntu 12.04.1.
我正在运行一个具有 2 个 dw.hs1.xlarge 节点(+ 领导者)的集群,并且执行负载的机器是与运行 64 位 Ubuntu 12.04.1 的 Redshift 集群位于同一 VPC 上的 EC2 m1.xlarge 实例。
I am using Java 1.7 (openjdk-7-jdk from the Ubuntu repos) and the Postgresql 9.2-1002 driver (principally because it's the only one in Maven Central which makes my build easier!).
我正在使用 Java 1.7(来自 Ubuntu 存储库的 openjdk-7-jdk)和 Postgresql 9.2-1002 驱动程序(主要是因为它是 Maven Central 中唯一使我的构建更容易的驱动程序!)。
I've tried all the techniques shown here, except the last one.
我已经尝试了这里展示的所有技术,除了最后一种。
I cannot use COPY FROM
because we want to load data in "real time", so staging it via S3 or DynamoDB isn't really an option, and Redshift doesn't support COPY FROM stdin
for some reason.
我无法使用,COPY FROM
因为我们想“实时”加载数据,因此通过 S3 或 DynamoDB 暂存数据并不是一个真正的选择,而且 RedshiftCOPY FROM stdin
出于某种原因不支持。
Here is an excerpt from my logs showing that individual rows are being inserted at the rate of around 15/second:
这是我的日志的摘录,显示以大约 15/秒的速度插入单个行:
2013-05-10 15:05:06,937 [pool-1-thread-2] INFO uk.co...redshift.DatabaseWriter - Beginning batch of 170
2013-05-10 15:05:18,707 [pool-1-thread-2] INFO uk.co...redshift.DatabaseWriter - Done
2013-05-10 15:05:18,708 [pool-1-thread-2] INFO uk.co...redshift.DatabaseWriter - Beginning batch of 712
2013-05-10 15:06:03,078 [pool-1-thread-2] INFO uk.co...redshift.DatabaseWriter - Done
2013-05-10 15:06:03,078 [pool-1-thread-2] INFO uk.co...redshift.DatabaseWriter - Beginning batch of 167
2013-05-10 15:06:14,381 [pool-1-thread-2] INFO uk.co...redshift.DatabaseWriter - Done
What am I doing wrong? What other approaches could I take?
我究竟做错了什么?我还可以采取哪些其他方法?
回答by Joe Harris
Redshift (aka ParAccel) is an analytic database. The goal is enable analytic queries to be answered quickly over very large volumes of data. To that end Redshift stores data in a columnar format. Each column is held separately and compressed against the previous values in the column. This compression tends to be very effective because a given column usually holds many repetitive and similar data.
Redshift(又名 ParAccel)是一个分析数据库。目标是使分析查询能够在非常大量的数据上得到快速回答。为此,Redshift 以列格式存储数据。每列都单独保存,并根据列中的先前值进行压缩。这种压缩往往非常有效,因为给定的列通常包含许多重复和相似的数据。
This storage approach provides many benefits at query time because only the requested columns need to be read and the data to be read is very compressed. However, the cost of this is that inserts tend to be slower and require much more effort. Also inserts that are not perfectly ordered may result in poor query performance until the tables are VACUUM'ed.
这种存储方式在查询时提供了很多好处,因为只需要读取请求的列,并且要读取的数据非常压缩。然而,这样做的代价是插入速度较慢并且需要更多的努力。此外,在表被 VACUUM 之前,未完全排序的插入可能会导致查询性能不佳。
So, by inserting a single row at a time you are completely working against the the way that Redshift works. The database is has to append your data to each column in succession and calculate the compression. It's a little bit (but not exactly) like adding a single value to large number of zip archives. Additionally, even after your data is inserted you still won't get optimal performance until you run VACUUM to reorganise the tables.
因此,通过一次插入一行,您完全违背了 Redshift 的工作方式。数据库必须连续将您的数据附加到每一列并计算压缩。这有点(但不完全)像向大量 zip 档案添加单个值。此外,即使在插入数据之后,您仍然无法获得最佳性能,直到您运行 VACUUM 来重新组织表。
If you want to analyse your data in "real time" then, for all practical purposes, you should probably choose another database and/or approach. Off the top of my head here are 3:
如果您想“实时”分析您的数据,那么出于所有实际目的,您可能应该选择其他数据库和/或方法。在我的头顶上有 3 个:
- Accept a "small" batching window (5-15 minutes) and plan to run VACUUM at least daily.
- Choose an analytic database (more $) which copes with small inserts, e.g., Vertica.
- Experiment with "NoSQL" DBs that allow single path analysis, e.g., Acunu Cassandra.
- 接受“小”批处理窗口(5-15 分钟)并计划至少每天运行 VACUUM。
- 选择处理小插入的分析数据库(更多 $),例如 Vertica。
- 试验允许单路径分析的“NoSQL”数据库,例如 Acunu Cassandra。
回答by xpapad
We have been able to insert 1000 rows / sec in Redshift by batching several requests together in the same INSERT statement (in our case we had to batch ~200 value tuples in each INSERT). If you use an ORM layer like Hibernate, you can configure it for batching (eg see http://docs.jboss.org/hibernate/orm/3.3/reference/en/html/batch.html)
通过在同一个 INSERT 语句中将多个请求批处理在一起,我们已经能够在 Redshift 中插入 1000 行/秒(在我们的例子中,我们必须在每个 INSERT 中批处理 ~200 个值元组)。如果您使用像 Hibernate 这样的 ORM 层,您可以将其配置为批处理(例如,请参阅http://docs.jboss.org/hibernate/orm/3.3/reference/en/html/batch.html)
回答by scorpio155
The reason single inserts are slow is the way Redshift handles commits. Redshift has a single queue for commit.
单个插入缓慢的原因是 Redshift 处理提交的方式。Redshift 有一个用于提交的队列。
Say you insert row 1, then commit - it goes to the redshift commit queue to finish commit.
假设您插入第 1 行,然后提交 - 它会进入 redshift 提交队列以完成提交。
Next row , row 2, then commit - again goes to the commit queue. Say during this time if the commit of row 1 is not complete, row 2 waits for the commit of 1 to complete and then gets started to work on row 2 commit.
下一行,第 2 行,然后提交 - 再次进入提交队列。假设在此期间,如果第 1 行的提交未完成,则第 2 行等待第 1 行的提交完成,然后开始处理第 2 行提交。
So if you batch your inserts, it does a single commit and is faster than single commits to the Redshift system.
因此,如果您批量插入,它会执行单次提交,并且比单次提交到 Redshift 系统要快。
You can get commit queue information via the issue #8 in the link below. https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-techniques-for-amazon-redshift/
您可以通过以下链接中的问题 #8 获取提交队列信息。 https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-techniques-for-amazon-redshift/
回答by Jonathan Leger
I've been able to achieve 2,400 inserts/second by batching writes into transactions of 75,000 records per transaction. Each record is small, as you might expect, being only about 300 bytes per record.
通过批量写入每个事务 75,000 条记录的事务,我已经能够实现每秒 2,400 次插入。正如您所料,每条记录都很小,每条记录只有大约 300 个字节。
I'm querying a MariaDB installed on an EC2 instance and inserting the records into RedShift from the same EC2 instance that Maria is installed on.
我正在查询安装在 EC2 实例上的 MariaDB,并将记录从安装 Maria 的同一个 EC2 实例插入 RedShift。
UPDATE
更新
I modified the way I was doing writes so that it loads the data from MariaDB in 5 parallel threads and writes to RedShift from each thread. That increased performance to 12,000+ writes/second.
我修改了我的写入方式,以便它在 5 个并行线程中从 MariaDB 加载数据,并从每个线程写入 RedShift。这将性能提高到每秒 12,000 多次写入。
So yeah, if you plan it correctly you can get great performance from RedShift writes.
所以是的,如果您计划正确,您可以从 RedShift 写入中获得出色的性能。