MySQL 如何将 RedShift 上的表卸载到单个 CSV 文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20323919/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 19:35:46  来源:igfitidea点击:

How to unload a table on RedShift to a single CSV file?

mysqlamazon-web-servicesamazon-redshift

提问by ciphor

I want to migrate a table from Amazon RedShift to MySQL, but using "unload" will generate multiple data files which are hard to imported into MySQL directly.

我想将一个表从 Amazon RedShift 迁移到 MySQL,但使用“卸载”会生成多个难以直接导入 MySQL 的数据文件。

Is there any approach to unload the table to a single CSV file so that I can import it to MySQL directly?

有什么方法可以将表卸载到单个 CSV 文件中,以便我可以直接将其导入 MySQL?

回答by Dan Ciborowski - MSFT

In order to send to a single file use parallel off

为了发送到单个文件使用并行关闭

unload ('select * from venue')
to 's3://mybucket/tickit/unload/venue_' credentials 
'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>'
parallel off;

Also I recommend using Gzip, to make that file even smaller for download.

此外,我建议使用 Gzip,使该文件更小以供下载。

unload ('select * from venue')
to 's3://mybucket/tickit/unload/venue_' credentials 
'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>'
parallel off
gzip;

回答by Brent Writes Code

This is an old question at this point, but I feel like all the existing answers are slightly misleading. If your question is, "Can I absolutely 100% guarantee that Redshift will ALWAYS unload to a SINGLE file in S3?", the answer is simply NO.

这是一个老问题,但我觉得所有现有的答案都有些误导。如果您的问题是“我可以绝对 100% 保证 Redshift 将始终卸载到 S3 中的单个文件吗?”,答案就是NO

That being said, for most cases, you can generally limit your query in such a way that you'll end up with a single file. Per the documentation (https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html), the main factor in limiting the number of files you generate is the actual raw size in bytes of your export (NOTthe number of rows). The limit on the size of an output file generated by the Redshift UNLOADcommand is 6.2GB.

话虽如此,在大多数情况下,您通常可以以这样一种方式限制您的查询,即您最终会得到一个文件。根据文档(https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html),限制生成的文件数量的主要因素是导出的实际原始大小(行数)。RedshiftUNLOAD命令生成的输出文件的大小限制为 6.2GB。

So if you want to try to guarantee that you get a single output file from UNLOAD, here's what you should try:

因此,如果您想确保从 获得单个输出文件UNLOAD,您应该尝试以下操作:

  • Specify PARALLEL OFF. Parallel is "ON" by default and will generally write to multiple files unless you have a tiny cluster (The number of output files with "PARALLEL ON" set is proportional to the number of slices in your cluster). PARALLEL OFFwill write files serially to S3 instead of in parallel and will only spill over to using multiple files if you exceed the size limit.
  • Limit the size of your output. The raw size of the data must be less than 6.2GB if you want a single file. So you need to make your query have a more restrictive WHEREclause or use a LIMITclause to keep the number of records down. Unfortunately neither of these techniques are perfect since rows can be of variable size. It's also not clear to me if the GZIPoption affects the output file size spillover limit or not (it's unclear if 6.2GB is the pre-GZIP size limit or the post-GZIP size limit).
  • 指定PARALLEL OFF。Parallel 默认为“ON”,通常会写入多个文件,除非您有一个很小的集群(设置为“PARALLEL ON”的输出文件的数量与集群中的切片数量成正比)。 PARALLEL OFF将文件串行写入 S3 而不是并行,如果超过大小限制,只会溢出到使用多个文件。
  • 限制输出的大小。如果您想要单个文件,数据的原始大小必须小于 6.2GB。所以你需要让你的查询有一个更严格的WHERE子句或使用一个LIMIT子句来减少记录数。不幸的是,这些技术都不是完美的,因为行的大小可能是可变的。我也不清楚该GZIP选项是否影响输出文件大小溢出限制(不清楚 6.2GB 是前 GZIP 大小限制还是后 GZIP 大小限制)。

For me, the UNLOAD command that ending up generating a single CSV file in most cases was:

对我来说,在大多数情况下最终生成单个 CSV 文件的 UNLOAD 命令是:

UNLOAD
('SELECT <fields> FROM <table> WHERE <restrict_query>')
TO 's3://<bucket_name>/<filename_prefix>'
CREDENTIALS 'aws_access_key_id=<access_key>;aws_secret_access_key=<secret_key>'
DELIMITER AS ','
ADDQUOTES
NULL AS ''
PARALLEL OFF;

The other nice side effect of PARALLEL OFFis that it will respect your ORDER BYclause if you have one and generate the files in an order that keeps all the records ordered, even across multiple output files.

另一个不错的副作用PARALLEL OFF是,ORDER BY如果您有一个子句,它会尊重您的子句,并以保持所有记录有序的顺序生成文件,即使在多个输出文件中也是如此。

Addendum: There seems to be some folkloric knowledge around using LIMIT 2147483647to force the leader node to do all the processing and generate a single output file, but this doesn't seem to be actually documented anywhere in the Redshift documentation and as such, relying on it seems like a bad idea since it could change at any time.

附录:似乎有一些关于使用LIMIT 2147483647强制领导节点进行所有处理并生成单个输出文件的民间知识,但这似乎并没有在 Redshift 文档中的任何地方实际记录,因此,依赖它似乎是个坏主意,因为它随时可能改变。

回答by davefender

It is a bit of a workaround, but you need to make your query a subquery and include a limit. It will then output to one file. E.g.

这是一种解决方法,但您需要将查询设为子查询并包含限制。然后它将输出到一个文件。例如

select * from (select * from bizdata LIMIT 2147483647);

So basically you are selecting all from a limited set. That is the only way it works. 2147483647 is your max limit, as a limit clause takes an unsigned integer argument.

所以基本上你是从有限的集合中选择所有的。这是它工作的唯一方式。2147483647 是您的最大限制,因为限制子句采用无符号整数参数。

So the following will unload to one file:

因此,以下内容将卸载到一个文件中:

unload(' select * from (
select bizid, data
from biztable
limit 2147483647);
 ') to 's3://.......' CREDENTIALS 'aws_access_key_id=<<aws_access_key_id>>;aws_secret_access_key=<<aws_secret_access_key>>' csv ; 

回答by Suvrat

There is no way to force Redshift to generate only a single output file, for sure.

当然,没有办法强制 Redshift 只生成一个输出文件。

Under a standard UNLOAD you will have output files created equivalent to the number of system slices, i.e. a system with 8 slices will create 8 files for a single unload command(This is the fastest method to unload.)

在标准 UNLOAD 下,您将创建与系统切片数量相等的输出文件,即具有 8 个切片的系统将为单个卸载命令创建 8 个文件(这是最快的卸载方法。)

If you add a clause PARALLEL OFF in to he Unload Command, your output will be created as a single file, upto the time where the data extract soze does not go beyond 6.25GB, after which Redshift will automatically break the file into a new chunk.

如果在卸载命令中添加子句 PARALLEL OFF,您的输出将被创建为单个文件,直到数据提取 soze 不超过 6.25GB,之后 Redshift 将自动将文件分解为一个新块.

The same thing holds true, if you produce compressed output files as well(There of course you will have greater chances to produce a single output file, considering that your file can accommodate more number of records in it.).

同样的事情也适用,如果您也生成压缩的输出文件(当然,您将有更大的机会生成单个输出文件,因为您的文件可以容纳更多的记录。)。

回答by Joe Harris

Nope. { You can use a manifest and tell Redshift to direct all output to a single file.} Previous answer was wrong, I had used manifests for loading but not unloading.

不。{ 您可以使用清单并告诉 Redshift 将所有输出定向到单个文件。以前的答案是错误的,我使用清单进行加载但未卸载。

There appears to be 2 possible ways to get a single file:

似乎有两种可能的方法来获取单个文件:

  1. Easier: Wrap a SELECT … LIMIT query around your actual output query, as per this SO answerbut this is limited to ~2 billion rows.
  2. Harder: Use the Unix catutility to join the files togethercat File1.txt File2.txt > union.txt. This will require you to download the files from S3 first.
  1. 更简单:根据此 SO 答案,围绕您的实际输出查询包装一个 SELECT ... LIMIT 查询,但这仅限于约 20 亿行。
  2. 更难:使用 Unixcat实用程序将文件连接在一起cat File1.txt File2.txt > union.txt。这将要求您首先从 S3 下载文件。