database 如何将 HiveQL 查询的结果输出到 CSV?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18129581/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-08 09:02:01  来源:igfitidea点击:

How do I output the results of a HiveQL query to CSV?

databasehadoophivehiveql

提问by AAA

we would like to put the results of a Hive query to a CSV file. I thought the command should look like this:

我们想将 Hive 查询的结果放入 CSV 文件。我认为命令应该是这样的:

insert overwrite directory '/home/output.csv' select books from table;

When I run it, it says it completeld successfully but I can never find the file. How do I find this file or should I be extracting the data in a different way?

当我运行它时,它说它已成功完成,但我永远找不到该文件。我如何找到这个文件,或者我应该以不同的方式提取数据?

回答by Lukas Vermeer

Although it is possible to use INSERT OVERWRITEto get data out of Hive, it might not be the best method for your particular case. First let me explain what INSERT OVERWRITEdoes, then I'll describe the method I use to get tsv files from Hive tables.

尽管可以使用INSERT OVERWRITE从 Hive 中获取数据,但对于您的特定情况,它可能不是最佳方法。首先让我解释一下有什么INSERT OVERWRITE作用,然后我将描述我用来从 Hive 表中获取 tsv 文件的方法。

According to the manual, your query will store the data in a directory in HDFS. The format will not be csv.

根据手册,您的查询会将数据存储在 HDFS 的目录中。格式不会是 csv。

Data written to the filesystem is serialized as text with columns separated by ^A and rows separated by newlines. If any of the columns are not of primitive type, then those columns are serialized to JSON format.

写入文件系统的数据被序列化为文本,列由 ^A 分隔,行由换行符分隔。如果任何列不是原始类型,则这些列将序列化为 JSON 格式。

A slight modification (adding the LOCALkeyword) will store the data in a local directory.

稍微修改(添加LOCAL关键字)会将数据存储在本地目录中。

INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp' select books from table;

When I run a similar query, here's what the output looks like.

当我运行类似的查询时,输出如下所示。

[lvermeer@hadoop temp]$ ll
total 4
-rwxr-xr-x 1 lvermeer users 811 Aug  9 09:21 000000_0
[lvermeer@hadoop temp]$ head 000000_0 
"row1""col1"1234"col3"1234FALSE
"row2""col1"5678"col3"5678TRUE

Personally, I usually run my query directly through Hive on the command line for this kind of thing, and pipe it into the local file like so:

就我个人而言,我通常在命令行上直接通过 Hive 运行我的查询,并将其通过管道传输到本地文件中,如下所示:

hive -e 'select books from table' > /home/lvermeer/temp.tsv

That gives me a tab-separated file that I can use. Hope that is useful for you as well.

这给了我一个可以使用的制表符分隔文件。希望这对你也有用。

Based on this patch-3682, I suspect a better solution is available when using Hive 0.11, but I am unable to test this myself. The new syntax should allow the following.

基于这个 patch-3682,我怀疑在使用 Hive 0.11 时有更好的解决方案,但我无法自己测试。新语法应允许以下内容。

INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp' 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
select books from table;

Hope that helps.

希望有帮助。

回答by David Kjerrumgaard

If you want a CSV file then you can modify Lukas' solutions as follows (assuming you are on a linux box):

如果你想要一个 CSV 文件,那么你可以修改 Lukas 的解决方案如下(假设你在一个 linux 机器上):

hive -e 'select books from table' | sed 's/[[:space:]]\+/,/g' > /home/lvermeer/temp.csv

回答by Olaf

You should use CREATE TABLE AS SELECT (CTAS) statement to create a directory in HDFS with the files containing the results of the query. After that you will have to export those files from HDFS to your regular disk and merge them into a single file.

您应该使用 CREATE TABLE AS SELECT (CTAS) 语句在 HDFS 中创建一个目录,其中包含包含查询结果的文件。之后,您必须将这些文件从 HDFS 导出到您的常规磁盘并将它们合并为一个文件。

You also might have to do some trickery to convert the files from '\001' - delimited to CSV. You could use a custom CSV SerDe or postprocess the extracted file.

您可能还需要做一些技巧才能将文件从 '\001' - 分隔为 CSV。您可以使用自定义 CSV SerDe 或对提取的文件进行后处理。

回答by bigmakers

You can use INSERTDIRECTORY…, as in this example:

您可以使用INSERT…… DIRECTORY,如本例所示:

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/ca_employees'
SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';

OVERWRITEand LOCALhave the same interpretations as before and paths are interpreted following the usual rules. One or more files will be written to /tmp/ca_employees, depending on the number of reducers invoked.

OVERWRITELOCAL具有与以前相同的解释,并且路径按照通常的规则进行解释。一个或多个文件将写入/tmp/ca_employees,具体取决于调用的减速器数量。

回答by Ray

If you are using HUE this is fairly simple as well. Simply go to the Hive editor in HUE, execute your hive query, then save the result file locally as XLS or CSV, or you can save the result file to HDFS.

如果您使用 HUE,这也相当简单。只需转到 HUE 中的 Hive 编辑器,执行您的 hive 查询,然后将结果文件本地保存为 XLS 或 CSV,或者您可以将结果文件保存到 HDFS。

回答by Ram Ghadiyaram

You can use hive string function CONCAT_WS( string delimiter, string str1, string str2...strn )

您可以使用 hive 字符串函数 CONCAT_WS( string delimiter, string str1, string str2...strn )

for ex:

例如:

hive -e 'select CONCAT_WS(',',cola,colb,colc...,coln) from Mytable' > /home/user/Mycsv.csv

回答by sisanared

I was looking for a similar solution, but the ones mentioned here would not work. My data had all variations of whitespace (space, newline, tab) chars and commas.

我正在寻找类似的解决方案,但这里提到的解决方案不起作用。我的数据包含空格(空格、换行符、制表符)字符和逗号的所有变体。

To make the column data tsv safe, I replaced all \t chars in the column data with a space, and executed python code on the commandline to generate a csv file, as shown below:

为了使列数据tsv安全,我将列数据中的所有\t字符替换为一个空格,并在命令行执行python代码生成一个csv文件,如下图:

hive -e 'tab_replaced_hql_query' |  python -c 'exec("import sys;import csv;reader = csv.reader(sys.stdin, dialect=csv.excel_tab);writer = csv.writer(sys.stdout, dialect=csv.excel)\nfor row in reader: writer.writerow(row)")'

This created a perfectly valid csv. Hope this helps those who come looking for this solution.

这创建了一个完全有效的 csv。希望这可以帮助那些寻求此解决方案的人。

回答by Dattatrey Sindol

I had a similar issue and this is how I was able to address it.

我有一个类似的问题,这就是我能够解决它的方式。

Step 1- Loaded the data from Hive table into another table as follows

步骤 1- 将 Hive 表中的数据加载到另一个表中,如下所示

DROP TABLE IF EXISTS TestHiveTableCSV;
CREATE TABLE TestHiveTableCSV 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' AS
SELECT Column List FROM TestHiveTable;

Step 2- Copied the blob from Hive warehouse to the new location with appropriate extension

第 2 步- 将 Blob 从 Hive 仓库复制到具有适当扩展名的新位置

Start-AzureStorageBlobCopy
-DestContext $destContext 
-SrcContainer "Source Container"
-SrcBlob "hive/warehouse/TestHiveTableCSV/000000_0"
-DestContainer "Destination Container"
-DestBlob "CSV/TestHiveTable.csv"

回答by Terminator17

hive  --outputformat=csv2 -e "select * from yourtable" > my_file.csv

or

或者

hive  --outputformat=csv2 -e "select * from yourtable" > [your_path]/file_name.csv

For tsv, just change csv to tsv in the above queries and run your queries

对于 tsv,只需在上述查询中将 csv 更改为 tsv 并运行您的查询

回答by Rishabh Sachdeva

This is most csv friendly way I found to output the results of HiveQL.
You don't need any grep or sed commands to format the data, instead hive supports it, just need to add extra tag of outputformat.

这是我发现输出 HiveQL 结果的最 csv 友好的方式。
您不需要任何 grep 或 sed 命令来格式化数据,而是 hive 支持它,只需要添加额外的 outputformat 标签。

hive --outputformat=csv2 -e 'select * from <table_name> limit 20' > /path/toStore/data/results.csv