Python 如何将 Parquet 文件复制和转换为 csv
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39419975/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to copy and convert parquet files to csv
提问by eleanora
I have access to a hdfs file system and can see parquet files with
我可以访问 hdfs 文件系统,并且可以查看镶木地板文件
hadoop fs -ls /user/foo
How can I copy those parquet files to my local system and convert them to csv so I can use them? The files should be simple text files with a number of fields per row.
如何将这些镶木地板文件复制到我的本地系统并将它们转换为 csv 以便我可以使用它们?这些文件应该是每行具有多个字段的简单文本文件。
回答by Zoltan
Try
尝试
df = spark.read.parquet("/path/to/infile.parquet")
df.write.csv("/path/to/outfile.csv")
Relevant API documentation:
相关API文档:
Both /path/to/infile.parquet
and /path/to/outfile.csv
should be locations on the hdfs filesystem. You can specify hdfs://...
explicitly or you can omit it as usually it is the default scheme.
双方/path/to/infile.parquet
并/path/to/outfile.csv
应在HDFS文件系统中的位置。您可以hdfs://...
明确指定,也可以省略它,因为它通常是默认方案。
You should avoid using file://...
, because a local file means a different file to every machine in the cluster. Output to HDFS instead then transfer the results to your local disk using the command line:
您应该避免使用file://...
,因为本地文件对于集群中的每台机器都意味着不同的文件。输出到 HDFS,然后使用命令行将结果传输到本地磁盘:
hdfs dfs -get /path/to/outfile.csv /path/to/localfile.csv
Or display it directly from HDFS:
或者直接从 HDFS 显示:
hdfs dfs -cat /path/to/outfile.csv
回答by Zoltan
If there is a table defined over those parquet files in Hive (or if you define such a table yourself), you can run a Hive query on that and save the results into a CSV file. Try something along the lines of:
如果在 Hive 中的这些镶木地板文件上定义了一个表(或者如果您自己定义了这样一个表),您可以对其运行 Hive 查询并将结果保存到 CSV 文件中。尝试一些类似的东西:
insert overwrite local directory dirname row format delimited fields terminated by ',' select * from tablename;
Substitute dirname
and tablename
with actual values. Be aware that any existing content in the specified directory gets deleted. See Writing data into the filesystem from queriesfor details.
用实际值代替dirname
和tablename
。请注意,指定目录中的任何现有内容都会被删除。有关详细信息,请参阅从查询将数据写入文件系统。
回答by Yusuf Hassan
Snippet for a more dynamic form, since you might not exactly know what's the name of your parquet file, will be:
更动态形式的片段,因为您可能不完全知道您的镶木地板文件的名称是什么,将是:
for filename in glob.glob("[location_of_parquet_file]/*.snappy.parquet"):
print filename
df = sqlContext.read.parquet(filename)
df.write.csv("[destination]")
print "csv generated"