Python 如何将 Parquet 文件复制和转换为 csv

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39419975/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 22:14:57  来源:igfitidea点击:

How to copy and convert parquet files to csv

pythonhadoopapache-sparkpysparkparquet

提问by eleanora

I have access to a hdfs file system and can see parquet files with

我可以访问 hdfs 文件系统,并且可以查看镶木地板文件

hadoop fs -ls /user/foo

How can I copy those parquet files to my local system and convert them to csv so I can use them? The files should be simple text files with a number of fields per row.

如何将这些镶木地板文件复制到我的本地系统并将它们转换为 csv 以便我可以使用它们?这些文件应该是每行具有多个字段的简单文本文件。

回答by Zoltan

Try

尝试

df = spark.read.parquet("/path/to/infile.parquet")
df.write.csv("/path/to/outfile.csv")

Relevant API documentation:

相关API文档:

Both /path/to/infile.parquetand /path/to/outfile.csvshould be locations on the hdfs filesystem. You can specify hdfs://...explicitly or you can omit it as usually it is the default scheme.

双方/path/to/infile.parquet/path/to/outfile.csv应在HDFS文件系统中的位置。您可以hdfs://...明确指定,也可以省略它,因为它通常是默认方案。

You should avoid using file://..., because a local file means a different file to every machine in the cluster. Output to HDFS instead then transfer the results to your local disk using the command line:

您应该避免使用file://...,因为本地文件对于集群中的每台机器都意味着不同的文件。输出到 HDFS,然后使用命令行将结果传输到本地磁盘:

hdfs dfs -get /path/to/outfile.csv /path/to/localfile.csv

Or display it directly from HDFS:

或者直接从 HDFS 显示:

hdfs dfs -cat /path/to/outfile.csv

回答by Zoltan

If there is a table defined over those parquet files in Hive (or if you define such a table yourself), you can run a Hive query on that and save the results into a CSV file. Try something along the lines of:

如果在 Hive 中的这些镶木地板文件上定义了一个表(或者如果您自己定义了这样一个表),您可以对其运行 Hive 查询并将结果保存到 CSV 文件中。尝试一些类似的东西:

insert overwrite local directory dirname
  row format delimited fields terminated by ','
  select * from tablename;

Substitute dirnameand tablenamewith actual values. Be aware that any existing content in the specified directory gets deleted. See Writing data into the filesystem from queriesfor details.

用实际值代替dirnametablename。请注意,指定目录中的任何现有内容都会被删除。有关详细信息,请参阅从查询将数据写入文件系统

回答by Yusuf Hassan

Snippet for a more dynamic form, since you might not exactly know what's the name of your parquet file, will be:

更动态形式的片段,因为您可能不完全知道您的镶木地板文件的名称是什么,将是:

for filename in glob.glob("[location_of_parquet_file]/*.snappy.parquet"):
        print filename
        df = sqlContext.read.parquet(filename)
        df.write.csv("[destination]")
        print "csv generated"