Python 如何将 Parquet 文件复制和转换为 csv

Question

提问by eleanora

I have access to a hdfs file system and can see parquet files with

我可以访问 hdfs 文件系统，并且可以查看镶木地板文件

hadoop fs -ls /user/foo

How can I copy those parquet files to my local system and convert them to csv so I can use them? The files should be simple text files with a number of fields per row.

如何将这些镶木地板文件复制到我的本地系统并将它们转换为 csv 以便我可以使用它们？这些文件应该是每行具有多个字段的简单文本文件。

Answer 1

回答by Zoltan

Try

尝试

df = spark.read.parquet("/path/to/infile.parquet")
df.write.csv("/path/to/outfile.csv")

Relevant API documentation:

相关API文档：

Both /path/to/infile.parquetand /path/to/outfile.csvshould be locations on the hdfs filesystem. You can specify hdfs://...explicitly or you can omit it as usually it is the default scheme.

双方/path/to/infile.parquet并/path/to/outfile.csv应在HDFS文件系统中的位置。您可以hdfs://...明确指定，也可以省略它，因为它通常是默认方案。

You should avoid using file://..., because a local file means a different file to every machine in the cluster. Output to HDFS instead then transfer the results to your local disk using the command line:

您应该避免使用file://...，因为本地文件对于集群中的每台机器都意味着不同的文件。输出到 HDFS，然后使用命令行将结果传输到本地磁盘：

hdfs dfs -get /path/to/outfile.csv /path/to/localfile.csv

Or display it directly from HDFS:

或者直接从 HDFS 显示：

hdfs dfs -cat /path/to/outfile.csv

Answer 2

回答by Zoltan

If there is a table defined over those parquet files in Hive (or if you define such a table yourself), you can run a Hive query on that and save the results into a CSV file. Try something along the lines of:

如果在 Hive 中的这些镶木地板文件上定义了一个表（或者如果您自己定义了这样一个表），您可以对其运行 Hive 查询并将结果保存到 CSV 文件中。尝试一些类似的东西：

insert overwrite local directory dirname
  row format delimited fields terminated by ','
  select * from tablename;

Substitute dirnameand tablenamewith actual values. Be aware that any existing content in the specified directory gets deleted. See Writing data into the filesystem from queriesfor details.

用实际值代替dirname和tablename。请注意，指定目录中的任何现有内容都会被删除。有关详细信息，请参阅从查询将数据写入文件系统。

Answer 3

回答by Yusuf Hassan

Snippet for a more dynamic form, since you might not exactly know what's the name of your parquet file, will be:

更动态形式的片段，因为您可能不完全知道您的镶木地板文件的名称是什么，将是：

for filename in glob.glob("[location_of_parquet_file]/*.snappy.parquet"):
        print filename
        df = sqlContext.read.parquet(filename)
        df.write.csv("[destination]")
        print "csv generated"

Python 如何将 Parquet 文件复制和转换为 csv

提问by eleanora

回答by Zoltan

回答by Zoltan

回答by Yusuf Hassan

相关推荐

最近更新

标签

Python 如何将 Parquet 文件复制和转换为 csv

提问by eleanora

回答by Zoltan

回答by Zoltan

回答by Yusuf Hassan

相关推荐

Python：为 Windows 7 安装 Tesseract

translate() 在 python 错误中只需要一个参数（给出 2 个）

Python 在 Ubuntu 16 上安装 openCV 时如何在 cmake 中包含 libgtk2.0-dev 和 pkg-config

Python 裁剪 numpy 图像的中心部分

相关推荐

最近更新

标签