Java 如何在 Windows 中查看 Apache Parquet 文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50933429/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-10 23:53:00  来源:igfitidea点击:

How to view Apache Parquet file in Windows?

java.netparquet

提问by Sal

I couldn't find any plain English explanations regarding Apache Parquet files. Such as:

我找不到关于 Apache Parquet 文件的任何简单的英文解释。如:

  1. What are they?
  2. Do I need Hadoop or HDFS to view/create/store them?
  3. How can I create parquet files?
  4. How can I view parquet files?
  1. 这些是什么?
  2. 我是否需要 Hadoop 或 HDFS 来查看/创建/存储它们?
  3. 如何创建镶木地板文件?
  4. 如何查看镶木地板文件?

Any help regarding these questions is appreciated.

任何有关这些问题的帮助表示赞赏。

采纳答案by Sal

What is Apache Parquet?

什么是 Apache Parquet?

Apache Parquet is a binary file format that stores data in a columnar fashion. Data inside a Parquet file is similar to an RDBMS style table where you have columns and rows. But instead of accessing the data one row at a time, you typically access it one column at a time.

Apache Parquet 是一种以柱状方式存储数据的二进制文件格式。Parquet 文件中的数据类似于具有列和行的 RDBMS 样式表。但是,您通常一次访问一列,而不是一次访问一行数据。

Apache Parquet is one of the modern big data storage formats. It has several advantages, some of which are:

Apache Parquet 是现代大数据存储格式之一。它有几个优点,其中一些是:

  • Columnar storage: efficient data retrieval, efficient compression, etc...
  • Metadata is at the end of the file: allows Parquet files to be generated from a stream of data. (common in big data scenarios)
  • Supported by all Apache big data products
  • 列式存储:高效的数据检索、高效的压缩等...
  • 元数据位于文件末尾:允许从数据流生成 Parquet 文件。(常见于大数据场景)
  • 所有Apache大数据产品均支持

Do I need Hadoop or HDFS?

我需要 Hadoop 还是 HDFS?

No. Parquet files can be stored in any file system, not just HDFS. As mentioned above it is a file format. So it's just like any other file where it has a name and a .parquetextension. What will usually happen in big data environments though is that one dataset will be split (or partitioned) into multiple parquet files for even more efficiency.

不可以。Parquet 文件可以存储在任何文件系统中,而不仅仅是 HDFS。如上所述,它是一种文件格式。所以它就像任何其他文件一样,它有一个名称和一个.parquet扩展名。但是,在大数据环境中通常会发生的是,一个数据集将被拆分(或分区)为多个镶木地板文件,以提高效率。

All Apache big data products support Parquet files by default. So that is why it might seem like it only can exist in the Apache ecosystem.

所有 Apache 大数据产品默认都支持 Parquet 文件。所以这就是为什么它似乎只能存在于 Apache 生态系统中的原因。

How can I create/read Parquet Files?

如何创建/读取 Parquet 文件?

As mentioned, all current Apache big data products such as Hadoop, Hive, Spark, etc. support Parquet files by default.

如前所述,目前所有的 Apache 大数据产品,如 Hadoop、Hive、Spark 等,都默认支持 Parquet 文件。

So it's possible to leverage these systems to generate or read Parquet data. But this is far from practical. Imagine that in order to read or create a CSV file you had to install Hadoop/HDFS + Hive and configure them. Luckily there are other solutions.

因此,可以利用这些系统来生成或读取 Parquet 数据。但这远不实用。想象一下,为了读取或创建 CSV 文件,您必须安装 Hadoop/HDFS + Hive 并对其进行配置。幸运的是还有其他解决方案。

To create your own parquet files:

要创建您自己的镶木地板文件:

To view parquet file contents:

查看 Parquet 文件内容:

Are there other methods?

还有其他方法吗?

Possibly. But not many exist and they mostly aren't well documented. This is due to Parquet being a very complicated file format (I could not even find a formal definition). The ones I've listed are the only ones I'm aware of as I'm writing this response

可能。但是存在的并不多,而且它们大多没有很好的记录。这是因为 Parquet 是一种非常复杂的文件格式(我什至找不到正式的定义)。我列出的那些是我在写这个回复时唯一知道的

回答by nirolo

In addition to @sal's extensive answer there is one further question I encountered in this context:

除了@sal 的广泛回答之外,我在这种情况下还遇到了另一个问题:

How can I access the data in a parquet file with SQL?

如何使用 SQL 访问镶木地板文件中的数据?

As we are still in the Windows context here, I know of not that many ways to do that. The best results were achieved by using Sparkas the SQL engine with Pythonas interface to Spark. However, I assume that the Zeppelinenvironment works as well, but did not try that out myself yet.

由于我们仍处于 Windows 环境中,因此我知道的方法并不多。最好的结果是使用Spark作为 SQL 引擎,使用Python作为 Spark 的接口。但是,我认为Zeppelin环境也能正常工作,但我自己还没有尝试过。

There is very well done guide by Michael Garlanykto guide one through the installation of the Spark/Python combination.

Michael Garlanyk提供了一份非常出色的指南,指导您安装 Spark/Python 组合。

Once set up, I'm able to interact with parquets through:

设置完成后,我可以通过以下方式与镶木地板互动:

from os import walk
from pyspark.sql import SQLContext

sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)

parquetdir = r'C:\PATH\TO\YOUR\PARQUET\FILES'

# Getting all parquet files in a dir as spark contexts.
# There might be more easy ways to access single parquets, but I had nested dirs
dirpath, dirnames, filenames = next(walk(parquetdir), (None, [], []))

# for each parquet file, i.e. table in our database, spark creates a tempview with
# the respective table name equal the parquet filename
print('New tables available: \n')

for parquet in filenames:
    print(parquet[:-8])
    spark.read.parquet(parquetdir+'\'+parquet).createOrReplaceTempView(parquet[:-8])

Once loaded your parquets this way, you can interact with the Pyspark API e.g. via:

一旦以这种方式加载您的镶木地板,您就可以与 Pyspark API 进行交互,例如通过:

my_test_query = spark.sql("""
select
  field1,
  field2
from parquetfilename1
where
  field1 = 'something'
""")

my_test_query.show()

回答by meow

This is possible now through Apache Arrow, which helps to simplify communication/transfer between different data formats, see my answer hereor the official docsin case of Python.

现在可以通过Apache Arrow实现这一点,它有助于简化不同数据格式之间的通信/传输,请在此处查看我的回答或Python的官方文档

Basically this allows you to quickly read/ write parquet files in a pandas DataFramelike fashion giving you the benefits of using notebooksto view and handle such files like it was a regular csvfile.

基本上,这允许您以DataFrame类似Pandas 的方式快速读取/写入镶木地板文件,从而使您notebooks可以像查看和处理常规csv文件一样查看和处理此类文件。

EDIT:

编辑:

As an example, given the latest version of Pandas, make sure pyarrowis installed:

例如,给定最新版本的Pandas,请确保pyarrow已安装:

Then you can simply use pandas to manipulate parquet files:

然后你可以简单地使用 pandas 来操作 parquet 文件:

import pandas as pd

# read
df = pd.read_parquet('myfile.parquet')

# write
df.to_parquet('my_newfile.parquet')

df.head()

回答by Eugene

Maybe too late for this thread, just make some complement for anyone who wants to view Parquet file with a desktop application running on MAC or Linux.
There is a desktop application to view Parquetand also other binary format data like ORCand AVRO. It's pure Java application so that can be run at Linux, Mac and also Windows. Please check Bigdata File Viewerfor details.

对于这个线程可能为时已晚,只是为想要使用在 MAC 或 Linux 上运行的桌面应用程序查看 Parquet 文件的任何人做一些补充。
有一个桌面应用程序可以查看Parquet以及其他二进制格式的数据,如ORCAVRO。它是纯 Java 应用程序,因此可以在 Linux、Mac 和 Windows 上运行。详情请查看大数据文件查看器

It supports complex data type like array, map, etc.

它支持复杂的数据类型,如数组、地图等。

enter image description here

在此处输入图片说明