Apache Parquet 无法读取页脚：java.io.IOException：

Question

提问by Lavd?rim Shala

I have a SPARK project running on a Cloudera VM. On my project I load the data from a parquet file and then process these data. Everything works fine but The problem is that I need to run this project on a school cluster but there I am having problems while reading the parquet file at this part of code:

我有一个在 Cloudera VM 上运行的 SPARK 项目。在我的项目中，我从镶木地板文件加载数据，然后处理这些数据。一切正常，但问题是我需要在学校集群上运行这个项目，但是在这部分代码中读取镶木地板文件时遇到问题：

DataFrame schemaRDF = sqlContext.parquetFile("/var/tmp/graphs/sib200.parquet");

I get the following error:

我收到以下错误：

Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=file:/var/tmp/graphs/sib200.parquet/_common_metadata; isDirectory=false; length=413; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} at parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248) at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$28.apply(ParquetRelation.scala:750)

无法读取页脚：java.io.IOException：无法读取文件 FileStatus{path=file:/var/tmp/graphs/sib200.parquet/_common_metadata; 的页脚；isDirectory=false; 长度=413；复制=0；块大小=0；修改时间=0；访问时间=0；所有者=; 组=; 权限=rw-rw-rw-；isSymlink=false} at parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248) at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$28.apply(ParquetRelation.scala:750)

Based on the search online it seems to be a parquet version problem.

根据网上的搜索，这似乎是一个镶木地板版本的问题。

What I would like from you is to tell me how can I find the installed parquet version in a computer in order to check if both have the same version. Or in addition, if you know the exact solution for this error would also be perfect!

我想从您那里告诉我如何在计算机中找到已安装的镶木地板版本，以检查两者是否具有相同的版本。或者此外，如果您知道此错误的确切解决方案也将是完美的！

Answer 1

回答by Bruno Faria

I got the same problem trying to read a parquet file from S3. In my case the issue was the required libraries were not available for all workers in the cluster.

我在尝试从 S3 读取镶木地板文件时遇到了同样的问题。就我而言，问题是集群中的所有工作人员都无法使用所需的库。

There are 2 ways to fix that:

有两种方法可以解决这个问题：

Make sure you added the dependencies on the spark-submit command so it's distributed to the whole cluster
Add the dependencies on the /jars directory on your SPARK_HOME for each worker in the cluster.

确保您在 spark-submit 命令上添加了依赖项，以便将其分发到整个集群
为集群中的每个工作程序添加对 SPARK_HOME 上的 /jars 目录的依赖项。

Answer 2

回答by Ryan Garaygay

if you open a parquet file (text editor), at the very bottom you will see something like "parquet-mr" and that could help you know what version/format the file was created from

如果你打开一个镶木地板文件（文本编辑器），在最底部你会看到类似“parquet-mr”的东西，它可以帮助你知道文件是从什么版本/格式创建的

the method above though simple, the "creator" can be something else like impala or other component that can create parquet files and you can use parquet-tools https://github.com/apache/parquet-mr/tree/master/parquet-tools

上面的方法虽然简单，但“创建者”可以是其他类似impala 或其他可以创建镶木地板文件的组件，您可以使用镶木地板工具https://github.com/apache/parquet-mr/tree/master/parquet -工具

since it looks like you are using spark to read the parquet file you might be able to work-around it by setting spark.sql.parquet.filterPushdown to false. maybe try that first (more info here - https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration- change latest to your version of spark).

因为看起来您正在使用 spark 来读取镶木地板文件，所以您可以通过将 spark.sql.parquet.filterPushdown 设置为 false 来解决它。也许先尝试一下（更多信息在这里 - https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration- 将最新版本更改为您的 spark 版本）。

if that does not work then maybe try if this is an issue with latest version of spark - if it does then you can try to trace history of which commits fixed it and that might give you an insight on possible work-around

如果这不起作用，那么也许可以尝试这是否是最新版本的 spark 的问题-如果确实如此，那么您可以尝试跟踪哪些提交修复了它的历史记录，这可能会让您了解可能的解决方法

or if you know the parquet version you can use (switch) the corresponding branch of parquet-mr (build the parquet-tools for that) and use the tools for that version to test your metadata files (_metadata, _common_metadata) or one of the parquet file - you should be able to reproduce the error and debug from there

或者，如果您知道 parquet 版本，您可以使用（切换）parquet-mr 的相应分支（为此构建 parquet-tools）并使用该版本的工具来测试您的元数据文件（_metadata、_common_metadata）或其中之一parquet 文件 - 您应该能够从那里重现错误和调试

Answer 3

回答by Srini

Can you try sqlContex.read.load, instead of sqlContext.parquetFile.?

你可以试试 sqlContex.read.load，而不是 sqlContext.parquetFile。？

Please refer the below link. http://spark.apache.org/docs/latest/sql-programming-guide.html#generic-loadsave-functions

请参考以下链接。 http://spark.apache.org/docs/latest/sql-programming-guide.html#generic-loadsave-functions

Please try and let me know, if that works. If not, we could try other way.

如果可行，请尝试告诉我。如果没有，我们可以尝试其他方式。

Apache Parquet 无法读取页脚：java.io.IOException：

提问by Lavd?rim Shala

回答by Bruno Faria

回答by Ryan Garaygay

回答by Srini

相关推荐

最近更新

标签

Apache Parquet 无法读取页脚：java.io.IOException：

提问by Lavd?rim Shala

回答by Bruno Faria

回答by Ryan Garaygay

回答by Srini

相关推荐

用 Java 中的 OpenCV 比较两个图像

java 使用 EJB 异步方法的正确方法

java ZonedDateTime 的 Jackson 反序列化问题

java.lang.IllegalStateException: 没有注册解组器。检查 WebServiceTemplate 的配置

相关推荐

最近更新

标签