Apache Parquet 无法读取页脚:java.io.IOException:

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34814082/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-02 23:24:30  来源:igfitidea点击:

Apache Parquet Could not read footer: java.io.IOException:

javahadoopioapache-sparkparquet

提问by Lavd?rim Shala

I have a SPARK project running on a Cloudera VM. On my project I load the data from a parquet file and then process these data. Everything works fine but The problem is that I need to run this project on a school cluster but there I am having problems while reading the parquet file at this part of code:

我有一个在 Cloudera VM 上运行的 SPARK 项目。在我的项目中,我从镶木地板文件加载数据,然后处理这些数据。一切正常,但问题是我需要在学校集群上运行这个项目,但是在这部分代码中读取镶木地板文件时遇到问题:

DataFrame schemaRDF = sqlContext.parquetFile("/var/tmp/graphs/sib200.parquet");

I get the following error:

我收到以下错误:

Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=file:/var/tmp/graphs/sib200.parquet/_common_metadata; isDirectory=false; length=413; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} at parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248) at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$28.apply(ParquetRelation.scala:750)

无法读取页脚:java.io.IOException:无法读取文件 FileStatus{path=file:/var/tmp/graphs/sib200.parquet/_common_metadata; 的页脚;isDirectory=false; 长度=413;复制=0;块大小=0;修改时间=0;访问时间=0;所有者=; 组=; 权限=rw-rw-rw-;isSymlink=false} at parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248) at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$28.apply(ParquetRelation.scala:750)

Based on the search online it seems to be a parquet version problem.

根据网上的搜索,这似乎是一个镶木地板版本的问题。

What I would like from you is to tell me how can I find the installed parquet version in a computer in order to check if both have the same version. Or in addition, if you know the exact solution for this error would also be perfect!

我想从您那里告诉我如何在计算机中找到已安装的镶木地板版本,以检查两者是否具有相同的版本。或者此外,如果您知道此错误的确切解决方案也将是完美的!

回答by Bruno Faria

I got the same problem trying to read a parquet file from S3. In my case the issue was the required libraries were not available for all workers in the cluster.

我在尝试从 S3 读取镶木地板文件时遇到了同样的问题。就我而言,问题是集群中的所有工作人员都无法使用所需的库。

There are 2 ways to fix that:

有两种方法可以解决这个问题:

  • Make sure you added the dependencies on the spark-submit command so it's distributed to the whole cluster
  • Add the dependencies on the /jars directory on your SPARK_HOME for each worker in the cluster.
  • 确保您在 spark-submit 命令上添加了依赖项,以便将其分发到整个集群
  • 为集群中的每个工作程序添加对 SPARK_HOME 上的 /jars 目录的依赖项。

回答by Ryan Garaygay

if you open a parquet file (text editor), at the very bottom you will see something like "parquet-mr" and that could help you know what version/format the file was created from

如果你打开一个镶木地板文件(文本编辑器),在最底部你会看到类似“parquet-mr”的东西,它可以帮助你知道文件是从什么版本/格式创建的

the method above though simple, the "creator" can be something else like impala or other component that can create parquet files and you can use parquet-tools https://github.com/apache/parquet-mr/tree/master/parquet-tools

上面的方法虽然简单,但“创建者”可以是其他类似impala 或其他可以创建镶木地板文件的组件,您可以使用镶木地板工具https://github.com/apache/parquet-mr/tree/master/parquet -工具

since it looks like you are using spark to read the parquet file you might be able to work-around it by setting spark.sql.parquet.filterPushdown to false. maybe try that first (more info here - https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration- change latest to your version of spark).

因为看起来您正在使用 spark 来读取镶木地板文件,所以您可以通过将 spark.sql.parquet.filterPushdown 设置为 false 来解决它。也许先尝试一下(更多信息在这里 - https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration- 将最新版本更改为您的 spark 版本)。

if that does not work then maybe try if this is an issue with latest version of spark - if it does then you can try to trace history of which commits fixed it and that might give you an insight on possible work-around

如果这不起作用,那么也许可以尝试这是否是最新版本的 spark 的问题-如果确实如此,那么您可以尝试跟踪哪些提交修复了它的历史记录,这可能会让您了解可能的解决方法

or if you know the parquet version you can use (switch) the corresponding branch of parquet-mr (build the parquet-tools for that) and use the tools for that version to test your metadata files (_metadata, _common_metadata) or one of the parquet file - you should be able to reproduce the error and debug from there

或者,如果您知道 parquet 版本,您可以使用(切换)parquet-mr 的相应分支(为此构建 parquet-tools)并使用该版本的工具来测试您的元数据文件(_metadata、_common_metadata)或其中之一parquet 文件 - 您应该能够从那里重现错误和调试

回答by Srini

Can you try sqlContex.read.load, instead of sqlContext.parquetFile.?

你可以试试 sqlContex.read.load,而不是 sqlContext.parquetFile。?

Please refer the below link. http://spark.apache.org/docs/latest/sql-programming-guide.html#generic-loadsave-functions

请参考以下链接。 http://spark.apache.org/docs/latest/sql-programming-guide.html#generic-loadsave-functions

Please try and let me know, if that works. If not, we could try other way.

如果可行,请尝试告诉我。如果没有,我们可以尝试其他方式。