java parquet.io.ParquetDecodingException:无法读取文件中块 -1 中 0 处的值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37829334/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file
提问by serverliving.com
I have saved a remote DB table in Hive using saveAsTable
method, now when i try to access the Hive table data using CLI command select * from table_name
, It's giving me the error below:
我使用saveAsTable
方法在 Hive 中保存了一个远程数据库表,现在当我尝试使用 CLI 命令访问 Hive 表数据时select * from table_name
,它给了我以下错误:
2016-06-15 10:49:36,866 WARN [HiveServer2-Handler-Pool: Thread-96]:
thrift.ThriftCLIService (ThriftCLIService.java:FetchResults(681)) -
Error fetching results: org.apache.hive.service.cli.HiveSQLException:
java.io.IOException: parquet.io.ParquetDecodingException: Can not read
value at 0 in block -1 in file hdfs:
Any idea what I might be doing wrong here?
知道我在这里可能做错了什么吗?
回答by Amit Kulkarni
Problem:Facing below issue while querying the data in impyla (data written by spark job)
问题:查询impyla中的数据时遇到以下问题(spark job写入的数据)
ERROR: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1521667682013_4868_1_00, diagnostics=[Task failed, taskId=task_1521667682013_4868_1_00_000082, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://shastina/sys/datalake_dev/venmo/data/managed_zone/integration/ACCOUNT_20180305/part-r-00082-bc0c080c-4080-4f6b-9b94-f5bafb5234db.snappy.parquet
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:347)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.run(TezTaskRunner.java:194)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.run(TezTaskRunner.java:185)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:185)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:181)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Root Cause:
根本原因:
This issue is caused because of different parquet conventions used in Hive and Spark. In Hive, the decimal datatype is represented as fixed bytes (INT 32). In Spark 1.4 or later the default convention is to use the Standard Parquet representation for decimal data type. As per the Standard Parquet representation based on the precision of the column datatype, the underlying representation changes.
eg:
DECIMAL can be used to annotate the following types:
int32: for 1 <= precision <= 9
int64: for 1 <= precision <= 18; precision < 10 will produce a warning
此问题是由 Hive 和 Spark 中使用的不同镶木地板约定引起的。在 Hive 中,十进制数据类型表示为固定字节 (INT 32)。在 Spark 1.4 或更高版本中,默认约定是对十进制数据类型使用标准 Parquet 表示。根据基于列数据类型精度的标准 Parquet 表示,底层表示会发生变化。
例如: DECIMAL 可用于注释以下类型: int32:for 1 <= precision <= 9 int64:for 1 <= precision <= 18;精度 < 10 将产生警告
Hence this issue happens only with the usage of datatypes which have different representations in the different Parquet conventions. If the datatype is DECIMAL (10,3), both the conventions represent it as INT32, hence we won't face an issue. If you are not aware of the internal representation of the datatypes it is safe to use the same convention used for writing while reading. With Hive, you do not have the flexibility to choose the Parquet convention. But with Spark, you do.
因此,此问题仅在使用在不同 Parquet 约定中具有不同表示形式的数据类型时才会发生。如果数据类型是 DECIMAL (10,3),两种约定都将其表示为 INT32,因此我们不会遇到问题。如果您不知道数据类型的内部表示,那么使用与读取时写入相同的约定是安全的。使用 Hive,您无法灵活选择 Parquet 约定。但是有了 Spark,您就可以做到。
Solution:The convention used by Spark to write Parquet data is configurable. This is determined by the property spark.sql.parquet.writeLegacyFormat The default value is false. If set to "true", Spark will use the same convention as Hive for writing the Parquet data. This will help to solve the issue.
解决方案:Spark 用于写入 Parquet 数据的约定是可配置的。这是由属性 spark.sql.parquet.writeLegacyFormat 决定的,默认值为 false。如果设置为“true”,Spark 将使用与 Hive 相同的约定来写入 Parquet 数据。这将有助于解决问题。
--conf "spark.sql.parquet.writeLegacyFormat=true"
References:
参考:
回答by Eric O Lebigot
I had a similar error (but at a positive index in a non-negative block), and it came from the fact that I had created the Parquet data with some Spark dataframe types marked as non-nullable when they were actually null.
我有一个类似的错误(但在非负块中的正索引处),这是因为我创建了 Parquet 数据,其中一些 Spark 数据帧类型在它们实际上为空时标记为不可为空。
In my case, I thus interpret the error as Spark attempting to read data from a certain non-nullable type and stumbling across an unexpected null value.
就我而言,我因此将错误解释为 Spark 尝试从某个不可为空的类型读取数据并偶然发现意外的空值。
To add to the confusion, after reading the Parquet file, Spark reports with printSchema()
that all the fields are nullable, whether they are or not. However, in my case, making them reallynullable in the original Parquet file solved the problem.
更令人困惑的是,在阅读 Parquet 文件后,Spark 报告printSchema()
所有字段都可以为空,无论它们是否为空。但是,就我而言,使它们在原始 Parquet 文件中真正可以为空解决了这个问题。
Now, the fact that the question happens at "0 in block -1" is suspicious: it actually almost looks as if the data was not found, since block -1 looks like Spark has not even started reading anything (just a guess).
现在,问题发生在“块 -1 中的 0”这一事实是可疑的:它实际上看起来好像没有找到数据,因为块 -1 看起来 Spark 甚至还没有开始读取任何东西(只是猜测)。
回答by Wong Tat Yau
It looks like a schema mismatch problem here. If you set your schema to be not nullable, and create your dataframe with None value, Spark would throw you ValueError: This field is not nullable, but got Noneerror.
这里看起来像是模式不匹配问题。如果您将架构设置为不可为空,并使用 None 值创建数据框,Spark 会抛出ValueError: This field is not nullable, but got None错误。
[Pyspark]
[Pyspark]
from pyspark.sql.functions import * #udf, concat, col, lit, ltrim, rtrim
from pyspark.sql.types import *
schema = ArrayType(StructType([StructField('A', IntegerType(), nullable=False)]))
# It will throw "ValueError".
df = spark.createDataFrame([[[None]],[[2]]],schema=schema)
df.show()
But it is not the case if you use udf.
但是如果你使用 udf 就不是这样了。
Using the same schema, if you use udf for transformation, it won't throw you ValueErroreven if your udf return a None. And it is the place where data schema mismatch happens.
使用相同的模式,如果您使用 udf 进行转换,即使您的 udf 返回 None ,它也不会抛出ValueError。这是发生数据模式不匹配的地方。
For example:
例如:
df = spark.createDataFrame([[[1]],[[2]]], schema=schema)
def throw_none():
def _throw_none(x):
if x[0][0] == 1:
return [['I AM ONE']]
else:
return x
return udf(_throw_none, schema)
# since value col only accept intergerType, it will throw null for
# string "I AM ONE" in the first row. But spark did not throw ValueError
# error this time ! This is where data schema type mismatch happen !
df = df.select(throw_none()(col("value")).name('value'))
df.show()
Then, the following parquet write and read will throw you the parquet.io.ParquetDecodingExceptionerror.
然后,下面的 parquet 写入和读取将抛出parquet.io.ParquetDecodingException错误。
df.write.parquet("tmp")
spark.read.parquet("tmp").collect()
So be very careful on the null value if you are using udf, return the right data type in your udf. And unless it is unnecessary, please dont set nullable=False in your StructField. Set nullable=True will solve all the problem.
因此,如果您正在使用 udf,请对空值非常小心,并在您的 udf 中返回正确的数据类型。除非没有必要,请不要在您的StructField 中设置 nullable=False 。设置 nullable=True 将解决所有问题。
回答by EmmaOnThursday
Are you able to use Avro instead of Parquet to store your Hive table? I ran into this issue because I was using Hive's Decimal datatype, and Parquet from Spark doesn't play nice with Decimal. If you post your table schema and some data samples, debugging will be easier.
您可以使用 Avro 而不是 Parquet 来存储您的 Hive 表吗?我遇到了这个问题,因为我使用的是 Hive 的 Decimal 数据类型,而 Spark 的 Parquet 不能很好地与 Decimal 配合使用。如果您发布您的表架构和一些数据示例,调试会更容易。
Another possible option, from the DataBricks Forum, is to use a Double instead of a Decimal, but that was not an option for my data so I can't report on whether it works.
来自 DataBricks 论坛的另一个可能的选择是使用 Double 而不是 Decimal,但这不是我的数据的选项,因此我无法报告它是否有效。
回答by Sergey Romanovsky
One more way to catch possible discrepancy is to eyeball the difference in schemata of parquet files produced by both sources, say hive and spark. You can dump schema with parquet-tools (brew install parquet-tools
for macos):
捕捉可能存在的差异的另一种方法是观察两个来源(例如 hive 和 spark)生成的镶木地板文件模式的差异。您可以使用 parquet-tools 转储模式(brew install parquet-tools
用于 macos):
λ $ parquet-tools schema /usr/local/Cellar/apache-drill/1.16.0/libexec/sample-data/nation.parquet
message root {
required int64 N_NATIONKEY;
required binary N_NAME (UTF8);
required int64 N_REGIONKEY;
required binary N_COMMENT (UTF8);
}