scala Spark 案例类 - 十进制类型编码器错误“无法从十进制转换”

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40952441/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:53:45  来源:igfitidea点击:

Spark case class - decimal type encoder error "Cannot up cast from decimal"

scalaapache-sparkapache-spark-sql

提问by mispp

I'm extracting data from MySQL/MariaDB and during creation of Dataset, an error occurs with the data types

我正在从 MySQL/MariaDB 中提取数据,在创建数据集期间,数据类型发生错误

Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast AMOUNTfrom decimal(30,6) to decimal(38,18) as it may truncate The type path of the target object is: - field (class: "org.apache.spark.sql.types.Decimal", name: "AMOUNT") - root class: "com.misp.spark.Deal" You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;

线程“main”org.apache.spark.sql.AnalysisException 中的异常:无法AMOUNT从十进制(30,6)向上转换为十进制(38,18),因为它可能会截断目标对象的类型路径是:-字段(类: "org.apache.spark.sql.types.Decimal", name: "AMOUNT") - 根类:"com.misp.spark.Deal" 您可以向输入数据添加显式转换或选择更高的精度目标对象中的字段类型;

Case class is defined like this

案例类是这样定义的

case class
(
AMOUNT: Decimal
)

Anyone know how to fix it and not touch the database?

任何人都知道如何修复它而不触及数据库?

采纳答案by mispp

That error says that apache spark can't automatically convert BigDecimal(30,6) from database to BigDecimal(38,18) which wanted in Dataset (I don't know why it needs fixed paramers 38,18. And it is even more strange that spark can't automatically convert type with low precision to type with high precision).

该错误表示 apache spark 无法自动将 BigDecimal(30,6) 从数据库转换为 Dataset 中想要的 BigDecimal(38,18)(我不知道为什么它需要固定参数 38,18。它甚至更多奇怪的是,spark 不能自动将低精度类型转换为高精度类型)。

There was reported a bug: https://issues.apache.org/jira/browse/SPARK-20162(maybe it was you). Anyway I found good workaround for reading data through casting columns to BigDecimal(38,18) in dataframe and then casting dataframe to dataset.

报告了一个错误:https: //issues.apache.org/jira/browse/SPARK-20162(也许是你)。无论如何,我找到了通过将列转换为数据帧中的 BigDecimal(38,18) 然后将数据帧转换为数据集来读取数据的好方法。

//first read data to dataframe with any way suitable for you
var df: DataFrame = ???
val dfSchema = df.schema

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.DecimalType
dfSchema.foreach { field =>
  field.dataType match {
    case t: DecimalType if t != DecimalType(38, 18) =>
      df = df.withColumn(field.name, col(field.name).cast(DecimalType(38,18)))
  }
}
df.as[YourCaseClassWithBigDecimal]

It should solve problems with reading (but not with writing I guess)

它应该解决阅读问题(但我猜不是写作问题)

回答by jenglert

While I don't have a solution here is my understanding of what is going on:

虽然我在这里没有解决方案,但我对正在发生的事情的理解:

By default spark will infer the schema of the Decimaltype (or BigDecimal) in a case classto be DecimalType(38, 18)(see org.apache.spark.sql.types.DecimalType.SYSTEM_DEFAULT). The 38means the Decimalcan hold 38digits total (for both left and right of the decimal point) while the 18means 18of those 38digits are reserved for the right of the decimal point. That means a Decimal(38, 18)may have 20digits for the left of the decimal point. Your MySQL schema is decimal(30, 6)which means it may contain values with 24digits (30 - 6) to the left of the decimal point and 6digits to the right of the decimal point. Since 24digits is greater than 20digits there could be values that are truncated when converting from your MySQL schema to that Decimaltype.

默认情况下,火花会推断出的模式Decimal类型(或BigDecimal在)case classDecimalType(38, 18)(见org.apache.spark.sql.types.DecimalType.SYSTEM_DEFAULT)。在38个装置Decimal可以容纳38位总(对于左和小数点右边),而18度的装置18的那些38位保留用于小数点的右边。这意味着 a小数点左侧Decimal(38, 18)可能有20位数字。您的 MySQL 架构decimal(30, 6)意味着它可能包含小数点左侧24位 (30 - 6) 和小数点右侧6位的值。自24数字大于20位,当从 MySQL 模式转换为该Decimal类型时,可能会有值被截断。

Unfortunately inferring schema from a scala case classis considered a convenience by the spark developers and they have chosen to not support allowing the programmer to specify precision and scale for Decimalor BigDecimaltypes within the case class(see https://issues.apache.org/jira/browse/SPARK-18484)

不幸的case class是,spark 开发人员认为从 scala 推断模式很方便,他们选择不支持允许程序员指定精度和规模DecimalBigDecimal类型case class(请参阅https://issues.apache.org/jira/browse/ SPARK-18484)

回答by ChoppyTheLumberHyman

As was previously stated by jenglert, since your DB uses DecimalType(30,6)means you have 30 slots total and 6 slots past the decimal point which leaves 30-6=24for the area in front of the decimal point. I like to call it a (24 left, 6 right)big-decimal. This of-course does not fit into a (20 right, 18 left)(i.e. DecimalType(38,18)) since the latter does not have enough slots on the right.

正如 jenglert 之前所说的那样,由于您的数据库使用DecimalType(30,6)意味着您总共有 30 个插槽,并且小数点后有 6 个插槽,小数点30-6=24前的区域保留。我喜欢称它为(24 left, 6 right)大十进制。这当然不适合 a (20 right, 18 left)(ie DecimalType(38,18)),因为后者在右侧没有足够的插槽。

What we can do here, is to down-cast the (24 left, 6 right)into a (20 left, 6 right)(i.e. DecimalType(26,6)) so that when it's being auto-casted to a (20 left, 18 right)(I.e. DecimalType(38,18)) both sides will fit. The way you do that is before converting anything to a Dataset, run the following operation on the DataFrame:

我们在这里可以做的是将 the 向下(24 left, 6 right)转换为 a (20 left, 6 right)(ie DecimalType(26,6)),这样当它被自动转换为 a (20 left, 18 right)(Ie DecimalType(38,18)) 时,双方都适合。您这样做的方法是在将任何内容转换为 Dataset 之前,在 DataFrame 上运行以下操作:

val downCastableData = 
  originalData.withColumn("amount", $"amount".cast(DecimalType(26,6)))

Then converting to Datasetshould work.

然后转换为Dataset应该工作。

(Actually, you can cast to anything that's (20 left, 6 right)or less e.g. (19 left, 5 right)etc...).

(实际上,你可以投射到任何(20 left, 6 right)或更少的东西,例如(19 left, 5 right)等......)。

回答by rileyss

According to pyspark, the Decimal(38,18)is default.

根据pyspark,这Decimal(38,18)是默认值。

When create a DecimalType, the default precision and scale is (10, 0). When infer schema from decimal.Decimal objects, it will be DecimalType(38, 18).

创建 DecimalType 时,默认精度和小数位数为 (10, 0)。当从decimal.Decimal 对象推断模式时,它将是DecimalType(38, 18)。