scala 如何在 Spark SQL 中为自定义类型定义架构？

Question

提问by Martin Senne

The following example code tries to put some case objects into a dataframe. The code includes the definition of a case object hierarchy and a case class using this trait:

以下示例代码尝试将一些案例对象放入数据帧中。该代码包括使用此特征的案例对象层次结构和案例类的定义：

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.SQLContext

sealed trait Some
case object AType extends Some
case object BType extends Some

case class Data( name : String, t: Some)

object Example {
  def main(args: Array[String]) : Unit = {
    val conf = new SparkConf()
      .setAppName( "Example" )
      .setMaster( "local[*]")

    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)

    import sqlContext.implicits._

    val df = sc.parallelize( Seq( Data( "a", AType), Data( "b", BType) ), 4).toDF()
    df.show()
  }
}

When executing the code, I unfortunately encounter the following exception:

执行代码时，不幸遇到以下异常：

java.lang.UnsupportedOperationException: Schema for type Some is not supported

Questions

问题

Is there a possibility to add or define a schema for certain types (here type Some)?
Does another approach exist to represent this kind of enumerations?
- I tried to use Enumerationdirectly, but also without success. (see below)

是否有可能为某些类型添加或定义模式（这里是 type Some）？
是否存在另一种方法来表示这种枚举？
- 我试过Enumeration直接用，也没有成功。（见下文）

Code for Enumeration:

代码Enumeration：

object Some extends Enumeration {
  type Some = Value
  val AType, BType = Value
}

Thanks in advance. I hope, that the best approach is not to use strings instead.

提前致谢。我希望最好的方法不是使用字符串。

Answer 1

采纳答案by zero323

Spark 2.0.0+:

火花 2.0.0+：

UserDefinedTypehas been made private in Spark 2.0.0 and as for now it has no Datasetfriendly replacement.

UserDefinedType已在 Spark 2.0.0 中私有化，目前还没有Dataset友好的替代品。

See: SPARK-14155 (Hide UserDefinedType in Spark 2.0)

请参阅：SPARK-14155（在 Spark 2.0 中隐藏 UserDefinedType）

Most of the time statically typed Datasetcan serve as replacement There is a pending Jira SPARK-7768to make UDT API public again with target version 2.4.

大多数时候静态类型Dataset可以作为替代品有一个待定的 Jira SPARK-7768使 UDT API 在目标版本 2.4 中再次公开。

See also How to store custom objects in Dataset?

另请参阅如何在数据集中存储自定义对象？

Spark < 2.0.0

火花 < 2.0.0

Is there a possibility to add or define a schema for certain types (here type Some)?

是否有可能为某些类型添加或定义模式（此处键入 Some）？

I guess the answer depends on how badly you need this. It looks like it is possible to create an UserDefinedTypebut it requires access to DeveloperApiand is not exactly straightforward or well documented.

我想答案取决于你有多需要这个。看起来可以创建一个，UserDefinedType但它需要访问DeveloperApi并且不完全直接或有据可查。

import org.apache.spark.sql.types._

@SQLUserDefinedType(udt = classOf[SomeUDT])
sealed trait Some
case object AType extends Some
case object BType extends Some

class SomeUDT extends UserDefinedType[Some] {
  override def sqlType: DataType = IntegerType

  override def serialize(obj: Any) = {
    obj match {
      case AType => 0
      case BType => 1
    }
  }

  override def deserialize(datum: Any): Some = {
    datum match {
      case 0 => AType
      case 1 => BType
    }
  }

  override def userClass: Class[Some] = classOf[Some]
}

You should probably override hashCodeand equalsas well.

您可能也应该覆盖hashCode和equals。

Its PySpark counterpart can look like this:

它的 PySpark 对应物可能如下所示：

from enum import Enum, unique
from pyspark.sql.types import UserDefinedType, IntegerType

class SomeUDT(UserDefinedType):
    @classmethod
    def sqlType(self):
        return IntegerType()

    @classmethod
    def module(cls):
        return cls.__module__

    @classmethod 
    def scalaUDT(cls): # Required in Spark < 1.5
        return 'net.zero323.enum.SomeUDT'

    def serialize(self, obj):
        return obj.value

    def deserialize(self, datum):
        return {x.value: x for x in Some}[datum]

@unique
class Some(Enum):
    __UDT__ = SomeUDT()
    AType = 0
    BType = 1

In Spark < 1.5 Python UDT requires a paired Scala UDT, but it look like it is no longer the case in 1.5.

在 Spark < 1.5 中，Python UDT 需要成对的 Scala UDT，但在 1.5 中似乎不再如此。

For a simple UDT like you can use simple types (for example IntegerTypeinstead of whole Struct).

对于像您这样的简单 UDT，您可以使用简单类型（例如，IntegerType而不是整体Struct）。

scala 如何在 Spark SQL 中为自定义类型定义架构？

提问by Martin Senne

Questions

问题

采纳答案by zero323

相关推荐

最近更新

标签

scala 如何在 Spark SQL 中为自定义类型定义架构？

提问by Martin Senne

Questions

问题

采纳答案by zero323

相关推荐

scala 案例类和案例对象之间的区别？

scala 如何在 Apache Spark 中读取包含多个文件的 zip

scala 如何将地图转换为 Spark 的 RDD

scala 如何在 Spark SQL 中定义和使用用户定义的聚合函数？

相关推荐

最近更新

标签