Spark：在 Scala 中以编程方式创建数据帧架构

Question

提问by Stuart

I have a smallish dataset that will be the result of a Spark job. I am thinking about converting this dataset to a dataframe for convenience at the end of the job, but have struggled to correctly define the schema. The problem is the last field below (topValues); it is an ArrayBuffer of tuples -- keys and counts.

我有一个较小的数据集，它将是 Spark 作业的结果。为了在工作结束时方便起见，我正在考虑将此数据集转换为数据框，但一直在努力正确定义模式。问题是下面的最后一个字段 ( topValues); 它是元组的 ArrayBuffer —— 键和计数。

  val innerSchema =
    StructType(
      Array(
        StructField("value", StringType),
        StructField("count", LongType)
      )
    )
  val outputSchema =
    StructType(
      Array(
        StructField("name", StringType, nullable=false),
        StructField("index", IntegerType, nullable=false),
        StructField("count", LongType, nullable=false),
        StructField("empties", LongType, nullable=false),
        StructField("nulls", LongType, nullable=false),
        StructField("uniqueValues", LongType, nullable=false),
        StructField("mean", DoubleType),
        StructField("min", DoubleType),
        StructField("max", DoubleType),
        StructField("topValues", innerSchema)
      )
    )

  val result = stats.columnStats.map{ c =>
    Row(c._2.name, c._1, c._2.count, c._2.empties, c._2.nulls, c._2.uniqueValues, c._2.mean, c._2.min, c._2.max, c._2.topValues.topN)
  }

  val rdd = sc.parallelize(result.toSeq)

  val outputDf = sqlContext.createDataFrame(rdd, outputSchema)

  outputDf.show()

The error I'm getting is a MatchError: scala.MatchError: ArrayBuffer((10,2), (20,3), (8,1)) (of class scala.collection.mutable.ArrayBuffer)

我得到的错误是 MatchError： scala.MatchError: ArrayBuffer((10,2), (20,3), (8,1)) (of class scala.collection.mutable.ArrayBuffer)

When I debug and inspect my objects, I'm seeing this:

当我调试和检查我的对象时，我看到了这个：

rdd: ParallelCollectionRDD[2]
rdd.data: "ArrayBuffer" size = 2
rdd.data(0): [age,2,6,0,0,3,14.666666666666666,8.0,20.0,ArrayBuffer((10,2), (20,3), (8,1))]
rdd.data(1): [gender,3,6,0,0,2,0.0,0.0,0.0,ArrayBuffer((M,4), (F,2))]

It seems to me that I've accurately described the ArrayBuffer of tuples in my innerSchema, but Spark disagrees.

在我看来，我已经在我的 innerSchema 中准确地描述了元组的 ArrayBuffer，但 Spark 不同意。

Any idea how I should be defining the schema?

知道我应该如何定义架构吗？

Answer 1

回答by David Griffin

val rdd = sc.parallelize(Array(Row(ArrayBuffer(1,2,3,4))))
val df = sqlContext.createDataFrame(
  rdd,
  StructType(Seq(StructField("arr", ArrayType(IntegerType, false), false)
)

df.printSchema
root
 |-- arr: array (nullable = false)
 |    |-- element: integer (containsNull = false)

df.show
+------------+
|         arr|
+------------+
|[1, 2, 3, 4]|
+------------+

Answer 2

回答by Stuart

As David pointed out, I needed to use an ArrayType. Spark is happy with this:

正如大卫指出的那样，我需要使用 ArrayType。Spark 对此很满意：

  val outputSchema =
    StructType(
      Array(
        StructField("name", StringType, nullable=false),
        StructField("index", IntegerType, nullable=false),
        StructField("count", LongType, nullable=false),
        StructField("empties", LongType, nullable=false),
        StructField("nulls", LongType, nullable=false),
        StructField("uniqueValues", LongType, nullable=false),
        StructField("mean", DoubleType),
        StructField("min", DoubleType),
        StructField("max", DoubleType),
        StructField("topValues", ArrayType(StructType(Array(
          StructField("value", StringType),
          StructField("count", LongType)
        ))))
      )
    )

Answer 3

回答by Arun Goudar

import spark.implicits._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._


val searchPath = "/path/to/.csv"
val columns = "col1,col2,col3,col4,col5,col6,col7"
val fields = columns.split(",").map(fieldName => StructField(fieldName, StringType, 
nullable = true))
val customSchema = StructType(fields)
var dfPivot =spark.read.format("com.databricks.spark.csv").option("header","false").option("inferSchema", "false").schema(customSchema).load(searchPath)

When you load the data with custom schema will be much faster compared to loading data with default schema

与使用默认架构加载数据相比，使用自定义架构加载数据时会快得多

Spark：在 Scala 中以编程方式创建数据帧架构

提问by Stuart

回答by David Griffin

回答by Stuart

回答by Arun Goudar

相关推荐

最近更新

标签

Spark：在 Scala 中以编程方式创建数据帧架构

提问by Stuart

回答by David Griffin

回答by Stuart

回答by Arun Goudar

相关推荐

scala Spark saveAsTextFile() 写入多个文件而不是一个

scala scalatest : 对象 scalatest 不是包 org 的成员

scala Spark 从 DataFrame 中删除重复的行

如何在 Scala 中的语句之间等待 N 秒？

相关推荐

最近更新

标签