Spark:在 Scala 中以编程方式创建数据帧架构
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36317002/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark: Programmatically creating dataframe schema in scala
提问by Stuart
I have a smallish dataset that will be the result of a Spark job. I am thinking about converting this dataset to a dataframe for convenience at the end of the job, but have struggled to correctly define the schema. The problem is the last field below (topValues); it is an ArrayBuffer of tuples -- keys and counts.
我有一个较小的数据集,它将是 Spark 作业的结果。为了在工作结束时方便起见,我正在考虑将此数据集转换为数据框,但一直在努力正确定义模式。问题是下面的最后一个字段 ( topValues); 它是元组的 ArrayBuffer —— 键和计数。
val innerSchema =
StructType(
Array(
StructField("value", StringType),
StructField("count", LongType)
)
)
val outputSchema =
StructType(
Array(
StructField("name", StringType, nullable=false),
StructField("index", IntegerType, nullable=false),
StructField("count", LongType, nullable=false),
StructField("empties", LongType, nullable=false),
StructField("nulls", LongType, nullable=false),
StructField("uniqueValues", LongType, nullable=false),
StructField("mean", DoubleType),
StructField("min", DoubleType),
StructField("max", DoubleType),
StructField("topValues", innerSchema)
)
)
val result = stats.columnStats.map{ c =>
Row(c._2.name, c._1, c._2.count, c._2.empties, c._2.nulls, c._2.uniqueValues, c._2.mean, c._2.min, c._2.max, c._2.topValues.topN)
}
val rdd = sc.parallelize(result.toSeq)
val outputDf = sqlContext.createDataFrame(rdd, outputSchema)
outputDf.show()
The error I'm getting is a MatchError: scala.MatchError: ArrayBuffer((10,2), (20,3), (8,1)) (of class scala.collection.mutable.ArrayBuffer)
我得到的错误是 MatchError: scala.MatchError: ArrayBuffer((10,2), (20,3), (8,1)) (of class scala.collection.mutable.ArrayBuffer)
When I debug and inspect my objects, I'm seeing this:
当我调试和检查我的对象时,我看到了这个:
rdd: ParallelCollectionRDD[2]
rdd.data: "ArrayBuffer" size = 2
rdd.data(0): [age,2,6,0,0,3,14.666666666666666,8.0,20.0,ArrayBuffer((10,2), (20,3), (8,1))]
rdd.data(1): [gender,3,6,0,0,2,0.0,0.0,0.0,ArrayBuffer((M,4), (F,2))]
It seems to me that I've accurately described the ArrayBuffer of tuples in my innerSchema, but Spark disagrees.
在我看来,我已经在我的 innerSchema 中准确地描述了元组的 ArrayBuffer,但 Spark 不同意。
Any idea how I should be defining the schema?
知道我应该如何定义架构吗?
回答by David Griffin
val rdd = sc.parallelize(Array(Row(ArrayBuffer(1,2,3,4))))
val df = sqlContext.createDataFrame(
rdd,
StructType(Seq(StructField("arr", ArrayType(IntegerType, false), false)
)
df.printSchema
root
|-- arr: array (nullable = false)
| |-- element: integer (containsNull = false)
df.show
+------------+
| arr|
+------------+
|[1, 2, 3, 4]|
+------------+
回答by Stuart
As David pointed out, I needed to use an ArrayType. Spark is happy with this:
正如大卫指出的那样,我需要使用 ArrayType。Spark 对此很满意:
val outputSchema =
StructType(
Array(
StructField("name", StringType, nullable=false),
StructField("index", IntegerType, nullable=false),
StructField("count", LongType, nullable=false),
StructField("empties", LongType, nullable=false),
StructField("nulls", LongType, nullable=false),
StructField("uniqueValues", LongType, nullable=false),
StructField("mean", DoubleType),
StructField("min", DoubleType),
StructField("max", DoubleType),
StructField("topValues", ArrayType(StructType(Array(
StructField("value", StringType),
StructField("count", LongType)
))))
)
)
回答by Arun Goudar
import spark.implicits._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val searchPath = "/path/to/.csv"
val columns = "col1,col2,col3,col4,col5,col6,col7"
val fields = columns.split(",").map(fieldName => StructField(fieldName, StringType,
nullable = true))
val customSchema = StructType(fields)
var dfPivot =spark.read.format("com.databricks.spark.csv").option("header","false").option("inferSchema", "false").schema(customSchema).load(searchPath)
When you load the data with custom schema will be much faster compared to loading data with default schema
与使用默认架构加载数据相比,使用自定义架构加载数据时会快得多

