scala 如何使用指定的模式创建一个空的 DataFrame?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31477598/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to create an empty DataFrame with a specified schema?
提问by user1735076
I want to create on DataFramewith a specified schema in Scala. I have tried to use JSON read (I mean reading empty file) but I don't think that's the best practice.
我想DataFrame在 Scala 中使用指定的模式创建。我曾尝试使用 JSON 读取(我的意思是读取空文件),但我认为这不是最佳做法。
回答by zero323
Lets assume you want a data frame with the following schema:
假设您想要一个具有以下架构的数据框:
root
|-- k: string (nullable = true)
|-- v: integer (nullable = false)
You simply define schema for a data frame and use empty RDD[Row]:
您只需为数据框定义架构并使用 empty RDD[Row]:
import org.apache.spark.sql.types.{
StructType, StructField, StringType, IntegerType}
import org.apache.spark.sql.Row
val schema = StructType(
StructField("k", StringType, true) ::
StructField("v", IntegerType, false) :: Nil)
// Spark < 2.0
// sqlContext.createDataFrame(sc.emptyRDD[Row], schema)
spark.createDataFrame(sc.emptyRDD[Row], schema)
PySpark equivalent is almost identical:
PySpark 等效项几乎相同:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
StructField("k", StringType(), True), StructField("v", IntegerType(), False)
])
# or df = sc.parallelize([]).toDF(schema)
# Spark < 2.0
# sqlContext.createDataFrame([], schema)
df = spark.createDataFrame([], schema)
Using implicit encoders (Scala only) with Producttypes like Tuple:
使用具有以下Product类型的隐式编码器(仅限 Scala)Tuple:
import spark.implicits._
Seq.empty[(String, Int)].toDF("k", "v")
or case class:
或案例类:
case class KV(k: String, v: Int)
Seq.empty[KV].toDF
or
或者
spark.emptyDataset[KV].toDF
回答by Jacek Laskowski
As of Spark 2.0.0, you can do the following.
从 Spark 2.0.0 开始,您可以执行以下操作。
Case Class
案例类
Let's define a Personcase class:
让我们定义一个Person案例类:
scala> case class Person(id: Int, name: String)
defined class Person
Import sparkSparkSession implicit Encoders:
导入sparkSparkSession 隐式Encoders:
scala> import spark.implicits._
import spark.implicits._
And use SparkSession to create an empty Dataset[Person]:
并使用 SparkSession 创建一个空的Dataset[Person]:
scala> spark.emptyDataset[Person]
res0: org.apache.spark.sql.Dataset[Person] = [id: int, name: string]
Schema DSL
模式 DSL
You could also use a Schema "DSL" (see Support functions for DataFramesin org.apache.spark.sql.ColumnName).
你也可以使用一个模式“DSL”(见的DataFrames支持功能在org.apache.spark.sql.ColumnName)。
scala> val id = $"id".int
id: org.apache.spark.sql.types.StructField = StructField(id,IntegerType,true)
scala> val name = $"name".string
name: org.apache.spark.sql.types.StructField = StructField(name,StringType,true)
scala> import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructType
scala> val mySchema = StructType(id :: name :: Nil)
mySchema: org.apache.spark.sql.types.StructType = StructType(StructField(id,IntegerType,true), StructField(name,StringType,true))
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> val emptyDF = spark.createDataFrame(sc.emptyRDD[Row], mySchema)
emptyDF: org.apache.spark.sql.DataFrame = [id: int, name: string]
scala> emptyDF.printSchema
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
回答by Ravindra
import scala.reflect.runtime.{universe => ru}
def createEmptyDataFrame[T: ru.TypeTag] =
hiveContext.createDataFrame(sc.emptyRDD[Row],
ScalaReflection.schemaFor(ru.typeTag[T].tpe).dataType.asInstanceOf[StructType]
)
case class RawData(id: String, firstname: String, lastname: String, age: Int)
val sourceDF = createEmptyDataFrame[RawData]
回答by Nilesh Shinde
Here you can create schema using StructType in scala and pass the Empty RDD so you will able to create empty table. Following code is for the same.
在这里,您可以在 scala 中使用 StructType 创建模式并传递 Empty RDD,以便您能够创建空表。以下代码是相同的。
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.BooleanType
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.types.StringType
//import org.apache.hadoop.hive.serde2.objectinspector.StructField
object EmptyTable extends App {
val conf = new SparkConf;
val sc = new SparkContext(conf)
//create sparksession object
val sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()
//Created schema for three columns
val schema = StructType(
StructField("Emp_ID", LongType, true) ::
StructField("Emp_Name", StringType, false) ::
StructField("Emp_Salary", LongType, false) :: Nil)
//Created Empty RDD
var dataRDD = sc.emptyRDD[Row]
//pass rdd and schema to create dataframe
val newDFSchema = sparkSession.createDataFrame(dataRDD, schema)
newDFSchema.createOrReplaceTempView("tempSchema")
sparkSession.sql("create table Finaltable AS select * from tempSchema")
}
回答by Molay
Java version to create empty DataSet:
创建空数据集的 Java 版本:
public Dataset<Row> emptyDataSet(){
SparkSession spark = SparkSession.builder().appName("Simple Application")
.config("spark.master", "local").getOrCreate();
Dataset<Row> emptyDataSet = spark.createDataFrame(new ArrayList<>(), getSchema());
return emptyDataSet;
}
public StructType getSchema() {
String schemaString = "column1 column2 column3 column4 column5";
List<StructField> fields = new ArrayList<>();
StructField indexField = DataTypes.createStructField("column0", DataTypes.LongType, true);
fields.add(indexField);
for (String fieldName : schemaString.split(" ")) {
StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
fields.add(field);
}
StructType schema = DataTypes.createStructType(fields);
return schema;
}
回答by braj
Here is a solution that creates an empty dataframe in pyspark 2.0.0 or more.
这是在 pyspark 2.0.0 或更高版本中创建空数据帧的解决方案。
from pyspark.sql import SQLContext
sc = spark.sparkContext
schema = StructType([StructField('col1', StringType(),False),StructField('col2', IntegerType(), True)])
sqlContext.createDataFrame(sc.emptyRDD(), schema)
回答by Abraham
As of Spark 2.4.3
从 Spark 2.4.3 开始
val df = SparkSession.builder().getOrCreate().emptyDataFrame

