scala 如何使用指定的模式创建一个空的 DataFrame?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31477598/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:22:34  来源:igfitidea点击:

How to create an empty DataFrame with a specified schema?

scalaapache-sparkdataframeapache-spark-sql

提问by user1735076

I want to create on DataFramewith a specified schema in Scala. I have tried to use JSON read (I mean reading empty file) but I don't think that's the best practice.

我想DataFrame在 Scala 中使用指定的模式创建。我曾尝试使用 JSON 读取(我的意思是读取空文件),但我认为这不是最佳做法。

回答by zero323

Lets assume you want a data frame with the following schema:

假设您想要一个具有以下架构的数据框:

root
 |-- k: string (nullable = true)
 |-- v: integer (nullable = false)

You simply define schema for a data frame and use empty RDD[Row]:

您只需为数据框定义架构并使用 empty RDD[Row]

import org.apache.spark.sql.types.{
    StructType, StructField, StringType, IntegerType}
import org.apache.spark.sql.Row

val schema = StructType(
    StructField("k", StringType, true) ::
    StructField("v", IntegerType, false) :: Nil)

// Spark < 2.0
// sqlContext.createDataFrame(sc.emptyRDD[Row], schema) 
spark.createDataFrame(sc.emptyRDD[Row], schema)

PySpark equivalent is almost identical:

PySpark 等效项几乎相同:

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([
    StructField("k", StringType(), True), StructField("v", IntegerType(), False)
])

# or df = sc.parallelize([]).toDF(schema)

# Spark < 2.0 
# sqlContext.createDataFrame([], schema)
df = spark.createDataFrame([], schema)

Using implicit encoders (Scala only) with Producttypes like Tuple:

使用具有以下Product类型的隐式编码器(仅限 Scala)Tuple

import spark.implicits._

Seq.empty[(String, Int)].toDF("k", "v")

or case class:

或案例类:

case class KV(k: String, v: Int)

Seq.empty[KV].toDF

or

或者

spark.emptyDataset[KV].toDF

回答by Jacek Laskowski

As of Spark 2.0.0, you can do the following.

从 Spark 2.0.0 开始,您可以执行以下操作。

Case Class

案例类

Let's define a Personcase class:

让我们定义一个Person案例类:

scala> case class Person(id: Int, name: String)
defined class Person

Import sparkSparkSession implicit Encoders:

导入sparkSparkSession 隐式Encoders

scala> import spark.implicits._
import spark.implicits._

And use SparkSession to create an empty Dataset[Person]:

并使用 SparkSession 创建一个空的Dataset[Person]

scala> spark.emptyDataset[Person]
res0: org.apache.spark.sql.Dataset[Person] = [id: int, name: string]

Schema DSL

模式 DSL

You could also use a Schema "DSL" (see Support functions for DataFramesin org.apache.spark.sql.ColumnName).

你也可以使用一个模式“DSL”(见的DataFrames支持功能org.apache.spark.sql.ColumnName)。

scala> val id = $"id".int
id: org.apache.spark.sql.types.StructField = StructField(id,IntegerType,true)

scala> val name = $"name".string
name: org.apache.spark.sql.types.StructField = StructField(name,StringType,true)

scala> import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructType

scala> val mySchema = StructType(id :: name :: Nil)
mySchema: org.apache.spark.sql.types.StructType = StructType(StructField(id,IntegerType,true), StructField(name,StringType,true))

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> val emptyDF = spark.createDataFrame(sc.emptyRDD[Row], mySchema)
emptyDF: org.apache.spark.sql.DataFrame = [id: int, name: string]

scala> emptyDF.printSchema
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)

回答by Ravindra

import scala.reflect.runtime.{universe => ru}
def createEmptyDataFrame[T: ru.TypeTag] =
    hiveContext.createDataFrame(sc.emptyRDD[Row],
      ScalaReflection.schemaFor(ru.typeTag[T].tpe).dataType.asInstanceOf[StructType]
    )
  case class RawData(id: String, firstname: String, lastname: String, age: Int)
  val sourceDF = createEmptyDataFrame[RawData]

回答by Nilesh Shinde

Here you can create schema using StructType in scala and pass the Empty RDD so you will able to create empty table. Following code is for the same.

在这里,您可以在 scala 中使用 StructType 创建模式并传递 Empty RDD,以便您能够创建空表。以下代码是相同的。

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.BooleanType
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.types.StringType



//import org.apache.hadoop.hive.serde2.objectinspector.StructField

object EmptyTable extends App {
  val conf = new SparkConf;
  val sc = new SparkContext(conf)
  //create sparksession object
  val sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()

  //Created schema for three columns 
   val schema = StructType(
    StructField("Emp_ID", LongType, true) ::
      StructField("Emp_Name", StringType, false) ::
      StructField("Emp_Salary", LongType, false) :: Nil)

      //Created Empty RDD 

  var dataRDD = sc.emptyRDD[Row]

  //pass rdd and schema to create dataframe
  val newDFSchema = sparkSession.createDataFrame(dataRDD, schema)

  newDFSchema.createOrReplaceTempView("tempSchema")

  sparkSession.sql("create table Finaltable AS select * from tempSchema")

}

回答by Molay

Java version to create empty DataSet:

创建空数据集的 Java 版本:

public Dataset<Row> emptyDataSet(){

    SparkSession spark = SparkSession.builder().appName("Simple Application")
                .config("spark.master", "local").getOrCreate();

    Dataset<Row> emptyDataSet = spark.createDataFrame(new ArrayList<>(), getSchema());

    return emptyDataSet;
}

public StructType getSchema() {

    String schemaString = "column1 column2 column3 column4 column5";

    List<StructField> fields = new ArrayList<>();

    StructField indexField = DataTypes.createStructField("column0", DataTypes.LongType, true);
    fields.add(indexField);

    for (String fieldName : schemaString.split(" ")) {
        StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
        fields.add(field);
    }

    StructType schema = DataTypes.createStructType(fields);

    return schema;
}

回答by braj

Here is a solution that creates an empty dataframe in pyspark 2.0.0 or more.

这是在 pyspark 2.0.0 或更高版本中创建空数据帧的解决方案。

from pyspark.sql import SQLContext
sc = spark.sparkContext
schema = StructType([StructField('col1', StringType(),False),StructField('col2', IntegerType(), True)])
sqlContext.createDataFrame(sc.emptyRDD(), schema)

回答by Abraham

As of Spark 2.4.3

从 Spark 2.4.3 开始

val df = SparkSession.builder().getOrCreate().emptyDataFrame