scala 如何使用指定的模式创建一个空的 DataFrame？

Question

提问by user1735076

I want to create on DataFramewith a specified schema in Scala. I have tried to use JSON read (I mean reading empty file) but I don't think that's the best practice.

我想DataFrame在 Scala 中使用指定的模式创建。我曾尝试使用 JSON 读取（我的意思是读取空文件），但我认为这不是最佳做法。

Answer 1

回答by zero323

Lets assume you want a data frame with the following schema:

假设您想要一个具有以下架构的数据框：

root
 |-- k: string (nullable = true)
 |-- v: integer (nullable = false)

You simply define schema for a data frame and use empty RDD[Row]:

您只需为数据框定义架构并使用 empty RDD[Row]：

import org.apache.spark.sql.types.{
    StructType, StructField, StringType, IntegerType}
import org.apache.spark.sql.Row

val schema = StructType(
    StructField("k", StringType, true) ::
    StructField("v", IntegerType, false) :: Nil)

// Spark < 2.0
// sqlContext.createDataFrame(sc.emptyRDD[Row], schema) 
spark.createDataFrame(sc.emptyRDD[Row], schema)

PySpark equivalent is almost identical:

PySpark 等效项几乎相同：

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([
    StructField("k", StringType(), True), StructField("v", IntegerType(), False)
])

# or df = sc.parallelize([]).toDF(schema)

# Spark < 2.0 
# sqlContext.createDataFrame([], schema)
df = spark.createDataFrame([], schema)

Using implicit encoders (Scala only) with Producttypes like Tuple:

使用具有以下Product类型的隐式编码器（仅限 Scala）Tuple：

import spark.implicits._

Seq.empty[(String, Int)].toDF("k", "v")

or case class:

或案例类：

case class KV(k: String, v: Int)

Seq.empty[KV].toDF

or

或者

spark.emptyDataset[KV].toDF

Answer 2

回答by Jacek Laskowski

As of Spark 2.0.0, you can do the following.

从 Spark 2.0.0 开始，您可以执行以下操作。

Case Class

案例类

Let's define a Personcase class:

让我们定义一个Person案例类：

scala> case class Person(id: Int, name: String)
defined class Person

Import sparkSparkSession implicit Encoders:

导入sparkSparkSession 隐式Encoders：

scala> import spark.implicits._
import spark.implicits._

And use SparkSession to create an empty Dataset[Person]:

并使用 SparkSession 创建一个空的Dataset[Person]：

scala> spark.emptyDataset[Person]
res0: org.apache.spark.sql.Dataset[Person] = [id: int, name: string]

Schema DSL

模式 DSL

You could also use a Schema "DSL" (see Support functions for DataFramesin org.apache.spark.sql.ColumnName).

你也可以使用一个模式“DSL”（见的DataFrames支持功能在org.apache.spark.sql.ColumnName）。

scala> val id = $"id".int
id: org.apache.spark.sql.types.StructField = StructField(id,IntegerType,true)

scala> val name = $"name".string
name: org.apache.spark.sql.types.StructField = StructField(name,StringType,true)

scala> import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructType

scala> val mySchema = StructType(id :: name :: Nil)
mySchema: org.apache.spark.sql.types.StructType = StructType(StructField(id,IntegerType,true), StructField(name,StringType,true))

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> val emptyDF = spark.createDataFrame(sc.emptyRDD[Row], mySchema)
emptyDF: org.apache.spark.sql.DataFrame = [id: int, name: string]

scala> emptyDF.printSchema
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)

Answer 3

回答by Ravindra

import scala.reflect.runtime.{universe => ru}
def createEmptyDataFrame[T: ru.TypeTag] =
    hiveContext.createDataFrame(sc.emptyRDD[Row],
      ScalaReflection.schemaFor(ru.typeTag[T].tpe).dataType.asInstanceOf[StructType]
    )
  case class RawData(id: String, firstname: String, lastname: String, age: Int)
  val sourceDF = createEmptyDataFrame[RawData]

Answer 4

回答by Nilesh Shinde

Here you can create schema using StructType in scala and pass the Empty RDD so you will able to create empty table. Following code is for the same.

在这里，您可以在 scala 中使用 StructType 创建模式并传递 Empty RDD，以便您能够创建空表。以下代码是相同的。

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.BooleanType
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.types.StringType



//import org.apache.hadoop.hive.serde2.objectinspector.StructField

object EmptyTable extends App {
  val conf = new SparkConf;
  val sc = new SparkContext(conf)
  //create sparksession object
  val sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()

  //Created schema for three columns 
   val schema = StructType(
    StructField("Emp_ID", LongType, true) ::
      StructField("Emp_Name", StringType, false) ::
      StructField("Emp_Salary", LongType, false) :: Nil)

      //Created Empty RDD 

  var dataRDD = sc.emptyRDD[Row]

  //pass rdd and schema to create dataframe
  val newDFSchema = sparkSession.createDataFrame(dataRDD, schema)

  newDFSchema.createOrReplaceTempView("tempSchema")

  sparkSession.sql("create table Finaltable AS select * from tempSchema")

}

Answer 5

回答by Molay

Java version to create empty DataSet:

创建空数据集的 Java 版本：

public Dataset<Row> emptyDataSet(){

    SparkSession spark = SparkSession.builder().appName("Simple Application")
                .config("spark.master", "local").getOrCreate();

    Dataset<Row> emptyDataSet = spark.createDataFrame(new ArrayList<>(), getSchema());

    return emptyDataSet;
}

public StructType getSchema() {

    String schemaString = "column1 column2 column3 column4 column5";

    List<StructField> fields = new ArrayList<>();

    StructField indexField = DataTypes.createStructField("column0", DataTypes.LongType, true);
    fields.add(indexField);

    for (String fieldName : schemaString.split(" ")) {
        StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
        fields.add(field);
    }

    StructType schema = DataTypes.createStructType(fields);

    return schema;
}

Answer 6

回答by braj

Here is a solution that creates an empty dataframe in pyspark 2.0.0 or more.

这是在 pyspark 2.0.0 或更高版本中创建空数据帧的解决方案。

from pyspark.sql import SQLContext
sc = spark.sparkContext
schema = StructType([StructField('col1', StringType(),False),StructField('col2', IntegerType(), True)])
sqlContext.createDataFrame(sc.emptyRDD(), schema)

Answer 7

回答by Abraham

As of Spark 2.4.3

从 Spark 2.4.3 开始

val df = SparkSession.builder().getOrCreate().emptyDataFrame

scala 如何使用指定的模式创建一个空的 DataFrame？

提问by user1735076

回答by zero323

回答by Jacek Laskowski

Case Class

案例类

Schema DSL

模式 DSL

回答by Ravindra

回答by Nilesh Shinde

回答by Molay

回答by braj

回答by Abraham

相关推荐

最近更新

标签

scala 如何使用指定的模式创建一个空的 DataFrame？

提问by user1735076

回答by zero323

回答by Jacek Laskowski

Case Class

案例类

Schema DSL

模式 DSL

回答by Ravindra

回答by Nilesh Shinde

回答by Molay

回答by braj

回答by Abraham

相关推荐

scala 在单元测试中模拟 Spark RDD

scala 如何创建一个可以稍后通过方法调用接收元素的源？

scala S3 目录上的 Spark Streaming

scala 如何定义DataFrame的分区？

相关推荐

最近更新

标签