scala 将模式从一个数据帧复制到另一个数据帧

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36795680/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:12:29  来源:igfitidea点击:

Copy schema from one dataframe to another dataframe

scalaapache-sparkdataframeapache-spark-sql

提问by RudyVerboven

I'm trying to change the schema of an existing dataframe to the schema of another dataframe.

我正在尝试将现有数据帧的架构更改为另一个数据帧的架构。

DataFrame 1:

数据帧 1:

Column A | Column B | Column C | Column D
   "a"   |    1     |   2.0    |   300
   "b"   |    2     |   3.0    |   400
   "c"   |    3     |   4.0    |   500

DataFrame 2:

数据帧 2:

Column K | Column B | Column F
   "c"   |    4     |   5.0
   "b"   |    5     |   6.0
   "f"   |    6     |   7.0

So I want to apply the schema of the first dataframe on the second. So all the columns which are the same remain. The columns in dataframe 2 that are not in 1 get deleted. The others become "NULL".

所以我想在第二个数据帧上应用第一个数据帧的模式。所以所有相同的列仍然存在。数据框 2 中不在 1 中的列被删除。其他的变为“NULL”。

Output

输出

Column A | Column B | Column C | Column D
 "NULL"  |    4     |   "NULL" |  "NULL"
 "NULL"  |    5     |   "NULL" |  "NULL"
 "NULL"  |    6     |   "NULL" |  "NULL"

So I came with a possible solution:

所以我提出了一个可能的解决方案:

val schema = df1.schema
val newRows: RDD[Row] = df2.map(row => {
  val values = row.schema.fields.map(s => {
    if(schema.fields.contains(s)){
      row.getAs(s.name).toString
    }else{
      "NULL"
    }
  })
  Row.fromSeq(values)
})
sqlContext.createDataFrame(newRows, schema)}

Now as you can see this will not work because the schema contains String, Int and Double. And all my rows have String values.

现在您可以看到这将不起作用,因为架构包含 String、Int 和 Double。我所有的行都有字符串值。

This is where I'm stuck, is there a way to automatically convert the type of my values to the schema?

这就是我卡住的地方,有没有办法自动将我的值的类型转换为模式?

回答by zero323

If schema is flat I would use simply map over per-existing schema and selectrequired columns:

如果架构是扁平的,我将简单地使用每个现有架构和select所需列的映射:

val exprs = df1.schema.fields.map { f => 
  if (df2.schema.fields.contains(f)) col(f.name)
  else lit(null).cast(f.dataType).alias(f.name) 
}

df2.select(exprs: _*).printSchema

// root
//  |-- A: string (nullable = true)
//  |-- B: integer (nullable = false)
//  |-- C: double (nullable = true)
//  |-- D: integer (nullable = true)

回答by Antonio Cachuan

Working in 2018 (Spark 2.3) reading a .sas7bdat

在 2018 年工作(Spark 2.3)阅读 .sas7bdat

Scala

斯卡拉

val sasFile = "file.sas7bdat"
val dfSas = spark.sqlContext.sasFile(sasFile)
val myManualSchema = dfSas.schema //getting the schema from another dataframe
val df = spark.read.format("csv").option("header","true").schema(myManualSchema).load(csvFile)

PD: spark.sqlContext.sasFile use saurfang library, you could skip that part of code and get the schema from another dataframe.

PD:spark.sqlContext.sasFile 使用saurfang 库,您可以跳过该部分代码并从另一个数据帧获取模式。

回答by charles gomes

You could simply do Left Join on your dataframes with query like this:-

您可以简单地使用如下查询对数据框进行左连接:-

SELECT Column A, Column B, Column C, Column D FROM foo LEFT JOIN BAR ON Column C = Column C

SELECT Column A, Column B, Column C, Column D FROM foo LEFT JOIN BAR ON Column C = Column C

Please checkout the answer by @zero323 in this post:-

请在这篇文章中查看@zero323 的答案:-

Spark specify multiple column conditions for dataframe join

Spark 为数据框连接指定多列条件

Thanks, Charles.

谢谢,查尔斯。