scala 将模式从一个数据帧复制到另一个数据帧

Question

提问by RudyVerboven

I'm trying to change the schema of an existing dataframe to the schema of another dataframe.

我正在尝试将现有数据帧的架构更改为另一个数据帧的架构。

DataFrame 1:

数据帧 1：

Column A | Column B | Column C | Column D
   "a"   |    1     |   2.0    |   300
   "b"   |    2     |   3.0    |   400
   "c"   |    3     |   4.0    |   500

DataFrame 2:

数据帧 2：

Column K | Column B | Column F
   "c"   |    4     |   5.0
   "b"   |    5     |   6.0
   "f"   |    6     |   7.0

So I want to apply the schema of the first dataframe on the second. So all the columns which are the same remain. The columns in dataframe 2 that are not in 1 get deleted. The others become "NULL".

所以我想在第二个数据帧上应用第一个数据帧的模式。所以所有相同的列仍然存在。数据框 2 中不在 1 中的列被删除。其他的变为“NULL”。

Output

输出

Column A | Column B | Column C | Column D
 "NULL"  |    4     |   "NULL" |  "NULL"
 "NULL"  |    5     |   "NULL" |  "NULL"
 "NULL"  |    6     |   "NULL" |  "NULL"

So I came with a possible solution:

所以我提出了一个可能的解决方案：

val schema = df1.schema
val newRows: RDD[Row] = df2.map(row => {
  val values = row.schema.fields.map(s => {
    if(schema.fields.contains(s)){
      row.getAs(s.name).toString
    }else{
      "NULL"
    }
  })
  Row.fromSeq(values)
})
sqlContext.createDataFrame(newRows, schema)}

Now as you can see this will not work because the schema contains String, Int and Double. And all my rows have String values.

现在您可以看到这将不起作用，因为架构包含 String、Int 和 Double。我所有的行都有字符串值。

This is where I'm stuck, is there a way to automatically convert the type of my values to the schema?

这就是我卡住的地方，有没有办法自动将我的值的类型转换为模式？

Answer 1

回答by zero323

If schema is flat I would use simply map over per-existing schema and selectrequired columns:

如果架构是扁平的，我将简单地使用每个现有架构和select所需列的映射：

val exprs = df1.schema.fields.map { f => 
  if (df2.schema.fields.contains(f)) col(f.name)
  else lit(null).cast(f.dataType).alias(f.name) 
}

df2.select(exprs: _*).printSchema

// root
//  |-- A: string (nullable = true)
//  |-- B: integer (nullable = false)
//  |-- C: double (nullable = true)
//  |-- D: integer (nullable = true)

Answer 2

回答by Antonio Cachuan

Working in 2018 (Spark 2.3) reading a .sas7bdat

在 2018 年工作（Spark 2.3）阅读 .sas7bdat

Scala

斯卡拉

val sasFile = "file.sas7bdat"
val dfSas = spark.sqlContext.sasFile(sasFile)
val myManualSchema = dfSas.schema //getting the schema from another dataframe
val df = spark.read.format("csv").option("header","true").schema(myManualSchema).load(csvFile)

PD: spark.sqlContext.sasFile use saurfang library, you could skip that part of code and get the schema from another dataframe.

PD：spark.sqlContext.sasFile 使用saurfang 库，您可以跳过该部分代码并从另一个数据帧获取模式。

Answer 3

回答by charles gomes

You could simply do Left Join on your dataframes with query like this:-

您可以简单地使用如下查询对数据框进行左连接：-

SELECT Column A, Column B, Column C, Column D FROM foo LEFT JOIN BAR ON Column C = Column C

Please checkout the answer by @zero323 in this post:-

请在这篇文章中查看@zero323 的答案：-

Spark specify multiple column conditions for dataframe join

Spark 为数据框连接指定多列条件

Thanks, Charles.

谢谢，查尔斯。

scala 将模式从一个数据帧复制到另一个数据帧

提问by RudyVerboven

回答by zero323

回答by Antonio Cachuan

回答by charles gomes

相关推荐

最近更新

标签

scala 将模式从一个数据帧复制到另一个数据帧

提问by RudyVerboven

回答by zero323

回答by Antonio Cachuan

回答by charles gomes

相关推荐

scala SBT 程序集不起作用（不是有效的命令）

scala 如何获得两个DataFrame之间的对称差异？

scala 在 Spark 中将 Dataframe 转换为 Map(Key-Value)

scala 如何将 Akka ByteString 转换为 String？

相关推荐

最近更新

标签