scala 将模式从一个数据帧复制到另一个数据帧
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36795680/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Copy schema from one dataframe to another dataframe
提问by RudyVerboven
I'm trying to change the schema of an existing dataframe to the schema of another dataframe.
我正在尝试将现有数据帧的架构更改为另一个数据帧的架构。
DataFrame 1:
数据帧 1:
Column A | Column B | Column C | Column D
"a" | 1 | 2.0 | 300
"b" | 2 | 3.0 | 400
"c" | 3 | 4.0 | 500
DataFrame 2:
数据帧 2:
Column K | Column B | Column F
"c" | 4 | 5.0
"b" | 5 | 6.0
"f" | 6 | 7.0
So I want to apply the schema of the first dataframe on the second. So all the columns which are the same remain. The columns in dataframe 2 that are not in 1 get deleted. The others become "NULL".
所以我想在第二个数据帧上应用第一个数据帧的模式。所以所有相同的列仍然存在。数据框 2 中不在 1 中的列被删除。其他的变为“NULL”。
Output
输出
Column A | Column B | Column C | Column D
"NULL" | 4 | "NULL" | "NULL"
"NULL" | 5 | "NULL" | "NULL"
"NULL" | 6 | "NULL" | "NULL"
So I came with a possible solution:
所以我提出了一个可能的解决方案:
val schema = df1.schema
val newRows: RDD[Row] = df2.map(row => {
val values = row.schema.fields.map(s => {
if(schema.fields.contains(s)){
row.getAs(s.name).toString
}else{
"NULL"
}
})
Row.fromSeq(values)
})
sqlContext.createDataFrame(newRows, schema)}
Now as you can see this will not work because the schema contains String, Int and Double. And all my rows have String values.
现在您可以看到这将不起作用,因为架构包含 String、Int 和 Double。我所有的行都有字符串值。
This is where I'm stuck, is there a way to automatically convert the type of my values to the schema?
这就是我卡住的地方,有没有办法自动将我的值的类型转换为模式?
回答by zero323
If schema is flat I would use simply map over per-existing schema and selectrequired columns:
如果架构是扁平的,我将简单地使用每个现有架构和select所需列的映射:
val exprs = df1.schema.fields.map { f =>
if (df2.schema.fields.contains(f)) col(f.name)
else lit(null).cast(f.dataType).alias(f.name)
}
df2.select(exprs: _*).printSchema
// root
// |-- A: string (nullable = true)
// |-- B: integer (nullable = false)
// |-- C: double (nullable = true)
// |-- D: integer (nullable = true)
回答by Antonio Cachuan
Working in 2018 (Spark 2.3) reading a .sas7bdat
在 2018 年工作(Spark 2.3)阅读 .sas7bdat
Scala
斯卡拉
val sasFile = "file.sas7bdat"
val dfSas = spark.sqlContext.sasFile(sasFile)
val myManualSchema = dfSas.schema //getting the schema from another dataframe
val df = spark.read.format("csv").option("header","true").schema(myManualSchema).load(csvFile)
PD: spark.sqlContext.sasFile use saurfang library, you could skip that part of code and get the schema from another dataframe.
PD:spark.sqlContext.sasFile 使用saurfang 库,您可以跳过该部分代码并从另一个数据帧获取模式。
回答by charles gomes
You could simply do Left Join on your dataframes with query like this:-
您可以简单地使用如下查询对数据框进行左连接:-
SELECT Column A, Column B, Column C, Column D FROM foo LEFT JOIN BAR ON Column C = Column C
SELECT Column A, Column B, Column C, Column D FROM foo LEFT JOIN BAR ON Column C = Column C
Please checkout the answer by @zero323 in this post:-
请在这篇文章中查看@zero323 的答案:-
Spark specify multiple column conditions for dataframe join
Thanks, Charles.
谢谢,查尔斯。

