SparkSQL 并在 Java 中的 DataFrame 上爆炸

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31859271/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-02 19:15:42  来源:igfitidea点击:

SparkSQL and explode on DataFrame in Java

javaapache-sparkapache-spark-sql

提问by JiriS

Is there an easy way how use explodeon array column on SparkSQL DataFrame? It's relatively simple in Scala, but this function seems to be unavailable (as mentioned in javadoc) in Java.

有没有一种简单的方法如何explode在 SparkSQL 上的数组列上使用DataFrame?在 Scala 中比较简单,但是这个功能在 Java 中似乎不可用(如 javadoc 中所述)。

An option is to use SQLContext.sql(...)and explodefunction inside the query, but I'm looking for a bit better and especially cleaner way. DataFrames are loaded from parquet files.

一个选项是在查询中使用SQLContext.sql(...)explode运行,但我正在寻找更好,尤其是更简洁的方法。DataFrames 从镶木地板文件加载。

采纳答案by JiriS

It seems it is possible to use a combination of org.apache.spark.sql.functions.explode(Column col)and DataFrame.withColumn(String colName, Column col)to replace the column with the exploded version of it.

似乎可以使用组合org.apache.spark.sql.functions.explode(Column col)DataFrame.withColumn(String colName, Column col)用它的分解版本替换该列。

回答by marilena.oita

I solved it in this manner: say that you have an array column containing job descriptions named "positions", for each person with "fullName".

我以这种方式解决了它:假设您有一个数组列,其中包含名为“职位”的职位描述,每个人都有“全名”。

Then you get from initial schema :

然后你从初始架构中得到:

root
|-- fullName: string (nullable = true)
|-- positions: array (nullable = true)
    |    |-- element: struct (containsNull = true)
    |    |    |-- companyName: string (nullable = true)
    |    |    |-- title: string (nullable = true)
...

to schema:

架构:

root
 |-- personName: string (nullable = true)
 |-- companyName: string (nullable = true)
 |-- positionTitle: string (nullable = true)

by doing:

通过做:

    DataFrame personPositions = persons.select(persons.col("fullName").as("personName"),
          org.apache.spark.sql.functions.explode(persons.col("positions")).as("pos"));

    DataFrame test = personPositions.select(personPositions.col("personName"),
    personPositions.col("pos").getField("companyName").as("companyName"), personPositions.col("pos").getField("title").as("positionTitle"));