scala 在 Apache Spark 中将 Dataframe 的列值提取为 List

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32000646/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:27:27  来源:igfitidea点击:

Extract column values of Dataframe as List in Apache Spark

scalaapache-sparkapache-spark-sql

提问by SH Y.

I want to convert a string column of a data frame to a list. What I can find from the DataframeAPI is RDD, so I tried converting it back to RDD first, and then apply toArrayfunction to the RDD. In this case, the length and SQL work just fine. However, the result I got from RDD has square brackets around every element like this [A00001]. I was wondering if there's an appropriate way to convert a column to a list or a way to remove the square brackets.

我想将数据框的字符串列转换为列表。我从DataframeAPI 中可以找到的是 RDD,所以我尝试先将其转换回 RDD,然后将toArray函数应用于 RDD。在这种情况下,长度和 SQL 工作得很好。然而,我从 RDD 得到的结果在每个像这样的元素周围都有方括号[A00001]。我想知道是否有适当的方法将列转换为列表或删除方括号的方法。

Any suggestions would be appreciated. Thank you!

任何建议,将不胜感激。谢谢!

回答by Niemand

This should return the collection containing single list:

这应该返回包含单个列表的集合:

dataFrame.select("YOUR_COLUMN_NAME").rdd.map(r => r(0)).collect()

Without the mapping, you just get a Row object, which contains every column from the database.

如果没有映射,您只会得到一个 Row 对象,其中包含数据库中的每一列。

Keep in mind that this will probably get you a list of Any type. ?f you want to specify the result type, you can use .asInstanceOf[YOUR_TYPE] in r => r(0).asInstanceOf[YOUR_TYPE]mapping

请记住,这可能会为您提供 Any 类型的列表。?如果要指定结果类型,可以在r => r(0).asInstanceOf[YOUR_TYPE]映射中使用.asInstanceOf[YOUR_TYPE]

P.S. due to automatic conversion you can skip the .rddpart.

PS 由于自动转换,您可以跳过该.rdd部分。

回答by mrsrinivas

With Spark 2.x and Scala 2.11

使用 Spark 2.x 和 Scala 2.11

I'd think of 3 possible ways to convert values of a specific column to List.

我会想到 3 种可能的方法将特定列的值转换为 List。

Common code snippets for all the approaches

所有方法的通用代码片段

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder.getOrCreate    
import spark.implicits._ // for .toDF() method

val df = Seq(
    ("first", 2.0),
    ("test", 1.5), 
    ("choose", 8.0)
  ).toDF("id", "val")

Approach 1

方法一

df.select("id").collect().map(_(0)).toList
// res9: List[Any] = List(one, two, three)

What happens now? We are collecting data to Driver with collect()and picking element zero from each record.

现在会发生什么?我们正在向 Driver 收集数据,collect()并从每个记录中挑选元素零。

This could not be an excellent way of doing it, Let's improve it with next approach.

这不是一个很好的方法,让我们用下一个方法改进它。



Approach 2

方法二

df.select("id").rdd.map(r => r(0)).collect.toList 
//res10: List[Any] = List(one, two, three)

How is it better? We have distributed map transformation load among the workers rather than single Driver.

如何更好?我们在工作人员之间分配了地图转换负载,而不是单个驱动程序。

I know rdd.map(r => r(0))does not seems elegant you. So, let's address it in next approach.

我知道rdd.map(r => r(0))你似乎不太优雅。所以,让我们在下一个方法中解决它。



Approach 3

方法三

df.select("id").map(r => r.getString(0)).collect.toList 
//res11: List[String] = List(one, two, three)

Here we are not converting DataFrame to RDD. Look at mapit won't accept r => r(0)(or _(0)) as the previous approach due to encoder issues in DataFrame. So end up using r => r.getString(0)and it would be addressed in the next versions of Spark.

这里我们没有将 DataFrame 转换为 RDD。由于 DataFrame 中的编码器问题,看看map它不会接受r => r(0)(或_(0))作为以前的方法。所以最终使用r => r.getString(0)它并将在下一版本的 Spark 中解决。

Conclusion

All the options give the same output but 2 and 3 are effective, finally 3rd one is effective and elegant(I'd think).

结论

所有选项都给出相同的输出,但 2 和 3 是有效的,最后第三个是有效和优雅的(我认为)。

Databricks notebook link which will available till 6 months from 2017/05/20

Databricks 笔记本链接,有效期至 2017/05/20 起 6 个月

回答by abby sobh

I know the answer given and asked for is assumed for Scala, so I am just providing a little snippet of Python code in case a PySpark user is curious. The syntax is similar to the given answer, but to properly pop the list out I actually have to reference the column name a second time in the mapping function and I do not need the select statement.

我知道给出和要求的答案是针对 Scala 的,所以我只是提供了一小段 Python 代码,以防 PySpark 用户好奇。语法类似于给定的答案,但要正确弹出列表,我实际上必须在映射函数中再次引用列名,并且不需要 select 语句。

i.e. A DataFrame, containing a column named "Raw"

即一个 DataFrame,包含一个名为“Raw”的列

To get each row value in "Raw" combined as a list where each entry is a row value from "Raw" I simply use:

要将“Raw”中的每一行值组合为一个列表,其中每个条目都是“Raw”中的一个行值,我只需使用:

MyDataFrame.rdd.map(lambda x: x.Raw).collect()

回答by kanielc

In Scala and Spark 2+, try this (assuming your column name is "s"): df.select('s).as[String].collect

在 Scala 和 Spark 2+ 中,试试这个(假设你的列名是“s”): df.select('s).as[String].collect

回答by Shaina Raza

sqlContext.sql(" select filename from tempTable").rdd.map(r => r(0)).collect.toList.foreach(out_streamfn.println) //remove brackets

it works perfectly

它完美地工作

回答by amarnath pimple

from pyspark.sql.functions import col

df.select(col("column_name")).collect()

here collect is functions which in turn convert it to list. Be ware of using the list on the huge data set. It will decrease performance. It is good to check the data.

这里 collect 是将其转换为列表的函数。小心在庞大的数据集上使用列表。它会降低性能。查资料就好。

回答by vahbuna

This is java answer.

这是Java答案。

df.select("id").collectAsList();

回答by user12910640

List<String> whatever_list = df.toJavaRDD().map(new Function<Row, String>() {
    public String call(Row row) {
        return row.getAs("column_name").toString();
    }
}).collect();

logger.info(String.format("list is %s",whatever_list)); //verification

Since no one has given any solution in java(Real Programming Language) Can thank me later

由于没有人在java(Real Programming Language)中给出任何解决方案,以后可以感谢我

回答by Athanasios Tsiaras

An updated solution that gets you a list:

为您提供列表的更新解决方案:

dataFrame.select("YOUR_COLUMN_NAME").map(r => r.getString(0)).collect.toList