Java 尝试在 Spark DataFrame 上使用地图

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42561084/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 00:14:25  来源:igfitidea点击:

Trying to use map on a Spark DataFrame

javaapache-sparkjava-8apache-spark-sqlspark-dataframe

提问by LetsPlayYahtzee

I recently started experimenting with both Spark and Java. I initially went through the famous WordCountexample using RDDand everything went as expected. Now I am trying to implement my own example but using DataFrames and not RDDs.

我最近开始尝试使用 Spark 和 Java。我最初WordCount使用了著名的例子RDD,一切都按预期进行。现在我正在尝试实现我自己的示例,但使用 DataFrames 而不是 RDDs。

So I am reading a dataset from a file with

所以我正在从文件中读取数据集

DataFrame df = sqlContext.read()
        .format("com.databricks.spark.csv")
        .option("inferSchema", "true")
        .option("delimiter", ";")
        .option("header", "true")
        .load(inputFilePath);

and then I try to select a specific column and apply a simple transformation to every row like that

然后我尝试选择一个特定的列并对每一行应用一个简单的转换

df = df.select("start")
        .map(text -> text + "asd");

But the compilation finds a problem with the second row which I don't fully understand (The start column is inferred as of type string).

但是编译发现第二行有问题,我不完全理解(起始列被推断为类型string)。

Multiple non-overriding abstract methods found in interface scala.Function1

接口 scala.Function1 中发现的多个非覆盖抽象方法

Why is my lambda function treated as a Scala function and what does the error message actually mean?

为什么我的 lambda 函数被视为 Scala 函数,错误消息的实际含义是什么?

回答by jojo_Berlin

If you use the selectfunction on a dataframe you get a dataframe back. Then you apply a function on the Rowdatatype not the value of the row. Afterwards you should get the value first so you should do the following:

如果您select在数据帧上使用该函数,则会返回一个数据帧。然后在Row数据类型上应用一个函数,而不是行的值。之后,您应该首先获得该值,因此您应该执行以下操作:

df.select("start").map(el->el.getString(0)+"asd")

df.select("start").map(el->el.getString(0)+"asd")

But you will get an RDD as return value not a DF

但是你会得到一个 RDD 作为返回值而不是 DF

回答by Dee

I use concat to achieve this

我使用 concat 来实现这一点

df.withColumn( concat(col('start'), lit('asd'))

As you're mapping the same text twice I'm not sure if you're also looking to replace the first part of the string? but if you are, I would do:

当您将相同的文本映射两次时,我不确定您是否还想替换字符串的第一部分?但如果你是,我会这样做:

df.withColumn('start', concat(
                      when(col('start') == 'text', lit('new'))
                      .otherwise(col('start))
                     , lit('asd')
                     )

This solution scales up when using big data, as it's concatinating two columns instead of iterating over values.

此解决方案在使用大数据时会扩展,因为它连接两列而不是迭代值。