Java 尝试在 Spark DataFrame 上使用地图
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42561084/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Trying to use map on a Spark DataFrame
提问by LetsPlayYahtzee
I recently started experimenting with both Spark and Java. I initially went through the famous WordCount
example using RDD
and everything went as expected. Now I am trying to implement my own example but using DataFrames and not RDDs.
我最近开始尝试使用 Spark 和 Java。我最初WordCount
使用了著名的例子RDD
,一切都按预期进行。现在我正在尝试实现我自己的示例,但使用 DataFrames 而不是 RDDs。
So I am reading a dataset from a file with
所以我正在从文件中读取数据集
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("delimiter", ";")
.option("header", "true")
.load(inputFilePath);
and then I try to select a specific column and apply a simple transformation to every row like that
然后我尝试选择一个特定的列并对每一行应用一个简单的转换
df = df.select("start")
.map(text -> text + "asd");
But the compilation finds a problem with the second row which I don't fully understand (The start column is inferred as of type string
).
但是编译发现第二行有问题,我不完全理解(起始列被推断为类型string
)。
Multiple non-overriding abstract methods found in interface scala.Function1
接口 scala.Function1 中发现的多个非覆盖抽象方法
Why is my lambda function treated as a Scala function and what does the error message actually mean?
为什么我的 lambda 函数被视为 Scala 函数,错误消息的实际含义是什么?
回答by jojo_Berlin
If you use the select
function on a dataframe you get a dataframe back. Then you apply a function on the Row
datatype not the value of the row. Afterwards you should get the value first so you should do the following:
如果您select
在数据帧上使用该函数,则会返回一个数据帧。然后在Row
数据类型上应用一个函数,而不是行的值。之后,您应该首先获得该值,因此您应该执行以下操作:
df.select("start").map(el->el.getString(0)+"asd")
df.select("start").map(el->el.getString(0)+"asd")
But you will get an RDD as return value not a DF
但是你会得到一个 RDD 作为返回值而不是 DF
回答by Dee
I use concat to achieve this
我使用 concat 来实现这一点
df.withColumn( concat(col('start'), lit('asd'))
As you're mapping the same text twice I'm not sure if you're also looking to replace the first part of the string? but if you are, I would do:
当您将相同的文本映射两次时,我不确定您是否还想替换字符串的第一部分?但如果你是,我会这样做:
df.withColumn('start', concat(
when(col('start') == 'text', lit('new'))
.otherwise(col('start))
, lit('asd')
)
This solution scales up when using big data, as it's concatinating two columns instead of iterating over values.
此解决方案在使用大数据时会扩展,因为它连接两列而不是迭代值。