Java 尝试在 Spark DataFrame 上使用地图

Question

提问by LetsPlayYahtzee

I recently started experimenting with both Spark and Java. I initially went through the famous WordCountexample using RDDand everything went as expected. Now I am trying to implement my own example but using DataFrames and not RDDs.

我最近开始尝试使用 Spark 和 Java。我最初WordCount使用了著名的例子RDD，一切都按预期进行。现在我正在尝试实现我自己的示例，但使用 DataFrames 而不是 RDDs。

So I am reading a dataset from a file with

所以我正在从文件中读取数据集

DataFrame df = sqlContext.read()
        .format("com.databricks.spark.csv")
        .option("inferSchema", "true")
        .option("delimiter", ";")
        .option("header", "true")
        .load(inputFilePath);

and then I try to select a specific column and apply a simple transformation to every row like that

然后我尝试选择一个特定的列并对每一行应用一个简单的转换

df = df.select("start")
        .map(text -> text + "asd");

But the compilation finds a problem with the second row which I don't fully understand (The start column is inferred as of type string).

但是编译发现第二行有问题，我不完全理解（起始列被推断为类型string）。

Multiple non-overriding abstract methods found in interface scala.Function1

接口 scala.Function1 中发现的多个非覆盖抽象方法

Why is my lambda function treated as a Scala function and what does the error message actually mean?

为什么我的 lambda 函数被视为 Scala 函数，错误消息的实际含义是什么？

Answer 1

回答by jojo_Berlin

If you use the selectfunction on a dataframe you get a dataframe back. Then you apply a function on the Rowdatatype not the value of the row. Afterwards you should get the value first so you should do the following:

如果您select在数据帧上使用该函数，则会返回一个数据帧。然后在Row数据类型上应用一个函数，而不是行的值。之后，您应该首先获得该值，因此您应该执行以下操作：

df.select("start").map(el->el.getString(0)+"asd")

But you will get an RDD as return value not a DF

但是你会得到一个 RDD 作为返回值而不是 DF

Answer 2

回答by Dee

I use concat to achieve this

我使用 concat 来实现这一点

df.withColumn( concat(col('start'), lit('asd'))

As you're mapping the same text twice I'm not sure if you're also looking to replace the first part of the string? but if you are, I would do:

当您将相同的文本映射两次时，我不确定您是否还想替换字符串的第一部分？但如果你是，我会这样做：

df.withColumn('start', concat(
                      when(col('start') == 'text', lit('new'))
                      .otherwise(col('start))
                     , lit('asd')
                     )

This solution scales up when using big data, as it's concatinating two columns instead of iterating over values.

此解决方案在使用大数据时会扩展，因为它连接两列而不是迭代值。

Java 尝试在 Spark DataFrame 上使用地图

提问by LetsPlayYahtzee

回答by jojo_Berlin

回答by Dee

相关推荐

最近更新

标签

Java 尝试在 Spark DataFrame 上使用地图

提问by LetsPlayYahtzee

回答by jojo_Berlin

回答by Dee

相关推荐

为什么在 java.lang.Object 中保护了 clone() 方法？

如何在 Java 中以编程方式生成 serialVersionUID？

Java 我什么时候应该使用流？

如何最好地测试 Java 代码？

相关推荐

最近更新

标签