SQL Spark数据帧reducebykey之类的操作

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34249841/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 04:15:33  来源:igfitidea点击:

Spark dataframe reducebykey like operation

sqlscalaapache-sparkapache-spark-sql

提问by Carson Pun

I have a Spark dataframe with the following data (I use spark-csv to load the data in):

我有一个包含以下数据的 Spark 数据框(我使用 spark-csv 加载数据):

key,value
1,10
2,12
3,0
1,20

is there anything similar to spark RDD reduceByKeywhich can return a Spark DataFrame as: (basically, summing up for the same key values)

是否有类似于 spark RDD 的东西reduceByKey可以将 Spark DataFrame 返回为:(基本上,总结相同的键值)

key,value
1,30
2,12
3,0

(I can transform the data to RDD and do a reduceByKeyoperation, but is there a more Spark DataFrame API way to do this?)

(我可以将数据转换为 RDD 并执行reduceByKey操作,但是否有更多 Spark DataFrame API 方法可以做到这一点?)

回答by zero323

If you don't care about column names you can use groupByfollowed by sum:

如果您不关心列名,则可以使用groupBy后跟sum

df.groupBy($"key").sum("value")

otherwise it is better to replace sumwith agg:

否则最好替换sumagg

df.groupBy($"key").agg(sum($"value").alias("value"))

Finally you can use raw SQL:

最后,您可以使用原始 SQL:

df.registerTempTable("df")
sqlContext.sql("SELECT key, SUM(value) AS value FROM df GROUP BY key")

See also DataFrame / Dataset groupBy behaviour/optimization

另请参阅DataFrame / Dataset groupBy 行为/优化

回答by Ans u man

I think user goks missed out on some part in the code. Its not a tested code.

我认为用户 goks 遗漏了代码中的某些部分。它不是经过测试的代码。

.map should have been used to convert the rdd to a pairRDD using .map(lambda x: (x,1)).reduceByKey. ....

.map 应该用于使用 .map(lambda x: (x,1)).reduceByKey 将 rdd 转换为 pairRDD。....

reduceByKey is not available on a single value rdd or regular rdd but pairRDD.

reduceByKey 在单值 rdd 或常规 rdd 上不可用,但在 pairRDD 上不可用。

Thx

谢谢

回答by goks

How about this? I agree this still converts to rdd then to dataframe.

这个怎么样?我同意这仍然转换为 rdd 然后转换为数据帧。

df.select('key','value').map(lambda x: x).reduceByKey(lambda a,b: a+b).toDF(['key','value'])