SQL Spark数据帧reducebykey之类的操作
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34249841/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark dataframe reducebykey like operation
提问by Carson Pun
I have a Spark dataframe with the following data (I use spark-csv to load the data in):
我有一个包含以下数据的 Spark 数据框(我使用 spark-csv 加载数据):
key,value
1,10
2,12
3,0
1,20
is there anything similar to spark RDD reduceByKeywhich can return a Spark DataFrame as: (basically, summing up for the same key values)
是否有类似于 spark RDD 的东西reduceByKey可以将 Spark DataFrame 返回为:(基本上,总结相同的键值)
key,value
1,30
2,12
3,0
(I can transform the data to RDD and do a reduceByKeyoperation, but is there a more Spark DataFrame API way to do this?)
(我可以将数据转换为 RDD 并执行reduceByKey操作,但是否有更多 Spark DataFrame API 方法可以做到这一点?)
回答by zero323
If you don't care about column names you can use groupByfollowed by sum:
如果您不关心列名,则可以使用groupBy后跟sum:
df.groupBy($"key").sum("value")
otherwise it is better to replace sumwith agg:
否则最好替换sum为agg:
df.groupBy($"key").agg(sum($"value").alias("value"))
Finally you can use raw SQL:
最后,您可以使用原始 SQL:
df.registerTempTable("df")
sqlContext.sql("SELECT key, SUM(value) AS value FROM df GROUP BY key")
回答by Ans u man
I think user goks missed out on some part in the code. Its not a tested code.
我认为用户 goks 遗漏了代码中的某些部分。它不是经过测试的代码。
.map should have been used to convert the rdd to a pairRDD using .map(lambda x: (x,1)).reduceByKey. ....
.map 应该用于使用 .map(lambda x: (x,1)).reduceByKey 将 rdd 转换为 pairRDD。....
reduceByKey is not available on a single value rdd or regular rdd but pairRDD.
reduceByKey 在单值 rdd 或常规 rdd 上不可用,但在 pairRDD 上不可用。
Thx
谢谢
回答by goks
How about this? I agree this still converts to rdd then to dataframe.
这个怎么样?我同意这仍然转换为 rdd 然后转换为数据帧。
df.select('key','value').map(lambda x: x).reduceByKey(lambda a,b: a+b).toDF(['key','value'])

