SQL Spark数据帧reducebykey之类的操作
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34249841/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark dataframe reducebykey like operation
提问by Carson Pun
I have a Spark dataframe with the following data (I use spark-csv to load the data in):
我有一个包含以下数据的 Spark 数据框(我使用 spark-csv 加载数据):
key,value
1,10
2,12
3,0
1,20
is there anything similar to spark RDD reduceByKey
which can return a Spark DataFrame as: (basically, summing up for the same key values)
是否有类似于 spark RDD 的东西reduceByKey
可以将 Spark DataFrame 返回为:(基本上,总结相同的键值)
key,value
1,30
2,12
3,0
(I can transform the data to RDD and do a reduceByKey
operation, but is there a more Spark DataFrame API way to do this?)
(我可以将数据转换为 RDD 并执行reduceByKey
操作,但是否有更多 Spark DataFrame API 方法可以做到这一点?)
回答by zero323
If you don't care about column names you can use groupBy
followed by sum
:
如果您不关心列名,则可以使用groupBy
后跟sum
:
df.groupBy($"key").sum("value")
otherwise it is better to replace sum
with agg
:
否则最好替换sum
为agg
:
df.groupBy($"key").agg(sum($"value").alias("value"))
Finally you can use raw SQL:
最后,您可以使用原始 SQL:
df.registerTempTable("df")
sqlContext.sql("SELECT key, SUM(value) AS value FROM df GROUP BY key")
回答by Ans u man
I think user goks missed out on some part in the code. Its not a tested code.
我认为用户 goks 遗漏了代码中的某些部分。它不是经过测试的代码。
.map should have been used to convert the rdd to a pairRDD using .map(lambda x: (x,1)).reduceByKey. ....
.map 应该用于使用 .map(lambda x: (x,1)).reduceByKey 将 rdd 转换为 pairRDD。....
reduceByKey is not available on a single value rdd or regular rdd but pairRDD.
reduceByKey 在单值 rdd 或常规 rdd 上不可用,但在 pairRDD 上不可用。
Thx
谢谢
回答by goks
How about this? I agree this still converts to rdd then to dataframe.
这个怎么样?我同意这仍然转换为 rdd 然后转换为数据帧。
df.select('key','value').map(lambda x: x).reduceByKey(lambda a,b: a+b).toDF(['key','value'])