SQL Spark数据帧reducebykey之类的操作

Question

提问by Carson Pun

I have a Spark dataframe with the following data (I use spark-csv to load the data in):

我有一个包含以下数据的 Spark 数据框（我使用 spark-csv 加载数据）：

key,value
1,10
2,12
3,0
1,20

is there anything similar to spark RDD reduceByKeywhich can return a Spark DataFrame as: (basically, summing up for the same key values)

是否有类似于 spark RDD 的东西reduceByKey可以将 Spark DataFrame 返回为：（基本上，总结相同的键值）

key,value
1,30
2,12
3,0

(I can transform the data to RDD and do a reduceByKeyoperation, but is there a more Spark DataFrame API way to do this?)

（我可以将数据转换为 RDD 并执行reduceByKey操作，但是否有更多 Spark DataFrame API 方法可以做到这一点？）

Answer 1

回答by zero323

If you don't care about column names you can use groupByfollowed by sum:

如果您不关心列名，则可以使用groupBy后跟sum：

df.groupBy($"key").sum("value")

otherwise it is better to replace sumwith agg:

否则最好替换sum为agg：

df.groupBy($"key").agg(sum($"value").alias("value"))

Finally you can use raw SQL:

最后，您可以使用原始 SQL：

df.registerTempTable("df")
sqlContext.sql("SELECT key, SUM(value) AS value FROM df GROUP BY key")

See also DataFrame / Dataset groupBy behaviour/optimization

另请参阅DataFrame / Dataset groupBy 行为/优化

Answer 2

回答by Ans u man

I think user goks missed out on some part in the code. Its not a tested code.

我认为用户 goks 遗漏了代码中的某些部分。它不是经过测试的代码。

.map should have been used to convert the rdd to a pairRDD using .map(lambda x: (x,1)).reduceByKey. ....

.map 应该用于使用 .map(lambda x: (x,1)).reduceByKey 将 rdd 转换为 pairRDD。....

reduceByKey is not available on a single value rdd or regular rdd but pairRDD.

reduceByKey 在单值 rdd 或常规 rdd 上不可用，但在 pairRDD 上不可用。

Thx

谢谢

Answer 3

回答by goks

How about this? I agree this still converts to rdd then to dataframe.

这个怎么样？我同意这仍然转换为 rdd 然后转换为数据帧。

df.select('key','value').map(lambda x: x).reduceByKey(lambda a,b: a+b).toDF(['key','value'])

SQL Spark数据帧reducebykey之类的操作

提问by Carson Pun

回答by zero323

回答by Ans u man

回答by goks

相关推荐

最近更新

标签

SQL Spark数据帧reducebykey之类的操作

提问by Carson Pun

回答by zero323

回答by Ans u man

回答by goks

相关推荐

SQL 删除...创建与更改

SQL 如何选择每组的第一行？

如何将一列添加到大型 sql server 表

在 Oracle SQL Developer 中使用引用游标

相关推荐

最近更新

标签