Python 如何在 PySpark 中有效地按值排序?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33706408/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 13:49:32  来源:igfitidea点击:

How to sort by value efficiently in PySpark?

pythonsortinglambdaapache-spark

提问by Hunle

I want to sort my K,V tuples by V, i.e. by the value. I know that TakeOrderedis good for this if you know how many you need:

我想按 V 对我的 K,V 元组进行排序,即按值排序。TakeOrdered如果您知道需要多少,我知道这对此有好处:

b = sc.parallelize([('t',3),('b',4),('c',1)])

Using TakeOrdered:

使用 TakeOrdered:

b.takeOrdered(3,lambda atuple: atuple[1])

Using Lambda

使用 Lambda

b.map(lambda aTuple: (aTuple[1], aTuple[0])).sortByKey().map(
    lambda aTuple: (aTuple[0], aTuple[1])).collect()

I've checked out the question here, which suggests the latter. I find it hard to believe that takeOrderedis so succinct and yet it requires the same number of operations as the Lambdasolution.

我已经检查了问题here,这表明后者。我发现很难相信它takeOrdered是如此简洁,但它需要与Lambda解决方案相同数量的操作。

Does anyone know of a simpler, more concise Transformation in spark to sort by value?

有谁知道 spark 中更简单、更简洁的转换以按值排序?

采纳答案by Rohan Aletty

I think sortBy()is more concise:

我认为sortBy()更简洁:

b = sc.parallelize([('t', 3),('b', 4),('c', 1)])
bSorted = b.sortBy(lambda a: a[1])
bSorted.collect()
...
[('c', 1),('t', 3),('b', 4)]

It's actually not more efficient at allas it involves keying by the values, sorting by the keys, and then grabbing the values but it looks prettier than your latter solution. In terms of efficiency, I don't think you'll find a more efficient solution as you would need a way to transform your data such that values will be your keys (and then eventually transform that data back to the original schema).

它实际上根本没有效率,因为它涉及按值键入,按键排序,然后获取值,但它看起来比后一个解决方案更漂亮。在效率方面,我认为您找不到更有效的解决方案,因为您需要一种方法来转换数据,使值成为您的键(然后最终将该数据转换回原始模式)。