Python 如何在 PySpark 中有效地按值排序?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33706408/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to sort by value efficiently in PySpark?
提问by Hunle
I want to sort my K,V tuples by V, i.e. by the value. I know that TakeOrdered
is good for this if you know how many you need:
我想按 V 对我的 K,V 元组进行排序,即按值排序。TakeOrdered
如果您知道需要多少,我知道这对此有好处:
b = sc.parallelize([('t',3),('b',4),('c',1)])
Using TakeOrdered:
使用 TakeOrdered:
b.takeOrdered(3,lambda atuple: atuple[1])
Using Lambda
使用 Lambda
b.map(lambda aTuple: (aTuple[1], aTuple[0])).sortByKey().map(
lambda aTuple: (aTuple[0], aTuple[1])).collect()
I've checked out the question here, which suggests the latter. I find it hard to believe that takeOrdered
is so succinct and yet it requires the same number of operations as the Lambda
solution.
我已经检查了问题here,这表明后者。我发现很难相信它takeOrdered
是如此简洁,但它需要与Lambda
解决方案相同数量的操作。
Does anyone know of a simpler, more concise Transformation in spark to sort by value?
有谁知道 spark 中更简单、更简洁的转换以按值排序?
采纳答案by Rohan Aletty
I think sortBy()
is more concise:
我认为sortBy()
更简洁:
b = sc.parallelize([('t', 3),('b', 4),('c', 1)])
bSorted = b.sortBy(lambda a: a[1])
bSorted.collect()
...
[('c', 1),('t', 3),('b', 4)]
It's actually not more efficient at allas it involves keying by the values, sorting by the keys, and then grabbing the values but it looks prettier than your latter solution. In terms of efficiency, I don't think you'll find a more efficient solution as you would need a way to transform your data such that values will be your keys (and then eventually transform that data back to the original schema).
它实际上根本没有效率,因为它涉及按值键入,按键排序,然后获取值,但它看起来比后一个解决方案更漂亮。在效率方面,我认为您找不到更有效的解决方案,因为您需要一种方法来转换数据,使值成为您的键(然后最终将该数据转换回原始模式)。