Python 如何在 PySpark 中有效地按值排序？

Question

提问by Hunle

I want to sort my K,V tuples by V, i.e. by the value. I know that TakeOrderedis good for this if you know how many you need:

我想按 V 对我的 K,V 元组进行排序，即按值排序。TakeOrdered如果您知道需要多少，我知道这对此有好处：

b = sc.parallelize([('t',3),('b',4),('c',1)])

Using TakeOrdered:

使用 TakeOrdered:

b.takeOrdered(3,lambda atuple: atuple[1])

Using Lambda

使用 Lambda

b.map(lambda aTuple: (aTuple[1], aTuple[0])).sortByKey().map(
    lambda aTuple: (aTuple[0], aTuple[1])).collect()

I've checked out the question here, which suggests the latter. I find it hard to believe that takeOrderedis so succinct and yet it requires the same number of operations as the Lambdasolution.

我已经检查了问题here，这表明后者。我发现很难相信它takeOrdered是如此简洁，但它需要与Lambda解决方案相同数量的操作。

Does anyone know of a simpler, more concise Transformation in spark to sort by value?

有谁知道 spark 中更简单、更简洁的转换以按值排序？

Answer 1

采纳答案by Rohan Aletty

I think sortBy()is more concise:

我认为sortBy()更简洁：

b = sc.parallelize([('t', 3),('b', 4),('c', 1)])
bSorted = b.sortBy(lambda a: a[1])
bSorted.collect()
...
[('c', 1),('t', 3),('b', 4)]

It's actually not more efficient at allas it involves keying by the values, sorting by the keys, and then grabbing the values but it looks prettier than your latter solution. In terms of efficiency, I don't think you'll find a more efficient solution as you would need a way to transform your data such that values will be your keys (and then eventually transform that data back to the original schema).

它实际上根本没有效率，因为它涉及按值键入，按键排序，然后获取值，但它看起来比后一个解决方案更漂亮。在效率方面，我认为您找不到更有效的解决方案，因为您需要一种方法来转换数据，使值成为您的键（然后最终将该数据转换回原始模式）。

Python 如何在 PySpark 中有效地按值排序？

提问by Hunle

采纳答案by Rohan Aletty

相关推荐

最近更新

标签

Python 如何在 PySpark 中有效地按值排序？

提问by Hunle

采纳答案by Rohan Aletty

相关推荐

Python 如何在 anaconda 中升级 scikit-learn 包

如何使用python打开受密码保护的excel文件？

Python 通过在两个现有列上使用 lambda 函数在 Panda 中创建一个新列

Python Matplotlib 散点标记大小

相关推荐

最近更新

标签