Python PySpark groupByKey 返回 pyspark.resultiterable.ResultIterable

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29717257/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 04:55:32  来源:igfitidea点击:

PySpark groupByKey returning pyspark.resultiterable.ResultIterable

pythonapache-sparkpyspark

提问by theMadKing

I am trying to figure out why my groupByKey is returning the following:

我想弄清楚为什么我的 groupByKey 返回以下内容:

[(0, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a210>), (1, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a4d0>), (2, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a390>), (3, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a290>), (4, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a450>), (5, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a350>), (6, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a1d0>), (7, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a490>), (8, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a050>), (9, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a650>)]

I have flatMapped values that look like this:

我有如下所示的 flatMapped 值:

[(0, u'D'), (0, u'D'), (0, u'D'), (0, u'D'), (0, u'D'), (0, u'D'), (0, u'D'), (0, u'D'), (0, u'D'), (0, u'D')]

I'm doing just a simple:

我只是做一个简单的:

groupRDD = columnRDD.groupByKey()

采纳答案by dpeacock

What you're getting back is an object which allows you to iterate over the results. You can turn the results of groupByKey into a list by calling list() on the values, e.g.

你得到的是一个对象,它允许你迭代结果。您可以通过对值调用 list() 将 groupByKey 的结果转换为列表,例如

example = sc.parallelize([(0, u'D'), (0, u'D'), (1, u'E'), (2, u'F')])

example.groupByKey().collect()
# Gives [(0, <pyspark.resultiterable.ResultIterable object ......]

example.groupByKey().map(lambda x : (x[0], list(x[1]))).collect()
# Gives [(0, [u'D', u'D']), (1, [u'E']), (2, [u'F'])]

回答by Jayaram

you can also use

你也可以使用

example.groupByKey().mapValues(list)

回答by Harsha

Instead of using groupByKey(), i would suggest you use cogroup(). You can refer the below example.

我建议您使用 cogroup(),而不是使用 groupByKey()。你可以参考下面的例子。

[(x, tuple(map(list, y))) for x, y in sorted(list(x.cogroup(y).collect()))]

Example:

例子:

>>> x = sc.parallelize([("foo", 1), ("bar", 4)])
>>> y = sc.parallelize([("foo", -1)])
>>> z = [(x, tuple(map(list, y))) for x, y in sorted(list(x.cogroup(y).collect()))]
>>> print(z)

You should get the desired output...

你应该得到想要的输出...

回答by bin yan

Example:

例子:

r1 = sc.parallelize([('a',1),('b',2)])
r2 = sc.parallelize([('b',1),('d',2)])
r1.cogroup(r2).mapValues(lambda x:tuple(reduce(add,__builtin__.map(list,x))))

Result:

结果:

[('d', (2,)), ('b', (2, 1)), ('a', (1,))]

回答by Aniruddha Kalburgi

In addition to above answers, if you want the sorted list of unique items, use following:

除了上述答案之外,如果您想要唯一项目的排序列表,请使用以下内容:

List of Distinct and Sorted Values

不同值和排序值列表

example.groupByKey().mapValues(set).mapValues(sorted)

Just List of Sorted Values

只是排序值列表

example.groupByKey().mapValues(sorted)

Alternative's to above

上面的替代品

# List of distinct sorted items
example.groupByKey().map(lambda x: (x[0], sorted(set(x[1]))))

# just sorted list of items
example.groupByKey().map(lambda x: (x[0], sorted(x[1])))

回答by yeamusic21

Say your code is..

说你的代码是..

ex2 = ex1.groupByKey()

And then you run..

然后你跑..

ex2.take(5)

You're going to see an iterable. This is okay if you're going to do something with this data, you can just move on. But, if all you want is to print/see the values first before moving on, here is a bit of a hack..

你会看到一个可迭代的。如果你打算用这些数据做点什么,这没关系,你可以继续前进。但是,如果您只想在继续之前先打印/查看值,那么这里有一些技巧..

ex2.toDF().show(20, False)

or just

要不就

ex2.toDF().show()

This will show the values of the data. You shouldn't use collect()because that will return data to the driver, and if you're working off a lot of data, that's going to blow up on you. Now if ex2 = ex1.groupByKey()was your final step, and you want those results returned, then yes use collect()but make sure that you know your data being returned is low volume.

这将显示数据的值。您不应该使用,collect()因为这会将数据返回给驱动程序,如果您正在处理大量数据,那将会对您造成影响。现在,如果这ex2 = ex1.groupByKey()是您的最后一步,并且您希望返回这些结果,那么可以使用,collect()但请确保您知道返回的数据量很小。

print(ex2.collect())

Here is another nice post on using collect() on RDD

这是另一个关于在 RDD 上使用 collect() 的好帖子

View RDD contents in Python Spark?

在 Python Spark 中查看 RDD 内容?