Python 从 Pyspark 中的 RDD 中提取字典

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31006438/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 09:19:04  来源:igfitidea点击:

Extracting a dictionary from an RDD in Pyspark

pythonapache-sparkpyspark

提问by Roman Rdgz

This is a homework question:

这是一道作业题:

I have an RDDwhich is a collection os tuples. I also have function which returns a dictionary from each input tuple. Somehow, the opposite of reduce function.

我有RDD一个集合 os 元组。我还有一个函数,它从每个输入元组返回一个字典。不知何故,与减少功能相反。

With map, I can easily go from a RDDof tuples to a RDDof dictionaries. But, since a dictionary is a collection of (key, value) pairs, I would like to convert the RDDof dictionaries into an RDDof (key, value) tuples with each dictionary contents.

使用 map,我可以轻松地从 a RDDof tuples 转到 a RDDof 字典。但是,由于字典是 (key, value) 对的集合,我想将RDD字典的RDDof转换为每个字典内容的 (key, value) 元组。

That way, if my RDDcontains 10 tuples, then I get an RDDcontaining 10 dictionaries with 5 elements (for example), and finally I get an RDDof 50 tuples.

这样,如果我RDD包含 10 个元组,那么我会得到一个RDD包含 10个包含 5 个元素的字典(例如),最后我得到一个RDD50 个元组。

I assume this has to be possible but, how? (Maybe the problem is that I don't know how this operation is called in English)

我认为这必须是可能的,但是,如何?(可能问题是我不知道这个操作英文是怎么调用的)

采纳答案by zero323

I guess what you want is just a flatMap:

我想你想要的只是一个flatMap

dicts = sc.parallelize([{"foo": 1, "bar": 2}, {"foo": 3, "baz": -1, "bar": 5}])
dicts.flatMap(lambda x: x.items())

flatMaptakes a function from a element of RDD to iterable and then concatenates the results. Another name for the same type of operation outside the Spark context is mapcat:

flatMap从 RDD 的元素中获取一个函数到可迭代的,然后连接结果。Spark 上下文之外的相同类型操作的另一个名称是mapcat

>>> from toolz.curried import map, mapcat, concat, pipe
>>> from itertools import repeat
>>> pipe(range(4), mapcat(lambda i: repeat(i, i + 1)), list)
[0, 1, 1, 2, 2, 2, 3, 3, 3, 3]

or going step by step:

或一步一步:

>>> pipe(range(4), map(lambda i: repeat(i, i + 1)), concat, list)
[0, 1, 1, 2, 2, 2, 3, 3, 3, 3]

The same thing using itertools.chain

同样的事情使用 itertools.chain

>>> from itertools import chain
>>> pipe((repeat(i, i + 1) for i in  range(4)), chain.from_iterable, list)
>>> [0, 1, 1, 2, 2, 2, 3, 3, 3, 3]

回答by Leandro Mora

My 2 cents:

我的 2 美分:

There is a PairRDD function named "collectAsMap" that returns a dictionary from a RDD.

有一个名为“collectAsMap”的 PairRDD 函数,它从 RDD 返回一个字典。

Let me show you an example:

让我给你看一个例子:

sample = someRDD.sample(0, 0.0001, 0)
sample_dict = sample.collectAsMap()
print sample.collect()
print sample_dict

[('hi', 4123.0)]
{'hi': 4123.0}

Documentation here

文档在这里

Hope it helps! Regards!

希望能帮助到你!问候!