Python 从 Pyspark 中的 RDD 中提取字典
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31006438/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Extracting a dictionary from an RDD in Pyspark
提问by Roman Rdgz
This is a homework question:
这是一道作业题:
I have an RDD
which is a collection os tuples. I also have function which returns a dictionary from each input tuple. Somehow, the opposite of reduce function.
我有RDD
一个集合 os 元组。我还有一个函数,它从每个输入元组返回一个字典。不知何故,与减少功能相反。
With map, I can easily go from a RDD
of tuples to a RDD
of dictionaries. But, since a dictionary is a collection of (key, value) pairs, I would like to convert the RDD
of dictionaries into an RDD
of (key, value) tuples with each dictionary contents.
使用 map,我可以轻松地从 a RDD
of tuples 转到 a RDD
of 字典。但是,由于字典是 (key, value) 对的集合,我想将RDD
字典的RDD
of转换为每个字典内容的 (key, value) 元组。
That way, if my RDD
contains 10 tuples, then I get an RDD
containing 10 dictionaries with 5 elements (for example), and finally I get an RDD
of 50 tuples.
这样,如果我RDD
包含 10 个元组,那么我会得到一个RDD
包含 10个包含 5 个元素的字典(例如),最后我得到一个RDD
50 个元组。
I assume this has to be possible but, how? (Maybe the problem is that I don't know how this operation is called in English)
我认为这必须是可能的,但是,如何?(可能问题是我不知道这个操作英文是怎么调用的)
采纳答案by zero323
I guess what you want is just a flatMap
:
我想你想要的只是一个flatMap
:
dicts = sc.parallelize([{"foo": 1, "bar": 2}, {"foo": 3, "baz": -1, "bar": 5}])
dicts.flatMap(lambda x: x.items())
flatMap
takes a function from a element of RDD to iterable and then concatenates the results. Another name for the same type of operation outside the Spark context is mapcat
:
flatMap
从 RDD 的元素中获取一个函数到可迭代的,然后连接结果。Spark 上下文之外的相同类型操作的另一个名称是mapcat
:
>>> from toolz.curried import map, mapcat, concat, pipe
>>> from itertools import repeat
>>> pipe(range(4), mapcat(lambda i: repeat(i, i + 1)), list)
[0, 1, 1, 2, 2, 2, 3, 3, 3, 3]
or going step by step:
或一步一步:
>>> pipe(range(4), map(lambda i: repeat(i, i + 1)), concat, list)
[0, 1, 1, 2, 2, 2, 3, 3, 3, 3]
The same thing using itertools.chain
同样的事情使用 itertools.chain
>>> from itertools import chain
>>> pipe((repeat(i, i + 1) for i in range(4)), chain.from_iterable, list)
>>> [0, 1, 1, 2, 2, 2, 3, 3, 3, 3]
回答by Leandro Mora
My 2 cents:
我的 2 美分:
There is a PairRDD function named "collectAsMap" that returns a dictionary from a RDD.
有一个名为“collectAsMap”的 PairRDD 函数,它从 RDD 返回一个字典。
Let me show you an example:
让我给你看一个例子:
sample = someRDD.sample(0, 0.0001, 0)
sample_dict = sample.collectAsMap()
print sample.collect()
print sample_dict
[('hi', 4123.0)]
{'hi': 4123.0}
Documentation here
文档在这里
Hope it helps! Regards!
希望能帮助到你!问候!