Python 如何在pyspark中将DataFrame转换回正常的RDD?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29000514/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to convert a DataFrame back to normal RDD in pyspark?
提问by javadba
I need to use the
我需要使用
(rdd.)partitionBy(npartitions, custom_partitioner)
method that is not available on the DataFrame. All of the DataFrame methods refer only to DataFrame results. So then how to create an RDD from the DataFrame data?
DataFrame 上不可用的方法。所有 DataFrame 方法都只引用 DataFrame 结果。那么如何从DataFrame数据创建一个RDD呢?
Note: this is a change (in 1.3.0) from 1.2.0.
注意:这是对 1.2.0 的更改(在 1.3.0 中)。
Updatefrom the answer from @dpangmao: the method is .rdd. I was interested to understand if (a) it were public and (b) what are the performance implications.
从@dpangmao 的答案更新:方法是.rdd。我有兴趣了解 (a) 它是否是公开的,以及 (b) 对性能有什么影响。
Well (a) is yes and (b) - well you can see here that there are significant perf implications: a new RDD must be created by invoking mapPartitions:
好吧 (a) 是肯定的和 (b) - 好吧,您可以在这里看到有重要的性能影响:必须通过调用mapPartitions创建一个新的 RDD :
In dataframe.py(note the file name changed as well (was sql.py):
在dataframe.py 中(注意文件名也改变了(是 sql.py):
@property
def rdd(self):
"""
Return the content of the :class:`DataFrame` as an :class:`RDD`
of :class:`Row` s.
"""
if not hasattr(self, '_lazy_rdd'):
jrdd = self._jdf.javaToPython()
rdd = RDD(jrdd, self.sql_ctx._sc, BatchedSerializer(PickleSerializer()))
schema = self.schema
def applySchema(it):
cls = _create_cls(schema)
return itertools.imap(cls, it)
self._lazy_rdd = rdd.mapPartitions(applySchema)
return self._lazy_rdd
采纳答案by kennyut
@dapangmao's answer works, but it doesn't give the regular spark RDD, it returns a Row object. If you want to have the regular RDD format.
@dapangmao 的答案有效,但它没有给出常规的 spark RDD,它返回一个 Row 对象。如果你想拥有常规的 RDD 格式。
Try this:
尝试这个:
rdd = df.rdd.map(tuple)
or
或者
rdd = df.rdd.map(list)
回答by dapangmao
Use the method .rddlike this:
使用这样的方法.rdd:
rdd = df.rdd
回答by Nilesh
Answer given by kennyut/Kistian works very well but to get exact RDD like output when RDD consist of list of attributese.g. [1,2,3,4] we can use flatmap command as below,
kennyut/Kistian 给出的答案效果很好,但是当RDD 由属性列表(例如 [1,2,3,4])组成时,要获得像输出一样的准确 RDD,我们可以使用 flatmap 命令,如下所示,
rdd = df.rdd.flatMap(list)
or
rdd = df.rdd.flatmap(lambda x: list(x))

