Python 如何在pyspark中将DataFrame转换回正常的RDD？

Question

提问by javadba

I need to use the

我需要使用

(rdd.)partitionBy(npartitions, custom_partitioner)

method that is not available on the DataFrame. All of the DataFrame methods refer only to DataFrame results. So then how to create an RDD from the DataFrame data?

DataFrame 上不可用的方法。所有 DataFrame 方法都只引用 DataFrame 结果。那么如何从DataFrame数据创建一个RDD呢？

Note: this is a change (in 1.3.0) from 1.2.0.

注意：这是对 1.2.0 的更改（在 1.3.0 中）。

Updatefrom the answer from @dpangmao: the method is .rdd. I was interested to understand if (a) it were public and (b) what are the performance implications.

从@dpangmao 的答案更新：方法是.rdd。我有兴趣了解 (a) 它是否是公开的，以及 (b) 对性能有什么影响。

Well (a) is yes and (b) - well you can see here that there are significant perf implications: a new RDD must be created by invoking mapPartitions:

好吧 (a) 是肯定的和 (b) - 好吧，您可以在这里看到有重要的性能影响：必须通过调用mapPartitions创建一个新的 RDD ：

In dataframe.py(note the file name changed as well (was sql.py):

在dataframe.py 中（注意文件名也改变了（是 sql.py）：

@property
def rdd(self):
    """
    Return the content of the :class:`DataFrame` as an :class:`RDD`
    of :class:`Row` s.
    """
    if not hasattr(self, '_lazy_rdd'):
        jrdd = self._jdf.javaToPython()
        rdd = RDD(jrdd, self.sql_ctx._sc, BatchedSerializer(PickleSerializer()))
        schema = self.schema

        def applySchema(it):
            cls = _create_cls(schema)
            return itertools.imap(cls, it)

        self._lazy_rdd = rdd.mapPartitions(applySchema)

    return self._lazy_rdd

Answer 1

采纳答案by kennyut

@dapangmao's answer works, but it doesn't give the regular spark RDD, it returns a Row object. If you want to have the regular RDD format.

@dapangmao 的答案有效，但它没有给出常规的 spark RDD，它返回一个 Row 对象。如果你想拥有常规的 RDD 格式。

Try this:

尝试这个：

rdd = df.rdd.map(tuple)

or

或者

rdd = df.rdd.map(list)

Answer 2

回答by dapangmao

Use the method .rddlike this:

使用这样的方法.rdd：

rdd = df.rdd

Answer 3

回答by Nilesh

Answer given by kennyut/Kistian works very well but to get exact RDD like output when RDD consist of list of attributese.g. [1,2,3,4] we can use flatmap command as below,

kennyut/Kistian 给出的答案效果很好，但是当RDD 由属性列表（例如 [1,2,3,4]）组成时，要获得像输出一样的准确 RDD，我们可以使用 flatmap 命令，如下所示，

rdd = df.rdd.flatMap(list)
or 
rdd = df.rdd.flatmap(lambda x: list(x))

Python 如何在pyspark中将DataFrame转换回正常的RDD？

提问by javadba

采纳答案by kennyut

回答by dapangmao

回答by Nilesh

相关推荐

最近更新

标签

Python 如何在pyspark中将DataFrame转换回正常的RDD？

提问by javadba

采纳答案by kennyut

回答by dapangmao

回答by Nilesh

相关推荐

Python 如何解决ImportError“No module named pycurl”

Python 如何在不指定绝对路径的情况下使用 PIL.ImageFont.truetype 加载字体文件？

Python 类型错误：不能pickle 生成器对象

Python Pandas 选择索引，其中索引大于 x

相关推荐

最近更新

标签