如何在 ipython 中将 Spark RDD 转换为 Pandas 数据帧?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34817549/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 15:33:31  来源:igfitidea点击:

How to convert Spark RDD to pandas dataframe in ipython?

pythonpandasipythonpysparkrdd

提问by user2966197

I have a RDDand I want to convert it to pandasdataframe. I know that to convert and RDDto a normal dataframewe can do

我有一个RDD,我想将它转换为pandasdataframe. 我知道要转换RDD为正常dataframe我们可以做

df = rdd1.toDF()

But I want to convert the RDDto pandasdataframeand not a normal dataframe. How can I do it?

但我想转换RDDpandasdataframe而不是正常的dataframe. 我该怎么做?

回答by jezrael

You can use function toPandas():

您可以使用功能toPandas()

Returns the contents of this DataFrame as Pandas pandas.DataFrame.

This is only available if Pandas is installed and available.

将此 DataFrame 的内容作为 Pandas pandas.DataFrame 返回。

这仅在 Pandas 已安装且可用时才可用。

>>> df.toPandas()  
   age   name
0    2  Alice
1    5    Bob

回答by RKD314

You'll have to use a Spark DataFrame as an intermediary step between your RDD and the desired Pandas DataFrame.

您必须使用 Spark DataFrame 作为 RDD 和所需 Pandas DataFrame 之间的中间步骤。

For example, let's say I have a text file, flights.csv, that has been read in to an RDD:

例如,假设我有一个flights.csv已读入 RDD的文本文件:

flights = sc.textFile('flights.csv')

You can check the type:

您可以检查类型:

type(flights)
<class 'pyspark.rdd.RDD'>

If you just use toPandas()on the RDD, it won't work. Depending on the format of the objects in your RDD, some processing may be necessary to go to a Spark DataFrame first. In the case of this example, this code does the job:

如果你只是toPandas()在 RDD 上使用,它是行不通的。根据 RDD 中对象的格式,可能需要先进行一些处理才能转到 Spark DataFrame。在这个例子中,这段代码完成了这项工作:

# RDD to Spark DataFrame
sparkDF = flights.map(lambda x: str(x)).map(lambda w: w.split(',')).toDF()

#Spark DataFrame to Pandas DataFrame
pdsDF = sparkDF.toPandas()

You can check the type:

您可以检查类型:

type(pdsDF)
<class 'pandas.core.frame.DataFrame'>

回答by Shuai Liu

I recommend a fast version of toPandas by joshlk

我推荐 joshlk 的一个快速版本的 toPandas

<script src="https://gist.github.com/joshlk/871d58e01417478176e7.js"></script>