如何在 ipython 中将 Spark RDD 转换为 Pandas 数据帧?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34817549/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to convert Spark RDD to pandas dataframe in ipython?
提问by user2966197
I have a RDD
and I want to convert it to pandas
dataframe
. I know that to convert and RDD
to a normal dataframe
we can do
我有一个RDD
,我想将它转换为pandas
dataframe
. 我知道要转换RDD
为正常dataframe
我们可以做
df = rdd1.toDF()
But I want to convert the RDD
to pandas
dataframe
and not a normal dataframe
. How can I do it?
但我想转换RDD
为pandas
dataframe
而不是正常的dataframe
. 我该怎么做?
回答by jezrael
You can use function toPandas()
:
您可以使用功能toPandas()
:
Returns the contents of this DataFrame as Pandas pandas.DataFrame.
This is only available if Pandas is installed and available.
将此 DataFrame 的内容作为 Pandas pandas.DataFrame 返回。
这仅在 Pandas 已安装且可用时才可用。
>>> df.toPandas()
age name
0 2 Alice
1 5 Bob
回答by RKD314
You'll have to use a Spark DataFrame as an intermediary step between your RDD and the desired Pandas DataFrame.
您必须使用 Spark DataFrame 作为 RDD 和所需 Pandas DataFrame 之间的中间步骤。
For example, let's say I have a text file, flights.csv
, that has been read in to an RDD:
例如,假设我有一个flights.csv
已读入 RDD的文本文件:
flights = sc.textFile('flights.csv')
You can check the type:
您可以检查类型:
type(flights)
<class 'pyspark.rdd.RDD'>
If you just use toPandas()
on the RDD, it won't work. Depending on the format of the objects in your RDD, some processing may be necessary to go to a Spark DataFrame first. In the case of this example, this code does the job:
如果你只是toPandas()
在 RDD 上使用,它是行不通的。根据 RDD 中对象的格式,可能需要先进行一些处理才能转到 Spark DataFrame。在这个例子中,这段代码完成了这项工作:
# RDD to Spark DataFrame
sparkDF = flights.map(lambda x: str(x)).map(lambda w: w.split(',')).toDF()
#Spark DataFrame to Pandas DataFrame
pdsDF = sparkDF.toPandas()
You can check the type:
您可以检查类型:
type(pdsDF)
<class 'pandas.core.frame.DataFrame'>
回答by Shuai Liu
I recommend a fast version of toPandas by joshlk
我推荐 joshlk 的一个快速版本的 toPandas
<script src="https://gist.github.com/joshlk/871d58e01417478176e7.js"></script>