将 spark DataFrame 转换为 pandas DF

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50958721/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-08 15:50:57  来源:igfitidea点击:

Convert a spark DataFrame to pandas DF

pandasapache-sparkapache-spark-sql

提问by vikky

Is there a way to convert a Spark Df (not RDD) to pandas DF

有没有办法将 Spark Df(不是 RDD)转换为 Pandas DF

I tried the following:

我尝试了以下方法:

var some_df = Seq(
 ("A", "no"),
 ("B", "yes"),
 ("B", "yes"),
 ("B", "no")

 ).toDF(
"user_id", "phone_number")

Code:

代码:

%pyspark
pandas_df = some_df.toPandas()

Error:

错误:

 NameError: name 'some_df' is not defined

Any suggestions.

有什么建议。

回答by Gaurang Shah

following should work

以下应该工作

some_df = sc.parallelize([
 ("A", "no"),
 ("B", "yes"),
 ("B", "yes"),
 ("B", "no")]
 ).toDF(["user_id", "phone_number"])
pandas_df = some_df.toPandas()

回答by Inna

In my case the following conversion from spark dataframe to pandas dataframe worked:

在我的情况下,从 spark 数据帧到熊猫数据帧的以下转换有效:

pandas_df = spark_df.select("*").toPandas()

回答by Shikha

Converting spark data frame to pandas can take time if you have large data frame. So you can use something like below:

如果您有大型数据框,将 spark 数据框转换为 Pandas 可能需要一些时间。因此,您可以使用以下内容:

spark.conf.set("spark.sql.execution.arrow.enabled", "true")

pd_df = df_spark.toPandas()

I have tried this in DataBricks.

我在 DataBricks 中尝试过这个。