将 Pandas 数据帧转换为 PySpark 数据帧
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/52943627/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Convert a pandas dataframe to a PySpark dataframe
提问by kikee1222
I have a script with the below setup.
我有一个具有以下设置的脚本。
I am using:
我在用:
1) Spark dataframes to pull data in 2) Converting to pandas dataframes after initial aggregatioin 3) Want to convert back to Spark for writing to HDFS
1) Spark 数据帧提取数据 2) 在初始聚合后转换为 Pandas 数据帧 3) 想要转换回 Spark 以写入 HDFS
The conversion from Spark --> Pandas was simple, but I am struggling with how to convert a Pandas dataframe back to spark.
Spark --> Pandas 的转换很简单,但我正在努力解决如何将 Pandas 数据帧转换回 Spark。
Can you advise?
你能建议吗?
from pyspark.sql import SparkSession
import pyspark.sql.functions as sqlfunc
from pyspark.sql.types import *
import argparse, sys
from pyspark.sql import *
import pyspark.sql.functions as sqlfunc
import pandas as pd
def create_session(appname):
spark_session = SparkSession\
.builder\
.appName(appname)\
.master('yarn')\
.config("hive.metastore.uris", "thrift://uds-far-mn1.dab.02.net:9083")\
.enableHiveSupport()\
.getOrCreate()
return spark_session
### START MAIN ###
if __name__ == '__main__':
spark_session = create_session('testing_files')
I've tried the below - no errors, just no data! To confirm, df6 does have data & is a pandas dataframe
我已经尝试了以下 - 没有错误,只是没有数据!确认一下,df6 确实有数据 & 是一个Pandas数据框
df6 = df5.sort_values(['sdsf'], ascending=["true"])
sdf = spark_session.createDataFrame(df6)
sdf.show()
回答by Andrea
Here we go:
开始了:
# Spark to Pandas
df_pd = df.toPandas()
# Pandas to Spark
df_sp = spark_session.createDataFrame(df_pd)