pandas 如何将pandas数据帧转换为具有rdd属性的pyspark数据帧?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/49555269/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:23:18  来源:igfitidea点击:

How to convert pandas dataframe to pyspark dataframe which has attribute to rdd?

pythonpandasdataframepyspark

提问by Carmelo Smith

Now I am doing a project for my course, and find a problem to convert pandas dataframeto pyspark dataframe. I have produce a pandas dataframe named data_org as follows. enter image description here

现在我正在为我的课程做一个项目,并发现一个问题要转换pandas dataframepyspark dataframe. 我已经生成了一个名为 data_org 的 Pandas 数据框,如下所示。 在此处输入图片说明

And I want to covert it into pyspark dataframe to adjust it into libsvm format. So my code is

我想将其转换为 pyspark 数据帧以将其调整为 libsvm 格式。所以我的代码是

from pyspark.sql import SQLContext  
spark_df = SQLContext.createDataFrame(data_org)

However, it went wrong.

然而,它出错了。

TypeError: createDataFrame() missing 1 required positional argument: 'data'

类型错误:createDataFrame() 缺少 1 个必需的位置参数:“数据”

I really do not know how to do. And my python version is 3.5.2 and pyspark version is 2.0.1. I am looking forward to your reply.

我真的不知道该怎么办。我的python版本是3.5.2,pyspark版本是2.0.1。我期待着您的回复。

回答by Sociopath

First pass sparkContext to SQLContext:

首先将 sparkContext 传递给 SQLContext:

from pyspark import SparkContext
sc = SparkContext("local", "App Name")
sql = SQLContext(sc)

then use createDataFramelike below:

然后使用createDataFrame如下:

spark_df = sql.createDataFrame(data_org)