pandas 如何将pandas数据帧转换为具有rdd属性的pyspark数据帧?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/49555269/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to convert pandas dataframe to pyspark dataframe which has attribute to rdd?
提问by Carmelo Smith
Now I am doing a project for my course, and find a problem to convert pandas dataframe
to pyspark dataframe
.
I have produce a pandas dataframe named data_org as follows.
enter image description here
现在我正在为我的课程做一个项目,并发现一个问题要转换pandas dataframe
为pyspark dataframe
. 我已经生成了一个名为 data_org 的 Pandas 数据框,如下所示。
在此处输入图片说明
And I want to covert it into pyspark dataframe to adjust it into libsvm format. So my code is
我想将其转换为 pyspark 数据帧以将其调整为 libsvm 格式。所以我的代码是
from pyspark.sql import SQLContext
spark_df = SQLContext.createDataFrame(data_org)
However, it went wrong.
然而,它出错了。
TypeError: createDataFrame() missing 1 required positional argument: 'data'
类型错误:createDataFrame() 缺少 1 个必需的位置参数:“数据”
I really do not know how to do. And my python version is 3.5.2 and pyspark version is 2.0.1. I am looking forward to your reply.
我真的不知道该怎么办。我的python版本是3.5.2,pyspark版本是2.0.1。我期待着您的回复。
回答by Sociopath
First pass sparkContext to SQLContext:
首先将 sparkContext 传递给 SQLContext:
from pyspark import SparkContext
sc = SparkContext("local", "App Name")
sql = SQLContext(sc)
then use createDataFrame
like below:
然后使用createDataFrame
如下:
spark_df = sql.createDataFrame(data_org)