Python pyspark 错误:AttributeError:'SparkSession' 对象没有属性 'parallelize'
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39521341/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pyspark error: AttributeError: 'SparkSession' object has no attribute 'parallelize'
提问by Edamame
I am using pyspark on Jupyter notebook. Here is how Spark setup:
我在 Jupyter 笔记本上使用 pyspark。以下是 Spark 设置的方法:
import findspark
findspark.init(spark_home='/home/edamame/spark/spark-2.0.0-bin-spark-2.0.0-bin-hadoop2.6-hive', python_path='python2.7')
import pyspark
from pyspark.sql import *
sc = pyspark.sql.SparkSession.builder.master("yarn-client").config("spark.executor.memory", "2g").config('spark.driver.memory', '1g').config('spark.driver.cores', '4').enableHiveSupport().getOrCreate()
sqlContext = SQLContext(sc)
Then when I do:
然后当我这样做时:
spark_df = sqlContext.createDataFrame(df_in)
where df_in
is a pandas dataframe. I then got the following errors:
df_in
大熊猫数据框在哪里。然后我收到以下错误:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-9-1db231ce21c9> in <module>()
----> 1 spark_df = sqlContext.createDataFrame(df_in)
/home/edamame/spark/spark-2.0.0-bin-spark-2.0.0-bin-hadoop2.6-hive/python/pyspark/sql/context.pyc in createDataFrame(self, data, schema, samplingRatio)
297 Py4JJavaError: ...
298 """
--> 299 return self.sparkSession.createDataFrame(data, schema, samplingRatio)
300
301 @since(1.3)
/home/edamame/spark/spark-2.0.0-bin-spark-2.0.0-bin-hadoop2.6-hive/python/pyspark/sql/session.pyc in createDataFrame(self, data, schema, samplingRatio)
520 rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
521 else:
--> 522 rdd, schema = self._createFromLocal(map(prepare, data), schema)
523 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
524 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
/home/edamame/spark/spark-2.0.0-bin-spark-2.0.0-bin-hadoop2.6-hive/python/pyspark/sql/session.pyc in _createFromLocal(self, data, schema)
400 # convert python objects to sql data
401 data = [schema.toInternal(row) for row in data]
--> 402 return self._sc.parallelize(data), schema
403
404 @since(2.0)
AttributeError: 'SparkSession' object has no attribute 'parallelize'
Does anyone know what I did wrong? Thanks!
有谁知道我做错了什么?谢谢!
回答by zero323
SparkSession
is not a replacement for a SparkContext
but an equivalent of the SQLContext
. Just use it use the same way as you used to use SQLContext
:
SparkSession
不是 a 的替代品,SparkContext
而是 的等价物SQLContext
。只需使用与以前相同的方式使用它SQLContext
:
spark.createDataFrame(...)
and if you ever have to access SparkContext
use sparkContext
attribute:
如果您必须访问SparkContext
usesparkContext
属性:
spark.sparkContext
so if you need SQLContext
for backwards compatibility you can:
因此,如果您需要SQLContext
向后兼容,您可以:
SQLContext(sparkContext=spark.sparkContext, sparkSession=spark)