Python spark 2.1.0 会话配置设置 (pyspark)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41886346/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-20 01:47:55  来源:igfitidea点击:

spark 2.1.0 session config settings (pyspark)

pythonapache-sparkpysparkspark-dataframe

提问by Harish

I am trying to overwrite the spark session/spark context default configs, but it is picking entire node/cluster resource.

我正在尝试覆盖 spark 会话/spark 上下文默认配置,但它正在选择整个节点/集群资源。

 spark  = SparkSession.builder
                      .master("ip")
                      .enableHiveSupport()
                      .getOrCreate()

 spark.conf.set("spark.executor.memory", '8g')
 spark.conf.set('spark.executor.cores', '3')
 spark.conf.set('spark.cores.max', '3')
 spark.conf.set("spark.driver.memory",'8g')
 sc = spark.sparkContext

It works fine when i put the configuration in spark submit

当我将配置放入 spark submit 时它工作正常

spark-submit --master ip --executor-cores=3 --diver 10G code.py

采纳答案by Grr

You aren't actually overwriting anything with this code. Just so you can see for yourself try the following.

您实际上并未使用此代码覆盖任何内容。只是为了让您亲眼看看,请尝试以下操作。

As soon as you start pyspark shell type:

一旦你启动 pyspark shell 类型:

sc.getConf().getAll()

This will show you all of the current config settings. Then try your code and do it again. Nothing changes.

这将显示所有当前的配置设置。然后尝试您的代码并再次执行。没有什么改变。

What you should do instead is create a new configuration and use that to create a SparkContext. Do it like this:

您应该做的是创建一个新配置并使用它来创建 SparkContext。像这样做:

conf = pyspark.SparkConf().setAll([('spark.executor.memory', '8g'), ('spark.executor.cores', '3'), ('spark.cores.max', '3'), ('spark.driver.memory','8g')])
sc.stop()
sc = pyspark.SparkContext(conf=conf)

Then you can check yourself just like above with:

然后你可以像上面一样检查自己:

sc.getConf().getAll()

This should reflect the configuration you wanted.

这应该反映您想要的配置。

回答by bob

update configuration in Spark 2.3.1

Spark 2.3.1 中的更新配置

To change the default spark configurations you can follow these steps:

要更改默认火花配置,您可以按照以下步骤操作:

Import the required classes

导入所需的类

from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

Get the default configurations

获取默认配置

spark.sparkContext._conf.getAll()

Update the default configurations

更新默认配置

conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'), ('spark.app.name', 'Spark Updated Conf'), ('spark.executor.cores', '4'), ('spark.cores.max', '4'), ('spark.driver.memory','4g')])

Stop the current Spark Session

停止当前的 Spark 会话

spark.sparkContext.stop()

Create a Spark Session

创建 Spark 会话

spark = SparkSession.builder.config(conf=conf).getOrCreate()

回答by Vivek

Setting 'spark.driver.host' to 'localhost' in the config works for me

在配置中将 'spark.driver.host' 设置为 'localhost' 对我有用

spark = SparkSession \
    .builder \
    .appName("MyApp") \
    .config("spark.driver.host", "localhost") \
    .getOrCreate()

回答by user3282611

You could also set configuration when you start pyspark, just like spark-submit:

你也可以在启动pyspark时设置配置,就像spark-submit一样:

pyspark --conf property=value

Here is one example

这是一个例子

-bash-4.2$ pyspark
Python 3.6.8 (default, Apr 25 2019, 21:02:35) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0-cdh6.2.0
      /_/

Using Python version 3.6.8 (default, Apr 25 2019 21:02:35)
SparkSession available as 'spark'.
>>> spark.conf.get('spark.eventLog.enabled')
'true'
>>> exit()


-bash-4.2$ pyspark --conf spark.eventLog.enabled=false
Python 3.6.8 (default, Apr 25 2019, 21:02:35) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0-cdh6.2.0
      /_/

Using Python version 3.6.8 (default, Apr 25 2019 21:02:35)
SparkSession available as 'spark'.
>>> spark.conf.get('spark.eventLog.enabled')
'false'