Python spark 2.1.0 会话配置设置 (pyspark)

Question

提问by Harish

I am trying to overwrite the spark session/spark context default configs, but it is picking entire node/cluster resource.

我正在尝试覆盖 spark 会话/spark 上下文默认配置，但它正在选择整个节点/集群资源。

 spark  = SparkSession.builder
                      .master("ip")
                      .enableHiveSupport()
                      .getOrCreate()

 spark.conf.set("spark.executor.memory", '8g')
 spark.conf.set('spark.executor.cores', '3')
 spark.conf.set('spark.cores.max', '3')
 spark.conf.set("spark.driver.memory",'8g')
 sc = spark.sparkContext

It works fine when i put the configuration in spark submit

当我将配置放入 spark submit 时它工作正常

spark-submit --master ip --executor-cores=3 --diver 10G code.py

Answer 1

采纳答案by Grr

You aren't actually overwriting anything with this code. Just so you can see for yourself try the following.

您实际上并未使用此代码覆盖任何内容。只是为了让您亲眼看看，请尝试以下操作。

As soon as you start pyspark shell type:

一旦你启动 pyspark shell 类型：

sc.getConf().getAll()

This will show you all of the current config settings. Then try your code and do it again. Nothing changes.

这将显示所有当前的配置设置。然后尝试您的代码并再次执行。没有什么改变。

What you should do instead is create a new configuration and use that to create a SparkContext. Do it like this:

您应该做的是创建一个新配置并使用它来创建 SparkContext。像这样做：

conf = pyspark.SparkConf().setAll([('spark.executor.memory', '8g'), ('spark.executor.cores', '3'), ('spark.cores.max', '3'), ('spark.driver.memory','8g')])
sc.stop()
sc = pyspark.SparkContext(conf=conf)

Then you can check yourself just like above with:

然后你可以像上面一样检查自己：

sc.getConf().getAll()

This should reflect the configuration you wanted.

这应该反映您想要的配置。

Answer 2

回答by bob

update configuration in Spark 2.3.1

Spark 2.3.1 中的更新配置

To change the default spark configurations you can follow these steps:

要更改默认火花配置，您可以按照以下步骤操作：

Import the required classes

导入所需的类

from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

Get the default configurations

获取默认配置

spark.sparkContext._conf.getAll()

Update the default configurations

更新默认配置

conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'), ('spark.app.name', 'Spark Updated Conf'), ('spark.executor.cores', '4'), ('spark.cores.max', '4'), ('spark.driver.memory','4g')])

Stop the current Spark Session

停止当前的 Spark 会话

spark.sparkContext.stop()

Create a Spark Session

创建 Spark 会话

spark = SparkSession.builder.config(conf=conf).getOrCreate()

Answer 3

回答by Vivek

Setting 'spark.driver.host' to 'localhost' in the config works for me

在配置中将 'spark.driver.host' 设置为 'localhost' 对我有用

spark = SparkSession \
    .builder \
    .appName("MyApp") \
    .config("spark.driver.host", "localhost") \
    .getOrCreate()

Answer 4

回答by user3282611

You could also set configuration when you start pyspark, just like spark-submit:

你也可以在启动pyspark时设置配置，就像spark-submit一样：

pyspark --conf property=value

Here is one example

这是一个例子

-bash-4.2$ pyspark
Python 3.6.8 (default, Apr 25 2019, 21:02:35) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0-cdh6.2.0
      /_/

Using Python version 3.6.8 (default, Apr 25 2019 21:02:35)
SparkSession available as 'spark'.
>>> spark.conf.get('spark.eventLog.enabled')
'true'
>>> exit()


-bash-4.2$ pyspark --conf spark.eventLog.enabled=false
Python 3.6.8 (default, Apr 25 2019, 21:02:35) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0-cdh6.2.0
      /_/

Using Python version 3.6.8 (default, Apr 25 2019 21:02:35)
SparkSession available as 'spark'.
>>> spark.conf.get('spark.eventLog.enabled')
'false'

Python spark 2.1.0 会话配置设置 (pyspark)

提问by Harish

采纳答案by Grr

回答by bob

回答by Vivek

回答by user3282611

相关推荐

最近更新

标签

Python spark 2.1.0 会话配置设置 (pyspark)

提问by Harish

采纳答案by Grr

回答by bob

回答by Vivek

回答by user3282611

相关推荐

Python Tensorflow Slim：TypeError：预期为 int32，得到包含“_Message”类型张量的列表

Python 熊猫：合并（加入）多列上的两个数据框

Python 绘制 datetime.date 熊猫

Python ModuleNotFoundError：__main__ 不是包是什么意思？

相关推荐

最近更新

标签

Python ModuleNotFoundError：main 不是包是什么意思？