Python Spark 1.4 增加 maxResultSize 内存

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31058504/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 09:24:19  来源:igfitidea点击:

Spark 1.4 increase maxResultSize memory

pythonmemoryapache-sparkpysparkjupyter

提问by ahajib

I am using Spark 1.4 for my research and struggling with the memory settings. My machine has 16GB of memory so no problem there since the size of my file is only 300MB. Although, when I try to convert Spark RDD to panda dataframe using toPandas()function I receive the following error:

我正在使用 Spark 1.4 进行研究并在内存设置方面苦苦挣扎。我的机器有 16GB 的内存,所以没有问题,因为我的文件大小只有 300MB。虽然,当我尝试使用toPandas()函数将 Spark RDD 转换为熊猫数据帧时,我收到以下错误:

serialized results of 9 tasks (1096.9 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

I tried to fix this changing the spark-config file and still getting the same error. I've heard that this is a problem with spark 1.4 and wondering if you know how to solve this. Any help is much appreciated.

我试图通过更改 spark-config 文件来解决此问题,但仍然出现相同的错误。我听说这是 spark 1.4 的问题,想知道您是否知道如何解决这个问题。任何帮助深表感谢。

采纳答案by zero323

You can set spark.driver.maxResultSizeparameter in the SparkConfobject:

您可以spark.driver.maxResultSizeSparkConf对象中设置参数:

from pyspark import SparkConf, SparkContext

# In Jupyter you have to stop the current context first
sc.stop()

# Create new config
conf = (SparkConf()
    .set("spark.driver.maxResultSize", "2g"))

# Create new context
sc = SparkContext(conf=conf)

You should probably create a new SQLContextas well:

您可能还应该创建一个新的SQLContext

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

回答by Zia Kayani

Looks like you are collecting the RDD, So it will definitely collect all the data to driver node that's why you are facing this issue. You have to avoid collect data if not required for a rdd, or if its necessary then specify spark.driver.maxResultSize. there are two ways of defining this variable

看起来您正在收集 RDD,因此它肯定会将所有数据收集到驱动程序节点,这就是您面临此问题的原因。如果 rdd 不需要,则必须避免收集数据,或者如果有必要,则指定spark.driver.maxResultSize. 有两种定义这个变量的方法

1 - create Spark Config by setting this variable as
conf.set("spark.driver.maxResultSize", "3g")
2 - or set this variable in spark-defaults.conffile present in conf folder of spark. like spark.driver.maxResultSize 3gand restart the spark.

1 - 通过将此变量设置为
conf.set("spark.driver.maxResultSize", "3g")
2 来创建 Spark 配置- 或spark-defaults.conf在 spark conf 文件夹中的文件中设置此变量。喜欢 spark.driver.maxResultSize 3g并重新启动火花。

回答by Dolan Antenucci

From the command line, such as with pyspark, --conf spark.driver.maxResultSize=3gcan also be used to increase the max result size.

从命令行,例如使用 pyspark,--conf spark.driver.maxResultSize=3g也可用于增加最大结果大小。

回答by Iraj Hedayati

Tuning spark.driver.maxResultSizeis a good practice considering the running environment. However, it is not the solution to your problem as the amount of data may change time by time. As @Zia-Kayani mentioned, it is better to collect data wisely. So if you have a DataFrame df, then you can call df.rddand do all the magic stuff on the cluster, not in the driver. However, if you need to collect the data, I would suggest:

spark.driver.maxResultSize考虑到运行环境,调优是一个很好的做法。但是,它不是您问题的解决方案,因为数据量可能会随时间变化。正如@Zia-Kayani 所提到的,最好明智地收集数据。因此,如果您有一个 DataFrame df,那么您可以df.rdd在集群上调用并执行所有神奇的事情,而不是在驱动程序中。但是,如果您需要收集数据,我建议:

  • Do not turn on spark.sql.parquet.binaryAsString. String objects take more space
  • Use spark.rdd.compressto compress RDDs when you collect them
  • Try to collect it using pagination. (code in Scala, from another answer Scala: How to get a range of rows in a dataframe)

    long count = df.count() int limit = 50; while(count > 0){ df1 = df.limit(limit); df1.show(); //will print 50, next 50, etc rows df = df.except(df1); count = count - limit; }

  • 不要打开spark.sql.parquet.binaryAsString。字符串对象占用更多空间
  • 用于spark.rdd.compress在收集 RDD 时对其进行压缩
  • 尝试使用分页收集它。(Scala 中的代码,来自另一个答案Scala: How to get a range of rows in a dataframe

    long count = df.count() int limit = 50; while(count > 0){ df1 = df.limit(limit); df1.show(); //will print 50, next 50, etc rows df = df.except(df1); count = count - limit; }

回答by Tagar

There is also a Spark bug https://issues.apache.org/jira/browse/SPARK-12837that gives the same error

还有一个 Spark 错误 https://issues.apache.org/jira/browse/SPARK-12837给出了同样的错误

 serialized results of X tasks (Y MB) is bigger than spark.driver.maxResultSize

even though you may not be pulling data to the driver explicitly.

即使您可能没有明确地将数据拉到驱动程序。

SPARK-12837 addresses a Spark bug that accumulators/broadcast variables prior to Spark 2 were pulled to driver unnecessary causing this problem.

SPARK-12837 解决了 Spark 错误,即 Spark 2 之前的累加器/广播变量被拉到驱动程序不必要的导致此问题。

回答by Mike

while starting the job or terminal, you can use

在开始工作或终端时,您可以使用

--conf spark.driver.maxResultSize="0"

to remove the bottleneck

消除瓶颈

回答by korahtm

You can set spark.driver.maxResultSize to 2GB when you start the pyspark shell:

您可以在启动 pyspark shell 时将 spark.driver.maxResultSize 设置为 2GB:

pyspark  --conf "spark.driver.maxResultSize=2g"

This is for allowing 2Gb for spark.driver.maxResultSize

这是为了允许 spark.driver.maxResultSize 为 2Gb