Python Spark 1.4 增加 maxResultSize 内存
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31058504/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark 1.4 increase maxResultSize memory
提问by ahajib
I am using Spark 1.4 for my research and struggling with the memory settings. My machine has 16GB of memory so no problem there since the size of my file is only 300MB. Although, when I try to convert Spark RDD to panda dataframe using toPandas()
function I receive the following error:
我正在使用 Spark 1.4 进行研究并在内存设置方面苦苦挣扎。我的机器有 16GB 的内存,所以没有问题,因为我的文件大小只有 300MB。虽然,当我尝试使用toPandas()
函数将 Spark RDD 转换为熊猫数据帧时,我收到以下错误:
serialized results of 9 tasks (1096.9 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
I tried to fix this changing the spark-config file and still getting the same error. I've heard that this is a problem with spark 1.4 and wondering if you know how to solve this. Any help is much appreciated.
我试图通过更改 spark-config 文件来解决此问题,但仍然出现相同的错误。我听说这是 spark 1.4 的问题,想知道您是否知道如何解决这个问题。任何帮助深表感谢。
采纳答案by zero323
You can set spark.driver.maxResultSize
parameter in the SparkConf
object:
您可以spark.driver.maxResultSize
在SparkConf
对象中设置参数:
from pyspark import SparkConf, SparkContext
# In Jupyter you have to stop the current context first
sc.stop()
# Create new config
conf = (SparkConf()
.set("spark.driver.maxResultSize", "2g"))
# Create new context
sc = SparkContext(conf=conf)
You should probably create a new SQLContext
as well:
您可能还应该创建一个新的SQLContext
:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
回答by Zia Kayani
Looks like you are collecting the RDD, So it will definitely collect all the data to driver node that's why you are facing this issue.
You have to avoid collect data if not required for a rdd, or if its necessary then specify spark.driver.maxResultSize
. there are two ways of defining this variable
看起来您正在收集 RDD,因此它肯定会将所有数据收集到驱动程序节点,这就是您面临此问题的原因。如果 rdd 不需要,则必须避免收集数据,或者如果有必要,则指定spark.driver.maxResultSize
. 有两种定义这个变量的方法
1 - create Spark Config by setting this variable as
conf.set("spark.driver.maxResultSize", "3g")
2 - or set this variable inspark-defaults.conf
file present in conf folder of spark. likespark.driver.maxResultSize 3g
and restart the spark.
1 - 通过将此变量设置为
conf.set("spark.driver.maxResultSize", "3g")
2 来创建 Spark 配置- 或spark-defaults.conf
在 spark conf 文件夹中的文件中设置此变量。喜欢spark.driver.maxResultSize 3g
并重新启动火花。
回答by Dolan Antenucci
From the command line, such as with pyspark, --conf spark.driver.maxResultSize=3g
can also be used to increase the max result size.
从命令行,例如使用 pyspark,--conf spark.driver.maxResultSize=3g
也可用于增加最大结果大小。
回答by Iraj Hedayati
Tuning spark.driver.maxResultSize
is a good practice considering the running environment. However, it is not the solution to your problem as the amount of data may change time by time. As @Zia-Kayani mentioned, it is better to collect data wisely. So if you have a DataFrame df
, then you can call df.rdd
and do all the magic stuff on the cluster, not in the driver. However, if you need to collect the data, I would suggest:
spark.driver.maxResultSize
考虑到运行环境,调优是一个很好的做法。但是,它不是您问题的解决方案,因为数据量可能会随时间变化。正如@Zia-Kayani 所提到的,最好明智地收集数据。因此,如果您有一个 DataFrame df
,那么您可以df.rdd
在集群上调用并执行所有神奇的事情,而不是在驱动程序中。但是,如果您需要收集数据,我建议:
- Do not turn on
spark.sql.parquet.binaryAsString
. String objects take more space - Use
spark.rdd.compress
to compress RDDs when you collect them - Try to collect it using pagination. (code in Scala, from another answer Scala: How to get a range of rows in a dataframe)
long count = df.count() int limit = 50; while(count > 0){ df1 = df.limit(limit); df1.show(); //will print 50, next 50, etc rows df = df.except(df1); count = count - limit; }
- 不要打开
spark.sql.parquet.binaryAsString
。字符串对象占用更多空间 - 用于
spark.rdd.compress
在收集 RDD 时对其进行压缩 - 尝试使用分页收集它。(Scala 中的代码,来自另一个答案Scala: How to get a range of rows in a dataframe)
long count = df.count() int limit = 50; while(count > 0){ df1 = df.limit(limit); df1.show(); //will print 50, next 50, etc rows df = df.except(df1); count = count - limit; }
回答by Tagar
There is also a Spark bug https://issues.apache.org/jira/browse/SPARK-12837that gives the same error
还有一个 Spark 错误 https://issues.apache.org/jira/browse/SPARK-12837给出了同样的错误
serialized results of X tasks (Y MB) is bigger than spark.driver.maxResultSize
even though you may not be pulling data to the driver explicitly.
即使您可能没有明确地将数据拉到驱动程序。
SPARK-12837 addresses a Spark bug that accumulators/broadcast variables prior to Spark 2 were pulled to driver unnecessary causing this problem.
SPARK-12837 解决了 Spark 错误,即 Spark 2 之前的累加器/广播变量被拉到驱动程序不必要的导致此问题。
回答by Mike
while starting the job or terminal, you can use
在开始工作或终端时,您可以使用
--conf spark.driver.maxResultSize="0"
to remove the bottleneck
消除瓶颈
回答by korahtm
You can set spark.driver.maxResultSize to 2GB when you start the pyspark shell:
您可以在启动 pyspark shell 时将 spark.driver.maxResultSize 设置为 2GB:
pyspark --conf "spark.driver.maxResultSize=2g"
This is for allowing 2Gb for spark.driver.maxResultSize
这是为了允许 spark.driver.maxResultSize 为 2Gb