Python Spark 1.4 增加 maxResultSize 内存

Question

提问by ahajib

I am using Spark 1.4 for my research and struggling with the memory settings. My machine has 16GB of memory so no problem there since the size of my file is only 300MB. Although, when I try to convert Spark RDD to panda dataframe using toPandas()function I receive the following error:

我正在使用 Spark 1.4 进行研究并在内存设置方面苦苦挣扎。我的机器有 16GB 的内存，所以没有问题，因为我的文件大小只有 300MB。虽然，当我尝试使用toPandas()函数将 Spark RDD 转换为熊猫数据帧时，我收到以下错误：

serialized results of 9 tasks (1096.9 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

I tried to fix this changing the spark-config file and still getting the same error. I've heard that this is a problem with spark 1.4 and wondering if you know how to solve this. Any help is much appreciated.

我试图通过更改 spark-config 文件来解决此问题，但仍然出现相同的错误。我听说这是 spark 1.4 的问题，想知道您是否知道如何解决这个问题。任何帮助深表感谢。

Answer 1

采纳答案by zero323

You can set spark.driver.maxResultSizeparameter in the SparkConfobject:

您可以spark.driver.maxResultSize在SparkConf对象中设置参数：

from pyspark import SparkConf, SparkContext

# In Jupyter you have to stop the current context first
sc.stop()

# Create new config
conf = (SparkConf()
    .set("spark.driver.maxResultSize", "2g"))

# Create new context
sc = SparkContext(conf=conf)

You should probably create a new SQLContextas well:

您可能还应该创建一个新的SQLContext：

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

Answer 2

回答by Zia Kayani

Looks like you are collecting the RDD, So it will definitely collect all the data to driver node that's why you are facing this issue. You have to avoid collect data if not required for a rdd, or if its necessary then specify spark.driver.maxResultSize. there are two ways of defining this variable

看起来您正在收集 RDD，因此它肯定会将所有数据收集到驱动程序节点，这就是您面临此问题的原因。如果 rdd 不需要，则必须避免收集数据，或者如果有必要，则指定spark.driver.maxResultSize. 有两种定义这个变量的方法

1 - create Spark Config by setting this variable as
conf.set("spark.driver.maxResultSize", "3g")
2 - or set this variable in spark-defaults.conffile present in conf folder of spark. like spark.driver.maxResultSize 3gand restart the spark.

1 - 通过将此变量设置为
conf.set("spark.driver.maxResultSize", "3g")
2 来创建 Spark 配置- 或spark-defaults.conf在 spark conf 文件夹中的文件中设置此变量。喜欢 spark.driver.maxResultSize 3g并重新启动火花。

Answer 3

回答by Dolan Antenucci

From the command line, such as with pyspark, --conf spark.driver.maxResultSize=3gcan also be used to increase the max result size.

从命令行，例如使用 pyspark，--conf spark.driver.maxResultSize=3g也可用于增加最大结果大小。

Answer 4

回答by Iraj Hedayati

Tuning spark.driver.maxResultSizeis a good practice considering the running environment. However, it is not the solution to your problem as the amount of data may change time by time. As @Zia-Kayani mentioned, it is better to collect data wisely. So if you have a DataFrame df, then you can call df.rddand do all the magic stuff on the cluster, not in the driver. However, if you need to collect the data, I would suggest:

spark.driver.maxResultSize考虑到运行环境，调优是一个很好的做法。但是，它不是您问题的解决方案，因为数据量可能会随时间变化。正如@Zia-Kayani 所提到的，最好明智地收集数据。因此，如果您有一个 DataFrame df，那么您可以df.rdd在集群上调用并执行所有神奇的事情，而不是在驱动程序中。但是，如果您需要收集数据，我建议：

Do not turn on spark.sql.parquet.binaryAsString. String objects take more space
Use spark.rdd.compressto compress RDDs when you collect them
Try to collect it using pagination. (code in Scala, from another answer Scala: How to get a range of rows in a dataframe)
long count = df.count() int limit = 50; while(count > 0){ df1 = df.limit(limit); df1.show(); //will print 50, next 50, etc rows df = df.except(df1); count = count - limit; }

不要打开spark.sql.parquet.binaryAsString。字符串对象占用更多空间
用于spark.rdd.compress在收集 RDD 时对其进行压缩
尝试使用分页收集它。（Scala 中的代码，来自另一个答案Scala: How to get a range of rows in a dataframe）
long count = df.count() int limit = 50; while(count > 0){ df1 = df.limit(limit); df1.show(); //will print 50, next 50, etc rows df = df.except(df1); count = count - limit; }

Answer 5

回答by Tagar

There is also a Spark bug https://issues.apache.org/jira/browse/SPARK-12837that gives the same error

还有一个 Spark 错误 https://issues.apache.org/jira/browse/SPARK-12837给出了同样的错误

 serialized results of X tasks (Y MB) is bigger than spark.driver.maxResultSize

even though you may not be pulling data to the driver explicitly.

即使您可能没有明确地将数据拉到驱动程序。

SPARK-12837 addresses a Spark bug that accumulators/broadcast variables prior to Spark 2 were pulled to driver unnecessary causing this problem.

SPARK-12837 解决了 Spark 错误，即 Spark 2 之前的累加器/广播变量被拉到驱动程序不必要的导致此问题。

Answer 6

回答by Mike

while starting the job or terminal, you can use

在开始工作或终端时，您可以使用

--conf spark.driver.maxResultSize="0"

to remove the bottleneck

消除瓶颈

Answer 7

回答by korahtm

You can set spark.driver.maxResultSize to 2GB when you start the pyspark shell:

您可以在启动 pyspark shell 时将 spark.driver.maxResultSize 设置为 2GB：

pyspark  --conf "spark.driver.maxResultSize=2g"

This is for allowing 2Gb for spark.driver.maxResultSize

这是为了允许 spark.driver.maxResultSize 为 2Gb

Python Spark 1.4 增加 maxResultSize 内存

提问by ahajib

采纳答案by zero323

回答by Zia Kayani

回答by Dolan Antenucci

回答by Iraj Hedayati

回答by Tagar

回答by Mike

回答by korahtm

相关推荐

最近更新

标签

Python Spark 1.4 增加 maxResultSize 内存

提问by ahajib

采纳答案by zero323

回答by Zia Kayani

回答by Dolan Antenucci

回答by Iraj Hedayati

回答by Tagar

回答by Mike

回答by korahtm

相关推荐

Python if not == vs if !=

Python 使用 Numpy 查找输入数字集的均值、中值、众数或范围

如何使用 Python 解压缩 gz 文件

Python 在过滤器 SQLAlchemy 中进行日期时间比较

相关推荐

最近更新

标签