PySpark:java.lang.OutofMemoryError:Java 堆空间

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32336915/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 12:21:58  来源:igfitidea点击:

PySpark: java.lang.OutofMemoryError: Java heap space

javaapache-sparkout-of-memoryheap-memorypyspark

提问by pg2455

I have been using PySpark with Ipython lately on my server with 24 CPUs and 32GB RAM. Its running only on one machine. In my process, I want to collect huge amount of data as is give in below code:

我最近在我的服务器上使用 PySpark 和 Ipython,它有 24 个 CPU 和 32GB RAM。它只在一台机器上运行。在我的过程中,我想收集以下代码中给出的大量数据:

train_dataRDD = (train.map(lambda x:getTagsAndText(x))
.filter(lambda x:x[-1]!=[])
.flatMap(lambda (x,text,tags): [(tag,(x,text)) for tag in tags])
.groupByKey()
.mapValues(list))

When I do

当我做

training_data =  train_dataRDD.collectAsMap()

It gives me outOfMemory Error. Java heap Space. Also, I can not perform any operations on Spark after this error as it looses connection with Java. It gives Py4JNetworkError: Cannot connect to the java server.

它给了我 outOfMemory 错误。Java heap Space. 此外,在出现此错误后,我无法对 Spark 执行任何操作,因为它与 Java 失去了连接。它给Py4JNetworkError: Cannot connect to the java server.

It looks like heap space is small. How can I set it to bigger limits?

看起来堆空间很小。如何将其设置为更大的限制?

EDIT:

编辑

Things that I tried before running: sc._conf.set('spark.executor.memory','32g').set('spark.driver.memory','32g').set('spark.driver.maxResultsSize','0')

我在跑步前尝试过的事情: sc._conf.set('spark.executor.memory','32g').set('spark.driver.memory','32g').set('spark.driver.maxResultsSize','0')

I changed the spark options as per the documentation here(if you do ctrl-f and search for spark.executor.extraJavaOptions) : http://spark.apache.org/docs/1.2.1/configuration.html

我根据此处的文档更改了 spark 选项(如果您执行 ctrl-f 并搜索 spark.executor.extraJavaOptions):http://spark.apache.org/docs/1.2.1/configuration.html

It says that I can avoid OOMs by setting spark.executor.memory option. I did the same thing but it seem not be working.

它说我可以通过设置 spark.executor.memory 选项来避免 OOM。我做了同样的事情,但似乎不起作用。

采纳答案by pg2455

After trying out loads of configuration parameters, I found that there is only one need to be changed to enable more Heap space and i.e. spark.driver.memory.

在尝试了大量配置参数后,我发现只需要更改一个即可启用更多 Heap 空间,即spark.driver.memory.

sudo vim $SPARK_HOME/conf/spark-defaults.conf
#uncomment the spark.driver.memory and change it according to your use. I changed it to below
spark.driver.memory 15g
# press : and then wq! to exit vim editor

Close your existing spark application and re run it. You will not encounter this error again. :)

关闭现有的 Spark 应用程序并重新运行它。您不会再次遇到此错误。:)

回答by Francesco Boi

I had the same problem with pyspark(installed with brew). In my case it was installed on the path /usr/local/Cellar/apache-spark.

我有同样的问题pyspark(安装brew)。在我的情况下,它安装在路径上/usr/local/Cellar/apache-spark

The only configuration file I had was in apache-spark/2.4.0/libexec/python//test_coverage/conf/spark-defaults.conf.

我唯一的配置文件在apache-spark/2.4.0/libexec/python//test_coverage/conf/spark-defaults.conf.

As suggested hereI created the file spark-defaults.confin the path /usr/local/Cellar/apache-spark/2.4.0/libexec/conf/spark-defaults.confand appended to it the line spark.driver.memory 12g.

正如此处所建议的spark-defaults.conf在路径中创建了该文件/usr/local/Cellar/apache-spark/2.4.0/libexec/conf/spark-defaults.conf并将spark.driver.memory 12g.

回答by louis_guitton

If you're looking for the way to set this from within the script or a jupyter notebook, you can do:

如果您正在寻找从脚本或 jupyter 笔记本中设置它的方法,您可以执行以下操作:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master('local[*]') \
    .config("spark.driver.memory", "15g") \
    .appName('my-cool-app') \
    .getOrCreate()