scala 为什么 Spark 会因 java.lang.OutOfMemoryError 失败：超出 GC 开销限制？

Question

提问by Augusto

I'm trying to implement a Hadoop Map/Reduce job that worked fine before in Spark. The Spark app definition is the following:

我正在尝试实现之前在 Spark 中运行良好的 Hadoop Map/Reduce 作业。Spark 应用定义如下：

val data = spark.textFile(file, 2).cache()
val result = data
  .map(//some pre-processing)
  .map(docWeightPar => (docWeightPar(0),docWeightPar(1))))
  .flatMap(line => MyFunctions.combine(line))
  .reduceByKey( _ + _)

Where MyFunctions.combineis

哪里MyFunctions.combine是

def combine(tuples: Array[(String, String)]): IndexedSeq[(String,Double)] =
  for (i <- 0 to tuples.length - 2;
       j <- 1 to tuples.length - 1
  ) yield (toKey(tuples(i)._1,tuples(j)._1),tuples(i)._2.toDouble * tuples(j)._2.toDouble)

The combinefunction produces lots of map keys if the list used for input is big and this is where the exceptions is thrown.

combine如果用于输入的列表很大并且这是抛出异常的地方，则该函数会生成大量映射键。

In the Hadoop Map Reduce setting I didn't have problems because this is the point where the combinefunction yields was the point Hadoop wrote the map pairs to disk. Spark seems to keep all in memory until it explodes with a java.lang.OutOfMemoryError: GC overhead limit exceeded.

在 Hadoop Map Reduce 设置中，我没有问题，因为这是combine函数产生的点，也是 Hadoop 将映射对写入磁盘的点。Spark 似乎将所有内容都保存在内存中，直到它以java.lang.OutOfMemoryError: GC overhead limit exceeded.

I am probably doing something really basic wrong but I couldn't find any pointers on how to come forward from this, I would like to know how I can avoid this. Since I am a total noob at Scala and Spark I am not sure if the problem is from one or from the other, or both. I am currently trying to run this program in my own laptop, and it works for inputs where the length of the tuplesarray is not very long. Thanks in advance.

我可能在做一些非常基本的错误，但我找不到任何关于如何从中提出来的指示，我想知道如何避免这种情况。由于我是 Scala 和 Spark 的完全菜鸟，我不确定问题是出自其中之一还是来自另一个，或两者兼而有之。我目前正在尝试在我自己的笔记本电脑上运行这个程序，它适用于tuples数组长度不是很长的输入。提前致谢。

Answer 1

采纳答案by ohruunuruus

Adjusting the memory is probably a good way to go, as has already been suggested, because this is an expensive operation that scales in an ugly way. But maybe some code changes will help.

正如已经建议的那样，调整内存可能是一个很好的方法，因为这是一项昂贵的操作，并且以丑陋的方式扩展。但也许一些代码更改会有所帮助。

You could take a different approach in your combine function that avoids ifstatements by using the combinationsfunction. I'd also convert the second element of the tuples to doubles before the combination operation:

您可以在 combine 函数中采用不同的方法，if通过使用该combinations函数来避免语句。我还将在组合操作之前将元组的第二个元素转换为双精度：

tuples.

    // Convert to doubles only once
    map{ x=>
        (x._1, x._2.toDouble)
    }.

    // Take all pairwise combinations. Though this function
    // will not give self-pairs, which it looks like you might need
    combinations(2).

    // Your operation
    map{ x=>
        (toKey(x{0}._1, x{1}._1), x{0}._2*x{1}._2)
    }

This will give an iterator, which you can use downstream or, if you want, convert to list (or something) with toList.

这将提供一个迭代器，您可以在下游使用它，或者，如果需要，可以使用toList.

Answer 2

回答by javadba

Add the following JVM arg when you launch spark-shellor spark-submit:

启动时添加以下 JVM argspark-shell或spark-submit：

-Dspark.executor.memory=6g

You may also consider to explicitly set the number of workers when you create an instance of SparkContext:

您还可以考虑在创建实例时明确设置工作人员的数量SparkContext：

Distributed Cluster

分布式集群

Set the slave names in the conf/slaves:

在以下位置设置从属名称conf/slaves：

val sc = new SparkContext("master", "MyApp")

Answer 3

回答by Carlos AG

In the documentation (http://spark.apache.org/docs/latest/running-on-yarn.html) you can read how to configure the executors and the memory limit. For example:

在文档 ( http://spark.apache.org/docs/latest/running-on-yarn.html) 中，您可以阅读如何配置执行程序和内存限制。例如：

--master yarn-cluster --num-executors 10 --executor-cores 3 --executor-memory 4g --driver-memory 5g  --conf spark.yarn.executor.memoryOverhead=409

The memoryOverhead should be the 10% of the executor memory.

memoryOverhead 应该是执行程序内存的 10%。

Edit: Fixed 4096 to 409 (Comment below refers to this)

编辑：固定 4096 到 409（下面的评论是指这个）

Answer 4

回答by Erkan ?irin

I had the same issue during long regression fit. I cached the train and test set. It solved my problem.

我在长期回归拟合期间遇到了同样的问题。我缓存了火车和测试集。它解决了我的问题。

train_df, test_df = df3.randomSplit([0.8, 0.2], seed=142)
pipeline_model = pipeline_object.fit(train_df)

pipeline_model line was giving java.lang.OutOfMemoryError: GC overhead limit exceededBut when I used

pipeline_model 线给出java.lang.OutOfMemoryError: GC overhead limit exceeded但是当我使用

train_df, test_df = df3.randomSplit([0.8, 0.2], seed=142)
train_df.cache()
test_df.cache()
pipeline_model = pipeline_object.fit(train_df)

It worked.

有效。

Answer 5

回答by asmaier

This JVM garbage collection error happened reproducibly in my case when I increased the spark.memory.fractionto values greater than 0.6 . So it is better to leave the value at it's default value to avoid JVM garbage collection errors. This is also recommended by https://forums.databricks.com/questions/2202/javalangoutofmemoryerror-gc-overhead-limit-exceede.html.

在我将 JVM 垃圾收集错误增加到spark.memory.fraction大于 0.6 的情况下，此 JVM 垃圾收集错误可重现。因此最好将该值保留为默认值以避免 JVM 垃圾收集错误。这也是https://forums.databricks.com/questions/2202/javalangoutofmemoryerror-gc-overhead-limit-exceede.html推荐的。

For more information one why 0.6is the best value for spark.memory.fractionsee https://issues.apache.org/jira/browse/SPARK-15796.

有关更多信息0.6，spark.memory.fraction请参阅 https://issues.apache.org/jira/browse/SPARK-15796的最佳价值。

scala 为什么 Spark 会因 java.lang.OutOfMemoryError 失败：超出 GC 开销限制？

提问by Augusto

采纳答案by ohruunuruus

回答by javadba

Distributed Cluster

分布式集群

回答by Carlos AG

回答by Erkan ?irin

回答by asmaier

相关推荐

最近更新

标签

scala 为什么 Spark 会因 java.lang.OutOfMemoryError 失败：超出 GC 开销限制？

提问by Augusto

采纳答案by ohruunuruus

回答by javadba

Distributed Cluster

分布式集群

回答by Carlos AG

回答by Erkan ?irin

回答by asmaier

相关推荐

scala 根据多个包含过滤列表

scala KeeperErrorCode = NoNode for /brokers/topics/test-topic/partitions

IntelliJ Scala Plugin的case类缩进很荒谬

scala 如何在对RDD中找到最大值？

相关推荐

最近更新

标签