scala Spark 在标准输出上丢失 println()

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33225994/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:43:50  来源:igfitidea点击:

Spark losing println() on stdout

scalaapache-sparkprintlnaccumulator

提问by Edamame

I have the following code:

我有以下代码:

val blueCount = sc.accumulator[Long](0)
val output = input.map { data =>
  for (value <- data.getValues()) {
    if (record.getEnum() == DataEnum.BLUE) {
      blueCount += 1
      println("Enum = BLUE : " + value.toString()
    }
  }
  data
}.persist(StorageLevel.MEMORY_ONLY_SER)

output.saveAsTextFile("myOutput")


Then the blueCount is not zero, but I got no println() output! Am I missing anything here? Thanks!

然后 blueCount 不为零,但我没有 println() 输出!我在这里错过了什么吗?谢谢!

回答by Alberto Bonsanto

This is a conceptual question...

这是一个概念性的问题...

Imagine You have a big cluster, composed of many workers let's say nworkers and those workers store a partition of an RDDor DataFrame, imagine You start a maptask across that data, and inside that mapyou have a printstatement, first of all:

想象一下,你有一个大集群,由许多工作人员组成,假设n工作人员和那些工作人员存储一个RDDor的分区DataFrame,想象你map在该数据上启动一个任务,在里面map你有一个print语句,首先:

  • Where will that data be printed out?
  • What node has priority and what partition?
  • If all nodes are running in parallel, who will be printed first?
  • How will be this print queue created?
  • 在哪里打印这些数据?
  • 什么节点有优先权,什么分区?
  • 如果所有节点并行运行,谁会先打印?
  • 这个打印队列将如何创建?

Those are too many questions, thus the designers/maintainers of apache-sparkdecided logically to drop any support to printstatements inside any map-reduceoperation (this include accumulatorsand even broadcastvariables).

这些问题太多了,因此设计者/维护者在apache-spark逻辑上决定放弃对print任何map-reduce操作中的语句的任何支持(这包括accumulators甚至broadcast变量)。

This also makes sense because Spark is a language designedfor very large datasets. While printing can be useful for testing and debugging, you wouldn't want to print every line of a DataFrame or RDD because they are built to have millions or billions of rows! So why deal with these complicated questions when you wouldn't even want to print in the first place?

这也是有道理的,因为 Spark 是一种为非常大的数据集设计的语言。虽然打印对于测试和调试很有用,但您不希望打印 DataFrame 或 RDD 的每一行,因为它们被构建为具有数百万或数十亿行!那么,当您一开始甚至不想打印时,为什么还要处理这些复杂的问题呢?

In order to prove this you can run this scala code for example:

为了证明这一点,您可以运行以下 Scala 代码,例如:

// Let's create a simple RDD
val rdd = sc.parallelize(1 to 10000)

def printStuff(x:Int):Int = {
  println(x)
  x + 1
}

// It doesn't print anything! because of a logic design limitation!
rdd.map(printStuff)

// But you can print the RDD by doing the following:
rdd.take(10).foreach(println)

回答by Edamame

I was able to work it around by making a utility function:

我能够通过创建一个实用函数来解决它:

object PrintUtiltity {
    def print(data:String) = {
      println(data)
    }
}