Java 如何强制 Spark 执行代码?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31383904/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 11:01:08  来源:igfitidea点击:

How can I force Spark to execute code?

javascalahadoopapache-spark

提问by MetallicPriest

How can I force Spark to execute a call to map, even if it thinks it does not need to be executed due to its lazy evaluation?

如何强制 Spark 执行对 map 的调用,即使它认为由于其惰性求值而无需执行?

I have tried to put cache()with the map call but that still doesn't do the trick. My map method actually uploads results to HDFS. So, its not useless, but Spark thinks it is.

我试图cache()与 map 调用一起放置,但这仍然不能解决问题。我的地图方法实际上将结果上传到 HDFS。所以,它不是没用,但 Spark 认为它是。

采纳答案by eliasah

Short answer:

简短的回答:

To force Spark to execute a transformation, you'll need to require a result. Sometimes a simple countaction is sufficient.

要强制 Spark 执行转换,您需要需要一个结果。有时一个简单的count动作就足够了。

TL;DR:

特尔;博士:

Ok, let's review the RDDoperations.

好的,让我们回顾一下RDD操作

RDDs support two types of operations:

RDDs 支持两种类型的操作:

  • transformations- which create a new dataset from an existing one.
  • actions- which return a value to the driver program after running a computation on the dataset.
  • 转换- 从现有数据集创建一个新数据集。
  • 操作- 在对数据集运行计算后向驱动程序返回一个值。

For example, mapis a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduceis an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKeythat returns a distributed dataset).

例如,map是一个转换,它通过一个函数传递每个数据集元素并返回一个表示结果的新 RDD。另一方面,reduce是使用某个函数聚合RDD的所有元素并将最终结果返回给驱动程序的操作(尽管也有一个并行reduceByKey返回分布式数据集)。

All transformations in Spark are lazy, in that they do not compute their results right away.

Spark 中的所有转换都是惰性的,因为它们不会立即计算结果

Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a resultto be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

相反,他们只记住应用于某些基本数据集(例如文件)的转换。仅当操作需要将结果返回到驱动程序时才计算转换。这样的设计让 Spark 运行起来更高效——例如,我们可以意识到通过 map 创建的数据集会在 reduce 中使用,并且只将 reduce 的结果返回给驱动程序,而不是更大的映射数据集。

By default, each transformed RDDmay be recomputed each time you run an action on it. However, you may also persist an RDDin memory using the persist(or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

默认情况下,每次RDD在其上运行操​​作时,可能会重新计算每个转换。但是,您也可以RDD使用persist(or cache) 方法在内存中持久化一个元素,在这种情况下,Spark 会将元素保留在集群上,以便在您下次查询时更快地访问它。还支持将RDDs持久保存在磁盘上,或跨多个节点复制。

Conclusion

结论

To force Spark to execute a call to map, you'll need to require a result. Sometimes a countaction is sufficient.

要强制 Spark 执行对 map 的调用,您需要需要一个结果。有时一个count动作就足够了。

Reference

参考

回答by zero323

Spark transformationsonly describe what has to be done. To trigger an execution you need an action.

Spark转换仅描述必须完成的操作。要触发执行,您需要一个action

In your case there is a deeper problem. If goal is to create some kind of side effect, like storing data on HDFS, the right method to use is foreach. It is both an action and has a clean semantics. What is also important, unlike map, it doesn't imply referential transparency.

在你的情况下,有一个更深层次的问题。如果目标是创建某种副作用,例如将数据存储在 HDFS 上,那么正确的使用方法是foreach. 它既是一个动作,又具有清晰的语义。同样重要的是,与 不同的是map,它并不意味着引用透明度。