分析 Scala Spark 应用程序

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30900104/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:16:09  来源:igfitidea点击:

Profiling a Scala Spark application

scalaapache-spark

提问by svKris

I would like to profile my sSpark scala applications to figure out the parts of the code which i have to optimize. I enabled -Xprofin --driver-java-optionsbut this is not of much help to me as it gives lot of granular details. I am just interested to know how much time each function call in my application is taking time. As is other Stack Overflow questions, many people suggested YourKit but it is not inexpensive. So i would like to use something which is not costly in fact free of cost.

我想分析我的 sSpark scala 应用程序以找出我必须优化的代码部分。我启用了-Xprof--driver-java-options但这对我没有多大帮助,因为它提供了很多详细的细节。我只是想知道我的应用程序中的每个函数调用花费了多少时间。与其他 Stack Overflow 问题一样,很多人建议使用 YourKit,但它并不便宜。所以我想使用一些实际上免费的东西。

Are there any better ways to solve this ?

有没有更好的方法来解决这个问题?

采纳答案by hveiga

I would recommend you to use directly the UI that spark provides. It provides a lot of information and metrics regarding time, steps, network usage, etc...

我建议您直接使用 spark 提供的 UI。它提供了大量关于时间、步骤、网络使用等的信息和指标......

You can check more about it here: https://spark.apache.org/docs/latest/monitoring.html

您可以在此处查看更多相关信息:https: //spark.apache.org/docs/latest/monitoring.html

Also, in the new Spark version (1.4.0) there is a nice visualizer to understand the steps and stages of your spark jobs.

此外,在新的 Spark 版本 (1.4.0) 中,有一个很好的可视化工具来了解 Spark 作业的步骤和阶段。

回答by aviemzur

As you said, profiling a distributed process is trickier than profiling a single JVM process, but there are ways to achieve this.

正如您所说,分析分布式进程比分析单个 JVM 进程更棘手,但有一些方法可以实现这一点。

You can use sampling as a thread profiling method. Add a java agent to the executors that will capture stack traces, then aggregate over these stack traces to see which methods your application spends the most time in.

您可以将采样用作线程分析方法。向将捕获堆栈跟踪的执行程序添加一个 java 代理,然后聚合这些堆栈跟踪以查看您的应用程序花费最多时间的方法。

For example, you can use Etsy's statsd-jvm-profilerjava agent and configure it to send the stack traces to InfluxDBand then aggregate them using Flame Graphs.

例如,您可以使用Etsy 的 statsd-jvm-profilerjava 代理并将其配置为将堆栈跟踪发送到InfluxDB,然后使用Flame Graphs聚合它们。

For more information, check out my post on profiling Spark applications: https://www.paypal-engineering.com/2016/09/08/spark-in-flames-profiling-spark-applications-using-flame-graphs/

有关更多信息,请查看我关于分析 Spark 应用程序的帖子:https: //www.paypal-engineering.com/2016/09/08/spark-in-flames-profiling-spark-applications-using-flame-graphs/

回答by Michael Spector

I've written an article and a script recently, that wraps spark-submit, and generates a flame graph after executing a Spark application.

我最近写了一篇文章和一个脚本,它spark-submit在执行 Spark 应用程序后包装并生成火焰图。

Here's the article: https://www.linkedin.com/pulse/profiling-spark-applications-one-click-michael-spector

这是文章:https: //www.linkedin.com/pulse/profiling-spark-applications-one-click-michael-spector

Here's the script: https://raw.githubusercontent.com/spektom/spark-flamegraph/master/spark-submit-flamegraph

这是脚本:https: //raw.githubusercontent.com/spektom/spark-flamegraph/master/spark-submit-flamegraph

Just use it instead of regular spark-submit.

只需使用它而不是常规spark-submit.

回答by pkumbhar

Look at JVM Profilerreleased by UBER.

看看UBER 发布的JVM Profiler

JVM Profiler is a tool developed by UBER for analysing JVM applications in distributed environment. It can attach Java agent to executors of Spark/Hadoop application in a distributed way and collect various metrics at runtime. It allows to trace arbitrary java methods/arguments without source code change (similar to Dtrace).

JVM Profiler 是 UBER 开发的一款用于分析分布式环境中 JVM 应用程序的工具。它可以以分布式方式将 Java 代理附加到 Spark/Hadoop 应用程序的执行器,并在运行时收集各种指标。它允许在不更改源代码的情况下跟踪任意 java 方法/参数(类似于 Dtrace)。

Here is the blog post.

这是博客文章

回答by apnith

Would suggest to check out sparklens. This is profiling and performance prediction tool for Spark with built-in Spark Scheduler simulator. It provides an overall idea about how efficiently your cluster resources are utilized and what effects(approximately) a change in cluster resource configuration could have on the performance.

建议检查火花。这是具有内置 Spark 调度程序模拟器的 Spark 的分析和性能预测工具。它提供了关于集群资源的利用效率以及集群资源配置的更改可能对性能产生什么影响(大约)的总体概念。