scala 使用 Apache Spark 作为 Web 应用程序的后端

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29276381/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:00:25  来源:igfitidea点击:

Using Apache Spark as a backend for web application

scalahadoopapache-spark

提问by Raju Rama Krishna

We have Terabytes of data stored in HDFS, comprising of customer data and behavioral information. Business Analysts want to perform slicing and dicing of this data using filters.

我们在 HDFS 中存储了数 TB 的数据,包括客户数据和行为信息。业务分析师希望使用过滤器对这些数据进行切片和切块。

These filters are similar to Spark RDD filters. Some examples of the filter are: age > 18 and age < 35, date between 10-02-2015, 20-02-2015, gender=male, country in (UK, US, India), etc. We want to integrate this filter functionality in our JSF (or Play) based web application.

这些过滤器类似于 Spark RDD 过滤器。过滤器的一些示例是: age > 18 and age < 35date between 10-02-2015, 20-02-2015gender=malecountry in (UK, US, India)等。我们希望将此过滤器功能集成到基于 JSF(或 Play)的 Web 应用程序中。

Analysts would like to experiment by applying/removing filters, and verifying if the count of the final filtered data is as desired. This is a repeated exercise, and the maximum number of people using this web application could be around 100.

分析师希望通过应用/删除过滤器进行试验,并验证最终过滤数据的计数是否符合要求。这是一个重复的练习,使用此 Web 应用程序的最大人数可能在 100 左右。

We are planning to use Scala as a programming language for implementing the filters. The web application would initialize a single SparkContext at the load of the server, and every filter would reuse the same SparkContext.

我们计划使用 Scala 作为实现过滤器的编程语言。Web 应用程序将在服务器负载时初始化单个 SparkContext,并且每个过滤器将重用相同的 SparkContext。

Is Spark good for this use case of interactive querying through a web application. Also, the idea of sharing a single SparkContext, is this a work-around approach? The other alternative we have is Apache Hive with Tez engine using ORC compressed file format, and querying using JDBC/Thrift. Is this option better than Spark, for the given job?

Spark 是否适合这种通过 Web 应用程序进行交互式查询的用例。此外,共享单个 SparkContext 的想法,这是一种解决方法吗?我们拥有的另一种选择是 Apache Hive 和 Tez 引擎,使用 ORC 压缩文件格式,并使用 JDBC/Thrift 进行查询。对于给定的工作,此选项是否比 Spark 更好?

回答by Marius Soutier

It's not the best use case for Spark, but it is completely possible. The latency can be high though.

这不是 Spark 的最佳用例,但完全有可能。但是延迟可能很高。

You might want to check out Spark Jobserver, it should offer most of your required features. You can also get an SQL view over your data using Spark's JDBC Thrift server.

您可能想查看Spark Jobserver,它应该提供您所需的大部分功能。您还可以使用 Spark 的 JDBC Thrift 服务器获取数据的 SQL 视图。

In general I'd advise using SparkSQL for this, it already handles a lot of the things you might be interested in.

一般来说,我建议为此使用 SparkSQL,它已经处理了许多您可能感兴趣的事情。

Another option would be to use Databricks Cloud, but it's not publicly available yet.

另一种选择是使用Databricks Cloud,但它尚未公开可用。

回答by quickinsights

Analysts would like to experiment by applying/removing filters, and verifying if the count of the final filtered data is as desired. This is a repeated exercise, and the maximum number of people using this web application could be around 100.

分析师希望通过应用/删除过滤器进行试验,并验证最终过滤数据的计数是否符合要求。这是一个重复的练习,使用此 Web 应用程序的最大人数可能在 100 左右。

Apache Zeppelinprovides a framework for interactively ingesting and visualizing data (via web application) using apache spark as the back end. Here is a videodemonstrating the features.

Apache Zeppelin提供了一个框架,用于使用 apache spark 作为后端以交互方式摄取和可视化数据(通过 Web 应用程序)。这是演示这些功能的视频

Also, the idea of sharing a single SparkContext, is this a work-around approach?

此外,共享单个 SparkContext 的想法,这是一种解决方法吗?

It looks like that project uses a single sparkContext for low latency query jobs.

看起来该项目使用单个 sparkContext 进行低延迟查询作业。

回答by Juh_

I'd like to know which solution you chose in the end.

我想知道您最终选择了哪种解决方案。

I have two propositions:

我有两个提议:

  1. following the zeppelin idea of @quickinsights, there is also the interactive notebook jupyterthat is well established now. It is firstly designed for python, but specialized kernelcan be installed. I tried using toreea couple of month ago. The basic installation is simple:

    pip install jupyter

    pip install toree

    jupyter install toree

    but at the time I had to do a couple low level twicks to make it works (s.as editing /usr/local/share/jupyter/kernels/toree/kernel.json). But it worked and I could use a spark cluster from a scala notebook. Check this tuto, it fits what I have in memory.

  2. Most (all?) docs on spark speak about running app with spark-submit or using spark-shell for interactive usage (sorry but spark&scala shell are so disappointing...). They never speak about using spark in an interactive app, such as a web-app. It is possible (I tried), but there are indeed some issues to be check, such as sharing sparkContext as you mentioned, and also some issues about managing dependencies. You can checks the two getting-started-prototypesI made to use spark in a spring web-app. It is in java, but I would strongly recommend using scala. I did not work long enough with this to learn a lot. However I can say that it is possible, and it works well (tried on a 12 nodes cluster + app running on an edge node)

    Just remember that the spark driver, i.e. where the code with rdd is running, should be physically on the same cluster that the spark nodes: there are lots of communications between the driver and the workers.

  1. 遵循@quickinsights 的 zeppelin 想法,还有现在已经很成熟的交互式笔记本jupyter。它最初是为python设计的,但可以安装专门的内核。几个月前我尝试使用toree。基本安装很简单:

    pip install jupyter

    pip install toree

    jupyter install toree

    但当时我不得不做一些低级别的twicks才能使其正常工作(s.asediting /usr/local/share/jupyter/kernels/toree/kernel.json)。但它奏效了,我可以使用 Scala 笔记本中的 Spark 集群。检查这个教程,它符合我的记忆

  2. Spark 上的大多数(全部?)文档都谈到使用 spark-submit 运行应用程序或使用 spark-shell 进行交互使用(抱歉,spark&scala shell 太令人失望了......)。他们从不谈论在交互式应用程序(例如网络应用程序)中使用 Spark。这是可能的(我尝试过),但确实有一些问题需要检查,例如您提到的共享 sparkContext,以及有关管理依赖项的一些问题。您可以检查我为在 spring 网络应用程序中使用 spark 而制作的两个入门原型。它是在 java 中,但我强烈建议使用 scala。我在这方面工作的时间不够长,无法学到很多东西。但是我可以说这是可能的,并且运行良好(在 12 个节点的集群 + 边缘节点上运行的应用程序上进行了尝试)

    请记住,spark 驱动程序,即运行 rdd 代码的地方,应该在物理上与 spark 节点位于同一个集群上:驱动程序和工作程序之间有很多通信。

回答by leo9r

Apache Livyenables programmatic, fault-tolerant, multi-tenant submission of Spark jobs from web/mobile apps (no Spark client needed). So, multiple users can interact with your Spark cluster concurrently.

Apache Livy支持从 Web/移动应用程序(无需 Spark 客户端)以编程方式、容错、多租户提交 Spark 作业。因此,多个用户可以同时与您的 Spark 集群交互。

回答by Ashish Gupta

We had a similar problem at our company. We have ~2-2.5 TB of data in the form of logs. Had some basic analytics to do on that data.

我们公司也有类似的问题。我们有大约 2-2.5 TB 的日志形式的数据。对该数据进行了一些基本的分析。

We used following:

我们使用了以下内容:

  • Apache Flink for Streaming data from source to HDFS via Hive.

  • Have Zeppelin configured on the top of HDFS.

  • SQL interface for Joins and JDBC connection to connect to HDFS via hive.

  • Spark for putting batches of data offline

  • Apache Flink 用于通过 Hive 将数据从源流式传输到 HDFS。

  • 在 HDFS 顶部配置 Zeppelin。

  • 用于连接和 JDBC 连接的 SQL 接口以通过 hive 连接到 HDFS。

  • Spark用于将批量数据离线

You can use Flink + Hive-HDFS

你可以使用 Flink + Hive-HDFS

Filters can be applied via SQL ( Yes! everything is supported in latest releases)

Zeppelin can automate task of report generation and it has cool features of filters without actually mordifying sql queries using ${sql-variable} feature.

过滤器可以通过 SQL 应用(是的!最新版本支持所有内容)

Zeppelin 可以自动执行报告生成任务,并且它具有很酷的过滤器功能,而无需使用 ${sql-variable} 功能实际修改 sql 查询。

Check it out. I am sure you'll find your answer:)

看看这个。我相信你会找到答案:)

Thanks.

谢谢。