java 如何使用单个 Spark 上下文在 Apache Spark 中运行并发作业（操作）

Question

提问by Sporty

It says in Apache Spark documentation "within each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they were submitted by different threads". Can someone explain how to achieve this concurrency for the following sample code?

它在 Apache Spark 文档中说“在每个 Spark 应用程序中，如果多个“作业”（Spark 操作）由不同的线程提交，它们可能会同时运行”。有人可以解释如何为以下示例代码实现这种并发性吗？

    SparkConf conf = new SparkConf().setAppName("Simple_App");
    JavaSparkContext sc = new JavaSparkContext(conf);

    JavaRDD<String> file1 = sc.textFile("/path/to/test_doc1");
    JavaRDD<String> file2 = sc.textFile("/path/to/test_doc2");

    System.out.println(file1.count());
    System.out.println(file2.count());

These two jobs are independent and must run concurrently.
Thank You.

这两个作业是独立的，必须同时运行。
谢谢。

Answer 1

回答by G Quintana

Try something like this:

尝试这样的事情：

    final JavaSparkContext sc = new JavaSparkContext("local[2]","Simple_App");
    ExecutorService executorService = Executors.newFixedThreadPool(2);
    // Start thread 1
    Future<Long> future1 = executorService.submit(new Callable<Long>() {
        @Override
        public Long call() throws Exception {
            JavaRDD<String> file1 = sc.textFile("/path/to/test_doc1");
            return file1.count();
        }
    });
    // Start thread 2
    Future<Long> future2 = executorService.submit(new Callable<Long>() {
        @Override
        public Long call() throws Exception {
            JavaRDD<String> file2 = sc.textFile("/path/to/test_doc2");
            return file2.count();
        }
    });
    // Wait thread 1
    System.out.println("File1:"+future1.get());
    // Wait thread 2
    System.out.println("File2:"+future2.get());

Answer 2

回答by Tagar

Using scala parallel collections feature

使用 Scala 并行集合功能

Range(0,10).par.foreach {
  project_id => 
      {
        spark.table("store_sales").selectExpr(project_id+" as project_id", "count(*) as cnt")
        .write
        .saveAsTable(s"counts_$project_id")
    }
}

PS. Above would launch up to 10 parallel Spark jobs but it could be less depending on number of available cores on Spark Driver. Above method by GQ using Futures is more flexible in this regard.

附注。以上将启动多达 10 个并行 Spark 作业，但可能会更少，具体取决于 Spark Driver 上可用内核的数量。GQ使用Futures的上述方法在这方面更加灵活。

java 如何使用单个 Spark 上下文在 Apache Spark 中运行并发作业（操作）

提问by Sporty

回答by G Quintana

回答by Tagar

相关推荐

最近更新

标签

java 如何使用单个 Spark 上下文在 Apache Spark 中运行并发作业（操作）

提问by Sporty

回答by G Quintana

回答by Tagar

相关推荐

java 更快的 XML Jackson：删除双引号

java @PostFilter 和 @PreFilter 如何工作

java 方法 getPart(String) 未定义为 HttpServletRequest 类型

Java 游戏：射击子弹

相关推荐

最近更新

标签