Java 为 Apache Spark 指定外部配置文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29441316/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 07:59:19  来源:igfitidea点击:

Specifying an external configuration file for Apache Spark

javaamazon-web-servicesapache-spark

提问by Alexander

I'd like to specify all of Spark's properties in a configuration file, and then load that configuration file at runtime.

我想在配置文件中指定 Spark 的所有属性,然后在运行时加载该配置文件。

~~~~~~~~~~Edit~~~~~~~~~~~

~~~~~~~~~~编辑~~~~~~~~~~~~

It turns out I was pretty confused about how to go about doing this. Ignore the rest of this question. To see a simple solution (in Java Spark) on how to load a .properties file into a spark cluster, see my answer below.

事实证明,我对如何去做这件事感到非常困惑。忽略这个问题的其余部分。要查看有关如何将 .properties 文件加载到 Spark 集群的简单解决方案(在 Java Spark 中),请参阅下面的答案。

original question below for reference purposes only.

以下原始问题仅供参考。

~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~

I want

我想要

  • Different configuration files depending on the environment (local, aws)
  • I'd like to specify application specific parameters
  • 根据环境不同的配置文件(local、aws)
  • 我想指定特定于应用程序的参数

As a simple example, let's imagine I'd like to filter lines in a log file depending on a string. Below I've got a simple Java Spark program that reads data from a file and filters it depending on a string the user defines. The program takes one argument, the input source file.

作为一个简单的例子,假设我想根据字符串过滤日志文件中的行。下面我有一个简单的 Java Spark 程序,它从文件中读取数据并根据用户定义的字符串对其进行过滤。该程序采用一个参数,即输入源文件。

Java Spark Code

Java Spark 代码

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;

public class SimpleSpark {
    public static void main(String[] args) {
        String inputFile = args[0]; // Should be some file on your system

        SparkConf conf = new SparkConf();// .setAppName("Simple Application");
        JavaSparkContext sc = new JavaSparkContext(conf);
        JavaRDD<String> logData = sc.textFile(inputFile).cache();

        final String filterString = conf.get("filterstr");

        long numberLines = logData.filter(new Function<String, Boolean>() {
            public Boolean call(String s) {
                return s.contains(filterString);
            }
        }).count();

        System.out.println("Line count: " + numberLines);
    }
}

Config File

配置文件

the configuration file is based on https://spark.apache.org/docs/1.3.0/configuration.htmland it looks like:

配置文件基于https://spark.apache.org/docs/1.3.0/configuration.html,它看起来像:

spark.app.name          test_app
spark.executor.memory   2g
spark.master            local
simplespark.filterstr   a

The Problem

问题

I execute the application using the following arguments:

我使用以下参数执行应用程序:

/path/to/inputtext.txt --conf /path/to/configfile.config

However, this doesn't work, since the exception

但是,这不起作用,因为异常

Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration

gets thrown. To me means the configuration file is not being loaded.

被抛出。对我来说意味着没有加载配置文件。

My questions are:

我的问题是:

  1. What is wrong with my setup?
  2. Is specifying application specific parameters in the spark configuration file good practice?
  1. 我的设置有什么问题?
  2. 在 spark 配置文件中指定特定于应用程序的参数是好习惯吗?

采纳答案by Alexander

So after a bit of time, I realized I was pretty confused. The easiest way to get a configuration file into memory is to use a standard properties file, put it into hdfs and load it from there. For the record, here is the code to do it (in Java Spark):

所以过了一会儿,我意识到我很困惑。将配置文件放入内存的最简单方法是使用标准属性文件,将其放入 hdfs 并从那里加载。作为记录,这是执行此操作的代码(在 Java Spark 中):

import java.util.Properties;

import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;

SparkConf sparkConf = new SparkConf()
JavaSparkContext ctx = new JavaSparkContext(sparkConf);

InputStream inputStream;
Path pt = new Path("hdfs:///user/hadoop/myproperties.properties");
FileSystem fs = FileSystem.get(ctx.hadoopConfiguration());
inputStream = fs.open(pt);

Properties properties = new Properties();
properties.load(inputStream);

回答by Marius Soutier

  1. --confonly sets a single Spark property, it's not for reading files.
    For example --conf spark.shuffle.spill=false.
  2. Application parameters don't go into spark-defaults, but are passed as program args (and are read from your main method). spark-defaultsshould contain SparkConf properties that apply to most or all jobs. If you want to use a config file instead of application parameters, take a look at the Typesafe Config. It also supports environment variables.
  1. --conf只设置一个 Spark 属性,它不是用于读取文件。
    例如--conf spark.shuffle.spill=false
  2. 应用程序参数不会进入 spark-defaults,而是作为程序参数传递(并从您的 main 方法中读取)。spark-defaults应包含适用于大多数或所有作业的 SparkConf 属性。如果您想使用配置文件而不是应用程序参数,请查看类型安全配置。它还支持环境变量。

回答by Objektwerks

FWIW, using the Typesafe Config library, I just verified that this work in ScalaTest:

FWIW,使用 Typesafe Config 库,我刚刚在 ScalaTest 中验证了这项工作:

  val props = ConfigFactory.load("spark.properties")
  val conf = new SparkConf().
    setMaster(props.getString("spark.master")).
    setAppName(props.getString("spark.app.name"))

回答by Poojaa Karaande

try this

尝试这个

--properties-file /path/to/configfile.config

then access in scalaprogram as

然后在scala程序中访问

sc.getConf.get("spark.app.name")