Java 为 Apache Spark 指定外部配置文件

Question

提问by Alexander

I'd like to specify all of Spark's properties in a configuration file, and then load that configuration file at runtime.

我想在配置文件中指定 Spark 的所有属性，然后在运行时加载该配置文件。

~~~~~~~~~~Edit~~~~~~~~~~~

~~~~~~~~~~编辑~~~~~~~~~~~~

It turns out I was pretty confused about how to go about doing this. Ignore the rest of this question. To see a simple solution (in Java Spark) on how to load a .properties file into a spark cluster, see my answer below.

事实证明，我对如何去做这件事感到非常困惑。忽略这个问题的其余部分。要查看有关如何将 .properties 文件加载到 Spark 集群的简单解决方案（在 Java Spark 中），请参阅下面的答案。

original question below for reference purposes only.

以下原始问题仅供参考。

~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~

I want

我想要

Different configuration files depending on the environment (local, aws)
I'd like to specify application specific parameters

根据环境不同的配置文件（local、aws）
我想指定特定于应用程序的参数

As a simple example, let's imagine I'd like to filter lines in a log file depending on a string. Below I've got a simple Java Spark program that reads data from a file and filters it depending on a string the user defines. The program takes one argument, the input source file.

作为一个简单的例子，假设我想根据字符串过滤日志文件中的行。下面我有一个简单的 Java Spark 程序，它从文件中读取数据并根据用户定义的字符串对其进行过滤。该程序采用一个参数，即输入源文件。

Java Spark Code

Java Spark 代码

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;

public class SimpleSpark {
    public static void main(String[] args) {
        String inputFile = args[0]; // Should be some file on your system

        SparkConf conf = new SparkConf();// .setAppName("Simple Application");
        JavaSparkContext sc = new JavaSparkContext(conf);
        JavaRDD<String> logData = sc.textFile(inputFile).cache();

        final String filterString = conf.get("filterstr");

        long numberLines = logData.filter(new Function<String, Boolean>() {
            public Boolean call(String s) {
                return s.contains(filterString);
            }
        }).count();

        System.out.println("Line count: " + numberLines);
    }
}

Config File

配置文件

the configuration file is based on https://spark.apache.org/docs/1.3.0/configuration.htmland it looks like:

配置文件基于https://spark.apache.org/docs/1.3.0/configuration.html，它看起来像：

spark.app.name          test_app
spark.executor.memory   2g
spark.master            local
simplespark.filterstr   a

The Problem

问题

I execute the application using the following arguments:

我使用以下参数执行应用程序：

/path/to/inputtext.txt --conf /path/to/configfile.config

However, this doesn't work, since the exception

但是，这不起作用，因为异常

Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration

gets thrown. To me means the configuration file is not being loaded.

被抛出。对我来说意味着没有加载配置文件。

My questions are:

我的问题是：

What is wrong with my setup?
Is specifying application specific parameters in the spark configuration file good practice?

我的设置有什么问题？
在 spark 配置文件中指定特定于应用程序的参数是好习惯吗？

Answer 1

采纳答案by Alexander

So after a bit of time, I realized I was pretty confused. The easiest way to get a configuration file into memory is to use a standard properties file, put it into hdfs and load it from there. For the record, here is the code to do it (in Java Spark):

所以过了一会儿，我意识到我很困惑。将配置文件放入内存的最简单方法是使用标准属性文件，将其放入 hdfs 并从那里加载。作为记录，这是执行此操作的代码（在 Java Spark 中）：

import java.util.Properties;

import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;

SparkConf sparkConf = new SparkConf()
JavaSparkContext ctx = new JavaSparkContext(sparkConf);

InputStream inputStream;
Path pt = new Path("hdfs:///user/hadoop/myproperties.properties");
FileSystem fs = FileSystem.get(ctx.hadoopConfiguration());
inputStream = fs.open(pt);

Properties properties = new Properties();
properties.load(inputStream);

Answer 2

回答by Marius Soutier

--confonly sets a single Spark property, it's not for reading files.
For example --conf spark.shuffle.spill=false.
Application parameters don't go into spark-defaults, but are passed as program args (and are read from your main method). spark-defaultsshould contain SparkConf properties that apply to most or all jobs. If you want to use a config file instead of application parameters, take a look at the Typesafe Config. It also supports environment variables.

--conf只设置一个 Spark 属性，它不是用于读取文件。
例如--conf spark.shuffle.spill=false。
应用程序参数不会进入 spark-defaults，而是作为程序参数传递（并从您的 main 方法中读取）。spark-defaults应包含适用于大多数或所有作业的 SparkConf 属性。如果您想使用配置文件而不是应用程序参数，请查看类型安全配置。它还支持环境变量。

Answer 3

回答by Objektwerks

FWIW, using the Typesafe Config library, I just verified that this work in ScalaTest:

FWIW，使用 Typesafe Config 库，我刚刚在 ScalaTest 中验证了这项工作：

  val props = ConfigFactory.load("spark.properties")
  val conf = new SparkConf().
    setMaster(props.getString("spark.master")).
    setAppName(props.getString("spark.app.name"))

Answer 4

回答by Poojaa Karaande

try this

尝试这个

--properties-file /path/to/configfile.config

then access in scalaprogram as

然后在scala程序中访问

sc.getConf.get("spark.app.name")

Java 为 Apache Spark 指定外部配置文件

提问by Alexander

采纳答案by Alexander

回答by Marius Soutier

回答by Objektwerks

回答by Poojaa Karaande

相关推荐

最近更新

标签

Java 为 Apache Spark 指定外部配置文件

提问by Alexander

采纳答案by Alexander

回答by Marius Soutier

回答by Objektwerks

回答by Poojaa Karaande

相关推荐

如何将时间从 java.util.Date 存储到 java.sql.Date

Java内部类和静态嵌套类

Java 用于 XSS 预防的 ESAPI 不起作用

Java 无法在 AMD 64 位平台上加载 IA 32 位 .dll

相关推荐

最近更新

标签