提交 spark 作业时,我可以向 python 代码添加参数吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32217160/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 11:14:10  来源:igfitidea点击:

Can I add arguments to python code when I submit spark job?

pythonapache-sparkcluster-mode

提问by Jinho Yoo

I'm trying to use spark-submitto execute my python code in spark cluster.

我正在尝试用于spark-submit在 Spark 集群中执行我的 python 代码。

Generally we run spark-submitwith python code like below.

通常我们spark-submit使用如下的python代码运行。

# Run a Python application on a cluster
./bin/spark-submit \
  --master spark://207.184.161.138:7077 \
  my_python_code.py \
  1000

But I wanna run my_python_code.pyby passing several arguments Is there smart way to pass arguments?

但我想my_python_code.py通过传递几个参数来运行有没有聪明的方法来传递参数?

采纳答案by Paul

Yes: Put this in a file called args.py

:把它放在一个名为 args.py 的文件中

#import sys
print sys.argv

If you run

如果你跑

spark-submit args.py a b c d e 

You will see:

你会看见:

['/spark/args.py', 'a', 'b', 'c', 'd', 'e']

回答by Jinho Yoo

Ah, it's possible. http://caen.github.io/hadoop/user-spark.html

啊,有可能。http://caen.github.io/hadoop/user-spark.html

spark-submit \
    --master yarn-client \   # Run this as a Hadoop job
    --queue <your_queue> \   # Run on your_queue
    --num-executors 10 \     # Run with a certain number of executors, for example 10
    --executor-memory 12g \  # Specify each executor's memory, for example 12GB
    --executor-cores 2 \     # Specify each executor's amount of CPUs, for example 2
    job.py ngrams/input ngrams/output

回答by noleto

Even though sys.argvis a good solution, I still prefer this more proper way of handling line command args in my PySpark jobs:

尽管sys.argv是一个很好的解决方案,但我仍然更喜欢这种在 PySpark 作业中处理行命令参数的更合适的方法:

import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--ngrams", help="some useful description.")
args = parser.parse_args()
if args.ngrams:
    ngrams = args.ngrams

This way, you can launch your job as follows:

这样,您可以按如下方式启动您的工作:

spark-submit job.py --ngrams 3

More information about argparsemodule can be found in Argparse Tutorial

有关argparse模块的更多信息可以在Argparse 教程中找到

回答by Vivarsh Kondalkar

You can pass the arguments from the spark-submit command and then access them in your code in the following way,

您可以从 spark-submit 命令传递参数,然后通过以下方式在您的代码中访问它们,

sys.argv[1] will get you the first argument, sys.argv[2] the second argument and so on. Refer to the below example,

sys.argv[1] 将获得第一个参数,sys.argv[2] 将获得第二个参数,依此类推。参考下面的例子,

You can create code as below to take the arguments which you will be passing in the spark-submit command,

您可以创建如下代码以获取将在 spark-submit 命令中传递的参数,

import os
import sys

n = int(sys.argv[1])
a = 2
tables = []
for _ in range(n):
    tables.append(sys.argv[a])
    a += 1
print(tables)

Save the above file as PysparkArg.py and execute the below spark-submit command,

将上述文件另存为 PysparkArg.py 并执行以下 spark-submit 命令,

spark-submit PysparkArg.py 3 table1 table2 table3

Output:

输出:

['table1', 'table2', 'table3']

This piece of code can be used in PySpark jobs where it is required to fetch multiple tables from the database and, the number of tables to be fetched & the table names will be given by the user while executing the spark-submit command.

这段代码可用于 PySpark 作业,其中需要从数据库中获取多个表,并且用户在执行 spark-submit 命令时会给出要获取的表的数量和表名。

回答by trevorgrayson

Aniket Kulkarni's spark-submit args.py a b c d eseems to suffice, but it's worth mentioning we had issues with optional/named args (e.g --param1).

Aniket Kulkarni 的spark-submit args.py a b c d e似乎足够了,但值得一提的是我们在可选/命名参数(例如 --param1)方面存在问题。

It appears that double dashes --will help signal that python optional args follow:

似乎双破折号--将有助于表明 python 可选参数如下:

spark-submit --sparkarg xxx yourscript.py -- --scriptarg 1 arg1 arg2