提交 spark 作业时，我可以向 python 代码添加参数吗？

Question

提问by Jinho Yoo

I'm trying to use spark-submitto execute my python code in spark cluster.

我正在尝试用于spark-submit在 Spark 集群中执行我的 python 代码。

Generally we run spark-submitwith python code like below.

通常我们spark-submit使用如下的python代码运行。

# Run a Python application on a cluster
./bin/spark-submit \
  --master spark://207.184.161.138:7077 \
  my_python_code.py \
  1000

But I wanna run my_python_code.pyby passing several arguments Is there smart way to pass arguments?

但我想my_python_code.py通过传递几个参数来运行有没有聪明的方法来传递参数？

Answer 1

采纳答案by Paul

Yes: Put this in a file called args.py

是：把它放在一个名为 args.py 的文件中

#import sys
print sys.argv

If you run

如果你跑

spark-submit args.py a b c d e

You will see:

你会看见：

['/spark/args.py', 'a', 'b', 'c', 'd', 'e']

Answer 2

回答by Jinho Yoo

Ah, it's possible. http://caen.github.io/hadoop/user-spark.html

啊，有可能。http://caen.github.io/hadoop/user-spark.html

spark-submit \
    --master yarn-client \   # Run this as a Hadoop job
    --queue <your_queue> \   # Run on your_queue
    --num-executors 10 \     # Run with a certain number of executors, for example 10
    --executor-memory 12g \  # Specify each executor's memory, for example 12GB
    --executor-cores 2 \     # Specify each executor's amount of CPUs, for example 2
    job.py ngrams/input ngrams/output

Answer 3

回答by noleto

Even though sys.argvis a good solution, I still prefer this more proper way of handling line command args in my PySpark jobs:

尽管sys.argv是一个很好的解决方案，但我仍然更喜欢这种在 PySpark 作业中处理行命令参数的更合适的方法：

import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--ngrams", help="some useful description.")
args = parser.parse_args()
if args.ngrams:
    ngrams = args.ngrams

This way, you can launch your job as follows:

这样，您可以按如下方式启动您的工作：

spark-submit job.py --ngrams 3

More information about argparsemodule can be found in Argparse Tutorial

有关argparse模块的更多信息可以在Argparse 教程中找到

Answer 4

回答by Vivarsh Kondalkar

You can pass the arguments from the spark-submit command and then access them in your code in the following way,

您可以从 spark-submit 命令传递参数，然后通过以下方式在您的代码中访问它们，

sys.argv[1] will get you the first argument, sys.argv[2] the second argument and so on. Refer to the below example,

sys.argv[1] 将获得第一个参数，sys.argv[2] 将获得第二个参数，依此类推。参考下面的例子，

You can create code as below to take the arguments which you will be passing in the spark-submit command,

您可以创建如下代码以获取将在 spark-submit 命令中传递的参数，

import os
import sys

n = int(sys.argv[1])
a = 2
tables = []
for _ in range(n):
    tables.append(sys.argv[a])
    a += 1
print(tables)

Save the above file as PysparkArg.py and execute the below spark-submit command,

将上述文件另存为 PysparkArg.py 并执行以下 spark-submit 命令，

spark-submit PysparkArg.py 3 table1 table2 table3

Output:

输出：

['table1', 'table2', 'table3']

This piece of code can be used in PySpark jobs where it is required to fetch multiple tables from the database and, the number of tables to be fetched & the table names will be given by the user while executing the spark-submit command.

这段代码可用于 PySpark 作业，其中需要从数据库中获取多个表，并且用户在执行 spark-submit 命令时会给出要获取的表的数量和表名。

Answer 5

回答by trevorgrayson

Aniket Kulkarni's spark-submit args.py a b c d eseems to suffice, but it's worth mentioning we had issues with optional/named args (e.g --param1).

Aniket Kulkarni 的spark-submit args.py a b c d e似乎足够了，但值得一提的是我们在可选/命名参数（例如 --param1）方面存在问题。

It appears that double dashes --will help signal that python optional args follow:

似乎双破折号--将有助于表明 python 可选参数如下：

spark-submit --sparkarg xxx yourscript.py -- --scriptarg 1 arg1 arg2

提交 spark 作业时，我可以向 python 代码添加参数吗？

提问by Jinho Yoo

采纳答案by Paul

回答by Jinho Yoo

回答by noleto

回答by Vivarsh Kondalkar

回答by trevorgrayson

相关推荐

最近更新

标签

提交 spark 作业时，我可以向 python 代码添加参数吗？

提问by Jinho Yoo

采纳答案by Paul

回答by Jinho Yoo

回答by noleto

回答by Vivarsh Kondalkar

回答by trevorgrayson

相关推荐

Python 如何返回numpy中的所有最小索引

计算三角形的角度Python

将 2D 数组复制到第 3 维，N 次（Python）

Python sqlite3.DatabaseError：文件已加密或不是数据库

相关推荐

最近更新

标签