Python spark-submit 和 pyspark 有什么区别？

Question

提问by user592419

If I start up pyspark and then run this command:

如果我启动 pyspark 然后运行以下命令：

import my_script; spark = my_script.Sparker(sc); spark.collapse('./data/')

Everything is A-ok. If, however, I try to do the same thing through the commandline and spark-submit, I get an error:

一切都很好。但是，如果我尝试通过命令行和 spark-submit 执行相同的操作，则会收到错误消息：

Command: /usr/local/spark/bin/spark-submit my_script.py collapse ./data/
  File "/usr/local/spark/python/pyspark/rdd.py", line 352, in func
    return f(iterator)
  File "/usr/local/spark/python/pyspark/rdd.py", line 1576, in combineLocally
    merger.mergeValues(iterator)
  File "/usr/local/spark/python/pyspark/shuffle.py", line 245, in mergeValues
    for k, v in iterator:
  File "/.../my_script.py", line 173, in _json_args_to_arr
    js = cls._json(line)
RuntimeError: uninitialized staticmethod object

my_script:

我的脚本：

...
if __name__ == "__main__":
    args = sys.argv[1:]
    if args[0] == 'collapse':
        directory = args[1]
        from pyspark import SparkContext
        sc = SparkContext(appName="Collapse")
        spark = Sparker(sc)
        spark.collapse(directory)
        sc.stop()

Why is this happening? What's the difference between running pyspark and running spark-submit that would cause this divergence? And how can I make this work in spark-submit?

为什么会这样？运行 pyspark 和运行 spark-submit 会导致这种分歧的区别是什么？我怎样才能在 spark-submit 中完成这项工作？

EDIT: I tried running this from the bash shell by doing pyspark my_script.py collapse ./data/and I got the same error. The only time when everything works is when I am in a python shell and import the script.

编辑：我尝试从 bash shell 运行它，但pyspark my_script.py collapse ./data/我遇到了同样的错误。一切正常的唯一时间是当我在 python shell 中并导入脚本时。

Answer 1

回答by avrsanjay

If you built a spark application, you need to use spark-submitto run the application
- The code can be written either in python/scala
- The mode can be either local/cluster
If you just want to test/run few individual commands, you can use the shellprovided by spark
- pyspark (for spark in python)
- spark-shell (for spark in scala)

如果您构建了一个 spark 应用程序，则需要使用spark-submit来运行该应用程序
- 代码可以用python/scala编写
- 模式可以是本地/集群
如果你只想测试/运行几个单独的命令，你可以使用spark 提供的shell
- pyspark（用于python中的spark）
- spark-shell（用于 scala 中的 spark）

Answer 2

回答by H Roy

spark-submitis a utility to submit your spark program (or job) to Spark clusters. If you open the spark-submit utility, it eventually calls a Scala program.

spark-submit是一个将您的 spark 程序（或作业）提交到 Spark 集群的实用程序。如果您打开 spark-submit 实用程序，它最终会调用 Scala程序。

org.apache.spark.deploy.SparkSubmit

On the other hand, pysparkor spark-shellis REPL (read–eval–print loop) utility which allows the developer to run/execute their spark code as they write and can evaluate on fly.

另一方面，pyspark或spark-shell是 REPL（读取-评估-打印循环）实用程序，它允许开发人员在编写时运行/执行他们的 Spark 代码，并且可以即时评估。

Eventually, both of them run a job behind the scene and the majority of the options are the same if you use the following command

最终，它们都在幕后运行一个作业，如果您使用以下命令，大多数选项是相同的

spark-submit --help
pyspark --help
spark-shell --help

spark-submithas some additional option to take your spark program (scala or python) as a bundle (jar/zip for python) or individual .py or .class file.

spark-submit有一些额外的选项可以将你的 spark 程序（scala 或 python）作为一个包（python 的 jar/zip）或单独的 .py 或 .class 文件。

spark-submit --help
Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]

They both also give a WebUI to track the Spark Job progress and other metrics.

它们还提供了一个 WebUI 来跟踪 Spark 作业进度和其他指标。

When you kill your spark-shell (pyspark or spark-shell) using Ctrl+c, your spark session is killed and WebUI can not show details anymore.

当您使用 Ctrl+c 终止 spark-shell（pyspark 或 spark-shell）时，spark 会话将终止，WebUI 无法再显示详细信息。

if you look into spark-shell, it has one additional option to run a scrip line by line using -I

如果您查看 spark-shell，它还有一个附加选项可以使用 -I 逐行运行脚本

Scala REPL options:
  -I <file>                   preload <file>, enforcing line-by-line interpretation

Python spark-submit 和 pyspark 有什么区别？

提问by user592419

回答by avrsanjay

回答by H Roy

相关推荐

最近更新

标签

Python spark-submit 和 pyspark 有什么区别？

提问by user592419

回答by avrsanjay

回答by H Roy

相关推荐

Python Django 1.7 迁移不会重新创建删除的表，为什么？

Python matplotlib 中的命名颜色

在 Python 中的 psycopg2 中执行 .sql 模式

Python Django删除超级用户

相关推荐

最近更新

标签