Python Pyspark --py-files 不起作用

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27644525/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 02:03:29  来源:igfitidea点击:

Pyspark --py-files doesn't work

pythonhadoopapache-sparkemr

提问by C19

I use this as document suggests http://spark.apache.org/docs/1.1.1/submitting-applications.html

我使用它作为文档建议http://spark.apache.org/docs/1.1.1/submitting-applications.html

spsark version 1.1.0

spsark 版本 1.1.0

./spark/bin/spark-submit --py-files /home/hadoop/loganalysis/parser-src.zip \
/home/hadoop/loganalysis/ship-test.py 

and conf in code :

和代码中的conf:

conf = (SparkConf()
        .setMaster("yarn-client")
        .setAppName("LogAnalysis")
        .set("spark.executor.memory", "1g")
        .set("spark.executor.cores", "4")
        .set("spark.executor.num", "2")
        .set("spark.driver.memory", "4g")
        .set("spark.kryoserializer.buffer.mb", "128"))

and slave node complain ImportError

和从节点抱怨 ImportError

14/12/25 05:09:53 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, ip-172-31-10-8.cn-north-1.compute.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/hadoop/spark/python/pyspark/worker.py", line 75, in main
    command = pickleSer._read_with_length(infile)
  File "/home/hadoop/spark/python/pyspark/serializers.py", line 150, in _read_with_length
    return self.loads(obj)
ImportError: No module named parser

and parser-src.zip is tested locally.

并且 parser-src.zip 在本地进行测试。

[hadoop@ip-172-31-10-231 ~]$ python
Python 2.7.8 (default, Nov  3 2014, 10:17:30) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.path.insert(1, '/home/hadoop/loganalysis/parser-src.zip')
>>> from parser import parser
>>> parser.parse
<function parse at 0x7fa5ef4c9848>
>>> 

I'm trying to get info about the remote worker. see whether it copied the files.what the sys.path looks like..and it's tricky.

我正在尝试获取有关远程工作者的信息。看看它是否复制了文件。sys.path 看起来像什么......而且它很棘手。

UPDATE: I use this found that the zip file was shiped. and sys.path was set. still import get error.

更新:我使用这个发现 zip 文件已发送。并设置了 sys.path。仍然导入错误。

data = list(range(4))
disdata = sc.parallelize(data)
result = disdata.map(lambda x: "sys.path:  {0}\nDIR: {1}   \n FILES: {2} \n parser: {3}".format(sys.path, os.getcwd(), os.listdir('.'), str(parser)))
result.collect()
print(result.take(4))

it seems I have to digging into cloudpickle.which means I need to understand how cloudpickle works and fails first.

看来我必须深入研究 cloudpickle。这意味着我需要先了解 cloudpickle 如何工作和失败。

: An error occurred while calling o40.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 0.0 failed 4 times, most recent failure: Lost task 4.3 in stage 0.0 (TID 23, ip-172-31-10-8.cn-north-1.compute.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/hadoop/spark/python/pyspark/worker.py", line 75, in main
    command = pickleSer._read_with_length(infile)
  File "/home/hadoop/spark/python/pyspark/serializers.py", line 150, in _read_with_length
    return self.loads(obj)
  File "/home/hadoop/spark/python/pyspark/cloudpickle.py", line 811, in subimport
    __import__(name)
ImportError: ('No module named parser', <function subimport at 0x7f219ffad7d0>, ('parser.parser',))

UPDATE:

更新:

someone encounter the same problem in spark 0.8 http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Importing-other-py-files-in-PYTHONPATH-td2301.html

有人在 spark 0.8 http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Importing-other-py-files-in-PYTHONPATH-td2301.html 中遇到同样的问题

but he put his lib in python dist-packages and import works. which I tried and still get import error.

但他把他的库放在 python dist-packages 和 import 中。我尝试过,但仍然出现导入错误。

UPDATE:

更新:

OH.gush.. I think the problem is caused by not understanding zip file and python import behaviour..I pass parser.py to --py-files, it works, complain about another dependency. and zip only the .py files[not including .pyc] seems to work too.

OH.gush .. 我认为问题是由于不了解 zip 文件和 python 导入行为造成的。我将 parser.py 传递给 --py-files,它有效,抱怨另一个依赖项。并且只压缩 .py 文件 [不包括 .pyc] 似乎也可以工作。

But I couldn't quite understand why though.

但我不太明白为什么。

回答by lolcaks

It sounds like one or more of the nodes aren't configured properly. Do all of the nodes on the cluster have the same version/configuration of Python (i.e. they all have the parser module installed)?

听起来一个或多个节点没有正确配置。集群上的所有节点是否都具有相同的 Python 版本/配置(即它们都安装了解析器模块)?

If you don't want to check one-by-one you could write a script to check if it is installed/install it for you. Thisthread shows a few ways to do that.

如果您不想一一检查,您可以编写一个脚本来检查它是否已为您安装/安装。这个线程显示了一些方法来做到这一点。

回答by Gnat

Try to import your custom module from inside the method itself rather than at the top of the driver script, e.g.:

尝试从方法本身内部而不是在驱动程序脚本的顶部导入您的自定义模块,例如:

def parse_record(record):
    import parser
    p = parser.parse(record)
    return p

rather than

而不是

import parser
def parse_record(record):
    p = parser.parse(record)
    return p

Cloud Pickle doesn't seem to recognise when a custom module has been imported, so it seems to try to pickle the top-level modules along with the other data that's needed to run the method. In my experience, this means that top-level modules appear to exist, but they lack usable members, and nested modules can't be used as expected. Once either importing with from A import *or from inside the method (import A.B), the modules worked as expected.

Cloud Pickle 似乎无法识别何时导入了自定义模块,因此它似乎尝试将顶级模块与运行该方法所需的其他数据一起进行 pickle。根据我的经验,这意味着顶级模块似乎存在,但它们缺少可用成员,并且无法按预期使用嵌套模块。一旦使用from A import *方法 ( import A.B)或从方法内部导入,模块就会按预期工作。

回答by apurva.nandan

I was facing a similar kind of problem, My worker nodes could not detect the modules even though I was using the --py-filesswitch.

我遇到了类似的问题,即使我正在使用--py-files交换机,我的工作节点也无法检测到模块。

There were couple of things I did - First I tried putting import statement after I created SparkContext (sc) variable hoping that import should take place after the module has shipped to all nodes but still it did not work. I then tried sc.addFileto add the module inside the script itself (instead of sending it as a command line argument) and afterwards imported the functions of the module. This did the trick at least in my case.

我做了几件事 - 首先,我尝试在创建 SparkContext (sc) 变量后放置 import 语句,希望在模块发送到所有节点后导入,但仍然无法正常工作。然后我尝试sc.addFile将模块添加到脚本本身中(而不是将其作为命令行参数发送),然后导入模块的功能。至少在我的情况下,这起到了作用。

回答by noli

PySpark on EMR is configured for Python 2.6 by default, so make sure they're not being installed for the Python 2.7 interpreter

默认情况下,EMR 上的 PySpark 配置为 Python 2.6,因此请确保没有为 Python 2.7 解释器安装它们

回答by Raymond

Try this function of SparkContext

试试这个功能 SparkContext

sc.addPyFile(path)

According to pysparkdocumentation here

根据此处的pyspark文档

Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

为将来要在此 SparkContext 上执行的所有任务添加 .py 或 .zip 依赖项。传递的路径可以是本地文件、HDFS(或其他 Hadoop 支持的文件系统)中的文件,也可以是 HTTP、HTTPS 或 FTP URI。

Try upload your python module file to a public cloud storage (e.g. AWS S3) and pass the URL to that method.

尝试将您的 python 模块文件上传到公共云存储(例如 AWS S3)并将 URL 传递给该方法。

Here is a more comprehensive reading material: http://www.cloudera.com/documentation/enterprise/5-5-x/topics/spark_python.html

这是更全面的阅读材料:http: //www.cloudera.com/documentation/enterprise/5-5-x/topics/spark_python.html

回答by newToJS_HTML

You need to package your Python code using tools like setuptools. This will let you create an .egg file which is similar to java jar file. You can then specify the path of this egg file using --py-files

您需要使用 setuptools 等工具打包 Python 代码。这将让您创建一个类似于 java jar 文件的 .egg 文件。然后,您可以使用 --py-files 指定此egg 文件的路径

spark-submit --py-files path_to_egg_file path_to_spark_driver_file

spark-submit --py-files path_to_egg_file path_to_spark_driver_file

回答by Prashant Singh

Create zip files (example- abc.zip) containing all your dependencies.

创建包含所有依赖项的 zip 文件 (example-abc.zip)。

While creating the spark context mention the zip file name as:

在创建 spark 上下文时,将 zip 文件名提及为:

    sc = SparkContext(conf=conf, pyFiles=["abc.zip"])