将 pyspark 中的 Python 模块传送到其他节点
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24686474/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Shipping Python modules in pyspark to other nodes
提问by mgoldwasser
How can I ship C compiled modules (for example, python-Levenshtein) to each node in a Sparkcluster?
如何将 C 编译模块(例如,python-Levenshtein)发送到Spark集群中的每个节点?
I know that I can ship Python files in Spark using a standalone Python script (example code below):
我知道我可以使用独立的 Python 脚本(下面的示例代码)在 Spark 中传送 Python 文件:
from pyspark import SparkContext
sc = SparkContext("local", "App Name", pyFiles=['MyFile.py', 'MyOtherFile.py'])
But in situations where there is no '.py', how do I ship the module?
但是在没有 '.py' 的情况下,我如何发送模块?
采纳答案by Josh Rosen
If you can package your module into a .egg
or .zip
file, you should be able to list it in pyFiles
when constructing your SparkContext (or you can add it later through sc.addPyFile).
如果您可以将您的模块打包到一个.egg
或.zip
文件中,那么您应该能够在pyFiles
构建 SparkContext 时将其列出(或者您可以稍后通过sc.addPyFile添加它)。
For Python libraries that use setuptools, you can run python setup.py bdist_egg
to build an egg distribution.
对于使用 setuptools 的 Python 库,您可以运行python setup.py bdist_egg
构建一个 egg 发行版。
Another option is to install the library cluster-wide, either by using pip/easy_install on each machine or by sharing a Python installation over a cluster-wide filesystem (like NFS).
另一种选择是在集群范围内安装库,通过在每台机器上使用 pip/easy_install 或通过在集群范围的文件系统(如 NFS)上共享 Python 安装。
回答by ivan_pozdeev
There are two main options here:
这里有两个主要选项:
- If it's a single file or a
.zip
/.egg
, pass it toSparkContext.addPyFile
. - Insert
pip install
into a bootstrap code for the cluster's machines.
People also suggest using python shell
to test if the module is present on the cluster.