Python 在pyspark中找不到col函数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40163106/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Cannot find col function in pyspark
提问by Bamqf
In pyspark 1.6.2, I can import col
function by
在 pyspark 1.6.2 中,我可以col
通过
from pyspark.sql.functions import col
but when I try to look it up in the Github source codeI find no col
function in functions.py
file, how can python import a function that doesn't exist?
但是当我尝试在Github 源代码中查找它时,我发现文件中没有任何col
函数,functions.py
python 如何导入一个不存在的函数?
采纳答案by zero323
It exists. It just isn't explicitly defined. Functions exported from pyspark.sql.functions
are thin wrappers around JVM code and, with a few exceptions which require special treatment, are generated automatically using helper methods.
它存在。它只是没有明确定义。导出的函数pyspark.sql.functions
是围绕 JVM 代码的瘦包装器,除了少数需要特殊处理的例外,它们是使用辅助方法自动生成的。
If you carefully check the source you'll find col
listed among other _functions
. This dictionary is further iteratedand _create_function
is used to generate wrappers. Each generated function is directly assigned to a corresponding name in the globals
.
如果您仔细检查来源,您会发现其中col
列出了_functions
. 该字典被进一步迭代并_create_function
用于生成包装器。每个生成的函数在globals
.
Finally __all__
, which defines a list of items exported from the module, just exports all globals
excluding ones contained in the blacklist.
最后__all__
,它定义了从模块导出的项目列表,只导出globals
黑名单中不包括的所有项目。
If this mechanisms is still not clear you can create a toy example:
如果这种机制仍然不清楚,您可以创建一个玩具示例:
Create Python module called
foo.py
with a following content:# Creates a function assigned to the name foo globals()["foo"] = lambda x: "foo {0}".format(x) # Exports all entries from globals which start with foo __all__ = [x for x in globals() if x.startswith("foo")]
Place it somewhere on the Python path (for example in the working directory).
Import
foo
:from foo import foo foo(1)
创建
foo.py
使用以下内容调用的 Python 模块:# Creates a function assigned to the name foo globals()["foo"] = lambda x: "foo {0}".format(x) # Exports all entries from globals which start with foo __all__ = [x for x in globals() if x.startswith("foo")]
将它放在 Python 路径上的某个位置(例如在工作目录中)。
进口
foo
:from foo import foo foo(1)
An undesired side effect of such metaprogramming approach is that defined functions might not be recognized by the tools depending purely on static code analysis. This is not a critical issue and can be safely ignored during development process.
这种元编程方法的一个不希望有的副作用是,完全依赖静态代码分析的工具可能无法识别定义的函数。这不是一个关键问题,可以在开发过程中安全地忽略。
Depending on the IDE installing type annotationsmight resolve the problem (see for example zero323/pyspark-stubs#172).
根据 IDE 安装类型注释可能会解决问题(参见例如zero323/pyspark-stubs#172)。
回答by Dmytro
As of VS Code 1.26.1this can be solved by modifying python.linting.pylintArgs
setting:
从VS Code 1.26.1 开始,这可以通过修改python.linting.pylintArgs
设置来解决:
"python.linting.pylintArgs": [
"--generated-members=pyspark.*",
"--extension-pkg-whitelist=pyspark",
"--ignored-modules=pyspark.sql.functions"
]
That issue was explained on github: https://github.com/DonJayamanne/pythonVSCode/issues/1418#issuecomment-411506443
这个问题在 github 上有解释:https: //github.com/DonJayamanne/pythonVSCode/issues/1418#issuecomment-411506443
回答by Vincent Claes
回答by Thomas
As explained above, pyspark generates some of its functions on the fly, which makes that most IDEs cannot detect them properly. However, there is a python package pyspark-stubsthat includes a collection of stub files such that type hints are improved, static error detection, code completion, ... By just installing with
如上所述,pyspark 会动态生成一些函数,这使得大多数 IDE 无法正确检测它们。但是,有一个 python 包pyspark-stubs,其中包含一组存根文件,以便改进类型提示、静态错误检测、代码完成,...只需安装
pip install pyspark-stubs==x.x.x
(where x.x.x has to be replaced with your pyspark version (2.3.0. in my case for instance)), col
and other functions will be detected, without changing anything at your code for most IDEs (Pycharm, Visual Studio Code, Atom, Jupyter Notebook, ...)
(其中 xxx 必须替换为您的 pyspark 版本(例如在我的情况下为 2.3.0)),col
并且将检测其他功能,而无需更改大多数 IDE(Pycharm、Visual Studio Code、Atom、Jupyter)的代码中的任何内容笔记本, ...)
回答by AEDWIP
I ran into a similar problem trying to set up a PySpark development environment with Eclipse and PyDev. PySpark uses a dynamic namespace. To get it to work I needed to add PySpark to "force Builtins" as below.
我在尝试使用 Eclipse 和 PyDev 设置 PySpark 开发环境时遇到了类似的问题。PySpark 使用动态命名空间。为了让它工作,我需要将 PySpark 添加到“强制内置”中,如下所示。