Python Spark RDD - 带有额外参数的映射
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33019420/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark RDD - Mapping with extra arguments
提问by Stan
Is it possible to pass extra arguments to the mapping function in pySpark? Specifically, I have the following code recipe:
是否可以将额外的参数传递给 pySpark 中的映射函数?具体来说,我有以下代码配方:
raw_data_rdd = sc.textFile("data.json", use_unicode=True)
json_data_rdd = raw_data_rdd.map(lambda line: json.loads(line))
mapped_rdd = json_data_rdd.flatMap(processDataLine)
The function processDataLine
takes extra arguments in addition to the JSON object, as:
processDataLine
除了 JSON 对象之外,该函数还接受额外的参数,如:
def processDataLine(dataline, arg1, arg2)
How can I pass the extra arguments arg1
and arg2
to the flaMap
function?
如何传递额外的参数arg1
,并arg2
在flaMap
功能?
采纳答案by zero323
You can use an anonymous function either directly in a
flatMap
json_data_rdd.flatMap(lambda j: processDataLine(j, arg1, arg2))
or to curry
processDataLine
f = lambda j: processDataLine(dataline, arg1, arg2) json_data_rdd.flatMap(f)
You can generate
processDataLine
like this:def processDataLine(arg1, arg2): def _processDataLine(dataline): return ... # Do something with dataline, arg1, arg2 return _processDataLine json_data_rdd.flatMap(processDataLine(arg1, arg2))
toolz
library provides usefulcurry
decorator:from toolz.functoolz import curry @curry def processDataLine(arg1, arg2, dataline): return ... # Do something with dataline, arg1, arg2 json_data_rdd.flatMap(processDataLine(arg1, arg2))
Note that I've pushed
dataline
argument to the last position. It is not required but this way we don't have to use keyword args.Finally there is
functools.partial
already mentioned by Avihoo Mamkain the comments.
您可以直接在一个匿名函数中使用
flatMap
json_data_rdd.flatMap(lambda j: processDataLine(j, arg1, arg2))
或咖喱
processDataLine
f = lambda j: processDataLine(dataline, arg1, arg2) json_data_rdd.flatMap(f)
你可以这样生成
processDataLine
:def processDataLine(arg1, arg2): def _processDataLine(dataline): return ... # Do something with dataline, arg1, arg2 return _processDataLine json_data_rdd.flatMap(processDataLine(arg1, arg2))
toolz
库提供了有用的curry
装饰器:from toolz.functoolz import curry @curry def processDataLine(arg1, arg2, dataline): return ... # Do something with dataline, arg1, arg2 json_data_rdd.flatMap(processDataLine(arg1, arg2))
请注意,我已将
dataline
参数推到最后一个位置。它不是必需的,但这样我们就不必使用关键字 args。最后
functools.partial
,Avihoo Mamka在评论中已经提到了。