Python Spark RDD - 带有额外参数的映射
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33019420/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark RDD - Mapping with extra arguments
提问by Stan
Is it possible to pass extra arguments to the mapping function in pySpark? Specifically, I have the following code recipe:
是否可以将额外的参数传递给 pySpark 中的映射函数?具体来说,我有以下代码配方:
raw_data_rdd = sc.textFile("data.json", use_unicode=True)
json_data_rdd = raw_data_rdd.map(lambda line: json.loads(line))
mapped_rdd = json_data_rdd.flatMap(processDataLine)
The function processDataLinetakes extra arguments in addition to the JSON object, as:
processDataLine除了 JSON 对象之外,该函数还接受额外的参数,如:
def processDataLine(dataline, arg1, arg2)
How can I pass the extra arguments arg1and arg2to the flaMapfunction?
如何传递额外的参数arg1,并arg2在flaMap功能?
采纳答案by zero323
You can use an anonymous function either directly in a
flatMapjson_data_rdd.flatMap(lambda j: processDataLine(j, arg1, arg2))or to curry
processDataLinef = lambda j: processDataLine(dataline, arg1, arg2) json_data_rdd.flatMap(f)You can generate
processDataLinelike this:def processDataLine(arg1, arg2): def _processDataLine(dataline): return ... # Do something with dataline, arg1, arg2 return _processDataLine json_data_rdd.flatMap(processDataLine(arg1, arg2))toolzlibrary provides usefulcurrydecorator:from toolz.functoolz import curry @curry def processDataLine(arg1, arg2, dataline): return ... # Do something with dataline, arg1, arg2 json_data_rdd.flatMap(processDataLine(arg1, arg2))Note that I've pushed
datalineargument to the last position. It is not required but this way we don't have to use keyword args.Finally there is
functools.partialalready mentioned by Avihoo Mamkain the comments.
您可以直接在一个匿名函数中使用
flatMapjson_data_rdd.flatMap(lambda j: processDataLine(j, arg1, arg2))或咖喱
processDataLinef = lambda j: processDataLine(dataline, arg1, arg2) json_data_rdd.flatMap(f)你可以这样生成
processDataLine:def processDataLine(arg1, arg2): def _processDataLine(dataline): return ... # Do something with dataline, arg1, arg2 return _processDataLine json_data_rdd.flatMap(processDataLine(arg1, arg2))toolz库提供了有用的curry装饰器:from toolz.functoolz import curry @curry def processDataLine(arg1, arg2, dataline): return ... # Do something with dataline, arg1, arg2 json_data_rdd.flatMap(processDataLine(arg1, arg2))请注意,我已将
dataline参数推到最后一个位置。它不是必需的,但这样我们就不必使用关键字 args。最后
functools.partial,Avihoo Mamka在评论中已经提到了。

