Python Spark RDD - 带有额外参数的映射

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33019420/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 12:37:11  来源:igfitidea点击:

Spark RDD - Mapping with extra arguments

pythonapache-sparkpysparkrdd

提问by Stan

Is it possible to pass extra arguments to the mapping function in pySpark? Specifically, I have the following code recipe:

是否可以将额外的参数传递给 pySpark 中的映射函数?具体来说,我有以下代码配方:

raw_data_rdd = sc.textFile("data.json", use_unicode=True)
json_data_rdd = raw_data_rdd.map(lambda line: json.loads(line))
mapped_rdd = json_data_rdd.flatMap(processDataLine)

The function processDataLinetakes extra arguments in addition to the JSON object, as:

processDataLine除了 JSON 对象之外,该函数还接受额外的参数,如:

def processDataLine(dataline, arg1, arg2)

How can I pass the extra arguments arg1and arg2to the flaMapfunction?

如何传递额外的参数arg1,并arg2flaMap功能?

采纳答案by zero323

  1. You can use an anonymous function either directly in a flatMap

    json_data_rdd.flatMap(lambda j: processDataLine(j, arg1, arg2))
    

    or to curry processDataLine

    f = lambda j: processDataLine(dataline, arg1, arg2)
    json_data_rdd.flatMap(f)
    
  2. You can generate processDataLinelike this:

    def processDataLine(arg1, arg2):
        def _processDataLine(dataline):
            return ... # Do something with dataline, arg1, arg2
        return _processDataLine
    
    json_data_rdd.flatMap(processDataLine(arg1, arg2))
    
  3. toolzlibrary provides useful currydecorator:

    from toolz.functoolz import curry
    
    @curry
    def processDataLine(arg1, arg2, dataline): 
        return ... # Do something with dataline, arg1, arg2
    
    json_data_rdd.flatMap(processDataLine(arg1, arg2))
    

    Note that I've pushed datalineargument to the last position. It is not required but this way we don't have to use keyword args.

  4. Finally there is functools.partialalready mentioned by Avihoo Mamkain the comments.

  1. 您可以直接在一个匿名函数中使用 flatMap

    json_data_rdd.flatMap(lambda j: processDataLine(j, arg1, arg2))
    

    或咖喱 processDataLine

    f = lambda j: processDataLine(dataline, arg1, arg2)
    json_data_rdd.flatMap(f)
    
  2. 你可以这样生成processDataLine

    def processDataLine(arg1, arg2):
        def _processDataLine(dataline):
            return ... # Do something with dataline, arg1, arg2
        return _processDataLine
    
    json_data_rdd.flatMap(processDataLine(arg1, arg2))
    
  3. toolz库提供了有用的curry装饰器:

    from toolz.functoolz import curry
    
    @curry
    def processDataLine(arg1, arg2, dataline): 
        return ... # Do something with dataline, arg1, arg2
    
    json_data_rdd.flatMap(processDataLine(arg1, arg2))
    

    请注意,我已将dataline参数推到最后一个位置。它不是必需的,但这样我们就不必使用关键字 args。

  4. 最后functools.partialAvihoo Mamka在评论中已经提到了。