Python Spark RDD - 带有额外参数的映射

Question

提问by Stan

Is it possible to pass extra arguments to the mapping function in pySpark? Specifically, I have the following code recipe:

是否可以将额外的参数传递给 pySpark 中的映射函数？具体来说，我有以下代码配方：

raw_data_rdd = sc.textFile("data.json", use_unicode=True)
json_data_rdd = raw_data_rdd.map(lambda line: json.loads(line))
mapped_rdd = json_data_rdd.flatMap(processDataLine)

The function processDataLinetakes extra arguments in addition to the JSON object, as:

processDataLine除了 JSON 对象之外，该函数还接受额外的参数，如：

def processDataLine(dataline, arg1, arg2)

How can I pass the extra arguments arg1and arg2to the flaMapfunction?

如何传递额外的参数arg1，并arg2在flaMap功能？

Answer 1

采纳答案by zero323

You can use an anonymous function either directly in a flatMap

json_data_rdd.flatMap(lambda j: processDataLine(j, arg1, arg2))

or to curry processDataLine

f = lambda j: processDataLine(dataline, arg1, arg2)
json_data_rdd.flatMap(f)

You can generate processDataLinelike this:

def processDataLine(arg1, arg2):
    def _processDataLine(dataline):
        return ... # Do something with dataline, arg1, arg2
    return _processDataLine

json_data_rdd.flatMap(processDataLine(arg1, arg2))

toolzlibrary provides useful currydecorator:

from toolz.functoolz import curry

@curry
def processDataLine(arg1, arg2, dataline): 
    return ... # Do something with dataline, arg1, arg2

json_data_rdd.flatMap(processDataLine(arg1, arg2))

Note that I've pushed datalineargument to the last position. It is not required but this way we don't have to use keyword args.

Finally there is functools.partialalready mentioned by Avihoo Mamkain the comments.

您可以直接在一个匿名函数中使用 flatMap

json_data_rdd.flatMap(lambda j: processDataLine(j, arg1, arg2))

或咖喱 processDataLine

f = lambda j: processDataLine(dataline, arg1, arg2)
json_data_rdd.flatMap(f)

你可以这样生成processDataLine：

def processDataLine(arg1, arg2):
    def _processDataLine(dataline):
        return ... # Do something with dataline, arg1, arg2
    return _processDataLine

json_data_rdd.flatMap(processDataLine(arg1, arg2))

toolz库提供了有用的curry装饰器：

from toolz.functoolz import curry

@curry
def processDataLine(arg1, arg2, dataline): 
    return ... # Do something with dataline, arg1, arg2

json_data_rdd.flatMap(processDataLine(arg1, arg2))

请注意，我已将dataline参数推到最后一个位置。它不是必需的，但这样我们就不必使用关键字 args。

最后functools.partial，Avihoo Mamka在评论中已经提到了。

Python Spark RDD - 带有额外参数的映射

提问by Stan

采纳答案by zero323

相关推荐

最近更新

标签

Python Spark RDD - 带有额外参数的映射

提问by Stan

采纳答案by zero323

相关推荐

导入任意python源文件。（Python 3.3+）

Python 找不到 cv2.imread 标志

Python 吉普错误！堆栈错误：`C:\Program Files (x86)\MSBuild\12.0\bin\msbuild.exe` 失败，退出代码：1

Python Pillow Image 对象和 numpy 数组之间的转换会改变维度

相关推荐

最近更新

标签