java spark - 地图中的过滤器
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28843591/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
spark - filter within map
提问by nir
I am trying to filter inside map function. Basically the way I'll do that in classic map-reduce is mapper wont write anything to context when filter criteria meet. How can I achieve similar with spark? I can't seem to return null from map function as it fails in shuffle step. I can either use filter function but it seems unnecessary iteration of data set while I can perform same task during map. I can also try to output null with dummy key but thats a bad workaround.
我正在尝试过滤内部地图功能。基本上,我在经典 map-reduce 中这样做的方式是,当过滤条件满足时,mapper 不会向上下文写入任何内容。我怎样才能用火花实现类似的目标?我似乎无法从 map 函数返回 null,因为它在 shuffle 步骤中失败。我可以使用过滤器功能,但似乎不必要的数据集迭代,而我可以在地图期间执行相同的任务。我也可以尝试使用虚拟键输出空值,但这是一个糟糕的解决方法。
回答by maasg
There are few options:
有几个选项:
rdd.flatMap
: rdd.flatMap
will flatten a Traversable
collection into the RDD. To pick elements, you'll typically return an Option
as result of the transformation.
rdd.flatMap
:rdd.flatMap
将一个Traversable
集合展平到 RDD 中。要选择元素,您通常会返回一个Option
作为转换的结果。
rdd.flatMap(elem => if (filter(elem)) Some(f(elem)) else None)
rdd.collect(pf: PartialFunction)
allows you to provide a partial function that can filter and transform elements from the original RDD. You can use all power of pattern matching with this method.
rdd.collect(pf: PartialFunction)
允许您提供一个部分函数,可以过滤和转换原始 RDD 中的元素。您可以通过此方法使用模式匹配的所有功能。
rdd.collect{case t if (cond(t)) => f(t)}
rdd.collect{case t:GivenType => f(t)}
As Dean Wampler mentions in the comments, rdd.map(f(_)).filter(cond(_))
might be as good and even faster than the other more 'terse' options mentioned above.
正如 Dean Wampler 在评论中提到的那样,rdd.map(f(_)).filter(cond(_))
可能与上面提到的其他更“简洁”的选项一样好,甚至更快。
Where f
is a transformation (or map) function.
f
转换(或映射)函数在哪里。