scala 有没有办法使用scala过滤火花数据框中不包含某些内容的字段?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33608526/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Is there a way to filter a field not containing something in a spark dataframe using scala?
提问by Dean
Hopefully I'm stupid and this will be easy.
希望我是愚蠢的,这会很容易。
I have a dataframe containing the columns 'url' and 'referrer'.
我有一个包含“url”和“referrer”列的数据框。
I want to extract all the referrers that contain the top level domain 'www.mydomain.com' and 'mydomain.co'.
我想提取包含顶级域“www.mydomain.com”和“mydomain.co”的所有引用。
I can use
我可以用
val filteredDf = unfilteredDf.filter(($"referrer").contains("www.mydomain."))
However, this pulls out the url www.google.co.uk search url that also contains my web domain for some reason. Is there a way, using scala in spark, that I can filter out anything with google in it while keeping the correct results I have?
但是,出于某种原因,这会提取出也包含我的网络域的网址 www.google.co.uk 搜索网址。有没有办法,在 spark 中使用 scala,我可以过滤掉任何包含 google 的内容,同时保持我拥有的正确结果?
Thanks
谢谢
Dean
院长
回答by zero323
You can negate predicate using either notor !so all what's left is to add another condition:
您可以使用任何一个not或!这样的方法否定谓词,剩下的就是添加另一个条件:
import org.apache.spark.sql.functions.not
df.where($"referrer".contains("www.mydomain.") &&
not($"referrer".contains("google")))
or separate filter:
或单独的过滤器:
df
.where($"referrer".contains("www.mydomain."))
.where(!$"referrer".contains("google"))
回答by mgaido
You may use a Regex. Hereyou can find a reference for the usage of regex in Scala. And hereyou can find some hints about how to create a proper regex for URLs.
您可以使用Regex. 在这里您可以找到有关 Scala 中正则表达式用法的参考。而在这里你可以找到关于如何创建URL的正则表达式正确一些提示。
Thus in your case you will have something like:
因此,在您的情况下,您将有类似的东西:
val regex = "PUT_YOUR_REGEX_HERE".r // something like (https?|ftp)://www.mydomain.com?(/[^\s]*)? should work
val filteredDf = unfilteredDf.filter(regex.findFirstIn(($"referrer")) match {
case Some => true
case None => false
} )
This solution requires a bit of work but is the safest one.
此解决方案需要一些工作,但它是最安全的解决方案。

