scala 有没有办法使用scala过滤火花数据框中不包含某些内容的字段?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33608526/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:47:13  来源:igfitidea点击:

Is there a way to filter a field not containing something in a spark dataframe using scala?

scalaapache-sparkapache-spark-sql

提问by Dean

Hopefully I'm stupid and this will be easy.

希望我是愚蠢的,这会很容易。

I have a dataframe containing the columns 'url' and 'referrer'.

我有一个包含“url”和“referrer”列的数据框。

I want to extract all the referrers that contain the top level domain 'www.mydomain.com' and 'mydomain.co'.

我想提取包含顶级域“www.mydomain.com”和“mydomain.co”的所有引用。

I can use

我可以用

val filteredDf = unfilteredDf.filter(($"referrer").contains("www.mydomain."))

However, this pulls out the url www.google.co.uk search url that also contains my web domain for some reason. Is there a way, using scala in spark, that I can filter out anything with google in it while keeping the correct results I have?

但是,出于某种原因,这会提取出也包含我的网络域的网址 www.google.co.uk 搜索网址。有没有办法,在 spark 中使用 scala,我可以过滤掉任何包含 google 的内容,同时保持我拥有的正确结果?

Thanks

谢谢

Dean

院长

回答by zero323

You can negate predicate using either notor !so all what's left is to add another condition:

您可以使用任何一个not!这样的方法否定谓词,剩下的就是添加另一个条件:

import org.apache.spark.sql.functions.not

df.where($"referrer".contains("www.mydomain.") &&
  not($"referrer".contains("google")))

or separate filter:

或单独的过滤器:

df
 .where($"referrer".contains("www.mydomain."))
 .where(!$"referrer".contains("google"))

回答by mgaido

You may use a Regex. Hereyou can find a reference for the usage of regex in Scala. And hereyou can find some hints about how to create a proper regex for URLs.

您可以使用Regex. 在这里您可以找到有关 Scala 中正则表达式用法的参考。而在这里你可以找到关于如何创建URL的正则表达式正确一些提示。

Thus in your case you will have something like:

因此,在您的情况下,您将有类似的东西:

val regex = "PUT_YOUR_REGEX_HERE".r // something like (https?|ftp)://www.mydomain.com?(/[^\s]*)? should work
val filteredDf = unfilteredDf.filter(regex.findFirstIn(($"referrer")) match {
    case Some => true
    case None => false
} )

This solution requires a bit of work but is the safest one.

此解决方案需要一些工作,但它是最安全的解决方案。