scala 如何在 spark-sql 中使用“not rlike”?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34534088/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:54:16  来源:igfitidea点击:

How do I use "not rlike" in spark-sql?

scalaapache-sparkapache-spark-sql

提问by WoodChopper

rlikeworks fine but not rlikethrows an error:

rlike工作正常但not rlike抛出错误:

scala> sqlContext.sql("select * from T where columnB rlike '^[0-9]*$'").collect()
res42: Array[org.apache.spark.sql.Row] = Array([412,0], [0,25], [412,25], [0,25])

scala> sqlContext.sql("select * from T where columnB not rlike '^[0-9]*$'").collect()
java.lang.RuntimeException: [1.35] failure: ``in'' expected but `rlike' found


val df = sc.parallelize(Seq(
  (412, 0),
  (0, 25), 
  (412, 25), 
  (0, 25)
)).toDF("columnA", "columnB")

Or it is continuation of issue https://issues.apache.org/jira/browse/SPARK-4207?

或者它是问题https://issues.apache.org/jira/browse/SPARK-4207 的延续?

回答by Srini

There is nothing as such not rlike, but in regex you have something called negative lookahead, which means it will give the words that does not match.

没有什么不一样的,但在正则表达式中,你有一种叫做负前瞻的东西,这意味着它会给出不匹配的词。

For above query, you can use the regex as below. Lets say, you want the ColumnB should not start with digits '0'

对于上述查询,您可以使用如下正则表达式。假设,您希望 ColumnB 不应以数字 '0' 开头

Then you can do like this.

那么你可以这样做。

sqlContext.sql("select * from T where columnB rlike '^(?!.*[1-9]).*$'").collect() 
Result: Array[org.apache.spark.sql.Row] = Array([412,0])

What I meant over all is, you have to do with regex it self to negate the match, not with rlike. Rlike simply matches the regex that you asked to match. If your regex tells it to not match, it applies that, if your regex is for matching then it does that.

我的意思是,您必须使用正则表达式来否定匹配,而不是使用 rlike。Rlike 只匹配您要求匹配的正则表达式。如果您的正则表达式告诉它不匹配,则它适用,如果您的正则表达式用于匹配,则它会这样做。

回答by Highbrainer

I know your question is getting a bit old, but just in case : have you simply tried scala's unary "!" operator?

我知道您的问题有点老了,但以防万一:您是否只是尝试过 Scala 的一元“!” 操作员?

In java you would go something like that :

在 Java 中,你会这样做:

DataFrame df = sqlContext.table("T");
DataFrame notLikeDf = df.filter(
  df.col("columnB").rlike("^[0-9]*$").unary_$bang()
);

回答by dre-hh

The above answers suggests to use a negative lookahead. It can be achieved for some cases. However regexps were not designed to make an effecient negative match. Those regexp will be error prone and hard to read.

上述答案建议使用负前瞻。它可以在某些情况下实现。然而,正则表达式并不是为了进行有效的否定匹配而设计的。那些正则表达式很容易出错并且难以阅读。

Spark does support "not rlike" since version 2.0.

从 2.0 版开始,Spark 确实支持“not rlike”。

 # given 'url' is column on a dataframe
 df.filter("""url not rlike "stackoverflow.com"""")

The only usage known to me, is a sql string expression (as above). I could not find a "not" sql dsl function in the python api. There might be one in scala.

我知道的唯一用法是 sql 字符串表达式(如上)。我在 python api 中找不到“非”sql dsl 函数。scala 中可能有一个。

回答by bikashg

In pyspark, I did this as :

在 pyspark 中,我是这样做的:

df = load_your_df()

matching_regex = "yourRegexString"

matching_df = df.filter(df.fieldName.rlike(matching_regex))

non_matching_df = df.subtract(matching_df)