scala 如何在 spark-sql 中使用“not rlike”？

Question

提问by WoodChopper

rlikeworks fine but not rlikethrows an error:

rlike工作正常但not rlike抛出错误：

scala> sqlContext.sql("select * from T where columnB rlike '^[0-9]*$'").collect()
res42: Array[org.apache.spark.sql.Row] = Array([412,0], [0,25], [412,25], [0,25])

scala> sqlContext.sql("select * from T where columnB not rlike '^[0-9]*$'").collect()
java.lang.RuntimeException: [1.35] failure: ``in'' expected but `rlike' found


val df = sc.parallelize(Seq(
  (412, 0),
  (0, 25), 
  (412, 25), 
  (0, 25)
)).toDF("columnA", "columnB")

Or it is continuation of issue https://issues.apache.org/jira/browse/SPARK-4207?

或者它是问题https://issues.apache.org/jira/browse/SPARK-4207 的延续？

Answer 1

回答by Srini

There is nothing as such not rlike, but in regex you have something called negative lookahead, which means it will give the words that does not match.

没有什么不一样的，但在正则表达式中，你有一种叫做负前瞻的东西，这意味着它会给出不匹配的词。

For above query, you can use the regex as below. Lets say, you want the ColumnB should not start with digits '0'

对于上述查询，您可以使用如下正则表达式。假设，您希望 ColumnB 不应以数字 '0' 开头

Then you can do like this.

那么你可以这样做。

sqlContext.sql("select * from T where columnB rlike '^(?!.*[1-9]).*$'").collect() 
Result: Array[org.apache.spark.sql.Row] = Array([412,0])

What I meant over all is, you have to do with regex it self to negate the match, not with rlike. Rlike simply matches the regex that you asked to match. If your regex tells it to not match, it applies that, if your regex is for matching then it does that.

我的意思是，您必须使用正则表达式来否定匹配，而不是使用 rlike。Rlike 只匹配您要求匹配的正则表达式。如果您的正则表达式告诉它不匹配，则它适用，如果您的正则表达式用于匹配，则它会这样做。

Answer 2

回答by Highbrainer

I know your question is getting a bit old, but just in case : have you simply tried scala's unary "!" operator?

我知道您的问题有点老了，但以防万一：您是否只是尝试过 Scala 的一元“！” 操作员？

In java you would go something like that :

在 Java 中，你会这样做：

DataFrame df = sqlContext.table("T");
DataFrame notLikeDf = df.filter(
  df.col("columnB").rlike("^[0-9]*$").unary_$bang()
);

Answer 3

回答by dre-hh

The above answers suggests to use a negative lookahead. It can be achieved for some cases. However regexps were not designed to make an effecient negative match. Those regexp will be error prone and hard to read.

上述答案建议使用负前瞻。它可以在某些情况下实现。然而，正则表达式并不是为了进行有效的否定匹配而设计的。那些正则表达式很容易出错并且难以阅读。

Spark does support "not rlike" since version 2.0.

从 2.0 版开始，Spark 确实支持“not rlike”。

 # given 'url' is column on a dataframe
 df.filter("""url not rlike "stackoverflow.com"""")

The only usage known to me, is a sql string expression (as above). I could not find a "not" sql dsl function in the python api. There might be one in scala.

我知道的唯一用法是 sql 字符串表达式（如上）。我在 python api 中找不到“非”sql dsl 函数。scala 中可能有一个。

Answer 4

回答by bikashg

In pyspark, I did this as :

在 pyspark 中，我是这样做的：

df = load_your_df()

matching_regex = "yourRegexString"

matching_df = df.filter(df.fieldName.rlike(matching_regex))

non_matching_df = df.subtract(matching_df)

scala 如何在 spark-sql 中使用“not rlike”？

提问by WoodChopper

回答by Srini

回答by Highbrainer

回答by dre-hh

回答by bikashg

相关推荐

最近更新

标签

scala 如何在 spark-sql 中使用“not rlike”？

提问by WoodChopper

回答by Srini

回答by Highbrainer

回答by dre-hh

回答by bikashg

相关推荐

scala 如何对RDD进行排序

scala 如何从嵌套的结构元素数组创建 Spark DataFrame？

scala 如何将列添加到 mapPartitions 内的 org.apache.spark.sql.Row 中

scala 过滤器和scala spark sql中的where之间的区别

相关推荐

最近更新

标签