scala 如何在没有 SQL 查询的情况下使用 Spark Dataframe 检查相等性?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31324539/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:19:59  来源:igfitidea点击:

How do I check for equality using Spark Dataframe without SQL Query?

scalaapache-sparkdataframeapache-spark-sql

提问by Instinct

I want to select a column that equals to a certain value. I am doing this in scala and having a little trouble.

我想选择一个等于某个值的列。我在 Scala 中执行此操作并且遇到了一些麻烦。

Heres my code

这是我的代码

df.select(df("state")==="TX").show()

this returns the state column with boolean values instead of just TX

这将返回带有布尔值的状态列,而不仅仅是 TX

Ive also tried

我也试过

df.select(df("state")=="TX").show() 

but this doesn't work either.

但这也不起作用。

回答by user3487888

I had the same issue, and the following syntax worked for me:

我遇到了同样的问题,以下语法对我有用:

df.filter(df("state")==="TX").show()

I'm using Spark 1.6.

我正在使用 Spark 1.6。

回答by Jegan

There is another simple sql like option. With Spark 1.6 below also should work.

还有另一个简单的类似 sql 的选项。使用下面的 Spark 1.6 也应该可以工作。

df.filter("state = 'TX'")

This is a new way of specifying sql like filters. For a full list of supported operators, check out thisclass.

这是一种像过滤器一样指定 sql 的新方法。有关受支持运算符的完整列表,请查看此类

回答by Justin Pihony

You should be using where, selectis a projection that returns the output of the statement, thus why you get boolean values. whereis a filter that keeps the structure of the dataframe, but only keeps data where the filter works.

您应该使用where,select是返回语句输出的投影,因此您会得到布尔值。where是一个过滤器,它保留数据帧的结构,但只保留过滤器工作的数据。

Along the same line though, per the documentation, you can write this in 3 different ways

尽管如此,根据文档,您可以用 3 种不同的方式编写它

// The following are equivalent:
peopleDf.filter($"age" > 15)
peopleDf.where($"age" > 15)
peopleDf($"age" > 15)

回答by Srini

df.filter($"state" like "T%%")for pattern matching

df.filter($"state" like "T%%")用于模式匹配

df.filter($"state" === "TX")or df.filter("state = 'TX'")for equality

df.filter($"state" === "TX")df.filter("state = 'TX'")为了平等

回答by dman

To get the negation, do this ...

要获得否定,请执行以下操作...

df.filter(not( ..expression.. ))

eg

例如

df.filter(not($"state" === "TX"))

回答by Farshad Javadi

Worked on Spark V2.*

在 Spark V2.* 上工作

import sqlContext.implicits._
df.filter($"state" === "TX")

if needs to be compared against a variable (e.g., var):

如果需要与变量(例如,var)进行比较:

import sqlContext.implicits._
df.filter($"state" === var)

Note : import sqlContext.implicits._

笔记 : import sqlContext.implicits._

回答by Phani

We can write multiple Filter/where conditions in Dataframe.

我们可以在 Dataframe 中编写多个 Filter/where 条件。

For example:

例如:

table1_df
.filter($"Col_1_name" === "buddy")  // check for equal to string
.filter($"Col_2_name" === "A")
.filter(not($"Col_2_name".contains(" .sql")))  // filter a string which is    not relevent
.filter("Col_2_name is not null")   // no null filter
.take(5).foreach(println)

回答by Ram Ghadiyaram

Here is the complete example using spark2.2+ taking data in json...

这是使用 spark2.2+ 在 json 中获取数据的完整示例...

val myjson = "[{\"name\":\"Alabama\",\"abbreviation\":\"AL\"},{\"name\":\"Alaska\",\"abbreviation\":\"AK\"},{\"name\":\"American Samoa\",\"abbreviation\":\"AS\"},{\"name\":\"Arizona\",\"abbreviation\":\"AZ\"},{\"name\":\"Arkansas\",\"abbreviation\":\"AR\"},{\"name\":\"California\",\"abbreviation\":\"CA\"},{\"name\":\"Colorado\",\"abbreviation\":\"CO\"},{\"name\":\"Connecticut\",\"abbreviation\":\"CT\"},{\"name\":\"Delaware\",\"abbreviation\":\"DE\"},{\"name\":\"District Of Columbia\",\"abbreviation\":\"DC\"},{\"name\":\"Federated States Of Micronesia\",\"abbreviation\":\"FM\"},{\"name\":\"Florida\",\"abbreviation\":\"FL\"},{\"name\":\"Georgia\",\"abbreviation\":\"GA\"},{\"name\":\"Guam\",\"abbreviation\":\"GU\"},{\"name\":\"Hawaii\",\"abbreviation\":\"HI\"},{\"name\":\"Idaho\",\"abbreviation\":\"ID\"},{\"name\":\"Illinois\",\"abbreviation\":\"IL\"},{\"name\":\"Indiana\",\"abbreviation\":\"IN\"},{\"name\":\"Iowa\",\"abbreviation\":\"IA\"},{\"name\":\"Kansas\",\"abbreviation\":\"KS\"},{\"name\":\"Kentucky\",\"abbreviation\":\"KY\"},{\"name\":\"Louisiana\",\"abbreviation\":\"LA\"},{\"name\":\"Maine\",\"abbreviation\":\"ME\"},{\"name\":\"Marshall Islands\",\"abbreviation\":\"MH\"},{\"name\":\"Maryland\",\"abbreviation\":\"MD\"},{\"name\":\"Massachusetts\",\"abbreviation\":\"MA\"},{\"name\":\"Michigan\",\"abbreviation\":\"MI\"},{\"name\":\"Minnesota\",\"abbreviation\":\"MN\"},{\"name\":\"Mississippi\",\"abbreviation\":\"MS\"},{\"name\":\"Missouri\",\"abbreviation\":\"MO\"},{\"name\":\"Montana\",\"abbreviation\":\"MT\"},{\"name\":\"Nebraska\",\"abbreviation\":\"NE\"},{\"name\":\"Nevada\",\"abbreviation\":\"NV\"},{\"name\":\"New Hampshire\",\"abbreviation\":\"NH\"},{\"name\":\"New Jersey\",\"abbreviation\":\"NJ\"},{\"name\":\"New Mexico\",\"abbreviation\":\"NM\"},{\"name\":\"New York\",\"abbreviation\":\"NY\"},{\"name\":\"North Carolina\",\"abbreviation\":\"NC\"},{\"name\":\"North Dakota\",\"abbreviation\":\"ND\"},{\"name\":\"Northern Mariana Islands\",\"abbreviation\":\"MP\"},{\"name\":\"Ohio\",\"abbreviation\":\"OH\"},{\"name\":\"Oklahoma\",\"abbreviation\":\"OK\"},{\"name\":\"Oregon\",\"abbreviation\":\"OR\"},{\"name\":\"Palau\",\"abbreviation\":\"PW\"},{\"name\":\"Pennsylvania\",\"abbreviation\":\"PA\"},{\"name\":\"Puerto Rico\",\"abbreviation\":\"PR\"},{\"name\":\"Rhode Island\",\"abbreviation\":\"RI\"},{\"name\":\"South Carolina\",\"abbreviation\":\"SC\"},{\"name\":\"South Dakota\",\"abbreviation\":\"SD\"},{\"name\":\"Tennessee\",\"abbreviation\":\"TN\"},{\"name\":\"Texas\",\"abbreviation\":\"TX\"},{\"name\":\"Utah\",\"abbreviation\":\"UT\"},{\"name\":\"Vermont\",\"abbreviation\":\"VT\"},{\"name\":\"Virgin Islands\",\"abbreviation\":\"VI\"},{\"name\":\"Virginia\",\"abbreviation\":\"VA\"},{\"name\":\"Washington\",\"abbreviation\":\"WA\"},{\"name\":\"West Virginia\",\"abbreviation\":\"WV\"},{\"name\":\"Wisconsin\",\"abbreviation\":\"WI\"},{\"name\":\"Wyoming\",\"abbreviation\":\"WY\"}]"
import spark.implicits._
val df = spark.read.json(Seq(myjson).toDS)
df.show 
   import spark.implicits._
    val df = spark.read.json(Seq(myjson).toDS)
    df.show

    scala> df.show
    +------------+--------------------+
    |abbreviation|                name|
    +------------+--------------------+
    |          AL|             Alabama|
    |          AK|              Alaska|
    |          AS|      American Samoa|
    |          AZ|             Arizona|
    |          AR|            Arkansas|
    |          CA|          California|
    |          CO|            Colorado|
    |          CT|         Connecticut|
    |          DE|            Delaware|
    |          DC|District Of Columbia|
    |          FM|Federated States ...|
    |          FL|             Florida|
    |          GA|             Georgia|
    |          GU|                Guam|
    |          HI|              Hawaii|
    |          ID|               Idaho|
    |          IL|            Illinois|
    |          IN|             Indiana|
    |          IA|                Iowa|
    |          KS|              Kansas|
    +------------+--------------------+

    // equals matching
    scala> df.filter(df("abbreviation") === "TX").show
    +------------+-----+
    |abbreviation| name|
    +------------+-----+
    |          TX|Texas|
    +------------+-----+
    // or using lit

    scala> df.filter(df("abbreviation") === lit("TX")).show
    +------------+-----+
    |abbreviation| name|
    +------------+-----+
    |          TX|Texas|
    +------------+-----+

    //not expression
    scala> df.filter(not(df("abbreviation") === "TX")).show
    +------------+--------------------+
    |abbreviation|                name|
    +------------+--------------------+
    |          AL|             Alabama|
    |          AK|              Alaska|
    |          AS|      American Samoa|
    |          AZ|             Arizona|
    |          AR|            Arkansas|
    |          CA|          California|
    |          CO|            Colorado|
    |          CT|         Connecticut|
    |          DE|            Delaware|
    |          DC|District Of Columbia|
    |          FM|Federated States ...|
    |          FL|             Florida|
    |          GA|             Georgia|
    |          GU|                Guam|
    |          HI|              Hawaii|
    |          ID|               Idaho|
    |          IL|            Illinois|
    |          IN|             Indiana|
    |          IA|                Iowa|
    |          KS|              Kansas|
    +------------+--------------------+
    only showing top 20 rows

回答by Powers

Let's create a sample dataset and do a deep dive into exactly why OP's code didn't work.

让我们创建一个示例数据集,并深入了解 OP 代码不起作用的确切原因。

Here's our sample data:

这是我们的示例数据:

val df = Seq(
  ("Rockets", 2, "TX"),
  ("Warriors", 6, "CA"),
  ("Spurs", 5, "TX"),
  ("Knicks", 2, "NY")
).toDF("team_name", "num_championships", "state")

We can pretty print our dataset with the show()method:

我们可以使用以下show()方法漂亮地打印我们的数据集:

+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
|  Rockets|                2|   TX|
| Warriors|                6|   CA|
|    Spurs|                5|   TX|
|   Knicks|                2|   NY|
+---------+-----------------+-----+

Let's examine the results of df.select(df("state")==="TX").show():

让我们检查以下结果df.select(df("state")==="TX").show()

+------------+
|(state = TX)|
+------------+
|        true|
|       false|
|        true|
|       false|
+------------+

It's easier to understand this result by simply appending a column - df.withColumn("is_state_tx", df("state")==="TX").show():

通过简单地附加一列 - 更容易理解这个结果df.withColumn("is_state_tx", df("state")==="TX").show()

+---------+-----------------+-----+-----------+
|team_name|num_championships|state|is_state_tx|
+---------+-----------------+-----+-----------+
|  Rockets|                2|   TX|       true|
| Warriors|                6|   CA|      false|
|    Spurs|                5|   TX|       true|
|   Knicks|                2|   NY|      false|
+---------+-----------------+-----+-----------+

The other code OP tried (df.select(df("state")=="TX").show()) returns this error:

OP 尝试的其他代码 ( df.select(df("state")=="TX").show()) 返回此错误:

<console>:27: error: overloaded method value select with alternatives:
  [U1](c1: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U1])org.apache.spark.sql.Dataset[U1] <and>
  (col: String,cols: String*)org.apache.spark.sql.DataFrame <and>
  (cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
 cannot be applied to (Boolean)
       df.select(df("state")=="TX").show()
          ^

The ===operator is defined in the Column class. The Column class doesn't define a ==operator and that's why this code is erroring out. Read this blogfor more background information about the Spark Column class.

===操作者在定义栏类。Column 类未定义==运算符,这就是此代码出错的原因。阅读此博客,了解有关 Spark Column 类的更多背景信息。

Here's the accepted answer that works:

这是有效的公认答案:

df.filter(df("state")==="TX").show()

+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
|  Rockets|                2|   TX|
|    Spurs|                5|   TX|
+---------+-----------------+-----+

As other posters have mentioned, the ===method takes an argument with an Anytype, so this isn't the only solution that works. This works too for example:

正如其他海报所提到的,该===方法接受一个Any类型的参数,所以这不是唯一有效的解决方案。这也适用,例如:

df.filter(df("state") === lit("TX")).show

+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
|  Rockets|                2|   TX|
|    Spurs|                5|   TX|
+---------+-----------------+-----+

The Column equalTomethod can also be used:

equalTo也可以使用Column方法:

df.filter(df("state").equalTo("TX")).show()

+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
|  Rockets|                2|   TX|
|    Spurs|                5|   TX|
+---------+-----------------+-----+

It worthwhile studying this example in detail. Scala's syntax seems magical at times, especially when method are invoked without dot notation. It's hard for the untrained eye to see that ===is a method defined in the Columnclass!

值得详细研究这个例子。Scala 的语法有时看起来很神奇,尤其是在没有点符号的情况下调用方法时。未经训练的眼睛很难看出这===Column类中定义的方法!

See this blog postif you'd like even more details on Spark Column equality.

如果您想了解有关 Spark 列相等性的更多详细信息,请参阅此博客文章

回答by Venkatesh Surya Goli

In Spark 2.4

在 Spark 2.4 中

To compare with one value:

与一个值进行比较:

df.filter(lower(trim($"col_name")) === "<value>").show()

To compare with collection of value:

与价值集合进行比较:

df.filter($"col_name".isInCollection(new HashSet<>(Arrays.asList("value1", "value2")))).show()