scala 如何从火花数据框中过滤掉空值

Question

提问by Steven Li

I created a dataframe in spark with the following schema:

我使用以下架构在 spark 中创建了一个数据框：

root
 |-- user_id: long (nullable = false)
 |-- event_id: long (nullable = false)
 |-- invited: integer (nullable = false)
 |-- day_diff: long (nullable = true)
 |-- interested: integer (nullable = false)
 |-- event_owner: long (nullable = false)
 |-- friend_id: long (nullable = false)

And the data is shown below:

数据如下所示：

+----------+----------+-------+--------+----------+-----------+---------+
|   user_id|  event_id|invited|day_diff|interested|event_owner|friend_id|
+----------+----------+-------+--------+----------+-----------+---------+
|   4236494| 110357109|      0|      -1|         0|  937597069|     null|
|  78065188| 498404626|      0|       0|         0| 2904922087|     null|
| 282487230|2520855981|      0|      28|         0| 3749735525|     null|
| 335269852|1641491432|      0|       2|         0| 1490350911|     null|
| 437050836|1238456614|      0|       2|         0|  991277599|     null|
| 447244169|2095085551|      0|      -1|         0| 1579858878|     null|
| 516353916|1076364848|      0|       3|         1| 3597645735|     null|
| 528218683|1151525474|      0|       1|         0| 3433080956|     null|
| 531967718|3632072502|      0|       1|         0| 3863085861|     null|
| 627948360|2823119321|      0|       0|         0| 4092665803|     null|
| 811791433|3513954032|      0|       2|         0|  415464198|     null|
| 830686203|  99027353|      0|       0|         0| 3549822604|     null|
|1008893291|1115453150|      0|       2|         0| 2245155244|     null|
|1239364869|2824096896|      0|       2|         1| 2579294650|     null|
|1287950172|1076364848|      0|       0|         0| 3597645735|     null|
|1345896548|2658555390|      0|       1|         0| 2025118823|     null|
|1354205322|2564682277|      0|       3|         0| 2563033185|     null|
|1408344828|1255629030|      0|      -1|         1|  804901063|     null|
|1452633375|1334001859|      0|       4|         0| 1488588320|     null|
|1625052108|3297535757|      0|       3|         0| 1972598895|     null|
+----------+----------+-------+--------+----------+-----------+---------+

I want to filter out the rows have null values in the field of "friend_id".

我想过滤掉“friend_id”字段中具有空值的行。

scala> val aaa = test.filter("friend_id is null")

scala> aaa.count

I got :res52: Long = 0 which is obvious not right. What is the right way to get it?

我得到 :res52: Long = 0 这显然是不对的。获得它的正确方法是什么？

One more question, I want to replace the values in the friend_id field. I want to replace null with 0 and 1 for any other value except null. The code I can figure out is:

还有一个问题，我想替换friend_id 字段中的值。我想用 0 和 1 替换 null 以替换除 null 之外的任何其他值。我能弄清楚的代码是：

val aaa = train_friend_join.select($"user_id", $"event_id", $"invited", $"day_diff", $"interested", $"event_owner", ($"friend_id" != null)?1:0)

This code also doesn't work. Can anyone tell me how can I fix it? Thanks

此代码也不起作用。谁能告诉我我该如何修复它？谢谢

Answer 1

回答by Sachin Tyagi

Let's say you have this data setup (so that results are reproducible):

假设您有此数据设置（以便结果可重现）：

// declaring data types
case class Company(cName: String, cId: String, details: String)
case class Employee(name: String, id: String, email: String, company: Company)

// setting up example data
val e1 = Employee("n1", null, "[email protected]", Company("c1", "1", "d1"))
val e2 = Employee("n2", "2", "[email protected]", Company("c1", "1", "d1"))
val e3 = Employee("n3", "3", "[email protected]", Company("c1", "1", "d1"))
val e4 = Employee("n4", "4", "[email protected]", Company("c2", "2", "d2"))
val e5 = Employee("n5", null, "[email protected]", Company("c2", "2", "d2"))
val e6 = Employee("n6", "6", "[email protected]", Company("c2", "2", "d2"))
val e7 = Employee("n7", "7", "[email protected]", Company("c3", "3", "d3"))
val e8 = Employee("n8", "8", "[email protected]", Company("c3", "3", "d3"))
val employees = Seq(e1, e2, e3, e4, e5, e6, e7, e8)
val df = sc.parallelize(employees).toDF

Data is:

数据为：

+----+----+---------+---------+
|name|  id|    email|  company|
+----+----+---------+---------+
|  n1|null|[email protected]|[c1,1,d1]|
|  n2|   2|[email protected]|[c1,1,d1]|
|  n3|   3|[email protected]|[c1,1,d1]|
|  n4|   4|[email protected]|[c2,2,d2]|
|  n5|null|[email protected]|[c2,2,d2]|
|  n6|   6|[email protected]|[c2,2,d2]|
|  n7|   7|[email protected]|[c3,3,d3]|
|  n8|   8|[email protected]|[c3,3,d3]|
+----+----+---------+---------+

Now to filter employees with nullids, you will do --

现在用nullid过滤员工，你会做——

df.filter("id is null").show

which will correctly show you following:

这将正确显示以下内容：

+----+----+---------+---------+
|name|  id|    email|  company|
+----+----+---------+---------+
|  n1|null|[email protected]|[c1,1,d1]|
|  n5|null|[email protected]|[c2,2,d2]|
+----+----+---------+---------+

Coming to the second part of your question, you can replace the nullids with 0 and other values with 1 with this --

来到你问题的第二部分，你可以null用 0替换ids，用 1替换其他值——

df.withColumn("id", when($"id".isNull, 0).otherwise(1)).show

This results in:

这导致：

+----+---+---------+---------+
|name| id|    email|  company|
+----+---+---------+---------+
|  n1|  0|[email protected]|[c1,1,d1]|
|  n2|  1|[email protected]|[c1,1,d1]|
|  n3|  1|[email protected]|[c1,1,d1]|
|  n4|  1|[email protected]|[c2,2,d2]|
|  n5|  0|[email protected]|[c2,2,d2]|
|  n6|  1|[email protected]|[c2,2,d2]|
|  n7|  1|[email protected]|[c3,3,d3]|
|  n8|  1|[email protected]|[c3,3,d3]|
+----+---+---------+---------+

Answer 2

回答by Adriana Lazar

Or like df.filter($"friend_id".isNotNull)

或者喜欢 df.filter($"friend_id".isNotNull)

Answer 3

回答by Michael Kopaniov

df.where(df.col("friend_id").isNull)

Answer 4

回答by chAlexey

A good solution for me was to drop the rows with any null values:

对我来说一个很好的解决方案是删除具有任何空值的行：

Dataset<Row> filtered = df.filter(row => !row.anyNull);

In case one is interested in the other case, just call row.anyNull. (Spark 2.1.0 using Java API)

如果一个人对另一种情况感兴趣，只需调用row.anyNull。（Spark 2.1.0 使用 Java API）

Answer 5

回答by Ayush Vatsyayan

There are two ways to do it: creating filter condition 1) Manually 2) Dynamically.

有两种方法可以做到：创建过滤条件1) 手动 2) 动态。

Sample DataFrame:

示例数据帧：

val df = spark.createDataFrame(Seq(
  (0, "a1", "b1", "c1", "d1"),
  (1, "a2", "b2", "c2", "d2"),
  (2, "a3", "b3", null, "d3"),
  (3, "a4", null, "c4", "d4"),
  (4, null, "b5", "c5", "d5")
)).toDF("id", "col1", "col2", "col3", "col4")

+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
|  0|  a1|  b1|  c1|  d1|
|  1|  a2|  b2|  c2|  d2|
|  2|  a3|  b3|null|  d3|
|  3|  a4|null|  c4|  d4|
|  4|null|  b5|  c5|  d5|
+---+----+----+----+----+

1) Creating filter condition manuallyi.e. using DataFrame whereor filterfunction

1) 手动创建过滤条件，即使用 DataFramewhere或filter函数

df.filter(col("col1").isNotNull && col("col2").isNotNull).show

or

或者

df.where("col1 is not null and col2 is not null").show

Result:

结果：

+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
|  0|  a1|  b1|  c1|  d1|
|  1|  a2|  b2|  c2|  d2|
|  2|  a3|  b3|null|  d3|
+---+----+----+----+----+

2) Creating filter condition dynamically: This is useful when we don't want any column to have null value and there are large number of columns, which is mostly the case.

2）动态创建过滤条件：当我们不希望任何列具有空值并且有大量列时，这很有用，这是大多数情况。

To create the filter condition manually in these cases will waste a lot of time. In below code we are including all columns dynamically using mapand reducefunction on DataFrame columns:

在这些情况下手动创建过滤条件会浪费大量时间。在下面的代码中，我们在 DataFrame 列上动态使用map和reduce函数包含所有列：

val filterCond = df.columns.map(x=>col(x).isNotNull).reduce(_ && _)

How filterCondlooks:

filterCond看起来如何：

filterCond: org.apache.spark.sql.Column = (((((id IS NOT NULL) AND (col1 IS NOT NULL)) AND (col2 IS NOT NULL)) AND (col3 IS NOT NULL)) AND (col4 IS NOT NULL))

Filtering:

过滤：

val filteredDf = df.filter(filterCond)

Result:

结果：

+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
|  0|  a1|  b1|  c1|  d1|
|  1|  a2|  b2|  c2|  d2|
+---+----+----+----+----+

Answer 6

回答by mputha

for the first question, it is correct you are filtering out nulls and hence count is zero.

对于第一个问题，您过滤掉空值是正确的，因此计数为零。

for the second replacing: use like below:

第二次更换：使用如下：

val options = Map("path" -> "...\ex.csv", "header" -> "true")
val dfNull = spark.sqlContext.load("com.databricks.spark.csv", options)

scala> dfNull.show

+----------+----------+-------+--------+----------+-----------+---------+
|   user_id|  event_id|invited|day_diff|interested|event_owner|friend_id|
+----------+----------+-------+--------+----------+-----------+---------+
|   4236494| 110357109|      0|      -1|         0|  937597069|     null|
|  78065188| 498404626|      0|       0|         0| 2904922087|     null|
| 282487230|2520855981|      0|      28|         0| 3749735525|     null|
| 335269852|1641491432|      0|       2|         0| 1490350911|     null|
| 437050836|1238456614|      0|       2|         0|  991277599|     null|
| 447244169|2095085551|      0|      -1|         0| 1579858878|        a|
| 516353916|1076364848|      0|       3|         1| 3597645735|        b|
| 528218683|1151525474|      0|       1|         0| 3433080956|        c|
| 531967718|3632072502|      0|       1|         0| 3863085861|     null|
| 627948360|2823119321|      0|       0|         0| 4092665803|     null|
| 811791433|3513954032|      0|       2|         0|  415464198|     null|
| 830686203|  99027353|      0|       0|         0| 3549822604|     null|
|1008893291|1115453150|      0|       2|         0| 2245155244|     null|
|1239364869|2824096896|      0|       2|         1| 2579294650|        d|
|1287950172|1076364848|      0|       0|         0| 3597645735|     null|
|1345896548|2658555390|      0|       1|         0| 2025118823|     null|
|1354205322|2564682277|      0|       3|         0| 2563033185|     null|
|1408344828|1255629030|      0|      -1|         1|  804901063|     null|
|1452633375|1334001859|      0|       4|         0| 1488588320|     null|
|1625052108|3297535757|      0|       3|         0| 1972598895|     null|
+----------+----------+-------+--------+----------+-----------+---------+

dfNull.withColumn("friend_idTmp", when($"friend_id".isNull, "1").otherwise("0")).drop($"friend_id").withColumnRenamed("friend_idTmp", "friend_id").show

+----------+----------+-------+--------+----------+-----------+---------+
|   user_id|  event_id|invited|day_diff|interested|event_owner|friend_id|
+----------+----------+-------+--------+----------+-----------+---------+
|   4236494| 110357109|      0|      -1|         0|  937597069|        1|
|  78065188| 498404626|      0|       0|         0| 2904922087|        1|
| 282487230|2520855981|      0|      28|         0| 3749735525|        1|
| 335269852|1641491432|      0|       2|         0| 1490350911|        1|
| 437050836|1238456614|      0|       2|         0|  991277599|        1|
| 447244169|2095085551|      0|      -1|         0| 1579858878|        0|
| 516353916|1076364848|      0|       3|         1| 3597645735|        0|
| 528218683|1151525474|      0|       1|         0| 3433080956|        0|
| 531967718|3632072502|      0|       1|         0| 3863085861|        1|
| 627948360|2823119321|      0|       0|         0| 4092665803|        1|
| 811791433|3513954032|      0|       2|         0|  415464198|        1|
| 830686203|  99027353|      0|       0|         0| 3549822604|        1|
|1008893291|1115453150|      0|       2|         0| 2245155244|        1|
|1239364869|2824096896|      0|       2|         1| 2579294650|        0|
|1287950172|1076364848|      0|       0|         0| 3597645735|        1|
|1345896548|2658555390|      0|       1|         0| 2025118823|        1|
|1354205322|2564682277|      0|       3|         0| 2563033185|        1|
|1408344828|1255629030|      0|      -1|         1|  804901063|        1|
|1452633375|1334001859|      0|       4|         0| 1488588320|        1|
|1625052108|3297535757|      0|       3|         0| 1972598895|        1|
+----------+----------+-------+--------+----------+-----------+---------+

Answer 7

回答by Andrushenko Alexander

Here is a solution for spark in Java. To select data rows containingnulls. When you have Dataset data, you do:

这是 Java 中 spark 的解决方案。选择包含空值的数据行。当您拥有数据集数据时，您可以：

Dataset<Row> containingNulls =  data.where(data.col("COLUMN_NAME").isNull())

To filter out data withoutnulls you do:

要过滤掉没有空值的数据，您可以执行以下操作：

Dataset<Row> withoutNulls = data.where(data.col("COLUMN_NAME").isNotNull())

Often dataframes contain columns of type String where instead of nulls we have empty strings like "". To filter out such data as well we do:

数据框通常包含字符串类型的列，其中我们有空字符串，如“”，而不是空字符串。为了过滤掉这些数据，我们这样做：

Dataset<Row> withoutNullsAndEmpty = data.where(data.col("COLUMN_NAME").isNotNull().and(data.col("COLUMN_NAME").notEqual("")))

Answer 8

回答by Robin Wang

From the hint from Michael Kopaniov, below works

根据 Michael Kopaniov 的提示，下面的作品

df.where(df("id").isNotNull).show

Answer 9

回答by Steven Li

I use the following code to solve my question. It works. But as we all know, I work around a country's mile to solve it. So, is there a short cut for that? Thanks

我使用以下代码来解决我的问题。有用。但众所周知，我绕着一个国家的英里工作来解决它。那么，有没有捷径可走呢？谢谢

def filter_null(field : Any) : Int = field match {
    case null => 0
    case _    => 1
}

val test = train_event_join.join(
    user_friends_pair,
    train_event_join("user_id") === user_friends_pair("user_id") &&
    train_event_join("event_owner") === user_friends_pair("friend_id"),
    "left"
).select(
    train_event_join("user_id"),
    train_event_join("event_id"),
    train_event_join("invited"),
    train_event_join("day_diff"),
    train_event_join("interested"),
    train_event_join("event_owner"),
    user_friends_pair("friend_id")
).rdd.map{
    line => (
        line(0).toString.toLong,
        line(1).toString.toLong,
        line(2).toString.toLong,
        line(3).toString.toLong,
        line(4).toString.toLong,
        line(5).toString.toLong,
        filter_null(line(6))
        )
    }.toDF("user_id", "event_id", "invited", "day_diff", "interested", "event_owner", "creator_is_friend")

Answer 10

回答by Erkan ?irin

Another easy way to filter out null values from multiple columns in spark dataframe. Please pay attention there is ANDbetween columns.

从 spark 数据框中的多列中过滤掉空值的另一种简单方法。请注意列之间有AND。

df.filter(" COALESCE(col1, col2, col3, col4, col5, col6) IS NOT NULL")

If you need to filter out rows that contain any null (ORconnected) please use

如果您需要过滤掉包含任何空值（或连接）的行，请使用

df.na.drop()

scala 如何从火花数据框中过滤掉空值

提问by Steven Li

回答by Sachin Tyagi

回答by Adriana Lazar

回答by Michael Kopaniov

回答by chAlexey

回答by Ayush Vatsyayan

回答by mputha

回答by Andrushenko Alexander

回答by Robin Wang

回答by Steven Li

回答by Erkan ?irin

相关推荐

最近更新

标签

scala 如何从火花数据框中过滤掉空值

提问by Steven Li

回答by Sachin Tyagi

回答by Adriana Lazar

回答by Michael Kopaniov

回答by chAlexey

回答by Ayush Vatsyayan

回答by mputha

回答by Andrushenko Alexander

回答by Robin Wang

回答by Steven Li

回答by Erkan ?irin

相关推荐

在 Scala 中导入 spark.implicits._

scala 在 Apache Spark 中写入文件

Scala sbt 运行-“不支持的major.minor 版本52.0”

na.fill in Spark DataFrame Scala

相关推荐

最近更新

标签