Python 具有多个条件的 Sparksql 过滤(使用 where 子句进行选择)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33747834/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 13:54:01  来源:igfitidea点击:

Sparksql filtering (selecting with where clause) with multiple conditions

pythonsqlapache-sparkapache-spark-sqlpyspark

提问by user3803714

Hi I have the following issue:

您好,我有以下问题:

numeric.registerTempTable("numeric"). 

All the values that I want to filter on are literal null strings and not N/A or Null values.

我要过滤的所有值都是文字空字符串,而不是 N/A 或空值。

I tried these three options:

我尝试了这三个选项:

  1. numeric_filtered = numeric.filter(numeric['LOW'] != 'null').filter(numeric['HIGH'] != 'null').filter(numeric['NORMAL'] != 'null')

  2. numeric_filtered = numeric.filter(numeric['LOW'] != 'null' AND numeric['HIGH'] != 'null' AND numeric['NORMAL'] != 'null')

  3. sqlContext.sql("SELECT * from numeric WHERE LOW != 'null' AND HIGH != 'null' AND NORMAL != 'null'")

  1. numeric_filtered = numeric.filter(numeric['LOW'] != 'null').filter(numeric['HIGH'] != 'null').filter(numeric['NORMAL'] != 'null')

  2. numeric_filtered = numeric.filter(numeric['LOW'] != 'null' AND numeric['HIGH'] != 'null' AND numeric['NORMAL'] != 'null')

  3. sqlContext.sql("SELECT * from numeric WHERE LOW != 'null' AND HIGH != 'null' AND NORMAL != 'null'")

Unfortunately, numeric_filtered is always empty. I checked and numeric has data that should be filtered based on these conditions.

不幸的是, numeric_filtered 总是空的。我检查过,数字有应该根据这些条件过滤的数据。

Here are some sample values:

以下是一些示例值:

Low High Normal

低 高 正常

3.5 5.0 null

3.5 5.0 空

2.0 14.0 null

2.0 14.0 空

null 38.0 null

空 38.0 空

null null null

空空空

1.0 null 4.0

1.0 空 4.0

采纳答案by zero323

Your are using logical conjunction (AND). It means that all columns have to be different than 'null'for row to be included. Lets illustrate that using filterversion as an example:

您正在使用逻辑连词 (AND)。这意味着所有列都必须与'null'要包含的行不同。让我们以filter版本为例来说明:

numeric = sqlContext.createDataFrame([
    ('3.5,', '5.0', 'null'), ('2.0', '14.0', 'null'),  ('null', '38.0', 'null'),
    ('null', 'null', 'null'),  ('1.0', 'null', '4.0')],
    ('low', 'high', 'normal'))

numeric_filtered_1 = numeric.where(numeric['LOW'] != 'null')
numeric_filtered_1.show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

numeric_filtered_2 = numeric_filtered_1.where(
    numeric_filtered_1['NORMAL'] != 'null')
numeric_filtered_2.show()

## +---+----+------+
## |low|high|normal|
## +---+----+------+
## |1.0|null|   4.0|
## +---+----+------+

numeric_filtered_3 = numeric_filtered_2.where(
    numeric_filtered_2['HIGH'] != 'null')
numeric_filtered_3.show()

## +---+----+------+
## |low|high|normal|
## +---+----+------+
## +---+----+------+

All remaining methods you've tried follow exactly the same schema. What you need here is a logical disjunction (OR).

您尝试过的所有剩余方法都遵循完全相同的模式。您在这里需要的是逻辑分离 (OR)。

from pyspark.sql.functions import col 

numeric_filtered = df.where(
    (col('LOW')    != 'null') | 
    (col('NORMAL') != 'null') |
    (col('HIGH')   != 'null'))
numeric_filtered.show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## |null|38.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

or with raw SQL:

或使用原始 SQL:

numeric.registerTempTable("numeric")
sqlContext.sql("""SELECT * FROM numeric
    WHERE low != 'null' OR normal != 'null' OR high != 'null'"""
).show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## |null|38.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

See also: Pyspark: multiple conditions in when clause

另请参阅:Pyspark:when 子句中的多个条件

回答by Sudhakar

from pyspark.sql.functions import col, countDistinct 
totalrecordcount = df.where("ColumnName is not null").select(countDistinct("ColumnName")).collect()[0][0]