Python 如何通过布尔列过滤火花数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36784000/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 18:20:48  来源:igfitidea点击:

how to filter a spark dataframe by a boolean column

pythonapache-sparkfilterspark-dataframe

提问by Nasreddin

I created a dataframe that has the following schema:

我创建了一个具有以下架构的数据框:

In [43]: yelp_df.printSchema()
root
 |-- business_id: string (nullable = true)
 |-- cool: integer (nullable = true)
 |-- date: string (nullable = true)
 |-- funny: integer (nullable = true)
 |-- id: string (nullable = true)
 |-- stars: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- type: string (nullable = true)
 |-- useful: integer (nullable = true)
 |-- user_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- full_address: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- neighborhoods: string (nullable = true)
 |-- open: boolean (nullable = true)
 |-- review_count: integer (nullable = true)
 |-- state: string (nullable = true)

Now I want to select only the records with the "open" column that is "true". As shown below, lots of them are "open".

现在我只想选择“打开”列是“真”的记录。如下所示,其中很多都是“开放的”。

business_id          cool date       funny id                   stars text                 type     useful user_id              name               full_address         latitude      longitude      neighborhoods open review_count state
9yKzy9PApeiPPOUJE... 2    2011-01-26 0     fWKvX83p0-ka4JS3d... 4     My wife took me h... business 5      rLtl8ZkDX5vH5nAx9... Morning Glory Cafe 6106 S 32nd St Ph... 33.3907928467 -112.012504578 []            true 116          AZ   
ZRJwVLyzEJq1VAihD... 0    2011-07-27 0     IjZ33sJrzXqU-0X6U... 4     I have no idea wh... business 0      0a2KyEL0d3Yb1V6ai... Spinato's Pizzeria 4848 E Chandler B... 33.305606842  -111.978759766 []            true 102          AZ   
6oRAC4uyJCsJl1X0W... 0    2012-06-14 0     IESLBzqUCLdSzSqm0... 4     love the gyro pla... business 1      0hT2KtfLiobPvh6cD... Haji-Baba          1513 E  Apache Bl... 33.4143447876 -111.913032532 []            true 265          AZ   
_1QQZuf4zZOyFCvXc... 1    2010-05-27 0     G-WvGaISbqqaMHlNn... 4     Rosie, Dakota, an... business 2      uZetl9T0NcROGOyFf... Chaparral Dog Park 5401 N Hayden Rd ... 33.5229454041 -111.90788269  []            true 88           AZ   
6ozycU1RpktNG2-1B... 0    2012-01-05 0     1uJFq2r5QfJG_6ExM... 4     General Manager S... business 0      vYmM4KTsC8ZfQBg-j... Discount Tire      1357 S Power Road... 33.3910255432 -111.68447876  []            true 5            AZ   

However the following command I run in pyspark returns nothing:

但是,我在 pyspark 中运行的以下命令不返回任何内容:

yelp_df.filter(yelp_df["open"] == "true").collect()

What is the right way to do it?

正确的做法是什么?

回答by Akshat Mahajan

You're comparing data types incorrectly. openis listed as a Boolean value, not a string, so doing yelp_df["open"] == "true"is incorrect - "true"is a string.

您正在错误地比较数据类型。open被列为布尔值,而不是字符串,所以这样做yelp_df["open"] == "true"是不正确的 -"true"是一个字符串。

Instead you want to do

相反你想做

yelp_df.filter(yelp_df["open"] == True).collect()

This correctly compares the values of openagainst the Boolean primitive True, rather than the non-Boolean string "true".

这正确地将 的值open与布尔基元True而不是非布尔字符串进行比较"true"

回答by user11428312

In Spark - Scala, I can think of two approaches Approach 1 :Spark sql command to get all the bool columns by creating a temporary view and selecting only Boolean columns from the whole dataframe. However this requires Boolean columns to be determined or fteching columsn from schema based on data type

在 Spark - Scala 中,我可以想到两种方法方法 1:Spark sql 命令通过创建临时视图并从整个数据框中仅选择布尔列来获取所有 bool 列。但是,这需要确定布尔列或根据数据类型从模式中获取列

    //define bool columns 
    val SqlBoolCols ="'boolcolumn1','boolcolumn2','boolcolumn3' 

    dataframe.createOrReplaceTempView("Booltable")
    val dfwithboolcolumns = sqlcontext.sql(s"Select ${SqlBoolCols} from Booltable")  

Approach 2 : Filter the dataframe if schema is defined

方法 2:如果定义了架构,则过滤数据框

val strcolnames = rawdata.schema.fields.filter(x=>x.dataType == StringType).map(strtype=>strtype.name)   
val strdataframe= rawdata.select(strcolnames.head,strcolnames.tail:_*)