Python 如何通过布尔列过滤火花数据框

Question

提问by Nasreddin

I created a dataframe that has the following schema:

我创建了一个具有以下架构的数据框：

In [43]: yelp_df.printSchema()
root
 |-- business_id: string (nullable = true)
 |-- cool: integer (nullable = true)
 |-- date: string (nullable = true)
 |-- funny: integer (nullable = true)
 |-- id: string (nullable = true)
 |-- stars: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- type: string (nullable = true)
 |-- useful: integer (nullable = true)
 |-- user_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- full_address: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- neighborhoods: string (nullable = true)
 |-- open: boolean (nullable = true)
 |-- review_count: integer (nullable = true)
 |-- state: string (nullable = true)

Now I want to select only the records with the "open" column that is "true". As shown below, lots of them are "open".

现在我只想选择“打开”列是“真”的记录。如下所示，其中很多都是“开放的”。

business_id          cool date       funny id                   stars text                 type     useful user_id              name               full_address         latitude      longitude      neighborhoods open review_count state
9yKzy9PApeiPPOUJE... 2    2011-01-26 0     fWKvX83p0-ka4JS3d... 4     My wife took me h... business 5      rLtl8ZkDX5vH5nAx9... Morning Glory Cafe 6106 S 32nd St Ph... 33.3907928467 -112.012504578 []            true 116          AZ   
ZRJwVLyzEJq1VAihD... 0    2011-07-27 0     IjZ33sJrzXqU-0X6U... 4     I have no idea wh... business 0      0a2KyEL0d3Yb1V6ai... Spinato's Pizzeria 4848 E Chandler B... 33.305606842  -111.978759766 []            true 102          AZ   
6oRAC4uyJCsJl1X0W... 0    2012-06-14 0     IESLBzqUCLdSzSqm0... 4     love the gyro pla... business 1      0hT2KtfLiobPvh6cD... Haji-Baba          1513 E  Apache Bl... 33.4143447876 -111.913032532 []            true 265          AZ   
_1QQZuf4zZOyFCvXc... 1    2010-05-27 0     G-WvGaISbqqaMHlNn... 4     Rosie, Dakota, an... business 2      uZetl9T0NcROGOyFf... Chaparral Dog Park 5401 N Hayden Rd ... 33.5229454041 -111.90788269  []            true 88           AZ   
6ozycU1RpktNG2-1B... 0    2012-01-05 0     1uJFq2r5QfJG_6ExM... 4     General Manager S... business 0      vYmM4KTsC8ZfQBg-j... Discount Tire      1357 S Power Road... 33.3910255432 -111.68447876  []            true 5            AZ

However the following command I run in pyspark returns nothing:

但是，我在 pyspark 中运行的以下命令不返回任何内容：

yelp_df.filter(yelp_df["open"] == "true").collect()

What is the right way to do it?

正确的做法是什么？

Answer 1

回答by Akshat Mahajan

You're comparing data types incorrectly. openis listed as a Boolean value, not a string, so doing yelp_df["open"] == "true"is incorrect - "true"is a string.

您正在错误地比较数据类型。open被列为布尔值，而不是字符串，所以这样做yelp_df["open"] == "true"是不正确的 -"true"是一个字符串。

Instead you want to do

相反你想做

yelp_df.filter(yelp_df["open"] == True).collect()

This correctly compares the values of openagainst the Boolean primitive True, rather than the non-Boolean string "true".

这正确地将的值open与布尔基元True而不是非布尔字符串进行比较"true"。

Answer 2

回答by user11428312

In Spark - Scala, I can think of two approaches Approach 1 :Spark sql command to get all the bool columns by creating a temporary view and selecting only Boolean columns from the whole dataframe. However this requires Boolean columns to be determined or fteching columsn from schema based on data type

在 Spark - Scala 中，我可以想到两种方法方法 1：Spark sql 命令通过创建临时视图并从整个数据框中仅选择布尔列来获取所有 bool 列。但是，这需要确定布尔列或根据数据类型从模式中获取列

    //define bool columns 
    val SqlBoolCols ="'boolcolumn1','boolcolumn2','boolcolumn3' 

    dataframe.createOrReplaceTempView("Booltable")
    val dfwithboolcolumns = sqlcontext.sql(s"Select ${SqlBoolCols} from Booltable")

Approach 2 : Filter the dataframe if schema is defined

方法 2：如果定义了架构，则过滤数据框

val strcolnames = rawdata.schema.fields.filter(x=>x.dataType == StringType).map(strtype=>strtype.name)   
val strdataframe= rawdata.select(strcolnames.head,strcolnames.tail:_*)

Python 如何通过布尔列过滤火花数据框

提问by Nasreddin

回答by Akshat Mahajan

回答by user11428312

相关推荐

最近更新

标签

Python 如何通过布尔列过滤火花数据框

提问by Nasreddin

回答by Akshat Mahajan

回答by user11428312

相关推荐

禁用 iPython Notebook 自动滚动

Python pip install 返回无效语法

如何在python中为一个if语句设置多个条件

python中的MAPE计算

相关推荐

最近更新

标签