Python 当值与 pyspark 中字符串的一部分匹配时过滤 df

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41889974/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-20 01:48:32  来源:igfitidea点击:

Filter df when values matches part of a string in pyspark

pythonapache-sparkpysparkapache-spark-sql

提问by gaatjeniksaan

I have a large pyspark.sql.dataframe.DataFrameand I want to keep (so filter) all rows where the URL saved in the locationcolumn contains a pre-determined string, e.g. 'google.com'.

我有一个大的pyspark.sql.dataframe.DataFrame,我想保留(所以filter)列中保存的 URLlocation包含预先确定的字符串的所有行,例如“google.com”。

I have tried:

我试过了:

import pyspark.sql.functions as sf
df.filter(sf.col('location').contains('google.com')).show(5)

but this throws a

但这会抛出一个

TypeError: _TypeError: 'Column' object is not callable'

How do I go around and filter my df properly? Many thanks in advance!

如何正确过滤我的 df?提前谢谢了!

回答by mrsrinivas

Spark 2.2 onwards

火花 2.2 以后

df.filter(df.location.contains('google.com'))

Spark 2.2 documentation link

df.filter(df.location.contains('google.com'))

Spark 2.2 文档链接



Spark 2.1 and before

Spark 2.1 及之前

You can use plain SQLin filter

df.filter("location like '%google.com%'")

or with DataFrame column methods

df.filter(df.location.like('%google.com%'))

Spark 2.1 documentation link

您可以使用普通的SQLfilter

df.filter("location like '%google.com%'")

使用 DataFrame 列方法

df.filter(df.location.like('%google.com%'))

Spark 2.1 文档链接

回答by joaofbsm

pyspark.sql.Column.contains()is only available in pyspark version 2.2 and above.

pyspark.sql.Column.contains()仅在 pyspark 2.2 及更高版本中可用。

df.where(df.location.contains('google.com'))

回答by caffreyd

When filtering a DataFrame with string values, I find that the pyspark.sql.functionslowerand uppercome in handy, if your data could have column entries like "foo" and "Foo":

当使用字符串值过滤 DataFrame 时,我发现pyspark.sql.functionslowerupper派上用场,如果您的数据可以有像“foo”和“Foo”这样的列条目:

import pyspark.sql.functions as sql_fun
result = source_df.filter(sql_fun.lower(source_df.col_name).contains("foo"))