Python 当值与 pyspark 中字符串的一部分匹配时过滤 df
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41889974/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Filter df when values matches part of a string in pyspark
提问by gaatjeniksaan
I have a large pyspark.sql.dataframe.DataFrame
and I want to keep (so filter
) all rows where the URL saved in the location
column contains a pre-determined string, e.g. 'google.com'.
我有一个大的pyspark.sql.dataframe.DataFrame
,我想保留(所以filter
)列中保存的 URLlocation
包含预先确定的字符串的所有行,例如“google.com”。
I have tried:
我试过了:
import pyspark.sql.functions as sf
df.filter(sf.col('location').contains('google.com')).show(5)
but this throws a
但这会抛出一个
TypeError: _TypeError: 'Column' object is not callable'
How do I go around and filter my df properly? Many thanks in advance!
如何正确过滤我的 df?提前谢谢了!
回答by mrsrinivas
Spark 2.2 onwards
火花 2.2 以后
df.filter(df.location.contains('google.com'))
df.filter(df.location.contains('google.com'))
Spark 2.1 and before
Spark 2.1 及之前
You can use plain SQLin
filter
df.filter("location like '%google.com%'")
or with DataFrame column methods
df.filter(df.location.like('%google.com%'))
您可以使用普通的SQL中
filter
df.filter("location like '%google.com%'")
或使用 DataFrame 列方法
df.filter(df.location.like('%google.com%'))
回答by joaofbsm
pyspark.sql.Column.contains()
is only available in pyspark version 2.2 and above.
pyspark.sql.Column.contains()
仅在 pyspark 2.2 及更高版本中可用。
df.where(df.location.contains('google.com'))
回答by caffreyd
When filtering a DataFrame with string values, I find that the pyspark.sql.functions
lower
and upper
come in handy, if your data could have column entries like "foo" and "Foo":
当使用字符串值过滤 DataFrame 时,我发现pyspark.sql.functions
lower
和upper
派上用场,如果您的数据可以有像“foo”和“Foo”这样的列条目:
import pyspark.sql.functions as sql_fun
result = source_df.filter(sql_fun.lower(source_df.col_name).contains("foo"))