Python 当值与 pyspark 中字符串的一部分匹配时过滤 df

Question

提问by gaatjeniksaan

I have a large pyspark.sql.dataframe.DataFrameand I want to keep (so filter) all rows where the URL saved in the locationcolumn contains a pre-determined string, e.g. 'google.com'.

我有一个大的pyspark.sql.dataframe.DataFrame，我想保留（所以filter）列中保存的 URLlocation包含预先确定的字符串的所有行，例如“google.com”。

I have tried:

我试过了：

import pyspark.sql.functions as sf
df.filter(sf.col('location').contains('google.com')).show(5)

but this throws a

但这会抛出一个

TypeError: _TypeError: 'Column' object is not callable'

How do I go around and filter my df properly? Many thanks in advance!

如何正确过滤我的 df？提前谢谢了！

Answer 1

回答by mrsrinivas

Spark 2.2 onwards

火花 2.2 以后

df.filter(df.location.contains('google.com'))
Spark 2.2 documentation link

df.filter(df.location.contains('google.com'))
Spark 2.2 文档链接

Spark 2.1 and before

Spark 2.1 及之前

You can use plain SQLin filter
df.filter("location like '%google.com%'")
or with DataFrame column methods
df.filter(df.location.like('%google.com%'))
Spark 2.1 documentation link

您可以使用普通的SQL中filter
df.filter("location like '%google.com%'")
或使用 DataFrame 列方法
df.filter(df.location.like('%google.com%'))
Spark 2.1 文档链接

Answer 2

回答by joaofbsm

pyspark.sql.Column.contains()is only available in pyspark version 2.2 and above.

pyspark.sql.Column.contains()仅在 pyspark 2.2 及更高版本中可用。

df.where(df.location.contains('google.com'))

Answer 3

回答by caffreyd

When filtering a DataFrame with string values, I find that the pyspark.sql.functionslowerand uppercome in handy, if your data could have column entries like "foo" and "Foo":

当使用字符串值过滤 DataFrame 时，我发现pyspark.sql.functionslower和upper派上用场，如果您的数据可以有像“foo”和“Foo”这样的列条目：

import pyspark.sql.functions as sql_fun
result = source_df.filter(sql_fun.lower(source_df.col_name).contains("foo"))

Python 当值与 pyspark 中字符串的一部分匹配时过滤 df

提问by gaatjeniksaan

回答by mrsrinivas

Spark 2.2 onwards

火花 2.2 以后

Spark 2.1 and before

Spark 2.1 及之前

回答by joaofbsm

回答by caffreyd

相关推荐

最近更新

标签

Python 当值与 pyspark 中字符串的一部分匹配时过滤 df

提问by gaatjeniksaan

回答by mrsrinivas

Spark 2.2 onwards

火花 2.2 以后

Spark 2.1 and before

Spark 2.1 及之前

回答by joaofbsm

回答by caffreyd

相关推荐

Python 绘制 datetime.date 熊猫

Python ModuleNotFoundError：__main__ 不是包是什么意思？

Python 基本方法链

更新 python 字典（向现有键添加另一个值）

相关推荐

最近更新

标签

Python ModuleNotFoundError：main 不是包是什么意思？