Python 使用列的长度过滤 DataFrame

Question

提问by Alberto Bonsanto

I want to filter a DataFrameusing a condition related to the length of a column, this question might be very easy but I didn't find any related question in the SO.

我想DataFrame使用与列长度相关的条件过滤 a ，这个问题可能很简单，但我在 SO 中没有找到任何相关问题。

More specific, I have a DataFramewith only one Columnwhich of ArrayType(StringType()), I want to filter the DataFrameusing the length as filterer, I shot a snippet below.

更具体的，我有一个DataFrame只有一个Column，其中ArrayType(StringType())，我想过滤DataFrame使用长度filterer，我拍下面的一个片段。

df = sqlContext.read.parquet("letters.parquet")
df.show()

# The output will be 
# +------------+
# |      tokens|
# +------------+
# |[L, S, Y, S]|
# |[L, V, I, S]|
# |[I, A, N, A]|
# |[I, L, S, A]|
# |[E, N, N, Y]|
# |[E, I, M, A]|
# |[O, A, N, A]|
# |   [S, U, S]|
# +------------+

# But I want only the entries with length 3 or less
fdf = df.filter(len(df.tokens) <= 3)
fdf.show() # But it says that the TypeError: object of type 'Column' has no len(), so the previous statement is obviously incorrect.

I read Column's Documentation, but didn't find any property useful for this matter. I appreciate any help!

我阅读了Column's Documentation，但没有发现任何对此事有用的属性。我感谢任何帮助！

Answer 1

采纳答案by zero323

In Spark >= 1.5 you can use sizefunction:

在 Spark >= 1.5 中你可以使用size函数：

from pyspark.sql.functions import col, size

df = sqlContext.createDataFrame([
    (["L", "S", "Y", "S"],  ),
    (["L", "V", "I", "S"],  ),
    (["I", "A", "N", "A"],  ),
    (["I", "L", "S", "A"],  ),
    (["E", "N", "N", "Y"],  ),
    (["E", "I", "M", "A"],  ),
    (["O", "A", "N", "A"],  ),
    (["S", "U", "S"],  )], 
    ("tokens", ))

df.where(size(col("tokens")) <= 3).show()

## +---------+
## |   tokens|
## +---------+
## |[S, U, S]|
## +---------+

In Spark < 1.5 an UDF should do the trick:

在 Spark < 1.5 中，UDF 应该可以解决问题：

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf

size_ = udf(lambda xs: len(xs), IntegerType())

df.where(size_(col("tokens")) <= 3).show()

## +---------+
## |   tokens|
## +---------+
## |[S, U, S]|
## +---------+

If you use HiveContextthen sizeUDF with raw SQL should work with any version:

如果您使用的HiveContext则sizeUDF与原始的SQL应与任何版本：

df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df WHERE size(tokens) <= 3").show()

## +--------------------+
## |              tokens|
## +--------------------+
## |ArrayBuffer(S, U, S)|
## +--------------------+

For string columns you can either use an udfdefined above or lengthfunction:

对于字符串列，您可以使用udf上面定义的或length函数：

from pyspark.sql.functions import length

df = sqlContext.createDataFrame([("fooo", ), ("bar", )], ("k", ))
df.where(length(col("k")) <= 3).show()

## +---+
## |  k|
## +---+
## |bar|
## +---+

Answer 2

回答by mputha

Here is an example for String in scala:

这是 Scala 中的 String 示例：

val stringData = Seq(("Maheswara"), ("Mokshith"))
val df = sc.parallelize(stringData).toDF
df.where((length($"value")) <= 8).show
+--------+
|   value|
+--------+
|Mokshith|
+--------+
df.withColumn("length", length($"value")).show
+---------+------+
|    value|length|
+---------+------+
|Maheswara|     9|
| Mokshith|     8|
+---------+------+

Answer 3

回答by mputha

@AlbertoBonsanto : below code filters based on array size:

@AlbertoBonsanto ：下面基于数组大小的代码过滤器：

val input = Seq(("a1,a2,a3,a4,a5"), ("a1,a2,a3,a4"), ("a1,a2,a3"), ("a1,a2"), ("a1"))
val df = sc.parallelize(input).toDF("tokens")
val tokensArrayDf = df.withColumn("tokens", split($"tokens", ","))
tokensArrayDf.show
+--------------------+
|              tokens|
+--------------------+
|[a1, a2, a3, a4, a5]|
|    [a1, a2, a3, a4]|
|        [a1, a2, a3]|
|            [a1, a2]|
|                [a1]|
+--------------------+

tokensArrayDf.filter(size($"tokens") > 3).show
+--------------------+
|              tokens|
+--------------------+
|[a1, a2, a3, a4, a5]|
|    [a1, a2, a3, a4]|
+--------------------+

Python 使用列的长度过滤 DataFrame

提问by Alberto Bonsanto

采纳答案by zero323

回答by mputha

回答by mputha

相关推荐

最近更新

标签

Python 使用列的长度过滤 DataFrame

提问by Alberto Bonsanto

采纳答案by zero323

回答by mputha

回答by mputha

相关推荐

Python 在 Excel 中调整单元格宽度

Python 散布 Flask 模型时，会引发 RuntimeError: 'application not register on db'

Python：如何将字典转换为可下标数组？

Python 请求：在单个请求中发布 JSON 和文件

相关推荐

最近更新

标签