Python 使用 Pyspark 计算 Spark 数据帧每列中非 NaN 条目的数量

Question

提问by RKD314

I have a very large dataset that is loaded in Hive. It consists of about 1.9 million rows and 1450 columns. I need to determine the "coverage" of each of the columns, meaning, the fraction of rows that have non-NaN values for each column.

我有一个在 Hive 中加载的非常大的数据集。它由大约 190 万行和 1450 列组成。我需要确定每一列的“覆盖率”，即每列具有非 NaN 值的行的比例。

Here is my code:

这是我的代码：

from pyspark import SparkContext
from pyspark.sql import HiveContext
import string as string

sc = SparkContext(appName="compute_coverages") ## Create the context
sqlContext = HiveContext(sc)

df = sqlContext.sql("select * from data_table")
nrows_tot = df.count()

covgs=sc.parallelize(df.columns)
        .map(lambda x: str(x))
        .map(lambda x: (x, float(df.select(x).dropna().count()) / float(nrows_tot) * 100.))

Trying this out in the pyspark shell, if I then do covgs.take(10), it returns a rather large error stack. It says that there's a problem in save in the file /usr/lib64/python2.6/pickle.py. This is the final part of the error:

在 pyspark shell 中尝试这个，如果我再做 covgs.take(10)，它会返回一个相当大的错误堆栈。它说保存在文件中存在问题/usr/lib64/python2.6/pickle.py。这是错误的最后一部分：

py4j.protocol.Py4JError: An error occurred while calling o37.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
        at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
        at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
        at py4j.Gateway.invoke(Gateway.java:252)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:207)
        at java.lang.Thread.run(Thread.java:745)

If there is a better way to accomplish this than the way I'm trying, I'm open to suggestions. I can't use pandas, though, as it's not currently available on the cluster I work on and I don't have rights to install it.

如果有比我尝试的方式更好的方法来实现这一点，我愿意接受建议。但是，我不能使用 pandas，因为它目前在我工作的集群上不可用，而且我无权安装它。

Answer 1

采纳答案by zero323

Let's start with a dummy data:

让我们从一个虚拟数据开始：

from pyspark.sql import Row

row = Row("v", "x", "y", "z")
df = sc.parallelize([
    row(0.0, 1, 2, 3.0), row(None, 3, 4, 5.0),
    row(None, None, 6, 7.0), row(float("Nan"), 8, 9, float("NaN"))
]).toDF()

## +----+----+---+---+
## |   v|   x|  y|  z|
## +----+----+---+---+
## | 0.0|   1|  2|3.0|
## |null|   3|  4|5.0|
## |null|null|  6|7.0|
## | NaN|   8|  9|NaN|
## +----+----+---+---+

All you need is a simple aggregation:

您只需要一个简单的聚合：

from pyspark.sql.functions import col, count, isnan, lit, sum

def count_not_null(c, nan_as_null=False):
    """Use conversion between boolean and integer
    - False -> 0
    - True ->  1
    """
    pred = col(c).isNotNull() & (~isnan(c) if nan_as_null else lit(True))
    return sum(pred.cast("integer")).alias(c)

df.agg(*[count_not_null(c) for c in df.columns]).show()

## +---+---+---+---+
## |  v|  x|  y|  z|
## +---+---+---+---+
## |  2|  3|  4|  4|
## +---+---+---+---+

or if you want to treat NaNa NULL:

或者如果你想治疗NaN一个NULL：

df.agg(*[count_not_null(c, True) for c in df.columns]).show()

## +---+---+---+---+
## |  v|  x|  y|  z|
## +---+---+---+---+
## |  1|  3|  4|  3|
## +---+---+---+---

You can also leverage SQL NULLsemantics to achieve the same result without creating a custom function:

您还可以利用 SQLNULL语义来实现相同的结果，而无需创建自定义函数：

df.agg(*[
    count(c).alias(c)    # vertical (column-wise) operations in SQL ignore NULLs
    for c in df.columns
]).show()

## +---+---+---+
## |  x|  y|  z|
## +---+---+---+
## |  1|  2|  3|
## +---+---+---+

but this won't work with NaNs.

但这不适用于NaNs.

If you prefer fractions:

如果您更喜欢分数：

exprs = [(count_not_null(c) / count("*")).alias(c) for c in df.columns]
df.agg(*exprs).show()

## +------------------+------------------+---+
## |                 x|                 y|  z|
## +------------------+------------------+---+
## |0.3333333333333333|0.6666666666666666|1.0|
## +------------------+------------------+---+

or

或者

# COUNT(*) is equivalent to COUNT(1) so NULLs won't be an issue
df.select(*[(count(c) / count("*")).alias(c) for c in df.columns]).show()

## +------------------+------------------+---+
## |                 x|                 y|  z|
## +------------------+------------------+---+
## |0.3333333333333333|0.6666666666666666|1.0|
## +------------------+------------------+---+

Scala equivalent:

Scala 等价物：

import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{col, isnan, sum}

type JDouble = java.lang.Double

val df = Seq[(JDouble, JDouble, JDouble, JDouble)](
  (0.0, 1, 2, 3.0), (null, 3, 4, 5.0),
  (null, null, 6, 7.0), (java.lang.Double.NaN, 8, 9, java.lang.Double.NaN)
).toDF()


def count_not_null(c: Column, nanAsNull: Boolean = false) = {
  val pred = c.isNotNull and (if (nanAsNull) not(isnan(c)) else lit(true))
  sum(pred.cast("integer"))
}

df.select(df.columns map (c => count_not_null(col(c)).alias(c)): _*).show
// +---+---+---+---+                                                               
// | _1| _2| _3| _4|
// +---+---+---+---+
// |  2|  3|  4|  4|
// +---+---+---+---+

 df.select(df.columns map (c => count_not_null(col(c), true).alias(c)): _*).show
 // +---+---+---+---+
 // | _1| _2| _3| _4|
 // +---+---+---+---+
 // |  1|  3|  4|  3|
 // +---+---+---+---+

Answer 2

回答by LaSul

You can use isNotNull():

您可以使用isNotNull()：

df.where(df[YOUR_COLUMN].isNotNull()).select(YOUR_COLUMN).show()

Python 使用 Pyspark 计算 Spark 数据帧每列中非 NaN 条目的数量

提问by RKD314

采纳答案by zero323

回答by LaSul

相关推荐

最近更新

标签

Python 使用 Pyspark 计算 Spark 数据帧每列中非 NaN 条目的数量

提问by RKD314

采纳答案by zero323

回答by LaSul

相关推荐

使用 STARTTLS 从 Python 发送电子邮件

使用 Python 遍历目录

Python 运行时错误：针对 API 版本 a 编译的模块，但此版本的 numpy 是 9

Python 在 Django 模板中相乘

相关推荐

最近更新

标签