Python 如何在pyspark中估计数据帧的实际大小？

Question

提问by TheSilence

How to determine a dataframe size?

如何确定数据帧大小？

Right now I estimate the real size of a dataframe as follows:

现在我估计数据帧的实际大小如下：

headers_size = key for key in df.first().asDict()
rows_size = df.map(lambda row: len(value for key, value in row.asDict()).sum()
total_size = headers_size + rows_size

It is too slow and I'm looking for a better way.

它太慢了，我正在寻找更好的方法。

Answer 1

回答by Ziggy Eunicien

nice post from Tamas Szuromi http://metricbrew.com/how-to-estimate-rdd-or-dataframe-real-size-in-pyspark/

来自 Tamas Szuromi 的好帖子http://metricbrew.com/how-to-estimate-rdd-or-dataframe-real-size-in-pyspark/

from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
def _to_java_object_rdd(rdd):  
    """ Return a JavaRDD of Object by unpickling
    It will convert each Python object into Java object by Pyrolite, whenever the
    RDD is serialized in batch or not.
    """
    rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
    return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)

JavaObj = _to_java_object_rdd(df.rdd)

nbytes = sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj)

Answer 2

回答by Kiran Thati

Currently I am using the below approach, not sure if this is the best way

目前我正在使用以下方法，不确定这是否是最好的方法

df.persist(StorageLevel.Memory) df.count()

On the spark-web ui under the Storage tab you can check the size which is displayed in MB's and then I do unpersist to clear the memory.

在存储选项卡下的 spark-web ui 上，您可以检查以 MB 为单位显示的大小，然后我不坚持清除内存。

df.unpersist()

Python 如何在pyspark中估计数据帧的实际大小？

提问by TheSilence

回答by Ziggy Eunicien

回答by Kiran Thati

相关推荐

最近更新

标签

Python 如何在pyspark中估计数据帧的实际大小？

提问by TheSilence

回答by Ziggy Eunicien

回答by Kiran Thati

相关推荐

版本名称“cp27”或“cp35”在 Python 中是什么意思？

Python 无法使用 pip 安装 matplotlib

Python *args 和 **kwargs 的类型注释

无法在 python 3.6 中 pip install pickle

相关推荐

最近更新

标签