scala 如何找到 spark RDD/Dataframe 大小?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35008123/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to find spark RDD/Dataframe size?
提问by Venu A Positive
I know how to find the file size in scala.But how to find a RDD/dataframe size in spark?
我知道如何在 scala 中找到文件大小。但是如何在 spark 中找到 RDD/数据帧大小?
Scala:
斯卡拉:
object Main extends App {
val file = new java.io.File("hdfs://localhost:9000/samplefile.txt").toString()
println(file.length)
}
Spark:
火花:
val distFile = sc.textFile(file)
println(distFile.length)
but if i process it not getting file size. How to find the RDD size?
但如果我处理它不会得到文件大小。如何找到RDD的大小?
采纳答案by Venu A Positive
Yes Finally I got the solution. Include these libraries.
是的,我终于找到了解决方案。包括这些库。
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd
How to find the RDD Size:
如何找到 RDD 大小:
def calcRDDSize(rdd: RDD[String]): Long = {
rdd.map(_.getBytes("UTF-8").length.toLong)
.reduce(_+_) //add the sizes together
}
Function to find DataFrame size:(This function just convert DataFrame to RDD internally)
查找DataFrame大小的函数:(该函数只是在内部将DataFrame转换为RDD)
val dataFrame = sc.textFile(args(1)).toDF() // you can replace args(1) with any path
val rddOfDataframe = dataFrame.rdd.map(_.toString())
val size = calcRDDSize(rddOfDataframe)
回答by Glennie Helles Sindholt
If you are simply looking to count the number of rows in the rdd, do:
如果您只是想计算 中的行数rdd,请执行以下操作:
val distFile = sc.textFile(file)
println(distFile.count)
If you are interested in the bytes, you can use the SizeEstimator:
如果您对字节感兴趣,可以使用SizeEstimator:
import org.apache.spark.util.SizeEstimator
println(SizeEstimator.estimate(distFile))
https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html
回答by Ram Ghadiyaram
Below is one way apart from SizeEstimator.I use frequently
下面是一种方式,除了SizeEstimator.I 经常使用
To know from code about an RDD if it is cached, and more precisely, how many of its partitions are cached in memory and how many are cached on disk? to get the storage level, also want to know the current actual caching status.to Know memory consumption.
要从代码中了解 RDD 是否已缓存,更准确地说,有多少分区缓存在内存中,有多少缓存在磁盘上?获取存储级别,还想知道当前的实际缓存状态。要知道内存消耗。
Spark Context has developer api method getRDDStorageInfo()Occasionally you can use this.
Spark Context 具有开发人员 api 方法getRDDStorageInfo()有时您可以使用它。
Return information about what RDDs are cached, if they are in mem or on disk, how much space they take, etc.
For Example :
scala> sc.getRDDStorageInfo res3: Array[org.apache.spark.storage.RDDInfo] = Array(RDD "HiveTableScan [name#0], (MetastoreRelation sparkdb, firsttable, None), None " (3) StorageLevel: StorageLevel(false, true, false, true, 1); CachedPartitions: 1;TotalPartitions: 1; MemorySize: 256.0 B;ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B)
返回有关缓存的 RDD 的信息,它们是在内存中还是在磁盘上,它们占用了多少空间等。
例如 :
scala> sc.getRDDStorageInfo res3: Array[org.apache.spark.storage.RDDInfo] = Array(RDD "HiveTableScan [name#0], (MetastoreRelation sparkdb, firsttable, None), None " (3) StorageLevel: StorageLevel(false, true, false, true, 1); CachedPartitions: 1;总分区数:1; 内存大小:256.0 B;外部块存储大小:0.0 B;磁盘大小:0.0 B)
Seems like spark ui also used the same from this code
似乎 spark ui 也使用了此代码中的相同内容
- See this Source issue SPARK-17019which describes...
- 请参阅此源问题 SPARK-17019,其中描述了...
Description
With SPARK-13992, Spark supports persisting data into off-heap memory, but the usage of off-heap is not exposed currently, it is not so convenient for user to monitor and profile, so here propose to expose off-heap memory as well as on-heap memory usage in various places:
- Spark UI's executor page will display both on-heap and off-heap memory usage.
- REST request returns both on-heap and off-heap memory.
- Also these two memory usage can be obtained programmatically from SparkListener.
描述
在 SPARK-13992 中,Spark 支持将数据持久化到堆外内存中,但是目前没有公开堆外的使用,用户监控和分析不太方便,所以这里建议将堆外内存公开为以及各个地方的堆上内存使用情况:
- Spark UI 的执行器页面将显示堆内和堆外内存使用情况。
- REST 请求返回堆上和堆外内存。
- 也可以从 SparkListener 以编程方式获取这两个内存使用情况。

