Python 如何在 PySpark 中删除 RDD 以释放资源?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27990616/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to delete an RDD in PySpark for the purpose of releasing resources?
提问by Ego
If I have an RDD that I no longer need, how do I delete it from memory? Would the following be enough to get this done:
如果我有一个不再需要的 RDD,如何从内存中删除它?以下是否足以完成这项工作:
del thisRDD
Thanks!
谢谢!
回答by 0x0FFF
No, del thisRDDis not enough, it would just delete the pointer to the RDD. You should call thisRDD.unpersist()to remove the cached data.
不,del thisRDD还不够,它只会删除指向 RDD 的指针。您应该调用thisRDD.unpersist()以删除缓存的数据。
For you information, Spark uses a model of lazy computations, which means that when you run this code:
供您参考,Spark 使用延迟计算模型,这意味着当您运行此代码时:
>>> thisRDD = sc.parallelize(xrange(10),2).cache()
you won't have any data cached really, it would be only marked as 'to be cached' in the RDD execution plan. You can check it this way:
您不会真正缓存任何数据,它只会在 RDD 执行计划中标记为“要缓存”。你可以这样检查:
>>> print thisRDD.toDebugString()
(2) PythonRDD[6] at RDD at PythonRDD.scala:43 [Memory Serialized 1x Replicated]
| ParallelCollectionRDD[5] at parallelize at PythonRDD.scala:364 [Memory Serialized 1x Replicated]
But when you call an action on top of this RDD at least once, it would become cached:
但是当你在这个 RDD 之上调用一个动作至少一次时,它会被缓存:
>>> thisRDD.count()
10
>>> print thisRDD.toDebugString()
(2) PythonRDD[6] at RDD at PythonRDD.scala:43 [Memory Serialized 1x Replicated]
| CachedPartitions: 2; MemorySize: 174.0 B; TachyonSize: 0.0 B; DiskSize: 0.0 B
| ParallelCollectionRDD[5] at parallelize at PythonRDD.scala:364 [Memory Serialized 1x Replicated]
You can easily check the persisted data and the level of persistence in the Spark UI using the address http://<driver_node>:4040/storage. You would see there that del thisRDDwon't change the persistence of this RDD, but thisRDD.unpersist()would unpersist it, while you still would be able to use thisRDD in your code (while it won't persist in memory anymore and would be recomputed each time it is queried)
您可以使用地址轻松检查 Spark UI 中的持久化数据和持久化级别http://<driver_node>:4040/storage。你会看到那里del thisRDD不会改变这个 RDD 的持久性,但是thisRDD.unpersist()会取消它,而你仍然可以在你的代码中使用 thisRDD(虽然它不会再持久化在内存中,并且每次都会重新计算)查询)
回答by nonsleepr
Short answer: it depends.
简短的回答:这取决于。
According to pyspark v.1.3.0 source code, del thisRDDshould be enough for PipelinedRDD, which is an RDD generated by Python mapper/reducer:
根据pyspark v.1.3.0源代码,del thisRDD应该足够了PipelinedRDD,它是由Python mapper/reducer生成的RDD:
class PipelinedRDD(RDD):
# ...
def __del__(self):
if self._broadcast:
self._broadcast.unpersist()
self._broadcast = None
RDDclass on the other hand, doesn't have __del__method (while it probably should), so you should call unpersistmethod on your own.
RDD另一方面,类没有__del__方法(虽然它可能应该),所以你应该unpersist自己调用方法。
Edit:__del__method was deleted in thiscommit.
编辑:此提交__del__中删除了方法。
回答by joshsuihn
Just FYI,
I would recommend gc.collect()after del(if rdd takes lots of memory).
仅供参考,我会推荐gc.collect()之后del(如果 rdd 需要大量内存)。
回答by Stuart Berg
Short answer:The following code should do the trick:
简短回答:以下代码应该可以解决问题:
import gc
del thisRDD
gc.collect()
Explanation:
解释:
Even if you are using PySpark, your RDD's data is managed on the Java side, so first let's ask the same question, but for Java instead of Python:
即使你使用的是 PySpark,你的 RDD 的数据也是在 Java 端管理的,所以首先让我们问同样的问题,但是对于 Java 而不是 Python:
If I'm using Java, and I simply release all references to my RDD, is that sufficient to automatically unpersist it?
如果我使用 Java,并且我只是释放对我的 RDD 的所有引用,这是否足以自动取消它?
For Java, the answer is YES, the RDD will be automatically unpersisted when it is garbage collected, according to this answer. (Apparently that functionality was added to Spark in this PR.)
对于 Java,答案是 YES,根据这个答案,RDD 在被垃圾收集时会自动取消持久化。(显然这个功能是在这个 PR 中添加到 Spark 中的。)
OK, what happens in Python? If I remove all references to my RDD in Python, does that cause them to be removed on the Java side?
好的,在 Python 中会发生什么?如果我在 Python 中删除对我的 RDD 的所有引用,是否会导致它们在 Java 端被删除?
PySpark uses Py4Jto send objects from Python to Java and vice-versa. According to the Py4J Memory Model Docs:
PySpark 使用Py4J将对象从 Python 发送到 Java,反之亦然。根据Py4J 内存模型文档:
Once the object is garbage collected on the Python VM (reference count == 0), the reference is removed on the Java VM
一旦对象在 Python VM 上被垃圾回收(引用计数 == 0),Java VM 上的引用就会被删除
But take note: Removing the Python references to your RDD won't cause it to be immediatelydeleted. You have to wait for the Python garbage collector to clean up the references. You can read the Py4J explanation for details, where they recommend the following:
但请注意:删除对 RDD 的 Python 引用不会导致它立即被删除。您必须等待 Python 垃圾收集器清理引用。您可以阅读 Py4J 解释以了解详细信息,他们推荐以下内容:
A call to
gc.collect()also usually works.
调用 to
gc.collect()也通常有效。
OK, now back to your original question:
好的,现在回到你最初的问题:
Would the following be enough to get this done:
del thisRDD
以下是否足以完成这项工作:
del thisRDD
Almost.You should remove the last reference to it (i.e. del thisRDD), and then, if you really need the RDD to be unpersisted immediately**, call gc.collect().
几乎。您应该删除对它的最后一个引用(即del thisRDD),然后,如果您确实需要立即取消持久化 RDD**,请调用gc.collect().
**Well, technically, this will immediately delete the referenceon the Java side, but there will be a slight delay until Java's garbage collector actually executes the RDD's finalizer and thereby unpersists the data.
**好吧,从技术上讲,这将立即删除Java 端的引用,但是在 Java 的垃圾收集器实际执行 RDD 的终结器并因此取消持久化数据之前,会有一点延迟。

