Python 如何在 PySpark 中删除 RDD 以释放资源？

Question

提问by Ego

If I have an RDD that I no longer need, how do I delete it from memory? Would the following be enough to get this done:

如果我有一个不再需要的 RDD，如何从内存中删除它？以下是否足以完成这项工作：

del thisRDD

Thanks!

谢谢！

Answer 1

回答by 0x0FFF

No, del thisRDDis not enough, it would just delete the pointer to the RDD. You should call thisRDD.unpersist()to remove the cached data.

不，del thisRDD还不够，它只会删除指向 RDD 的指针。您应该调用thisRDD.unpersist()以删除缓存的数据。

For you information, Spark uses a model of lazy computations, which means that when you run this code:

供您参考，Spark 使用延迟计算模型，这意味着当您运行此代码时：

>>> thisRDD = sc.parallelize(xrange(10),2).cache()

you won't have any data cached really, it would be only marked as 'to be cached' in the RDD execution plan. You can check it this way:

您不会真正缓存任何数据，它只会在 RDD 执行计划中标记为“要缓存”。你可以这样检查：

>>> print thisRDD.toDebugString()
(2) PythonRDD[6] at RDD at PythonRDD.scala:43 [Memory Serialized 1x Replicated]
 |  ParallelCollectionRDD[5] at parallelize at PythonRDD.scala:364 [Memory Serialized 1x Replicated]

But when you call an action on top of this RDD at least once, it would become cached:

但是当你在这个 RDD 之上调用一个动作至少一次时，它会被缓存：

>>> thisRDD.count()
10
>>> print thisRDD.toDebugString()
(2) PythonRDD[6] at RDD at PythonRDD.scala:43 [Memory Serialized 1x Replicated]
 |       CachedPartitions: 2; MemorySize: 174.0 B; TachyonSize: 0.0 B; DiskSize: 0.0 B
 |  ParallelCollectionRDD[5] at parallelize at PythonRDD.scala:364 [Memory Serialized 1x Replicated]

You can easily check the persisted data and the level of persistence in the Spark UI using the address http://<driver_node>:4040/storage. You would see there that del thisRDDwon't change the persistence of this RDD, but thisRDD.unpersist()would unpersist it, while you still would be able to use thisRDD in your code (while it won't persist in memory anymore and would be recomputed each time it is queried)

您可以使用地址轻松检查 Spark UI 中的持久化数据和持久化级别http://<driver_node>:4040/storage。你会看到那里del thisRDD不会改变这个 RDD 的持久性，但是thisRDD.unpersist()会取消它，而你仍然可以在你的代码中使用 thisRDD（虽然它不会再持久化在内存中，并且每次都会重新计算）查询）

Answer 2

回答by nonsleepr

Short answer: it depends.

简短的回答：这取决于。

According to pyspark v.1.3.0 source code, del thisRDDshould be enough for PipelinedRDD, which is an RDD generated by Python mapper/reducer:

根据pyspark v.1.3.0源代码，del thisRDD应该足够了PipelinedRDD，它是由Python mapper/reducer生成的RDD：

class PipelinedRDD(RDD):
    # ...
    def __del__(self):
        if self._broadcast:
            self._broadcast.unpersist()
            self._broadcast = None

RDDclass on the other hand, doesn't have __del__method (while it probably should), so you should call unpersistmethod on your own.

RDD另一方面，类没有__del__方法（虽然它可能应该），所以你应该unpersist自己调用方法。

Edit:__del__method was deleted in thiscommit.

编辑：此提交__del__中删除了方法。

Answer 3

回答by joshsuihn

Just FYI, I would recommend gc.collect()after del(if rdd takes lots of memory).

仅供参考，我会推荐gc.collect()之后del（如果 rdd 需要大量内存）。

Answer 4

回答by Stuart Berg

Short answer:The following code should do the trick:

简短回答：以下代码应该可以解决问题：

import gc
del thisRDD
gc.collect()

Explanation:

解释：

Even if you are using PySpark, your RDD's data is managed on the Java side, so first let's ask the same question, but for Java instead of Python:

即使你使用的是 PySpark，你的 RDD 的数据也是在 Java 端管理的，所以首先让我们问同样的问题，但是对于 Java 而不是 Python：

If I'm using Java, and I simply release all references to my RDD, is that sufficient to automatically unpersist it?

如果我使用 Java，并且我只是释放对我的 RDD 的所有引用，这是否足以自动取消它？

For Java, the answer is YES, the RDD will be automatically unpersisted when it is garbage collected, according to this answer. (Apparently that functionality was added to Spark in this PR.)

对于 Java，答案是 YES，根据这个答案，RDD 在被垃圾收集时会自动取消持久化。（显然这个功能是在这个 PR 中添加到 Spark 中的。）

OK, what happens in Python? If I remove all references to my RDD in Python, does that cause them to be removed on the Java side?

好的，在 Python 中会发生什么？如果我在 Python 中删除对我的 RDD 的所有引用，是否会导致它们在 Java 端被删除？

PySpark uses Py4Jto send objects from Python to Java and vice-versa. According to the Py4J Memory Model Docs:

PySpark 使用Py4J将对象从 Python 发送到 Java，反之亦然。根据Py4J 内存模型文档：

Once the object is garbage collected on the Python VM (reference count == 0), the reference is removed on the Java VM

一旦对象在 Python VM 上被垃圾回收（引用计数 == 0），Java VM 上的引用就会被删除

But take note: Removing the Python references to your RDD won't cause it to be immediatelydeleted. You have to wait for the Python garbage collector to clean up the references. You can read the Py4J explanation for details, where they recommend the following:

但请注意：删除对 RDD 的 Python 引用不会导致它立即被删除。您必须等待 Python 垃圾收集器清理引用。您可以阅读 Py4J 解释以了解详细信息，他们推荐以下内容：

A call to gc.collect()also usually works.

调用 togc.collect()也通常有效。

OK, now back to your original question:

好的，现在回到你最初的问题：

Would the following be enough to get this done:
del thisRDD

以下是否足以完成这项工作：
del thisRDD

Almost.You should remove the last reference to it (i.e. del thisRDD), and then, if you really need the RDD to be unpersisted immediately**, call gc.collect().

几乎。您应该删除对它的最后一个引用（即del thisRDD），然后，如果您确实需要立即取消持久化 RDD**，请调用gc.collect().

**Well, technically, this will immediately delete the referenceon the Java side, but there will be a slight delay until Java's garbage collector actually executes the RDD's finalizer and thereby unpersists the data.

**好吧，从技术上讲，这将立即删除Java 端的引用，但是在 Java 的垃圾收集器实际执行 RDD 的终结器并因此取消持久化数据之前，会有一点延迟。

Python 如何在 PySpark 中删除 RDD 以释放资源？

提问by Ego

回答by 0x0FFF

回答by nonsleepr

回答by joshsuihn

回答by Stuart Berg

相关推荐

最近更新

标签

Python 如何在 PySpark 中删除 RDD 以释放资源？

提问by Ego

回答by 0x0FFF

回答by nonsleepr

回答by joshsuihn

回答by Stuart Berg

相关推荐

Python 无法执行 collectstatic

在linux上安装python ssl模块而无需重新编译

Python SQL 更新语句但使用 pyodbc

如何在python中绘制三角形？

相关推荐

最近更新

标签