Python Google Colaboratory:关于其 GPU 的误导性信息(某些用户只能使用 5% 的 RAM)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/48750199/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 18:50:53  来源:igfitidea点击:

Google Colaboratory: misleading information about its GPU (only 5% RAM available to some users)

pythonmachine-learninggpuramgoogle-colaboratory

提问by stason

update: this question is related to Google Colab's "Notebook settings: Hardware accelerator: GPU". This question was written before the "TPU" option was added.

更新:此问题与 Google Colab 的“笔记本设置:硬件加速器:GPU”有关。这个问题是在添加“TPU”选项之前写的。

Reading multiple excited announcements about Google Colaboratory providing free Tesla K80 GPU, I tried to run fast.ailesson on it for it to never complete - quickly running out of memory. I started investigating of why.

阅读有关 Google Colaboratory 提供免费 Tesla K80 GPU 的多个激动人心的公告,我尝试在其上运行fast.ai课程,但它永远无法完成 - 很快就会耗尽内存。我开始调查原因。

The bottom line is that “free Tesla K80” is not "free" for all - for some only a small slice of it is "free".

最重要的是,“免费的 Tesla K80”并不是对所有人都“免费”——对某些人来说,只有一小部分是“免费的”。

I connect to Google Colab from West Coast Canada and I get only 0.5GB of what supposed to be a 24GB GPU RAM. Other users get access to 11GB of GPU RAM.

我从加拿大西海岸连接到 Google Colab,我只得到了 0.5GB 的 24GB GPU RAM。其他用户可以访问 11GB 的 GPU RAM。

Clearly 0.5GB GPU RAM is insufficient for most ML/DL work.

显然 0.5GB GPU RAM 对于大多数 ML/DL 工作来说是不够的。

If you're not sure what you get, here is little debug function I scraped together (only works with the GPU setting of the notebook):

如果你不确定你得到了什么,这里是我拼凑起来的小调试功能(仅适用于笔记本的 GPU 设置):

# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize
import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn't guaranteed
gpu = GPUs[0]
def printm():
 process = psutil.Process(os.getpid())
 print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " | Proc size: " + humanize.naturalsize( process.memory_info().rss))
 print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()

Executing it in a jupyter notebook before running any other code gives me:

在运行任何其他代码之前在 jupyter notebook 中执行它给了我:

Gen RAM Free: 11.6 GB  | Proc size: 666.0 MB
GPU RAM Free: 566MB | Used: 10873MB | Util  95% | Total 11439MB

The lucky users who get access to the full card will see:

获得全卡访问权限的幸运用户将看到:

Gen RAM Free: 11.6 GB  | Proc size: 666.0 MB
GPU RAM Free: 11439MB | Used: 0MB | Util  0% | Total 11439MB

Do you see any flaw in my calculation of the GPU RAM availability, borrowed from GPUtil?

您是否看到我从 GPUtil 借用的 GPU RAM 可用性计算中的任何缺陷?

Can you confirm that you get similar results if you run this code on Google Colab notebook?

如果您在 Google Colab Notebook 上运行此代码,您能否确认获得了类似的结果?

If my calculations are correct, is there any way to get more of that GPU RAM on the free box?

如果我的计算是正确的,有没有办法在空闲框上获得更多的 GPU RAM?

update: I'm not sure why some of us get 1/20th of what other users get. e.g. the person who helped me to debug this is from India and he gets the whole thing!

更新:我不知道为什么我们中的一些人得到了其他用户得到的 1/20。例如,帮助我调试这个的人来自印度,他得到了整个事情!

note: please don't send any more suggestions on how to kill the potentially stuck/runaway/parallel notebooks that might be consuming parts of the GPU. No matter how you slice it, if you are in the same boat as I and were to run the debug code you'd see that you still get a total of 5% of GPU RAM (as of this update still).

注意:请不要再发送关于如何杀死可能消耗部分 GPU 的潜在卡住/失控/并行笔记本的任何建议。不管你如何切片,如果你和我在同一条船上并且要运行调试代码,你会发现你仍然可以获得总共 5% 的 GPU RAM(截至本次更新)。

采纳答案by stason

So to prevent another dozen of answers suggesting invalid in the context of this thread suggestion to !kill -9 -1, let's close this thread:

因此,为了防止在此线程建议的上下文中出现另外十几个建议无效的答案!kill -9 -1,让我们关闭此线程:

The answer is simple:

答案很简单:

As of this writing Google simply gives only 5% of GPU to some of us, whereas 100% to the others. Period.

在撰写本文时,Google 仅将 5% 的 GPU 提供给我们中的一些人,而 100% 提供给其他人。时期。

dec-2019 update: The problem still exists - this question's upvotes continue still.

2019 年 12 月更新:问题仍然存在 - 此问题的投票仍在继续。

mar-2019 update: A year later a Google employee @AmiF commented on the state of things, stating that the problem doesn't exist, and anybody who seems to have this problem needs to simply reset their runtime to recover memory. Yet, the upvotes continue, which to me this tells that the problem still exists, despite @AmiF's suggestion to the contrary.

2019 年 3 月更新:一年后,一位 Google 员工@AmiF 评论了事情的状态,指出问题不存在,任何似乎有此问题的人都需要简单地重置其运行时以恢复内存。然而,投票仍在继续,这对我来说表明问题仍然存在,尽管@AmiF 提出了相反的建议。

dec-2018 update: I have a theory that Google may have a blacklist of certain accounts, or perhaps browser fingerprints, when its robots detect a non-standard behavior. It could be a total coincidence, but for quite some time I had an issue with Google Re-captcha on any website that happened to require it, where I'd have to go through dozens of puzzles before I'd be allowed through, often taking me 10+ min to accomplish. This lasted for many months. All of a sudden as of this month I get no puzzles at all and any google re-captcha gets resolved with just a single mouse click, as it used to be almost a year ago.

2018 年 12 月更新:我有一个理论,即当其机器人检测到非标准行为时,Google 可能会将某些帐户或浏览器指纹列入黑名单。这可能完全是巧合,但很长一段时间以来,我在任何碰巧需要它的网站上都遇到了 Google Re-captcha 的问题,在那里我必须经过数十道谜题才能被允许通过,经常花了我 10 分钟以上的时间来完成。这持续了好几个月。突然间,到本月为止,我根本没有遇到任何难题,只需单击鼠标即可解决任何谷歌重新验证码问题,就像过去几乎一年前一样。

And why I'm telling this story? Well, because at the same time I was given 100% of the GPU RAM on Colab. That's why my suspicion is that if you are on a theoretical Google black list then you aren't being trusted to be given a lot of resources for free. I wonder if any of you find the same correlation between the limited GPU access and the Re-captcha nightmare. As I said, it could be totally a coincidence as well.

为什么我要讲这个故事?好吧,因为同时我在 Colab 上获得了 100% 的 GPU RAM。这就是为什么我怀疑如果您在理论上的 Google 黑名单上,那么您就不会被信任免费获得大量资源。我想知道你们中是否有人发现有限的 GPU 访问与 Re-captcha 噩梦之间存在相同的相关性。正如我所说,这也可能完全是巧合。

回答by Nguy?n Tài Long

Last night I ran your snippet and got exactly what you got:

昨晚我运行了你的代码片段,得到了你所得到的:

Gen RAM Free: 11.6 GB  | Proc size: 666.0 MB
GPU RAM Free: 566MB | Used: 10873MB | Util  95% | Total 11439MB

but today:

但今天:

Gen RAM Free: 12.2 GB  I Proc size: 131.5 MB
GPU RAM Free: 11439MB | Used: 0MB | Util   0% | Total 11439MB

I think the most probable reason is the GPUs are shared among VMs, so each time you restart the runtime you have chance to switch the GPU, and there is also probability you switch to one that is being used by other users.

我认为最可能的原因是 GPU 在 VM 之间共享,因此每次重新启动运行时您都有机会切换 GPU,并且您也有可能切换到其他用户正在使用的 GPU。

UPDATED: It turns out that I can use GPU normally even when the GPU RAM Free is 504 MB, which I thought as the cause of ResourceExhaustedError I got last night.

更新:事实证明,即使 GPU RAM Free 为 504 MB,我也可以正常使用 GPU,我认为这是我昨晚遇到的 ResourceExhaustedError 的原因。

回答by Ajaychhimpa1

If you execute a cell that just has
!kill -9 -1
in it, that'll cause all of your runtime's state (including memory, filesystem, and GPU) to be wiped clean and restarted. Wait 30-60s and press the CONNECT button at the top-right to reconnect.

如果您执行一个只有
!kill -9 -1的单元格
,这将导致您的所有运行时状态(包括内存、文件系统和 GPU)被清除并重新启动。等待 30-60 秒,然后按右上角的 CONNECT 按钮重新连接。

回答by ivan_bilan

Misleading description on the part of Google. I got too excited about it too, I guess. Set everything up, loaded the data, and now I am not able to do anything with it due to having only 500Mb memory allocated to my Notebook.

谷歌方面的误导性描述。我也太兴奋了,我猜。设置好一切,加载数据,现在我无法对它做任何事情,因为我的笔记本只分配了 500Mb 的内存。

回答by Manivannan Murugavel

Find the Python3 pid and kill the pid. Please see the below imageenter image description here

找到 Python3 pid 并杀死 pid。请看下图在此处输入图片说明

Note: kill only python3(pid=130) not jupyter python(122).

注意:只杀死 python3(pid=130) 而不是 jupyter python(122)。

回答by mkczyk

Restart Jupyter IPython Kernel:

重启 Jupyter IPython 内核:

!pkill -9 -f ipykernel_launcher

回答by Kregnach

Im not sure if this blacklisting is true! Its rather possible, that the cores are shared among users. I ran also the test, and my results are the following:

我不确定这个黑名单是否属实!内核在用户之间共享是很有可能的。我也进行了测试,结果如下:

Gen RAM Free: 12.9 GB | Proc size: 142.8 MB GPU RAM Free: 11441MB | Used: 0MB | Util 0% | Total 11441MB

Gen RAM 免费:12.9 GB | 进程大小:142.8 MB GPU RAM 免费:11441MB | 已用:0MB | 利用率 0% | 总计 11441MB

It seems im getting also full core. However i ran it a few times, and i got the same result. Maybe i will repeat this check a few times during the day to see if there is any change.

看来我也得到了完整的核心。但是我运行了几次,得到了相同的结果。也许我会在白天重复几次这个检查,看看是否有任何变化。

回答by Jainil Patel

just give a heavy task to google colab, it will ask us to change to 25 gb of ram.

只是给 google colab 一个繁重的任务,它会要求我们更改为 25 GB 的内存。

enter image description here

在此处输入图片说明

example run this code twice:

示例运行此代码两次:

import numpy as np
from keras.layers import Conv2D, MaxPooling2D, AveragePooling2D
from keras.layers import Dropout, Flatten, Dense
from keras.models import Sequential
from keras.layers.advanced_activations import LeakyReLU
from keras.datasets import cifar10
(train_features, train_labels), (test_features, test_labels) = cifar10.load_data()
model = Sequential()

model.add(Conv2D(filters=16, kernel_size=(2, 2), padding="same", activation="relu", input_shape=(train_features.shape[1:])))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))

model.add(Conv2D(filters=32, kernel_size=(3, 3), padding="same", activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))

model.add(Conv2D(filters=64, kernel_size=(4, 4), padding="same", activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))

model.add(Flatten())

model.add(Dense(25600, activation="relu"))
model.add(Dense(25600, activation="relu"))
model.add(Dense(25600, activation="relu"))
model.add(Dense(25600, activation="relu"))
model.add(Dense(10, activation="softmax"))

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

model.fit(train_features, train_labels, validation_split=0.2, epochs=10, batch_size=128, verbose=1)

then click on get more ram :) enter image description hereenter image description here

然后点击获取更多内存 :) 在此处输入图片说明在此处输入图片说明

enter image description here

在此处输入图片说明

回答by Ritwik G

I believe if we have multiple notebooks open. Just closing it doesn't actually stop the process. I haven't figured out how to stop it. But I used top to find PID of the python3 that was running longest and using most of the memory and I killed it. Everything back to normal now.

我相信如果我们打开多个笔记本。只是关闭它实际上并没有停止这个过程。我还没有想出如何阻止它。但是我使用 top 找到运行时间最长并使用大部分内存的python3的PID,然后我杀死了它。现在一切恢复正常。