Python 获取卷积算法失败。这可能是因为 cuDNN 初始化失败,
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/53698035/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Failed to get convolution algorithm. This is probably because cuDNN failed to initialize,
提问by Steve-0 Dev.
In Tensorflow/ Keras when running the code from https://github.com/pierluigiferrari/ssd_keras, use the estimator: ssd300_evaluation. I received this error.
在 Tensorflow/Keras 中运行来自https://github.com/pierluigiferrari/ssd_keras的代码时,使用估算器:ssd300_evaluation。我收到了这个错误。
Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
获取卷积算法失败。这可能是因为 cuDNN 初始化失败,所以尝试查看上面是否打印了警告日志消息。
This is very similar to the unsolved question: Google Colab Error : Failed to get convolution algorithm.This is probably because cuDNN failed to initialize
这与未解决的问题非常相似:Google Colab Error : Failed to get convolution algorithm.这可能是因为cuDNN未能初始化
With the issue I'm running:
对于我正在运行的问题:
python: 3.6.4.
蟒蛇:3.6.4。
Tensorflow Version: 1.12.0.
Tensorflow 版本:1.12.0。
Keras Version: 2.2.4.
Keras 版本:2.2.4。
CUDA: V10.0.
CUDA:V10.0。
cuDNN: V7.4.1.5.
cuDNN:V7.4.1.5。
NVIDIA GeForce GTX 1080.
NVIDIA GeForce GTX 1080。
Also I ran:
我也跑了:
import tensorflow as tf
with tf.device('/gpu:0'):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
with tf.Session() as sess:
print (sess.run(c))
With no errors or issues.
没有错误或问题。
The minimalist example is:
极简主义的例子是:
from keras import backend as K
from keras.models import load_model
from keras.optimizers import Adam
from scipy.misc import imread
import numpy as np
from matplotlib import pyplot as plt
from models.keras_ssd300 import ssd_300
from keras_loss_function.keras_ssd_loss import SSDLoss
from keras_layers.keras_layer_AnchorBoxes import AnchorBoxes
from keras_layers.keras_layer_DecodeDetections import DecodeDetections
from keras_layers.keras_layer_DecodeDetectionsFast import DecodeDetectionsFast
from keras_layers.keras_layer_L2Normalization import L2Normalization
from data_generator.object_detection_2d_data_generator import DataGenerator
from eval_utils.average_precision_evaluator import Evaluator
import tensorflow as tf
%matplotlib inline
import keras
keras.__version__
# Set a few configuration parameters.
img_height = 300
img_width = 300
n_classes = 20
model_mode = 'inference'
K.clear_session() # Clear previous models from memory.
model = ssd_300(image_size=(img_height, img_width, 3),
n_classes=n_classes,
mode=model_mode,
l2_regularization=0.0005,
scales=[0.1, 0.2, 0.37, 0.54, 0.71, 0.88, 1.05], # The scales
for MS COCO [0.07, 0.15, 0.33, 0.51, 0.69, 0.87, 1.05]
aspect_ratios_per_layer=[[1.0, 2.0, 0.5],
[1.0, 2.0, 0.5, 3.0, 1.0/3.0],
[1.0, 2.0, 0.5, 3.0, 1.0/3.0],
[1.0, 2.0, 0.5, 3.0, 1.0/3.0],
[1.0, 2.0, 0.5],
[1.0, 2.0, 0.5]],
two_boxes_for_ar1=True,
steps=[8, 16, 32, 64, 100, 300],
offsets=[0.5, 0.5, 0.5, 0.5, 0.5, 0.5],
clip_boxes=False,
variances=[0.1, 0.1, 0.2, 0.2],
normalize_coords=True,
subtract_mean=[123, 117, 104],
swap_channels=[2, 1, 0],
confidence_thresh=0.01,
iou_threshold=0.45,
top_k=200,
nms_max_output_size=400)
# 2: Load the trained weights into the model.
# TODO: Set the path of the trained weights.
weights_path = 'C:/Users/USAgData/TF SSD
Keras/weights/VGG_VOC0712Plus_SSD_300x300_iter_240000.h5'
model.load_weights(weights_path, by_name=True)
# 3: Compile the model so that Keras won't complain the next time you load it.
adam = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
ssd_loss = SSDLoss(neg_pos_ratio=3, alpha=1.0)
model.compile(optimizer=adam, loss=ssd_loss.compute_loss)
dataset = DataGenerator()
# TODO: Set the paths to the dataset here.
dir= "C:/Users/USAgData/TF SSD Keras/VOC/VOCtest_06-Nov-2007/VOCdevkit/VOC2007/"
Pascal_VOC_dataset_images_dir = dir+ 'JPEGImages'
Pascal_VOC_dataset_annotations_dir = dir + 'Annotations/'
Pascal_VOC_dataset_image_set_filename = dir+'ImageSets/Main/test.txt'
# The XML parser needs to now what object class names to look for and in which order to map them to integers.
classes = ['background',
'aeroplane', 'bicycle', 'bird', 'boat',
'bottle', 'bus', 'car', 'cat',
'chair', 'cow', 'diningtable', 'dog',
'horse', 'motorbike', 'person', 'pottedplant',
'sheep', 'sofa', 'train', 'tvmonitor']
dataset.parse_xml(images_dirs=[Pascal_VOC_dataset_images_dir],
image_set_filenames=[Pascal_VOC_dataset_image_set_filename],
annotations_dirs=[Pascal_VOC_dataset_annotations_dir],
classes=classes,
include_classes='all',
exclude_truncated=False,
exclude_difficult=False,
ret=False)
evaluator = Evaluator(model=model,
n_classes=n_classes,
data_generator=dataset,
model_mode=model_mode)
results = evaluator(img_height=img_height,
img_width=img_width,
batch_size=8,
data_generator_mode='resize',
round_confidences=False,
matching_iou_threshold=0.5,
border_pixels='include',
sorting_algorithm='quicksort',
average_precision_mode='sample',
num_recall_points=11,
ignore_neutral_boxes=True,
return_precisions=True,
return_recalls=True,
return_average_precisions=True,
verbose=True)
采纳答案by gatefun
I had this error and I fixed it by uninstalling all CUDA and cuDNN versions from my system. Then I installed CUDA Toolkit 9.0(without any patches) and cuDNN v7.4.1 for CUDA 9.0.
我遇到了这个错误,我通过从我的系统中卸载所有 CUDA 和 cuDNN 版本来修复它。然后我安装了CUDA Toolkit 9.0(没有任何补丁)和cuDNN v7.4.1 for CUDA 9.0。
回答by waterproof
I've seen this error message for three different reasons, with different solutions:
我出于三种不同的原因看到了此错误消息,并使用了不同的解决方案:
1. You have cache issues
1.你有缓存问题
I regularly work around this error by shutting down my python process, removing the ~/.nv
directory (on linux, rm -rf ~/.nv
), and restarting the Python process. I don't exactly know why this works. It's probably at least partly related to the second option:
我经常通过关闭 Python 进程、删除~/.nv
目录(在 linux 上rm -rf ~/.nv
)并重新启动 Python 进程来解决此错误。我不完全知道为什么会这样。它可能至少部分与第二个选项有关:
2. You're out of memory
2. 你的内存不足
The error can also show up if you run out of graphics card RAM. With an nvidia GPU you can check graphics card memory usage with nvidia-smi
. This will give you not only a readout of how much GPU RAM you have in use (something like 6025MiB / 6086MiB
if you're almost at the limit) as well as a list of what processes are using GPU RAM.
如果图形卡 RAM 用完,该错误也会出现。使用 nvidia GPU,您可以使用nvidia-smi
. 这不仅可以让您读出正在使用的 GPU RAM 量(就像6025MiB / 6086MiB
您几乎达到极限一样),还可以列出哪些进程正在使用 GPU RAM。
If you've run out of RAM, you'll need to restart the process (which should free up the RAM) and then take a less memory-intensive approach. A few options are:
如果您的 RAM 用完,您将需要重新启动该进程(这将释放 RAM),然后采用较少内存密集型的方法。几个选项是:
- reducing your batch size
- using a simpler model
- using less data
- limit TensorFlow GPU memory fraction: For example, the following will make sure TensorFlow uses <= 90% of your RAM:
- 减少批量大小
- 使用更简单的模型
- 使用更少的数据
- 限制 TensorFlow GPU 内存比例:例如,以下内容将确保 TensorFlow 使用 <= 90% 的 RAM:
import keras
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
keras.backend.tensorflow_backend.set_session(tf.Session(config=config))
This can slow down your model evaluation if not used together with the items above, presumably since the large data set will have to be swapped in and out to fit into the small amount of memory you've allocated.
如果不与上述项目一起使用,这可能会减慢您的模型评估速度,大概是因为必须交换进出大数据集以适应您分配的少量内存。
3. You have incompatible versions of CUDA, TensorFlow, NVIDIA drivers, etc.
3. 您的 CUDA、TensorFlow、NVIDIA 驱动程序等版本不兼容。
If you've never had similar models working, you're not running out of VRAM andyour cache is clean, I'd go back and set up CUDA + TensorFlow using the best available installation guide - I have had the most success with following the instructions at https://www.tensorflow.org/install/gpurather than those on the NVIDIA / CUDA site. Lambda Stackis also a good way to go.
如果你从来没有使用过类似的模型,你没有用完 VRAM并且你的缓存是干净的,我会回去使用最好的安装指南设置 CUDA + TensorFlow - 我在以下方面取得了最大的成功https://www.tensorflow.org/install/gpu 上的说明,而不是 NVIDIA / CUDA 站点上的说明。Lambda Stack也是一个不错的方法。
回答by Bensuperpc
I had the same issue, I solved it thanks to that :
我遇到了同样的问题,因此我解决了它:
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'
or
或者
physical_devices = tf.config.experimental.list_physical_devices('GPU')
if len(physical_devices) > 0:
tf.config.experimental.set_memory_growth(physical_devices[0], True)
回答by Mainak Dutta
The problem is with the incompatibility of newer versions of tensorflow 1.10.x plus versions with cudnn 7.0.5 and cuda 9.0. Easiest fix is to downgrade tensorflow to 1.8.0
问题在于较新版本的 tensorflow 1.10.x 以及带有 cudnn 7.0.5 和 cuda 9.0 的版本不兼容。最简单的解决方法是将 tensorflow 降级到 1.8.0
pip install --upgrade tensorflow-gpu==1.8.0
pip install --upgrade tensorflow-gpu==1.8.0
回答by Shanu Dey
Keras is included in TensorFlow 2.0 above. So
Keras 包含在上面的 TensorFlow 2.0 中。所以
- remove
import keras
and - replace
from keras.module.module import class
statement to -->from tensorflow.keras.module.module import class
Maybe your GPU memory is filled. So use allow growth = True in GPU option. This is deprecated now. But use this below code snippet after imports may solve your problem.
import tensorflow as tf
from tensorflow.compat.v1.keras.backend import set_session
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True # dynamically grow the memory used on the GPU
config.log_device_placement = True # to log device placement (on which device the operation ran)
sess = tf.compat.v1.Session(config=config)
set_session(sess)
- 删除
import keras
和 - 将
from keras.module.module import class
语句替换为 -->from tensorflow.keras.module.module import class
也许您的 GPU 内存已满。所以在 GPU 选项中使用 allow growth = True 。现在已弃用。但是在导入后使用下面的代码片段可能会解决您的问题。
import tensorflow as tf
from tensorflow.compat.v1.keras.backend import set_session
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True # dynamically grow the memory used on the GPU
config.log_device_placement = True # to log device placement (on which device the operation ran)
sess = tf.compat.v1.Session(config=config)
set_session(sess)
回答by Vidit Varshney
Same error i got , The Reason of getting this error is due to the mismatch of the version of the cudaa/cudnn with your tensorflow version there are two methods to solve this:
我遇到了同样的错误,出现此错误的原因是由于 cudaa/cudnn 的版本与您的 tensorflow 版本不匹配,有两种方法可以解决此问题:
Either you Downgrade your Tensorflow Version
pip install --upgrade tensorflowgpu==1.8.0
Or You can follow the steps at Here.
tip: Choose your ubuntu version and follow the steps.:-)
要么你降级你的 Tensorflow 版本
pip install --upgrade tensorflowgpu==1.8.0
或者您可以按照此处的步骤操作。
提示:选择您的 ubuntu 版本并按照步骤操作。:-)
回答by RadV
I had this problem after upgrading to TF2.0. The following started giving error:
升级到TF2.0后我遇到了这个问题。以下开始给出错误:
outputs = tf.nn.conv2d(images, filters, strides=1, padding="SAME")
I am using Ubuntu 16.04.6 LTS (Azure datascience VM) and TensorFlow 2.0. Upgraded per instruction on this TensorFlow GPU instructions page. This resolved the issue for me. By the way, its bunch of apt-get update/installs and I executed all of them.
我使用的是 Ubuntu 16.04.6 LTS(Azure 数据科学 VM)和 TensorFlow 2.0。在此 TensorFlow GPU 指令页面上按指令升级。这为我解决了这个问题。顺便说一句,它的一堆 apt-get 更新/安装,我执行了所有这些。
回答by Ralph Bisschops
This is a follow up to https://stackoverflow.com/a/56511889/2037998point 2.
这是对https://stackoverflow.com/a/56511889/2037998第 2 点的跟进。
2. You're out of memory
2. 你的内存不足
I used the following code to limit the GPU RAM usage:
我使用以下代码来限制 GPU RAM 的使用:
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 1*X GB of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=(1024*4))])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)
This code sample comes from: TensorFlow: Use a GPU: Limiting GPU memory growthPut this code before of any other TF/Keras code you are using.
此代码示例来自:TensorFlow:使用 GPU:限制 GPU 内存增长将此代码放在您正在使用的任何其他 TF/Keras 代码之前。
Note: The application might still use a bit more GPU RAM than the number above.
注意:应用程序可能仍会使用比上述数字多一点的 GPU RAM。
Note 2: If the system also runs other applications (like a UI) these programs can also consume some GPU RAM. (Xorg, Firefox,... sometimes up to 1GB of GPU RAM combined)
注 2:如果系统还运行其他应用程序(如 UI),这些程序也会消耗一些 GPU RAM。(Xorg, Firefox,... 有时高达 1GB 的 GPU RAM)
回答by Paktalin
I was struggling with this problem for a week. The reason was very silly: I used high-res photos for training.
我在这个问题上挣扎了一个星期。原因很傻:我用高分辨率照片进行训练。
Hopefully, this will save someone's time :)
希望这会节省某人的时间:)
回答by kHarshit
The problem can also occur if there are incompatible version of cuDNN, which could be the case if you installed Tensorflow with conda, as conda also installs CUDA and cuDNN while installing Tensorflow.
如果存在不兼容的 cuDNN 版本,也可能出现此问题,如果您使用 conda 安装 Tensorflow,则可能会出现这种情况,因为 conda 在安装 Tensorflow 时还会安装 CUDA 和 cuDNN。
The solution is to install the Tensorflow with pip, and install CUDA and cuDNN separately without conda e.g. if you have CUDA 10.0.130 and cuDNN 7.4.1 (tested configurations), then
解决方案是使用pip安装Tensorflow,并在没有conda的情况下分别安装CUDA和cuDNN,例如如果您有CUDA 10.0.130和cuDNN 7.4.1 (测试配置),那么
pip install tensorflow-gpu==1.13.1