Tensorflow出现CUDNN_STATUS_INTERNAL_ERROR如何解决?

发现在最近安装的Ubuntu20.4上使用TensorFlow偶尔会出现如下错误,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6253 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
Epoch 1/12
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
E tensorflow/stream_executor/cuda/cuda_dnn.cc:328] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E tensorflow/stream_executor/cuda/cuda_dnn.cc:328] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

这个CUDNN_STATUS_INTERNAL_ERROR导致模型训练无法使用GPU,只能在CPU上跑。我看Github上有解决方法是重启主机,但是这个万能重启打法并不友好。

这里有一种解决方案。有些情况下是因为硬件和系统的问题,导致TensorFlow无法连接到GPU,可以通过重启NVIDIA内核解决,

1
2
sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm

我们也可以把这段小代码写到.bashrc上,

1
2
3
4
restart_gpu() {
sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm
}

这样遇到CUDNN_STATUS_INTERNAL_ERROR问题时直接在命令行上执行restart_gpu即可。

如果以上尝试桌面版系统如果还是无办法解决,重启主机即可。

转载请包括本文地址:https://allenwind.github.io/blog/12321
更多文章请参考:https://allenwind.github.io/blog/archives/