CUDNN_STATUS_INTERNAL_ERROR in tensorflow 2.1 c++ - c++

I face a problem as the title says when I load a pre-trained model(.pb model of YOLOv3) and infer with this model in tensorflow 2.1 c++. Error messages are as the following:
2020-10-30 21:36:20.245492: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-10-30 21:36:20.269906: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
[InferCC] Model infer failed(2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node yolov3/yolo_darknet/conv2d/Conv2D}}]]
[[StatefulPartitionedCall/_791]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node yolov3/yolo_darknet/conv2d/Conv2D}}]]
0 successful operations.
Here is my configuration:
Ubuntu 18.04
Tensorflow 2.1 c++
Cuda 10.1
cuDNN 7.6.5
GPU memory ~6G
$ nvidia-smi
Fri Oct 30 21:50:24 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 2060 Off | 00000000:01:00.0 On | N/A |
| N/A 46C P8 6W / N/A | 553MiB / 5931MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1242 G /usr/lib/xorg/Xorg 18MiB |
| 0 N/A N/A 1886 G /usr/bin/gnome-shell 50MiB |
| 0 N/A N/A 9708 G /usr/lib/xorg/Xorg 321MiB |
| 0 N/A N/A 9876 G /usr/bin/gnome-shell 128MiB |
| 0 N/A N/A 12591 G /usr/lib/firefox/firefox 3MiB |
| 0 N/A N/A 14848 G ...s/QtCreator/bin/qtcreator 3MiB |
| 0 N/A N/A 15696 G ...AAAAAAAAA= --shared-files 21MiB |
+-----------------------------------------------------------------------------+
I searched it on internet and it seems that my GPU memory is ran out(I'm not sure about it). So I add following codes to set GPU memory growth before load model:
tensorflow::ConfigProto config;
config.mutable_gpu_options()->set_allow_growth(true);
But with no luck, errors are still there.
Or I want to know if it is indeed lack of GPU memory(~6G is not enough for YOLOv3 model)?
Please some one helps me out. Thanks.

Finally, I find a way to correct that error. Just pass the SessionOptions variable(call set_allow_growth with it) into LoadSavedModel function:
if(tensorflow::MaybeSavedModelDirectory(modelName.toStdString()))
{
// set gpu memory growth true here
tensorflow::SessionOptions sessionOpt;
sessionOpt.config.mutable_gpu_options()->set_allow_growth(true);
// then pass that variable into LoadSavedModel
tensorflow::Status status = tensorflow::LoadSavedModel(
sessionOpt,
tensorflow::RunOptions(),
modelName.toStdString(),
{tensorflow::kSavedModelTagServe},
_model);
if(!status.ok())
{
qCritical("[InferCC] Load %s model failed - %s.", modelName.toStdString().c_str(), status.error_message().c_str());
_loaded = false;
return false;
}
}
Then run program while keep calling nvidia-smi in command line, showing that GPU memory is increasing and the last state(used 5278 MiB)as :
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 2060 Off | 00000000:01:00.0 On | N/A |
| N/A 58C P0 63W / N/A | 5832MiB / 5931MiB | 37% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 656 C ...VisTrack/release/VisTrack 5278MiB |
| 0 N/A N/A 1242 G /usr/lib/xorg/Xorg 18MiB |
| 0 N/A N/A 1886 G /usr/bin/gnome-shell 50MiB |
| 0 N/A N/A 4843 G /usr/lib/firefox/firefox 3MiB |
| 0 N/A N/A 5879 G /usr/lib/firefox/firefox 3MiB |
| 0 N/A N/A 9708 G /usr/lib/xorg/Xorg 293MiB |
| 0 N/A N/A 9876 G /usr/bin/gnome-shell 143MiB |
| 0 N/A N/A 12591 G /usr/lib/firefox/firefox 3MiB |
| 0 N/A N/A 13952 G /usr/lib/firefox/firefox 3MiB |
| 0 N/A N/A 14848 G ...s/QtCreator/bin/qtcreator 3MiB |
| 0 N/A N/A 15696 G ...AAAAAAAAA= --shared-files 21MiB |
+-----------------------------------------------------------------------------+

Related

CMake failling to detect a default CUDA architecture

I'm trying to build this repository and am stuck at building some thirdparty dependencies (note that I'm very new and basically have no knowledge about c++ / cmake)
I'm strictly following the installation guide provided in the repo and am stuck at trying to build ngp with following command:
cmake ./thirdparty/instant-ngp -B build_ngp
I recieve following error message:
CMake Error at /opt/cmake-3.25.2-linux-x86_64/share/cmake-3.25/Modules/CMakeDetermineCUDACompiler.cmake:603 (message):
Failed to detect a default CUDA architecture.
Compiler output:
Call Stack (most recent call first):
CMakeLists.txt:11 (project)
I'm running Ubuntu 20.04 as OS and have a nvidia 2080 super installed in the computer
cmake --version --> 3.25.2
nvidia-smi output:
Tue Feb 14 13:40:27 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:2D:00.0 On | N/A |
| 0% 39C P8 20W / 250W | 449MiB / 8192MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1386 G /usr/lib/xorg/Xorg 53MiB |
| 0 N/A N/A 1971 G /usr/lib/xorg/Xorg 120MiB |
| 0 N/A N/A 2099 G /usr/bin/gnome-shell 36MiB |
| 0 N/A N/A 2542 G /usr/lib/firefox/firefox 226MiB |
+-----------------------------------------------------------------------------+
Would appreciate any kind of help and please let me know if I can provide more valuable info.

How to start minikube on specific network

I want to start minikube cluster on specific network/network adapter in VirtualBox, so that I launch other VMs in same network like below
+-------+ +------+ +----------------+
| | | | | |
| VM2 | | VM1 | | Minikube |
| | | | | Cluster |
| | | | | |
+---+---+ +---+--+ +------------+---+
| | |
| | |
| +------+------------+ |
+--+ | |
| 192.168.10.0/24 +-----+
+-------------------+
But I don't see much options for networking in minikube start CLI
Is it possible to start minikube like that or any trick to setup like above?
When it comes to adjusting networking with minikube start you can use the following option:
--host-only-cidr string The CIDR to be used for the minikube VM (only supported with Virtualbox driver) (default "192.168.99.1/24")
As you can see in the table here by default NAT option doesn't give you access to Minikube Cluster VM neither from host nor from other guests (VMs) but you can additionally set port forwarding which is well described in this article.
Although mentioned minikube start doesn't support many options that allow you to modify networking of your default VM, you can easily modify it by adding additional bridged adapter once the Minikube VM is created using Virtualbox GUI or vboxmanage command line tool to modify your network settings as some users suggest here and here.
I have checked again, the minikube cluster is attached to 2 networks,
NAT
Host-Only Network(vboxnet1)
Since it has already connected to a adapter, I can attache VM to exiting adapter and use it like below
+--------+ +---------------------+
| | | Minikube |
| | | |
| VM | | eth1 eth0 |
| | | + + |
| | +---------------------+
+---+----+ | |
| | |
| | |
| +------------v------+ |
| | | v
+------->+ vboxnet1 | NAT
| 192.168.99.0/24 |
| |
+-------------------+
Any other suggestions are welcome

prevent tensorflow device on multiple gpu c++

I'm using a dynamically linked tensorflow library to run a neural network on a c++ code.
2018-06-07 19:03:10.578031: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9168 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:17:00.0, compute capability: 6.1)
2018-06-07 19:03:10.615271: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 9822 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:17:00.0 Off | N/A |
| 0% 54C P2 106W / 250W | 10734MiB / 11172MiB | 33% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:65:00.0 On | N/A |
| 1% 54C P2 65W / 250W | 10655MiB / 11171MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 22230 C ./demo 10723MiB |
| 1 1264 G /usr/lib/xorg/Xorg 309MiB |
| 1 3230 G compiz 235MiB |
| 1 22230 C ./demo 10095MiB |
| 1 25625 G unity-control-center 3MiB |
+-----------------------------------------------------------------------------+
The code
tf::Session* session_ptr;
auto options = tf::SessionOptions();
options.config.mutable_gpu_options()->set_visible_device_list("0");
auto status = NewSession(tf::SessionOptions(), &session_ptr);
session.reset(tensorflow::NewSession(options));
doesn't seem to prevent tensorflow from setting up devices in all of the gpus.
I know you can use the CUDA_VISIBLE_DEVICES environment variable, but I need to do this on runtime. I also need to run several instances of this program probably on the same gpu (4 probably)
Is there any way to do this?
I also tried using
tf::GraphDef graph_def;
auto status1 = ReadBinaryProto(tf::Env::Default(), tf_model, &graph_def);
tensorflow::graph::SetDefaultDevice("0",&graph_def);
also allocates memory to both gpu...

Apache camel-jetty and activemq conflict in karaf?

I am having an issue in a clean karaf (4.0.3) installing both camel-jetty and activemq (for example, activemq-client or activemq-broker) together. It doesn't matter the order the features are installed. The second one hangs during install with no information displayed in the karaf log beyond that the install has begun.
Has anyone seen this before? Is there a workaround? I tried having the activemq-broker in it's own instance but then my app that uses both camel-jetty and jms still need to have the activemq connector initialized and thus I need to load activemq bundles/features.
Here is the output of both install separately, but when performed one after the other the 2nd always hangs the karaf instance. There don't appear to be any bundles in common.
karaf#root()> feature:install activemq-client
karaf#root()> list
START LEVEL 100 , List Threshold: 50
ID | State | Lvl | Version | Name
------------------------------------------------------------------------
52 | Active | 80 | 5.12.1 | activemq-osgi
53 | Active | 80 | 3.3.0 | Commons Net
54 | Active | 80 | 2.4.2 | Apache Commons Pool
55 | Active | 80 | 1.0.1 | geronimo-j2ee-management_1.1_spec
56 | Active | 80 | 1.1.1 | geronimo-jms_1.1_spec
57 | Active | 80 | 1.1.1 | geronimo-jta_1.1_spec
58 | Active | 80 | 3.4.6 | ZooKeeper Bundle
63 | Active | 80 | 2.2.11.1 | Apache ServiceMix :: Bundles :: jaxb-impl
70 | Active | 80 | 3.18.0 | Apache XBean :: Spring
71 | Active | 80 | 0.6.4 | JAXB2 Basics - Runtime
(Performed clean karaf launch in between installs to get bundle listings)
karaf#root()> feature:install camel-jetty
karaf#root()> list
START LEVEL 100 , List Threshold: 50
ID | State | Lvl | Version | Name
--------------------------------------------------------------------------------
55 | Active | 80 | 2.12.2 | camel-core
56 | Active | 80 | 2.12.2 | camel-http
57 | Active | 80 | 2.12.2 | camel-jetty
58 | Active | 80 | 2.12.2 | camel-karaf-commands
59 | Active | 80 | 1.8.0 | Commons Codec
63 | Active | 80 | 3.1.0.7 | Apache ServiceMix :: Bundles :: commons-httpclient

Doctrine 2 memory hogging

I'm using Doctrine 2 with my ZF2 project, but i'm getting some weird problem with my server CPU and memory. And my server simply crashes.
I'm getting a lot of sleep state querys and they seem not to get cleaned.
mysql> show processlist;
+---------+--------------+-----------+------------------+----------------+------+--------------------+------------------------------------------------------------------------------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+---------+--------------+-----------+------------------+----------------+------+--------------------+------------------------------------------------------------------------------------------------------+
| 2832346 | leechprotect | localhost | leechprotect | Sleep | 197 | | NULL |
| 2832629 | db_user | localhost | db_exemple | Sleep | 3 | | NULL |
| 2832643 | db_user | localhost | db_exemple | Sleep | 3 | | NULL |
| 2832646 | db_user | localhost | db_exemple | Sleep | 3 | | NULL |
| 2832664 | db_user | localhost | db_exemple | Sleep | 154 | | NULL |
| 2832666 | db_user | localhost | db_exemple | Sleep | 153 | | NULL |
| 2832669 | db_user | localhost | db_exemple | Sleep | 152 | | NULL |
| 2832674 | db_user | localhost | db_exemple | Sleep | 7 | | NULL |
| 2832681 | db_user | localhost | db_exemple | Sleep | 1 | | NULL |
| 2832683 | db_user | localhost | db_exemple | Sleep | 4 | | NULL |
| 2832690 | db_user | localhost | db_exemple | Sleep | 149 | | NULL |
(.......)
Also, it seems php GC is not cleaning all the objects from memory, or even kill processes.
Is there a way to disable the cache system? Would it improve the use of my resorces=
Most my querys are similar to:
$query = $this->createQueryBuilder('i');
$query->innerJoin('\Application\Relation', 'r', 'WITH', 'r.child = i.id');
$query->innerJoin('\Application\Taxonomy', 't', 'WITH', 't.id = r.taxonomy');
$query->where('t.type = :type')->setParameter('type', $relation);
$query->groupBy('i.id');
$items = $query->getQuery()->getResult(2);
Thanks in advance.
Firstly check the mysql's wait_timout variable. From the documentation:
Wait_timeout : The number of seconds the server waits for activity on
a noninteractive connection before closing it.
In normal flow (which not using persistent connections), php closes the connection automatically after script execution. To ensure there are no sleeping threads; at the end of your script simply close the connection:
$entityManager->getConnection()->close();
If these queries are running in a big while/for loop, you might want to read doctrine 2 batch processing documentation.