Significant RAM cost while running TensorFlow in C++ - c++

I'm currently writing some inference code using a trained TensorFlow graph on GPU machine with C++ APIs.
Here are my settings:
Platform: CentOS 7
TensorFlow Version: TensorFlow 1.5
CUDA Version: CUDA 9.0
C++ Version: C++11
There are a couple of questions that I'm struggling with.
1) First, I followed this tutorial to learn the basic template to load a graph in C++. The example in this tutorial is quite simple, but the program takes almost 0.9G in RAM when I run it (on GPU machine).
2) My graph is way more complicated than the one in that tutorial. There are approximately 20 layers and the numbers of nodes in layers vary from 300 to 5000.
My (pseudo) code snippet is here. For simplicity, I only keep the code that causes (potential) memory issue:
tensorflow::Tensor input = getDataFromSomewhere(...);
int length = size of the input;
int g_batch_size = 50;
// 1) Create session...
// 2) Load graph...
// 3) Inference
for (int x = 0; x < length; x += g_batch_size) {
tensorflow::Tensor out;
auto cur_slice = input.Slice(x, std::min(x + g_batch_size, length));
inference(cur_slice, out);
// doSomethingWithOutput(out);
}
// 4) Close session and free session memory
// Inference helper function
tensorflow::Status inference(tensorflow::Tensor& input_tensors, tensorflow::Tensor& out) {
// This line increases a lot more memory usage
TensorDict feed_dict = {{"IteratorGetNext:0", input_tensors}};
std::vector<tensorflow::Tensor> outputs;
tensorflow::Status status = session->Run(feed_dict, {"final_dense:0"}, {}, &outputs);
// UpdateOutWithOutputs();
return tensorflow::Status::OK();
}
After I created session and loaded graph, the memory cost is around 1.2G.
Then, as I noted in my code, when the program reached session->Run(...), the memory usage went up to more than 2G.
I'm not sure if this is a normal behavior of TensorFlow. I've checked this and this thread, but I don't quite know if I created redundant ops in my code.
Any comments or suggestions are appreciated! Thanks in advance!

The issue that I found was that Tensorflow dynamic libraries would take about 200MB and CUDA dynamic libraries would take more than 500MB memory. So loading those libraries already takes a great amount of memory.

Related

is libtorch c++ auto jit any computation?

These days I built libtorch 1.12 from source under cuda 11.6.2 docker image with intel-MKL v2020.0.166 in the official repository (using tbb as parallel framework instead of default openmp enter link description here. I also tried several times to build with pip installed MKL 2022.1/2021.4 and oneTBB 2021.6/2021.5, but always have some build or runtime problems). It built successfully and run libtorch example enter link description here fine.
Then I also faced the speed problem: it runs some test code enter link description here slower than pytorch about 5-40x times. In pytorch following code cost ~0.001-0.002s, but in built libtorch with mkl it cost average ~0.02-0.03s, sometime > 0.04s.
CPU:AMD5900x, 32GB memory, GPU: Nvidia 3070 super.
libtorch:
#include<torch/torch.h>
#include<iostream>
#include <chrono>
int main(){
torch::Tensor tensor = torch::randn({2708, 1433});
torch::Tensor weight = torch::randn({1433, 16});
auto start = std::chrono::high_resolution_clock::now();
tensor.mm(weight);
auto end = std::chrono::high_resolution_clock::now();
std::cout<< "C++ Operation Time(s) " << std::chrono::duration<double>(end - start).count() << "s" << std::endl;
return 0;
}
pytorch:
import torch
import torch.nn as nn
import torch.nn.functional as F
tensor = torch.randn(2708, 1433)
weight = torch.randn(1433, 16)
t0 = time.time()
tensor.mm(weight)
t1 = time.time()
print("Python Operation Time(s) {:.4f}".format(t1 - t0))
By google, someone noted to use at::set_num_threads() to adjust parallel threads number in each process enter link description here. no effect.
someone said change debug mode to release mode and use -O3 optimization flag. no effect.
Finally I saw someone metioned libtorch need warmup step enter link description here. So I tried to add for (int i=0; i<100; i++) in the above code to test. The result shows that first run cost ~0.02-0.03s, then following 99 loops keep cost same time ~0.001-0.002s, just reach pytorch speed.
So the questions are:
it appears that libtorch c++ use auto jit mechanism to create a fused graph even in normal computation process, like arrayfile. is that so? if not, how to explain the warmup phenomenon? Above test code has just normal operations, and no torchscrpit to call torch::jit load.
why pytorch is always fast and no warmup steps? and how? is there some build options to change libtorch's inner computation behavior? after all, pytorch is also based on c++ low level.

How could just loading a dll lead to 100 CPU load in my main application?

I have a perfectly working program which connects to a video camera (an IDS uEye camera) and continuously grabs frames from it and displays them.
However, when loading a specific dll before connecting to the camera, the program runs with 100% CPU load. If I load the dll after connecting to the camera, the program runs fine.
int main()
{
INT nRet = IS_NO_SUCCESS;
// init camera (open next available camera)
m_hCam = (HIDS)0;
// (A) Uncomment this for 100% CPU load:
// HMODULE handle = LoadLibrary(L"myInnocentDll.dll");
// This is the call to the 3rdparty camera vendor's library:
nRet = is_InitCamera(&m_hCam, 0);
// (B) Uncomment this instead of (A) and the CPU load won't change
// HMODULE handle = LoadLibrary(L"myInnocentDll.dll");
if (nRet == IS_SUCCESS)
{
/*
* Please note: I have removed all lines which are not necessary for the exploit.
* Therefore this is NOT a full example of how to properly initialize an IDS camera!
*/
is_GetSensorInfo(m_hCam, &m_sInfo);
GetMaxImageSize(m_hCam, &m_s32ImageWidth, &m_s32ImageHeight);
m_nColorMode = IS_CM_BGR8_PACKED;// IS_CM_BGRA8_PACKED;
m_nBitsPerPixel = 24; // 32;
nRet |= is_SetColorMode(m_hCam, m_nColorMode);
// allocate image memory.
if (is_AllocImageMem(m_hCam, m_s32ImageWidth, m_s32ImageHeight, m_nBitsPerPixel, &m_pcImageMemory, &m_lMemoryId) != IS_SUCCESS)
{
return 1;
}
else
{
is_SetImageMem(m_hCam, m_pcImageMemory, m_lMemoryId);
}
}
else
{
return 1;
}
std::thread([&]() {
while (true) {
is_FreezeVideo(m_hCam, IS_WAIT);
/*
* Usually, the image memory would now be grabbed via is_GetImageMem().
* but as it is not needed for the exploit, I removed it as well
*/
}
}).detach();
cv::waitKey(0);
}
Independently of the actually used camera driver, in what way could loading a dll change the performance of it, occupying 100% of all available CPU cores? When using the Visual Studio Diagnostic Tools, the excess CPU time is attributed to "[External Call] SwitchToThread" and not to the myInnocentDll.
Loading just the dll without the camera initialization does not result in 100% CPU load.
I was first thinking of some static initializers in the myInnocentDll.dll configuring some threading behavior, but I did not find anything pointing in this direction. For which aspects should I look for in the code of myInnocentDll.dll?
After a lot of digging I found the answer and it is both frustratingly simple and frustrating by itself:
It is Microsoft's poor support of OpenMP. When I disabled OpenMP in my project, the camera driver runs just fine.
The reason seems to be that the Microsoft compiler uses OpenMP with busy waiting and there is also the possibility to manually configure OMP_WAIT_POLICY, but as I was not depending on OpenMP anyways, disabling was the easiest solution for me.
https://developercommunity.visualstudio.com/content/problem/589564/how-to-control-omp-wait-policy-for-openmp.html
https://support.microsoft.com/en-us/help/2689322/redistributable-package-fix-high-cpu-usage-when-you-run-a-visual-c-201
I still don't understand why the CPU only went up high when using the camera and not when running the rest of my solution, even though the camera library is pre-built and my disabling/enabling of OpenMP compilation cannot have any effect on it. And I also don't understand why they bothered to make a hotfix for VS2010 but have no real fix as of VS2019, which I am using. But the problem is averted.
You can disable CPU idle state in the IDS camera manager and then the minimum CPU load in the windows energy plans is set to 100%
I think this is worth mentioning here, even you solved your problem already.

Why does this cuda program freeze my display?

I am new to CUDA and I was just trying to make a program which utilized a lot of my GPU. The only issue was that I am also using the card for my display and this froze my screen and required me to reboot.
__global__ void cuda_burn(int* sum)
{
int x = 0;
for(int i = 0; i < 1000000000; i++)
{
x += i;
}
atomicAdd(sum, x);
}
I originally launched it like cuda_burn<<<1024, 1024>>>(sum_d); which killed my display. This makes sense to me because I have enough blocks and threads to fully utilize my gpu which leaves no time for the graphics.
Next I tried to launch it like this cuda_burn<<<1, 1024>>>(sum_d); I thought that since I was only using one block that it would not be able to fully utilize the GPU resources and not freeze my display. Unfortunately it still did. Why?
What is also strange is the the mouse does not freeze?
Also is there a better way of unfreezing the display than rebooting?
Currently, CUDA and display tasks cannot run at the same time. While a CUDA kernel is running, regardless of how much or how little it uses the GPU, display tasks will be "frozen".

Getting OpenCV Error: Insufficient memory while running OpenCV Sample Program: "stitching_detailed.cpp"

I recently starting working with OpenCV with the intent of stitching large amounts of images together to create massive panoramas. To begin my experimentation, I looked into the sample programs that come with the OpenCV files to get an idea about how to implement the OpenCV libraries. Since I was interested in image stitching, I went straight for the "stitching_detailed.cpp." The code can be found at:
https://code.ros.org/trac/opencv/browser/trunk/opencv/samples/cpp/stitching_detailed.cpp?rev=6856
Now, this program does most of what I need it to do, but I ran into something interesting. I found that for 9 out of 15 of the optional projection warpers, I receive the following error when I try to run the program:
Insufficient memory (Failed to allocate XXXXXXXXXX bytes) in unknown function,
file C:\slave\winInstallerMegaPack\src\opencv\modules\core\src\alloc.cpp,
line 52
where the "X's" mark integer that change between the different types of projection (as though different methods require different amounts of space). The full source code for "alloc.cpp" can be found at the following website:
https://code.ros.org/trac/opencv/browser/trunk/opencv/modules/core/src/alloc.cpp?rev=3060
However, the line of code that emits this error in alloc.cpp is:
static void* OutOfMemoryError(size_t size)
{
--HERE--> CV_Error_(CV_StsNoMem, ("Failed to allocate %lu bytes", (unsigned long)size));
return 0;
}
So, I am simply lost as to the possible reasons that this error may be occurring. I realize that this error would normally occur if the system is out of memory, but I when running this program with my test images I am never using more that ~3.5GB of RAM, according to my Task Manager.
Also, since the program was written as an sample of the OpenCV stitching capabilities BY OpenCV developers I find it hard to believe that there is a drastic memory error present within the source code.
Finally, the program works fine if I use some of the warping methods:
- spherical
- fisheye
- transverseMercator
- compressedPlanePortraitA2B1
- paniniPortraitA2B1
- paniniPortraitA1.5B1)
but as ask the program to use any of the others (through the command line tag
--warp [PROJECTION_NAME]):
- plane
- cylindrical
- stereographic
- compressedPlaneA2B1
- mercator
- compressedPlaneA1.5B1
- compressedPlanePortraitA1.5B1
- paniniA2B1
- paniniA1.5B1
I get the error mentioned above. I get pretty good results from the transverseMercator project warper, but I would like to test the stereographic in particular. Can anyone help me figure this out?
The pictures that I am trying to process are 1360 x 1024 in resolution and my computer has the following stats:
Model: HP Z800 Workstation
Operating System: Windows 7 enterprise 64-bit OPS
Processor: Intel Xeon 2.40GHz (12 cores)
Memory: 14GB RAM
Hard Drive: 1TB Hitachi
Video Card: ATI FirePro V4800
Any help would be greatly appreciated, thanks!
When I run OpenCV's APP traincascade, i get just the same error as you:
Insufficient memory (Failed to allocate XXXXXXXXXX bytes) in unknown function,
file C:\slave\winInstallerMegaPack\src\opencv\modules\core\src\alloc.cpp,
line 52
at the time, only about 70% pecent of my RAM(6G) was occupied. And when runnig trainscascade step by step, I found that the error would be thrown.when it use about more than 1.5G RAM space.
then, I found the are two arguments which can control how many memory should be used:
-precalcValBufSize
-precalcIdxBufSize
so i tried to set these two to 128, it run. I hope my experience can help you.
I thought this problem is nothing about memory leak, it is just relate to how many memory the OS limits a application occupy. I expect someone can check my guess.
I've recently had a similar issue with OpenCV image stitching. I've used create method to create stitcher instance and provided 5 images in vertical order to stitch method, but I've received insufficient memory error.
Panorama was successfully created after setting:
setWaveCorrection(false)
This solution will not be applicable if you need wave correction.
This may be related to the sequence of the stitching, I split a big picture into 3*3, and firstly I stitch them row by row and there is no problem, when I stitch them column by column, there is the problem same as you.

Linux & Windows: Using Large Files to save physical memory

Good afternoon, We have implemented a C++ cKeyArray class to test whether we can use the Large File API to save physical memory. During Centos Linux testing, we found that the Linux File API was just as fast as using the heap for random access processing. Here are the numbers: for a 2,700,000 row SQL database where the KeySize for each row is 62 bytes,
cKeyArray class using LINUX File API
BruteForceComparisons = 197275 BruteForceTimeElapsed = 1,763,504,445 microsecs
Each BruteForce Comparisons requires two random access, there the mean time required for each random access = 1,763,504,445 microsecs / (2 * 197275) = 4470 microsecs
Heap , no cKeyArray class
BruteForceComparisons = 197275 BruteForceTimeElapsed = 1,708,442,690microsecs
the mean time required for each random access = 4300 microsecs.
On 32 bit Windows,the numbers are,
cKeyArray class using Windows File API
BruteForceComparisons = 197275 BruteForceTimeElapsed = 9243787 millisecs
the mean time for each random access is 23.4 millisec
Heap, no cKeyArray class
BruteForceComparisons = 197275 BruteForceTimeElapsed = 2,141,941 millisecs
the mean time requires for each random access is 5.4 millisec
We are wondering why the Linux cKeyArray numbers are just as good the Linux heap numbers while on 32 bit Windows the mean heap random access time is 4 times as fast the cKeyArray Windows File API. Is there some way we can speed up the Windows cKeyArray File API?
Previouly, we received a lot of good suggestions from Stack Overflow on using the Windows Memory Mapped File API. Based on these Stack Overflow suggestions we have implemented a Memory Mapped File MRU caching class which functions properly.
Because we want to devlop a cross-platform solution, we want to do due diligence to see why the Linux File API is so fast? Thank you. We are trying to post a portion of the cKeyArray class implementation below.
#define KEYARRAY_THRESHOLD 100000000
// Use file instead of memory if requirement is above this number
cKeyArray::cKeyArray(long RecCount_,int KeySize_,int MatchCodeSize_, char* TmpFileName_) {
RecCount=RecCount_;
KeySize=KeySize_;
MatchCodeSize=MatchCodeSize_;
MemBuffer=0;
KeyBuffer=0;
MemFile=0;
MemFileName[0]='\x0';
ReturnBuffer=new char[MatchCodeSize + 1];
if (RecCount*KeySize<=KEYARRAY_THRESHOLD) {
InMemory=true;
MemBuffer=new char[RecCount*KeySize];
memset(MemBuffer,0,RecCount*KeySize);
} else {
InMemory=false;
strcpy(MemFileName,TmpFileName_);
try {
MemFile=
new cFile(MemFileName,cFile::CreateAlways,cFile::ReadWrite);
}
catch (cException e)
{
throw e;
}
try {
MemFile->SetFilePointer(
(int64_t)(RecCount*KeySize),cFile::FileBegin);
}
catch (cException e)
{
throw e;
}
if (!(MemFile->SetEndOfFile()))
throw cException(ERR_FILEOPEN,MemFileName);
KeyBuffer=new char[KeySize];
}
}
char *cKeyArray::GetKey(long Record_) {
memset(ReturnBuffer,0,MatchCodeSize + 1);
if (InMemory) {
memcpy(ReturnBuffer,MemBuffer+Record_*KeySize,MatchCodeSize);
} else {
MemFile->SetFilePointer((int64_t)(Record_*KeySize),cFile::FileBegin);
MemFile->ReadFile(KeyBuffer,KeySize);
memcpy(ReturnBuffer,KeyBuffer,MatchCodeSize);
}
return ReturnBuffer;
}
uint32_t cKeyArray::GetDupeGroup(long Record_) {
uint32_t DupeGroup(0);
if (InMemory) {
memcpy((char*)&DupeGroup,
MemBuffer+Record_*KeySize + MatchCodeSize,sizeof(uint32_t));
} else {
MemFile->SetFilePointer(
(int64_t)(Record_*KeySize + MatchCodeSize) ,cFile::FileBegin);
MemFile->ReadFile((char*)&DupeGroup,sizeof(uint32_t));
}
return DupeGroup;
}
On Linux, the OS aggressively caches file data in main memory -- so although you haven't explicitly allocated memory for the file contents, they are nevertheless stored in RAM. Here's a decent link with some more information about the page cache -- only one thing is missing from that description, which is that most Linux filesystems actually implement the standard I/O interfaces as thin wrappers around the page cache. That means that even though you haven't explicitly memory mapped the file, the system is still treating it as though it were memory mapped under-the-covers. That's why you see roughly equivalent performance with either approach.
I second the suggestion to factor the platform-specific stuff out, and use whichever appropach is fastest for each platform. Be sure to benchmark -- don't ever make assumptions about performance.
Your memory-mapped solution should be as much as 10x faster than the file solution even in Linux. That is the speed I experience in my test cases.
Each file access system call takes hundreds of CPU cycles to complete. Time which your program could be using to do real work.
One explanation for why the speeds are similar could be that your memory map has not been used before. When a memory mapped page is accessed for the first time it must be assigned to a physical page of RAM and zeroed out or if it is a disk file it must be loaded from disk into RAM. All of that takes a considerable amount of time.
If you touch (read or write a value) each 4K of RAM before using it you should see a significant speed increase in the memory map.