These days I built libtorch 1.12 from source under cuda 11.6.2 docker image with intel-MKL v2020.0.166 in the official repository (using tbb as parallel framework instead of default openmp enter link description here. I also tried several times to build with pip installed MKL 2022.1/2021.4 and oneTBB 2021.6/2021.5, but always have some build or runtime problems). It built successfully and run libtorch example enter link description here fine.
Then I also faced the speed problem: it runs some test code enter link description here slower than pytorch about 5-40x times. In pytorch following code cost ~0.001-0.002s, but in built libtorch with mkl it cost average ~0.02-0.03s, sometime > 0.04s.
CPU:AMD5900x, 32GB memory, GPU: Nvidia 3070 super.
libtorch:
#include<torch/torch.h>
#include<iostream>
#include <chrono>
int main(){
torch::Tensor tensor = torch::randn({2708, 1433});
torch::Tensor weight = torch::randn({1433, 16});
auto start = std::chrono::high_resolution_clock::now();
tensor.mm(weight);
auto end = std::chrono::high_resolution_clock::now();
std::cout<< "C++ Operation Time(s) " << std::chrono::duration<double>(end - start).count() << "s" << std::endl;
return 0;
}
pytorch:
import torch
import torch.nn as nn
import torch.nn.functional as F
tensor = torch.randn(2708, 1433)
weight = torch.randn(1433, 16)
t0 = time.time()
tensor.mm(weight)
t1 = time.time()
print("Python Operation Time(s) {:.4f}".format(t1 - t0))
By google, someone noted to use at::set_num_threads() to adjust parallel threads number in each process enter link description here. no effect.
someone said change debug mode to release mode and use -O3 optimization flag. no effect.
Finally I saw someone metioned libtorch need warmup step enter link description here. So I tried to add for (int i=0; i<100; i++) in the above code to test. The result shows that first run cost ~0.02-0.03s, then following 99 loops keep cost same time ~0.001-0.002s, just reach pytorch speed.
So the questions are:
it appears that libtorch c++ use auto jit mechanism to create a fused graph even in normal computation process, like arrayfile. is that so? if not, how to explain the warmup phenomenon? Above test code has just normal operations, and no torchscrpit to call torch::jit load.
why pytorch is always fast and no warmup steps? and how? is there some build options to change libtorch's inner computation behavior? after all, pytorch is also based on c++ low level.
Related
I have trained three different neural network models using Pytorch and CUDA.
I ported the trained models to C++ using libtorch. Everything works as it should.
When I make the inference on the three neural network models using the GPU, I
obtain different running times.
The first model runs in 60 ms, the second in 40 ms and the third in 10 ms.
I tried to parallelize the inference for all three neural networks using pthreads
and OpenMP. I expected the parallel versions to run in a time equal to the time of
the largest model. However, after parallelization, I only get an increase of at
most 5 ms in comparison with the version where all models inference is run
sequentially. I have enough space on the GPU to run all models in parallel and
my computer has multiple cores. Can the inference of the three neural models be
run in parallel in order to obtain a better running time?
Below is a code snippet with a parallelization I made using OpenMP.
Am I doing something wrong when using OpenMP? I get the same result when I run
with pthreads. It is worth mentioning that the functions test_model1
and test_model2 also unpack the tensors to obtain the results, so it is not just the
forward pass through the network.
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{
test_model1(img_1);
}
#pragma omp section
{
test_model2(img_1);
}
#pragma omp section
{
cvtColor(img_1, img_1, COLOR_BGR2RGB);
torch::Tensor imgTensor = ToTensor(img_1);
inputs.push_back(imgTensor);
torch::Tensor output = model3.forward(inputs).toTensor();
result = ToCVImage(output.argmax(1));
}
}
}
I'm currently writing some inference code using a trained TensorFlow graph on GPU machine with C++ APIs.
Here are my settings:
Platform: CentOS 7
TensorFlow Version: TensorFlow 1.5
CUDA Version: CUDA 9.0
C++ Version: C++11
There are a couple of questions that I'm struggling with.
1) First, I followed this tutorial to learn the basic template to load a graph in C++. The example in this tutorial is quite simple, but the program takes almost 0.9G in RAM when I run it (on GPU machine).
2) My graph is way more complicated than the one in that tutorial. There are approximately 20 layers and the numbers of nodes in layers vary from 300 to 5000.
My (pseudo) code snippet is here. For simplicity, I only keep the code that causes (potential) memory issue:
tensorflow::Tensor input = getDataFromSomewhere(...);
int length = size of the input;
int g_batch_size = 50;
// 1) Create session...
// 2) Load graph...
// 3) Inference
for (int x = 0; x < length; x += g_batch_size) {
tensorflow::Tensor out;
auto cur_slice = input.Slice(x, std::min(x + g_batch_size, length));
inference(cur_slice, out);
// doSomethingWithOutput(out);
}
// 4) Close session and free session memory
// Inference helper function
tensorflow::Status inference(tensorflow::Tensor& input_tensors, tensorflow::Tensor& out) {
// This line increases a lot more memory usage
TensorDict feed_dict = {{"IteratorGetNext:0", input_tensors}};
std::vector<tensorflow::Tensor> outputs;
tensorflow::Status status = session->Run(feed_dict, {"final_dense:0"}, {}, &outputs);
// UpdateOutWithOutputs();
return tensorflow::Status::OK();
}
After I created session and loaded graph, the memory cost is around 1.2G.
Then, as I noted in my code, when the program reached session->Run(...), the memory usage went up to more than 2G.
I'm not sure if this is a normal behavior of TensorFlow. I've checked this and this thread, but I don't quite know if I created redundant ops in my code.
Any comments or suggestions are appreciated! Thanks in advance!
The issue that I found was that Tensorflow dynamic libraries would take about 200MB and CUDA dynamic libraries would take more than 500MB memory. So loading those libraries already takes a great amount of memory.
I'm building my own Embedded Linux OS for Raspberry PI3 using Buildroot. This OS will be used to handle several applications, one of them performs objects detection based on OpenCV (v3.3.0).
I started with Raspbian Jessy + Python but it turned out that it takes a lot of time to execute a simple example, So I decided to design my own RTOS with Optimized features + C++ development instead of Python.
I thought that with these optimizations the 4 cores of RPI + the 1GB RAM will handle such applications. The problem is that even with these things, the simplest Computer Vision programs take a lot of time.
PC vs. Raspberry PI3 Comparaison
This is a simple program I wrote to have an idea of the order of magnitude of execution time of each part of the program.
#include <stdio.h>
#include "opencv2/core.hpp"
#include "opencv2/imgproc.hpp"
#include "opencv2/highgui.hpp"
#include <time.h> /* clock_t, clock, CLOCKS_PER_SEC */
using namespace cv;
using namespace std;
int main()
{
setUseOptimized(true);
clock_t t_access, t_proc, t_save, t_total;
// Access time.
t_access = clock();
Mat img0 = imread("img0.jpg", IMREAD_COLOR);// takes ~90ms
t_access = clock() - t_access;
// Processing time
t_proc = clock();
cvtColor(img0, img0, CV_BGR2GRAY);
blur(img0, img0, Size(9,9));// takes ~18ms
t_proc = clock() - t_proc;
// Saving time
t_save = clock();
imwrite("img1.jpg", img0);
t_save = clock() - t_save;
t_total = t_access + t_proc + t_save;
//printf("CLOCKS_PER_SEC = %d\n\n", CLOCKS_PER_SEC);
printf("(TEST 0) Total execution time\t %d cycles \t= %f ms!\n", t_total,((float)t_total)*1000./CLOCKS_PER_SEC);
printf("---->> Accessing in\t %d cycles \t= %f ms.\n", t_access,((float)t_access)*1000./CLOCKS_PER_SEC);
printf("---->> Processing in\t %d cycles \t= %f ms.\n", t_proc,((float)t_proc)*1000./CLOCKS_PER_SEC);
printf("---->> Saving in\t %d cycles \t= %f ms.\n", t_save,((float)t_save)*1000./CLOCKS_PER_SEC);
return 0;
}
Results of Execution on an i7 PC
Results of Execution on Raspberry PI (Generated OS from Buildroot)
As you can see there is a huge difference. What I need is to optimize every single detail so that this example processing step occurs in "near" real-time at in a maximum 15ms processing time instead of the 44ms. So these are my questions:
How can I optimize my OS so that it can handle intensive calculations applications and how can control the priorities of each part?
How can I fully use the 4 Cores of RPI3 to fulfill the requirements?
Is there any other possibilities instead of OpenCV?
Should I use C instead of C++?
Any hardware improvements you recommend?
Well as i understand, you want to get about 30-40fps. In case of your I7: it is fast and having tone of acceleration techniques enabled default by itel. In case of raspberry pi: well, we love it but it is slow, especially for image processing program.
How can I optimize my OS so that it can handle intensive calculations applications and how can control the priorities of each part?
You should include some acceleration library for arm and re-compiled opencv again with those features enabled.
How can I fully use the 4 Cores of RPI3 to fulfill the requirements?
Paralleling your code so it could run on 4 cores
Is there any other possibilities instead of OpenCV?
Ask your self first, what features do you need from OpenCV.
Should I use C instead of C++?
Changing language will not help you at all, stay and love C++. It is a beautiful language and very fast
Any hardware improvements you recommend?
How about other board with mali GPU supported. So you could run opencv code directly on GPU, that will boost up your speed a lot.
Question summary: all four cores used when running a single threaded programme. Why?
Details: I have written a non-parallelised programme in Xcode (C++). I was in the process of parallelising it, and wanted to see whether what I was doing was actually resulting in more cores being used. To that end I used Instruments to look at the core usage. To my surprise, while my application is single threaded, all four cores were being utilised.
To test whether it changed the performance, I dialled down the number of cores available to 1 (you can do it in Instruments, preferences) and the speed wasn't reduced at all. So (as I knew) the programme isn't parallelised in any way.
I can't find any information on what it means to use multiple cores to perform single threaded tasks. Am I reading the Instruments output wrong? Or is the single-threaded process being shunted between different cores for some reason (like changing lanes on a road instead of driving in two lanes at once - i.e. actual parallelisation)?
Thanks for any insight anyone can give on this.
EDIT with MWE (apologies for not doing this initially).
The following is C++ code that finds primes under 500,000, compiled in Xcode.
#include <iostream>
int main(int argc, const char * argv[]) {
clock_t start, end;
double runTime;
start = clock();
int i, num = 1, primes = 0;
int num_max = 500000;
while (num <= num_max) {
i = 2;
while (i <= num) {
if(num % i == 0)
break;
i++;
}
if (i == num){
primes++;
std::cout << "Prime: " << num << std::endl;
}
num++;
}
end = clock();
runTime = (end - start) / (double) CLOCKS_PER_SEC;
std::cout << "This machine calculated all " << primes << " under " << num_max << " in " << runTime << " seconds." << std::endl;
return 0;
}
This runs in 36s or thereabouts on my machine, as shown by the final out and my phone's stopwatch. When I profile it (using instruments launched from within Xcode) it gives a run-time of around 28s. The following image shows the core usage.
instruments showing core usage with all 4 cores (with hyper threading)
Now I reduce number of available cores to 1. Re-running from within the profiler (pressing the record button), it says a run-time of 29s; a picture is shown below.
instruments output with only 1 core available
That would accord with my theory that more cores doesn't improve performance for a single thread programme! Unfortunately, when I actually time the programme with my phone, the above took about 1 minute 30s, so there is a meaningful performance gain from having all cores switched on.
One thing that is really puzzling me, is that, if you leave the number of cores at 1, go back to Xcode and run the program, it again says it takes about 33s, but my phone says it takes 1 minute 50s. So changing the cores is doing something to the internal clock (perhaps).
Hopefully that describes the problem fully. I'm running on a 2015 15 inch MBP, with 2.2GHz i7 quad core processor. Xcode 7.3.1
I want to premise your answer lacks a lots of information in order to proceed an accurate diagnostic. Anyway I'll try to explain you the most common reason IHMO, supposing you application doesn't use 3-rd part component which perform in a multi-thread way.
I think that could be a result of scheduler effect. I'm going to explain what I mean.
Each core of the processor takes a process in the system and executed it for a "short" amount of time. This is the most common solution in desktop operative system.
Your process is executed on a single core for this amount of time and then stopped in order to allow other process to continue. When your same process is resumed it could be executed in another core (always one core, but a different one). So a poor precise task manager with a low resolution time could register the utilization of all cores, even if it does not.
In order to verify whether the cause could be that, I suggest you to see the amount of CPU % used in the time your application is running. Indeed in case of a single thread application the CPU should be about 1/#numberCore , in your case 25%.
If it's a release build your compiler may be vectorising parallelise your code. Also libraries you link against, say the standard library for example, may be threaded or vectorised.
I am trying to learn xeon-phi , and while studying the Intel Xeon-Phi Coprocessor HPC book , I tried to run the code here. (from book)
The code uses openmp and 2 threads.
But the results I am taking are the same as running with 1 thread.
( no use of openmp at all )
I even used in mic different combinations but still the same:
export OMP_NUM_THREADS=2
export MIC_OMP_NUM_THREADS=124
export MIC_ENV_PREFIX=MIC
It seems that somehow openmp is not enabled?Am I missing something here?
The code using only 1 thread is here
I compiled using:
icc -mmic -openmp -qopt-report -O3 hello.c
Thanks!
I am not sure exactly which book you are talking about, but perhaps this will help.
The code you show does not use the offload programming style and must be run natively on the the coprocessor, meaning you copy the executable to the coprocessor and run it there or you use the micnativeloadex utility to run the code from the host processor. You show that you know the code must be run natively because you compiled it with the -mmic option.
If you use micnativeloadex, then the number of omp threads on the coprocessor is set by executing "export MIC_OMP_NUM_THREADS=124" on the host. If you copy the executable to the coprocessor and then log in to run it there, the number of omp threads on the coprocessor is set by executing "export OMP_NUM_THREADS=124" on the coprocessor. If you use "export OMP_NUM_THREADS=2" on the coprocessor, you get only two threads; the MIC_OMP_NUM_THREADS environment variable is not used if you set it directly on the coprocessor.
I don't see any place in the code where it prints out the number of threads, so I don't know for sure how you determined the number of threads actually being used. I suspect you were using a tool like micsmc. However micsmc tells you how may cores are in use, not how many threads are in use.
By default, the omp threads are laid out in order, so that the first core would run threads 0,1,2,3, the second core would run threads 4,5,6,7 and so on. If you are using only two threads, both threads would run on the first core.
So, is that what you are seeing - not that you are using only one thread but instead that you are using only one core?
I was looking at the serial version of the code you are using. For the following lines:
for(j=0; j<MAXFLOPS_ITERS; j++)
{
//
// scale 1st array and add in the 2nd array
// example usage - y = mx + b;
//
for(k=0; k<LOOP_COUNT; k++)
{
fa[k] = a * fa[k] + fb[k];
}
}
I see that here you do not scan the complete array. Instead you keep on updating the first 128 (LOOP_COUNT) elements of the array Fa. If you wish to compare this serial version to the parallel code you are referring to, then you will have to ensure that the program does same amount of work in both versions.
Thanks
I noticed three things in your first program omp:
the total floating point operations should reflect the number of threads doing the work. Therefore,
gflops = (double)( 1.0e-9*LOOP_COUNTMAXFLOPS_ITERSFLOPSPERCALC*numthreads);
You harded code the number of thread = 2. If you want to use the OMP env variable, you should comment out the API "omp_set_num_threads(2);"
After transferring the binary to the coprocessor, to set the OMP env variable in the coprocessor please use OMP_NUM_THREADS, and not MIC_OMP_NUM_THREADS. For example, if you want 64 threads to run your program in the coprocessor:
% ssh mic0
% export OMP_NUM_THREADS=64