Maximize tensorflow multi gpu performance - c++

I was wondering if anybody could advise on how to get peak performance out of tensorflow in a 4 GPU setting.
As a test I created two of the same network (18 ish layer residual network with small filter banks (ranging from 16-128) on 32x32 inputs. Batch size 512, 128 per GPU.). One in MXNet and one I have modelled off of the inception example.
My MXNet network can train at around 7k examples a second where tensorflow is only capable of 4.2k with dummy data and 3.7 with real data.
(when running on 1 GPU the numbers are 1.2k examples a second vs 2.1k)
In my experiment I have a few questions in hopes to speed things up.
GPU utilization seems quite low when training. I noticed that in the tensorflow white paper there is support for running multiple streams on the same GPU. Is this possible in the public release?
Is there anyway to perform multiple train operations in one execution of session.run()? Or have async execution? This would allow for weight updates to be done at the same time as the next batches forward pass? I have tried using 2 threads (both system and with QueueRunners's), but this only resulted in a slowdown. MXNet is able to increase speeds by running weight updates on the CPU so that the gpu's can be used for the next batch.
Will the new distributed run time get around some of these issues by letting me run more than one worker on a single machine?
Is there something else that can be done?
I know there are a number of similar questions here on stack overflow, but though my searching I couldn't find a solution to my problems that I have not already tried.
Edit:
I did a little bit of CUDA profiling to see what the expensive kernels were. According to my run, 21.4% of the time is spent inside:
void Eigen::internal::EigenMetaKernel_NonVectorizable<Eigen::TensorEvaluator
<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=4, int=1, long>, int=16>,
Eigen::TensorPaddingOp<Eigen::array<std::pair<int, int>,
unsigned long=4> const, Eigen::TensorMap<Eigen::Tensor<float const,
int=4, int=1, long>, int=16> const > const > const, Eigen::GpuDevice>, long>(float, int=4)
and 20.0% of the time were spent in
void Eigen::internal::EigenMetaKernel_NonVectorizable<Eigen::TensorEvaluator
<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=4, int=1, long>, int=16>,
Eigen::TensorBroadcastingOp<Eigen::array<int, unsigned long=4>
const, Eigen::TensorMap<Eigen::Tensor<float const, int=4, int=1, long>,
int=16> const > const > const, Eigen::GpuDevice>, long>(float, int=4)
Off of the Signature I am not exactly sure what these are doing. Do these make sense?
In addition to this, the analysis reports low kernel concurrency, 0%, as expected.
And Low compute utilization 34.9% (granted this includes start-up time and a little bit of python in train loop. Around 32 seconds total out of 91. This comes out to around 50% utilization inside tensorflow.)
Edit 2:
I have attached a copy of the trimmed down source code. In general though I am more concerned about question 1-3 and don't want to take too much of ever bodies time.
In addition I am running on tensorflow built from: f07234db2f7b316b08f7df25417245274b63342a
Edit 3:
Updated to the most recent tensorflow (63409bd23facad471973b110df998782c0e19c06) same code, default data format (NHWC) and that seemed to speed this up a lot.
On fake data 6.7k-6.8k (thermal dependence I think?) examples a second 4gpu. 1gpu -- 2.0k examples a second.
Real data performance is around 4.9k examples a second for 4gpu. 1gpu -- 1.7k examples a second.
Edit 4:
In addition I tried out switching data formats to BCHW. I made the conversion modelled off of Soumith's benchmarks. The convolution parts were indeed faster, but batch norm appears to be messing everything up. With a naive implementation (fixing axis, and making weights [1,C,1,1] instead of [C,]) I am only able to get 1.2k examples a second on 4 gpu (fake data). Where as with a transpose before and after the batch norm op I am able to get 6.2k examples a second (fake data). Still slower than the NHWC data_format.

It's a bit hard to diagnose your program's performance problem without seeing the code. Is it possible for us to read your test code somehow?
TensorPadding showing on the top is a bit strange. I'd expect cudnn calls should be on the top of the profile. Anyway, showing us the test code will be helpful.

Related

OpenVINO GPU performance optimization

I'm trying to speed up the inference on a people counter application, in order to use the GPU I've set the inference engine configuration setting as described:
device_name = "GPU"
ie.SetConfig({ {PluginConfigParams::KEY_CONFIG_FILE, "./cldnn_global_custom_kernels/cldnn_global_custom_kernels.xml"} }, device_name);
and loading the network on the inference engine I've set the target device like described below:
CNNNetwork net = netReader.getNetwork();
TargetDevice t_device = InferenceEngine::TargetDevice::eGPU;
network.setTargetDevice(t_device);
const std::map<std::string, std::string> dyn_config = { { PluginConfigParams::KEY_DYN_BATCH_ENABLED, PluginConfigParams::YES } };
ie_.LoadNetwork(network,device_name, dyn_config);
but the inference engine use the CPU yet, and this slow down the inference time. There is a way to use the Intel GPU at maximum power to do inference on a particular network? I'm using the person-detection-retail-0013 model.
Thank's.
Have you meant person-detection-retail-0013? Because I haven't found pedestrian-detection-retail-013 in open_model_zoo repo.
This might be expected that you see a slowdown while using GPU. The network, you tested, has the following layers as part of the network topology: PriorBox, DetectionOutput . Those layers are executed on CPU as documentation says: https://docs.openvinotoolkit.org/latest/_docs_IE_DG_supported_plugins_CL_DNN.html
I have a guess that this may be the reason of the slowdown.
But to be 100% percent sure I would suggest to run benchmark_app tool to do bench-marking of the model. This tool can print detailed performance information about each layer. It should help to shed light what is the real root cause of the slowdown. More information about benchmark_app can be found here: https://docs.openvinotoolkit.org/latest/_inference_engine_samples_benchmark_app_README.html
PS: Just a piece of advice regarding usage of IE API. network.setTargetDevice(t_device); - setTargetDevice is a deprecated method. It is enough to set a device using LoadNetwork like in your example: ie_.LoadNetwork(network,device_name, dyn_config);
Hope it will help.

Reasons for variations in run-time performance of identical code

when running some benchmarks of my C++ software, I obtained the following picture:
The plot shows the execution time in nanoseconds of one tick of the software.
The exact same tick is ran each time (one data point is one tick).
When testing in the simulated environment of valgrind, there is zero difference between each tick, and I don't have syscalls others than what clock_gettime may do.
I would like to understand what can cause the two "speeds" in which the tick seems to run. I disabled intel CPU sleep states which greatly helped (before that I had 4 lines like this), and what could be the causes for the outlier points. The scheduler used is linux's FIFO scheduler.
An interesting observation is that the tick times alternate between the two values of 9000 and 6700 ns; here's some data points:
9022
6605
9170
6756
9126
6594
9102
6744
9016
6643
8950
6638
9047
6662
edit:
just doing this in a loop in my thread:
auto t0 = std::chrono::high_resolution_clock::now();
auto t1 = std::chrono::high_resolution_clock::now();
measure(std::chrono::duration_cast<std::chrono::nanoseconds>(t1 - t0).count()));
is enough for me to see the alternating effect, though between 100 and 150 ns.

fine tuning vgg raise memory error

Hi i'm trying to fine tuning vgg on my problem but when i try to train the net i get this error.
OOM when allocating tensor with shape[25088,4096]
The net has this structure:
I take this tensorflow pretrained vgg implementation code from this site.
I only add this procedure to train the net:
with tf.name_scope('joint_loss'):
joint_loss = ya_loss+yb_loss+yc_loss+yd_loss+ye_loss+yf_loss+yg_loss+yh_loss+yi_loss+yl_loss+ym_loss+yn_loss
# Loss with weight decay
l2_loss = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables()])
self.joint_loss = joint_loss + self.weights_decay * l2_loss
self.optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate).minimize(joint_loss)
i try to reduce the batch size to 2 but not works i get the same error. The error is due to the big tensor that cannot be allocated in memory. I get this error only in train cause if i feed a value without minimize the net works. How i can avoid this error? how can i save memory of graphic card(Nvidia GeForce GTX 970)?
UPDATE: if i use the GradientDescentOptimizer the training process start, instead if i use AdamOptimizer i get the memory error, seems that the GradientDescentOptimizer use less memory.
Without a backward pass ("feed a value without minimizing"), TensorFlow can immediately de-allocate intermediate activations. With a backward pass, the graph has a giant U-shape, where activations from the forward pass need to be kept in memory for the backward pass. There are some tricks (such as swapping to host memory), but in general backprop means that memory usage will be higher.
Adam does keep some extra bookkeeping variables around, so it will increase memory usage proportional to the amount of memory your weight variables are already using. If your training steps take quite a while (in which case having the variable updates on the GPU isn't important), you could instead locate the optimization ops in host memory.
If you need a larger batch size and can't reduce image resolution or model size, combining gradients from multiple workers/GPUs using something like SyncReplicasOptimizer can be a good option. Looking at the paper associated with this model, it looks like they were training on 4 GPUs each with 12GB of memory.

RAM consumption regarding cores/

I am working on a 31 ,available, Go of RAM, 12 cores Linux KUbuntu computer.
I produce simulations which calculate functions over 4 dimensions (x,y,z,t).
I define my dimensions as arrays that I numpy.meshgrid for use. So, for each point of time, I calculate for each point x,y,z the result. It comes as heavy calculations with heavy data.
First, I learned how to use it with only one core. It works well and whatever are the size of my "boxs" ( x,y,z). Because of the fact I work a lot with Fourier transform, I define x,y,z,t as powers of 2 : 64,128,256,...
I can,without dificulties, go to x = y = z = t = 512, even if it takes a lot of time to run it (which makes sense). When I do that, I use around 20-30% of the available RAM of the computer. Great.
Then I wanted to use more cores. So I implemented this code :
import multiprocessing as mp
pool = mp.Pool(processes=8)
results = [pool.apply_async(conv_green, args=(tstep, S_, )) for tstep in t]
So here I ask my script to use 8 cores, and define my results as the use of the function "conv_green" with the args "tstep,S_" all along t.
It works pretty well, use 8 cores as expected BUT I can not run any more simulations who use figures equal or above to 512 for x,y,z,t.
This is where my problem is. Technically, switching from the mono core system to multi chanegd nothing to the routine of my calculations. I do not understand why I have enough RAM for 512... in mono core and why,sudenly, when I switch to multi cores, computer does not even want to launch it ( and the error occurs at the" results = pool.apply ..." line)
So if you guys know how this works and why I get this "treshold", thanks for helping me solving out !
Best regards.
PS : this is the error which pops out when it crashes with 512 in multi cores :
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.7/dist
packages/spyderlib/widgets/externalshell/sitecustomize.py", line 540, in runfile
execfile(filename, namespace)
File "/home/alexis/Heat/Simu⁄Lecture Propre/Test Tkinter/Simulation N spots SCAN Tkinter.py", line 280, in
XYslice = array([p.get()[0] for p in results])
File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
raise self._value
SystemError: NULL result without error in PyObject_Call
For multiprocessing in any language each thread will need private storage which it can write to without interference from the other threads. As soon as interference is possible the data structure has to be locked, which (in the worst case) takes us back to single threading.
It would appear that your large data structure is being copied for each of the threads, effectively multiplying your memory usage by eight when you have eight processors ... or up to 200% of your available RAM.
The best solution would be to prevent the unnecessary copying.
If that's not feasible then all you can do is limit the number of processors it can run on, four should be ok in your instance but make sure your machine has lots of swap space. The swap space also gives you some play to allow the virtual memory to exceed the physical RAM, if the "working set" is small enough you may be able to significantly exceed your physical RAM given enough swap.

kyoto cabinet scan_parallel not really parallel?

I just spent a day creating an abstraction layer to kyotodb to remove global locks from my code, I was busy porting my algorithms to this new abstraction layer when I discover that scan_parallel isn't really parallel. It only maxes out one core -- For jollies I stuck in a billion-int-countdown spin-loop in my code(empty stubs as I port) to try and simulate some processing time. still only one core maxed. Do I need to move to berkley db or leveldb ? I thought kyotodb was meant for internet scale problems :/. I must be doing something wrong or missing some gotchas.
top or iostat never went above 100% / 25% (iostat one cpu maxed = 1/number of cores * 100):/ On a quad core i5.
source db is 10gigs corpus of protocol buffer encoded data (treedb) with the following flags (picked these up from the documentation).
index_db.tune_options(TreeDB::TLINEAR | TreeDB::TCOMPRESS);
index_db.tune_buckets(1LL * 1000);
index_db.tune_defrag(8);
index_db.tune_page(32768);
edit
Do not remove the IR TAG. Please think before you wave arround the detag bat.
This IS an IR related question, its about creating GINORMOUS (40 gig +) inverted files ONLINE, inverted indices are the base of IR data access methods, and inverted index creation has a unique transactional profile. By removing the IR tag you rob me of the wisdom of IR researchers who have used a database library to create such large database files.