Utilizing multiple GPU in my machine (Intel + Nvidia) - Copy data between them - c++

My machine has 1 Intel graphic card and 1 Nvidia 1060 card.
I use Nvidia gpu for object detection (Yolo) .
PipeLine
---Stream--->Intel gpu (decode)----> Nvidia Gpu (Yolo)---->Renderer
I want to utilize both of my gpu cards ; I want to use one for decoding frames (Hardware accleration -ffmpeg ) and other for yolo. (Nvidia restricts number of streams that you can decode at one time to 1, but I dont see such restriction with Intel)
Has anyone tried some thing like this ? any pointers on how to do interGPU frame transfer

Related

Screen Capturing with CUDA and Direct3D

I need to capture the screen in real time (less than 10ms) with C++.
I tried BitBlt/StretchBlt based, it was very slow (20+ms) and unacceptable.
I found the solution here: https://www.unknowncheats.me/forum/general-programming-and-reversing/422635-fastest-method-capture-screen.html
Please check LRK03's answer, I copy/paste here.
"I currently use the DesktopDuplicationAPI and copy the result via the CUDA D3D Interoperability to GPU memory.
From there I can create an opencv cuda gpu mat (which I wanted to have as a final result)"
"The screen capture when the next screen is available took around 0.5-1 milliseconds."
Unfortunately I do not know DesktopDuplicationAPI, CUDA and Direct3D. So I need everyone's help.
OS: Windows 10
PC: 3.2 GHz Intel Core i7-8700 Six-Core. 16GB DDR4
GPU: NVIDIA GeForce GTX 1080 TI (8GB GDDR5)
Size of Screen to be captured: 1280*720
Thank you in advance.

How to find a Gstreamer bottleneck?

I have a glvideomixer sink that shows 16 720p-60Hz videos simultaneously in a 4x4 matrix array. When the source of all 16 videos are from 16 different "h264 main profile" files all runs smoothly, but when I acquire the videos from 4 grabber cards (4 x 4HDMI input ports set to 1280x720-60Hz, same as video files) the output gets stutter.
The pipeline is very simple:
glvideomixer(name=vmix)
ksvideosrc(device-index=0...15)->capsfilter(video/x-raw,format=YV12,framerate=60/1,height=720,width=1280)->vmix.sink_0...15
Note: The ksvideosrc element is only available on Windows platform.
AFAIK the pipeline is GL based, so all the videos streams are implicitly uploaded to an GL context, when the glvideomixer treats them as GL textures. I am right?
But i don't understand why when i use 16 video files all runs smoothly, even when, in theory, the process is more complex because the computer must decode these streams before sending them to the GPU, and when i use the grabber cards, all output is stuttered.
I'm pretty sure that the stream format of the cards are RAW YV12, because i set the capsfilter element to explicitly choose that stream. Here the link to the grabbers: http://www.yuan.com.tw/en/products/capture/capture_sc510n4_hdmi_spec.htm
I think that the bottleneck is there in the PCIe bus, but i not sure because the GPU is an AMD FirePro W7100 run at 16x and the 4 grabber cards are 4x PCIe runs at 4x.
It should be noted that all run smoothly up to 13 video signals from the grabbers. Adding one more the stutter shows up.
So: how to know where is the bottleneck?
Many thanks in advance.
Edit:
The rig is:
MB: Asus x99-deluxe USB 3.1:
http://www.asus.com/Motherboards/X99DELUXEU31/
CPU: Hexacore i7-5930K Haswell 40 PCIe lanes:
http://ark.intel.com/es/products/82931/Intel-Core-i7-5930K-Processor-15M-Cache-up-to-3_70-GHz
RAM: Kingston Hyperx PC4 21300 HX426C15FBK2/8 dual channel setup:
http://www.kingston.com/dataSheets/HX426C15FBK2_8.pdf
GPU: AMd FirePro W7100 8Gb GDDR5 256 bits:
http://www.amd.com/en-us/products/graphics/workstation/firepro-3d/7100
HDD: Kingston SSD Now V300:
http://www.kingston.com/datasheets/sv300s3_en.pdf

Same Direct2D application performs better on a "slower" machine

I wrote a Direct2D application that displays a certain number of graphics.
When I run this application it takes about 4 seconds to display 700,000 graphic elements on my notebook:
Intel Core i7 CPU Q 720 1.6 GHz
NVIDIA Quadro FX 880M
According to the Direct2D MSDN page:
Direct2D is a user-mode library that is built using the Direct3D 10.1
API. This means that Direct2D applications benefit from
hardware-accelerated rendering on modern mainstream GPUs.
I was expecting that the same application (without any modification) should perform better on a different machine with better specs. So I tried it on a desktop computer:
Intel Xeon(R) CPU 2.27 GHz
NVIDIA GeForce GTX 960
But it took 5 seconds (1 second more) to display the same graphics (same number and type of elements).
I would like to know how can it be possible and what are the causes.
It's impossible to say for sure without measuring. However, my gut tells me that melak47 is correct. There is no lack of GPU acceleration, it's a lack of bandwidth. Integrated GPUs have access to the same memory as the CPU. They can skip the step of having to transfer bitmaps and drawing commands across the bus to dedicated graphics memory for the GPU.
With a primarily 2D workload, any GPU will be spending most of its time waiting on memory. In your case, the integrated GPU has an advantage. I suspect that extra second you feel, is your GeForce waiting on graphics coming across the motherboard bus.
But, you could profile and enlighten us.
Some good points in the comments and other replies.(can't add a comment yet)
Your results dont surprise me as there are some differencies between your 2 setups.
Let's have a look there: http://ark.intel.com/fr/compare/47640,43122
A shame we can't see the SSE version supported by your Xeon CPU. Those are often used for code optimization. Is the model I chose for the comparison even the good one?
No integrated GPU in that Core-I7, but 4 cores + hyperthreading = 8 threads against 2 cores with no hyperthreading for the Xeon.
Quadro stuff rocks when it comes to realtime rendering. As your scene seems to be quite simple, it could be well optimized for that, but just "maybe" - I'm guessing here... could someone with experience comment on that? :-)
So it's not so simple. What appears to be a better gfx card doesn't mean better performance for sure. If you have a bottleneck somewhere else you're screwed!
The difference is small, you must compare every single element of your 2 setups: CPU, RAM, HDD, GPU, Motherboard with type of PCI-e and chipset.
So again, a lot of guessing, some tests are needed :)
Have fun and good luck ;-)

Two GPU Cards, One Endabled Display, One Disabled Display: How to tell which GPU Card is OpenGL running on?

So I have two NVidia GPU Cards
Card A: GeForce GTX 560 Ti - Wired to Monitor A (Dell P2210)
Card B: GeForce 9800 GTX+ - Wired to Monitor B (ViewSonic VP20)
Setup: an Asus Mother Board with Intel Core i7 that supports SLI
In NVidia Control Panel, I disabled Monitor A, So I only have Monitor B for all my display purposes.
I ran my program, which
simulated 10000 particles in OpenGL and rendered them (properly showed in Monitor B)
use cudaSetDevice() to 'target' at Card A to run computational intensive CUDA Kernel.
The idea is simple - use Card B for all the OpenGL rendering work and use Card A for all the CUDA Kernel computational work.
My Question is this:
After using GPU-Z to monitor both of the Cards, I can see that:
Card A's GPU Load increased immediately to over 60% percent as expected.
However, Card B's GPU Load increased only to up to 2%. For 10000 particle rendered in 3D in opengl, I am not sure if that is what I should have expected.
So how can I find out if the OpenGL rendering was indeed using Card B (whose connected Monitor B is the only one that is enabled), and had nothing to do with Card A?
And and extension to the question is:
Is there a way to 'force' the OpenGL rendering logic to use a particular GPU Card?
You can tell which GPU a OpenGL context is using with glGetString(GL_RENDERER);
Is there a way to 'force' the OpenGL rendering logic to use a particular GPU Card?
Given the functions of the context creation APIs available at the moment: No.

Performing calculations with the ATI Mobility Radeon HD 5850 graphics card

I wish to do calculations with my graphics card as the CPU's etc are too slow. I know it is possible with NVidia cards, (CUDA etc) but I can't find anything about using the ATI Mobility Radeon HD 5850 graphics card in the laptop I have. I wish to perform vector addition/multiplication + exp/log functions with floating point.
Is there any C code that can access the card and put it to work in this way?
The calculations aren't to do with graphics, but would still be sped up a lot by the increased power of it compared to the CPU I expect.
Radeon cards support OpenCL (aka ATI Stream) and Direct3D compute shaders.