I have written an application that launches unity applications on a AWS Ec2 instance that needs hardware acceleration to render properly. This launcher application does a pre-sanity check to make sure a HAL (hardware acceleration) device is present with Direct3D calls like
D3DDISPLAYMODE mode;
HRESULT hr;
if (FAILED (hr = s_D3D->GetAdapterDisplayMode (g_D3DAdapter, &mode)))
{
return false;
}
if(FAILED (hr = s_D3D->CheckDeviceType (g_D3DAdapter, D3DDEVTYPE_HAL, mode.Format, mode.Format, TRUE)))
{
/// This fails.
return false;
}
The sanity check succeeds when I run the launcher application from remote desktop as expected. But it fails when the application is launched on boot.
If I remove the sanity checks the unity application loads correctly on boot but certain elements/shaders that do need hardware acceleration are not rendered properly.
I am using a AWS EC2 G4 Instance with Windows10 and NVIDIA Tesla T4 graphics card. I have the latest drivers (Nvidia and Directx) installed as well.
For extra debugging I have tried printing some debug logs to see if the graphics driver was even detected and the application does seem to find it.
D3DADAPTER_IDENTIFIER9 adapter_info;
hr = s_D3D->GetAdapterIdentifier (g_D3DAdapter, 0, &adapter_info);
std::cout << "Device Name: " << adapter_info.DeviceName << std::endl;
std::cout << "Device Id: " << std::to_string (adapter_info.DeviceId) << std::endl;
std::cout << "Device Desc: " << adapter_info.Description << std::endl;
std::cout << "Device Driver: " << adapter_info.Driver << std::endl;
If you are curious as to how I am launching the application process on boot I use the AWS's UserDataScripts feature. Through the UserData scripts I make sure that the application process is running with administrator privileges and under the correct windows session. Following is a link that talks about the approach I am using.
https://www.codeproject.com/Articles/35773/Subverting-Vista-UAC-in-Both-32-and-64-bit-Archite
I haven't found anything close on this topic from the searches I have done so far.
I am currently using a Lenovo Yoga 510 which makes use of an AMD Radon R5 Graphics card. OpenCL works on it, but however, when I run my code to query and get platform details, the total number of available platforms is returned, but if gives an error that this platforms cannot be opened. Please see error message below.
Error: Failed to open platforms key SOFTWARE\Intel\OpenCL\Boards to load board library at runtime.
Either link to the board library while compiling your host code or refer to your board vendor's documentation on how to install the board library so that it can be loaded at runtime.
Failed to close platforms key (null), ignoring
Warning: Cannot find any Intel(R) FPGA Board libraries.
No Intel(R) FPGA devices will be loaded.
Please contact your board vender or see section "Linking Your Host Application to the Khronos ICD Loader Library" of the Programming Guide to set up FCD manually.
2 PLATFORM(s) FOUND
SEE MY CODE BELOW
[INCLUDE STATEMENTS]
int main() {
cl_int returned;
cl_int zero = (cl_int)0;
//SET-UP DEVICE EXECUTION ENVIRONMENT
cl_uint no_of_platforms;
//cl_uint no_of_entries;
cl_platform_id* platforms;
size_t device_info_val_size;
char* detail;
//1. Query and select the vendor specific platform
returned = clGetPlatformIDs(zero, NULL, &no_of_platforms);
if (returned == CL_SUCCESS) {
printf("%d PLATFORM(s) FOUND \n", no_of_platforms);
}
else {
printf("No Platform Found\n");
return EXIT_FAILURE; //Terminante programme
}
platforms = (cl_platform_id*)malloc(sizeof(cl_platform_id) * no_of_platforms); //create enough space to put platofrm IDs into
clGetPlatformIDs(no_of_platforms, platforms, NULL); //Fill in platform with their ID
free(platforms);
return 0;
}
Any Idea What I may be doing Wrong or have set-up wrong? I am wondering why it is looking for Intel FPGAs on my Radon graphics card
Based on what you've provided, it sounds like the OpenCL ICD (Installable Client Driver) is configured incorrectly. This can be caused by a number of factors (independently):
Old/outdated graphics drivers
Corruption caused by a system update/Registry edit
The most reliable advice is to update (or, as a last resort, reinstall) your graphics drivers. Unless your GPU/iGPU are too old to have working OpenCL drivers, this should set everything up correctly.
Since you're using MSVC, I'll also recommend you to download the OpenCL SDK provided by Intel (or AMD if this were an AMD CPU), as not only does this ensure that you have the most up-to-date headers and utilities associated with OpenCL, it also installs a CPU ICD for OpenCL, giving you an extra platform to test with.
I'm trying to speed up the research of features of stitching algorithm using OpenCL. I'm using the code of the example provided here: https://github.com/opencv/opencv/blob/master/samples/cpp/stitching_detailed.cpp
I read online that the only thing I have to do is change Mat to Umat. I did it.
However I am not sure my code is actually using OpenCL.
First: I'm working on an Ubuntu 16.04
Virtual Machine using Parallels
Desktop on a Macbook Pro. Therefore
the only device supported by OpenCL
will be the CPU (no GPU). I installed the correct drivers, sdk, etc and CPU should work correctly. You can see the result of command "clinfo" in Ubuntu shell below. Working with a CPU, I do not
expect performance improvement. My
plan is just to work my virtual
machine and than deploy the code on a
real Ubuntu machine.
Second: the code has no improvement (as expected, see above). Actually the required time seems to be the same. Is it right? I mean, I know I am working still on the CPU, but I expected to be some differences. Moreover looking at the call graph profiled with grof and gprof2dot there no differences (for ones who have never heard about gprof, it is simply a code profiler that can generate a call graph showing all calls among functions: which function calls what other function, and so on). Is it possible? OpenCV with and without OpenCL should call exactly the same function?
How can I be sure the code is actually working with OpenCL? I read online there were some bugs in features finding on OpenCL and therefore I would like to check myself. Moreover, obviously, I would like to work and edit the code, this is just the beginning.
I'm using this code to check if OpenCL is working:
void checkOpenCL() {
if (!cv::ocl::haveOpenCL())
{
cout << "OpenCL is not available..." << endl;
//return;
}
cv::ocl::Context context;
if (!context.create(cv::ocl::Device::TYPE_ALL))
{
cout << "Failed creating the context..." << endl;
//return;
}
cout << context.ndevices() << " CPU devices are detected." << endl; //This bit provides an overview of the OpenCL devices you have in your computer
for (int i = 0; i < context.ndevices(); i++)
{
cv::ocl::Device device = context.device(i);
cout << "name: " << device.name() << endl;
cout << "available: " << device.available() << endl;
cout << "imageSupport: " << device.imageSupport() << endl;
cout << "OpenCL_C_Version: " << device.OpenCL_C_Version() << endl;
cout << endl;
}
cv::ocl::Device(context.device(0)); //Here is where you change which GPU to use (e.g. 0 or 1)
}
And it prints:
1 CPU devices are detected.
name: Intel(R) Core(TM) i7-4850HQ CPU # 2.30GHz
available: 1
imageSupport: 1
OpenCL_C_Version: OpenCL C 1.2
Running clinfo in Ubuntu shell report
Number of platforms 1
Platform Name Intel(R) OpenCL
Platform Vendor Intel(R) Corporation
Platform Version OpenCL 1.2 LINUX
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64
Platform Extensions function suffix INTEL
Platform Name Intel(R) OpenCL
Number of devices 1
Device Name Intel(R) Core(TM) i7-4850HQ CPU # 2.30GHz
Device Vendor Intel(R) Corporation
Device Vendor ID 0x8086
Device Version OpenCL 1.2 (Build 25)
Driver Version 1.2.0.25
Device OpenCL C Version OpenCL C 1.2
Device Type CPU
Device Profile FULL_PROFILE
Max compute units 4
Max clock frequency 2300MHz
Device Partition (core)
Max number of sub-devices 4
Supported partition types by counts, equally, by names (Intel)
Max work item dimensions 3
Max work item sizes 8192x8192x8192
Max work group size 8192
Preferred work group size multiple 128
Preferred / native vector sizes
char 1 / 32
short 1 / 16
int 1 / 8
long 1 / 4
half 0 / 0 (n/a)
float 1 / 8
double 1 / 4 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Address bits 64, Little-Endian
Global memory size 6103834624 (5.685GiB)
Error Correction support No
Max memory allocation 1525958656 (1.421GiB)
Unified memory for Host and Device Yes
Minimum alignment for any data type 128 bytes
Alignment of base address 1024 bits (128 bytes)
Global Memory cache type Read/Write
Global Memory cache size 262144
Global Memory cache line 64 bytes
Image support Yes
Max number of samplers per kernel 480
Max size for 1D images from buffer 95372416 pixels
Max 1D or 2D image array size 2048 images
Max 2D image size 16384x16384 pixels
Max 3D image size 2048x2048x2048 pixels
Max number of read image args 480
Max number of write image args 480
Local memory type Global
Local memory size 32768 (32KiB)
Max constant buffer size 131072 (128KiB)
Max number of constant args 480
Max size of kernel argument 3840 (3.75KiB)
Queue properties
Out-of-order execution Yes
Profiling Yes
Local thread execution (Intel) Yes
Prefer user sync for interop No
Profiling timer resolution 1ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels Yes
SPIR versions 1.2
printf() buffer size 1048576 (1024KiB)
Built-in kernels
Device Available Yes
Compiler Available Yes
Linker Available Yes
Device Extensions cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform
clCreateContext(NULL, ...) [default] No platform
clCreateContext(NULL, ...) [other] Success [INTEL]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) No platform
Compile and run the facedetect.cpp from the opencv's T-API folder in the samples folder, it will tell you if you have OpenCL ON/OFF, displayed on the rendered window.
Here is how to run the command:
./ufacedetect --cascade=/home/user/opencv/data/haarcascades/haarcascade_frontalface_alt.xml /dev/video5 --scale=1.3
I'm in the process of writing a hardware accelerated h264 decoder using Media Foundation's Source Reader, but have encountered a problem. I followed this tutorial and supported myself with Windows SDK Media Foundation samples.
My app seems to work fine when hardware acceleration is turned off, but it doesn't provide the performance I need. When I turn the acceleration on by passing a IMFDXGIDeviceManager to IMFAttributes used to create the reader, things get complicated.
If I create the ID3D11Device using a D3D_DRIVER_TYPE_NULL driver, the app works fine and the frames are processed faster that in the software mode, but judging by the CPU and GPU usage it still does majority of the processing on CPU.
On the other hand, when I create the ID3D11Device using a D3D_DRIVER_TYPE_HARDWARE driver and run the app, one of these four things can happen.
I only get an unpredictable number of frames (usually 1-3) before IMFMediaBuffer::Lock function returns 0x887a0005 which is described as "The GPU device instance has been suspended. Use GetDeviceRemovedReason to determine the appropriate action". When I call ID3D11Device::GetDeviceRemovedReason, I get 0x887a0020 which is described as "The driver encountered a problem and was put into the device removed state" which isn't as helpful as I wish it to be.
The app crashes in an external dll on IMFMediaBuffer::Lock call. It seems that the dll depends on the GPU used. For Intel integrated GPU it's igd10iumd32.dll and for Nvidia mobile GPU it's mfplat.dll. The message for this particular crash is as follows: "Exception thrown at 0x53C6DB8C (mfplat.dll) in decoder_ tester.exe: 0xC0000005: Access violation reading location 0x00000024". The addresses are different between executions and sometimes it involves reading, sometimes writing.
The graphics driver stops responding, the system hangs for a short time and then the application crashes like in point 2 or finishes like in point 1.
The app works fine and processes all the frames with hardware acceleration.
Most of the time it's 1 or 2, seldom 3 or 4.
Here's what the CPU/GPU usage is like when processing without throttling in different modes on my machine (Intel Core i5-6500 with HD Graphics 530, Windows 10 Pro).
NULL - CPU: ~90%, GPU: ~15%
HARDWARE - CPU: ~15%, GPU: ~60%
SOFTWARE - CPU: ~40%, GPU: ~7%
I tested the app on three machines. All of them had Intel integrated GPUs (HD 4400, HD 4600, HD 530). One of them also had switchable Nvidia dedicated GPU (GF 840M). It bahaves identically on all of them, the only difference is that it crashes in a different dll when Nvidia's GPU is used.
I have no previous experience with COM or DirectX, but all of this is inconsistent and unpredictable, so it looks like a memory corruption to me. Still, I don't know where I'm making the mistake. Could you please help me find what I'm doing wrong?
The minimal code example I could come up with with is below. I'm using Visual Studio Professional 2015 to compile it as a C++ project. I prepared definitions to enable hardware acceleration and select the hardware driver. Comment them out to change the behavior. Also, the code expects this video file to be present in the project directory.
#include <iostream>
#include <string>
#include <atlbase.h>
#include <d3d11.h>
#include <mfapi.h>
#include <mfidl.h>
#include <mfreadwrite.h>
#include <windows.h>
#pragma comment(lib, "d3d11.lib")
#pragma comment(lib, "mf.lib")
#pragma comment(lib, "mfplat.lib")
#pragma comment(lib, "mfreadwrite.lib")
#pragma comment(lib, "mfuuid.lib")
#define ENABLE_HW_ACCELERATION
#define ENABLE_HW_DRIVER
void handle_result(HRESULT hr)
{
if (SUCCEEDED(hr))
return;
WCHAR message[512];
FormatMessage(FORMAT_MESSAGE_FROM_SYSTEM | FORMAT_MESSAGE_IGNORE_INSERTS, nullptr, hr,
MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT), message, ARRAYSIZE(message), nullptr);
printf("%ls", message);
abort();
}
int main(int argc, char** argv)
{
handle_result(CoInitializeEx(nullptr, COINIT_APARTMENTTHREADED | COINIT_DISABLE_OLE1DDE));
handle_result(MFStartup(MF_VERSION));
{
CComPtr<IMFAttributes> attributes;
handle_result(MFCreateAttributes(&attributes, 3));
#if defined(ENABLE_HW_ACCELERATION)
CComPtr<ID3D11Device> device;
D3D_FEATURE_LEVEL levels[] = { D3D_FEATURE_LEVEL_11_1, D3D_FEATURE_LEVEL_11_0 };
#if defined(ENABLE_HW_DRIVER)
handle_result(D3D11CreateDevice(nullptr, D3D_DRIVER_TYPE_HARDWARE, nullptr, D3D11_CREATE_DEVICE_SINGLETHREADED | D3D11_CREATE_DEVICE_VIDEO_SUPPORT,
levels, ARRAYSIZE(levels), D3D11_SDK_VERSION, &device, nullptr, nullptr));
#else
handle_result(D3D11CreateDevice(nullptr, D3D_DRIVER_TYPE_NULL, nullptr, D3D11_CREATE_DEVICE_SINGLETHREADED,
levels, ARRAYSIZE(levels), D3D11_SDK_VERSION, &device, nullptr, nullptr));
#endif
UINT token;
CComPtr<IMFDXGIDeviceManager> manager;
handle_result(MFCreateDXGIDeviceManager(&token, &manager));
handle_result(manager->ResetDevice(device, token));
handle_result(attributes->SetUnknown(MF_SOURCE_READER_D3D_MANAGER, manager));
handle_result(attributes->SetUINT32(MF_READWRITE_ENABLE_HARDWARE_TRANSFORMS, TRUE));
handle_result(attributes->SetUINT32(MF_SOURCE_READER_ENABLE_ADVANCED_VIDEO_PROCESSING, TRUE));
#else
handle_result(attributes->SetUINT32(MF_SOURCE_READER_ENABLE_VIDEO_PROCESSING, TRUE));
#endif
CComPtr<IMFSourceReader> reader;
handle_result(MFCreateSourceReaderFromURL(L"Rogue One - A Star Wars Story - Trailer.mp4", attributes, &reader));
CComPtr<IMFMediaType> output_type;
handle_result(MFCreateMediaType(&output_type));
handle_result(output_type->SetGUID(MF_MT_MAJOR_TYPE, MFMediaType_Video));
handle_result(output_type->SetGUID(MF_MT_SUBTYPE, MFVideoFormat_RGB32));
handle_result(reader->SetCurrentMediaType(MF_SOURCE_READER_FIRST_VIDEO_STREAM, nullptr, output_type));
unsigned int frame_count{};
std::cout << "Started processing frames" << std::endl;
while (true)
{
CComPtr<IMFSample> sample;
DWORD flags;
handle_result(reader->ReadSample(MF_SOURCE_READER_FIRST_VIDEO_STREAM,
0, nullptr, &flags, nullptr, &sample));
if (flags & MF_SOURCE_READERF_ENDOFSTREAM || sample == nullptr)
break;
std::cout << "Frame " << frame_count++ << std::endl;
CComPtr<IMFMediaBuffer> buffer;
BYTE* data;
handle_result(sample->ConvertToContiguousBuffer(&buffer));
handle_result(buffer->Lock(&data, nullptr, nullptr));
// Use the frame here.
buffer->Unlock();
}
std::cout << "Finished processing frames" << std::endl;
}
MFShutdown();
CoUninitialize();
return 0;
}
Your code is correct, conceptually, with the only remark - and it's not quite obvious - that Media Foundation decoder is multithreaded. You are feeding it with a single threaded version of Direct3D device. You have to work it around or you get what you are currently getting: access violations and freezes, that is undefined behavior.
// NOTE: No single threading
handle_result(D3D11CreateDevice(nullptr, D3D_DRIVER_TYPE_HARDWARE, nullptr,
(0 * D3D11_CREATE_DEVICE_SINGLETHREADED) | D3D11_CREATE_DEVICE_VIDEO_SUPPORT,
levels, ARRAYSIZE(levels), D3D11_SDK_VERSION, &device, nullptr, nullptr));
// NOTE: Getting ready for multi-threaded operation
const CComQIPtr<ID3D11Multithread> pMultithread = device;
pMultithread->SetMultithreadProtected(TRUE);
Also note that this straightforward code sample has a performance bottleneck around the lines you added for getting contiguous buffer. Apparently it's your move to get access to the data... however behavior by design is that decoded data is already in video memory, and your transfer to system memory is an expensive operation. That is, you added a severe performance hit to the loop. You will be interested in checking validity of data this way, and when it comes to performance benchmarking you should rather comment that out.
The output types of H264 video decoder can be found here: https://msdn.microsoft.com/en-us/library/windows/desktop/dd797815(v=vs.85).aspx.
RGB32 is not one of them. In this case your app relies on the Video Processor MFT to do the conversion from any of the MFVideoFormat_I420, MFVideoFormat_IYUV, MFVideoFormat_NV12, MFVideoFormat_YUY2, MFVideoFormat_YV12 to RGB32. I suppose that it's the Video Processor MFT that acts strangely and causes your program to misbehave. That's why by setting NV12 as the output subtype for the decoder you'll get rid of the Video Processor MFT and the following lines of code are getting useless as well:
handle_result(attributes->SetUINT32(MF_SOURCE_READER_ENABLE_ADVANCED_VIDEO_PROCESSING, TRUE));
and
handle_result(attributes->SetUINT32(MF_SOURCE_READER_ENABLE_VIDEO_PROCESSING, TRUE));
Moreover, as you noticed NV12 is the only format that works properly. I think the reason for this is that it is the only one that is used in the accelerated scenarios by the D3D and DXGI device manager.
I have a task to make a simple profiling tool (winOS) for performance/timing/event analysis of OpenCL programs. Can someone give advice how to start?
The simplest one and works accurately on all platforms:
cl_event perfEvent;
cl_ulong start=0, end=0;
float t_kernel;
/* Enqueue kernel */
clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, globalWorkSize, localWorkSize, 0, NULL, &perfEvent);
clWaitForEvents( 1, &perfEvent );
/* Get the execution time */
clGetEventProfilingInfo(perfEvent, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);
clGetEventProfilingInfo(perfEvent, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);
t_kernel = (end-start)/1000000.0f;
std::cout << t_kernel << std::endl;
Take a look at AMD CodeXL. It's free and it might be just what you are looking for. Inside CodeXL, use the Application Timeline Trace mode (Profile -> Application Timeline Trace), which executes a program and generates a visual timeline that displays OpenCL events like kernel dispatch and data transfer operations.