Use CPU fallback if OpenCV's Cuda extensions are not available

Use CPU fallback if OpenCV's Cuda extensions are not available - c++

In my code I'm trying to capitalize the power of a possibly present cuda capable GPU. While this code works well on computers that have cuda available (and where OpenCV was compiled with cuda support), I have troubles implementing a fallback to CPU. Even building fails, since the imports I'm using
#include "opencv2/core/cuda.hpp"
#include "opencv2/cudaimgproc.hpp"
#include "opencv2/cudaarithm.hpp"
are not found. I'm quite a novice regarding C++ program architecture. How would I need to model my code to support such a fallback functionality?

If you are implementing a fallback you probably want to switch to it at runtime. But the fact that you are getting compiler error messages suggests that you are compiling with different flags. In general, you probably want something like this:
if (HasCuda()) {
RunCudaCode(...);
} else {
RunCpuCode(...);
}
Alternatively, you could build two shared libraries one with and one without Cuda and load the one that you need based on HasCuda(). However, that approach only makes sense if your binary is huge and you're running into memory issues.
It might be necessary to have a similar block in your startup code that initializes Cuda.

Related

Robust compatibility check? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I'm building an application that uses OpenCL GPU acceleration on windows, including OpenCL 2.0+ features.
On my own machine, that has compatible HW and up-to-date drivers, I get no problems running the builds.
However, I've been deploying it to other machines and have been encountering freezes/crashing for various reasons during initialization of my OpenCL kernels/programs/etc.
The other machines have either incompatible HW (no gfx card or gfx card not compatible with OCL2.0+), out-of-date GFX drivers, out-of-date OpenCL drivers, etc. Simply updating them isn't a solution since they're meant to simulate real-world user environments (ie, the users I eventually deploy my software to are not guaranteed to have compatible systems).
I already track OpenCL-returned error codes (and stop further initialization once one is returned), but I'm still getting segmentation faults on these machines during initialization of the various OpenCL functions, or they will simply hang during OCL program initialization (in some circumstances, even when no OpenCL error codes are returned prior to running the problem functions).
How can I do a robust compatibility check on a particular machine, prior to running any OpenCL initialization functions?
I know I can query device/driver OpenCL info, but the return values are just vendor-specific strings and it seems like a fool's errand to try anticipating/parsing all possible combinations thereof (and further, it seems there's no guarantee they will even return helpful info at all). Is there a more robust way to query whether OpenCL (and specifically OpenCL 2.0 GPU-device code) can be executed on a particular machine?

There are 2 problems when people try to distribute OpenCL apps.
You want to check whether the client even has OpenCL.
You want to check whether the client has the correct version.
Solving 1 is a little pain in the ass since OpenCL apps would usually crash if there is no OpenCL. You can use CLEW which is basically glew for opencl. This will allow you to check if client has opencl or not.
After that all you have left is OpenCL device/driver querying functions to check if client has the correct version installed.

There are several possible incompatibility issues that you may encounter:
Extensions or Optional Core Features
Core language features are described in the OpenCL Specification, and
all features from core are supposed to work on any system and with any
compiler (provided that it supports particular version of OpenCL).
There is also a set of extensions, which are optional and you need to
check that they are supported on a system.
For example, if you use double type, you have to check that
cl_khr_fp64 extension is supported. You can get a list of supported
extensions by caling clGetDeviceInfo(CL_DEVICE_EXTENSIONS)
Undefined behavior or any other bug
When a program runs well on your local machine, and crashes/freezes
when you deploy it, this is often an indication of a bug in the
program itself.
This could happen if you (unintentionally) relied on an OpenCL driver
implementation details (e.g. how workgroups are ordered, how
work-items are executed). To avoid this, you should strictly follow
the rules of the OpenCL specification, although the specification is
not always perfect.
As an example, if you have the following code:
for (int i = 0; i < N; ++i) {
if (get_global_id() < M) {
barrier();
}
}
This code may happily run and give you correct results on your local
machine, but it is incorrect according to the OpenCL specification (you cannot
have a barrier() call in a divergent block), and it will
crash/hang/mismatch on other machine.
Compiler (or driver) bugs
Compilers try hard to optimize your program, but they sometimes fail
to do this correctly, especially in some edge cases. Probably the best
way to detect these kinds of bugs is to write a self-check tool, that
runs a unit test for the key parts of your program and checks the
result aginst a reference.
For example, if you have an algorithm, say Histogram computation, you
can isolate it from the rest of the program, and make sure you get the
expected results.
If this self-check tool fails, it can give you a clue on what is going
on, and you'll have a good reproducer that you can share with OpenCL
driver developers, so they can fix the problem.
Aside from that, you can apply a workaround based on a vendor id,
device type, driver version, etc. All this information can be queried
from clGetDeviceInfo, but you should not treat it as a stable
interface: names and versions can potentially change for future
releases, so it is hard to follow these changes.

List opengl extensions USED in runtime

How I can get a list with openGl extensions that I am used in my program at runtime in C++. Clarifying, I don't want extensions available in my platform, I don't want extensions I can execute, I want a list of extensions I am using in my code. This is to check if this extensions are availables before start the execution. I am looking GLEW but GLEW is for
determining which OpenGL extensions are supported on the target platform.
And what I want is a way to get what extensions I am using. If anyone knows not runtime way please tell me because maybe is useful too.
Also, I want tho know how to determine minimum opengl version to use.

Actually the standard way to handle extensions is in-situ: test for availability before trying to call it.
if (GLEW_ARB_vertex_program)
{
... // Do something with ARB_vertex_program
}
More complexes OpenGl programs will perform tests on several extensions availability and make decision from that:
if (GLEW_ARB_vertex_array_object && other stuff)
{
renderingTechnic = VERTEX;
}
else
{
renderingTechnic = BEGIN_END;
}
... etc
If you want the list of actually used extension it's become tricky.
You have to detect cases like this:
if (GLEW_ARB_vertex_program && 0) // YOU SHALL NOT PASS
{
... // Do something with ARB_vertex_program
}
In this case, you will never enter in the then-statement but your tool may have difficulty to detect it.
Code coverage tools are here to perform this kind of job, but embed ones to perform the comparison with available extension at runtime is not an option here.
Take a look in the symbol table of your output is neither a solution:
The one for you exe will contain none of OpenGl functions.
The one for the OpenGl library will contain all of possible (not available nether used) functions.

If your codebase is anything but humongous, you can simply do this with a bit of searching for "gl_EXT_", "gl_ARB_", etc. and manual inspection and a dash of discipline afterwards to try and document all the extensions you use.
For minimum version; again, this is something pretty basic that you need to know already when writing the code. But here I think using GLEW can help you. If memory serves, GLEW uses preprocessor macros that you can define to set the version of OpenGL standard you are expecting. You start by setting this to a low value (e.g. 1.1) and see if your code compiles and runs. If it does, that's probably the minimum version you need. If it doesn't, you increment the version and try again.

The usual approach is to decide which extensions you're going to use before you're starting the actual coding. And if you change it later on, you immediately put the availability tests somewhere close to initialization so that you can either terminate with a runtime error message or fall back to an alternate code path.
Of course the preferred way is to not use extensions at all and stick to a plain OpenGL version profile. Yes, anything after OpenGL-1.2 is loaded through the extension mechanism, but that doesn't make it extensions. So if you know you absolutely need OpenGL-3.0 for your program to work, but nothing else then you can just test for that and be done.

C++ Windows Compiler for smallest executables

guys I want to start programing with C++. I have written some programs in vb6, vb.net and now I want to gain knowledge in C++, what I want is a compiler that can compile my code to the smallest windows application. For example there is a Basic language compiler called PureBasic that can make Hello world standalone app's size 5 kb, and simple socket program which i compiled was only 12kb (without any DLL-s and Runtime files). I know it is amazing, so I want something like this for C++.
If I am wrong and there is not such kind of windows compiler can someone give me a website or book that can teach me how to reduce C++ executable size, or how to use Windows API calls?

Taking Microsoft Visual C++ compiler as example, if you turn off linking to the C runtime (/NODEFAULTLIB) your executable will be as small as 5KB.
There's a little problem though: you won't be able to use almost anything from the standard C or C++ libraries, nor standard features of C++ like exception handling, new and delete operators, floating point arithmetics, and more. You'll need to use only the features directly provided by WinAPI (e.g. create files with CreateFile, allocate memory with HeapAlloc, etc...).
It's also worth noting that while it's possible to create small executables with C++ using these methods, you may not be using most of C++ features at this point. In fact typical C++ code have some significant bloat due to heavy use of templates, polymorphism that prevents dead code elimination, or stack unwinding tables used for exception handling. You may be better off using something like C for this purpose.

I had to do this many years ago with VC6. It was necessary because the executable was going to be transmitted over the wire to a target computer, where it would run. Since it was likely to be sent over a modem connection, it needed to be as small as possible. To shrink the executable, I relied on two techniques:
Do not use the C or C++ runtime. Tell the compiler not to link them in. Implement all necessary functionality using a subset of the Windows API that was guaranteed to be available on all versions of Windows at the time (98, Me, NT, 2000).
Tell the linker to combine all code and data segments into one. I don't remember the switches for this and I don't know if it's still possible, especially with 64-bit executables.
The final executable size: ~2K

Reduction of the executable size for the code below from 24k to 1.6k bytes in Visual C++
int main (char argv[]) {
return 0;
}
Linker Switches (although the safe alignment is recommended to be 512):
/FILEALIGN:16
/ALIGN:16
Link with (in the VC++ project properties):
LIBCTINY.LIB
Additional pragmas (this will address Feruccio's suggestion)
However, I still see a section of ASCII(0) making a third of the executable, and the "Rich" Windows signature. (I'm reading the latter is not really needed for program execution).
#ifdef NDEBUG
#pragma optimize("gsy",on)
#pragma comment(linker,"/merge:.rdata=.data")
#pragma comment(linker,"/merge:.text=.data")
#pragma comment(linker,"/merge:.reloc=.data")
#pragma comment(linker,"/OPT:NOWIN98")
#endif // NDEBUG
int main (char argv[]) {
return 0;
}

I don't know why you are interested in this kind of optimization before learning the language, but anyways...
It doesn't make much difference of what compiler you use, but on how you use it. Chose a compiler like the Visual Studio C++'s or MinGW for example, and read its documentation. You will find information of how to optimize the compilation for size or performance (usually when you optimize for size, you lose performance, and vice-versa).
In Visual Studio, for example, you can minimize the size of the executable by passing the /O1 parameter to the compiler (or Project Properties/ C-C++ /Optimization).
Also don't forget to compile in "release" mode, or your executable may be full of debugging symbols, which will increase the size of your executable.

A modern desktop PC running Windows has at least 1Gb RAM and a huge hard drive, worrying about the size of a trivial program that is not representative of any real application is pointless.
Much of the size of a "Hello world" program in any language is fixed overhead to do with establishing an execution environment and loading and starting the code. For any non-trivial application you should be more concerned with the rate the code size increases as more functionality is added. And in that sense it is likley that C++ code in any compiler is pretty efficient. That is to say your PureBasic program that does little or nothing may be smaller than an equivalent C++ program, but that is not necessarily the case by the time you have built useful functionality into the code.
#user: C++ does produce small object code, however if the code for printf() (or cout<<) is statically linked, the resulting executable may be rather larger because printf() has a lot of functionality that is not used in a "hello world" program so is redundant. Try using puts() for example and you may find the code is smaller.
Moreover are you sure that you are comparing apples with apples? Some execution environments rely on a dynamically linked runtime library or virtual machine that is providing functionality that might be statically linked in a C++ program.

I don't like to reply to a dead post, but since none of the responses mentions this (except Mat response)...
Repeat after me: C++ != ( vb6 || vb.net || basic ). And I'm not only mentioning syntax, C++ coding style is typically different than the one in VB, as C++ programmers try to make things usually better designed than vb programmers...
P.S.: No, there is no place for copy-paste in C++ world. Sorry, had to say this...

A boot loader in C++

I have messed around a few times by making a small assembly boot loader on a floppy disk and was wondering if it's possible to make a boot loader in c++ and if so where might I begin? For all I know im not sure it would even use int main().
Thanks for any help.

If you're writing a boot loader, you're essentially starting from nothing: a small chunk of code is loaded into memory, and executed. You can write the majority of your boot loader in C++, but you will need to bootstrap your own C++ runtime environment first.
Assembly is really the only option for the first stage, as you need to set up a sensible environment for running anything higher-level. Doing enough to run C code is fairly straightforward -- you need:
code and data loaded in the right place;
there may be an additional part of the data area which must be zero-initialised;
you need to point the stack pointer at a suitable area of memory for the stack.
Then you can jump into the code at an appropriate point (e.g. main()) and expect that the basic language features will work. (It's possible that any features of the standard library that may have been implemented or linked in might require additional initialisation at this stage.)
Getting a suitable environment going for C++ requires more effort, as it needs more initialisation here, and also has core language features which require runtime support (again, this is before considering library features). These include:
running static constructors;
memory allocation to support new and delete;
support for run-time type information (RTTI);
support for exceptions;
probably some other things I've forgotten to mention.
None of these are required until the C environment is up and running, so the code that handles these can be written in C rather than assembler (or even in a subset of C++ that does not make use of the above features).
(The same principles apply in embedded systems, and it's not uncommon for such systems to make use of C++, but only in a limited way -- e.g. no exceptions and/or RTTI because the runtime support isn't implemented.)

It's been a while since I played with writing bootloaders, so I'm going off memory.
For an x86 bootloader, you need to have a C++ compiler that can emit x86 assembly, or, at the very least, you need to write your own preamble in 16-bit assembly that will put the CPU into 32-bit protected (or 64-bit long) mode, before you can call your C++ functions.
Once you've done that, though, you should be able to make use of most, if not all, of C++'s language features, so long as you stay away from things that require an underlying libc. But statically link everything without the CRT and you're golden.

Bootloaders don't have "int main()"s, unless you write assembly code to call it.
If you are writing a stage 1 bootloader, then it is seriously discouraged.
Otherwise, the osdev.org has great documentation on the topic.
While it is probably possible to make a bootloader in C++, remember not to link your code to any dynamic libraries, and remember that just because it is C++, that doesn't mean you can/should use the STL, etc.

Yes it is possible. You have elements of answer and usefull links in this question
You also can have a look here, there is a C++ bootloader example.
The main thing to understand is that you need to create a flat binary instead of the usual fancy executable file formats (PE on windows, or ELF on Unixes), because these file format need an OS to load them, and in a boot loader you don't have an OS yet.
Using library is not a problem if you link statically (no dynamic link because again of the above executable problem). But obviously all OS API related entry points are not available...

Compile and optimize for different target architectures

Summary: I want to take advantage of compiler optimizations and processor instruction sets, but still have a portable application (running on different processors). Normally I could indeed compile 5 times and let the user choose the right one to run.
My question is: how can I can automate this, so that the processor is detected at runtime and the right executable is executed without the user having to chose it?
I have an application with a lot of low level math calculations. These calculations will typically run for a long time.
I would like to take advantage of as much optimization as possible, preferably also of (not always supported) instruction sets. On the other hand I would like my application to be portable and easy to use (so I would not like to compile 5 different versions and let the user choose).
Is there a possibility to compile 5 different versions of my code and run dynamically the most optimized version that's possible at execution time? With 5 different versions I mean with different instruction sets and different optimizations for processors.
I don't care about the size of the application.
At this moment I'm using gcc on Linux (my code is in C++), but I'm also interested in this for the Intel compiler and for the MinGW compiler for compilation to Windows.
The executable doesn't have to be able to run on different OS'es, but ideally there would be something possible with automatically selecting 32 bit and 64 bit as well.
Edit: Please give clear pointers how to do it, preferably with small code examples or links to explanations. From my point of view I need a super generic solution, which is applicable on any random C++ project I have later.
Edit I assigned the bounty to ShuggyCoUk, he had a great number of pointers to look out for. I would have liked to split it between multiple answers but that is not possible. I'm not having this implemented yet, so the question is still 'open'! Please, still add and/or improve answers, even though there is no bounty to be given anymore.
Thanks everybody!

Yes it's possible. Compile all your differently optimised versions as different dynamic libraries with a common entry point, and provide an executable stub that that loads and runs
the correct library at run-time, via the entry point, depending on config file or other information.

Can you use script?
You could detect the CPU using script, and dynamically load the executable that is most optimized for architecture. It can choose 32/64 bit versions too.
If you are using a Linux you can query the cpu with
cat /proc/cpuinfo
You could probably do this with a bash/perl/python script or windows scripting host on windows. You probably don't want to force the user to install a script engine. One that works on the OS out of the box IMHO would be best.
In fact, on windows you probably would want to write a small C# app so you can more easily query the architecture. The C# app could just spawn whatever executable is fastest.
Alternatively you could put your different versions of code in a dll's or shared object's, then dynamically load them based on the detected architecture. As long as they have the same call signature it should work.

If you wish this to cleanly work on Windows and take full advantage in 64bit capable platforms of the additional 1. Addressing space and 2. registers (likely of more use to you) you must have at a minimum a separate process for the 64bit ones.
You can achieve this by having a separate executable with the relevant PE64 header. Simply using CreateProcess will launch this as the relevant bitness (unless the executable launched is in some redirected location there is no need to worry about WoW64 folder redirection
Given this limitation on windows it is likely that simply 'chaining along' to the relevant executable will be the simplest option for all different options, as well as making testing an individual one simpler.
It also means you 'main' executable is free to be totally separate depending on the target operating system (as detecting the cpu/OS capabilities is, by it's nature, very OS specific) and then do most of the rest of your code as shared objects/dlls.
Also you can 'share' the same files for two different architectures if you currently do not feel that there is any point using the differing capabilities.
I would suggest that the main executable is capable of being forced into making a specific choice so you can see what happens with 'lesser' versions on a more capable machine (or what errors come up if you try something different).
Other possibilities given this model are:
Statically linking to different versions of the standard runtimes (for ones with/without thread safety) and using them appropriately if you are running without any SMP/SMT capabilities.
Detect if multiple cores are present and whether they are real or hyper threading (also whether the OS knows how the schedule effectively in those cases)
checking the performance of things like the system timer/high performance timers and using code optimized to this behaviour, say if you do anything where you look for a certain amount of time to expire and thus can know your best possible granularity.
If you wish to optimize you choice of code based on cache sizing/other load on the box. If you are using unrolled loops then more aggressive unrolling options may depend on having a certain amount level 1/2 cache.
Compiling conditionally to use doubles/floats depending on the architecture. Less important on intel hardware but if you are targetting certain ARM cpu's some have actual floating point hardware support and others require emulation. The optimal code would change heavily, even to the extent you just use conditional compilation rather than using the optimizing compiler(1).
Making use of co-processor hardware like CUDA capable graphics cards.
detect virtualization and alter behaviour (perhaps trying to avoid file system writes)
As to doing this check you have a few options, the most useful one on Intel being the the cpuid instruction.
Windows
Use someone else's implementation but you'll have to pay
Use a free open source one
Linux
Use the built in one
You could also look at open source software doing the same thing
Pixman does a fair amount of this and is a permissive licence.
Alternatively re-implement/update an existing one using available documentation on the features you need.
Quite a lot of separate documents to work out how to detect things:
Intel:
SSE 4.1/4.2
SSE3
MMX
A large part of what you would be paying for in the CPU-Z library is someone doing all this (and the nasty little issues involved) for you.
be careful with this - it is hard to beat decent optimizing compilers on this

Have a look at liboil: http://liboil.freedesktop.org/wiki/ . It can dynamically select implementations of multimedia-related computations at run-time. You may find you can liboil itself and not just its techniques.

Since you mention you are using GCC, I'll assume your code is in C (or C++).
Neil Butterworth already suggested making separate dynamic libraries, but that requires some non-trivial cross-platform considerations (manually loading dynamic libraries is different on Linux, Windows, OSX, etc., and getting it right will likely take some time).
A cheap solution is to simply write all of your variants using unique names, and use a function pointer to select the proper one at runtime.
I suspect the extra dereference caused by the function pointer will be amortized by the actual work you are doing (but you'll want to confirm that).
Also, getting different compiler optimizations will likely require different .c/.cpp files, as well as some twiddling of your build tool. But it's probably less overall work than separate libraries (which needed this already in one form or another).

Since you didn't specify whether you have limits on the number of files, I propose another solution: compile 5 executables, and then create a sixth executable that launches the appropriate binary. Here is some pseudocode, for Linux
int main(int argc, char* argv[])
{
char* target_path[MAXPATH];
char* new_argv[];
char* specific_version = determine_name_of_specific_version();
strcpy(target_path, "/usr/lib/myapp/versions");
strcat(target_path, specific_version);
/* append NULL to argv */
new_argv = malloc(sizeof(char*)*(argc+1));
memcpy(new_argv, argv, argc*sizeof(char*));
new_argv[argc] = 0;
/* optionally set new_argv[0] to target_path */
execv(target_path, new_argv);
}
On the plus side, this approach allows to provide the user transparently with both 32-bit and 64-bit binaries, unlike any library methods that have been proposed. On the minus side, there is no execv in Win32 (but a good emulation in cygwin); on Windows, you have to create a new process, rather than re-execing the current one.

Lets break the problem down to its two constituent parts. 1) Creating platform dependent optimized code and 2) building on multiple platforms.
The first problem is pretty straightforward. Encapsulate the platform dependent code in a set of functions. Create a different implementation of each function for each platform. Put each implementation in its own file or set of files. It's easiest for the build system if you put each platform's code in a separate directory.
For part two I suggest you look at Gnu Atuotools (Automake, AutoConf, and Libtool). If you've ever downloaded and built a GNU program from source code you know you have to run ./configure before running make. The purpose of the configure script is to 1) verify that your system has all of the required libraries and utilities need to build and run the program and 2) customize the Makefiles for the target platform. Autotools is the set of utilities for generating the configure script.
Using autoconf, you can create little macros to check that the machine supports all of the CPU instructions your platform dependent code needs. In most cases, the macros already exists, you just have to copy them into your autoconf script. Then, automake and autoconf can set up the Makefiles to pull in the appropriate implementation.
All this is a bit much for creating an example here. It takes a little time to learn. But the documentation is all out there. There is even a free book available online. And the process is applicable to your future projects. For multi-platform support, this is really the most robust and easiest way to go, I think. A lot of the suggestions posted in other answers are things that Autotools deals with (CPU detection, static & shared library support) without you have to think about it too much. The only wrinkle you might have to deal with is finding out if Autotools are available for MinGW. I know they are part of Cygwin if you can go that route instead.

You mentioned the Intel compiler. That is funny, because it can do something like this by default. However, there is a catch. The Intel compiler didn't insert checks for the approopriate SSE functionality. Instead, they checked if you had a particular Intel chip. There would still be a slow default case. As a result, AMD CPUs would not get suitable SSE-optimized versions. There are hacks floating around that will replace the Intel check with a proper SSE check.
The 32/64 bits difference will require two executables. Both the ELF and PE format store this information in the exectuables header. It's not too hard to start the 32 bits version by default, check if you are on a 64 bit system, and then restart the 64 bit version. But it may be easier to create an appropriate symlink at installation time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js