Robust compatibility check? [closed]

Robust compatibility check? [closed] - c++

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I'm building an application that uses OpenCL GPU acceleration on windows, including OpenCL 2.0+ features.
On my own machine, that has compatible HW and up-to-date drivers, I get no problems running the builds.
However, I've been deploying it to other machines and have been encountering freezes/crashing for various reasons during initialization of my OpenCL kernels/programs/etc.
The other machines have either incompatible HW (no gfx card or gfx card not compatible with OCL2.0+), out-of-date GFX drivers, out-of-date OpenCL drivers, etc. Simply updating them isn't a solution since they're meant to simulate real-world user environments (ie, the users I eventually deploy my software to are not guaranteed to have compatible systems).
I already track OpenCL-returned error codes (and stop further initialization once one is returned), but I'm still getting segmentation faults on these machines during initialization of the various OpenCL functions, or they will simply hang during OCL program initialization (in some circumstances, even when no OpenCL error codes are returned prior to running the problem functions).
How can I do a robust compatibility check on a particular machine, prior to running any OpenCL initialization functions?
I know I can query device/driver OpenCL info, but the return values are just vendor-specific strings and it seems like a fool's errand to try anticipating/parsing all possible combinations thereof (and further, it seems there's no guarantee they will even return helpful info at all). Is there a more robust way to query whether OpenCL (and specifically OpenCL 2.0 GPU-device code) can be executed on a particular machine?

There are 2 problems when people try to distribute OpenCL apps.
You want to check whether the client even has OpenCL.
You want to check whether the client has the correct version.
Solving 1 is a little pain in the ass since OpenCL apps would usually crash if there is no OpenCL. You can use CLEW which is basically glew for opencl. This will allow you to check if client has opencl or not.
After that all you have left is OpenCL device/driver querying functions to check if client has the correct version installed.

There are several possible incompatibility issues that you may encounter:
Extensions or Optional Core Features
Core language features are described in the OpenCL Specification, and
all features from core are supposed to work on any system and with any
compiler (provided that it supports particular version of OpenCL).
There is also a set of extensions, which are optional and you need to
check that they are supported on a system.
For example, if you use double type, you have to check that
cl_khr_fp64 extension is supported. You can get a list of supported
extensions by caling clGetDeviceInfo(CL_DEVICE_EXTENSIONS)
Undefined behavior or any other bug
When a program runs well on your local machine, and crashes/freezes
when you deploy it, this is often an indication of a bug in the
program itself.
This could happen if you (unintentionally) relied on an OpenCL driver
implementation details (e.g. how workgroups are ordered, how
work-items are executed). To avoid this, you should strictly follow
the rules of the OpenCL specification, although the specification is
not always perfect.
As an example, if you have the following code:
for (int i = 0; i < N; ++i) {
if (get_global_id() < M) {
barrier();
}
}
This code may happily run and give you correct results on your local
machine, but it is incorrect according to the OpenCL specification (you cannot
have a barrier() call in a divergent block), and it will
crash/hang/mismatch on other machine.
Compiler (or driver) bugs
Compilers try hard to optimize your program, but they sometimes fail
to do this correctly, especially in some edge cases. Probably the best
way to detect these kinds of bugs is to write a self-check tool, that
runs a unit test for the key parts of your program and checks the
result aginst a reference.
For example, if you have an algorithm, say Histogram computation, you
can isolate it from the rest of the program, and make sure you get the
expected results.
If this self-check tool fails, it can give you a clue on what is going
on, and you'll have a good reproducer that you can share with OpenCL
driver developers, so they can fix the problem.
Aside from that, you can apply a workaround based on a vendor id,
device type, driver version, etc. All this information can be queried
from clGetDeviceInfo, but you should not treat it as a stable
interface: names and versions can potentially change for future
releases, so it is hard to follow these changes.

Related

Is it possible to generate a full memory dump on a release build at will?

I'm trying to debug a particularly nasty state-based issue that has no known repro, and has only been seen in the wild on other people's machines. The issue is very specific and hard to detect (basically a 3D physics engine breaks at a random time, but doesn't actually harm program stability)
I would like to be able to take a full memory snapshot of the application at the users discretion, like a crashdump but with as much of the program's state as possible. But I'm not sure if such a thing is even possible in a release build for a C++ program. (And even if it was, a lot of data will be obfuscated...)
Is there a way to generate a full memory dump?
What other alternatives do I have?
Platform details:
Windows and Linux, Microsofts default compiler for VS2013

Use CPU fallback if OpenCV's Cuda extensions are not available

In my code I'm trying to capitalize the power of a possibly present cuda capable GPU. While this code works well on computers that have cuda available (and where OpenCV was compiled with cuda support), I have troubles implementing a fallback to CPU. Even building fails, since the imports I'm using
#include "opencv2/core/cuda.hpp"
#include "opencv2/cudaimgproc.hpp"
#include "opencv2/cudaarithm.hpp"
are not found. I'm quite a novice regarding C++ program architecture. How would I need to model my code to support such a fallback functionality?

If you are implementing a fallback you probably want to switch to it at runtime. But the fact that you are getting compiler error messages suggests that you are compiling with different flags. In general, you probably want something like this:
if (HasCuda()) {
RunCudaCode(...);
} else {
RunCpuCode(...);
}
Alternatively, you could build two shared libraries one with and one without Cuda and load the one that you need based on HasCuda(). However, that approach only makes sense if your binary is huge and you're running into memory issues.
It might be necessary to have a similar block in your startup code that initializes Cuda.

Kernel mode programming using simplistic c++?

I am about to delve into kernel land. My question relates to the programming language. I have seen most tutorials to be written in C. I currently program in C++ and Assembly. I also studied C before C++, but I didn't use it a lot. Would it be possible to program in kernel mode using simplistic C++without using any advanced constructs? Basically I am trying to avoid the minor differences that exist between the two languages(like no bool in C, no automatic returning of 0 from main, really minor differences). I won't be using templates, classes and the like. So would it be possible to program in kernel mode using simplistic C++ without any major annoyances?

Even if not officially supported, you can use C++ as the development language for Windows kernel development.
You should be aware of the following things :
you MUST define the new and delete operator to map to ExAllocatePoolWithTag and ExFreePool.
try to avoid virtual functions. It seems not possible to control the location of the vtable of the object and this may have unexpected results if it is in a pageable portion and you code is called with IRQL >= DISPATCH_LEVEL.
if you still need to use virtual methods table than lock .rdata segment before using it on IRQL >= DISPATCH_LEVEL.
Apart from these kinds of limitations, you can use C++ for your driver development.

Add two links if you want to do C++ in WDK. It's a one time setup effort.
The NT Insider:Guest Article: C++ in an NT Driver
The NT Insider:Global Relief Effort - C++ Runtime Support for the NT DDK
Have seen kernel codes use lots of auto-locks/smart-pointers; although they make the code neat, I feel it has a learning curve for beginner to fully understand, and if abused, lots of construct/destruct codes slow things down.

If you write your code carefully, knowing what exactly stands behind each definition, operator, call, etc, then there should be no problem writing kernel code in C++. The Microsoft document mentioned in the comments above is a good reading precisely because it describes situations in which C++ isn't as transparent as C or doesn't provide similar important guarantees and from that you know what to avoid.

Microsoft has written a guide. Basically they tell us to steer clear of anything but using C++'s relaxed rules of variable declarations...sigh. Anything else and you're on your own. Anyway it can't be all that bad but here are some examples of what you need to remember:
Memory allocated in the paged pool can get paged out. If you try to access it when IRQL is above PASSIVE_LEVEL you're screwed (or at least you will be every once in a while when your customer complains about your driver BSODding their system)! Test your driver on a low memory system under load!
The non-paged pool is limited, you most likely cannot allocate all your needs from it.
Stack is much smaller than in user mode ~12-24K.
Anything you do involving floating point path in the kernel must be protected by KeSaveFloatingPointState and KeRestoreFloatingPointState
C++ exceptions: No
Read the guide for more. Now if you can make sure that the generated code follows the rules, go ahead and use C++.

Edit and Continue on GDB

I know that E&C is a controversial subject and some say that it encourages a wrong approach to debugging, but still - I think we can agree that there are numerous cases when it is clearly useful - experimenting with different values of some constants, redesigning GUI parameters on-the-fly to find a good look... You name it.
My question is: Are we ever going to have E&C on GDB? I understand that it is a platform-specific feature and needs some serious cooperation with the compiler, the debugger and the OS (MSVC has this one easy as the compiler and debugger always come in one package), but... It still should be doable. I've even heard something about Apple having it implemented in their version of GCC [citation needed]. And I'd say it is indeed feasible.
Knowing all the hype about MSVC's E&C (my experience says it's the first thing MSVC users mention when asked "why not switch to Eclipse and gcc/gdb"), I'm seriously surprised that after quite some years GCC/GDB still doesn't have such feature. Are there any good reasons for that? Is someone working on it as we speak?

It is a surprisingly non-trivial amount of work, encompassing many design decisions and feature tradeoffs. Consider: you are debugging. The debugee is suspended. Its image in memory contains the object code of the source, and the binary layout of objects, the heap, the stacks. The debugger is inspecting its memory image. It has loaded debug information about the symbols, types, address mappings, pc (ip) to source correspondences. It displays the call stack, data values.
Now you want to allow a particular set of possible edits to the code and/or data, without stopping the debuggee and restarting. The simplest might be to change one line of code to another. Perhaps you recompile that file or just that function or just that line. Now you have to patch the debuggee image to execute that new line of code the next time you step over it or otherwise run through it. How does that work under the hood? What happens if the code is larger than the line of code it replaced? How does it interact with compiler optimizations? Perhaps you can only do this on a specially compiled for EnC debugging target. Perhaps you will constrain possible sites it is legal to EnC. Consider: what happens if you edit a line of code in a function suspended down in the call stack. When the code returns there does it run the original version of the function or the version with your line changed? If the original version, where does that source come from?
Can you add or remove locals? What does that do to the call stack of suspended frames? Of the current function?
Can you change function signatures? Add fields to / remove fields from objects? What about existing instances? What about pending destructors or finalizers? Etc.
There are many, many functionality details to attend to to make any kind of usuable EnC work. Then there are many cross-tools integration issues necessary to provide the infrastructure to power EnC. In particular, it helps to have some kind of repository of debug information that can make available the before- and after-edit debug information and object code to the debugger. For C++, the incrementally updatable debug information in PDBs helps. Incremental linking may help too.
Looking from the MS ecosystem over into the GCC ecosystem, it is easy to imagine the complexity and integration issues across GDB/GCC/binutils, the myriad of targets, some needed EnC specific target abstractions, and the "nice to have but inessential" nature of EnC, are why it has not appeared yet in GDB/GCC.
Happy hacking!
(p.s. It is instructive and inspiring to look at what the Smalltalk-80 interactive programming environment could do. In St80 there was no concept of "restart" -- the image and its object memory were always live, if you edited any aspect of a class you still had to keep running. In such environments object versioning was not a hypothetical.)

I'm not familiar with MSVC's E&C, but GDB has some of the things you've mentioned:
http://sourceware.org/gdb/current/onlinedocs/gdb/Altering.html#Altering
17. Altering Execution
Once you think you have found an error in your program, you might want to find out for certain whether correcting the apparent error would lead to correct results in the rest of the run. You can find the answer by experiment, using the gdb features for altering execution of the program.
For example, you can store new values into variables or memory locations, give your program a signal, restart it at a different address, or even return prematurely from a function.
Assignment: Assignment to variables
Jumping: Continuing at a different address
Signaling: Giving your program a signal
Returning: Returning from a function
Calling: Calling your program's functions
Patching: Patching your program
Compiling and Injecting Code: Compiling and injecting code in GDB

This is a pretty good reference to the old Apple implementation of "fix and continue". It also references other working implementations.
http://sources.redhat.com/ml/gdb/2003-06/msg00500.html
Here is a snippet:
Fix and continue is a feature implemented by many other debuggers,
which we added to our gdb for this release. Sun Workshop, SGI ProDev
WorkShop, Microsoft's Visual Studio, HP's wdb, and Sun's Hotspot Java
VM all provide this feature in one way or another. I based our
implementation on the HP wdb Fix and Continue feature, which they
added a few years back. Although my final implementation follows the
general outlines of the approach they took, there is almost no shared
code between them. Some of this is because of the architectual
differences (both the processor and the ABI), but even more of it is
due to implementation design differences.
Note that this capability may have been removed in a later version of their toolchain.
UPDATE: Dec-21-2012
There is a GDB Roadmap PDF presentation that includes a slide describing "Fix and Continue" among other bullet points. The presentation is dated July-9-2012 so maybe there is hope to have this added at some point. The presentation was part of the GNU Tools Cauldron 2012.
Also, I get it that adding E&C to GDB or anywhere in Linux land is a tough chore with all the different components.
But I don't see E&C as controversial. I remember using it in VB5 and VB6 and it was probably there before that. Also it's been in Office VBA since way back. And it's been in Visual Studio since VS2005. VS2003 was the only one that didn't have it and I remember devs howling about it. They intended to add it back anyway and they did with VS2005 and it's been there since. It works with C#, VB, and also C and C++. It's been in MS core tools for 20+ years, almost continuous (counting VB when it was standalone), and subtracting VS2003. But you could still say they had it in Office VBA during the VS2003 period ;)
And Jetbrains recently added it too their C# tool Rider. They bragged about it (rightly so imo) in their Rider blog.

Compile and optimize for different target architectures

Summary: I want to take advantage of compiler optimizations and processor instruction sets, but still have a portable application (running on different processors). Normally I could indeed compile 5 times and let the user choose the right one to run.
My question is: how can I can automate this, so that the processor is detected at runtime and the right executable is executed without the user having to chose it?
I have an application with a lot of low level math calculations. These calculations will typically run for a long time.
I would like to take advantage of as much optimization as possible, preferably also of (not always supported) instruction sets. On the other hand I would like my application to be portable and easy to use (so I would not like to compile 5 different versions and let the user choose).
Is there a possibility to compile 5 different versions of my code and run dynamically the most optimized version that's possible at execution time? With 5 different versions I mean with different instruction sets and different optimizations for processors.
I don't care about the size of the application.
At this moment I'm using gcc on Linux (my code is in C++), but I'm also interested in this for the Intel compiler and for the MinGW compiler for compilation to Windows.
The executable doesn't have to be able to run on different OS'es, but ideally there would be something possible with automatically selecting 32 bit and 64 bit as well.
Edit: Please give clear pointers how to do it, preferably with small code examples or links to explanations. From my point of view I need a super generic solution, which is applicable on any random C++ project I have later.
Edit I assigned the bounty to ShuggyCoUk, he had a great number of pointers to look out for. I would have liked to split it between multiple answers but that is not possible. I'm not having this implemented yet, so the question is still 'open'! Please, still add and/or improve answers, even though there is no bounty to be given anymore.
Thanks everybody!

Yes it's possible. Compile all your differently optimised versions as different dynamic libraries with a common entry point, and provide an executable stub that that loads and runs
the correct library at run-time, via the entry point, depending on config file or other information.

Can you use script?
You could detect the CPU using script, and dynamically load the executable that is most optimized for architecture. It can choose 32/64 bit versions too.
If you are using a Linux you can query the cpu with
cat /proc/cpuinfo
You could probably do this with a bash/perl/python script or windows scripting host on windows. You probably don't want to force the user to install a script engine. One that works on the OS out of the box IMHO would be best.
In fact, on windows you probably would want to write a small C# app so you can more easily query the architecture. The C# app could just spawn whatever executable is fastest.
Alternatively you could put your different versions of code in a dll's or shared object's, then dynamically load them based on the detected architecture. As long as they have the same call signature it should work.

If you wish this to cleanly work on Windows and take full advantage in 64bit capable platforms of the additional 1. Addressing space and 2. registers (likely of more use to you) you must have at a minimum a separate process for the 64bit ones.
You can achieve this by having a separate executable with the relevant PE64 header. Simply using CreateProcess will launch this as the relevant bitness (unless the executable launched is in some redirected location there is no need to worry about WoW64 folder redirection
Given this limitation on windows it is likely that simply 'chaining along' to the relevant executable will be the simplest option for all different options, as well as making testing an individual one simpler.
It also means you 'main' executable is free to be totally separate depending on the target operating system (as detecting the cpu/OS capabilities is, by it's nature, very OS specific) and then do most of the rest of your code as shared objects/dlls.
Also you can 'share' the same files for two different architectures if you currently do not feel that there is any point using the differing capabilities.
I would suggest that the main executable is capable of being forced into making a specific choice so you can see what happens with 'lesser' versions on a more capable machine (or what errors come up if you try something different).
Other possibilities given this model are:
Statically linking to different versions of the standard runtimes (for ones with/without thread safety) and using them appropriately if you are running without any SMP/SMT capabilities.
Detect if multiple cores are present and whether they are real or hyper threading (also whether the OS knows how the schedule effectively in those cases)
checking the performance of things like the system timer/high performance timers and using code optimized to this behaviour, say if you do anything where you look for a certain amount of time to expire and thus can know your best possible granularity.
If you wish to optimize you choice of code based on cache sizing/other load on the box. If you are using unrolled loops then more aggressive unrolling options may depend on having a certain amount level 1/2 cache.
Compiling conditionally to use doubles/floats depending on the architecture. Less important on intel hardware but if you are targetting certain ARM cpu's some have actual floating point hardware support and others require emulation. The optimal code would change heavily, even to the extent you just use conditional compilation rather than using the optimizing compiler(1).
Making use of co-processor hardware like CUDA capable graphics cards.
detect virtualization and alter behaviour (perhaps trying to avoid file system writes)
As to doing this check you have a few options, the most useful one on Intel being the the cpuid instruction.
Windows
Use someone else's implementation but you'll have to pay
Use a free open source one
Linux
Use the built in one
You could also look at open source software doing the same thing
Pixman does a fair amount of this and is a permissive licence.
Alternatively re-implement/update an existing one using available documentation on the features you need.
Quite a lot of separate documents to work out how to detect things:
Intel:
SSE 4.1/4.2
SSE3
MMX
A large part of what you would be paying for in the CPU-Z library is someone doing all this (and the nasty little issues involved) for you.
be careful with this - it is hard to beat decent optimizing compilers on this

Have a look at liboil: http://liboil.freedesktop.org/wiki/ . It can dynamically select implementations of multimedia-related computations at run-time. You may find you can liboil itself and not just its techniques.

Since you mention you are using GCC, I'll assume your code is in C (or C++).
Neil Butterworth already suggested making separate dynamic libraries, but that requires some non-trivial cross-platform considerations (manually loading dynamic libraries is different on Linux, Windows, OSX, etc., and getting it right will likely take some time).
A cheap solution is to simply write all of your variants using unique names, and use a function pointer to select the proper one at runtime.
I suspect the extra dereference caused by the function pointer will be amortized by the actual work you are doing (but you'll want to confirm that).
Also, getting different compiler optimizations will likely require different .c/.cpp files, as well as some twiddling of your build tool. But it's probably less overall work than separate libraries (which needed this already in one form or another).

Since you didn't specify whether you have limits on the number of files, I propose another solution: compile 5 executables, and then create a sixth executable that launches the appropriate binary. Here is some pseudocode, for Linux
int main(int argc, char* argv[])
{
char* target_path[MAXPATH];
char* new_argv[];
char* specific_version = determine_name_of_specific_version();
strcpy(target_path, "/usr/lib/myapp/versions");
strcat(target_path, specific_version);
/* append NULL to argv */
new_argv = malloc(sizeof(char*)*(argc+1));
memcpy(new_argv, argv, argc*sizeof(char*));
new_argv[argc] = 0;
/* optionally set new_argv[0] to target_path */
execv(target_path, new_argv);
}
On the plus side, this approach allows to provide the user transparently with both 32-bit and 64-bit binaries, unlike any library methods that have been proposed. On the minus side, there is no execv in Win32 (but a good emulation in cygwin); on Windows, you have to create a new process, rather than re-execing the current one.

Lets break the problem down to its two constituent parts. 1) Creating platform dependent optimized code and 2) building on multiple platforms.
The first problem is pretty straightforward. Encapsulate the platform dependent code in a set of functions. Create a different implementation of each function for each platform. Put each implementation in its own file or set of files. It's easiest for the build system if you put each platform's code in a separate directory.
For part two I suggest you look at Gnu Atuotools (Automake, AutoConf, and Libtool). If you've ever downloaded and built a GNU program from source code you know you have to run ./configure before running make. The purpose of the configure script is to 1) verify that your system has all of the required libraries and utilities need to build and run the program and 2) customize the Makefiles for the target platform. Autotools is the set of utilities for generating the configure script.
Using autoconf, you can create little macros to check that the machine supports all of the CPU instructions your platform dependent code needs. In most cases, the macros already exists, you just have to copy them into your autoconf script. Then, automake and autoconf can set up the Makefiles to pull in the appropriate implementation.
All this is a bit much for creating an example here. It takes a little time to learn. But the documentation is all out there. There is even a free book available online. And the process is applicable to your future projects. For multi-platform support, this is really the most robust and easiest way to go, I think. A lot of the suggestions posted in other answers are things that Autotools deals with (CPU detection, static & shared library support) without you have to think about it too much. The only wrinkle you might have to deal with is finding out if Autotools are available for MinGW. I know they are part of Cygwin if you can go that route instead.

You mentioned the Intel compiler. That is funny, because it can do something like this by default. However, there is a catch. The Intel compiler didn't insert checks for the approopriate SSE functionality. Instead, they checked if you had a particular Intel chip. There would still be a slow default case. As a result, AMD CPUs would not get suitable SSE-optimized versions. There are hacks floating around that will replace the Intel check with a proper SSE check.
The 32/64 bits difference will require two executables. Both the ELF and PE format store this information in the exectuables header. It's not too hard to start the 32 bits version by default, check if you are on a 64 bit system, and then restart the 64 bit version. But it may be easier to create an appropriate symlink at installation time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js