cv::split() crashes on Xeon processor but works elsewhere - c++

I am using pre-built opencv lib & dll, version 3.4.3 Winpack (downloaded from official site https://opencv.org/releases.html).
Till now everything worked fine, but recently my code started to crash.
It is one specific function that causes this crash: cv::split(). It is a common utility funtion to extract channels
from cv::Mat array. The crash occurs only on Xeon processor, Windows Server 2012. Regardless of preceding calls or context, it just crashes immediately on this call and the application just closes.
On other processors the same .exe works without problems, the code is normally tested on Windows 10 with ordinary processors. I don't have Xeon processor at hand to test every function, but the mentioned crash could be reproduced 100% on a Xeon Gold machine and I have used quite a lot of different library functions and they worked there, so it is the first one that crashed.
It seems that some functions' runtime simply contains instruction that are incompatible with the Xeon processor so it just crashes there.
Question: how do I know in advance whether certain openCV function will work or not on a Xeon processor?
Currently I have just removed cv::split() calls from my code and replced it by cv::extractChannel() methods which works fine on all tested platforms. I suspect one option would be to compile a custom version of the lib and disable specific instructions, but that will need knowledge of what to disable, etc, so frankly I am not in the mood involving
custom compiled version for what seems relatively 'standard architecture' (Xeon processor).
What can you suggest to avoid these errors?
Maybe there is a list of openCV functions that are known to be 'special' (not for Xeon processor so I can just avoid them)?
Code example:
# include <opencv2/opencv.hpp>
int main ( int argc, char* argv[] )
{
cv::Mat Patch = cv::imread ( "image.png", -1 );
cv::Mat Patch_planes[4];
cv::split ( Patch, Patch_planes );
return 0;
}
Compiler command (Microsoft (R) C/C++ Optimizing Compiler Version 19.15.26732.1 for x64):
cl.exe "minim.cpp" /EHsc /W2 /I "c:\VCLIB\openCV-3.4.3" "c:\VCLIB\openCV-3.4.3\lib\opencv_world343.lib" /link /SUBSYSTEM:CONSOLE

How do I know in advance whether certain openCV function will work or not on a Xeon processor?
You don't. The compiler will use whatever instructions it deems most suitable to compile any particular piece of code, subject to the constraints given on the command line.
So to be safe (assuming it is an 'illegal instruction' error), you probably do need to compile openCV for the least capable processor you need to support and then check the performance hit on other processors. Either that or check the CPU in your installer and install a version of openCV tailored to that processor. Yuk, I don't envy you.

Related

How can we distribute compiled source code if it is specific to the hardware it was compiled on?

Suppose we take a compiled language, for example, C++. Now let's take an example Framework, suppose Qt. Qt has it's source code publically available and has the options for users to download the binary files and let users use their API. My question is however, when they compiled their code, it was compiled to their specific HardWare, Operating System, all that stuff. I understand how many Software Require recompilation for different types of Operating Systems (Including 32 vs 64bit) and offer multiple downloads on their website, however how does it not go even further to suggest it is also Hardware Specific and eventually result in the redistribution of compiled executes extremely frustrating to produce?
Code gets compiled to a target base CPU (e.g. 32-bit x86, x86_64, or ARM), but not necessarily a specific processor like the Core i9-10900K. By default, the compiler typically generates the code to run on the widest range of processors. And Intel and AMD guarantee forward compatibility for running that code on newer processors. Compilers often offer switches for optimizing to run on newer processors with new instruction sets, but you rarely do that since not all your customers have that config. Or perhaps you build your code twice (once for older processors, and an optimized build for newer processors).
There's also a concept called cross-compiling. That's where the compiler generates code for a completely different processor than it runs on. Such is the case when you build your iOS app on a Mac. The compiler itself is an x86_64 program, but it's generating ARM CPU instruction set to run on the iPhone.
Code gets compiled and linked with a certain set of OS APIs and external runtime libraries (including the C/C++ runtime). If you want your code to run on Windows 7 or Mac OSX Maverics, you wouldn't statically link to an API that only exists on Windows 10 or Mac OS Big Sur. The code would compile, but it wouldn't run on the older operating systems. Instead, you'd do a workaround or conditionally load the API if it is available. Microsoft and Apple provides the forward compatibility of providing those same runtime library APIs to be available on later OS releases.
Additionally Windows supports running 32-bit processes on 64-bit chips and OS. Mac can even emulate x86_64 on their new ARM based devices coming out later this year. But I digress.
As for Qt, they actually offer several pre-built configurations for their reference binary downloads. Because, at least on Windows, the MSVCRT (C-runtime APIs from Visual Studio) are closely tied to different compiler versions of Visual Studio. So they offer various downloads to match the configuration you want to build your your code for (32-bit, 64-bit, VS2017, VS2019, etc...). So when you put together a complete application with 3rd party dependencies, some of these build, linkage, and CPU/OS configs have to be accounted for.

How to transfere my code from I7 processor to Xeon processor on (AWS)?

I make some codes on this environment:
a) my laptop with i7 processor;
b) IDE "visual studio"/C/C++
Now, I want to transfere the code on AWS with Xeon E5-2670.
1) Is it possible ?
2) Must i change the configuration on "visual studio" or take the code and make it runs directly on the the Xeon proc ?
3) do you have some references i could follow
Thank for you help and recommendations
Alvaro
It depends on how you have set up the compilation options. If you have not enabled any specific options that allow the compiler to use instructinos not present on the target processor the executable will run. You can use Dependency Walker to determine what DLLs your executable requires.
The default options in VS C++ projects will produce executables that run on practically any modern x86 processor. By itself your machine's CPU doesn't matter when compiling, only compiler options.
It should run directly, but it might not be as efficient as if it were compiled on the AWS system. I.e. I coded a program optimised for a 4 core 8 thread computer, but when I ran it on my laptop with a 2 core 4 thread processor it nearly crashed it.I can also guess that running the program on a 6 core 12 thread processor would not achieve full efficiency.
If you're talking about the runtime environment (I just remembered that) there is a chance that Visual Studio provides non-standard libraries which you would need to download and/or compile before being able to run the program. E.g. I sent my program to my friend, who was missing a required DLL to run the program.
EDIT (I'm new here, so not enough rep to comment): Usually I just search for the missing DLLs on dll-files.com. I'm not sure about linux though, could be that you have to compile libraries yourself, which I'm not that familiar with.
After the try, the execution results in 2 errors:
MSVCP144.dll missing
MSVCP100.dll missing

Processor optimization flags in OpenCV

I'm building an application that uses OpenCV that will run on a variety of Windows computers (using Win7, Win8, Win10).
Now I have discovered that my application crashes randomly at some computers. After a lot of googling I have realized that enabling SSE3 in OpenCV can cause Illegal Instruction crashes on processors that doesn't support SSE3.
http://answers.opencv.org/question/18001/illegal-instruction-when-running-any-compiled-opencv-demo-binary-sse3-flag/
https://bugs.launchpad.net/linuxmint/+bug/1258259
So this is my question: Does anyone of you know which processor flags are "safe". I understand what they do, but I don't know how common it is for a processor to support, for instance, SSE42.
In other words: Which of these flags do you think I should disable when I compile OpenCV?
OCV_OPTION:
ENABLE_SSE
ENABLE_SSE2
ENABLE_SSE3
ENABLE_SSSE3
ENABLE_SSE41
ENABLE_SSE42
ENABLE_POPCNT
ENABLE_AVX
ENABLE_AVX2
ENABLE_FMA3

Same program faster on Linux than Windows -- why?

The solution to this was found in the question Executable runs faster on Wine than Windows -- why? Glibc's floor() is probably implemented in terms of system libraries.
I have a very small C++ program (~100 lines) for a physics simulation. I have compiled it with gcc 4.6.1 on both Ubuntu Oneiric and Windows XP on the same computer. I used precisely the same command line options (same makefile).
Strangely, on Ubuntu, the program finishes much faster than on Windows (~7.5 s vs 13.5 s). At this point I thought it's a compiler difference (despite using the same version).
But even more strangely, if I run the Windows executable under wine, it's still faster than on Windows (I get 11 s "real" and 7.7 s "user" time -- and this includes wine startup.)
I'm confused. Surely if the same code is run on the same CPU, there shouldn't be a difference in the timing.
What can be the reason for this? What could I be doing wrong?
The program does minimal I/O (outputs a single line), and only uses a fixed-length vector from the STL (i.e. no system libraries should be involved). On Ubuntu I used the default gcc and on Windows the Nuwen distribution. I verified that the CPU usage is close to zero when doing the benchmarking (I closed most programs). On Linux I used time for timing. On Windows I used timethis.exe.
UPDATE
I did some more precise timings, comparing the running time for different inputs (run-time must be proportional to the input) of the gcc and msvc-compiled programs on Windows XP, Wine and Linux. All numbers are in seconds and are the minimum of at least 3 runs.
On Windows I used timethis.exe (wall time), on Linux and Wine I used time (CPU time). (timethis.exe is broken on Wine) I made sure no other programs were using the CPU and disabled the virus scanner.
The command line options to gcc were -march=pentium-m -Wall -O3 -fno-exceptions -fno-rtti (i.e. exceptions were disabled).
What we see from this data:
the difference is not due to process startup time, as run-times are proportional to the input
The difference between running on Wine and Windows exists only for the gcc-compiled program, not the msvc-compiled one: it can't be casued by other programs hogging the CPU on Windows or timethis.exe being broken.
You'd be surprised what system libraries are involved. Just do ldd on your app, and see which are used (ok, not that much, but certainly glibc).
In order to completely trust your findings about execution speed, you would need to run your app a couple of times sequentially and take the mean execution time. It might be that the OS loader is just slower (although 4s is a long loading time).
Other very possible reasons are:
Different malloc implementation
Exception handling, if used to the extreme might cause slowdown (Windows GCC, MinGW, might not be the optimal exception handling star of the show)
OS-dependent initialization: stuff that needs to be done at program startup on Windows, but not on Linux.
Most of these are easily benchmarkable ;-)
An update to your update: the only thing you can now do is profile. Stop guessing, and let a profiler tell you where time is being spent. Use gprof and the Visual Studio built-in profiler and compare time spent in different functions.
Do benchmarking in code. Also try to compile with visual studio. On windows if you have some application like Yahoo Messenger, that are installing hooks, they can very easy slow down your application loading times.
On windows you have: QueryPerformanceCounter
On Linux: clock_gettime
Apparently the difference is system related.
You might use strace to understand what system calls are done, eg
strace -o /tmp/yourprog.tr yourprog
and then look into /tmp/yourprog.tr
(If an equivalent of strace existed on Windows, try to use it)
Perhaps your program is allocating memory (using mmap system call), and perhaps the memory related system calls are faster on Linux (or even on Wine) than on Windows? Or some other syscalls give faster functionality on Linux that on Windows.
NB. I know nothing about Windows, since I'm using Unix systems since 1986 and Linux since 1993.

HD Photo source compile on ARM?

I've downloaded HD Photo Device Porting Kit 1.0 and successfully compiled and executed it on x86 PC.
I want to port the image viewer program to ARM-based Windows Mobile Smartphone, but there is some missing ARM code.
First, no "/image/x86/x86.h" equivalent header file for ARM. But the file is very simple, so I copied and renamed it to "arm.h" and successfully compiled and linked the source code.
But at runtime, DWORD alignment exception occurrs. I found that on ARM build, it seems that ARMOPT_BITIO should be declared for properly aligned read & write. But with ARMOPT_BITIO, some IO functions are missing, e. g. peekBits, getBits, flushToByte, flushBits.
I copied x86 version of these functions (peekBit16, flushBit16, etc), but no luck, it does not work (I've got a stack overflow error).
I can't debug the complex HD Photo source files. Please let me know where can I find the missing ARM code.
Any help would be much appreciated. Thanks!
Based on my experience of porting some Microsoft code to ARM Linux, I do not think there is an easy way around it, unless someone has ported it already. You'll have to dive into this sort of low-level debugging.
Bugs I encountered were mainly related to unaligned access, and missing platform API calls. Also incorrect preprocessor checks resulted in code thinking it's running on big-endian platform.
The method I found useful to debug in such scenario is to build the code for the target platform and for the platform where it's known to work, and debug/trace these builds in parallel using a number of use cases. This will catch the most severe bugs.