cpu dispatcher for visual studio for AVX and SSE - c++

I work with two computers. One without AVX support and one with AVX. It would be convenient to have my code find the instruction set supported by my CPU at run-time and choose the appropriate code path.
I've follow the suggestions by Agner Fog to make a CPU dispatcher (http://www.agner.org/optimize/#vectorclass). However, on my maching ithout AVX compiling and linking with visual studio the code with AVX enabled causes the code to crash when I run it.
I mean for example I have two source files one with the SSE2 instruction set defined with some SSE2 instructions and another one with the AVX instruction set defined and with some AVX instructions. In my main function if I only reference the SSE2 functions the code still crashes by virtue of having any source code with AVX enabled and with AVX instructions. Any clues to how I can fix this?
Edit:
Okay, I think I isolated the problem. I'm using Agner Fog's vector class and I have defined three source files as:
//file sse2.cpp - compiled with /arch:SSE2
#include "vectorclass.h"
float func_sse2(const float* a) {
Vec8f v1 = Vec8f().load(a);
float sum = horizontal_add(v1);
return sum;
}
//file avx.cpp - compiled with /arch:AVX
#include "vectorclass.h"
float func_avx(const float* a) {
Vec8f v1 = Vec8f().load(a);
float sum = horizontal_add(v1);
return sum;
}
//file foo.cpp - compiled with /arch:SSE2
#include <stdio.h>
extern float func_sse2(const float* a);
extern float func_avx(const float* a);
int main() {
float (*fp)(const float*a);
float a[] = {1,2,3,4,5,6,7,8};
int iset = 6;
if(iset>=7) {
fp = func_avx;
}
else {
fp = func_sse2;
}
float sum = (*fp)(a);
printf("sum %f\n", sum);
}
This crashes. If I instead use Vec4f in func_SSE2 it does not crash. I don't understand this. I can use Vec8f with SSE2 by itself as long as I don't have another source file with AVX. Agner Fog's manual says
"There is no advantage in using the 256-bit floating point vector classes (Vec8f,
Vec4d) unless the AVX instruction set is specified, but it can be convenient to use
these classes anyway if the same source code is used with and without AVX.
Each 256-bit vector will simply be split up into two 128-bit vectors when compiling
without AVX."
However, when I have two source files with Vec8f one compiled with SSE2 and one compiled with AVX then I get a crash.
Edit2:
I can get it to work from the command line
>cl -c sse2.cpp
>cl -c /arch:AVX avx.cpp
>cl foo.cpp sse2.obj avx.obj
>foo.exe
Edit3:
This, however, crashes
>cl -c sse2.cpp
>cl -c /arch:AVX avx.cpp
>cl foo.cpp avx.obj sse2.obj
>foo.exe
Another clue. Apparently, the order of linking matters. It crashes if avx.obj is before sse2.obj but if sse2.obj is before avx.obj it does not crash. I'm not sure if it chooses the correct code path (I don't have access to my AVX system right now) but at least it does not crash.

I realise that this is an old question and that the person who asked it appears to be no longer around, but I hit the same problem yesterday. Here's what I worked out.
When compiled both your sse2.cpp and avx.cpp files produce object files that not only contain your function but also any required template functions.
(e.g. Vec8f::load) These template functions are also compiled using the requested instruction set.
The means that your sse2.obj and avx.obj object files will both contain definitions of Vec8f::load each compiled using the respective instruction sets.
However, since the compiler treats Vec8f::load as externally visible, it puts it a 'COMDAT' section of the object file with a 'selectany' (aka 'pick any') label. This tells the linker that if it sees multiple definitions of this symbol, for example in 2 different object files, then it is allowed to pick any one it likes. (It does this to reduce duplicate code in the final executable which otherwise would be inflated in size by multiple definitions of template and inline functions.)
The problem you are having is directly related to this in that the order of the object files passed to the linker is affecting which one it picks. Specifically here, it appears to be picking the first definition it sees.
If this was avx.obj then the AVX compiled version of Vec8F::load will always be used. This will crash on a machine that doesn't support that instruction set.
On the other hand if sse2.obj is first then the SSE2 compiled version will always be used. This won't crash but it will only use SSE2 instructions even if AVX is supported.
That this is the case can be seen if you look at the linker 'map' file output (produced using the /map option.) Here are the relevant (edited) excerpts -
//
// link with sse2.obj before avx.obj
//
0001:00000080 _main foo.obj
0001:00000330 func_sse2##YAMPBM#Z sse2.obj
0001:00000420 ??0Vec256fe##QAE#XZ sse2.obj
0001:00000440 ??0Vec4f##QAE#ABT__m128###Z sse2.obj
0001:00000470 ??0Vec8f##QAE#XZ sse2.obj <-- sse2 version used
0001:00000490 ??BVec4f##QBE?AT__m128##XZ sse2.obj
0001:000004c0 ?get_high#Vec8f##QBE?AVVec4f##XZ sse2.obj
0001:000004f0 ?get_low#Vec8f##QBE?AVVec4f##XZ sse2.obj
0001:00000520 ?load#Vec8f##QAEAAV1#PBM#Z sse2.obj <-- sse2 version used
0001:00000680 ?func_avx##YAMPBM#Z avx.obj
0001:00000740 ??BVec8f##QBE?AT__m256##XZ avx.obj
//
// link with avx.obj before sse2.obj
//
0001:00000080 _main foo.obj
0001:00000270 ?func_avx##YAMPBM#Z avx.obj
0001:00000330 ??0Vec8f##QAE#XZ avx.obj <-- avx version used
0001:00000350 ??BVec8f##QBE?AT__m256##XZ avx.obj
0001:00000380 ?load#Vec8f##QAEAAV1#PBM#Z avx.obj <-- avx version used
0001:00000580 ?func_sse2##YAMPBM#Z sse2.obj
0001:00000670 ??0Vec256fe##QAE#XZ sse2.obj
0001:00000690 ??0Vec4f##QAE#ABT__m128###Z sse2.obj
0001:000006c0 ??BVec4f##QBE?AT__m128##XZ sse2.obj
0001:000006f0 ?get_high#Vec8f##QBE?AVVec4f##XZ sse2.obj
0001:00000720 ?get_low#Vec8f##QBE?AVVec4f##XZ sse2.obj
As for fixing it, that's another matter. In this case, the following blunt hack should work by forcing the avx version to have its own differently named versions of the template functions. This will increase the resulting executable size as it will contain multiple versions of the same function even if the sse2 and avx versions are identical.
// avx.cpp
namespace AVXWrapper {
\#include "vectorclass.h"
}
using namespace AVXWrapper;
float func_avx(const float* a)
{
...
}
There are some important limitations though -
(a) if the included file manages any form of global state it will no longer be truly global as you will have 2 'semi-global' versions, and
(b) you won't be able to pass vectorclass variables as parameters between other code and functions defined in avx.cpp.

The fact that the link order matters makes me think that there might be some kind of initialization code in the obj file. If the initialization code is communal, then only the first one is taken. I can't reproduce it, but you should be able to see it in an assembly listing (compile with /c /Ftestavx.asm)

Put the SSE and AVX functions in different CPP files and be sure to compile SSE version wihout /arch:AVX.

Related

Why a basic unreferenced c++ function does not get optimized away?

Consider this simple code:
#include <stdio.h>
extern "C"
{
void p4nenc256v32();
void p4ndec256v32();
}
void bigFunctionTest()
{
p4nenc256v32();
p4ndec256v32();
}
int main()
{
printf("hello\n");
}
Code size of those p4nenc256v32/p4ndec256v32 functions is significant, roughly 1.5MB. This binary size when compiled with latest VS2022 with optimizations enabled is 1.5MB. If I comment out that unused bigFunctionTest function then resulting binary is smaller by 1.4MB. Any ideas why would this clearly unused function wouldn't be eliminated by compiler and/or linker in release builds? By default, VS2022 in release uses /Gy and /OPT:REF.
I also tried mingw64 (gcc 12.2) with -fdata-sections -ffunction-sections -Wl,--gc-sections and results were much worse: when compiled with that dummy function exe grew by 5.2MB. Seem like ms and gcc compilers agree that for some reason these functions cannot be removed.
I created a working sample project that shows the issue: https://github.com/pps83/TestLinker.git (make sure to pull submodules as well) and filled an issue with VS issue tracker: Linker doesn't eliminate correctly dead code, however, I think I might get better explanation from SO users explaining what might be the reason for the problem.

C++ error: intrinsic function was not declared in scope

I want to compile code that uses the intrinsic function _mm256_undefined_si256() (returns a vector of 8 packed double word integers). Here is the reduced snipped of the affected function from the header file:
// test.hpp
#include "immintrin.h"
namespace {
inline __m256i foo(__m256i a, __m256i b) {
__m256i res = _mm256_undefined_si256();
// some inline asm stuff
// __asm__(...);
return res;
}
}
Compiling via gcc -march=native -mavx2 -O3 -std=c++11 test.cpp -o app throws the following error >>_mm256_undefined_si256<< was not declared in this scope.
I can not explain why this intrinsic function is not defined, since there are other intrinsics used in the header file which work properly.
Your code works in GCC4.9 and newer (https://godbolt.org/z/bajMsKvK9). GCC4.9 was released in April 2014, close to a decade ago, and the most recent release of GCC4.8.5 was in June 2015. So it's about time to upgrade your compiler!
GCC4.8 was missing that intrinsic, and didn't even know about -march=sandybridge (let alone tuning options for Haswell which had AVX2), although it did know about the less meaningful -march=corei7-avx.
It does happen that GCC misses some of the more obscure intrinsics that Intel adds along with support for a new instruction set, so support for _mm256_add_epi32 won't always imply _mm256_undefined_si256().
e.g. it took until GCC11 for them to add _mm_load_si32(void*) unaligned aliasing-safe movd (which I think Intel introduced around the same time as AVX-512 stuff), so that's multiple years late. (And until GCC12 / 11.3 for GCC to implement it correctly, Bug 99754, and still not aliasing-safe for _mm_load_ss(float*) (Bug 84508).
But fortunately for the _mm256_undefined_si256, it's supported by non-ancient versions of all the mainstream compilers.

Using inline ARM asm in Android NDK project

I'm doing a little project using the Android NDK, and I must insert some asm code for ARM architecture.
Everything apart from the asm works just fine, but the asm code tells me that
Operand 1 should be an integer register
when conpiling simple code like
asm("mov r0, r0");
So, what is the problem? Is my computer trying to compile for x86_64 instead of ARM? If so, how should I change that?
Also, I've tried the x86_64 equivalent arm("mov rax, rax"); but the error is the same.
All your C sources are compiled for each architecture that is mentioned in your APP_ABI. So there is no point to wonder why ARM assembly was not unterstood by x64 compiler or vice versa.
Do not use inline assembly. It is much better to put all assembly stuff into dedicates *.S sources, that will be processed by as (NDK toolchains have it). That assembly sources should be placed into appropriate folders like arch-arm/, arch-x86/. Then you should add them to Android.mk properly:
LOCAL_SRC_FILES := arch-$(TARGET_ARCH)/my_awesome_code.S
$(TARGET_ARCH) helps to resolve path to appropriate source in correct and painless way.
P.S. Also standalone assembly gives more abilities to you than inline one. This is one more reason to avoid using of inline assembly. Moreover inline assembly syntax differs for compiler to compiler since it is not a part of standard.

Visual Studio equivalent of GCC's __attribute__((target("sse")))

This is the problem I encountered when I tried to migrate an existing GCC project to Visual Studio.
For a given function void foo(), I hand-optimize it by sse/avx intrinsics, resulting in two versions of this function void foo_sse() and void foo_avx(), and I use cpuid to invoke the correct version at runtime. To tell GCC that void foo_sse() and void foo_avx() should be compiled with -msse and -mavx options respectively, I add __attribute__((target("sse"))) and __attribute__((target("avx"))) to their declaration.
It works well for GCC, but I cannot find an equivalent for VS. For some security concerns, I have to put all codes in one single cpp file to mangle symbol names and I cannot simply put two functions in two different cpp files and give them different compiler options.
How can I specify compiler options on a per-function basis in VS? Thanks in advance.

How to remove unused C/C++ symbols with GCC and ld?

I need to optimize the size of my executable severely (ARM development) and
I noticed that in my current build scheme (gcc + ld) unused symbols are not getting stripped.
The usage of the arm-strip --strip-unneeded for the resulting executables / libraries doesn't change the output size of the executable (I have no idea why, maybe it simply can't).
What would be the way (if it exists) to modify my building pipeline, so that the unused symbols are stripped from the resulting file?
I wouldn't even think of this, but my current embedded environment isn't very "powerful" and
saving even 500K out of 2M results in a very nice loading performance boost.
Update:
Unfortunately the current gcc version I use doesn't have the -dead-strip option and the -ffunction-sections... + --gc-sections for ld doesn't give any significant difference for the resulting output.
I'm shocked that this even became a problem, because I was sure that gcc + ld should automatically strip unused symbols (why do they even have to keep them?).
For GCC, this is accomplished in two stages:
First compile the data but tell the compiler to separate the code into separate sections within the translation unit. This will be done for functions, classes, and external variables by using the following two compiler flags:
-fdata-sections -ffunction-sections
Link the translation units together using the linker optimization flag (this causes the linker to discard unreferenced sections):
-Wl,--gc-sections
So if you had one file called test.cpp that had two functions declared in it, but one of them was unused, you could omit the unused one with the following command to gcc(g++):
gcc -Os -fdata-sections -ffunction-sections test.cpp -o test -Wl,--gc-sections
(Note that -Os is an additional compiler flag that tells GCC to optimize for size)
If this thread is to be believed, you need to supply the -ffunction-sections and -fdata-sections to gcc, which will put each function and data object in its own section. Then you give and --gc-sections to GNU ld to remove the unused sections.
You'll want to check your docs for your version of gcc & ld:
However for me (OS X gcc 4.0.1) I find these for ld
-dead_strip
Remove functions and data that are unreachable by the entry point or exported symbols.
-dead_strip_dylibs
Remove dylibs that are unreachable by the entry point or exported symbols. That is, suppresses the generation of load command commands for dylibs which supplied no symbols during the link. This option should not be used when linking against a dylib which is required at runtime for some indirect reason such as the dylib has an important initializer.
And this helpful option
-why_live symbol_name
Logs a chain of references to symbol_name. Only applicable with -dead_strip. It can help debug why something that you think should be dead strip removed is not removed.
There's also a note in the gcc/g++ man that certain kinds of dead code elimination are only performed if optimization is enabled when compiling.
While these options/conditions may not hold for your compiler, I suggest you look for something similar in your docs.
Programming habits could help too; e.g. add static to functions that are not accessed outside a specific file; use shorter names for symbols (can help a bit, likely not too much); use const char x[] where possible; ... this paper, though it talks about dynamic shared objects, can contain suggestions that, if followed, can help to make your final binary output size smaller (if your target is ELF).
The answer is -flto. You have to pass it to both your compilation and link steps, otherwise it doesn't do anything.
It actually works very well - reduced the size of a microcontroller program I wrote to less than 50% of its previous size!
Unfortunately it did seem a bit buggy - I had instances of things not being built correctly. It may have been due to the build system I'm using (QBS; it's very new), but in any case I'd recommend you only enable it for your final build if possible, and test that build thoroughly.
While not strictly about symbols, if going for size - always compile with -Os and -s flags. -Os optimizes the resulting code for minimum executable size and -s removes the symbol table and relocation information from the executable.
Sometimes - if small size is desired - playing around with different optimization flags may - or may not - have significance. For example toggling -ffast-math and/or -fomit-frame-pointer may at times save you even dozens of bytes.
It seems to me that the answer provided by Nemo is the correct one. If those instructions do not work, the issue may be related to the version of gcc/ld you're using, as an exercise I compiled an example program using instructions detailed here
#include <stdio.h>
void deadcode() { printf("This is d dead codez\n"); }
int main(void) { printf("This is main\n"); return 0 ; }
Then I compiled the code using progressively more aggressive dead-code removal switches:
gcc -Os test.c -o test.elf
gcc -Os -fdata-sections -ffunction-sections test.c -o test.elf -Wl,--gc-sections
gcc -Os -fdata-sections -ffunction-sections test.c -o test.elf -Wl,--gc-sections -Wl,--strip-all
These compilation and linking parameters produced executables of size 8457, 8164 and 6160 bytes, respectively, the most substantial contribution coming from the 'strip-all' declaration. If you cannot produce similar reductions on your platform,then maybe your version of gcc does not support this functionality. I'm using gcc(4.5.2-8ubuntu4), ld(2.21.0.20110327) on Linux Mint 2.6.38-8-generic x86_64
strip --strip-unneeded only operates on the symbol table of your executable. It doesn't actually remove any executable code.
The standard libraries achieve the result you're after by splitting all of their functions into seperate object files, which are combined using ar. If you then link the resultant archive as a library (ie. give the option -l your_library to ld) then ld will only include the object files, and therefore the symbols, that are actually used.
You may also find some of the responses to this similar question of use.
I don't know if this will help with your current predicament as this is a recent feature, but you can specify the visibility of symbols in a global manner. Passing -fvisibility=hidden -fvisibility-inlines-hidden at compilation can help the linker to later get rid of unneeded symbols. If you're producing an executable (as opposed to a shared library) there's nothing more to do.
More information (and a fine-grained approach for e.g. libraries) is available on the GCC wiki.
From the GCC 4.2.1 manual, section -fwhole-program:
Assume that the current compilation unit represents whole program being compiled. All public functions and variables with the exception of main and those merged by attribute externally_visible become static functions and in a affect gets more aggressively optimized by interprocedural optimizers. While this option is equivalent to proper use of static keyword for programs consisting of single file, in combination with option --combine this flag can be used to compile most of smaller scale C programs since the functions and variables become local for the whole combined compilation unit, not for the single source file itself.
You can use strip binary on object file(eg. executable) to strip all symbols from it.
Note: it changes file itself and don't create copy.