Automatic vectorization GCC - c++

I'm trying to get GCC 4.7 to automatically vectorize some parts of my code to provide a speed increase, however, it seems difficult to do so.
Here some code that I would like to vectorize:
void VideoLine::WriteOut(unsigned short * __restrict__ start_of_line, const int number_of_sub_pixels_to_write)
{
unsigned short * __restrict__ write_pointer = (unsigned short *)__builtin_assume_aligned (start_of_line, 16);
unsigned short * __restrict__ line = (unsigned short *)__builtin_assume_aligned (_line, 16);
for (int i = 0; i < number_of_sub_pixels_to_write; i++)
{
write_pointer[i] = line[i];
}
}
I am using the following GCC switches:
-std=c++0x \
-o3 \
-msse \
-msse2 \
-msse3 \
-msse4.1 \
-msse4.2 \
-ftree-vectorizer-verbose=5\
-funsafe-loop-optimizations\
-march=corei7-avx \
-mavx \
-fdump-tree-vect-details \
-fdump-tree-optimized \
I'm aware that some override others.
I do not get any output from the vectorizer at all, however, when looking at the .optomized file, I can see it has not used vectorization. Can anyone point me in the right way to get this to vectorize?
Edit: Turned out the issue was using -o3 rather than -O3.

try to guarantee, that number_of_sub_pixels_to_write is a multiple of 4 by masking it like the way it is done here:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0002a/ch01s04s03.html

The compiler is free to do what it pleases. Therefore, if you really want to use SIMD functionality (and not rely on the compiler), you should use the functions (see the manual).

Related

32b multiplication in 8b uC with avr-g++ vs 32b multiplication on X86 with gcc

PROBLEM:
I'm doing a fixed point c++ class to perform some closed loop control system on an 8b microcontroller.
I wrote a C++ class to encapsulate the PID and tested the algorithm on an X86 desktop with a modern gcc compiler. All good.
When I compiled the same code on an 8b microcontroller with a modern avr-g++ compiler, I had weird artefacts. After some debugging, the problem was that the 16b*16b multiplication was truncated to 16b. Below some minimal code to show what I'm trying to do.
I used -O2 optimization on the desktop system and -OS optimization on the embedded system, with no other compiler flag.
#include <cstdio>
#include <stdint.h>
#define TEST_16B true
#define TEST_32B true
int main( void )
{
if (TEST_16B)
{
int16_t op1 = 9000;
int16_t op2 = 9;
int32_t res;
//This operation gives the correct result on X86 gcc (81000)
//This operation gives the wrong result on AVR avr-g++ (15464)
res = (int32_t)0 +op1 *op2;
printf("op1: %d | op2: %d | res: %d\n", op1, op2, res );
}
if (TEST_32B)
{
int16_t op1 = 9000;
int16_t op2 = 9;
int32_t res;
//Promote first operand
int32_t promoted_op1 = op1;
//This operation gives the correct result on X86 gcc (81000)
//This operation gives the correct result on AVR avr-g++ (81000)
res = promoted_op1 *op2;
printf("op1: %d | op2: %d | res: %d\n", promoted_op1, op2, res );
}
return 0;
}
SOLUTION:
Just promoting one operand to 32b with a local variable is enough to solve the problem.
My expectation was that C++ would garantuee that a math operation would be performed at the same width as the first operand, so in my mind res = (int32_t)0 +... should have told the compiler that whatever came after should be performed at int32_t resolution.
This is not what happened. The (int16_t)*(int16_t) operation got truncated to (int16_t).
gcc has an internal word width of at least 32b in an X86 machine, so that might be the reason I didn't see artefacts on my desktop.
AVR Command Line
E:\Programs\AVR\7.0\toolchain\avr8\avr8-gnu-toolchain\bin\avr-g++.exe$(QUOTE) -funsigned-char -funsigned-bitfields -DNDEBUG -I"E:\Programs\AVR\7.0\Packs\atmel\ATmega_DFP\1.3.300\include" -Os -ffunction-sections -fdata-sections -fpack-struct -fshort-enums -Wall -pedantic -mmcu=atmega4809 -B "E:\Programs\AVR\7.0\Packs\atmel\ATmega_DFP\1.3.300\gcc\dev\atmega4809" -c -std=c++11 -fno-threadsafe-statics -fkeep-inline-functions -v -MD -MP -MF "$(#:%.o=%.d)" -MT"$(#:%.o=%.d)" -MT"$(#:%.o=%.o)" -o "$#" "$<"
QUESTION:
Is this the actual expected behaviour of a compliant C++ compiler, meaning I did it wrong, or is this a quirk of the avr-g++ compiler?
UPDATE:
Debugger output of various solutions
This is expected behavior of the compiler.
When you write A + B * C, that is equivalent to A + (B * C) because of operator precedence. The B * C term is evaluated on its own, without regard to how it is going to be used later. (Otherwise, it would be really hard to look at C/C++ code and understand what is actually going to happen.)
There are integer promotion rules in the C/C++ standards that sometimes help you out by promoting B and C to be of type int or maybe unsigned int before performing the multiplication. That is why you get the expected result on x86 gcc, where an int has 32 bits. However, since an int in avr-gcc only has 16 bits, the integer promotion is not good enough for you. So you need to cast either B or C to an int32_t to ensure the result of the multiplication will be an int32_t as well. For example, you can do:
A + (int32_t)B * C

No Emscripten, How to Compile C++ With Standard Library to WebAssembly

I am having trouble building standalone webassembly with the full control I want over memory and layout. I don't want to use emscripten because, as the following post says, it doesn't give me all of the compile-time options I want (e.g. stack size control, being able to choose to import memory in standalone mode, etc.) I've been folowing pages such as: How to generate standalone webassembly with emscripten
Also, emscripten is overkill.
What I've done so far:
I have a fully working llvm 9 toolchain downloaded via homebrew (I am on macos 10.14.)
I was following a mix of https://aransentin.github.io/cwasm/ and https://depth-first.com/articles/2019/10/16/compiling-c-to-webassembly-and-running-it-without-emscripten/
I used wasi to get the C standard library. Using linker flags like -Wl,-z,stack-size=$[1024 * 1024] I could control the stack size. Compilation was successful. Great!
However, I need to use C++ standard libraries to support some of my own and other third party libraries.
As far as I can tell, there doesn't seem to be any easy way to get libc++ and libc++abi.
I tried a "hack" in which I downloaded Emscripten and had it build its own libc++ and libc++abi files. Then I tried copying those files and headers into the right spot.
Then I got error messages referring to a missing threading API, which apparently were caused by not compiling with EMSCRIPTEN. So I defined the EMSCRIPTEN macro and that sort of worked. Then I thought that maybe I could remove the wasi dependency and use emscripten's version of libc to be consistent, but then there were conflicting / missing headers too.
In short, I think I got somewhat close to where I needed to be, but things just got terribly messy. I doubt I took the simplest non-emscripten approach.
Has anyone successfully created a build system for standalone webassembly that lets you use the c and c++ standard libraries?
EDIT:
This is the super hacky build script I have now (it's a heavily modified version of something I found online):
DEPS =
OBJ = library.o
STDLIBC_OBJ = $(patsubst %.cpp,%.o,$(wildcard stdlibc/*.cpp))
OUTPUT = library.wasm
DIR := ${CURDIR}
COMPILE_FLAGS = -Wall \
--target=wasm32-unknown-wasi \
-Os \
-D __EMSCRIPTEN__ \
-D _LIBCPP_HAS_NO_THREADS \
-flto \
--sysroot ./ \
-std=c++17 \
-ffunction-sections \
-fdata-sections \
-I./libcxx/ \
-I./libcxx/support/xlocale \
-I./libc/include \
-DPRINTF_DISABLE_SUPPORT_FLOAT=1 \
-DPRINTF_DISABLE_SUPPORT_LONG_LONG=1 \
-DPRINTF_DISABLE_SUPPORT_PTRDIFF_T=1
$(OUTPUT): $(OBJ) $(NANOLIBC_OBJ) Makefile
wasm-ld \
-o $(OUTPUT) \
--no-entry \
--export-all \
--initial-memory=131072 \
--stack-size=$[1024 * 1024] \
-error-limit=0 \
--lto-O3 \
-O3 \
-lc -lc++ -lc++abi \
--gc-sections \
-allow-undefined-file ./stdlibc/wasm.syms \
$(OBJ) \
$(LIBCXX_OBJ) \
$(STDLIBC_OBJ)
%.o: %.cpp $(DEPS) Makefile
clang++ \
-c \
$(COMPILE_FLAGS) \
-fno-exceptions \
-o $# \
$<
library.wat: $(OUTPUT) Makefile
~/build/wabt/wasm2wat -o library.wat $(OUTPUT)
wat: library.wat
clean:
rm -f $(OBJ) $(STDLIBC_OBJ) $(OUTPUT) library.wat
I dropped-in libc, libc++, and libc++abi from emscripten (but honestly this is a terrible installation process.)
I've been incrementally trying to fill in gaps that I guess emscripten would've normally done, but now I'm stuck again:
./libcxx/type_traits:4837:57: error: use of undeclared identifier 'byte'
constexpr typename enable_if<is_integral_v<_Integer>, byte>::type &
^
./libcxx/type_traits:4837:64: error: definition or redeclaration of 'type'
cannot name the global scope
constexpr typename enable_if<is_integral_v<_Integer>, byte>::type &
I am no longer sure if this will even work since the system might accidentally compile something platform-specific in. Really what I'd like is a shim that would just let me use the standard containers mostly.
This has become kind of unmanageable. What might I do next?
EDIT 2: Right so that's missing C++17 type trait content, and when I go to C++14 (I still want C++17) I end up with more missing things.
Definitely stuck.
EDIT 3:
I sort of started over. The libraries are linking, and I'm able to use the standard, but I'm seeing errors like the following if I try to use e.g. an std::chrono's methods (I can instantiate the object):
wasm-ld: error: /var/folders/9k/zvv02vlj007cc0pm73769y500000gn/T/library-4ff1b5.o: undefined symbol: std::__1::chrono::system_clock::now()
I'm currently using the static library abi from emscripten and the static library C++ standard library from my homebrew installation of llvm (I tried the emscripten one but that didn't work either).
I'm not really sure if this is related to name mangling. I'm currently exporting all symbols from webasm so malloc and co. get exported as well.
Here is my build script:
clang++ \
--target=wasm32-unknown-wasi \
--std=c++11 \
-stdlib=libc++ \
-O3 \
-flto \
-fno-exceptions \
-D WASM_BUILD \
-D _LIBCPP_HAS_NO_THREADS \
--sysroot /usr/local/opt/wasi-libc \
-I/usr/local/opt/wasi-libc/include \
-I/usr/local/opt/glm/include \
-I./libcxx/ \
-L./ \
-lc++ \
-lc++abi \
-nostartfiles \
-Wl,-allow-undefined-file wasm.syms \
-Wl,--import-memory \
-Wl,--no-entry \
-Wl,--export-all \
-Wl,--lto-O3 \
-Wl,-lc++, \
-Wl,-lc++abi, \
-Wl,-z,stack-size=$[1024 * 1024] \
-o library.wasm \
library.cpp
My code:
#include "common_header.h"
#include <glm/glm.hpp>
#include <unordered_map>
#include <vector>
#include <string>
#include <chrono>
template <typename T>
struct BLA {
T x;
};
template <typename T>
BLA<T> make_BLA() {
BLA<T> bla;
std::unordered_map<T, T> map;
std::vector<T> bla2;
std::string str = "WEE";
//str = str.substr(0, 2);
return bla;
}
#ifdef __cplusplus
extern "C" {
#endif
char* malloc_copy(char* input)
{
usize len = strlen(input) + 1;
char* result = (char*)malloc(len);
if (result == NULL) {
return NULL;
}
strncpy(result, input, len);
return result;
}
void malloc_free(char* input)
{
free(input);
}
float32 print_num(float val);
float32 my_sin(float32 val)
{
float32 result = sinf(val);
float32 result_times_2 = print_num(result);
print_num(result_times_2);
return result;
}
long fibonacci(unsigned n) {
if (n < 2) return n;
return fibonacci(n-1) + fibonacci(n-2);
}
void set_char(char* input)
{
input[0] = '\'';
uint8 fibonacci_series[] = { 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89 };
for (uint8 number : fibonacci_series) {
input[0] = number;
}
auto WEE = make_BLA<int>();
WEE.x = 18;
glm::vec4 v(100.0f, 200.0f, 300.0f, 1.0f);
glm::vec4 v_out = glm::mat4(1.0f) * v;
input[0] = 5 + static_cast<int>(v_out.x) * input[1];
auto start = std::chrono::system_clock::now();
long out = fibonacci(42);
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
std::time_t end_time = std::chrono::system_clock::to_time_t(end);
auto elapsed = elapsed_seconds.count();
}
#ifdef __cplusplus
}
#endif
When I tried exporting dynamically with the "visible" attribute on only the functions that had no C++, the project compiled, but the wasm module failed to load in JavaScript, so I think the problem was still there.
This is as far as I've gotten. Might the issue be related to the fact that I'm using a different compiler from the one used to create the static libraries? (I'm using homebrew clang 9). Hopefully not. I'd be kind of stuck then because I couldn't find another way to get the libraries. Manual llvm compilation seemed to fail.
The excellent wasi-sdk pulls upstream llvm-project (which provides clang++) and wasi-libc as git submodules and compiles them using suitable flags (most notably disabling pthreads which is not yet supported in wasi-libc).
You can then compile your own C++ source using the following minimal set of options:
/path/to/wasi-sdk/build/install/opt/wasi-sdk/bin/clang++ \
-nostartfiles \
-fno-exceptions \
-Wl,--no-entry \
-Wl,--strip-all \
-Wl,--export-dynamic \
-Wl,--import-memory \
-fvisibility=hidden \
--sysroot /path/to/wasi-sdk/build/install/opt/wasi-sdk/share/wasi-sysroot \
-o out.wasm \
source.cpp
If you want to import functions from the runtime, I would suggest adding an additional line:
-Wl,--allow-undefined-file=wasm-import.syms \
You then can put function names separated by newlines into wasm-import.syms so that the linker won't complain about undefined functions.
Note that all this is completely independent of Emscripten.

error: '_mm512_loadu_epi64' was not declared in this scope

I'm trying to create a minimal reproducer for this issue report. There seems to be some problems with AVX-512, which is shipping on the latest Apple machines with Skylake processors.
According to GCC6 release notes the AVX-512 gear should be available. According to the Intel Intrinsics Guide vmovdqu64 is available with AVX-512VL and AVX-512F:
$ cat test.cxx
#include <cstdint>
#include <immintrin.h>
int main(int argc, char* argv[])
{
uint64_t x[8];
__m512i y = _mm512_loadu_epi64(x);
return 0;
}
And then:
$ /opt/local/bin/g++-mp-6 -mavx512f -Wa,-q test.cxx -o test.exe
test.cxx: In function 'int main(int, char**)':
test.cxx:6:37: error: '_mm512_loadu_epi64' was not declared in this scope
__m512i y = _mm512_loadu_epi64(x);
^
$ /opt/local/bin/g++-mp-6 -mavx -mavx2 -mavx512f -Wa,-q test.cxx -o test.exe
test.cxx: In function 'int main(int, char**)':
test.cxx:6:37: error: '_mm512_loadu_epi64' was not declared in this scope
__m512i y = _mm512_loadu_epi64(x);
^
$ /opt/local/bin/g++-mp-6 -msse4.1 -msse4.2 -mavx -mavx2 -mavx512f -Wa,-q test.cxx -o test.exe
test.cxx: In function 'int main(int, char**)':
test.cxx:6:37: error: '_mm512_loadu_epi64' was not declared in this scope
__m512i y = _mm512_loadu_epi64(x);
^
I walked the options back to -msse2 without success. I seem to be missing something.
What is required to engage AVX-512 for modern GCC?
According to a /opt/local/bin/g++-mp-6 -v, these are the header search paths:
#include "..." search starts here:
#include <...> search starts here:
/opt/local/include/gcc6/c++/
/opt/local/include/gcc6/c++//x86_64-apple-darwin13
/opt/local/include/gcc6/c++//backward
/opt/local/lib/gcc6/gcc/x86_64-apple-darwin13/6.5.0/include
/opt/local/include
/opt/local/lib/gcc6/gcc/x86_64-apple-darwin13/6.5.0/include-fixed
/usr/include
/System/Library/Frameworks
/Library/Frameworks
And then:
$ grep -R '_mm512_' /opt/local/lib/gcc6/ | grep avx512f | head -n 8
/opt/local/lib/gcc6//gcc/x86_64-apple-darwin13/6.5.0/include/avx512fintrin.h:_mm512_set_epi64 (long long __A, long long __B, long long __C,
/opt/local/lib/gcc6//gcc/x86_64-apple-darwin13/6.5.0/include/avx512fintrin.h:_mm512_set_epi32 (int __A, int __B, int __C, int __D,
/opt/local/lib/gcc6//gcc/x86_64-apple-darwin13/6.5.0/include/avx512fintrin.h:_mm512_set_pd (double __A, double __B, double __C, double __D,
/opt/local/lib/gcc6//gcc/x86_64-apple-darwin13/6.5.0/include/avx512fintrin.h:_mm512_set_ps (float __A, float __B, float __C, float __D,
/opt/local/lib/gcc6//gcc/x86_64-apple-darwin13/6.5.0/include/avx512fintrin.h:#define _mm512_setr_epi64(e0,e1,e2,e3,e4,e5,e6,e7) \
/opt/local/lib/gcc6//gcc/x86_64-apple-darwin13/6.5.0/include/avx512fintrin.h: _mm512_set_epi64(e7,e6,e5,e4,e3,e2,e1,e0)
/opt/local/lib/gcc6//gcc/x86_64-apple-darwin13/6.5.0/include/avx512fintrin.h:#define _mm512_setr_epi32(e0,e1,e2,e3,e4,e5,e6,e7, \
/opt/local/lib/gcc6//gcc/x86_64-apple-darwin13/6.5.0/include/avx512fintrin.h: _mm512_set_epi32(e15,e14,e13,e12,e11,e10,e9,e8,e7,e6,e5,e4,e3,e2,e1,e0)
...
With no masking, there's no reason for this intrinsic to exist or to ever use it instead of the equivalent _mm512_loadu_si512. It's just confusing, and could trick human readers into thinking it was a vmovq zero-extending load of a single epi64.
Intel's intrinsics finder does specify that it exists, but even current trunk gcc (on Godbolt) doesn't define it.
Almost all AVX512 instructions support merge-masking and zero-masking. Instructions that used to be purely bitwise / whole-register with no meaningful element boundaries now come in 32 and 64-bit element flavours, like vpxord and vpxorq. Or vmovdqa32 and vmovdqa64. But using either version with no masking is still just a normal vector load / store / register-copy, and it's not meaningful to specify anything about element-size for them in the C++ source with intrinsics, only the total vector width.
See also What is the difference between _mm512_load_epi32 and _mm512_load_si512?
SSE* and AVX1/2 options are irrelevent to whether or not GCC headers define this intrinsic in terms of gcc built-ins or not; -mavx512f already implies all of the Intel SSE/AVX extensions before AVX512.
It is present in clang trunk (but not 7.0 so it was only very recently added).
unaligned _mm512_loadu_si512 - supported everywhere, use this
unaligned _mm512_loadu_epi64 - clang trunk, not gcc.
aligned _mm512_load_si512 - supported everywhere, use this
aligned _mm512_load_epi64 - also supported everywhere, surprisingly.
unaligned _mm512_maskz_loadu_epi64 - supported everywhere, use this for zero-masked loads
unaligned _mm512_mask_loadu_epi64 - supported everywhere, use this for merge-mask loads.
This code compiles on gcc as early as 4.9.0, and mainline (Linux) clang as early as 3.9, both with -march=avx512f. Or if they support it, -march=skylake-avx512 or -march=knl. I haven't tested with Apple Clang.
#include <immintrin.h>
__m512i loadu_si512(void *x) { return _mm512_loadu_si512(x); }
__m512i load_epi64(void *x) { return _mm512_load_epi64(x); }
//__m512i loadu_epi64(void *x) { return _mm512_loadu_epi64(x); }
__m512i loadu_maskz(void *x) { return _mm512_maskz_loadu_epi64(0xf0, x); }
__m512i loadu_mask(void *x) { return _mm512_mask_loadu_epi64(_mm512_setzero_si512(), 0xf0, x); }
Godbolt link; you can uncomment the _mm512_loadu_epi64 and flip the compiler to clang trunk to see it work there.
_mm512_loadu_epi64 is not available in 32-bit mode. You need to compile for 64-bit mode. In general, AVX512 works best in 64-bit mode.

LLVM coverage confused by if-constexpr

I have encountered a weird problem with LLVM coverage when using constant expressions in an if-statement:
template<typename T>
int foo(const T &val)
{
int idx = 0;
if constexpr(std::is_trivially_copyable<T>::value && sizeof(T) <= sizeof(int)
{
memcpy(&idx, &v, sizeof(T));
}
else
{
//store val and assign its index to idx
}
return idx;
}
The instantiations executed:
int idx1 = foo<int>(10);
int idx2 = foo<long long>(10);
int idx3 = foo<std::string>(std::string("Hello"));
int idx4 = foo<std::vector<int>>(std::vector<int>{1,2,3,4,5});
In none of these is the sizeof(T) <= sizeof(int) ever shown as executed. And yet in the first case instantiation (int) the body of the first if is indeed executed as it should. In none other it is shown as executed.
Relevant part of the compilation command line:
/usr/bin/clang++ -g -O0 -Wall -Wextra -fprofile-instr-generate -fcoverage-mapping -target x86_64-pc-linux-gnu -pipe -fexceptions -fvisibility=default -fPIC -DQT_CORE_LIB -DQT_TESTLIB_LIB -I(...) -std=c++17 -o test.o -c test.cpp
Relevant part of the linker command line:
/usr/bin/clang++ -Wl,-m,elf_x86_64,-rpath,/home/michael/Qt/5.11.2/gcc_64/lib -L/home/michael/Qt/5.11.2/gcc_64/lib -fprofile-instr-generate -fcoverage-mapping -target x86_64-pc-linux-gnu -o testd test.o -lpthread -fuse-ld=lld
When the condition is extracted to its own fucnction both int and long long instantiations are shown correctly in the coverage as executing the sizeof(T) <= sizeof(int) part. What might be causing such behaviour and how to solve it? Is it a bug in the Clang/LLVM cov?
Any ideas?
EDIT: This seems to be a known bug in LLVM (not clear yet if LLVM-cov or Clang though):
https://bugs.llvm.org/show_bug.cgi?id=36086
https://bugs.chromium.org/p/chromium/issues/detail?id=845575
First of all, sizeof(T) <= sizeof(int) should be executed at compile-time in your code, so chances are, compilation is not profiled for coverage.
Next, of those three types only long long looks trivially_copyable, but its size is (highly likely) more than that of int, so then-clause is not executed for them, nor even compiled. Since everything happens inside a templated function, the non-executed branch is not compiled.

OpenMP with restrict pointers fails with ICC while GCC/G++ succeeds

I implemented a simple matrix vector multiplication for sparse matrices in CRS using an implicit openMP directive in the multiplication loop.
The complete code is in GitHub: https://github.com/torbjoernk/openMP-Examples/blob/icc_gcc_problem/matxvec_sparse/matxvec_sparse.cpp
Note: It's ugly ;-)
To control the private and shared memory I'm using restrict pointers. Compiling it with GCC 4.6.3 on 64bit Linux works fine (besides two warnings about %u and unsigned int in a printf command, but that's not the point).
However, compiling it with ICC 12.1.0 on 64bit Linux failes with the error:
matxvec_sparse.cpp(79): error: "default_n_row" must be specified in a variable list at enclosing OpenMP parallel pragma
#pragma omp parallel \
^
with the definition of the variable and pointer in question
int default_n_row = 4;
int *n_row = &default_n_row;
and the openMP directive defined as
#pragma omp parallel \
default(none) \
shared(n_row, aval, acolind, arowpt, vval, yval) \
private(x, y)
{
#pragma omp for \
schedule(static)
for ( x = 0; x < *n_row; x++ ) {
yval[x] = 0;
for ( y = arowpt[x]; y < arowpt[x+1]; y++ ) {
yval[x] += aval[y] * vval[ acolind[y] ];
}
}
} /* end PARALLEL */
Compiled with g++:
c++ -fopenmp -O0 -g -std=c++0x -Wall -o matxvec_sparse matxvec_sparse.cpp
Compiled with icc:
icc -openmp -O0 -g -std=c++0x -Wall -restrict -o matxvec_sparse matxvec_sparse.cpp
Is it an error in usage of GCC/ICC?
Is this a design issue in my code causing undefined behaviour?
If so, which line(s) is/are causing it?
Is it just inconsistency between ICC and GCC?
If so, what would be a good way to achieve compiler independence and compatibility?
Huh. Looking at the code, it's clear what icpc thinks the problem is, but I'm not sure without going through the specification which compiler is doing the right thing here, g++ or icpc.
The issue isn't the restrict keyword; if you take all those out and lose the -restrict option to icpc, the problem remains. The issue is that you've got in that parallel section default(none) shared(n_row...), but n_row is, at the start of the program, a pointer to default_n_row. And icpc is requiring that default_n_row also be shared (or, at least, something) in that omp parallel section.