"Bad" GCC optimization performance - c++

I am trying to understand why using -O2 -march=native with GCC gives a slower code than without using them.
Note that I am using MinGW (GCC 4.7.1) under Windows 7.
Here is my code :
struct.hpp :
#ifndef STRUCT_HPP
#define STRUCT_HPP
#include <iostream>
class Figure
{
public:
Figure(char *pName);
virtual ~Figure();
char *GetName();
double GetArea_mm2(int factor);
private:
char name[64];
virtual double GetAreaEx_mm2() = 0;
};
class Disk : public Figure
{
public:
Disk(char *pName, double radius_mm);
~Disk();
private:
double radius_mm;
virtual double GetAreaEx_mm2();
};
class Square : public Figure
{
public:
Square(char *pName, double side_mm);
~Square();
private:
double side_mm;
virtual double GetAreaEx_mm2();
};
#endif
struct.cpp :
#include <cstdio>
#include "struct.hpp"
Figure::Figure(char *pName)
{
sprintf(name, pName);
}
Figure::~Figure()
{
}
char *Figure::GetName()
{
return name;
}
double Figure::GetArea_mm2(int factor)
{
return (double)factor*GetAreaEx_mm2();
}
Disk::Disk(char *pName, double radius_mm_) :
Figure(pName), radius_mm(radius_mm_)
{
}
Disk::~Disk()
{
}
double Disk::GetAreaEx_mm2()
{
return 3.1415926*radius_mm*radius_mm;
}
Square::Square(char *pName, double side_mm_) :
Figure(pName), side_mm(side_mm_)
{
}
Square::~Square()
{
}
double Square::GetAreaEx_mm2()
{
return side_mm*side_mm;
}
main.cpp
#include <iostream>
#include <cstdio>
#include "struct.hpp"
double Do(int n)
{
double sum_mm2 = 0.0;
const int figuresCount = 10000;
Figure **pFigures = new Figure*[figuresCount];
for (int i = 0; i < figuresCount; ++i)
{
if (i % 2)
pFigures[i] = new Disk((char *)"-Disque", i);
else
pFigures[i] = new Square((char *)"-Carré", i);
}
for (int a = 0; a < n; ++a)
{
for (int i = 0; i < figuresCount; ++i)
{
sum_mm2 += pFigures[i]->GetArea_mm2(i);
sum_mm2 += (double)(pFigures[i]->GetName()[0] - '-');
}
}
for (int i = 0; i < figuresCount; ++i)
delete pFigures[i];
delete[] pFigures;
return sum_mm2;
}
int main()
{
double a = 0;
StartChrono(); // home made lib, working fine
a = Do(10000);
double elapsedTime_ms = StopChrono();
std::cout << "Elapsed time : " << elapsedTime_ms << " ms" << std::endl;
return (int)a % 2; // To force the optimizer to keep the Do() call
}
I compile this code twice :
1 : Without optimization
mingw32-g++.exe -Wall -fexceptions -std=c++11 -c main.cpp -o main.o
mingw32-g++.exe -Wall -fexceptions -std=c++11 -c struct.cpp -o struct.o
mingw32-g++.exe -o program.exe main.o struct.o -s
2 : With -O2 optimization
mingw32-g++.exe -Wall -fexceptions -O2 -march=native -std=c++11 -c main.cpp -o main.o
mingw32-g++.exe -Wall -fexceptions -O2 -march=native -std=c++11 -c struct.cpp -o struct.o
mingw32-g++.exe -o program.exe main.o struct.o -s
1 : Execution time :
1196 ms (1269 ms with Visual Studio 2013)
2 : Execution time :
1569 ms (403 ms with Visual Studio 2013) !!!!!!!!!!!!!
Using -O3 instead of -O2 does not improve the results.
I was, and I still am, pretty convinced that GCC and Visual Studio are equivalents, so I don't understand this huge difference.
Plus, I don't understand why the optimized version is slower than the non-optimized version with GCC.
Do I miss something here ?
(Note that I had the same problem with genuine GCC 4.8.2 on Ubuntu)
Thanks for your help

Considering that I don't see the assembly code, I'm going to speculate the following :
The allocation loop can be optimized (by the compiler) by removing the if clause and causing the following :
for (int i=0;i <10000 ; i+=2)
{
pFigures[i] = new Square(...);
}
for (int i=1;i <10000 ; i +=2)
{
pFigures[i] = new Disk(...);
}
Considering that the end condition is a multiple of 4 , it can be even more "efficient"
for (int i=0;i < 10000 ;i+=2*4)
{
pFigures[i] = ...
pFigures[i+2] = ...
pFigures[i+4] = ...
pFigures[i+6] = ...
}
Memory wise this will make Disks to be allocated 4 by 4 an Squares 4 by 4 .
Now, this means they will be found in the memory next to each other.
Next, you are going to iterate the vector 10000 times in a normal order (by normal i mean index after index).
Think about the places where these shapes are allocated in memory.You will end up having 4 times more cache misses (think about the border example, when 4 disks and 4 squares are found in different pages, you will switch between the pages 8 times... in a normal case scenario you would switch between the pages only once).
This sort of optimization (if done by the compiler, and in your particular code) optimizes the time for Allocation , but not the time of access (which in your example is the biggest load).
Test this by removing the i%2 and see what results you get.
Again this is pure speculation, and it assumes that the reason for lower performance was a loop optimization.

I suspect that you've got an issue unique to the combination of mingw/gcc/glibc on Windows because your code performs faster with optimizations on Linux where gcc is altogether more 'at home'.
On a fairly pedestrian Linux VM using gcc 4.8.2:
$ g++ main.cpp struct.cpp
$ time a.out
real 0m2.981s
user 0m2.876s
sys 0m0.079s
$ g++ -O2 main.cpp struct.cpp
$ time a.out
real 0m1.629s
user 0m1.523s
sys 0m0.041s
...and if you really take the blinkers off the optimizer by deleting struct.cpp and moving the implementation all inline:
$ time a.out
real 0m0.550s
user 0m0.543s
sys 0m0.000s

Related

C++ in ARM MCU: Need help to set up a simple timer

I'm programming an ATSAME70 and I'm trying to program a simple timer using the SysTick interrupt available in Cortex M MCUs, but I don't know what is going wrong.
If write this code in a simple main.cpp file:
// main.cpp
#include <cstdint>
#include "init.h"
#include "led.hpp"
volatile uint32_t g_ticks = 0;
extern "C" {
void SysTick_Handler(void)
{
g_ticks++;
}
}
class Timer
{
private:
uint32_t start;
public:
Timer() : start(g_ticks) {}
float elapsed() const { return (g_ticks - start) / 1000.0f; }
};
int main()
{
init();
SysTick_Config(300000000 / 1000); /* Clock is running at 300 MHz */
Timer t;
while (t.elapsed() < 1.0f);
Led::on();
while (true);
}
It works, the led lights up properly after 1 second.
But if I try to keep it clean and separate the program in the following files:
// timer.hpp
#include <cstdint>
class Timer
{
private:
uint32_t start;
public:
Timer();
float elapsed() const;
};
// timer.cpp
#include "timer.hpp"
volatile uint32_t g_ticks = 0;
extern "C" {
void SysTick_Handler(void)
{
g_ticks++;
}
}
Timer::Timer() : start(g_ticks) {}
float Timer::elapsed() const
{
return (g_ticks - start) / 1000.0f;
}
// main.cpp
#include <cstdint>
#include "init.h"
#include "led.hpp"
#include "timer.hpp"
int main()
{
init();
SysTick_Config(300000000 / 1000); /* Clock is running at 300 MHz */
Timer t;
while (t.elapsed() < 1.0f);
Led::on();
while (true);
}
It doesn't work anymore, the program reaches the first while loop and then it gets stuck there, I think g_ticks is being corrupted when I try to read it in t.elapsed() but I don't know what is happening. Does anybody know where I'm wrong?
init() is just a function in which I initialize all needed registers.
EDIT: here are the command lines used to generate the code:
$toolchain_path = "C:\Program Files (x86)\GNU Tools ARM Embedded\8 2018-q4-major\bin";
$link_file = "source\device\same70_flash.ld"
$c_files = "include\sensors\bmi088\bmi088.c " +
...
"source\utils\syscalls.c";
$cpp_files = "source\device\init.cpp " +
...
"source\main.cpp";
Invoke-Expression "& '$toolchain_path\arm-none-eabi-gcc.exe' -c -s -O3 -fdata-sections -ffunction-sections '-Wl,--gc-sections' '-Wl,--entry=Reset_Handler' -mthumb -mcpu=cortex-m7 -mfloat-abi=hard -mfpu=fpv5-d16 -Isource -Iinclude\CMSIS -D__SAME70N21__ $c_files --specs=nosys.specs"
foreach ($c_file in $c_files.split(" "))
{
if ($objects) { $objects += " "; }
$objects += ($c_file.split("\")[-1]).split(".")[0] + ".o";
}
Invoke-Expression "& '$toolchain_path\arm-none-eabi-ld.exe' -s --entry=Reset_Handler -r $objects -o drivers.o"
foreach ($object in $objects.split(" ")) { Remove-Item $object; }
Move-Item drivers.o bin\drivers.o -force
Invoke-Expression "& '$toolchain_path\arm-none-eabi-g++.exe' -s -O3 -fdata-sections -ffunction-sections '-Wl,--gc-sections' -mthumb -mcpu=cortex-m7 -mfloat-abi=hard -mfpu=fpv5-d16 '-Wl,--entry=Reset_Handler' -std=c++17 -Isource -Iinclude -Iinclude\CMSIS -D__SAME70N21__ bin/drivers.o $cpp_files --specs=nosys.specs -T $link_file -o bin\code.elf"
Invoke-Expression "& '$toolchain_path\arm-none-eabi-objcopy.exe' -O binary bin\code.elf bin\code.bin"
The script is written in powershell and I'll explain it a little bit. $c_files is just a string with every c file to be compiled separated by an space. $objects is an array of strings containing every file listed in $c_files but with the ".c" extension replaced by ".o". I've done this to link every c compiled file into "drivers.o". Finally, c++ code is compiled using this drivers.o as argument and then I generate the .bin file to upload it to the MCU.
The code is compiled using the latest GNU Arm Embedded toolchain. I must have made a mistake somewhere but I don't know where and I don't have a debugger to debug the code at runtime.
EDIT 2: Both variants work properly without optimizations. If I pass -O1 or higher as argument to the compiler the second variant stops working and I don't understand why.

Undefined symbol when trying to link with shared library built from CUDA objects

I'm experimenting with building a simple application from a couple of .cu source files and a very simple C++ main that calls a function from one of the .cu files. I'm making a shared library (.so file) from the compiled .cu files. I'm finding that everything builds without trouble, but when I try to run the application, I get a linker undefined symbol error, with the mangled name of the .cu function I'm calling from main(). If I build a static library instead, my application runs just fine. Here's the makefile I've set up:
.PHONY: clean
NVCCFLAGS = -std=c++11 --compiler-options '-fPIC'
CXXFLAGS = -std=c++11
HLIB = libhello.a
SHLIB = libhello.so
CUDA_OBJECTS = bridge.o add.o
all: driver
%.o :: %.cu
nvcc -o $# $(NVCCFLAGS) -c -I. $<
%.o :: %.cpp
c++ $(CXXFLAGS) -o $# -c -I. $<
$(HLIB): $(CUDA_OBJECTS)
ar rcs $# $^
$(SHLIB): $(CUDA_OBJECTS)
nvcc $(NVCCFLAGS) --shared -o $# $^
#driver : driver.o $(HLIB)
# c++ -std=c++11 -fPIC -o $# driver.o -L. -lhello -L/usr/local/cuda-10.1/targets/x86_64-linux/lib -lcudart
driver : driver.o $(SHLIB)
c++ -std=c++11 -fPIC -o $# driver.o -L. -lhello
clean:
-rm -f driver *.o *.so *.a
Here are the various source files that the makefile takes as fodder.
add.cu:
__global__ void add(int n, int* a, int* b, int* c) {
int index = threadIdx.x;
int stride = blockDim.x;
for (int ii = index; ii < n; ii += stride) {
c[ii] = a[ii] + b[ii];
}
}
add.h:
extern __global__ void add(int n, int* a, int* b, int* c);
bridge.cu:
#include <iostream>
#include "add.h"
void bridge() {
int N = 1 << 16;
int blockSize = 256;
int numBlocks = (N + blockSize - 1)/blockSize;
int* a;
int* b;
int* c;
cudaMallocManaged(&a, N*sizeof(int));
cudaMallocManaged(&b, N*sizeof(int));
cudaMallocManaged(&c, N*sizeof(int));
for (int ii = 0; ii < N; ii++) {
a[ii] = ii;
b[ii] = 2*ii;
}
add<<<numBlocks, blockSize>>>(N, a, b, c);
cudaDeviceSynchronize();
for (int ii = 0; ii < N; ii++) {
std::cout << a[ii] << " + " << b[ii] << " = " << c[ii] << std::endl;
}
cudaFree(a);
cudaFree(b);
cudaFree(c);
}
bridge.h:
extern void bridge();
driver.cpp:
#include "bridge.h"
int main() {
bridge();
return 0;
}
I'm very new to cuda, so I expect that's where I'm doing something wrong. I've played a bit with using extern "C" declarations, but that just seems to move the "undefined symbol" error from run time to build time.
I'm familiar with various ways that one can end up with an undefined symbol, and I've mentioned various experiments I've already performed (static linking, extern "C" declarations) that make me think that this problem isn't addressed by the proposed duplicate question.
My unresolved symbol is _Z6bridgev
It looks to me as though the linker should be able resolve the symbol. If I can nm on driver.o, I see:
0000000000000000 T main
U _Z6bridgev
And if I run nm on libhello.so, I see:
0000000000006e56 T _Z6bridgev
When Robert Crovella was able to get my example to work on his machine, while I wasn't able to get his example to work on mine, I started realizing that my problem had nothing to do with cuda or nvcc. It was the fact that with a shared library, the loader has to resolve symbols at runtime, and my shared library wasn't in a "well-known location". I built a simple test case just now, purely with c++ sources, and repeated my failure. Once I copied libhello.so to /usr/local/lib, I was able to run driver successfully. So, I'm OK with closing my original question, if that's the will of the people.

Enabling C++ exceptions on ARM bare-metal bootloader

For learning purpose, I am trying to get full C++ support on an ARM MCU (STM32F407ZE). I am struggling at getting exceptions working, consequently carrying this question:
How to get C++ exceptions on a bare-metal ARM bootloader?
To extend a bit the question:
I understand that an exception, like exiting a function require unwinding the stack. The fact that exiting a function works out of the box, but the exception handling does not, make me to think that the compiler is adding the unwinding of functions-exit but can not do it for exceptions.
So the sub-question 1 is: Is this premise correct? Do I really need to implement/integrate an unwinding library for exception handling?
In my superficial understanding of unwinding, there is a frame in the stack and the unwinding "just" need to call the destructor on each object of it, and finally jump to the given catch.
Sub-question 2 is: How does the unwinding library perform this task? What is the strategy used? (to the extends appropriate for a SO answer)
In my searches, I found many explanations of WHAT is the unwinding, but very few of how to get it working. The closest is:
GCC arm-none-eabi (Codesourcery) and C++ Exceptions
The project
1) The first step and yet with some difficulties, was to get the MCU powered and communicating through JTAG.
This is just contextual information, please do not tag the question off-topic just because of this picture. Jump to step 2 instead.
I know there are testing boards available, but this is a learning project to get a better understanding on all the "magic" behind the scene. So I got a chip socket, a bread-board and setup the minimal power-up circuitry:
Note: JTAG is performed through the GPIO of a raspberry-pi.
Note2: I am using OpenOCD to communicate with the chip.
2) Second step, was to make a minimal software to blink the yellow led.
Using arm-none-eabi-g++ as a compiler and linker, the c++ code was straightforward, but my understanding of the linker script is still somewhat blurry.
3) Enable exceptions handling (not yet working).
For this goal, following informations where useful:
https://wiki.osdev.org/C++_Exception_Support
https://itanium-cxx-abi.github.io/cxx-abi/exceptions.pdf
https://itanium-cxx-abi.github.io/cxx-abi/abi-eh.html
However, it seems quite too much complexity for a simple exception handling, and before to start implementing/integrating an unwinding library, I would like to be sure I am going in the correct direction.
I would like to avoid earing in 2 weeks: "Ohh, by the way, you just need to add this "-xx" option to the compiler and it works"
main.cpp
auto reset_handler() noexcept ->void;
auto main() -> int;
int global_variable_test=50;
extern "C"
{
#include "stm32f4xx.h"
#include "stm32f4xx_rcc.h"
#include "stm32f4xx_gpio.h"
void assert_failed(uint8_t* file, uint32_t line){}
void hardFaultHandler( unsigned int * hardFaultArgs);
// vector table
#define SRAM_SIZE 128*1024
#define SRAM_END (SRAM_BASE + SRAM_SIZE)
unsigned long *vector_table[] __attribute__((section(".vector_table"))) =
{
(unsigned long *)SRAM_END, // initial stack pointer
(unsigned long *)reset_handler, // main as Reset_Handler
};
}
auto reset_handler() noexcept -> void
{
// Setup execution
// Call the main function
int ret = main();
// never finish
while(true);
}
class A
{
public:
int b;
auto cppFunc()-> void
{
throw (int)4;
}
};
auto main() -> int
{
// Initializing led GPIO
RCC_AHB1PeriphClockCmd(RCC_AHB1Periph_GPIOG, ENABLE);
GPIO_InitTypeDef GPIO_InitDef;
GPIO_InitDef.GPIO_Pin = GPIO_Pin_13 | GPIO_Pin_14;
GPIO_InitDef.GPIO_OType = GPIO_OType_PP;
GPIO_InitDef.GPIO_Mode = GPIO_Mode_OUT;
GPIO_InitDef.GPIO_PuPd = GPIO_PuPd_NOPULL;
GPIO_InitDef.GPIO_Speed = GPIO_Speed_100MHz;
GPIO_Init(GPIOG, &GPIO_InitDef);
// Testing normal blinking
int loopNum = 500000;
for (int i=0; i<5; ++i)
{
loopNum = 100000;
GPIO_SetBits(GPIOG, GPIO_Pin_13 | GPIO_Pin_14);
for (int i = 0; i < loopNum; i++) continue; //active waiting!
loopNum = 800000;
GPIO_ResetBits(GPIOG, GPIO_Pin_13 | GPIO_Pin_14);
for (int i=0; i<loopNum; i++) continue; //active waiting!
}
// Try exceptions handling
try
{
A a;
a.cppFunc();
}
catch(...){}
return 0;
}
Makefile
CPP_C = arm-none-eabi-g++
C_C = arm-none-eabi-g++
LD = arm-none-eabi-g++
COPY = arm-none-eabi-objcopy
LKR_SCRIPT = -Tstm32_minimal.ld
INCLUDE = -I. -I./stm32f4xx/CMSIS/Device/ST/STM32F4xx/Include -I./stm32f4xx/CMSIS/Include -I./stm32f4xx/STM32F4xx_StdPeriph_Driver/inc -I./stm32f4xx/Utilities/STM32_EVAL/STM3240_41_G_EVAL -I./stm32f4xx/Utilities/STM32_EVAL/Common
C_FLAGS = -c -fexceptions -fno-common -O0 -g -mcpu=cortex-m4 -mthumb -DSTM32F40XX -DUSE_FULL_ASSERT -DUSE_STDPERIPH_DRIVER $(INCLUDE)
CPP_FLAGS = -std=c++11 -c $(C_FLAGS)
LFLAGS = -specs=nosys.specs -nostartfiles -nostdlib $(LKR_SCRIPT)
CPFLAGS = -Obinary
all: main.bin
main.o: main.cpp
$(CPP_C) $(CPP_FLAGS) -o main.o main.cpp
stm32f4xx_gpio.o: stm32f4xx_gpio.c
$(C_C) $(C_FLAGS) -o stm32f4xx_gpio.o stm32f4xx_gpio.c
stm32f4xx_rcc.o: stm32f4xx_rcc.c
$(C_C) $(C_FLAGS) -o stm32f4xx_rcc.o stm32f4xx_rcc.c
main.elf: main.o stm32f4xx_gpio.o stm32f4xx_rcc.o
$(LD) $(LFLAGS) -o main.elf main.o stm32f4xx_gpio.o stm32f4xx_rcc.o
main.bin: main.elf
$(COPY) $(CPFLAGS) main.elf main.bin
clean:
rm -rf *.o *.elf *.bin
write:
./write_bin.sh main.elf
Linker script: stm32_minimal.ld
/* memory layout for an STM32F407 */
MEMORY
{
FLASH (rx) : ORIGIN = 0x08000000, LENGTH = 512K
SRAM (rwx) : ORIGIN = 0x20000000, LENGTH = 128K
}
/* output sections */
SECTIONS
{
/* program code into FLASH */
.text :
{
*(.vector_table) /* Vector table */
*(.text) /* Program code */
*(.data)
/**(.eh_frame)*/
} >FLASH
.ARM.exidx : /* Required for unwinding the stack? */
{
__exidx_start = .;
* (.ARM.exidx* .gnu.linkonce.armexidx.*)
__exidx_end = .;
} > FLASH
PROVIDE ( end = . );
}

Rcpp - Compiling outside the structure of a package

I'm having a question about using a C ++ code using Rcpp outside of a package structure.
To clarify my doubt, consider the C ++ code (test.cpp) below:
// [[Rcpp::depends(RcppGSL)]]
#include <Rcpp.h>
#include <numeric>
#include <gsl/gsl_sf_bessel.h>
#include <RcppGSL.h>
#include <gsl/gsl_matrix.h>
#include <gsl/gsl_blas.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector timesTwo(NumericVector x) {
return x * 2;
}
// [[Rcpp::export]]
double my_bessel(double x){
return gsl_sf_bessel_J0 (x);
}
// [[Rcpp::export]]
int tamanho(NumericVector x){
int n = x.size();
return n;
}
// [[Rcpp::export]]
double soma2(NumericVector x){
double resultado = std::accumulate(x.begin(), x.end(), .0);
return resultado;
}
// [[Rcpp::export]]
Rcpp::NumericVector colNorm(const RcppGSL::Matrix & G) {
int k = G.ncol();
Rcpp::NumericVector n(k); // to store results
for (int j = 0; j < k; j++) {
RcppGSL::VectorView colview = gsl_matrix_const_column (G, j);
n[j] = gsl_blas_dnrm2(colview);
}
return n; // return vector
}
The above code works when it is inside the structure of a package. As we know, using the Rcpp::compileAttributes() the file is created RcppExports.cpp. So I will have access to the functions in the R. environment.
My interest is to use the C++ function implemented using the Rcpp outside the framework of a package. For this I compiled the C ++ code using the g ++ compiler as follows:
g++ -I"/usr/include/R/" -DNDEBUG -I"/home/pedro/R/x86_64-pc-linux-gnu-library/3.5/Rcpp/include" -I"/home/pedro/Dropbox/UFPB/Redes Neurais e AnĂ¡lise de Agrupamento/Rcpp" -I /home/pedro/R/x86_64-pc-linux-gnu-library/3.5/RcppGSL/include -D_FORTIFY_SOURCE=2 -fpic -march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong -fno-plt -c test.cpp -o test.o -lgsl -lgslcblas -lm
g++ -shared -L/usr/lib64/R/lib -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -o produto.so produto.o -L/usr/lib64/R/lib -lR -lgsl -lgslcblas -lm
The compilation occurred successfully and no warning message was issued. In this way, the test.o andtest.so files were generated. Already in R, using the .Call interface, I did:
dyn.load("test.so")
my_function <- function(x){
.Call("soma2",x)
}
When trying to use the my_function () function, an error occurs stating that soma2 is not in the load table. Is there any way to create the RcppExports.cpp file outside the framework of a package? I guess the correct one would be to have compiled the code RcppExports.cpp and not test.cpp.
Thanks in advance.
If you are working outside of a package you can simply use Rcpp::sourceCpp(<file>). This will take care of compilating, linking and prociding an R wrapper for you. With your file I get:
> Rcpp::sourceCpp("test.cpp")
> soma2(1:5)
[1] 15

Error: 'ALIGN' undeclared (first use in this function) with ALIGN defined into macro

I have a strange error at compilation of the following source :
#include <stdio.h>
#include <stdlib.h>
#include <mach/mach_time.h>
#include <mm_malloc.h>
#ifdef SSE
#include <x86intrin.h>
#define ALIGN 16
void addition_tab(int size, double *a, double *b, double *c)
{
int i;
// Main loop
for (i=size-1; i>=0; i-=2)
{
// Intrinsic SSE syntax
const __m128d x = _mm_loadu_pd(a); // Load two x elements
const __m128d y = _mm_loadu_pd(b); // Load two y elements
const __m128d sum = _mm_add_pd(x, y); // Compute two sum elements
_mm_storeu_pd(c, sum); // Store two sum elements
// Increment pointers by 2 since SSE vectorizes on 128 bits = 16 bytes = 2*sizeof(double)
a += 2;
b += 2;
c += 2;
}
}
#endif
int main(int argc, char *argv[])
{
// Array index
int i;
// Array size as argument
int size = atoi(argv[1]);
// Time elapsed
uint64_t t1, t2;
float duration;
// Two input arrays
double *tab_x;
double *tab_y;
double *tab_z;
// Get the timebase info
mach_timebase_info_data_t info;
mach_timebase_info(&info);
#ifdef NOVEC
// Allocation
tab_x = (double*) malloc(size*sizeof(double));
tab_y = (double*) malloc(size*sizeof(double));
tab_z = (double*) malloc(size*sizeof(double));
#else
// Allocation
tab_x = (double*) _mm_malloc(size*sizeof(double),ALIGN);
tab_y = (double*) _mm_malloc(size*sizeof(double),ALIGN);
tab_z = (double*) _mm_malloc(size*sizeof(double),ALIGN);
#endif
}
If I compile with :
gcc-mp-4.9 -DNOVEC -O0 main.c -o exe
compilation is done but with :
gcc-mp-4.9 -DSSE -O3 -msse main.c -o exe
I get the following error :
main.c: In function 'main':
main.c:96:52: error: 'ALIGN' undeclared (first use in this function)
tab_x = (double*) _mm_malloc(size*sizeof(double),ALIGN);
However, variable ALIGN is defined if I pass SSE macro with gcc-mp-4.9 -DSSE, isn't it ?
I found out the root cause into your script: you are not isolating the novec so the compilation with NOVEC macro is always done. You could isolate it using:
if [ "$1" == "novec" ]; then
# Compile no vectorized and vectorized executables
$GCC -DNOVEC -O0 main_benchmark.c -o noVectorizedExe
$GCC -DNOVEC -O0 main_benchmark.c -S -o noVectorizedExe.s
elif [ "$1" == "sse" ]; then
# Compile with SSE
$GCC -DSSE -O3 -msse main_benchmark.c -o vectorizedExe
$GCC -DSSE -O3 -msse main_benchmark.c -S -o vectorizedExe.s
echo "Test"
elif [ "$1" == "avx" ]; then
# Compile with AVX256
$GCC -DAVX256 -O3 -mavx main_benchmark.c -o vectorizedExe
$GCC -DAVX256 -O3 -mavx main_benchmark.c -S -o vectorizedExe.s
fi
EDIT
I Found out it, you have a typo!!
$GCC -DNOVEV -O0 main_benchmark.c -S -o noVectorizedExe.s
should be
$GCC -DNOVEC -O0 main_benchmark.c -S -o noVectorizedExe.s