C++ in ARM MCU: Need help to set up a simple timer - c++

I'm programming an ATSAME70 and I'm trying to program a simple timer using the SysTick interrupt available in Cortex M MCUs, but I don't know what is going wrong.
If write this code in a simple main.cpp file:
// main.cpp
#include <cstdint>
#include "init.h"
#include "led.hpp"
volatile uint32_t g_ticks = 0;
extern "C" {
void SysTick_Handler(void)
{
g_ticks++;
}
}
class Timer
{
private:
uint32_t start;
public:
Timer() : start(g_ticks) {}
float elapsed() const { return (g_ticks - start) / 1000.0f; }
};
int main()
{
init();
SysTick_Config(300000000 / 1000); /* Clock is running at 300 MHz */
Timer t;
while (t.elapsed() < 1.0f);
Led::on();
while (true);
}
It works, the led lights up properly after 1 second.
But if I try to keep it clean and separate the program in the following files:
// timer.hpp
#include <cstdint>
class Timer
{
private:
uint32_t start;
public:
Timer();
float elapsed() const;
};
// timer.cpp
#include "timer.hpp"
volatile uint32_t g_ticks = 0;
extern "C" {
void SysTick_Handler(void)
{
g_ticks++;
}
}
Timer::Timer() : start(g_ticks) {}
float Timer::elapsed() const
{
return (g_ticks - start) / 1000.0f;
}
// main.cpp
#include <cstdint>
#include "init.h"
#include "led.hpp"
#include "timer.hpp"
int main()
{
init();
SysTick_Config(300000000 / 1000); /* Clock is running at 300 MHz */
Timer t;
while (t.elapsed() < 1.0f);
Led::on();
while (true);
}
It doesn't work anymore, the program reaches the first while loop and then it gets stuck there, I think g_ticks is being corrupted when I try to read it in t.elapsed() but I don't know what is happening. Does anybody know where I'm wrong?
init() is just a function in which I initialize all needed registers.
EDIT: here are the command lines used to generate the code:
$toolchain_path = "C:\Program Files (x86)\GNU Tools ARM Embedded\8 2018-q4-major\bin";
$link_file = "source\device\same70_flash.ld"
$c_files = "include\sensors\bmi088\bmi088.c " +
...
"source\utils\syscalls.c";
$cpp_files = "source\device\init.cpp " +
...
"source\main.cpp";
Invoke-Expression "& '$toolchain_path\arm-none-eabi-gcc.exe' -c -s -O3 -fdata-sections -ffunction-sections '-Wl,--gc-sections' '-Wl,--entry=Reset_Handler' -mthumb -mcpu=cortex-m7 -mfloat-abi=hard -mfpu=fpv5-d16 -Isource -Iinclude\CMSIS -D__SAME70N21__ $c_files --specs=nosys.specs"
foreach ($c_file in $c_files.split(" "))
{
if ($objects) { $objects += " "; }
$objects += ($c_file.split("\")[-1]).split(".")[0] + ".o";
}
Invoke-Expression "& '$toolchain_path\arm-none-eabi-ld.exe' -s --entry=Reset_Handler -r $objects -o drivers.o"
foreach ($object in $objects.split(" ")) { Remove-Item $object; }
Move-Item drivers.o bin\drivers.o -force
Invoke-Expression "& '$toolchain_path\arm-none-eabi-g++.exe' -s -O3 -fdata-sections -ffunction-sections '-Wl,--gc-sections' -mthumb -mcpu=cortex-m7 -mfloat-abi=hard -mfpu=fpv5-d16 '-Wl,--entry=Reset_Handler' -std=c++17 -Isource -Iinclude -Iinclude\CMSIS -D__SAME70N21__ bin/drivers.o $cpp_files --specs=nosys.specs -T $link_file -o bin\code.elf"
Invoke-Expression "& '$toolchain_path\arm-none-eabi-objcopy.exe' -O binary bin\code.elf bin\code.bin"
The script is written in powershell and I'll explain it a little bit. $c_files is just a string with every c file to be compiled separated by an space. $objects is an array of strings containing every file listed in $c_files but with the ".c" extension replaced by ".o". I've done this to link every c compiled file into "drivers.o". Finally, c++ code is compiled using this drivers.o as argument and then I generate the .bin file to upload it to the MCU.
The code is compiled using the latest GNU Arm Embedded toolchain. I must have made a mistake somewhere but I don't know where and I don't have a debugger to debug the code at runtime.
EDIT 2: Both variants work properly without optimizations. If I pass -O1 or higher as argument to the compiler the second variant stops working and I don't understand why.

Related

Unresolved extern function error with template default parameter in CUDA9.2 and above

I am working with some c++/CUDA code that makes significant use of templates for both classes and functions. We have mostly been using CUDA 9.0 and 9.1, where everything compiles and runs fine. However, compilation fails on newer versions of CUDA (specifically 9.2 and 10).
After further investigation, it seems that trying to compile exactly the same code with CUDA version 9.2.88 and above will fail, whereas with CUDA version 8 through 9.1.85 the code compiles and runs correctly.
A minimal example of the problematic code can be written as follows:
#include <iostream>
template<typename Pt>
using Link_force = void(Pt* x, Pt* y);
template<typename Pt>
__device__ void linear_force(Pt* x, Pt* y)
{
*x += *y;
}
template<typename Pt, Link_force<Pt> force>
__global__ void link(Pt* x, Pt* y)
{
force(x, y);
}
template<typename Pt = float, Link_force<Pt> force = linear_force<Pt>>
void apply_forces(Pt* x, Pt* y)
{
link<Pt, force><<<1, 1, 0>>>(x, y);
}
int main(int argc, const char* argv[])
{
float *x, *y;
cudaMallocManaged(&x, sizeof(float));
cudaMallocManaged(&y, sizeof(float));
*x = 0.0f;
*y = 42.0f;
std::cout << "Pre :: x = " << *x << ", y = " << *y << '\n';
apply_forces(x, y);
cudaDeviceSynchronize();
std::cout << "Post :: x = " << *x << ", y = " << *y << '\n';
return 0;
}
If I compile with nvcc, as below, the eventual result is an error from ptxas:
$ nvcc --verbose -std=c++11 -arch=sm_61 minimal_example.cu
#$ _SPACE_=
#$ _CUDART_=cudart
#$ _HERE_=/usr/local/cuda-9.2/bin
#$ _THERE_=/usr/local/cuda-9.2/bin
#$ _TARGET_SIZE_=
#$ _TARGET_DIR_=
#$ _TARGET_SIZE_=64
#$ TOP=/usr/local/cuda-9.2/bin/..
#$ NVVMIR_LIBRARY_DIR=/usr/local/cuda-9.2/bin/../nvvm/libdevice
#$ LD_LIBRARY_PATH=/usr/local/cuda-9.2/bin/../lib:/usr/local/cuda-9.2/lib64:
#$ PATH=/usr/local/cuda-9.2/bin/../nvvm/bin:/usr/local/cuda-9.2/bin:/usr/local/cuda-9.2/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
#$ INCLUDES="-I/usr/local/cuda-9.2/bin/..//include"
#$ LIBRARIES= "-L/usr/local/cuda-9.2/bin/..//lib64/stubs" "-L/usr/local/cuda-9.2/bin/..//lib64"
#$ CUDAFE_FLAGS=
#$ PTXAS_FLAGS=
#$ gcc -std=c++11 -D__CUDA_ARCH__=610 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda-9.2/bin/..//include" -D"__CUDACC_VER_BUILD__=148" -D"__CUDACC_VER_MINOR__=2" -D"__CUDACC_VER_MAJOR__=9" -include "cuda_runtime.h" -m64 "minimal_example.cu" > "/tmp/tmpxft_0000119e_00000000-8_minimal_example.cpp1.ii"
#$ cicc --c++11 --gnu_version=70300 --allow_managed -arch compute_61 -m64 -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_0000119e_00000000-2_minimal_example.fatbin.c" -tused -nvvmir-library "/usr/local/cuda-9.2/bin/../nvvm/libdevice/libdevice.10.bc" --gen_module_id_file --module_id_file_name "/tmp/tmpxft_0000119e_00000000-3_minimal_example.module_id" --orig_src_file_name "minimal_example.cu" --gen_c_file_name "/tmp/tmpxft_0000119e_00000000-5_minimal_example.cudafe1.c" --stub_file_name "/tmp/tmpxft_0000119e_00000000-5_minimal_example.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_0000119e_00000000-5_minimal_example.cudafe1.gpu" "/tmp/tmpxft_0000119e_00000000-8_minimal_example.cpp1.ii" -o "/tmp/tmpxft_0000119e_00000000-5_minimal_example.ptx"
#$ ptxas -arch=sm_61 -m64 "/tmp/tmpxft_0000119e_00000000-5_minimal_example.ptx" -o "/tmp/tmpxft_0000119e_00000000-9_minimal_example.sm_61.cubin"
ptxas fatal : Unresolved extern function '_Z12linear_forceIfEvPT_S1_'
# --error 0xff --
As far as I can tell, the error only occurs when using the default template parameter Link_force<Pt> force = linear_force<Pt> in the template definition for apply_forces. For example, explicitly specifying the template parameters in main
apply_forces<float, linear_force>(x, y);
where we call apply_forces will result in everything compiling and running correctly, as does defining the template parameters explicitly in any other way.
Is it likely that this is a problem with the nvcc toolchain? I didn't spot any changes in the CUDA release notes that would be a likely culprit, so I'm a bit stumped.
Since this was working with older versions of nvcc, and now is not, I don't understand whether this is in fact an illegitimate use of template default parameters? (perhaps specifically when combined with CUDA functions?)
This is a bug in CUDA 9.2 and 10.0 and a fix is being worked on. Thanks for pointing it out.
One possible workaround as you've already pointed out would be to revert to CUDA 9.1
Another possible workaround is to repeat the offending template instantiation in the body of the function (e.g. in a discarded statement). This has no impact on performance, it just forces the compiler to emit code for that function:
template<typename Pt = float, Link_force<Pt> force = linear_force<Pt>>
void apply_forces(Pt* x, Pt* y)
{
(void)linear_force<Pt>; // add this
link<Pt, force><<<1, 1, 0>>>(x, y);
}
I don't have further information on when a fix will be available, but it will be in a future CUDA release.

Enabling C++ exceptions on ARM bare-metal bootloader

For learning purpose, I am trying to get full C++ support on an ARM MCU (STM32F407ZE). I am struggling at getting exceptions working, consequently carrying this question:
How to get C++ exceptions on a bare-metal ARM bootloader?
To extend a bit the question:
I understand that an exception, like exiting a function require unwinding the stack. The fact that exiting a function works out of the box, but the exception handling does not, make me to think that the compiler is adding the unwinding of functions-exit but can not do it for exceptions.
So the sub-question 1 is: Is this premise correct? Do I really need to implement/integrate an unwinding library for exception handling?
In my superficial understanding of unwinding, there is a frame in the stack and the unwinding "just" need to call the destructor on each object of it, and finally jump to the given catch.
Sub-question 2 is: How does the unwinding library perform this task? What is the strategy used? (to the extends appropriate for a SO answer)
In my searches, I found many explanations of WHAT is the unwinding, but very few of how to get it working. The closest is:
GCC arm-none-eabi (Codesourcery) and C++ Exceptions
The project
1) The first step and yet with some difficulties, was to get the MCU powered and communicating through JTAG.
This is just contextual information, please do not tag the question off-topic just because of this picture. Jump to step 2 instead.
I know there are testing boards available, but this is a learning project to get a better understanding on all the "magic" behind the scene. So I got a chip socket, a bread-board and setup the minimal power-up circuitry:
Note: JTAG is performed through the GPIO of a raspberry-pi.
Note2: I am using OpenOCD to communicate with the chip.
2) Second step, was to make a minimal software to blink the yellow led.
Using arm-none-eabi-g++ as a compiler and linker, the c++ code was straightforward, but my understanding of the linker script is still somewhat blurry.
3) Enable exceptions handling (not yet working).
For this goal, following informations where useful:
https://wiki.osdev.org/C++_Exception_Support
https://itanium-cxx-abi.github.io/cxx-abi/exceptions.pdf
https://itanium-cxx-abi.github.io/cxx-abi/abi-eh.html
However, it seems quite too much complexity for a simple exception handling, and before to start implementing/integrating an unwinding library, I would like to be sure I am going in the correct direction.
I would like to avoid earing in 2 weeks: "Ohh, by the way, you just need to add this "-xx" option to the compiler and it works"
main.cpp
auto reset_handler() noexcept ->void;
auto main() -> int;
int global_variable_test=50;
extern "C"
{
#include "stm32f4xx.h"
#include "stm32f4xx_rcc.h"
#include "stm32f4xx_gpio.h"
void assert_failed(uint8_t* file, uint32_t line){}
void hardFaultHandler( unsigned int * hardFaultArgs);
// vector table
#define SRAM_SIZE 128*1024
#define SRAM_END (SRAM_BASE + SRAM_SIZE)
unsigned long *vector_table[] __attribute__((section(".vector_table"))) =
{
(unsigned long *)SRAM_END, // initial stack pointer
(unsigned long *)reset_handler, // main as Reset_Handler
};
}
auto reset_handler() noexcept -> void
{
// Setup execution
// Call the main function
int ret = main();
// never finish
while(true);
}
class A
{
public:
int b;
auto cppFunc()-> void
{
throw (int)4;
}
};
auto main() -> int
{
// Initializing led GPIO
RCC_AHB1PeriphClockCmd(RCC_AHB1Periph_GPIOG, ENABLE);
GPIO_InitTypeDef GPIO_InitDef;
GPIO_InitDef.GPIO_Pin = GPIO_Pin_13 | GPIO_Pin_14;
GPIO_InitDef.GPIO_OType = GPIO_OType_PP;
GPIO_InitDef.GPIO_Mode = GPIO_Mode_OUT;
GPIO_InitDef.GPIO_PuPd = GPIO_PuPd_NOPULL;
GPIO_InitDef.GPIO_Speed = GPIO_Speed_100MHz;
GPIO_Init(GPIOG, &GPIO_InitDef);
// Testing normal blinking
int loopNum = 500000;
for (int i=0; i<5; ++i)
{
loopNum = 100000;
GPIO_SetBits(GPIOG, GPIO_Pin_13 | GPIO_Pin_14);
for (int i = 0; i < loopNum; i++) continue; //active waiting!
loopNum = 800000;
GPIO_ResetBits(GPIOG, GPIO_Pin_13 | GPIO_Pin_14);
for (int i=0; i<loopNum; i++) continue; //active waiting!
}
// Try exceptions handling
try
{
A a;
a.cppFunc();
}
catch(...){}
return 0;
}
Makefile
CPP_C = arm-none-eabi-g++
C_C = arm-none-eabi-g++
LD = arm-none-eabi-g++
COPY = arm-none-eabi-objcopy
LKR_SCRIPT = -Tstm32_minimal.ld
INCLUDE = -I. -I./stm32f4xx/CMSIS/Device/ST/STM32F4xx/Include -I./stm32f4xx/CMSIS/Include -I./stm32f4xx/STM32F4xx_StdPeriph_Driver/inc -I./stm32f4xx/Utilities/STM32_EVAL/STM3240_41_G_EVAL -I./stm32f4xx/Utilities/STM32_EVAL/Common
C_FLAGS = -c -fexceptions -fno-common -O0 -g -mcpu=cortex-m4 -mthumb -DSTM32F40XX -DUSE_FULL_ASSERT -DUSE_STDPERIPH_DRIVER $(INCLUDE)
CPP_FLAGS = -std=c++11 -c $(C_FLAGS)
LFLAGS = -specs=nosys.specs -nostartfiles -nostdlib $(LKR_SCRIPT)
CPFLAGS = -Obinary
all: main.bin
main.o: main.cpp
$(CPP_C) $(CPP_FLAGS) -o main.o main.cpp
stm32f4xx_gpio.o: stm32f4xx_gpio.c
$(C_C) $(C_FLAGS) -o stm32f4xx_gpio.o stm32f4xx_gpio.c
stm32f4xx_rcc.o: stm32f4xx_rcc.c
$(C_C) $(C_FLAGS) -o stm32f4xx_rcc.o stm32f4xx_rcc.c
main.elf: main.o stm32f4xx_gpio.o stm32f4xx_rcc.o
$(LD) $(LFLAGS) -o main.elf main.o stm32f4xx_gpio.o stm32f4xx_rcc.o
main.bin: main.elf
$(COPY) $(CPFLAGS) main.elf main.bin
clean:
rm -rf *.o *.elf *.bin
write:
./write_bin.sh main.elf
Linker script: stm32_minimal.ld
/* memory layout for an STM32F407 */
MEMORY
{
FLASH (rx) : ORIGIN = 0x08000000, LENGTH = 512K
SRAM (rwx) : ORIGIN = 0x20000000, LENGTH = 128K
}
/* output sections */
SECTIONS
{
/* program code into FLASH */
.text :
{
*(.vector_table) /* Vector table */
*(.text) /* Program code */
*(.data)
/**(.eh_frame)*/
} >FLASH
.ARM.exidx : /* Required for unwinding the stack? */
{
__exidx_start = .;
* (.ARM.exidx* .gnu.linkonce.armexidx.*)
__exidx_end = .;
} > FLASH
PROVIDE ( end = . );
}

linker error: undefined symbol

What I'm trying to do
I am attempting to create 2 c++ classes.
One, named Agent that will be implemented as a member of class 2
Two, named Env that will be exposed to Python through boost.python (though I suspect this detail to be inconsequential to my problem)
The problem
After successful compilation with my make file, I attempt to run my python script and I receive an import error on my extension module (the c++ code) that reads "undefined symbol: _ZN5AgentC1Effff". All the boost-python stuff aside, I believe this to be a simple c++ linker error.
Here are my files:
Agent.h
class Agent {
public:
float xy_pos[2];
float xy_vel[2];
float yaw;
float z_pos;
Agent(float x_pos, float y_pos, float yaw, float z_pos);
};
Agent.cpp
#include "Agent.h"
Agent::Agent(float x_pos, float y_pos, float yaw, float z_pos)
{
xy_vel[0] = 0;
xy_vel[1] = 0;
xy_pos[0] = x_pos;
xy_pos[1] = y_pos;
z_pos = z_pos;
yaw = yaw;
};
test_ext.cpp (where my Env class lives)
#include "Agent.h"
#include <boost/python.hpp>
class Env{
public:
Agent * agent;
//some other members
Env() {
agent = new Agent(13, 10, 0, 2);
}
np::ndarray get_agent_vel() {
return np::from_data(agent->xy_vel, np::dtype::get_builtin<float>(),
p::make_tuple(2),
p::make_tuple(sizeof(float)),
p::object());
}
void set_agent_vel(np::ndarray vel) {
agent->xy_vel[0] = p::extract<float>(vel[0]);
agent->xy_vel[1] = p::extract<float>(vel[1]);
}
}
BOOST_PYTHON_MODULE(test_ext) {
using namespace boost::python;
class_<Env>("Env")
.def("set_agent_vel", &Env::set_agent_vel)
.def("get_agent_vel", &Env::get_agent_vel)
}
Makefile
PYTHON_VERSION = 3.5
PYTHON_INCLUDE = /usr/include/python$(PYTHON_VERSION)
# location of the Boost Python include files and library
BOOST_INC = /usr/local/include/boost_1_66_0
BOOST_LIB = /usr/local/include/boost_1_66_0/stage/lib/
# compile mesh classes
TARGET = test_ext
CFLAGS = --std=c++11
$(TARGET).so: $(TARGET).o
g++ -shared -Wl,--export-dynamic $(TARGET).o -L$(BOOST_LIB) -lboost_python3 -lboost_numpy3 -L/usr/lib/python3.5/config-3.5m-x86_64-linux-gnu -lpython3.5 -o $(TARGET).so
$(TARGET).o: $(TARGET).cpp Agent.o
g++ -I$(PYTHON_INCLUDE) -I$(BOOST_INC) -fPIC -c $(TARGET).cpp $(CFLAGS)
Agent.o: Agent.cpp Agent.h
g++ -c -Wall Agent.cpp $(CFLAGS)
You never link with Agent.o anywhere.
First of all you need to build it like you build test_ext.o with the same flags. Then you need to actually link with Agent.o when creating the shared library.

Error: 'ALIGN' undeclared (first use in this function) with ALIGN defined into macro

I have a strange error at compilation of the following source :
#include <stdio.h>
#include <stdlib.h>
#include <mach/mach_time.h>
#include <mm_malloc.h>
#ifdef SSE
#include <x86intrin.h>
#define ALIGN 16
void addition_tab(int size, double *a, double *b, double *c)
{
int i;
// Main loop
for (i=size-1; i>=0; i-=2)
{
// Intrinsic SSE syntax
const __m128d x = _mm_loadu_pd(a); // Load two x elements
const __m128d y = _mm_loadu_pd(b); // Load two y elements
const __m128d sum = _mm_add_pd(x, y); // Compute two sum elements
_mm_storeu_pd(c, sum); // Store two sum elements
// Increment pointers by 2 since SSE vectorizes on 128 bits = 16 bytes = 2*sizeof(double)
a += 2;
b += 2;
c += 2;
}
}
#endif
int main(int argc, char *argv[])
{
// Array index
int i;
// Array size as argument
int size = atoi(argv[1]);
// Time elapsed
uint64_t t1, t2;
float duration;
// Two input arrays
double *tab_x;
double *tab_y;
double *tab_z;
// Get the timebase info
mach_timebase_info_data_t info;
mach_timebase_info(&info);
#ifdef NOVEC
// Allocation
tab_x = (double*) malloc(size*sizeof(double));
tab_y = (double*) malloc(size*sizeof(double));
tab_z = (double*) malloc(size*sizeof(double));
#else
// Allocation
tab_x = (double*) _mm_malloc(size*sizeof(double),ALIGN);
tab_y = (double*) _mm_malloc(size*sizeof(double),ALIGN);
tab_z = (double*) _mm_malloc(size*sizeof(double),ALIGN);
#endif
}
If I compile with :
gcc-mp-4.9 -DNOVEC -O0 main.c -o exe
compilation is done but with :
gcc-mp-4.9 -DSSE -O3 -msse main.c -o exe
I get the following error :
main.c: In function 'main':
main.c:96:52: error: 'ALIGN' undeclared (first use in this function)
tab_x = (double*) _mm_malloc(size*sizeof(double),ALIGN);
However, variable ALIGN is defined if I pass SSE macro with gcc-mp-4.9 -DSSE, isn't it ?
I found out the root cause into your script: you are not isolating the novec so the compilation with NOVEC macro is always done. You could isolate it using:
if [ "$1" == "novec" ]; then
# Compile no vectorized and vectorized executables
$GCC -DNOVEC -O0 main_benchmark.c -o noVectorizedExe
$GCC -DNOVEC -O0 main_benchmark.c -S -o noVectorizedExe.s
elif [ "$1" == "sse" ]; then
# Compile with SSE
$GCC -DSSE -O3 -msse main_benchmark.c -o vectorizedExe
$GCC -DSSE -O3 -msse main_benchmark.c -S -o vectorizedExe.s
echo "Test"
elif [ "$1" == "avx" ]; then
# Compile with AVX256
$GCC -DAVX256 -O3 -mavx main_benchmark.c -o vectorizedExe
$GCC -DAVX256 -O3 -mavx main_benchmark.c -S -o vectorizedExe.s
fi
EDIT
I Found out it, you have a typo!!
$GCC -DNOVEV -O0 main_benchmark.c -S -o noVectorizedExe.s
should be
$GCC -DNOVEC -O0 main_benchmark.c -S -o noVectorizedExe.s

"Bad" GCC optimization performance

I am trying to understand why using -O2 -march=native with GCC gives a slower code than without using them.
Note that I am using MinGW (GCC 4.7.1) under Windows 7.
Here is my code :
struct.hpp :
#ifndef STRUCT_HPP
#define STRUCT_HPP
#include <iostream>
class Figure
{
public:
Figure(char *pName);
virtual ~Figure();
char *GetName();
double GetArea_mm2(int factor);
private:
char name[64];
virtual double GetAreaEx_mm2() = 0;
};
class Disk : public Figure
{
public:
Disk(char *pName, double radius_mm);
~Disk();
private:
double radius_mm;
virtual double GetAreaEx_mm2();
};
class Square : public Figure
{
public:
Square(char *pName, double side_mm);
~Square();
private:
double side_mm;
virtual double GetAreaEx_mm2();
};
#endif
struct.cpp :
#include <cstdio>
#include "struct.hpp"
Figure::Figure(char *pName)
{
sprintf(name, pName);
}
Figure::~Figure()
{
}
char *Figure::GetName()
{
return name;
}
double Figure::GetArea_mm2(int factor)
{
return (double)factor*GetAreaEx_mm2();
}
Disk::Disk(char *pName, double radius_mm_) :
Figure(pName), radius_mm(radius_mm_)
{
}
Disk::~Disk()
{
}
double Disk::GetAreaEx_mm2()
{
return 3.1415926*radius_mm*radius_mm;
}
Square::Square(char *pName, double side_mm_) :
Figure(pName), side_mm(side_mm_)
{
}
Square::~Square()
{
}
double Square::GetAreaEx_mm2()
{
return side_mm*side_mm;
}
main.cpp
#include <iostream>
#include <cstdio>
#include "struct.hpp"
double Do(int n)
{
double sum_mm2 = 0.0;
const int figuresCount = 10000;
Figure **pFigures = new Figure*[figuresCount];
for (int i = 0; i < figuresCount; ++i)
{
if (i % 2)
pFigures[i] = new Disk((char *)"-Disque", i);
else
pFigures[i] = new Square((char *)"-Carré", i);
}
for (int a = 0; a < n; ++a)
{
for (int i = 0; i < figuresCount; ++i)
{
sum_mm2 += pFigures[i]->GetArea_mm2(i);
sum_mm2 += (double)(pFigures[i]->GetName()[0] - '-');
}
}
for (int i = 0; i < figuresCount; ++i)
delete pFigures[i];
delete[] pFigures;
return sum_mm2;
}
int main()
{
double a = 0;
StartChrono(); // home made lib, working fine
a = Do(10000);
double elapsedTime_ms = StopChrono();
std::cout << "Elapsed time : " << elapsedTime_ms << " ms" << std::endl;
return (int)a % 2; // To force the optimizer to keep the Do() call
}
I compile this code twice :
1 : Without optimization
mingw32-g++.exe -Wall -fexceptions -std=c++11 -c main.cpp -o main.o
mingw32-g++.exe -Wall -fexceptions -std=c++11 -c struct.cpp -o struct.o
mingw32-g++.exe -o program.exe main.o struct.o -s
2 : With -O2 optimization
mingw32-g++.exe -Wall -fexceptions -O2 -march=native -std=c++11 -c main.cpp -o main.o
mingw32-g++.exe -Wall -fexceptions -O2 -march=native -std=c++11 -c struct.cpp -o struct.o
mingw32-g++.exe -o program.exe main.o struct.o -s
1 : Execution time :
1196 ms (1269 ms with Visual Studio 2013)
2 : Execution time :
1569 ms (403 ms with Visual Studio 2013) !!!!!!!!!!!!!
Using -O3 instead of -O2 does not improve the results.
I was, and I still am, pretty convinced that GCC and Visual Studio are equivalents, so I don't understand this huge difference.
Plus, I don't understand why the optimized version is slower than the non-optimized version with GCC.
Do I miss something here ?
(Note that I had the same problem with genuine GCC 4.8.2 on Ubuntu)
Thanks for your help
Considering that I don't see the assembly code, I'm going to speculate the following :
The allocation loop can be optimized (by the compiler) by removing the if clause and causing the following :
for (int i=0;i <10000 ; i+=2)
{
pFigures[i] = new Square(...);
}
for (int i=1;i <10000 ; i +=2)
{
pFigures[i] = new Disk(...);
}
Considering that the end condition is a multiple of 4 , it can be even more "efficient"
for (int i=0;i < 10000 ;i+=2*4)
{
pFigures[i] = ...
pFigures[i+2] = ...
pFigures[i+4] = ...
pFigures[i+6] = ...
}
Memory wise this will make Disks to be allocated 4 by 4 an Squares 4 by 4 .
Now, this means they will be found in the memory next to each other.
Next, you are going to iterate the vector 10000 times in a normal order (by normal i mean index after index).
Think about the places where these shapes are allocated in memory.You will end up having 4 times more cache misses (think about the border example, when 4 disks and 4 squares are found in different pages, you will switch between the pages 8 times... in a normal case scenario you would switch between the pages only once).
This sort of optimization (if done by the compiler, and in your particular code) optimizes the time for Allocation , but not the time of access (which in your example is the biggest load).
Test this by removing the i%2 and see what results you get.
Again this is pure speculation, and it assumes that the reason for lower performance was a loop optimization.
I suspect that you've got an issue unique to the combination of mingw/gcc/glibc on Windows because your code performs faster with optimizations on Linux where gcc is altogether more 'at home'.
On a fairly pedestrian Linux VM using gcc 4.8.2:
$ g++ main.cpp struct.cpp
$ time a.out
real 0m2.981s
user 0m2.876s
sys 0m0.079s
$ g++ -O2 main.cpp struct.cpp
$ time a.out
real 0m1.629s
user 0m1.523s
sys 0m0.041s
...and if you really take the blinkers off the optimizer by deleting struct.cpp and moving the implementation all inline:
$ time a.out
real 0m0.550s
user 0m0.543s
sys 0m0.000s