I'm writing a tiny system kernel follow this video:https://www.youtube.com/watch?v=1rnA6wpF0o4&list=PLHh55M_Kq4OApWScZyPl5HhgsTJS9MZ6M.
i meet a problem that doesn't emerge in the video: code just is blocked in getDriver function when PCI controller identify all devices
//PCI controller function
void PeripheralComponentInterconnectController::selectDrivers(DriverManager* driver_manager,InterruptManager* interrupt_manager){
// printf("{ void PeripheralComponentInterconnectController::selectDrivers\n");
int function_num=0;
for(int bus=0;bus<8;++bus){
for(int device=0;device<32;++device){
function_num=deviceHasFunction(bus,device) ? 8:1;
for(int function=0;function<function_num;++function){
PeripheralComponentInterconnectDeviceDescriptor dev= getDeviceDescriptor(bus,device,function);
if(dev.vendor_id == 0 || dev.vendor_id == 0xFFFF){
continue;
}
for(int bar_num=0;bar_num<6;++bar_num){
BaseAddreeRegister bar = getBaseAddressRegister(bus,device,function,bar_num);
if(bar.addr && (bar.type == InputOutput))
dev.port_base = (uint32_t)bar.addr;
}
// blocked in this getDriver
Driver* driver = getDriver(dev, interrupt_manager);
if(driver != 0){
driver_manager->AddDriver(driver);
}
//end
printf("FOUND DEVICE:");
printf("PCI BUS:");
printfHex(bus & 0xFF);
printf(",DEVICE:");
printfHex(device & 0xFF );
printf(", FUNCTION:");
printfHex(function & 0xFF);
printf(" = VENDOR_ID:");
printfHex((dev.vendor_id & 0xFF00) >> 8);
printfHex(dev.vendor_id & 0xFF);
printf(",DEVICE_ID:");
printfHex((dev.device_id & 0xFF00) >> 8);
printfHex(dev.device_id & 0xFF);
printf("\n");
}
}
}
}
//getDriver function:
Driver*
PeripheralComponentInterconnectController::getDriver(PeripheralComponentInterconnectDeviceDescriptor dev,
InterruptManager* interrupt_manager){
return 0;
}
output:
------------------------------------------------------------------------------
you can see there are nothing in getDriver now ,but its still blocked in here.
I downloaded source code from the video and meet same problem.therefore,I'm sure it's my computer problem .
I thought it might have something to do with stacks(maybe stack overflow :),after that, I alter the parameter PeripheralComponentInterconnectDeviceDescriptor dev toPeripheralComponentInterconnectDeviceDescriptor* dev so that reduce stack usage,and it work fine .
i try to add compile parameter after that and alter back to PeripheralComponentInterconnectDeviceDescriptor dev.
-Wl,-z,stack-size=4194304
and enlarger the space that behind esp pointer so that avoid esp overwrite something,and it's blocked again :(.
have somebody tell me what happen? Why isn't blocked in code of video.thinks!
//makefile complie parameter
GCCPARAMS = -m32 -fno-use-cxa-atexit -nostdlib -fno-builtin -fno-rtti -fno-exceptions -fno-leading-underscore -Wno-write-strings -fpermissive -fno-stack-protector -Iinclude
Related
I'm trying to implement the LC-3 architecture in c++. I found a tutorial in C.
I have all the code I need.
I built the source code ran the executable give it an obj file who contain the software to run but I'm stuck in the first instruction of the obj file. This is the code of the first instruction execution:
case Registry::OP_LD:
{
uint16_t r0 = (instr >> 9) & 0x7;
uint16_t pc_offset = m_registry->sign_extend(instr & 0x1FF, 9);
printf("m_memory: %p\n", m_memory);
printf("m_registry: %p\n", m_registry);
printf("\n\n\n\n\n\n");
printf("m_memory[0]: %d\n", m_memory->read(0));
printf("m_registry[0]: %d\n", m_registry->get(0));
printf("\n\n\n\n\n\n");
printf("R0: %d\n", r0);
printf("pc_offset: %d\n", pc_offset);
printf("Reg 0: %d\n", m_registry->get(r0));
printf("mem read: %d\n", m_memory->read(m_registry->get(static_cast<uint16_t>(Registry::R_PC)) + pc_offset));
m_registry->set(r0, m_memory->read(m_registry->get(static_cast<uint16_t>(Registry::R_PC)) + pc_offset));
printf("Reg 0 bis: %d\n", m_registry->get(r0));
m_registry->update_flags(r0);
}
I tried to debug this by putting printf to saw what's happened. I save the result in test (check the archive). I extract a result here as an example:
m_memory: 0x7f5c20645010
m_registry: 0x558784821ef0
m_memory[0]: 0
m_registry[0]: 0
R0: 6
pc_offset: 23
Reg 0: 61477
mem read: 61477
Reg 0 bis: 61477
it's seems, for an unknown reason, the reg and/or the memory it's not set correctly. I don't really know how to debug this. If someone can help me to debug this it's would be great.
The source code is here archive
Cheers
For learning purpose, I am trying to get full C++ support on an ARM MCU (STM32F407ZE). I am struggling at getting exceptions working, consequently carrying this question:
How to get C++ exceptions on a bare-metal ARM bootloader?
To extend a bit the question:
I understand that an exception, like exiting a function require unwinding the stack. The fact that exiting a function works out of the box, but the exception handling does not, make me to think that the compiler is adding the unwinding of functions-exit but can not do it for exceptions.
So the sub-question 1 is: Is this premise correct? Do I really need to implement/integrate an unwinding library for exception handling?
In my superficial understanding of unwinding, there is a frame in the stack and the unwinding "just" need to call the destructor on each object of it, and finally jump to the given catch.
Sub-question 2 is: How does the unwinding library perform this task? What is the strategy used? (to the extends appropriate for a SO answer)
In my searches, I found many explanations of WHAT is the unwinding, but very few of how to get it working. The closest is:
GCC arm-none-eabi (Codesourcery) and C++ Exceptions
The project
1) The first step and yet with some difficulties, was to get the MCU powered and communicating through JTAG.
This is just contextual information, please do not tag the question off-topic just because of this picture. Jump to step 2 instead.
I know there are testing boards available, but this is a learning project to get a better understanding on all the "magic" behind the scene. So I got a chip socket, a bread-board and setup the minimal power-up circuitry:
Note: JTAG is performed through the GPIO of a raspberry-pi.
Note2: I am using OpenOCD to communicate with the chip.
2) Second step, was to make a minimal software to blink the yellow led.
Using arm-none-eabi-g++ as a compiler and linker, the c++ code was straightforward, but my understanding of the linker script is still somewhat blurry.
3) Enable exceptions handling (not yet working).
For this goal, following informations where useful:
https://wiki.osdev.org/C++_Exception_Support
https://itanium-cxx-abi.github.io/cxx-abi/exceptions.pdf
https://itanium-cxx-abi.github.io/cxx-abi/abi-eh.html
However, it seems quite too much complexity for a simple exception handling, and before to start implementing/integrating an unwinding library, I would like to be sure I am going in the correct direction.
I would like to avoid earing in 2 weeks: "Ohh, by the way, you just need to add this "-xx" option to the compiler and it works"
main.cpp
auto reset_handler() noexcept ->void;
auto main() -> int;
int global_variable_test=50;
extern "C"
{
#include "stm32f4xx.h"
#include "stm32f4xx_rcc.h"
#include "stm32f4xx_gpio.h"
void assert_failed(uint8_t* file, uint32_t line){}
void hardFaultHandler( unsigned int * hardFaultArgs);
// vector table
#define SRAM_SIZE 128*1024
#define SRAM_END (SRAM_BASE + SRAM_SIZE)
unsigned long *vector_table[] __attribute__((section(".vector_table"))) =
{
(unsigned long *)SRAM_END, // initial stack pointer
(unsigned long *)reset_handler, // main as Reset_Handler
};
}
auto reset_handler() noexcept -> void
{
// Setup execution
// Call the main function
int ret = main();
// never finish
while(true);
}
class A
{
public:
int b;
auto cppFunc()-> void
{
throw (int)4;
}
};
auto main() -> int
{
// Initializing led GPIO
RCC_AHB1PeriphClockCmd(RCC_AHB1Periph_GPIOG, ENABLE);
GPIO_InitTypeDef GPIO_InitDef;
GPIO_InitDef.GPIO_Pin = GPIO_Pin_13 | GPIO_Pin_14;
GPIO_InitDef.GPIO_OType = GPIO_OType_PP;
GPIO_InitDef.GPIO_Mode = GPIO_Mode_OUT;
GPIO_InitDef.GPIO_PuPd = GPIO_PuPd_NOPULL;
GPIO_InitDef.GPIO_Speed = GPIO_Speed_100MHz;
GPIO_Init(GPIOG, &GPIO_InitDef);
// Testing normal blinking
int loopNum = 500000;
for (int i=0; i<5; ++i)
{
loopNum = 100000;
GPIO_SetBits(GPIOG, GPIO_Pin_13 | GPIO_Pin_14);
for (int i = 0; i < loopNum; i++) continue; //active waiting!
loopNum = 800000;
GPIO_ResetBits(GPIOG, GPIO_Pin_13 | GPIO_Pin_14);
for (int i=0; i<loopNum; i++) continue; //active waiting!
}
// Try exceptions handling
try
{
A a;
a.cppFunc();
}
catch(...){}
return 0;
}
Makefile
CPP_C = arm-none-eabi-g++
C_C = arm-none-eabi-g++
LD = arm-none-eabi-g++
COPY = arm-none-eabi-objcopy
LKR_SCRIPT = -Tstm32_minimal.ld
INCLUDE = -I. -I./stm32f4xx/CMSIS/Device/ST/STM32F4xx/Include -I./stm32f4xx/CMSIS/Include -I./stm32f4xx/STM32F4xx_StdPeriph_Driver/inc -I./stm32f4xx/Utilities/STM32_EVAL/STM3240_41_G_EVAL -I./stm32f4xx/Utilities/STM32_EVAL/Common
C_FLAGS = -c -fexceptions -fno-common -O0 -g -mcpu=cortex-m4 -mthumb -DSTM32F40XX -DUSE_FULL_ASSERT -DUSE_STDPERIPH_DRIVER $(INCLUDE)
CPP_FLAGS = -std=c++11 -c $(C_FLAGS)
LFLAGS = -specs=nosys.specs -nostartfiles -nostdlib $(LKR_SCRIPT)
CPFLAGS = -Obinary
all: main.bin
main.o: main.cpp
$(CPP_C) $(CPP_FLAGS) -o main.o main.cpp
stm32f4xx_gpio.o: stm32f4xx_gpio.c
$(C_C) $(C_FLAGS) -o stm32f4xx_gpio.o stm32f4xx_gpio.c
stm32f4xx_rcc.o: stm32f4xx_rcc.c
$(C_C) $(C_FLAGS) -o stm32f4xx_rcc.o stm32f4xx_rcc.c
main.elf: main.o stm32f4xx_gpio.o stm32f4xx_rcc.o
$(LD) $(LFLAGS) -o main.elf main.o stm32f4xx_gpio.o stm32f4xx_rcc.o
main.bin: main.elf
$(COPY) $(CPFLAGS) main.elf main.bin
clean:
rm -rf *.o *.elf *.bin
write:
./write_bin.sh main.elf
Linker script: stm32_minimal.ld
/* memory layout for an STM32F407 */
MEMORY
{
FLASH (rx) : ORIGIN = 0x08000000, LENGTH = 512K
SRAM (rwx) : ORIGIN = 0x20000000, LENGTH = 128K
}
/* output sections */
SECTIONS
{
/* program code into FLASH */
.text :
{
*(.vector_table) /* Vector table */
*(.text) /* Program code */
*(.data)
/**(.eh_frame)*/
} >FLASH
.ARM.exidx : /* Required for unwinding the stack? */
{
__exidx_start = .;
* (.ARM.exidx* .gnu.linkonce.armexidx.*)
__exidx_end = .;
} > FLASH
PROVIDE ( end = . );
}
We are developing for the STM32F103 MCU. We use bare-metal C++ code with the ARM GCC toolchain. After some hours of struggling with a suspicious expression, we found that the constant keyword triggers different results of that expression. When testing the same piece of code with the x86 GCC toolchain, the problem is nonexistent.
We are using the STM's GPIOS for debugging.
This is the code that that fully reproduces the problem:
#include "stm32f10x.h"
#include "system_stm32f10x.h"
#include "stdlib.h"
#include "stdio.h"
const unsigned short RTC_FREQ = 62500;
unsigned short prescaler_1ms = RTC_FREQ/1000;
int main()
{
//********** Clock Init **********
RCC->CFGR |= RCC_CFGR_ADCPRE_0 | RCC_CFGR_ADCPRE_1; // ADC prescaler
RCC->APB2ENR |= RCC_APB2ENR_AFIOEN; // Alternate Function I/O clock enable
RCC->APB2ENR |= RCC_APB2ENR_IOPCEN; // I/O port C clock enable
RCC->APB2ENR |= RCC_APB2ENR_IOPAEN; // I/O port A clock enable
RCC->APB2ENR |= RCC_APB2ENR_ADC1EN; // ADC 1 interface clock enable
RCC->APB1ENR |= RCC_APB1ENR_TIM2EN; // Timer 2 clock enable
RCC->AHBENR = RCC_AHBENR_DMA1EN; // DMA1 clock enable
RCC->CSR = RCC_CSR_LSION; // Internal Low Speed oscillator enable
//********************************
/* GPIO Configuration */
GPIOC->CRH = GPIO_CRH_MODE12_0; //GPIO Port C Pin 12
GPIOC->CRH |= GPIO_CRH_MODE13_1 | GPIO_CRH_MODE13_0;
GPIOC->CRH |= GPIO_CRH_MODE10_0;
GPIOC->CRH |= GPIO_CRH_MODE9_0;
GPIOC->CRH |= GPIO_CRH_MODE8_0;
GPIOC->CRL = GPIO_CRL_MODE7_0;
GPIOC->CRL |= GPIO_CRL_MODE6_0;
GPIOC->CRL |= GPIO_CRL_MODE4_0;
GPIOC->CRL |= GPIO_CRL_MODE3_0;
while(1){
if(prescaler_1ms & (1<<0))GPIOC->BSRR |= GPIO_BSRR_BR13;
else GPIOC->BSRR |= GPIO_BSRR_BS13;
if(prescaler_1ms & (1<<1))GPIOC->BSRR |= GPIO_BSRR_BR12;
else GPIOC->BSRR |= GPIO_BSRR_BS12;
if(prescaler_1ms & (1<<2))GPIOC->BSRR |= GPIO_BSRR_BR10;
else GPIOC->BSRR |= GPIO_BSRR_BS10;
if(prescaler_1ms & (1<<3))GPIOC->BSRR |= GPIO_BSRR_BR9;
else GPIOC->BSRR |= GPIO_BSRR_BS9;
if(prescaler_1ms & (1<<4))GPIOC->BSRR |= GPIO_BSRR_BR8;
else GPIOC->BSRR |= GPIO_BSRR_BS8;
if(prescaler_1ms & (1<<5))GPIOC->BSRR |= GPIO_BSRR_BR7;
else GPIOC->BSRR |= GPIO_BSRR_BS7;
if(prescaler_1ms & (1<<6))GPIOC->BSRR |= GPIO_BSRR_BR6;
else GPIOC->BSRR |= GPIO_BSRR_BS6;
if(prescaler_1ms & (1<<7))GPIOC->BSRR |= GPIO_BSRR_BR4;
else GPIOC->BSRR |= GPIO_BSRR_BS4;
if(prescaler_1ms & (1<<8))GPIOC->BSRR |= GPIO_BSRR_BR3;
else GPIOC->BSRR |= GPIO_BSRR_BS3;
}
return 0;
}
When that code compiles, we are expecting the result 0b111110 at the GPIOS. When we change
const unsigned short RTC_FREQ = 62500;
to
unsigned short RTC_FREQ = 62500;
we get 0b111111111.
This is the Makefile that we use:
EABI_PATH=$(ROOT_DIR)"arm_toolchain/gcc-arm-none-eabi-6-2017-q2-update/arm-none-eabi/"
CMSIS_INC_PATH=$(ROOT_DIR)"STMLib/STM32F10x_StdPeriph_Lib_V3.5.0/Libraries/CMSIS/CM3/"
PROJECT_INC=$(ROOT_DIR)
CXXINCS = -I$(EABI_PATH)"include" -I$(CMSIS_INC_PATH)"CoreSupport" -I$(CMSIS_INC_PATH)"DeviceSupport/ST/STM32F10x" -I$(PROJECT_INC)"Source" -I$(PROJECT_INC)"Includes"
CXXLIBS = -L$(EABI_PATH)"lib" -L$(EABI_PATH)"6.3.1"
CXXFLAGS = --specs=nosys.specs -DSTM32F10X_MD -DVECT_TAB_FLASH -fdata-sections -ffunction-sections -fno-exceptions -mthumb -mcpu=cortex-m3 -march=armv7-m -O2
LDFLAGS = -lstdc++ -Wl,--gc-sections
CC = $(EABI_PATH)"../bin/arm-none-eabi-gcc"
CXX = $(EABI_PATH)"../bin/arm-none-eabi-g++"
LD = $(EABI_PATH)"../bin/arm-none-eabi-ld"
STRIP = $(EABI_PATH)"../bin/arm-none-eabi-strip"
all:
$(CC) $(CXXINCS) -c $(PROJECT_INC)"Source/syscalls.c" $(PROJECT_INC)"Source/startup.c" $(CXXFLAGS)
$(CXX) $(CXXINCS) -c $(PROJECT_INC)"Source/main.cpp" $(CMSIS_INC_PATH)"DeviceSupport/ST/STM32F10x/system_stm32f10x.c" $(CXXFLAGS)
$(CXX) $(CXXLIBS) -o main syscalls.o main.o startup.o -T linker.ld system_stm32f10x.o $(LDFLAGS)
$(STRIP) --strip-all main
$(EABI_PATH)"bin/objcopy" -O binary main app
$(EABI_PATH)"bin/objdump" -b binary -m arm_any -D app > app_disasm
rm -f *.o main adc timer task solenoid dma startup syscalls system_stm32f10x
Does anybody have a clue what can cause a problem like that? Is that a compiler bug? Have we missed something?
Promoting my theory to an answer because it is confirmed by the startup code and LD script.
C++ initialization code, which is supposed to copy 62 into prescaler_1ms, is not called. When you define RTC_FREQ as const, the result of this computation is known at compile time, 62 lives in the flash and needs no initialization.
C++ initialization is performed by a number of generated functions, with names like _Z41__static_initialization_and_destruction_0ii. Pointers to these functions are collected by the compiler in the .init_array and .pre_init_array sections. Before main() is called, the startup code should iterate over these pointers and call each of them. The boundaries of these pointer arrays are known to the startup code because these special symbols are defined by the linker script:
__preinit_array_start, __preinit_array_end
__init_array_start, __init_array_end
The distinction between _preinit_array and __init_array is not yet clear to me. The former section is called before calling the _init function and the latter are called after that. In my project the _init function provided by gcc does not seem to be a valid function, so I do not call it.
There is a symmetrical procedure at the program termination, when C++ destructors of global objects are called using __fini_array_start and __fini_array_end. However, for embedded systems, it is likely not relevant.
The minimal steps to make a project call C++ initialization stuff are:
Include the .init_array section in your linker script.
From the document you provided, it seems the .init_array section is already defined as:
. = ALIGN(4);
__preinit_array_start = .;
KEEP(*(.preinit_array))
__preinit_array_end = .;
. = ALIGN(4);
__init_array_start = .;
KEEP(*(SORT(.init_array.*)))
KEEP(*(.init_array))
__init_array_end = .;
Have the code that calls those pointers at program startup. This part seems to be absent from your setup, which is the actual cause of the problem.
You could add the following code (or similar) to the __Init_Data() function in startup.c:
// usually these are defined with __attribute__((weak)) but I prefer to get errors when required things are missing
extern void (*__preinit_array_start[])(void);
extern void (*__preinit_array_end[])(void);
extern void (*__init_array_start[])(void);
extern void (*__init_array_end[])(void);
void __Init_Data(void) {
// copying initialized data from flash to RAM
...
// zeroing bss segment
...
// calling C++ initializers
void (**f)(void);
for (f = __preinit_array_start; f != __preinit_array_end; f++)
(*f)();
// init(); // _init and _fini do not work for me
for (f = __init_array_start; f != __init_array_end; f++)
(*f)();
}
Again, I am not sure about the _init function, so it is commented out here. I may ask my own question some time later.
I'm attempting to benchmark some CUDA code using google benchmark. To start, I haven't written any CUDA code, and just want to make sure I can benchmark a host function compiled with nvcc. In main.cu I have
#include <benchmark/benchmark.h>
size_t fibr(size_t n)
{
if (n == 0)
return 0;
if (n == 1)
return 1;
return fibr(n-1)+fibr(n-2);
}
static void BM_FibRecursive(benchmark::State& state)
{
size_t y;
while (state.KeepRunning())
{
benchmark::DoNotOptimize(y = fibr(state.range(0)));
}
}
BENCHMARK(BM_FibRecursive)->RangeMultiplier(2)->Range(1, 1<<5);
BENCHMARK_MAIN();
I compile with:
nvcc -g -G -Xcompiler -Wall -Wno-deprecated-gpu-targets --std=c++11 main.cu -o main.x -lbenchmark
When I run the program, I get the following error:
./main.x
main.x: malloc.c:2405: sysmalloc: Assertion `(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed.
[1] 11358 abort (core dumped) ./main.x
I have explicitly pointed nvcc to g++-4.9 and g++-4.8 using -ccbin g++-4.x and have reproduced the problem with both versions of g++.
Is there anything obviously wrong here? How can the problem be fixed?
I'm on Ubuntu 17.04 and NVIDIA driver version 375.82, if it matters.
Update: I installed g++-5, and the core dump went away.
Almost 99% of the time, this error means that you broke something or you are accessing corrupted memory zones (or even too many recursion with went over the stack limit).
Use free tools like Valgrind or your favorite IDE debugger to get a hint on where this is happenning and why.
I am trying to understand why using -O2 -march=native with GCC gives a slower code than without using them.
Note that I am using MinGW (GCC 4.7.1) under Windows 7.
Here is my code :
struct.hpp :
#ifndef STRUCT_HPP
#define STRUCT_HPP
#include <iostream>
class Figure
{
public:
Figure(char *pName);
virtual ~Figure();
char *GetName();
double GetArea_mm2(int factor);
private:
char name[64];
virtual double GetAreaEx_mm2() = 0;
};
class Disk : public Figure
{
public:
Disk(char *pName, double radius_mm);
~Disk();
private:
double radius_mm;
virtual double GetAreaEx_mm2();
};
class Square : public Figure
{
public:
Square(char *pName, double side_mm);
~Square();
private:
double side_mm;
virtual double GetAreaEx_mm2();
};
#endif
struct.cpp :
#include <cstdio>
#include "struct.hpp"
Figure::Figure(char *pName)
{
sprintf(name, pName);
}
Figure::~Figure()
{
}
char *Figure::GetName()
{
return name;
}
double Figure::GetArea_mm2(int factor)
{
return (double)factor*GetAreaEx_mm2();
}
Disk::Disk(char *pName, double radius_mm_) :
Figure(pName), radius_mm(radius_mm_)
{
}
Disk::~Disk()
{
}
double Disk::GetAreaEx_mm2()
{
return 3.1415926*radius_mm*radius_mm;
}
Square::Square(char *pName, double side_mm_) :
Figure(pName), side_mm(side_mm_)
{
}
Square::~Square()
{
}
double Square::GetAreaEx_mm2()
{
return side_mm*side_mm;
}
main.cpp
#include <iostream>
#include <cstdio>
#include "struct.hpp"
double Do(int n)
{
double sum_mm2 = 0.0;
const int figuresCount = 10000;
Figure **pFigures = new Figure*[figuresCount];
for (int i = 0; i < figuresCount; ++i)
{
if (i % 2)
pFigures[i] = new Disk((char *)"-Disque", i);
else
pFigures[i] = new Square((char *)"-Carré", i);
}
for (int a = 0; a < n; ++a)
{
for (int i = 0; i < figuresCount; ++i)
{
sum_mm2 += pFigures[i]->GetArea_mm2(i);
sum_mm2 += (double)(pFigures[i]->GetName()[0] - '-');
}
}
for (int i = 0; i < figuresCount; ++i)
delete pFigures[i];
delete[] pFigures;
return sum_mm2;
}
int main()
{
double a = 0;
StartChrono(); // home made lib, working fine
a = Do(10000);
double elapsedTime_ms = StopChrono();
std::cout << "Elapsed time : " << elapsedTime_ms << " ms" << std::endl;
return (int)a % 2; // To force the optimizer to keep the Do() call
}
I compile this code twice :
1 : Without optimization
mingw32-g++.exe -Wall -fexceptions -std=c++11 -c main.cpp -o main.o
mingw32-g++.exe -Wall -fexceptions -std=c++11 -c struct.cpp -o struct.o
mingw32-g++.exe -o program.exe main.o struct.o -s
2 : With -O2 optimization
mingw32-g++.exe -Wall -fexceptions -O2 -march=native -std=c++11 -c main.cpp -o main.o
mingw32-g++.exe -Wall -fexceptions -O2 -march=native -std=c++11 -c struct.cpp -o struct.o
mingw32-g++.exe -o program.exe main.o struct.o -s
1 : Execution time :
1196 ms (1269 ms with Visual Studio 2013)
2 : Execution time :
1569 ms (403 ms with Visual Studio 2013) !!!!!!!!!!!!!
Using -O3 instead of -O2 does not improve the results.
I was, and I still am, pretty convinced that GCC and Visual Studio are equivalents, so I don't understand this huge difference.
Plus, I don't understand why the optimized version is slower than the non-optimized version with GCC.
Do I miss something here ?
(Note that I had the same problem with genuine GCC 4.8.2 on Ubuntu)
Thanks for your help
Considering that I don't see the assembly code, I'm going to speculate the following :
The allocation loop can be optimized (by the compiler) by removing the if clause and causing the following :
for (int i=0;i <10000 ; i+=2)
{
pFigures[i] = new Square(...);
}
for (int i=1;i <10000 ; i +=2)
{
pFigures[i] = new Disk(...);
}
Considering that the end condition is a multiple of 4 , it can be even more "efficient"
for (int i=0;i < 10000 ;i+=2*4)
{
pFigures[i] = ...
pFigures[i+2] = ...
pFigures[i+4] = ...
pFigures[i+6] = ...
}
Memory wise this will make Disks to be allocated 4 by 4 an Squares 4 by 4 .
Now, this means they will be found in the memory next to each other.
Next, you are going to iterate the vector 10000 times in a normal order (by normal i mean index after index).
Think about the places where these shapes are allocated in memory.You will end up having 4 times more cache misses (think about the border example, when 4 disks and 4 squares are found in different pages, you will switch between the pages 8 times... in a normal case scenario you would switch between the pages only once).
This sort of optimization (if done by the compiler, and in your particular code) optimizes the time for Allocation , but not the time of access (which in your example is the biggest load).
Test this by removing the i%2 and see what results you get.
Again this is pure speculation, and it assumes that the reason for lower performance was a loop optimization.
I suspect that you've got an issue unique to the combination of mingw/gcc/glibc on Windows because your code performs faster with optimizations on Linux where gcc is altogether more 'at home'.
On a fairly pedestrian Linux VM using gcc 4.8.2:
$ g++ main.cpp struct.cpp
$ time a.out
real 0m2.981s
user 0m2.876s
sys 0m0.079s
$ g++ -O2 main.cpp struct.cpp
$ time a.out
real 0m1.629s
user 0m1.523s
sys 0m0.041s
...and if you really take the blinkers off the optimizer by deleting struct.cpp and moving the implementation all inline:
$ time a.out
real 0m0.550s
user 0m0.543s
sys 0m0.000s