storing structs in ROM on ARM device - c++

I have some constant data that I want to store in ROM since there is a fair amount of it and I'm working with a memory-constrained ARM7 embedded device. I'm trying to do this using structures that look something like this:
struct objdef
{
int x;
int y;
bool (*function_ptr)(int);
some_other_struct * const struct_array; // array of similar structures
const void* vp; // previously ommittted to shorten code
}
which I then create and initialize as globals:
const objdef def_instance = { 2, 3, function, array, NULL };
However, this eats up quite a bit of RAM despite the const at the beginning. More specifically, it significantly increases the amount of RW data and eventually causes the device to lock up if enough instances are created.
I'm using uVision and the ARM compiler, along with the RTX real-time kernel.
Does anybody know why this doesn't work or know a better way to store structured heterogenous data in ROM?
Update
Thank you all for your answers and my apologies for not getting back to you guys earlier. So here is the score so far and some additional observations on my part.
Sadly, __attribute__ has zero effect on RAM vs ROM and the same goes for static const. I haven't had time to try the assembly route yet.
My coworkers and I have discovered some more unusual behavior, though.
First, I must note that for the sake of simplicity I did not mention that my objdef structure contains a const void* field. The field is sometimes assigned a value from a string table defined as
char const * const string_table [ROWS][COLS] =
{
{ "row1_1", "row1_2", "row1_3" },
{ "row2_1", "row2_2", "row2_3" },
...
}
const objdef def_instance = { 2, 3, function, array, NULL };//->ROM
const objdef def_instance = { 2, 3, function, array, string_table[0][0] };//->RAM
string_table is in ROM as expected. And here's the kicker: instances of objdef get put in ROM until one of the values in string_table is assigned to that const void* field. After that the struct instance is moved to RAM.
But when string_table is changed to
char const string_table [ROWS][COLS][MAX_CHARS] =
{
{ "row1_1", "row1_2", "row1_3" },
{ "row2_1", "row2_2", "row2_3" },
...
}
const objdef def_instance = { 2, 3,function, array, NULL };//->ROM
const objdef def_instance = { 2, 3, function, array, string_table[0][0] };//->ROM
those instances of objdef are placed in ROM despite that const void* assigment. I have no idea why this should matter.
I'm beginning to suspect that Dan is right and that our configuration is messed up somewhere.

I assume you have a scatterfile that separates your RAM and ROM sections. What you want to do is to specify your structure with an attribute for what section it will be placed, or to place this in its own object file and then specify that in the section you want it to be in the scatterfile.
__attribute__((section("ROM"))) const objdef def_instance = { 2, 3, function, array };
The C "const" keyword doesn't really cause the compiler to put something in the text or const section. It only allows the compiler to warn you of attempts to modify it. It's perfectly valid to get a pointer to a const object, cast it to a non-const, and write to it, and the compiler needs to support that.

Your thinking is correct and reasonable. I've used Keil / uVision (this was v3, maybe 3 years ago?) and it always worked how you expected it to, i.e. it put const data in flash/ROM.
I'd suspect your linker configuration / script. I'll try to go back to my old work & see how I had it configured. I didn't have to add #pragma or __attribute__ directives, I just had it place .const & .text in flash/ROM. I set up the linker configuration / memory map quite a while ago, so unfortunately, my recall isn't very fresh.
(I'm a bit confused by people who are talking about casting & const pointers, etc... You didn't ask anything about that & you seem to understand how "const" works. You want to place the initialized data in flash/ROM to save RAM (not ROM->RAM copy at startup), not to mention a slight speedup at boot time, right? You're not asking if it's possible to change it or whatever...)
EDIT / UPDATE:
I just noticed the last field in your (const) struct is a some_other_struct * const (constant pointer to a some_other_struct). You might want to try making it a (constant) pointer to a constant some_other_struct [some_other_struct const * const] (assuming what it points to is indeed constant). In that case it might just work. I don't remember the specifics (see a theme here?), but this is starting to seem familiar. Even if your pointer target isn't a const item, and you can't eventually do this, try changing the struct definition & initializing it w/ a pointer to const and just see if that drops it into ROM. Even though you have it as a const pointer and it can't change once the structure is built, I seem to remember something where if the target isn't also const, the linker doesn't think it can be fully initialized at link time & defers the initialization to when the C runtime startup code is executed, incl. the ROM to RAM copy of initialized RW memory.

You could always try using assembly language.
Put in the information using DATA statements and publish (make public) the starting addresses of the data.
In my experience, large Read-Only data was declared in a source file as static const. A simple global function inside the source file would return the address of the data.

If you are doing stuff on ARM you are probably using the ELF binary format. ELF files contain an number of sections but constant data should find its way into .rodata or .text sections of the ELF binary. You should be able to check this with the GNU utility readelf or the RVCT utility fromelf.
Now assuming you symbols find themselves in the correct part of the elf file, you now need to find out how the RTX loader does its job. There is also no reason why the instances cannot share the same read only memory but this will depend on the loader. If the executable is stored in the rom, it may be run in-place but may still be loaded into RAM. This also depends on the loader.

A complete example would have been best. If I take something like this:
typedef struct
{
char a;
char b;
} some_other_struct;
struct objdef
{
int x;
int y;
const some_other_struct * struct_array;
};
typedef struct
{
int x;
int y;
const some_other_struct * struct_array;
} tobjdef;
const some_other_struct def_other = {4,5};
const struct objdef def_instance = { 2, 3, &def_other};
const tobjdef tdef_instance = { 2, 3, &def_other};
unsigned int read_write=7;
And compile it with the latest codesourcery lite
arm-none-linux-gnueabi-gcc -S struct.c
I get
.arch armv5te
.fpu softvfp
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 2
.eabi_attribute 30, 6
.eabi_attribute 18, 4
.file "struct.c"
.global def_other
.section .rodata
.align 2
.type def_other, %object
.size def_other, 2
def_other:
.byte 4
.byte 5
.global def_instance
.align 2
.type def_instance, %object
.size def_instance, 12
def_instance:
.word 2
.word 3
.word def_other
.global tdef_instance
.align 2
.type tdef_instance, %object
.size tdef_instance, 12
tdef_instance:
.word 2
.word 3
.word def_other
.global read_write
.data
.align 2
.type read_write, %object
.size read_write, 4
read_write:
.word 7
.ident "GCC: (Sourcery G++ Lite 2010.09-50) 4.5.1"
.section .note.GNU-stack,"",%progbits
With the section marked as .rodata, which I would assume is desired. Then it is up to the linker script to make sure that ro data is put in rom. And note the read_write variable is after switching from .rodata to .data which is read/write.
So to make this a complete binary and see if it gets placed in rom or ram (.text or .data) then
start.s
.globl _start
_start:
b reset
b hang
b hang
b hang
b hang
b hang
b hang
b hang
reset:
hang: b hang
Then
# arm-none-linux-gnueabi-gcc -c -o struct.o struct.c
# arm-none-linux-gnueabi-as -o start.o start.s
# arm-none-linux-gnueabi-ld -Ttext=0 -Tdata=0x1000 start.o struct.o -o struct.elf
# arm-none-linux-gnueabi-objdump -D struct.elf > struct.list
And we get
Disassembly of section .text:
00000000 <_start>:
0: ea000006 b 20 <reset>
4: ea000008 b 2c <hang>
8: ea000007 b 2c <hang>
c: ea000006 b 2c <hang>
10: ea000005 b 2c <hang>
14: ea000004 b 2c <hang>
18: ea000003 b 2c <hang>
1c: ea000002 b 2c <hang>
00000020 <reset>:
20: e59f0008 ldr r0, [pc, #8] ; 30 <hang+0x4>
24: e5901000 ldr r1, [r0]
28: e5801000 str r1, [r0]
0000002c <hang>:
2c: eafffffe b 2c <hang>
30: 00001000 andeq r1, r0, r0
Disassembly of section .data:
00001000 <read_write>:
1000: 00000007 andeq r0, r0, r7
Disassembly of section .rodata:
00000034 <def_other>:
34: 00000504 andeq r0, r0, r4, lsl #10
00000038 <def_instance>:
38: 00000002 andeq r0, r0, r2
3c: 00000003 andeq r0, r0, r3
40: 00000034 andeq r0, r0, r4, lsr r0
00000044 <tdef_instance>:
44: 00000002 andeq r0, r0, r2
48: 00000003 andeq r0, r0, r3
4c: 00000034 andeq r0, r0, r4, lsr r0
And that achieved the desired result. The read_write variable is in ram, the structs are in the rom. Need to make sure both the const declarations are in the right places, the compiler gives no warnings about say putting a const on some pointer to another structure that it may not determine at compile time as being a const, and even with all of that getting the linker script (if you use one) to work as desired can take some effort. for example this one seems to work:
MEMORY
{
bob(RX) : ORIGIN = 0x0000000, LENGTH = 0x8000
ted(WAIL) : ORIGIN = 0x2000000, LENGTH = 0x8000
}
SECTIONS
{
.text : { *(.text*) } > bob
.data : { *(.data*) } > ted
}

Related

What does .word 0 mean in ARM assembly?

I'm writing a C++ state machine for Cortex-M4.
I use ARM GCC 11.2.1 (none).
I'm making a comparison between C and C++ output assembly.
I have the following C++ code godbolt link
struct State
{
virtual void entry(void) = 0;
virtual void exit(void) = 0;
virtual void run(void) = 0;
};
struct S1 : public State
{
void entry(void)
{
}
void exit(void)
{
}
void run(void)
{
}
};
S1 s1;
The assembly output is:
S1::entry():
bx lr
S1::exit():
bx lr
S1::run():
bx lr
vtable for S1:
.word 0
.word 0
.word S1::entry()
.word S1::exit()
.word S1::run()
s1:
.word vtable for S1+8
The only difference from C version of this code is the 2 lines .word 0:
vtable for S1:
.word 0
.word 0
What does that mean and what does it do?
Here's the C version of the code above I wrote.
godbolt link
The C++ ABI for ARM and the GNU C++ ABI define which entries must appear in the virtual table.
In the case of your code, the first two entries are the offset to the top of the vtable and the typeinfo pointer. These are zero for now, but may be overwritten if required (eg: if a further derived class is made).
It means the assembler should output the word number 0.
The assembler basically goes through the file from top to bottom, and for each instruction, it calculates the bytes for that instruction (always 4 bytes on ARM), and writes it into the output file.
.word just tells it to output a particular number as 4 bytes. In this case 0.

C++ - Method returns non-null pointer according to gdb but the variable it's assigned to is null

I have a problem where I call a method from a statically linked library and the method returns a pointer to a datastructure. According to the debugger the value that is returned is non-null. But after the method returns and the value is assigned to a local variable, the variable is null.
The screen recording below demonstrates the problem. The recording starts before the method is called, then steps into the method and back out. As you can see, the method returns a pointer to the address 0x6920ae10 but then the value stored in the local pointer variable is 0x0.
I'm at a loss here... I have been using C++ for many years but i never encountered a problem like that before.. Am I missing something stupid here? What could cause this problem?
I compiled the statically linked library (LLRP for Impinj RFID Readers) just before, directly on the machine where the code is executed and i also just recompiled the whole program on the same machine, so I don't think it's a mismatch between the binary code on the remote machine and the code in the IDE.
The same code did work before, but now it's running on a different platform (on a Raspberry Pi instead of an Alix-board and on Raspbian instead of Ubuntu).
Update:
I have been investigating this problem further today and i found that the problem occurs here (slightly changed to the code in the animation but the problem is the same):
::LLRP::CReaderEventNotificationData *p_msg_ren_d = ((::LLRP::CREADER_EVENT_NOTIFICATION *) p_msg)->getReaderEventNotificationData();
if (p_msg_ren_d == NULL)
{
delete p_connection;
delete p_msg;
this->_fail("Invalid response from reader (1).");
return;
}
This is the disassembly at the point where the method gets called (I'm compiling with -O0): (comments by me, with what i think is going on)
=> 0x001ee394 <+576>: ldr r0, [r11, #-24] ; 0xffffffe8 "Load address of p_msg into r0"
0x001ee398 <+580>: bl 0x1f0658 <LLRP::CREADER_EVENT_NOTIFICATION::getReaderEventNotificationData()> "call getReaderEventNotificationData"
0x001ee39c <+584>: str r0, [r11, #-28] ; 0xffffffe4 "store r0 on the stack at sp-28"
0x001ee3a0 <+588>: ldr r3, [r11, #-28] ; 0xffffffe4 "load sp-28 into r3"
0x001ee3a4 <+592>: cmp r3, #0 "check if rd is NULL"
Here is the c++ code and disassembly of the method that gets called (p_msg->getReaderEventNotificationData()):
inline CReaderEventNotificationData *
getReaderEventNotificationData (void)
{
return m_pReaderEventNotificationData;
}
0x001f0658 <+0>: push {r11} ; (str r11, [sp, #-4]!) "save r11"
0x001f065c <+4>: add r11, sp, #0 "save sp in r11"
0x001f0660 <+8>: sub sp, sp, #12 "decrement sp by 12"
0x001f0664 <+12>: str r0, [r11, #-8] "store r0 on the stack at sp-8"
=> 0x001f0668 <+16>: ldr r3, [r11, #-8] "load sp-8 into r3"
0x001f066c <+20>: ldr r3, [r3, #24] "load r3+24 into r3 THIS IS WRONG!"
"m_pReaderEventNotificationData is at offset 28 not 24"
0x001f0670 <+24>: mov r0, r3 "move r3 into r0 as the return value"
0x001f0674 <+28>: add sp, r11, #0 "restore sp"
0x001f0678 <+32>: pop {r11} ; (ldr r11, [sp], #4) "restore r11"
0x001f067c <+36>: bx lr "return"
If i take a look at the momory at the address p_msg, this is what it looks like:
0x69405de8: 0x002bcbf8 0x002b8774 0x00000000 0x69408200
0x69405df8: 0x69408200 0x5c5a5b1a 0x00000000 0x6940ed90
0x69405e08: 0x00000028 0x0000012d 0x694035f0 0x694007c8
So at offset 24, it's actually 0x00000000 and that's what returned by the method. But The correct value that should be returned is actually at offset 28 (0x6940ed90)
Is this a compiler problem? Or some 64 bit thing?
This is the compiler version btw: gcc version 8.3.0 (Raspbian 8.3.0-6+rpi1)
What could cause this problem?
The most likely cause is that you've compiled your code with optimization, and are getting confused. Does the program proceed to report invalid response from reader, or does it actually continue to line 181.
If the latter, see this answer.
If the program really does go to execute line 179, then it is likely that your compiler has miscompiled your program (you'll need to disassemble the code to be sure).
In that case, trying different compiler versions, disabling optimizations for a particular function or file, changing optimization levels, etc. etc. may let you work around the compiler bug.
Update:
The program does report the invalid response from reader, so it is actually called. I spent all afternoon investigating this again and at this point i believe it's a compiler error. In the disassembly i can see that it tries to load the value of m_pReaderEventNotificationData
from the object-address+24 (ldr r3, [r3, #24]) but if i view the memory, at this offset is actually 0x000000. The real value that it should return is at offset #28 instead of #24.
This is actually a very common problem, usually stemming from an ODR violation or an incomplete rebuilt.
Suppose you have two object files: foo.o and bar.o, and also define
const int NUM_X = 6;
struct Bar {
int m_x[NUM_X];
void *m_p;
void *Fn() { return m_p;}
};
Given above, Fn() will return *(this + 24), and this offset will be compiled into both object files.
Now you change NUM_X from 6 to 7, and rebuild foo.o but not bar.o. Fn inside bar.o will still return *(this +24), but it should return *(this + 28) (assuming 32-bit binary).
Similar behavior could happen if struct Bar is defined differently in foo.cc and bar.cc (ODR violation).
Update 2:
I deleted all traces of the library from the disk and recompiled the .a file and reinstalled the library and the headers. I also tried to recompile the program when the lib was not present and got a linker error so it's definitely not using another version of the lib that i don't know about... I also deleted the complete build of the program and fully recompiled it... But it's still the same behavior..
You should verify that both files involved see the same definition of CREADER_EVENT_NOTIFICATION. Best to capture preprocessed files and compare the definition there (this is what the compiler actually sees). Be sure to use the exact compilation commands you used to build the library and the application.
One sneaky way ODR violations can creep in is if the #defines in effect when building the library and the application are different. For example, consider this code:
#ifdef NUM_XX
const int NUM_X = NUM_XX;
#else
const int NUM_X = 6;
#endif
struct Bar {
int m_x[NUM_X];
void *m_p;
void *Fn() { return m_p;}
};
Now compile foo.cc with -DNUM_XX=7 and bar.cc without it, and you've got an ODR violation.

Which is the faster operation? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have two variable a and b.I have to write a if condition on variable a and b:
This is First Approach:
if(a > 0 || b >0){
//do some things
}
This is second Approach:
if((a+b) > 0){
//do some thing
}
Update: consider a and b are unsigned.then which will take lesser execution time between logical or(||) and arithmetic (+ )operator
this condition will Iterate around one million times.
Any help on this will be appreciated.
Your second condition is wrong. If a=1, b=-1000, it will evaluate to false, whereas your first condition will be evaluated to true. In general you shouldn't worry about speed at these kind of tests, the compiler optimizes the condition a lot, so a logical OR is super fast. In general, people are making bigger mistakes than optimizing such conditions... So don't try to optimize unless you really know what is going on, the compiler is in general doing a much better job than any of us.
In principle, in the first expression you have 2 CMP and one OR, whereas in the second, you have only one CMP and one ADD, so the second should be faster (even though the complier does some short-circuit in the first case, but this cannot happen 100% of the time), however in your case the expressions are not equivalent (well, they are for positive numbers...).
I decided to check this for C language, but identical arguments apply to C++, and similar arguments apply to Java (except Java allows signed overflow). Following code was tested (for C++, replace _Bool with bool).
_Bool approach1(int a, int b) {
return a > 0 || b > 0;
}
_Bool approach2(int a, int b) {
return (a + b) > 0;
}
And this was resulting disasembly.
.file "faster.c"
.text
.p2align 4,,15
.globl approach1
.type approach1, #function
approach1:
.LFB0:
.cfi_startproc
testl %edi, %edi
setg %al
testl %esi, %esi
setg %dl
orl %edx, %eax
ret
.cfi_endproc
.LFE0:
.size approach1, .-approach1
.p2align 4,,15
.globl approach2
.type approach2, #function
approach2:
.LFB1:
.cfi_startproc
addl %esi, %edi
testl %edi, %edi
setg %al
ret
.cfi_endproc
.LFE1:
.size approach2, .-approach2
.ident "GCC: (SUSE Linux) 4.8.1 20130909 [gcc-4_8-branch revision 202388]"
.section .note.GNU-stack,"",#progbits
Those codes are quite different, even considering how clever the compilers are these days. Why is that so? Well, the reason is quite simple - they aren't identical. If a is -42 and b is 2, the first approach will return true, and second will return false.
Surely, you may think that a and b should be unsigned.
.file "faster.c"
.text
.p2align 4,,15
.globl approach1
.type approach1, #function
approach1:
.LFB0:
.cfi_startproc
orl %esi, %edi
setne %al
ret
.cfi_endproc
.LFE0:
.size approach1, .-approach1
.p2align 4,,15
.globl approach2
.type approach2, #function
approach2:
.LFB1:
.cfi_startproc
addl %esi, %edi
testl %edi, %edi
setne %al
ret
.cfi_endproc
.LFE1:
.size approach2, .-approach2
.ident "GCC: (SUSE Linux) 4.8.1 20130909 [gcc-4_8-branch revision 202388]"
.section .note.GNU-stack,"",#progbits
It's quite easy to notice that approach1 is better here, because it doesn't do pointless addition, which is in fact, quite wrong. In fact, it even makes an optimization to (a | b) != 0, which is correct optimization.
In C, unsigned overflows are defined, so the compiler has to handle the case when integers are very high (try INT_MAX and 1 for approach2). Even assuming you know the numbers won't overflow, it's easy to notice approach1 is faster, because it simply tests if both variables are 0.
Trust your compiler, it will optimize better than you, and that without small bugs that you could accidentally write. Write code instead of asking yourself whether i++ or ++i is faster, or if x >> 1 or x / 2 is faster (by the way, x >> 1 doesn't do the same thing as x / 2 for signed numbers, because of rounding behavior).
If you want to optimize something, optimize algorithms you use. Instead of using worst case O(N4) sorting algorithm, use worst case O(N log N) algorithm. This will actually make program faster, especially if you sort reasonably big arrays
The real answer for this is always to do both and actually test which one runs faster. That's the only way to know for sure.
I would guess the second one would run faster, because an add is a quick operation but a missed branch causes pipeline clears and all sort of nasty things. It would be data dependent though. But it isn't exactly the same, if a or b is allowed to be negative or big enough for overflow then it isn't the same test.
Well, I wrote some quick code and disassembled:
public boolean method1(final int a, final int b) {
if (a > 0 || b > 0) {
return true;
}
return false;
}
public boolean method2(final int a, final int b) {
if ((a + b) > 0) {
return true;
}
return false;
}
These produce:
public boolean method1(int, int);
Code:
0: iload_1
1: ifgt 8
4: iload_2
5: ifle 10
8: iconst_1
9: ireturn
10: iconst_0
11: ireturn
public boolean method2(int, int);
Code:
0: iload_1
1: iload_2
2: iadd
3: ifle 8
6: iconst_1
7: ireturn
8: iconst_0
9: ireturn
So as you can see, they're pretty similar; the only difference is performing a > 0 test vs a + b; looks like the || got optimized away. What the JIT compiler optimizes these to, I have no clue.
If you wanted to get really picky:
Option 1: Always 1 load and 1 comparison, possible 2 loads and 2 comparisons
Option 2: Always 2 loads, 1 addition, 1 comparison
So really, which one performs better depends on what your data looks like and whether there is a pattern the branch predictor can use. If so, I could imagine the first method running faster because the processor basically "skips" the checks, and in the best case only has to perform half the operations the second option will. To be honest, though, this really seems like premature optimization, and I'm willing to bet that you're much more likely to get more improvement elsewhere in your code. I don't find basic operations to be bottlenecks most of the time.
Two things:
(a|b) > 0 is strictly better than (a+b) > 0, so replace it.
The above two only work correctly if the numbers are both unsigned.
If a and b have the potential to be negative numbers, the two choices are not equivalent, as has been pointed out by the answer by #vsoftco.
If both a and b are guaranteed to be non-negative integers, I would use
if ( (a|b) > 0 )
instead of
if ( (a+b) > 0 )
I think bitwise | is faster than integer addition.
Update
Use bitwise | instead of &.

Can someone help me understand stmdb, ldmia, and how I can go about implementing this C++ code in arm assembly language?

So I have this piece of code, where N is the size of both arrays.
int i;
for (i = 0; i < N; i++)
{
if (listA[i] < listB[i])
{
listA[i] = listB[i];
}
}
I'm trying to implement this as an ARM Assembly subroutine, but I'm completely lost with how to deal with arrays. I have this so far:
sort1:
stmdb sp!, {v1-v5, lr}
ldmia sp!, {v1-v5, pc}
I assume that I have to use cmp to compare the values, but I'm not even sure what registers to use. Anyone have any guidance?
EDIT:
Okay I now have this code:
sort1:
stmdb sp!, {v1-v5, lr} # Copy registers to stack
ldr v1, [a1], #0 # Load a1
str v1, [a2], #0 # Copy elements of a1 to a2
ldmia sp!, {v1-v5, pc} # Copy stack back into registers
This copies the first four elements of a 10 element array, so I would assume if I changed the "#0" to "#4", it would cause the next four elements to change, but it doesn't. Why?
Firstly, as you've demonstrated, the load/store multiple instructions are primarily useful for stack operations (although they can also make an efficient memcpy). Simply put, they load/store the specified registers, in order, from/to a contiguous block of memory from base address to base address + (number of registers * 4).
In the example given, stmdb sp!, {v1-v5, lr} is storing 6 registers in the "Decrement Before" addressing mode1, so the effective base address is sp-24 - it will store the contents of v1 at sp-24, v2 at sp-20,... up to lr at sp-4. Since the ! syntax for base register writeback is present, it will then subtract 24 from sp, leaving it pointing at the stored value of v1. The ldmia is the complete reverse - "Increment After" means the effective base address is sp, so it will load the registers from sp up to sp+20, then add 24 to sp. Note that it loads the stacked lr value directly into the pc - this way you restore the registers and perform the function return in a single instruction.
As for the regular load/store instructions, they have 3 addressing modes - offset, pre-indexed and post-indexed. ldr v1, [a1], #0 is post-indexed, meaning "load v1 from the address in a1, then add 0 to a1", hence changing #0 to #4 doesn't affect the address used, only the value written back to the base register afterwards. If you'd got as far as implementing the loop there the effect would have become clearly visible.
It may be helpful to consider how some example C expressions map to these addressing modes:
int a; // r0
int *b; // r1
a = b[1]; // ldr r0, [r1, #4] (offset)
a = *(b+1); // similarly
a = *(++b); // ldr r0, [r1, #4]! (pre-indexed)
a = *(b++); // ldr r0, [r1], #4 (post-indexed)
Bear in mind the offset value can be a register instead of an immediate, too, so there are several possible ways to implement a loop like the one given.
For the authoritative reference, I'd recommend reading through the instruction section of the ARM Architecture Reference Manual, or for a less exhaustive but more accessible introduction, the Cortex-A Series Programmer's Guide.
[1] This implies a descending stack - corresponding "Decrement After" and "Increment Before" addressing modes exist for running an ascending stack.

Static Values in Assembler Code

I have the following simple code:
#include <cmath>
struct init_sin
{
typedef double type;
static constexpr type value(int index) {
return 3*std::pow(std::sin(index * 2.0 * 3.1415 / 20.0),1.999);
}
};
int main(){
static double VALUE = init_sin::value(10);
double VALUE_NONSTAT = 3*std::pow(std::sin(10 * 2.0 * 3.1415 / 20.0),1.999);
return int(VALUE_NONSTAT);
}
I would like to find out what the meaning of the assembler code is of this given piece.
Here the link to the assembly: http://pastebin.com/211AfSYh
I thought that VALUE is compile time computed and directly as value in the assembler code
Which should be in this line if I am not mistaken:
33 .size main, .-main
34 .data
35 .align 8
36 .type _ZZ4mainE5VALUE, #object
GAS LISTING /tmp/ccbPDNK8.s page 2
37 .size _ZZ4mainE5VALUE, 8
38 _ZZ4mainE5VALUE:
39 0000 15143B78 .long 2017137685
40 0004 45E95B3E .long 1046210885
Why are there two values with .long ? And why are the types long? (its a double?, maybe in assembler there is only long.
Does that mean that the value of VALUE was compile time generated
Where is the result for VALUE_NON_STATIC? This should be computed during run-time right? I cannot quite see where?
Thanks a lot!
.long in this assembler syntax implies a 32-bit number. Because a double is 64-bits, what you're seeing there is the two 32-bit parts of VALUE, in their double representation. You'll also notice above it that it's being aligned to an 8-byte boundary (through the .align statement) and that it's size is 8 (through the .size statement). Also, it's in the main .data segment, which is typically used for global-scope variables which are not initialised to zero (as a side-note, .bss is typically used to zero-initialised global scope variables).
The VALUE_NONSTAT can be seen being loaded into %rax here, which is the 64-bit version of the AX register:
V
20 0004 48B81514 movabsq $4493441537811354645, %rax
20 3B7845E9
20 5B3E
Recalling that 15143B7845E95B3E is the representation of the value of 3*std::pow(std::sin(index * 2.0 * 3.1415 / 20.0),1.999) when stored in a double, you can see the internal value in hex starting around where I inserted a V.
Later statements then push it onto the stack (movq %rax, -8(%rbp)), then load it into an FP register (movsd -8(%rbp), %xmm0) before converting it to an integer and storing it in %eax, which is the register for return values (cvttsd2si %xmm0, %eax) and then returning from the routine, using ret.
In any case, at the optimisation level you're using (and probably below), your compiler has figured out that VALUE_NONSTAT is a constant expression, and just inlined it at compile time instead, since the value is fully known at compile time.