32 bit PPC rlwinm instruction - c++

I'm having a bit of trouble understanding the rlwinm PPC Assembly instruction (Rotate Left Word Immediate Then AND with Mask).
I am trying to reverse this part of a function
rlwinm r3, r3, 0, 28, 28
I already know what r3 is. r3 in this case is a 4 byte integer but I am not sure exactly what this instruction rlwinm is doing to it.
By the way, this is on a 32 bit machine.

Your understanding is not quite right. As per the IBM link on this instruction, the form you're seeing is:
rlwinm <target=r3>, <source=r3>, <shift=0>, <begin-mask=28>, <end-mask=28>
Hence no actual shift is involved. And the actual mask used for the AND operation is constructed from the begin and end mask positions, it's not given as an explicit argument(a).
In this case, since both positions are 28, the mask is simply a single bit, as per the linked page (slightly paraphrased):
If the begin-mask value is less than the end-mask value plus one, then the mask bits between and including the starting point and the end point are set to ones. All other bits are set to zeros.
So the instruction you're seeing is nothing more complicated than a single AND operation.
(a) There is a form that allows you to specify the actual mask (assuming it consists of contiguous one-bits) but it's the four-argument version and really just syntactic sugar that the assembler can turn into the five-argument one.

#paxdiablo's answer is a correct, but to add some more context:
The various r* instructions (rlwinm, rlwimi, etc) are designed for extracting bit fields whose size is known at compile time, for example C struct bitfields, or even just splitting a word into bytes (which may be faster with one lw and four rlwinm instructions than several separate lbzus).
lw r4, r3 ; load the word at the address pointed at by r3
rlwinm r5, r4, 8, 24, 31 ; first byte in r5
rlwinm r6, r4, 16, 24, 31 ; second byte in r6
rlwinm r7, r4, 24, 24, 31 ; third byte in r7
rlwinm r8, r4, 0, 24, 31 ; fourth byte in r8, identical to andi r8, r4, 255
The rlwinm instructions can also be used, as in this case, as a special form of andi for contiguous sets of bits. Since instructions in PowerPC are always 32 bits, instructions taking immediate values have only 16 bits to hold those values - so if you want to mask a set of bits that crosses the high/low half-word boundary, say 23 to 8, you need to use multiple operations.
lis r4, r4, 0x00ff ; first set bits 23 to 16 of the mask
ori r4, r4, 0xff00 ; then bits bits 15 to 8
and r3, r3, r4 ; then perform the actual masking
However, with the rlwinm instructions, we can perform that same operation in a single instruction:
rlwinm r3, r3, 0, 8, 23
In your case, the value probably holds flags for something and this instruction is extracting one of them. The next instruction is probably a conditional branch on r3.
ETA: Peter Cordes corrected out some of my mistakes, for which I am grateful, and added that it was probably unnecessary to use rlwinm in this case, and that it may simply be a peculiarity of the compiler that causes this instruction to be generated instead of andi.

Related

Split a number into several numbers, each with only one significant bit

Is there any efficient algorithm (or processor instruction) that will help divide the number (32bit and 64bit) into several numbers, in which there will be only one 1-bit.
I want to isolate each set bit in a number. For example,
input:
01100100
output:
01000000
00100000
00000100
Only comes to mind number & mask.
Assembly or ะก++.
Yes, in a similar way as Brian Kernighan's algorithm to count set bits, except instead of counting the bits we extract and use the lowest set bit in every intermediary result:
while (number) {
// extract lowest set bit in number
uint64_t m = number & -number;
/// use m
...
// remove lowest set bit from number
number &= number - 1;
}
In modern x64 assembly, number & -number may be compiled to blsi, and number &= number - 1 may be compiled to blsr which are both fast, so this would only take a couple of efficient instructions to implement.
Since m is available, resetting the lowest set bit may be done with number ^= m but that may make it harder for the compiler to see that it can use blsr, which is a better choice because it depends only directly on number so it shortens the loop carried dependency chain.
The standard way is
while (num) {
unsigned mask = num ^ (num & (num-1)); // This will have just one bit set
...
num ^= mask;
}
for example starting with num = 2019 you will get in order
1
2
32
64
128
256
512
1024
If you are going to iterate over the single-bit-isolated masks one at a time, generating them one at a time is efficient; see #harold's answer.
But if you truly just want all the masks, x86 with AVX512F can usefully parallelize this. (At least potentially useful depending on surrounding code. More likely this is just a fun exercise in applying AVX512 and not useful for most use-cases).
The key building block is AVX512F vpcompressd : given a mask (e.g. from a SIMD compare) it will shuffle the selected dword elements to contiguous elements at the bottom of a vector.
An AVX512 ZMM / __m512i vector holds 16x 32-bit integers, so we only need 2 vectors to hold every possible single-bit mask. Our input number is a mask that selects which of those elements should be part of the output. (No need to broadcast it into a vector and vptestmd or anything like that; we can just kmov it into a mask register and use it directly.)
See also my AVX512 answer on AVX2 what is the most efficient way to pack left based on a mask?
#include <stdint.h>
#include <immintrin.h>
// suggest 64-byte alignment for out_array
// returns count of set bits = length stored
unsigned bit_isolate_avx512(uint32_t out_array[32], uint32_t x)
{
const __m512i bitmasks_lo = _mm512_set_epi32(
1UL << 15, 1UL << 14, 1UL << 13, 1UL << 12,
1UL << 11, 1UL << 10, 1UL << 9, 1UL << 8,
1UL << 7, 1UL << 6, 1UL << 5, 1UL << 4,
1UL << 3, 1UL << 2, 1UL << 1, 1UL << 0
);
const __m512i bitmasks_hi = _mm512_slli_epi32(bitmasks_lo, 16); // compilers actually do constprop and load another 64-byte constant, but this is more readable in the source.
__mmask16 set_lo = x;
__mmask16 set_hi = x>>16;
int count_lo = _mm_popcnt_u32(set_lo); // doesn't actually cost a kmov, __mask16 is really just uint16_t
_mm512_mask_compressstoreu_epi32(out_array, set_lo, bitmasks_lo);
_mm512_mask_compressstoreu_epi32(out_array+count_lo, set_hi, bitmasks_hi);
return _mm_popcnt_u32(x);
}
Compiles nicely with clang on Godbolt, and with gcc other than a couple minor sub-optimal choices with mov, movzx, and popcnt, and making a frame pointer for no reason. (It also can compile with -march=knl; it doesn't depend on AVX512BW or DQ.)
# clang9.0 -O3 -march=skylake-avx512
bit_isolate_avx512(unsigned int*, unsigned int):
movzx ecx, si
popcnt eax, esi
shr esi, 16
popcnt edx, ecx
kmovd k1, ecx
vmovdqa64 zmm0, zmmword ptr [rip + .LCPI0_0] # zmm0 = [1,2,4,8,16,32,64,128,256,512,1024,2048,4096,8192,16384,32768]
vpcompressd zmmword ptr [rdi] {k1}, zmm0
kmovd k1, esi
vmovdqa64 zmm0, zmmword ptr [rip + .LCPI0_1] # zmm0 = [65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728,268435456,536870912,1073741824,2147483648]
vpcompressd zmmword ptr [rdi + 4*rdx] {k1}, zmm0
vzeroupper
ret
On Skylake-AVX512, vpcompressd zmm{k1}, zmm is 2 uops for port 5. Latency from input vector -> output is 3 cycles, but latency from input mask -> output is 6 cycles. (https://www.uops.info/table.html / https://www.uops.info/html-instr/VPCOMPRESSD_ZMM_K_ZMM.html). The memory destination version is 4 uops: 2p5 + the usual store-address and store-data uops which can't micro-fuse when part of a larger instruction.
It might be better to compress into a ZMM reg and then store, at least for the first compress, to save total uops. The 2nd should probably still take advantage of the masked-store feature of vpcompressd [mem]{k1} so the output array doesn't need padding for it to step on. IDK if that helps with cache-line splits, i.e. whether masking can avoid replaying the store uop for the part with an all-zero mask in the 2nd cache line.
On KNL, vpcompressd zmm{k1} is only a single uop. Agner Fog didn't test it with a memory destination (https://agner.org/optimize/).
This is 14 fused-domain uops for the front-end on Skylake-X for the real work (e.g. after inlining into a loop over multiple x values, so we could hoist the vmovdqa64 loads out of the loop. Otherwise that's another 2 uops). So front-end bottleneck = 14 / 4 = 3.5 cycles.
Back-end port pressure: 6 uops for port 5 (2x kmov(1) + 2x vpcompressd(2)): 1 iteration per 6 cycles. (Even on IceLake (instlatx64), vpcompressd is still 2c throughput, unfortunately, so apparently ICL's extra shuffle port doesn't handle either of those uops. And kmovw k, r32 is still 1/clock, so presumably still port 5 as well.)
(Other ports are fine: popcnt runs on port 1, and that port's vector ALU is shut down when 512-bit uops are in flight. But not its scalar ALU, the only one that handles 3-cycle latency integer instructions. movzx dword, word can't be eliminated, only movzx dword, byte can do that, but it runs on any port.)
Latency: integer result is just one popcnt (3 cycles). First part of the memory result is stored about 7 cycles after the mask is ready. (kmov -> vpcompressd). The vector source for vpcompressd is a constant so OoO exec can get it ready plenty early unless it misses in cache.
Compacting the 1<<0..15 constant would be possible but probably not worth it, by building it with a shift. e.g. loading 16-byte _mm_setr_epi8(0..15) with vpmovzxbd, then using that with vpsllvd on a vector of set1(1) (which you can get from a broadcast or generate on the fly with vpternlogd+shift). But that's probably not worth it even if you're writing by hand in asm (so it's your choice instead of the compiler) since this already uses a lot of shuffles, and constant-generation would take at least 3 or 4 instructions (each of which is at least 6 bytes long; EVEX prefixes alone are 4 bytes each).
I would generate the hi part with a shift from lo, instead of loading it separately, though. Unless the surrounding code bottlenecks hard on port 0, an ALU uop isn't worse than a load uop. One 64-byte constant fills a whole cache line.
You could compress the lo constant with a vpmovzxwd load: each element fits in 16 bits. Worth considering if you can hoist that outside of a loop so it doesn't cost an extra shuffle per operation.
If you wanted the result in a SIMD vector instead of stored to memory, you could 2x vpcompressd into registers and maybe use count_lo to look up a shuffle control vector for vpermt2d. Possibly from a sliding-window on an array instead of 16x 64-byte vectors? But the result isn't guaranteed to fit in one vector unless you know your input had 16 or fewer bits set.
Things are much worse for 64-bit integers 8x 64-bit elements means we need 8 vectors. So maybe not worth it vs. scalar, unless your inputs have lots of bits set.
You can do it in a loop, though, using vpslld by 8 to move bits in vector elements. You'd think kshiftrq would be good, but with 4 cycle latency that's a long loop-carried dep chain. And you need scalar popcnt of each 8-bit chunk anyway to adjust the pointer. So your loop should use shr / kmov and movzx / popcnt. (Using a counter += 8 and bzhi to feed popcnt would cost more uops).
The loop-carried dependencies are all short (and the loop only runs 8 iterations to cover mask 64 bits), so out-of-order exec should be able to nicely overlap work for multiple iterations. Especially if we unroll by 2 so the vector and mask dependencies can get ahead of the pointer update.
vector: vpslld immediate, starting from the vector constant
mask: shr r64, 8 starting with x. (Could stop looping when this becomes 0 after shifting out all the bits. This 1-cycle dep chain is short enough for OoO exec to zip through it and hide most of the mispredict penalty, when it happens.)
pointer: lea rdi, [rdi + rax*4] where RAX holds a popcnt result.
The rest of the work is all independent across iterations. Depending on surrounding code, we probably bottleneck on port 5 with vpcompressd shuffles and kmov

Why does ARM use two instructions to mask a value?

For the following function...
uint16_t swap(const uint16_t value)
{
return value << 8 | value >> 8;
}
...why does ARM gcc 6.3.0 with -O2 yield the following assembly?
swap(unsigned short):
lsr r3, r0, #8
orr r0, r3, r0, lsl #8
lsl r0, r0, #16 # shift left
lsr r0, r0, #16 # shift right
bx lr
It appears the compiler is using two shifts to mask off the unwanted bytes, instead of using a logical AND. Could the compiler instead use and r0, r0, #4294901760?
Older ARM assembly cannot create constants easily. Instead, they are loaded into literal pools and then read in via a memory load. This and you suggest can only take I believe an 8-bit literal with shift. Your 0xFFFF0000 requires 16-bits to do as 1 instructions.
So, we can load from memory and do an and (slow),
Take 2 instructions to create the value and 1 to and (longer),
or just shift twice cheaply and call it good.
The compiler chose the shifts and honestly, it is plenty fast.
Now for a reality check:
Worrying about a single shift, unless this is a 100% for sure bottleneck is a waste of time. Even if the compiler was sub-optimal, you will almost never feel it. Worry about "hot" loops in code instead for micro-ops like this. Looking at this from curiosity is awesome. Worrying about this exact code for performance in your app, not so much.
Edit:
It has been noted by others here that newer versions of the ARM specifications allow this sort of thing to be done more efficiently. This shows that it is important, when talking at this level, to specify the Chip or at least the exact ARM spec we are dealing with. I was assuming ancient ARM from the lack of "newer" instructions given from your output. If we are tracking compiler bugs, then this assumption may not hold and knowing the specification is even more important. For a swap like this, there are indeed simpler instructions to handle this in later versions.
Edit 2
One thing that could be done to possibly make this faster is to make it inline'd. In that case, the compiler could interleave these operations with other work. Depending on the CPU, this could double the throughput here as many ARM CPUs have 2 integer instruction pipelines. Spread out the instructions enough so that there are no hazards, and away it goes. This has to be weighed against I-Cache usage, but in a case where it mattered, you could see something better.
There is a missed-optimization here, but and isn't the missing piece. Generating a 16-bit constant isn't cheap. For a loop, yes it would be a win to generate a constant outside the loop and use just and inside the loop. (TODO: call swap in a loop over an array and see what kind of code we get.)
For an out-of-order CPU, it could also be worth using multiple instructions off the critical path to build a constant, then you only have one AND on the critical path instead of two shifts. But that's probably rare, and not what gcc chooses.
AFAICT (from looking at compiler output for simple functions), the ARM calling convention guarantees there's no high garbage in input registers, and doesn't allow leaving high garbage in return values. i.e. on input, it can assume that the upper 16 bits of r0 are all zero, but must leave them zero on return. The value << 8 left shift is thus a problem, but the value >> 8 isn't (it doesn't have to worry about shifting garbage down into the low 16).
(Note that x86 calling conventions aren't like this: return values are allowed to have high garbage. (Maybe because the caller can simply use the 16-bit or 8-bit partial register). So are input values, except as an undocumented part of the x86-64 System V ABI: clang depends on input values being sign/zero extended to 32-bit. GCC provides this when calling, but doesn't assume as a callee.)
ARMv6 has a rev16 instruction which byte-swaps the two 16-bit halves of a register. If the upper 16 bits are already zeroed, they don't need to be re-zeroed, so gcc -march=armv6 should compile the function to just rev16. But in fact it emits a uxth to extract and zero-extend the low half-word. (i.e. exactly the same thing as and with 0x0000FFFF, but without needing a large constant). I believe this is pure missed optimization; presumably gcc's rotate idiom, or its internal definition for using rev16 that way, doesn't include enough info to let it realize the top half stays zeroed.
swap: ## gcc6.3 -O3 -march=armv6 -marm
rev16 r0, r0
uxth r0, r0 # not needed
bx lr
For ARM pre v6, a shorter sequence is possible. GCC only finds it if we hand-hold it towards the asm we want:
// better on pre-v6, worse on ARMv6 (defeats rev16 optimization)
uint16_t swap_prev6(const uint16_t value)
{
uint32_t high = value;
high <<= 24; // knock off the high bits
high >>= 16; // and place the low8 where we want it
uint8_t low = value >> 8;
return high | low;
//return value << 8 | value >> 8;
}
swap_prev6: # gcc6.3 -O3 -marm. (Or armv7 -mthumb for thumb2)
lsl r3, r0, #24
lsr r3, r3, #16
orr r0, r3, r0, lsr #8
bx lr
But this defeats the gcc's rotate-idiom recognition, so it compiles to this same code even with -march=armv6 when the simple version compiles to rev16 / uxth.
All source + asm on the Godbolt compiler explorer
ARM is a RISC machine (Advanced RISC Machine), and thus, all instrutcions are encoded in the same size, capping at 32bit.
Immediate values in instructions are assigned to a certain number of bits, and AND instruction simply doesn't have enought bits assigned to immediate values to express any 16bit value.
That's the reason for the compiler resorting to two shift instructions instead.
However, if your target CPU is ARMv6 (ARM11) or higher, the compiler takes leverage from the new REV16 instruction, and then masks the lower 16bit by UXTH instruction which is unnecessary and stupid, but there is simply no conventional way to persuade the compiler not to do this.
If you think that you would be served well by GCC intrinsic __builtin_bswap16, you are dead wrong.
uint16_t swap(const uint16_t value)
{
return __builtin_bswap16(value);
}
The function above generates exactly the same machine code that your original C code did.
Even using inline assembly doesn't help either
uint16_t swap(const uint16_t value)
{
uint16_t result;
__asm__ __volatile__ ("rev16 %[out], %[in]" : [out] "=r" (result) : [in] "r" (value));
return result;
}
Again, exactly the same. You cannot get rid of the pesky UXTH as long as you use GCC; It simply cannot read from the context that the upper 16bits are all zeros to start with and thus, UXTH is unnecessary.
Write the whole function in assembly; That's the only option.
This is the optimal solution, the AND would require at least two more instructions possibly having to stop and wait for a load to happen of the value to mask. So worse in a couple of ways.
00000000 <swap>:
0: e1a03420 lsr r3, r0, #8
4: e1830400 orr r0, r3, r0, lsl #8
8: e1a00800 lsl r0, r0, #16
c: e1a00820 lsr r0, r0, #16
10: e12fff1e bx lr
00000000 <swap>:
0: ba40 rev16 r0, r0
2: b280 uxth r0, r0
4: 4770 bx lr
The latter is armv7 but at the same time it is because they added instructions to support this kind of work.
Fixed length RISC instructions have by definition a problem with constants. MIPS chose one way, ARM chose another. Constants are a problem on CISC as well just a different problem. Not difficult to create something that takes advantage of ARMS barrel shifter and shows a disadvantage of MIPS solution and vice versa.
The solution actually has a bit of elegance to it.
Part of this as well is the overall design of the target.
unsigned short fun ( unsigned short x )
{
return(x+1);
}
0000000000000010 <fun>:
10: 8d 47 01 lea 0x1(%rdi),%eax
13: c3 retq
gcc chooses not to return the 16 bit variable you asked for it returns a 32 bit, it doesnt properly/correctly implement the function I asked for with my code. But that is okay if when the user of the data gets that result or uses it the mask happens there or with this architecture ax is used instead of eax. for example.
unsigned short fun ( unsigned short x )
{
return(x+1);
}
unsigned int fun2 ( unsigned short x )
{
return(fun(x));
}
0000000000000010 <fun>:
10: 8d 47 01 lea 0x1(%rdi),%eax
13: c3 retq
0000000000000020 <fun2>:
20: 8d 47 01 lea 0x1(%rdi),%eax
23: 0f b7 c0 movzwl %ax,%eax
26: c3 retq
A compiler design choice (likely based on architecture) not an implementation bug.
Note that for a sufficiently sized project, it is easy to find missed optimization opportunities. No reason to expect an optimizer to be perfect (it isnt and cant be). They just need to be more efficient than a human doing it by hand for that sized project on average.
This is why it is commonly said that for performance tuning you dont pre-optimize or just jump to asm immediately you use the high level language and the compiler you in some way profile your way through to find the performance problems then hand code those, why hand code them because we know we can at times out perform the compiler, implying the compiler output can be improved upon.
This isnt a missed optimization opportunity, this is instead a very elegant solution for the instruction set. Masking a byte is simpler
unsigned char fun ( unsigned char x )
{
return((x<<4)|(x>>4));
}
00000000 <fun>:
0: e1a03220 lsr r3, r0, #4
4: e1830200 orr r0, r3, r0, lsl #4
8: e20000ff and r0, r0, #255 ; 0xff
c: e12fff1e bx lr
00000000 <fun>:
0: e1a03220 lsr r3, r0, #4
4: e1830200 orr r0, r3, r0, lsl #4
8: e6ef0070 uxtb r0, r0
c: e12fff1e bx lr
the latter being armv7, but with armv7 they recognized and solved these issues you cant expect the programmer to always use natural sized variables, some feel the need to use less optimal sized variables. sometimes you still have to mask to a certain size.

Unexpected cast during assignment

There is piece of code in stm32 library that's behaving strangely. This is assignment made from initializing structure to timer auto-reload register:
/* Set the Autoreload value */
TIMx->ARR = TIM_TimeBaseInitStruct->TIM_Period ;
I've TIM_Period = 1999999, however after assignment TIMx->ARR = 33919. Smaller number usually points to overflow, so I checked: (1999999-33919) / 65536 = 30. This would mean the number overflowed 30 times on 16 bit data type, but both variables are 32 bit unsigned integers. Extracted from structure declarations:
For TIMx:
__IO uint32_t ARR; /*!< TIM auto-reload register, Address offset: 0x2C */
For TIM_TimeBaseInitStruct:
uint32_t TIM_Period; /*!< Specifies the period value to be loaded into the active
Auto-Reload Register at the next update event.
This parameter must be a number between 0x0000 and 0xFFFF. */
Where __IO is defined as volatile.
This is disassembly of that assignment:
296 TIMx->ARR = TIM_TimeBaseInitStruct->TIM_Period ;
0800c37c: ldr r3, [r7, #0]
0800c37e: ldr r2, [r3, #4]
0800c380: ldr r3, [r7, #4]
0800c382: str r2, [r3, #44] ; 0x2c
What is happening here.. could it be something external causing the value to overflow? Note that I'm debugging on real hardware through ST-Link with no code optimization.
I'm going to guess that your chip has 16-bit timer registers. That is, it might still be a 32-bit register, but only have 16 useful bits in it.
Something like:
31 16 15 0
+--------------------+------------------+
| RESERVED | Auto-reload value|
+--------------------+------------------+
Fact checking forthcoming (if you have a specific part number that would help me out).
Edit: By looking at some documentation [PDF link], my guess seems to be confirmed:
Edit 2: Since you mentioned which chip you were using, I found that documentation too [PDF link], which contains this handy diagram:
As you can see, some timers have 32-bit autoreload, and some don't. Which timer you've chosen will affect the behaviour you see.

Hypothetical - about making a header for an *existing* static/dynamic library

I want to learn more about unix/linux and this question popped into my head - let's say I made a static/dynamic library (.a or .so) and lost the c/c++ source code and header file. Default nm output gives me the names of the symbols but I need to know return types and parameter count/types to make a header. Is it possible to get this extra information somehow to reverse engineer a header for a given library?
You tagged C and C++ and the answer varies slightly between the two.
For C++, the method names of classes have type information embedded in the symbol name. You just have to figure how what kind of name mangling the compiler that compiled the library did.
For C, there's no real clean way to do it. You could take apart the assembly and analyze which registers and stack areas are read without having been written to figure out how many parameters a function takes. This would require knowledge of the calling conventions used by whatever compiler compiled the library.
Similarly, you can look at how each parameters is used in the assembly. If you see it being used in a load instruction, it is most likely a pointer of some sort while if you see it being used in arithmetic, it's possibly an integer of some sort.
For the return type, you can check if anything seemingly meaningful is placed in the return register before a return instruction. Again, this requires knowledge of calling conventions for your platform.
Here's an example of how I would do things in ARM assembly.
I know that parameters in ARM are passed in registers r0 to r3 and the return value is stored in register r0. With that in mind, we can begin reverse engineering. Let's take a look at the assembly for two functions and try to work out what the function prototype was.
00000000 <func1>:
0: e3510000 cmp r1, #0
4: 0a000007 beq 28 <func1+0x28>
8: e0801001 add r1, r0, r1
c: e1a03000 mov r3, r0
10: e3a00000 mov r0, #0
14: e4d32001 ldrb r2, [r3], #1
18: e1530001 cmp r3, r1
1c: e0800002 add r0, r0, r2
20: 1afffffb bne 14 <func1+0x14>
24: e12fff1e bx lr
28: e1a00001 mov r0, r1
2c: e12fff1e bx lr
If we take a look here, r0 and r1 are both read before anything was written to it. We can also see r2 and r3 are written to before they were read. We can therefore infer that func1 has a maximum of two paramaters.
We also realise that r0 is moved to r3 and then used as an address to ldrb, which is an instruction to load a byte from memory. Hence, we infer that the first parameter is a pointer. Because the instruction only loads a single byte, we can also tell it might be a pointer to some sort of one byte data type.
The second parameter in r1 never seems to be used except in compare and add instructions so it is possibly an integer.
Before each bx lr (a return-to-caller instruction), something is placed in r0 so we infer that the function returns some sort of value.
If this function were presented to me, I'd guess that the function prototype would look something like this:
int func1(unsigned char *, int);
Original:
unsigned int func1(void *, unsigned int);
Here's an another function
00000030 <func2>:
30: e0822001 add r2, r2, r1
34: e5c02000 strb r2, [r0]
38: e12fff1e bx lr
This one is very easy.
We see that r0, r1 and r2 are all read from before being written to so we can guess that the function takes three parameters. r0 is used as an address to a strb instruction (store byte) so it is probably a pointer. Again, it only stores a byte so it is probably a pointer to a byte sized data type.
The other two are only used in an add instruction so are probably integers.
Nothing seems to be placed into r0 at the end so the function either returns the first parameter or doesn't return a value.
I would guess the prototype would be one of the following
void func2(unsigned char *, int, int);
unsigned char *func2(unsigned char *, int, int);
Original:
void func2(char *, char, char);
Keeping in mind that caller/callee conventions vary for different processor instruction sets and you are already aware of name mangling while using c and c++ libraries together, you can try the following way:
gdb <executable>
....
disas <function name>
....
Here you can make a wild guess about the type of return value and parameters using the bit size of those values written on stack making use of assembly code.

Strange behaviour of ldr [pc, #value]

I was debugging some c++ code (WinCE 6 on ARM platform),
and i find some behavior strange:
4277220C mov r3, #0x93, 30
42772210 str r3, [sp]
42772214 ldr r3, [pc, #0x69C]
42772218 ldr r2, [pc, #0x694]
4277221C mov r1, #0
42772220 ldr r0, [pc, #0x688]
Line 42772214 ldr r3, [pc, #0x69C] is used to get some constant from .DATA section, at least I think so.
What is strange that according to the code r2 should be filled with memory from address pc=0x42772214 + 0x69C = 0x427728B0, but according to the memory contents it's loaded from 0x427728B8 (8bytes+), it happens for other ldr usages too.
Is it fault of the debugger or my understanding of ldr/pc?
Another issue I don't get - why access to the .data section is relative to the executed code? I find it little bit strange.
And one more issue: i cannot find syntax of the 1st mov command (any one could point me a optype specification for the Thumb (1C2))
Sorry for the laic description, but I'm just familiarizing with the assemblies.
This is correct. When pc is used for reading there is an 8-byte offset in ARM mode and 4-byte offset in Thumb mode.
From the ARM-ARM:
When an instruction reads the PC, the value read depends on which instruction set it comes from:
For an ARM instruction, the value read is the address of the instruction plus 8 bytes. Bits [1:0] of this value are always zero, because ARM instructions are always word-aligned.
For a Thumb instruction, the value read is the address of the instruction plus 4 bytes. Bit [0] of this value is always zero, because Thumb instructions are always halfword-aligned.
This way of reading the PC is primarily used for quick, position-independent addressing of nearby instructions and data, including position-independent branching within a program.
There are 2 reasons for pc-relative addressing.
Position-independent code, which is in your case.
Get some complicated constants nearby which cannot be written in 1 simple instruction, e.g. mov r3, #0x12345678 is impossible to complete in 1 instruction, so the compiler may put this constant in the end of the function and use e.g. ldr r3, [pc, #0x50] to load it instead.
I don't know what mov r3, #0x93, 30 means. Probably it is mov r3, #0x93, rol 30 (which gives 0xC0000024)?