I'm doing a blackjack game for my final project for an assembly course. I have an array of words that represents 52 cards in a deck. My game wont be exactly like blackjack, but I need to demonstrate the basic concept of the game.
I'm trying to loop my deal function twice, but no matter what value I put into r4 (the loop counter) it only prints the output of the deal function once. I've looked at this in GDB and after the first iteration of deal I get an error:
[Inferior 1 (process 1585) exited with code 04]
mov r4, #0
cmp r4, #4
beq display_2
add r4, r4, #1
bal deal
Whole source code:
.equ INPUT, 0
.equ OUTPUT, 1
.equ LOW, 0
.equ HIGH, 1
.equ PIN0, 0 // wipi pin 0 - bcm 17
.equ PIN1, 1 // wipi pin 1 - bcm 18
.equ PIN2, 2 // wipi pin 2 - bcm 27
.equ PIN3, 3 // wipi pin 3 - bcm 22
.equ PIN4, 4 // wipi pin 4 - bcm 23
.equ PIN5, 5 // wipi pin 5 - bcm 24
.equ PIN6, 6 // wipi pin 6 - bcm 25
.equ PIN7, 7 // wipi pin 7 - bcm 4
.global main
.data
format: .asciz "r1=%d\n"
.balign 4
// Create a deck of 52 cards
deck:
.word 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10
.word 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10
.word 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10
.word 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10
.text
// ripped from C++ compiler's assembly output
.align 2
.L3: .word 1321528399
main:
push {lr}
bl wiringPiSetup
mov r0, #PIN0
mov r1, #OUTPUT
bl pinMode
mov r0, #PIN1
mov r1, #OUTPUT
bl pinMode
mov r0, #PIN2
mov r1, #OUTPUT
bl pinMode
mov r0, #PIN3
mov r1, #OUTPUT
bl pinMode
mov r0, #PIN4
mov r1, #OUTPUT
bl pinMode
deal:
// the following code was disassembled from the C++ compiler's
// rand function
mov r0, #0
bl time
mov r3, r0
mov r0, r3
bl srand
bl rand
mov r2, r0
ldr r3, .L3
smull r1, r3, r3, r2
mov r1, r3, asr #4
mov r3, r2, asr #31
rsb r3, r3, r1
mov r1, #52
mul r3, r1, r3
rsb r3, r3, r2
// end of C++ compiler's code
ldr r0, =format
// mov the random number generated (r3) into r1 for printing
mov r1, r3
// take the same value and store it also into r7 to preserve it
mov r7, r1
bl printf
ldr r0, =format
ldr r1, =deck
// setup r9 as the increment value leading to the next index
// of the array
mov r9, #4
// multiply into r8 the random number times the increment value
// of the array (4 bytes)
mul r8, r7, r9
// r8 now holds the randomized card just dealt to the player
// add this to the players score and get the actual value
// at from the address
add r1, r1, r8
ldr r1, [r1]
// mov into r7 the players score to preserve it
mov r7, r1
bl printf
display_2:
mov r4, #0
l:
cmp r4, #2
bne deal
add r4, r4, #1
bal l
// write the players score to the led display
mov r0, r7
bl digitalWriteByte
Sample output:
pi#raspberrypi:~ $ ./6leds.out
r1=50 // the random index chosen
r1=10 // the value stored at that array index
pi#raspberrypi:~ $ ./6leds.out
r1=6
r1=7
pi#raspberrypi:~ $ ./6leds.out
r1=6
r1=7
^ I would like it to do this twice instead of once
A loop might be structured like so:
// Initialize loop counter. Note this is outside the loop body.
mov #r4, 0
loop_entry:
// Test if the loop is done. This is a "while (...) { }" or for style loop
cmp #r4, <loop_limt>
bge loop_exit
// code that you want to execute multiple times.
// In the above this would be everything between deal: and display_2:
<...>
add r4, r4, #1
bal loop_entry
loop_exit:
// Do stuff after the loop.
In the code you posted, the initialization to zero effectively happens inside of the loop. I would expect this to produce an infinite loop, not one which terminates after one execution. (E.g. I don't see how the bal l ever gets executed. You can verify this by stepping through in the debugger.)
The only simple thing I can think of that you might be getting hung up on is the difference between labels and functions. In the code you posted, the only label that identifies a function is main:. Elsewhere, the code just keeps executing past a label in a straight line. (There is of course nothing special about the main: label either, it is how it is used.) So after the second printf following deal:, the move r4, #0 under display_2: gets executed.
Related
I have been assigned to convert this program to arm assembly v8.
int power(int x, int y){
if (x == 0){
return 0;
}
else if (y < 0){
return 0;
}
else if (y == 0){
return 1;
}
else {
return x * power(x, y - 1);
}
}
Although I'm not very familiar with ARM assembly language and would like to know where to start.
I have attempted to research a bit on this but ultimately found very little on the internet about ARM.
The magic command is arm-linux-gnueabi-gcc -S -O2 -march=armv8-a power.c.
I used arm-linux-gnueabi-gcc since I work on an X86-64 machine and gcc does not have ARM targets available. If you are on an arm system, you should be able to use regular gcc instead. If not it will error, but no harm done.
-S tells gcc to output assembly.
The -O2 is optional and just helps to optimize the code slightly and reduce debug clutter from the result.
-march=armv8-a tells it to use the ARM v8 target while compiling. I chose armv8-a somewhat arbitrarily. According to the docs all of the ARM v8 are armv8-a, armv8.1-a, armv8.2-a, armv8.3-a, armv8.4-a, armv8.5-a, armv8.6-a, armv8-m.base, armv8-m.main, and armv8.1-m.main. I have no idea what the differences are so you may want to choose a different one.
power.c just tells it which file to compile. Since we don't specify an output file (Ex: -o output.asm), the assembly will be outputted to power.s.
If you are not compiling on an arm machine that has provides the desired target with regular gcc, you can use arm-linux-gnueabi-gcc instead. If you do not have it installed, you can install it with:
sudo apt-get update
sudo apt-get install gcc-arm-linux-gnueabi binutils-arm-linux-gnueabi
Output
If anyone is curious, this is the output I received when I tried it on my machine.
.arch armv8-a
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 2
.eabi_attribute 30, 2
.eabi_attribute 34, 1
.eabi_attribute 18, 4
.file "testing.c"
.text
.align 2
.global power
.syntax unified
.arm
.fpu softvfp
.type power, %function
power:
# args = 0, pretend = 0, frame = 0
# frame_needed = 0, uses_anonymous_args = 0
# link register save eliminated.
clz r2, r0
mov r3, r0
lsr r2, r2, #5
orrs r2, r2, r1, lsr #31
bne .L4
cmp r1, #0
mov r0, #1
bxeq lr
.L3:
subs r1, r1, #1
mul r0, r3, r0
bne .L3
bx lr
.L4:
mov r0, #0
bx lr
.size power, .-power
.ident "GCC: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0"
.section .note.GNU-stack,"",%progbits
How can I get started.
Here is my thought process for how I was able to solve this problem. Looking at the problem I could tell it would probably be in two parts:
How can that code be compiled to assembly?
For the first part I found this answer which provided the command gcc -S -fverbose-asm -O2 foo.c. After testing it, I decided to remove the -fverbose-asm since only seemed to provide clutter for a program this small.
How can I set the compiler target to ARM v8?
After a quick google search I found that gcc lets you specify the target architecture with -march=xxx. My next step was to find a list of ARM architectures that I could select from. After finding gcc.gnu.org/onlinedocs/gcc/ARM-Options.html, I selected armv8-a since it sounded the most correct. When I tried it out, gcc told me that the target architecture could not be found. This was not really a surprise since I am on x86-64 and usually compilers come with the compatible targets to reduce the space required. I knew this likely meant I would need to identify the apt package which provided the arm targets so I searched around until I found this answer which filled in the rest of the information I needed.
Compiler explorer is the friend in a such simple cases.
ARMv8-a Clang Assembly with the compiler option -O1 for keeping the recursion:
# Compilation provided by Compiler Explorer at https://godbolt.org/
power(int, int): // #power(int, int)
stp x29, x30, [sp, #-32]! // 16-byte Folded Spill
str x19, [sp, #16] // 8-byte Folded Spill
mov x29, sp
mov w19, w0
mov w0, wzr
cbz w19, .LBB0_5
tbnz w1, #31, .LBB0_5
cbz w1, .LBB0_4
sub w1, w1, #1
mov w0, w19
bl power(int, int)
mul w0, w0, w19
b .LBB0_5
.LBB0_4:
mov w0, #1
.LBB0_5:
ldr x19, [sp, #16] // 8-byte Folded Reload
ldp x29, x30, [sp], #32 // 16-byte Folded Reload
ret
ARM GCC (linux) Assembly with the compiler option -O1 for keeping the recursion:
# Compilation provided by Compiler Explorer at https://godbolt.org/
power(int, int):
push {r4, lr}
mov r4, r0
clz r0, r0
lsrs r0, r0, #5
orrs r3, r0, r1, lsr #31
it ne
movne r0, #0
beq .L6
.L1:
pop {r4, pc}
.L6:
movs r0, #1
cmp r1, #0
beq .L1
subs r1, r1, #1
mov r0, r4
bl power(int, int)
mul r0, r4, r0
b .L1
ARM GCC (none) Assembly with the compiler option -O1 for keeping the recursion:
# Compilation provided by Compiler Explorer at https://godbolt.org/
power(int, int):
push {r4, lr}
mov r4, r0
rsbs r0, r0, #1
movcc r0, #0
orrs r3, r0, r1, lsr #31
movne r0, #0
beq .L6
.L1:
pop {r4, lr}
bx lr
.L6:
cmp r1, #0
moveq r0, #1
beq .L1
sub r1, r1, #1
mov r0, r4
bl power(int, int)
mul r0, r4, r0
b .L1
ARM64 GCC Assembly with the compiler option -O1 for keeping the recursion:
# Compilation provided by Compiler Explorer at https://godbolt.org/
power(int, int):
cmp w1, 0
ccmp w0, 0, 4, ge
bne .L9
mov w0, 0
ret
.L9:
stp x29, x30, [sp, -32]!
mov x29, sp
str x19, [sp, 16]
mov w19, w0
mov w0, 1
cbnz w1, .L10
.L1:
ldr x19, [sp, 16]
ldp x29, x30, [sp], 32
ret
.L10:
sub w1, w1, #1
mov w0, w19
bl power(int, int)
mul w0, w19, w0
b .L1
I argued with a friend the other day about those two snippets. Which is faster and why ?
value = 5;
if (condition) {
value = 6;
}
and:
if (condition) {
value = 6;
} else {
value = 5;
}
What if value is a matrix ?
Note: I know that value = condition ? 6 : 5; exists and I expect it to be faster, but it wasn't an option.
Edit (requested by staff since question is on hold at the moment):
please answer by considering either x86 assembly generated by mainstream compilers (say g++, clang++, vc, mingw) in both optimized and non optimized versions or MIPS assembly.
when assembly differ, explain why a version is faster and when (e.g. "better because no branching and branching has following issue blahblah")
TL;DR: In unoptimized code, if without else seems irrelevantly more efficient but with even the most basic level of optimization enabled the code is basically rewritten to value = condition + 5.
I gave it a try and generated the assembly for the following code:
int ifonly(bool condition, int value)
{
value = 5;
if (condition) {
value = 6;
}
return value;
}
int ifelse(bool condition, int value)
{
if (condition) {
value = 6;
} else {
value = 5;
}
return value;
}
On gcc 6.3 with optimizations disabled (-O0), the relevant difference is:
mov DWORD PTR [rbp-8], 5
cmp BYTE PTR [rbp-4], 0
je .L2
mov DWORD PTR [rbp-8], 6
.L2:
mov eax, DWORD PTR [rbp-8]
for ifonly, while ifelse has
cmp BYTE PTR [rbp-4], 0
je .L5
mov DWORD PTR [rbp-8], 6
jmp .L6
.L5:
mov DWORD PTR [rbp-8], 5
.L6:
mov eax, DWORD PTR [rbp-8]
The latter looks slightly less efficient because it has an extra jump but both have at least two and at most three assignments so unless you really need to squeeze every last drop of performance (hint: unless you are working on a space shuttle you don't, and even then you probably don't) the difference won't be noticeable.
However, even with the lowest optimization level (-O1) both functions reduce to the same:
test dil, dil
setne al
movzx eax, al
add eax, 5
which is basically the equivalent of
return 5 + condition;
assuming condition is zero or one.
Higher optimization levels don't really change the output, except they manage to avoid the movzx by efficiently zeroing out the EAX register at the start.
Disclaimer: You probably shouldn't write 5 + condition yourself (even though the standard guarantees that converting true to an integer type gives 1) because your intent might not be immediately obvious to people reading your code (which may include your future self). The point of this code is to show that what the compiler produces in both cases is (practically) identical. Ciprian Tomoiaga states it quite well in the comments:
a human's job is to write code for humans and let the compiler write code for the machine.
The answer from CompuChip shows that for int they both are optimized to the same assembly, so it doesn't matter.
What if value is a matrix ?
I will interpret this in a more general way, i.e. what if value is of a type whose constructions and assignments are expensive (and moves are cheap).
then
T value = init1;
if (condition)
value = init2;
is sub-optimal because in case condition is true, you do the unnecessary initialization to init1 and then you do the copy assignment.
T value;
if (condition)
value = init2;
else
value = init3;
This is better. But still sub-optimal if default construction is expensive and if copy construction is more expensive then initialization.
You have the conditional operator solution which is good:
T value = condition ? init1 : init2;
Or, if you don't like the conditional operator, you can create a helper function like this:
T create(bool condition)
{
if (condition)
return {init1};
else
return {init2};
}
T value = create(condition);
Depending on what init1 and init2 are you can also consider this:
auto final_init = condition ? init1 : init2;
T value = final_init;
But again I must emphasize that this is relevant only when construction and assignments are really expensive for the given type. And even then, only by profiling you know for sure.
In pseudo-assembly language,
li #0, r0
test r1
beq L1
li #1, r0
L1:
may or may not be faster than
test r1
beq L1
li #1, r0
bra L2
L1:
li #0, r0
L2:
depending on how sophisticated the actual CPU is. Going from simplest to fanciest:
With any CPU manufactured after roughly 1990, good performance depends on the code fitting within the instruction cache. When in doubt, therefore, minimize code size. This weighs in favor of the first example.
With a basic "in-order, five-stage pipeline" CPU, which is still roughly what you get in many microcontrollers, there is a pipeline bubble every time a branch—conditional or unconditional—is taken, so it is also important to minimize the number of branch instructions. This also weighs in favor of the first example.
Somewhat more sophisticated CPUs—fancy enough to do "out-of-order execution", but not fancy enough to use the best known implementations of that concept—may incur pipeline bubbles whenever they encounter write-after-write hazards. This weighs in favor of the second example, where r0 is written only once no matter what. These CPUs are usually fancy enough to process unconditional branches in the instruction fetcher, so you aren't just trading the write-after-write penalty for a branch penalty.
I don't know if anyone is still making this kind of CPU anymore. However, the CPUs that do use the "best known implementations" of out-of-order execution are likely to cut corners on the less frequently used instructions, so you need to be aware that this sort of thing can happen. A real example is false data dependencies on the destination registers in popcnt and lzcnt on Sandy Bridge CPUs.
At the highest end, the OOO engine will wind up issuing exactly the same sequence of internal operations for both code fragments—this is the hardware version of "don't worry about it, the compiler will generate the same machine code either way." However, code size still does matter, and now you also should be worrying about the predictability of the conditional branch. Branch prediction failures potentially cause a complete pipeline flush, which is catastrophic for performance; see Why is it faster to process a sorted array than an unsorted array? to understand how much difference this can make.
If the branch is highly unpredictable, and your CPU has conditional-set or conditional-move instructions, this is the time to use them:
li #0, r0
test r1
setne r0
or
li #0, r0
li #1, r2
test r1
movne r2, r0
The conditional-set version is also more compact than any other alternative; if that instruction is available it is practically guaranteed to be the Right Thing for this scenario, even if the branch was predictable. The conditional-move version requires an additional scratch register, and always wastes one li instruction's worth of dispatch and execute resources; if the branch was in fact predictable, the branchy version may well be faster.
In unoptimised code, the first example assigns a variable always once and sometimes twice. The second example only ever assigns a variable once. The conditional is the same on both code paths, so that shouldn't matter. In optimised code, it depends on the compiler.
As always, if you are that concerned, generate the assembly and see what the compiler is actually doing.
What would make you think any of them even the one liner is faster or slower?
unsigned int fun0 ( unsigned int condition, unsigned int value )
{
value = 5;
if (condition) {
value = 6;
}
return(value);
}
unsigned int fun1 ( unsigned int condition, unsigned int value )
{
if (condition) {
value = 6;
} else {
value = 5;
}
return(value);
}
unsigned int fun2 ( unsigned int condition, unsigned int value )
{
value = condition ? 6 : 5;
return(value);
}
More lines of code of a high level language gives the compiler more to work with so if you want to make a general rule about it give the compiler more code to work with. If the algorithm is the same like the cases above then one would expect the compiler with minimal optimization to figure that out.
00000000 <fun0>:
0: e3500000 cmp r0, #0
4: 03a00005 moveq r0, #5
8: 13a00006 movne r0, #6
c: e12fff1e bx lr
00000010 <fun1>:
10: e3500000 cmp r0, #0
14: 13a00006 movne r0, #6
18: 03a00005 moveq r0, #5
1c: e12fff1e bx lr
00000020 <fun2>:
20: e3500000 cmp r0, #0
24: 13a00006 movne r0, #6
28: 03a00005 moveq r0, #5
2c: e12fff1e bx lr
not a big surprise it did the first function in a different order, same execution time though.
0000000000000000 <fun0>:
0: 7100001f cmp w0, #0x0
4: 1a9f07e0 cset w0, ne
8: 11001400 add w0, w0, #0x5
c: d65f03c0 ret
0000000000000010 <fun1>:
10: 7100001f cmp w0, #0x0
14: 1a9f07e0 cset w0, ne
18: 11001400 add w0, w0, #0x5
1c: d65f03c0 ret
0000000000000020 <fun2>:
20: 7100001f cmp w0, #0x0
24: 1a9f07e0 cset w0, ne
28: 11001400 add w0, w0, #0x5
2c: d65f03c0 ret
Hopefully you get the idea you could have just tried this if it wasnt obvious that the different implementations were not actually different.
As far as a matrix goes, not sure how that matters,
if(condition)
{
big blob of code a
}
else
{
big blob of code b
}
just going to put the same if-then-else wrapper around the big blobs of code be they value=5 or something more complicated. Likewise the comparison even if it is a big blob of code it still has to be computed, and equal to or not equal to something is often compiled with the negative, if (condition) do something is often compiled as if not condition goto.
00000000 <fun0>:
0: 0f 93 tst r15
2: 03 24 jz $+8 ;abs 0xa
4: 3f 40 06 00 mov #6, r15 ;#0x0006
8: 30 41 ret
a: 3f 40 05 00 mov #5, r15 ;#0x0005
e: 30 41 ret
00000010 <fun1>:
10: 0f 93 tst r15
12: 03 20 jnz $+8 ;abs 0x1a
14: 3f 40 05 00 mov #5, r15 ;#0x0005
18: 30 41 ret
1a: 3f 40 06 00 mov #6, r15 ;#0x0006
1e: 30 41 ret
00000020 <fun2>:
20: 0f 93 tst r15
22: 03 20 jnz $+8 ;abs 0x2a
24: 3f 40 05 00 mov #5, r15 ;#0x0005
28: 30 41 ret
2a: 3f 40 06 00 mov #6, r15 ;#0x0006
2e: 30 41
we just went through this exercise with someone else recently on stackoverflow. this mips compiler interestingly in that case not only realized the functions were the same, but had one function simply jump to the other to save on code space. Didnt do that here though
00000000 <fun0>:
0: 0004102b sltu $2,$0,$4
4: 03e00008 jr $31
8: 24420005 addiu $2,$2,5
0000000c <fun1>:
c: 0004102b sltu $2,$0,$4
10: 03e00008 jr $31
14: 24420005 addiu $2,$2,5
00000018 <fun2>:
18: 0004102b sltu $2,$0,$4
1c: 03e00008 jr $31
20: 24420005 addiu $2,$2,5
some more targets.
00000000 <_fun0>:
0: 1166 mov r5, -(sp)
2: 1185 mov sp, r5
4: 0bf5 0004 tst 4(r5)
8: 0304 beq 12 <_fun0+0x12>
a: 15c0 0006 mov $6, r0
e: 1585 mov (sp)+, r5
10: 0087 rts pc
12: 15c0 0005 mov $5, r0
16: 1585 mov (sp)+, r5
18: 0087 rts pc
0000001a <_fun1>:
1a: 1166 mov r5, -(sp)
1c: 1185 mov sp, r5
1e: 0bf5 0004 tst 4(r5)
22: 0204 bne 2c <_fun1+0x12>
24: 15c0 0005 mov $5, r0
28: 1585 mov (sp)+, r5
2a: 0087 rts pc
2c: 15c0 0006 mov $6, r0
30: 1585 mov (sp)+, r5
32: 0087 rts pc
00000034 <_fun2>:
34: 1166 mov r5, -(sp)
36: 1185 mov sp, r5
38: 0bf5 0004 tst 4(r5)
3c: 0204 bne 46 <_fun2+0x12>
3e: 15c0 0005 mov $5, r0
42: 1585 mov (sp)+, r5
44: 0087 rts pc
46: 15c0 0006 mov $6, r0
4a: 1585 mov (sp)+, r5
4c: 0087 rts pc
00000000 <fun0>:
0: 00a03533 snez x10,x10
4: 0515 addi x10,x10,5
6: 8082 ret
00000008 <fun1>:
8: 00a03533 snez x10,x10
c: 0515 addi x10,x10,5
e: 8082 ret
00000010 <fun2>:
10: 00a03533 snez x10,x10
14: 0515 addi x10,x10,5
16: 8082 ret
and compilers
with this i code one would expect the different targets to match as well
define i32 #fun0(i32 %condition, i32 %value) #0 {
%1 = icmp ne i32 %condition, 0
%. = select i1 %1, i32 6, i32 5
ret i32 %.
}
; Function Attrs: norecurse nounwind readnone
define i32 #fun1(i32 %condition, i32 %value) #0 {
%1 = icmp eq i32 %condition, 0
%. = select i1 %1, i32 5, i32 6
ret i32 %.
}
; Function Attrs: norecurse nounwind readnone
define i32 #fun2(i32 %condition, i32 %value) #0 {
%1 = icmp ne i32 %condition, 0
%2 = select i1 %1, i32 6, i32 5
ret i32 %2
}
00000000 <fun0>:
0: e3a01005 mov r1, #5
4: e3500000 cmp r0, #0
8: 13a01006 movne r1, #6
c: e1a00001 mov r0, r1
10: e12fff1e bx lr
00000014 <fun1>:
14: e3a01006 mov r1, #6
18: e3500000 cmp r0, #0
1c: 03a01005 moveq r1, #5
20: e1a00001 mov r0, r1
24: e12fff1e bx lr
00000028 <fun2>:
28: e3a01005 mov r1, #5
2c: e3500000 cmp r0, #0
30: 13a01006 movne r1, #6
34: e1a00001 mov r0, r1
38: e12fff1e bx lr
fun0:
push.w r4
mov.w r1, r4
mov.w r15, r12
mov.w #6, r15
cmp.w #0, r12
jne .LBB0_2
mov.w #5, r15
.LBB0_2:
pop.w r4
ret
fun1:
push.w r4
mov.w r1, r4
mov.w r15, r12
mov.w #5, r15
cmp.w #0, r12
jeq .LBB1_2
mov.w #6, r15
.LBB1_2:
pop.w r4
ret
fun2:
push.w r4
mov.w r1, r4
mov.w r15, r12
mov.w #6, r15
cmp.w #0, r12
jne .LBB2_2
mov.w #5, r15
.LBB2_2:
pop.w r4
ret
Now technically there is a performance difference in some of these solutions, sometimes the result is 5 case has a jump over the result is 6 code, and vice versa, is a branch faster than executing through? one could argue but the execution should vary. But that is more of an if condition vs if not condition in the code resulting in the compiler doing the if this jump over else execute through. but this is not necessarily due to the coding style but the comparison and the if and the else cases in whatever syntax.
Ok, since assembly is one of the tags, I will just assume your code is pseudo code (and not necessarily c) and translate it by human into 6502 assembly.
1st Option (without else)
ldy #$00
lda #$05
dey
bmi false
lda #$06
false brk
2nd Option (with else)
ldy #$00
dey
bmi else
lda #$06
sec
bcs end
else lda #$05
end brk
Assumptions: Condition is in Y register set this to 0 or 1 on the first line of either option, result will be in accumulator.
So, after counting cycles for both possibilities of each case, we see that the 1st construct is generally faster; 9 cycles when condition is 0 and 10 cycles when condition is 1, whereas option two is also 9 cycles when condition is 0, but 13 cycles when condition is 1. (cycle counts do not include the BRK at the end).
Conclusion: If only is faster than If-Else construct.
And for completeness, here is an optimized value = condition + 5 solution:
ldy #$00
lda #$00
tya
adc #$05
brk
This cuts our time down to 8 cycles (again not including the BRK at the end).
My CPU is arm.How can I figure out the function parameter value if it's optimized out?
For example:
status_t NuPlayer::GenericSource::setDataSource(
int fd, int64_t offset, int64_t length) {
resetDataSource();
mFd = dup(fd);
mOffset = offset;
mLength = length;
Above function has 3 parameters, when I try to print the second parameter offset, I will get below result:
Thread 4 "Binder:15082_3" hit Breakpoint 1, android::NuPlayer::GenericSource::setDataSource (this=0xae63bb40, fd=8, offset=<optimized out>, length=9384436) at frameworks/av/media/libmediaplayerservice/nuplayer/GenericSource.cpp:123
123 resetDataSource();
(gdb) x/i $pc
=> 0xb02aaa80 <android::NuPlayer::GenericSource::setDataSource(int, long long, long long)+12>: blx 0xb0282454 <_ZN7android8NuPlayer13GenericSource15resetDataSourceEv#plt>
(gdb) n
125 mFd = dup(fd);
(gdb) print offset
$1 = <optimized out>
(gdb) p $eax
$2 = void
(gdb) disassemble /m
Dump of assembler code for function android::NuPlayer::GenericSource::setDataSource(int, long long, long long):
122 int fd, int64_t offset, int64_t length) {
0xb02aaa74 <+0>: push {r4, r5, r6, r7, lr}
0xb02aaa76 <+2>: sub sp, #4
0xb02aaa78 <+4>: mov r4, r3
0xb02aaa7a <+6>: mov r5, r2
0xb02aaa7c <+8>: mov r6, r1
0xb02aaa7e <+10>: mov r7, r0
123 resetDataSource();
=> 0xb02aaa80 <+12>: blx 0xb0282454 <_ZN7android8NuPlayer13GenericSource15resetDataSourceEv#plt>
124
125 mFd = dup(fd);
0xb02aaa84 <+16>: mov r0, r6
0xb02aaa86 <+18>: blx 0xb027e5d8 <dup#plt>
0xb02aaa8a <+22>: ldrd r2, r1, [sp, #24]
0xb02aaa8e <+26>: str.w r0, [r7, #224] ; 0xe0
0xb02aaa92 <+30>: movs r0, #0
126 mOffset = offset;
0xb02aaa94 <+32>: strd r5, r4, [r7, #232] ; 0xe8
127 mLength = length;
0xb02aaa98 <+36>: strd r2, r1, [r7, #240] ; 0xf0
128
129 // delay data source creation to prepareAsync() to avoid blocking
130 // the calling thread in setDataSource for any significant time.
131 return OK;
0xb02aaa9c <+40>: add sp, #4
0xb02aaa9e <+42>: pop {r4, r5, r6, r7, pc}
End of assembler dump.
(gdb)
I guess it's in some register but the result of $eax is void.
I guess it's in some register but the result of $eax is void.
There is no register called eax on ARM.
To know which register the parameter is in, you need to know calling convention.
Looks like you are using 32-bit ARM. From above link:
r0 to r3: used to hold argument values passed to a subroutine
So you should do info registers, verify that r0 == 0xae63bb40, r1 == 8 and find the offset in r2.
Sounds like example code has assigned the parameter variable to local variable already, so print that value will be exactly the same as optimized out parameters.
mOffset = offset;
mLength = length;
I have small application that compiles and runs well on my ARM Cortex M4. But when I disassemble binary file, that I flush, here is how first bytes look like:
00000000 <.data>:
0: 20020000 andcs r0, r2, r0
4: 080003b5 stmdaeq r0, {r0, r2, r4, r5, r7, r8, r9}
8: 08000345 stmdaeq r0, {r0, r2, r6, r8, r9}
c: 08000351 stmdaeq r0, {r0, r4, r6, r8, r9}
080003b5 should be the address of Reset handler (I have .word Reset_Handler there), but disassembling ELF shows that Reset handler is actually located at 080003b4, which is 1 byte before:
080003b4 <Reset_Handler>:
80003b4: 2100 movs r1, #0
80003b6: e003 b.n 80003c0 <InitData>
(It's running in THUMB mode, I have 2byte instructions).
Even if I disassemble the binary file, it's located at 080003b4:
000003b4 <.data+0x3b4>:
3b4: 2100 movs r1, #0
3b6: e003 b.n 0x3c0
My question is, why does it point 1 byte after? This code surprisingly works on actual board. Even without disassembling, shouldn't instructions be aligned by 2 byte? how can address be 0x000003b5?
Answer: ARM uses it for switching to THUMB mode.
I had worked out some code for my assignment and something tells me that I'm not doing it correctly.. Hope someone can take a look at it.
Thank you!
AREA Reset, CODE, READONLY
ENTRY
LDR r1, = 0x13579BA0
MOV r3, #0
MOV r4, #0
MOV r2, #8
Loop CMP r2, #0
BGE DONE
LDR r5, [r1, r4]
AND r5, r5, #0x00000000
ADD r3, r3, r5
ADD r4, r4, #4
SUB r2, r2, #1
B Loop
LDR r0, [r3]
DONE B DONE
END
Write an ARM assembly program that will add the hexadecimal digits in register 1 and save the sum in register 0. For example, if r1 is initialized as follows:
LDR r1, =0x120A760C
When you program has run to completion, register 0 will contain the sum of 1+2+0+A+7+6+0+C.
You will need to use the following in your solution:
· An 8-iteration loop
· Logical shift right instruction
· The AND instruction (used to force selected bits to 0)
I know that I did not even use LSR. where should I put it? I'm just getting started on Assembly hope someone makes some improvements on this code..