Iteration Vs. Recursion for Printing Integer Numbers to a Character LCD/OLED Display - c++

Question
I am looking for some input on how to optimize printing the digits of an integer number, say uint32_t num = 1234567890;, to a character display with an Arduino UNO. The main metrics to consider are memory usage and complied size. The display is so slow that no improvement in speed would be meaningful and minimum code length, while nice, isn't a requirement.
Currently, I am extracting the least significant digit using num%10 and then removing this digit by num/10 and so on until all the digits of num are extracted. Using recursion I can reverse the order of the printing so very few operations are needed (as explicit lines of code) to print the digits in proper order. Using for loops I need to find the number of characters used to write the number, then store them before being able to print them in the correct order, requiring an array and 3 for loops.
According to the Arduino IDE, when printing an assortment of signed and unsigned integers, recursion uses 2010/33 bytes of storage/memory, while iteration uses 2200/33 bytes verses 2474/52 bytes when using the Adafruit_CharacterOLED library that extends class Print.
Is there a way to implement this better than the functions I've written using recursion and iteration below? If not, which would you prefer and why? I feel like there might be a better way to do this with less resources--but maybe I'm Don Quixote fighting windmills and the code is already good enough.
Background
I'm working with a NHD-0420DZW character OLED display and have used the Newhaven datasheet and LiquidCrystal library as a guide to write my own library and the display is working great. However, to minimize code bloat, I chose to not make my display library a subclass of Print, which is a part of the Arduino core libraries. In doing this, significant savings in storage space (~400 bytes) and memory (~19 bytes) have already been realized (the ATmega328P has 32k storage with 2k RAM, so resources are scarce).
Recursion
If I use recursion, the print method is rather elegant. The number is divided by 10 until the base case of zero is achieved. Then the least significant digit of the smallest number is printed (MSD of num), and the LSD of the next smallest number (second MSD of num) and so on, causing the final print order to be reversed. This corrects for the reversed order of digit extraction using %10 and /10 operations.
// print integer type literals to display (base-10 representation)
void NewhavenDZW::print(int8_t num) {print(static_cast<int32_t>(num));}
void NewhavenDZW::print(uint8_t num) {print(static_cast<uint32_t>(num));}
void NewhavenDZW::print(int16_t num) {print(static_cast<int32_t>(num));}
void NewhavenDZW::print(uint16_t num) {print(static_cast<uint32_t>(num));}
void NewhavenDZW::print(int32_t num) {
if(num < 0) { // print negative sign if present
send('-', HIGH); // and make num positive
print(static_cast<uint32_t>(-num));
} else
print(static_cast<uint32_t>(num));
}
void NewhavenDZW::print(uint32_t num) {
if(num < 10) { // print single digit numbers directly
send(num + '0', HIGH);
return;
} else // use recursion to print nums with more
recursivePrint(num); // than two digits in the correct order
}
// recursive method for printing a number "backwards"
// used to correct the reversed order of digit extraction
void NewhavenDZW::recursivePrint(uint32_t num) {
if(num) { // true if num>0, false if num==0
recursivePrint(num/10); // maximum of 11 recursive steps
send(num%10 + '0', HIGH); // for a 10 digit number
}
}
Iteration
Since the digit extraction method starts at the LSD, rather than the MSD, the extracted digits cannot be printed directly unless I move the cursor and tell the display to print right-to-left. So I have to store the digits as I extract them before I can write them to the display in the correct order.
void NewhavenDZW::print(uint32_t num) {
if(num < 10) {
send(num + '0', HIGH);
return;
}
uint8_t length = 0;
for(uint32_t i=num; i>0; i/=10) // determine number of characters
++length; // needed to represent number
char text[length];
for(uint8_t i=length; num>0; num/=10, --i)
text[i-1] = num%10 + '0'; // map each numerical digit to
for(uint8_t i=0; i<length; i++) // its char value and fix ordering
send(text[i], HIGH); // before printing result
}
Update
Ultimately, recursion takes the least storage space, but is likely to use the most memory.
After reviewing the code kindly provided by Igor G and darune, as well as looking at the number of instructions listed on godbolt (as discussed by darune and old_timer) I believe that Igor G's solution is the best overall. It compiles to 2076 bytes vs. 2096 bytes for darune's function (using an if statement to stop leading zeros and be able to print 0) during testing. It also requires less instructions (88) than darune's (273) when the necessary if statement is tacked on.
Using Pointer Variable
void NewhavenDZW::print(uint32_t num) {
char buffer[10];
char* p = buffer;
do {
*p++ = num%10 + '0';
num /= 10;
} while (num);
while (p != buffer)
send(*--p, HIGH);
}
Using Index Variable
This is what my original for loop was trying to do, but in a naive way. There is really no point in trying to minimize the size of the buffer array as Igor G has point out.
void NewhavenDZW::print(uint32_t num) {
char text[10]; // signed/unsigned 32-bit ints are <= 10 digits
uint8_t i = sizeof(text) - 1; // set index to end of char array
do {
text[i--] = num%10 + '0'; // store each numerical digit as
num /= 10; // its associated char value
} while (num);
while (i < sizeof(text))
send(text[i++], HIGH); // print num in the correct order
}
The Alternative
Here's darune's function with the added if statement, for those who don't want to sift through the comments. The condition pow10 == 100 is the same as pow10 == 1, but saves two iterations of the loop to print zero while having the same compile size.
void NewhavenDZW::print(uint32_t num) {
for (uint32_t pow10 = 1000000000; pow10 != 0; pow10 /= 10)
if (num >= pow10 || (num == 0 && pow10 == 100))
send((num/pow10)%10 + '0', HIGH);
}

For a smaller footprint you can use something like this:
void Send(unsigned char);
void SmallPrintf(unsigned long val)
{
static_assert(sizeof(decltype(val)) == 4, "expected '10 digit type'");
for (unsigned long digit_pow10{1000000000}; digit_pow10 != 0; digit_pow10 /= 10)
{
Send((val / digit_pow10 % 10) + '0');
}
}
This produces around 70 instructions - which is about ~14 instructions less then using a buffer and iterating the buffer after. (Also the code is a lot simpler)
Link to godbolt.
If leading zero's is unwant'ed then an if clause can avoid that fairly simpel - something like:
if (val >= digit_pow10) {
Send((val / digit_pow10 % 10) + '0');
}
But it will cost some extra instructions (~9) though - however the total is still below the buffered example.

Try this one. My avr-gcc-5.4.0 + readelf tells that the function body is only 138 bytes.
void Send(uint8_t);
void OptimizedPrintf(uint32_t val)
{
uint8_t buffer[sizeof(val) * CHAR_BIT / 3 + 1];
uint8_t* p = buffer;
do
{
*p++ = (val % 10) + '0';
val /= 10;
} while (val);
while (p != buffer)
Send(*--p);
}

interesting experiment.
unsigned long fun ( unsigned long x )
{
return(x/10);
}
unsigned long fun2 ( unsigned long x )
{
return(x%10);
}
int main ( void )
{
return(0);
}
does/can give with an apt-got toolchain:
00000000 <fun>:
0: 2a e0 ldi r18, 0x0A ; 10
2: 30 e0 ldi r19, 0x00 ; 0
4: 40 e0 ldi r20, 0x00 ; 0
6: 50 e0 ldi r21, 0x00 ; 0
8: 0e d0 rcall .+28 ; 0x26 <__udivmodsi4>
a: 95 2f mov r25, r21
c: 84 2f mov r24, r20
e: 73 2f mov r23, r19
10: 62 2f mov r22, r18
12: 08 95 ret
00000014 <fun2>:
14: 2a e0 ldi r18, 0x0A ; 10
16: 30 e0 ldi r19, 0x00 ; 0
18: 40 e0 ldi r20, 0x00 ; 0
1a: 50 e0 ldi r21, 0x00 ; 0
1c: 04 d0 rcall .+8 ; 0x26 <__udivmodsi4>
1e: 08 95 ret
00000020 <main>:
20: 80 e0 ldi r24, 0x00 ; 0
22: 90 e0 ldi r25, 0x00 ; 0
24: 08 95 ret
00000026 <__udivmodsi4>:
26: a1 e2 ldi r26, 0x21 ; 33
28: 1a 2e mov r1, r26
2a: aa 1b sub r26, r26
2c: bb 1b sub r27, r27
2e: ea 2f mov r30, r26
30: fb 2f mov r31, r27
32: 0d c0 rjmp .+26 ; 0x4e <__udivmodsi4_ep>
00000034 <__udivmodsi4_loop>:
34: aa 1f adc r26, r26
36: bb 1f adc r27, r27
38: ee 1f adc r30, r30
3a: ff 1f adc r31, r31
3c: a2 17 cp r26, r18
3e: b3 07 cpc r27, r19
40: e4 07 cpc r30, r20
42: f5 07 cpc r31, r21
44: 20 f0 brcs .+8 ; 0x4e <__udivmodsi4_ep>
46: a2 1b sub r26, r18
48: b3 0b sbc r27, r19
4a: e4 0b sbc r30, r20
4c: f5 0b sbc r31, r21
0000004e <__udivmodsi4_ep>:
4e: 66 1f adc r22, r22
50: 77 1f adc r23, r23
52: 88 1f adc r24, r24
54: 99 1f adc r25, r25
56: 1a 94 dec r1
58: 69 f7 brne .-38 ; 0x34 <__udivmodsi4_loop>
5a: 60 95 com r22
5c: 70 95 com r23
5e: 80 95 com r24
60: 90 95 com r25
62: 26 2f mov r18, r22
64: 37 2f mov r19, r23
66: 48 2f mov r20, r24
68: 59 2f mov r21, r25
6a: 6a 2f mov r22, r26
6c: 7b 2f mov r23, r27
6e: 8e 2f mov r24, r30
70: 9f 2f mov r25, r31
72: 08 95 ret
Answered one of my questions, 78 instructions for the division function. Also it returns both the numerator and denominator in one call something that could be taken advantage of if desperate.
unsigned int fun ( unsigned int x )
{
return(x/10);
}
unsigned int fun2 ( unsigned int x )
{
return(x%10);
}
int main ( void )
{
return(0);
}
gives
00000000 <fun>:
0: 6a e0 ldi r22, 0x0A ; 10
2: 70 e0 ldi r23, 0x00 ; 0
4: 0a d0 rcall .+20 ; 0x1a <__udivmodhi4>
6: 86 2f mov r24, r22
8: 97 2f mov r25, r23
a: 08 95 ret
0000000c <fun2>:
c: 6a e0 ldi r22, 0x0A ; 10
e: 70 e0 ldi r23, 0x00 ; 0
10: 04 d0 rcall .+8 ; 0x1a <__udivmodhi4>
12: 08 95 ret
00000014 <main>:
14: 80 e0 ldi r24, 0x00 ; 0
16: 90 e0 ldi r25, 0x00 ; 0
18: 08 95 ret
0000001a <__udivmodhi4>:
1a: aa 1b sub r26, r26
1c: bb 1b sub r27, r27
1e: 51 e1 ldi r21, 0x11 ; 17
20: 07 c0 rjmp .+14 ; 0x30 <__udivmodhi4_ep>
00000022 <__udivmodhi4_loop>:
22: aa 1f adc r26, r26
24: bb 1f adc r27, r27
26: a6 17 cp r26, r22
28: b7 07 cpc r27, r23
2a: 10 f0 brcs .+4 ; 0x30 <__udivmodhi4_ep>
2c: a6 1b sub r26, r22
2e: b7 0b sbc r27, r23
00000030 <__udivmodhi4_ep>:
30: 88 1f adc r24, r24
32: 99 1f adc r25, r25
34: 5a 95 dec r21
36: a9 f7 brne .-22 ; 0x22 <__udivmodhi4_loop>
38: 80 95 com r24
3a: 90 95 com r25
3c: 68 2f mov r22, r24
3e: 79 2f mov r23, r25
40: 8a 2f mov r24, r26
42: 9b 2f mov r25, r27
44: 08 95 ret
22 lines, 44 bytes for the divide. also interesting. Might be taken advantage of at the C++ level to save space (if this display loop is the only place you do a divide/modulo).
of course the optimizer knows that the library function does both result and remainder:
unsigned long fun ( unsigned long x )
{
unsigned long res;
unsigned long rem;
res=x/10;
rem=x&10;
res&=0xFFFF;
rem&=0xFFFF;
return((res<<16)|rem);
}
int main ( void )
{
return(0);
}
00000000 <fun>:
0: cf 92 push r12
2: df 92 push r13
4: ef 92 push r14
6: ff 92 push r15
8: c6 2e mov r12, r22
a: d7 2e mov r13, r23
c: e8 2e mov r14, r24
e: f9 2e mov r15, r25
10: 2a e0 ldi r18, 0x0A ; 10
12: 30 e0 ldi r19, 0x00 ; 0
14: 40 e0 ldi r20, 0x00 ; 0
16: 50 e0 ldi r21, 0x00 ; 0
18: 19 d0 rcall .+50 ; 0x4c <__udivmodsi4>
1a: a2 2f mov r26, r18
1c: b3 2f mov r27, r19
1e: 99 27 eor r25, r25
20: 88 27 eor r24, r24
22: 2a e0 ldi r18, 0x0A ; 10
24: c2 22 and r12, r18
26: dd 24 eor r13, r13
28: ee 24 eor r14, r14
2a: ff 24 eor r15, r15
2c: 68 2f mov r22, r24
2e: 79 2f mov r23, r25
30: 8a 2f mov r24, r26
32: 9b 2f mov r25, r27
34: 6c 29 or r22, r12
36: 7d 29 or r23, r13
38: 8e 29 or r24, r14
3a: 9f 29 or r25, r15
3c: ff 90 pop r15
3e: ef 90 pop r14
40: df 90 pop r13
42: cf 90 pop r12
44: 08 95 ret
00000046 <main>:
46: 80 e0 ldi r24, 0x00 ; 0
48: 90 e0 ldi r25, 0x00 ; 0
4a: 08 95 ret
0000004c <__udivmodsi4>:
4c: a1 e2 ldi r26, 0x21 ; 33
4e: 1a 2e mov r1, r26
50: aa 1b sub r26, r26
52: bb 1b sub r27, r27
54: ea 2f mov r30, r26
56: fb 2f mov r31, r27
58: 0d c0 rjmp .+26 ; 0x74 <__udivmodsi4_ep>
0000005a <__udivmodsi4_loop>:
5a: aa 1f adc r26, r26
5c: bb 1f adc r27, r27
5e: ee 1f adc r30, r30
60: ff 1f adc r31, r31
62: a2 17 cp r26, r18
64: b3 07 cpc r27, r19
66: e4 07 cpc r30, r20
68: f5 07 cpc r31, r21
6a: 20 f0 brcs .+8 ; 0x74 <__udivmodsi4_ep>
6c: a2 1b sub r26, r18
6e: b3 0b sbc r27, r19
70: e4 0b sbc r30, r20
72: f5 0b sbc r31, r21
00000074 <__udivmodsi4_ep>:
74: 66 1f adc r22, r22
76: 77 1f adc r23, r23
78: 88 1f adc r24, r24
7a: 99 1f adc r25, r25
7c: 1a 94 dec r1
7e: 69 f7 brne .-38 ; 0x5a <__udivmodsi4_loop>
80: 60 95 com r22
82: 70 95 com r23
84: 80 95 com r24
86: 90 95 com r25
88: 26 2f mov r18, r22
8a: 37 2f mov r19, r23
8c: 48 2f mov r20, r24
8e: 59 2f mov r21, r25
90: 6a 2f mov r22, r26
92: 7b 2f mov r23, r27
94: 8e 2f mov r24, r30
96: 9f 2f mov r25, r31
98: 08 95 ret
Sorry using this space to have fun with this exercise of seeing what a compiler does with this problem. Trying to use the 16 bit division starts to explode register usage burning through the 34 instructions saved.
Because of the fixed denominator and because this is an 8 bit processor you can play optimization games with the compiler, but you may take this down an unreadable path.
Still pretty sure that this can be done tighter without using the division library function and doing it all yourself knowing that this is an AVR. Shifts are brutal though, lots of registers but once you spill over then that explodes the size of the function too. Very delicate.
For the price of one uno you could have bought a handful of blue pills with a lot more of everything, including 32 bit registers and a multiply which turns a divide by 10 into a few instruction. Can still use the arduino sandbox, And runs way faster. (more flash, more ram, compiler friendly instruction set, likely no longer needing to count bytes, should try compiling your project for that target and see how much of the flash is used).

Related

Hardware supported popcount for dynamic bitset in Boost library

How to enable the hardware supported popcount for counting set bits in the dynamic bitset from the Boost 1.64.0 library?
#include <boost/dynamic_bitset.hpp>
#include <boost/function_output_iterator.hpp>
#include <cstddef>
std::size_t fn(boost::dynamic_bitset<> const & p)
{
std::size_t acc = 0;
boost::to_block_range(p, boost::make_function_output_iterator(
[&acc](boost::dynamic_bitset<>::block_type v)
{
acc += __builtin_popcountll(v);
}
));
return acc;
}
Compiles to (g++ -O3 -march=native -c bitset.cpp -std=c++14):
30: 48 8b 77 08 mov 0x8(%rdi),%rsi
34: 48 8b 17 mov (%rdi),%rdx
37: 48 89 f0 mov %rsi,%rax
3a: 48 29 d0 sub %rdx,%rax
3d: 48 83 f8 07 cmp $0x7,%rax
41: b8 00 00 00 00 mov $0x0,%eax
46: 7e 1d jle 65 <_Z3fn3RKN5boost14dynamic_bitsetImSaImEEE+0x35>
48: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
4f: 00
50: 31 c9 xor %ecx,%ecx
52: 48 83 c2 08 add $0x8,%rdx
56: f3 48 0f b8 4a f8 popcnt -0x8(%rdx),%rcx
5c: 48 01 c8 add %rcx,%rax
5f: 48 39 d6 cmp %rdx,%rsi
62: 75 ec jne 50 <_Z3fn3RKN5boost14dynamic_bitsetImSaImEEE+0x20>
64: c3 retq
65: c3 retq
66: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
6d: 00 00 00

Why does GAS inline assembly wrapped in a function generate different instructions for the caller than a pure assembly function

I've been writing some basic functions using GCC's asm to practice for an actual application.
My functions pretty, wrap, and pure generate the same instructions to unpack a 64 bit integer into a 128 bit vector. add1 and add2 which call pretty and wrap respectively also generate the same instructions. But add3 differs by saving its xmm0 register by pushing it to the stack rather than by copying it to another xmm register. This I don't understand because the compiler can see the details of pure to know none of the other xmm registers will be clobbered.
Here is the C++
#include <immintrin.h>
__m128i pretty(long long b) { return (__m128i){b,b}; }
__m128i wrap(long long b) {
asm ("mov qword ptr [rsp-0x10], rdi\n"
"vmovddup xmm0, qword ptr [rsp-0x10]\n"
:
: "r"(b)
);
}
extern "C" __m128i pure(long long b);
asm (".text\n.global pure\n\t.type pure, #function\n"
"pure:\n\t"
"mov qword ptr [rsp-0x10], rdi\n\t"
"vmovddup xmm0, qword ptr [rsp-0x10]\n\t"
"ret\n\t"
);
__m128i add1(__m128i in, long long in2) { return in + pretty(in2);}
__m128i add2(__m128i in, long long in2) { return in + wrap(in2);}
__m128i add3(__m128i in, long long in2) { return in + pure(in2);}
Compiled with g++ -c so.cpp -march=native -masm=intel -O3 -fno-inline and disassembled with objdump -d -M intel so.o | c++filt.
so.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <pure>:
0: 48 89 7c 24 f0 mov QWORD PTR [rsp-0x10],rdi
5: c5 fb 12 44 24 f0 vmovddup xmm0,QWORD PTR [rsp-0x10]
b: c3 ret
c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
0000000000000010 <pretty(long long)>:
10: 48 89 7c 24 f0 mov QWORD PTR [rsp-0x10],rdi
15: c5 fb 12 44 24 f0 vmovddup xmm0,QWORD PTR [rsp-0x10]
1b: c3 ret
1c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
0000000000000020 <wrap(long long)>:
20: 48 89 7c 24 f0 mov QWORD PTR [rsp-0x10],rdi
25: c5 fb 12 44 24 f0 vmovddup xmm0,QWORD PTR [rsp-0x10]
2b: c3 ret
2c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
0000000000000030 <add1(long long __vector(2), long long)>:
30: c5 f8 28 c8 vmovaps xmm1,xmm0
34: 48 83 ec 08 sub rsp,0x8
38: e8 00 00 00 00 call 3d <add1(long long __vector(2), long long)+0xd>
3d: 48 83 c4 08 add rsp,0x8
41: c5 f9 d4 c1 vpaddq xmm0,xmm0,xmm1
45: c3 ret
46: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
4d: 00 00 00
0000000000000050 <add2(long long __vector(2), long long)>:
50: c5 f8 28 c8 vmovaps xmm1,xmm0
54: 48 83 ec 08 sub rsp,0x8
58: e8 00 00 00 00 call 5d <add2(long long __vector(2), long long)+0xd>
5d: 48 83 c4 08 add rsp,0x8
61: c5 f9 d4 c1 vpaddq xmm0,xmm0,xmm1
65: c3 ret
66: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
6d: 00 00 00
0000000000000070 <add3(long long __vector(2), long long)>:
70: 48 83 ec 18 sub rsp,0x18
74: c5 f8 29 04 24 vmovaps XMMWORD PTR [rsp],xmm0
79: e8 00 00 00 00 call 7e <add3(long long __vector(2), long long)+0xe>
7e: c5 f9 d4 04 24 vpaddq xmm0,xmm0,XMMWORD PTR [rsp]
83: 48 83 c4 18 add rsp,0x18
87: c3 ret
GCC does not understand assembly language.
Since pure is an external function it cannot determine which registers it alters so according to the ABI has to assume all the xmm registers are changed.
wrap has undefined behaviour as the asm statement clobbers xmm0 and [rsp-0x10] which are not listed as clobbers or outputs (to a value which may or may not depend on b), and the function has no return statement.
Edit: The ABI does not apply to inline assembly, I expect your program will not work if you remove -fno-inline from the command line.

Adding UNUSED elements to C/C++ structure speeds up and slows down code execution

I wrote the following structure for use in an Arduino software PWM library I'm making, to PWM up to 20 pins at once (on an Uno) or 70 pins at once (on a Mega).
As written, the ISR portion of the code (eRCaGuy_SoftwarePWMupdate()), processing an array of this structure, takes 133us to run. VERY strangely, however, if I uncomment the line "byte flags1;" (in the struct), though flags1 is NOT used anywhere yet, the ISR now takes 158us to run. Then, if I uncomment "byte flags2;" so that BOTH flags are now uncommented, the runtime drops back down to where it was before (133us).
Why is this happening!? And how do I fix it? (ie: I want to ensure consistently fast code, for this particular function, not code that is inexplicably fickle). Adding one byte dramatically slows down the code, yet adding two makes no change at all.
I am trying to optimize the code (and I needed to add another feature too, requiring a single byte for flags), but I don't understand why adding one unused byte slows the code down by 25us, yet adding two unused bytes doesn't change the run-time at all.
I need to understand this to ensure my optimizations consistently work.
In .h file (my original struct, using C-style typedef'ed struct):
typedef struct softPWMpin //global struct
{
//VOLATILE VARIBLES (WILL BE ACCESSED IN AND OUTSIDE OF ISRs)
//for pin write access:
volatile byte pinBitMask;
volatile byte* volatile p_PORT_out; //pointer to port output register; NB: the 1st "volatile" says the port itself (1 byte) is volatile, the 2nd "volatile" says the *pointer* itself (2 bytes, pointing to the port) is volatile.
//for PWM output:
volatile unsigned long resolution;
volatile unsigned long PWMvalue; //Note: duty cycle = PWMvalue/(resolution - 1) = PWMvalue/topValue;
//ex: if resolution is 256, topValue is 255
//if PWMvalue = 255, duty_cycle = PWMvalue/topValue = 255/255 = 1 = 100%
//if PWMvalue = 50, duty_cycle = PWMvalue/topValue = 50/255 = 0.196 = 19.6%
//byte flags1;
//byte flags2;
//NON-VOLATILE VARIABLES (WILL ONLY BE ACCESSED INSIDE AN ISR, OR OUTSIDE AN ISR, BUT NOT BOTH)
unsigned long counter; //incremented each time update() is called; goes back to zero after reaching topValue; does NOT need to be volatile, since only the update function updates this (it is read-to or written from nowhere else)
} softPWMpin_t;
In .h file (new, using C++ style struct....to see if it makes any difference, per the comments. It appears to make no difference in any way, including run-time and compiled size)
struct softPWMpin //global struct
{
//VOLATILE VARIBLES (WILL BE ACCESSED IN AND OUTSIDE OF ISRs)
//for pin write access:
volatile byte pinBitMask;
volatile byte* volatile p_PORT_out; //pointer to port output register; NB: the 1st "volatile" says the port itself (1 byte) is volatile, the 2nd "volatile" says the *pointer* itself (2 bytes, pointing to the port) is volatile.
//for PWM output:
volatile unsigned long resolution;
volatile unsigned long PWMvalue; //Note: duty cycle = PWMvalue/(resolution - 1) = PWMvalue/topValue;
//ex: if resolution is 256, topValue is 255
//if PWMvalue = 255, duty_cycle = PWMvalue/topValue = 255/255 = 1 = 100%
//if PWMvalue = 50, duty_cycle = PWMvalue/topValue = 50/255 = 0.196 = 19.6%
//byte flags1;
//byte flags2;
//NON-VOLATILE VARIABLES (WILL ONLY BE ACCESSED INSIDE AN ISR, OR OUTSIDE AN ISR, BUT NOT BOTH)
unsigned long counter; //incremented each time update() is called; goes back to zero after reaching topValue; does NOT need to be volatile, since only the update function updates this (it is read-to or written from nowhere else)
};
In .cpp file (here I am creating the array of structs, and here is the update function which is called at a fixed rate in an ISR, via timer interrupts):
//static softPWMpin_t PWMpins[MAX_NUMBER_SOFTWARE_PWM_PINS]; //C-style, old, MAX_NUMBER_SOFTWARE_PWM_PINS = 20; static to give it file scope only
static softPWMpin PWMpins[MAX_NUMBER_SOFTWARE_PWM_PINS]; //C++-style, old, MAX_NUMBER_SOFTWARE_PWM_PINS = 20; static to give it file scope only
//This function must be placed within an ISR, to be called at a fixed interval
void eRCaGuy_SoftwarePWMupdate()
{
//Forced nonatomic block (ie: interrupts *enabled*)
byte SREG_old = SREG; //[1 clock cycle]
interrupts(); //[1 clock cycle] turn interrupts ON to allow *nested interrupts* (ex: handling of time-sensitive timing, such as reading incoming PWM signals or counting Timer2 overflows)
{
//first, increment all counters of attached pins (ie: where the value != PIN_NOT_ATTACHED)
//pinMapArray
for (byte pin=0; pin<NUM_DIGITAL_PINS; pin++)
{
byte i = pinMapArray[pin]; //[2 clock cycles: 0.125us]; No need to turn off interrupts to read this volatile variable here since reading pinMapArray[pin] is an atomic operation (since it's a single byte)
if (i != PIN_NOT_ATTACHED) //if the pin IS attached, increment counter and decide what to do with pin...
{
//Read volatile variables ONE time, all at once, to optimize code (volatile variables take more time to read [I know] since their values can't be recalled from registers [I believe]).
noInterrupts(); //[1 clock cycle] turn off interrupts to read non-atomic volatile variables that could be updated simultaneously right now in another ISR, since nested interrupts are enabled here
unsigned long resolution = PWMpins[i].resolution;
unsigned long PWMvalue = PWMpins[i].PWMvalue;
volatile byte* p_PORT_out = PWMpins[i].p_PORT_out; //[0.44us raw: 5 clock cycles, 0.3125us]
interrupts(); //[1 clock cycle]
//handle edge cases FIRST (PWMvalue==0 and PMWvalue==topValue), since if an edge case exists we should NOT do the main case handling below
if (PWMvalue==0) //the PWM command is 0% duty cycle
{
fastDigitalWrite(p_PORT_out,PWMpins[i].pinBitMask,LOW); //write LOW [1.19us raw: 17 clock cycles, 1.0625us]
}
else if (PWMvalue==resolution-1) //the PWM command is 100% duty cycle
{
fastDigitalWrite(p_PORT_out,PWMpins[i].pinBitMask,HIGH); //write HIGH [0.88us raw; 12 clock cycles, 0.75us]
}
//THEN handle main cases (PWMvalue is > 0 and < topValue)
else //(0% < PWM command < 100%)
{
PWMpins[i].counter++; //not volatile
if (PWMpins[i].counter >= resolution)
{
PWMpins[i].counter = 0; //reset
fastDigitalWrite(p_PORT_out,PWMpins[i].pinBitMask,HIGH);
}
else if (PWMpins[i].counter>=PWMvalue)
{
fastDigitalWrite(p_PORT_out,PWMpins[i].pinBitMask,LOW); //write LOW [1.18us raw: 17 clock cycles, 1.0625us]
}
}
}
}
}
SREG = SREG_old; //restore interrupt enable status
}
Update (5/4/2015, 8:58pm):
I've tried changing the alignment via the aligned attribute. My compiler is gcc.
Here's how I modified the struct in the .h file to add the attribute (it's on the very last line). Note that I also changed the order of the struct members to be largest to smallest:
struct softPWMpin //C++ style
{
volatile unsigned long resolution;
volatile unsigned long PWMvalue; //Note: duty cycle = PWMvalue/(resolution - 1) = PWMvalue/topValue;
//ex: if resolution is 256, topValue is 255
//if PWMvalue = 255, duty_cycle = PWMvalue/topValue = 255/255 = 1 = 100%
//if PWMvalue = 50, duty_cycle = PWMvalue/topValue = 50/255 = 0.196 = 19.6%
unsigned long counter; //incremented each time update() is called; goes back to zero after reaching topValue; does NOT need to be volatile, since only the update function updates this (it is read-to or written from nowhere else)
volatile byte* volatile p_PORT_out; //pointer to port output register; NB: the 1st "volatile" says the port itself (1 byte) is volatile, the 2nd "volatile" says the *pointer* itself (2 bytes, pointing to the port) is volatile.
volatile byte pinBitMask;
// byte flags1;
// byte flags2;
} __attribute__ ((aligned));
Source: https://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/Type-Attributes.html
Here's the results of what I've tried so far:
__attribute__ ((aligned));
__attribute__ ((aligned(1)));
__attribute__ ((aligned(2)));
__attribute__ ((aligned(4)));
__attribute__ ((aligned(8)));
None of them seem to fix the problem I see when I add one flag byte. When leaving the flag bytes commented out the 2-8 ones make the run-time longer than 133us, and the align 1 one makes no difference (run-time stays 133us), implying that it is what is already occurring with the attribute not added at all. Additionally, even when I use the align options of 2, 4, 8, the sizeof(PWMvalue) function still returns the exact number of bytes in the struct, with no additional padding.
...still don't know what's going on...
Update, 11:02pm:
(see comments below)
Optimization levels definitely have an effect. When I changed the compiler optimization level from -Os to -O2, for instance, the base case remained at 133us (as before), uncommenting flags1 gave me 120us (vs 158us), and uncommenting flags1 and flags2 simultaneously gave me 132us (vs 133us). This still doesn't answer my question, but I've at least learned that optimization levels exist, and how to change them.
Summary of above paragraph:
Processing time of (of eRCaGuy_SoftwarePWMupdate() function)
Optimization No flags w/flags1 w/flags1+flags2
Os 133us 158us 133us
O2 132us 120us 132us
Memory Use (bytes: flash/global vars SRAM/sizeof(softPWMpin)/sizeof(PWMpins))
Optimization No flags w/flags1 w/flags1+flags2
Os 4020/591/15/300 3950/611/16/320 4020/631/17/340
O2 4154/591/15/300 4064/611/16/320 4154/631/17/340
Update (5/5/2015, 4:05pm):
Just updated the tables above with more detailed information.
Added resources below.
Resources:
Sources for gcc compiler optimization levels:
- https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
- https://gcc.gnu.org/onlinedocs/gnat_ugn/Optimization-Levels.html
- http://www.rapidtables.com/code/linux/gcc/gcc-o.htm
How to change compiler settings in Arduino IDE:
- http://www.instructables.com/id/Arduino-IDE-16x-compiler-optimisations-faster-code/
Info on structure packing:
- http://www.catb.org/esr/structure-packing/
Data Alignment:
- http://www.songho.ca/misc/alignment/dataalign.html
Writing efficient C code for an 8-bit Atmel AVR Microcontroller
- AVR035 Efficient C Coding for AVR - doc1497 - http://www.atmel.com/images/doc1497.pdf
- AVR4027 Tips and Tricks to Optimize Your C Code for 8-bit AVR Microcontrollers - doc8453 - http://www.atmel.com/images/doc8453.pdf
Additional info that may prove useful to help you help me with my problem:
FOR NO FLAGS (flags1 and flags2 commented out), and Os Optimization
Build Preferences (from buildprefs.txt file where Arduino spits out compiled code):
For me: "C:\Users\Gabriel\AppData\Local\Temp\build8427371380606368699.tmp"
build.arch = AVR
build.board = AVR_UNO
build.core = arduino
build.core.path = C:\Program Files (x86)\Arduino\hardware\arduino\avr\cores\arduino
build.extra_flags =
build.f_cpu = 16000000L
build.mcu = atmega328p
build.path = C:\Users\Gabriel\AppData\Local\Temp\build8427371380606368699.tmp
build.project_name = software_PWM_fade13_speed_test2.cpp
build.system.path = C:\Program Files (x86)\Arduino\hardware\arduino\avr\system
build.usb_flags = -DUSB_VID={build.vid} -DUSB_PID={build.pid} '-DUSB_MANUFACTURER={build.usb_manufacturer}' '-DUSB_PRODUCT={build.usb_product}'
build.usb_manufacturer =
build.variant = standard
build.variant.path = C:\Program Files (x86)\Arduino\hardware\arduino\avr\variants\standard
build.verbose = true
build.warn_data_percentage = 75
compiler.S.extra_flags =
compiler.S.flags = -c -g -x assembler-with-cpp
compiler.ar.cmd = avr-ar
compiler.ar.extra_flags =
compiler.ar.flags = rcs
compiler.c.cmd = avr-gcc
compiler.c.elf.cmd = avr-gcc
compiler.c.elf.extra_flags =
compiler.c.elf.flags = -w -Os -Wl,--gc-sections
compiler.c.extra_flags =
compiler.c.flags = -c -g -Os -w -ffunction-sections -fdata-sections -MMD
compiler.cpp.cmd = avr-g++
compiler.cpp.extra_flags =
compiler.cpp.flags = -c -g -Os -w -fno-exceptions -ffunction-sections -fdata-sections -fno-threadsafe-statics -MMD
compiler.elf2hex.cmd = avr-objcopy
compiler.elf2hex.extra_flags =
compiler.elf2hex.flags = -O ihex -R .eeprom
compiler.ldflags =
compiler.objcopy.cmd = avr-objcopy
compiler.objcopy.eep.extra_flags =
compiler.objcopy.eep.flags = -O ihex -j .eeprom --set-section-flags=.eeprom=alloc,load --no-change-warnings --change-section-lma .eeprom=0
compiler.path = {runtime.ide.path}/hardware/tools/avr/bin/
compiler.size.cmd = avr-size
Some of the Assembly:
(Os, no flags):
00000328 <_Z25eRCaGuy_SoftwarePWMupdatev>:
328: 8f 92 push r8
32a: 9f 92 push r9
32c: af 92 push r10
32e: bf 92 push r11
330: cf 92 push r12
332: df 92 push r13
334: ef 92 push r14
336: ff 92 push r15
338: 0f 93 push r16
33a: 1f 93 push r17
33c: cf 93 push r28
33e: df 93 push r29
340: 0f b7 in r16, 0x3f ; 63
342: 78 94 sei
344: 20 e0 ldi r18, 0x00 ; 0
346: 30 e0 ldi r19, 0x00 ; 0
348: 1f e0 ldi r17, 0x0F ; 15
34a: f9 01 movw r30, r18
34c: e8 5a subi r30, 0xA8 ; 168
34e: fe 4f sbci r31, 0xFE ; 254
350: 80 81 ld r24, Z
352: 8f 3f cpi r24, 0xFF ; 255
354: 09 f4 brne .+2 ; 0x358 <_Z25eRCaGuy_SoftwarePWMupdatev+0x30>
356: 67 c0 rjmp .+206 ; 0x426 <_Z25eRCaGuy_SoftwarePWMupdatev+0xfe>
358: f8 94 cli
35a: 90 e0 ldi r25, 0x00 ; 0
35c: 18 9f mul r17, r24
35e: f0 01 movw r30, r0
360: 19 9f mul r17, r25
362: f0 0d add r31, r0
364: 11 24 eor r1, r1
366: e4 59 subi r30, 0x94 ; 148
368: fe 4f sbci r31, 0xFE ; 254
36a: c0 80 ld r12, Z
36c: d1 80 ldd r13, Z+1 ; 0x01
36e: e2 80 ldd r14, Z+2 ; 0x02
370: f3 80 ldd r15, Z+3 ; 0x03
372: 44 81 ldd r20, Z+4 ; 0x04
374: 55 81 ldd r21, Z+5 ; 0x05
376: 66 81 ldd r22, Z+6 ; 0x06
378: 77 81 ldd r23, Z+7 ; 0x07
37a: 04 84 ldd r0, Z+12 ; 0x0c
37c: f5 85 ldd r31, Z+13 ; 0x0d
37e: e0 2d mov r30, r0
380: 78 94 sei
382: 41 15 cp r20, r1
384: 51 05 cpc r21, r1
386: 61 05 cpc r22, r1
388: 71 05 cpc r23, r1
38a: 51 f4 brne .+20 ; 0x3a0 <_Z25eRCaGuy_SoftwarePWMupdatev+0x78>
38c: 18 9f mul r17, r24
38e: d0 01 movw r26, r0
390: 19 9f mul r17, r25
392: b0 0d add r27, r0
394: 11 24 eor r1, r1
396: a4 59 subi r26, 0x94 ; 148
398: be 4f sbci r27, 0xFE ; 254
39a: 1e 96 adiw r26, 0x0e ; 14
39c: 4c 91 ld r20, X
39e: 3b c0 rjmp .+118 ; 0x416 <_Z25eRCaGuy_SoftwarePWMupdatev+0xee>
3a0: 46 01 movw r8, r12
3a2: 57 01 movw r10, r14
3a4: a1 e0 ldi r26, 0x01 ; 1
3a6: 8a 1a sub r8, r26
3a8: 91 08 sbc r9, r1
3aa: a1 08 sbc r10, r1
3ac: b1 08 sbc r11, r1
3ae: 48 15 cp r20, r8
3b0: 59 05 cpc r21, r9
3b2: 6a 05 cpc r22, r10
3b4: 7b 05 cpc r23, r11
3b6: 51 f4 brne .+20 ; 0x3cc <_Z25eRCaGuy_SoftwarePWMupdatev+0xa4>
3b8: 18 9f mul r17, r24
3ba: d0 01 movw r26, r0
3bc: 19 9f mul r17, r25
3be: b0 0d add r27, r0
3c0: 11 24 eor r1, r1
3c2: a4 59 subi r26, 0x94 ; 148
3c4: be 4f sbci r27, 0xFE ; 254
3c6: 1e 96 adiw r26, 0x0e ; 14
3c8: 9c 91 ld r25, X
3ca: 1c c0 rjmp .+56 ; 0x404 <_Z25eRCaGuy_SoftwarePWMupdatev+0xdc>
3cc: 18 9f mul r17, r24
3ce: e0 01 movw r28, r0
3d0: 19 9f mul r17, r25
3d2: d0 0d add r29, r0
3d4: 11 24 eor r1, r1
3d6: c4 59 subi r28, 0x94 ; 148
3d8: de 4f sbci r29, 0xFE ; 254
3da: 88 85 ldd r24, Y+8 ; 0x08
3dc: 99 85 ldd r25, Y+9 ; 0x09
3de: aa 85 ldd r26, Y+10 ; 0x0a
3e0: bb 85 ldd r27, Y+11 ; 0x0b
3e2: 01 96 adiw r24, 0x01 ; 1
3e4: a1 1d adc r26, r1
3e6: b1 1d adc r27, r1
3e8: 88 87 std Y+8, r24 ; 0x08
3ea: 99 87 std Y+9, r25 ; 0x09
3ec: aa 87 std Y+10, r26 ; 0x0a
3ee: bb 87 std Y+11, r27 ; 0x0b
3f0: 8c 15 cp r24, r12
3f2: 9d 05 cpc r25, r13
3f4: ae 05 cpc r26, r14
3f6: bf 05 cpc r27, r15
3f8: 40 f0 brcs .+16 ; 0x40a <_Z25eRCaGuy_SoftwarePWMupdatev+0xe2>
3fa: 18 86 std Y+8, r1 ; 0x08
3fc: 19 86 std Y+9, r1 ; 0x09
3fe: 1a 86 std Y+10, r1 ; 0x0a
400: 1b 86 std Y+11, r1 ; 0x0b
402: 9e 85 ldd r25, Y+14 ; 0x0e
404: 80 81 ld r24, Z
406: 89 2b or r24, r25
408: 0d c0 rjmp .+26 ; 0x424 <_Z25eRCaGuy_SoftwarePWMupdatev+0xfc>
40a: 84 17 cp r24, r20
40c: 95 07 cpc r25, r21
40e: a6 07 cpc r26, r22
410: b7 07 cpc r27, r23
412: 48 f0 brcs .+18 ; 0x426 <_Z25eRCaGuy_SoftwarePWMupdatev+0xfe>
414: 4e 85 ldd r20, Y+14 ; 0x0e
416: 80 81 ld r24, Z
418: 90 e0 ldi r25, 0x00 ; 0
41a: 50 e0 ldi r21, 0x00 ; 0
41c: 40 95 com r20
41e: 50 95 com r21
420: 84 23 and r24, r20
422: 95 23 and r25, r21
424: 80 83 st Z, r24
426: 2f 5f subi r18, 0xFF ; 255
428: 3f 4f sbci r19, 0xFF ; 255
42a: 24 31 cpi r18, 0x14 ; 20
42c: 31 05 cpc r19, r1
42e: 09 f0 breq .+2 ; 0x432 <_Z25eRCaGuy_SoftwarePWMupdatev+0x10a>
430: 8c cf rjmp .-232 ; 0x34a <_Z25eRCaGuy_SoftwarePWMupdatev+0x22>
432: 0f bf out 0x3f, r16 ; 63
434: df 91 pop r29
436: cf 91 pop r28
438: 1f 91 pop r17
43a: 0f 91 pop r16
43c: ff 90 pop r15
43e: ef 90 pop r14
440: df 90 pop r13
442: cf 90 pop r12
444: bf 90 pop r11
446: af 90 pop r10
448: 9f 90 pop r9
44a: 8f 90 pop r8
44c: 08 95 ret
This is almost certainly an alignment issue. Judging by the size of your struct, your compiler seems to be automatically packing it.
The LDR instruction loads a 4-byte value into a register, and operates on 4-byte boundaries. If it needs to load a memory address that isn't on a 4-byte boundary, it actually performs two loads and combines them to obtain the value at that address.
For example, if you want to load the 4-byte value at 0x02, the processor will do two loads, as 0x02 does not fall on a 4-byte boundary.
Let's say we have the following memory at address 0x00 and we want to load the 4-byte value at 0x02 into the register r0:
Address |0x00|0x01|0x02|0x03|0x04|0x05|0x06|0x07|0x08|
Value | 12 | 34 | 56 | 78 | 90 | AB | CD | EF | 12 |
------------------------------------------------------
r0: 00 00 00 00
It will first load the 4 bytes at 0x00, because that's the 4-byte segment containing 0x02, and store the 2 bytes at 0x02 and 0x03 in the register:
Address |0x00|0x01|0x02|0x03|0x04|0x05|0x06|0x07|
Value | 12 | 34 | 56 | 78 | 90 | AB | CD | EF |
Load 1 | ** ** |
------------------------------------------------------
r0: 56 78 00 00
It will then load the 4 bytes at 0x04, which is the next 4-byte segment, and store the 2 bytes at 0x04 and 0x05 in the register.
Address |0x00|0x01|0x02|0x03|0x04|0x05|0x06|0x07|
Value | 12 | 34 | 56 | 78 | 90 | AB | CD | EF |
Load 2 | ** ** |
------------------------------------------------------
r0: 56 78 90 AB
As you can see, each time you want to access the value at 0x02, the processor actually has to split your instruction into two operations. However, if you instead wanted to access the value at 0x04, the processor can do it in a single operation:
Address |0x00|0x01|0x02|0x03|0x04|0x05|0x06|0x07|
Value | 12 | 34 | 56 | 78 | 90 | AB | CD | EF |
Load 1 | ** ** ** ** |
------------------------------------------------------
r0: 90 AB CD EF
In your example, with both flags1 and flags2 commented out, your struct's size is 15. This means that every second struct in your array is going to be at an odd address, so none of it's pointer or long members are going to be aligned correctly.
By introducing one of the flags variables, your struct's size increases to 16, which is a multiple of 4. This ensures that all of your structs begin on a 4-byte boundary, so you likely won't run into alignment issues.
There's likely a compiler flag that can help you with this, but in general, it's good to be aware of the layout of your structures. Alignment is a tricky issue to deal with, and only compilers that conform to the current standards have well defined behavior.

SSE load/store memory transactions

There are two ways for memory-register interactions in use SSE intrinsics:
Intermediate pointers:
void f_sse(float *input, float *output, unsigned int n)
{
_m128 *input_sse = reinterpret_cast<__m128*>(input);//Input intermediate pointer
_m128 *output_sse = reinterpret_cast<__m128*>(output);//Output intermediate pointer
_m128 s = _mm_set1_ps(0.1f);
auto loop_size = n/4;
for(auto i=0; i<loop_size; ++i)
output_sse[i] = _mm_add_ps(input_sse[i], s);
}
Explicit fetch/store:
void f_sse(float *input, float *output, unsigned int n)
{
_m128 input_sse, output_sse, result;
_m128 s = _mm_set1_ps(0.1f);
for(auto i=0; i<n; i+=4)
{
input_sse = _mm_load_ps(input+i);
result = _mm_add_ps(input_sse, s);
_mm_store_ps(output+i, result);
}
}
What's the difference between mentioned approaches and which method is better in terms of perfomance? input and output pointers are aligned by _mm_malloc().
Compiled with g++ at optimization level O3 the assembly code of the inner loop (using objdump -d) are
20: 0f 28 04 07 movaps (%rdi,%rax,1),%xmm0
24: 0f 58 c1 addps %xmm1,%xmm0
27: 0f 29 04 06 movaps %xmm0,(%rsi,%rax,1)
2b: 48 83 c0 10 add $0x10,%rax
2f: 48 39 d0 cmp %rdx,%rax
32: 75 ec jne 20 <_Z5f_ssePfS_j+0x20>
and
10: 0f 28 04 07 movaps (%rdi,%rax,1),%xmm0
14: 83 c1 04 add $0x4,%ecx
17: 0f 58 c1 addps %xmm1,%xmm0
1a: 0f 29 04 06 movaps %xmm0,(%rsi,%rax,1)
1e: 48 83 c0 10 add $0x10,%rax
22: 39 ca cmp %ecx,%edx
24: 77 ea ja 10 <_Z5f_ssePfS_j+0x10>
They are pretty similar. In the first g++ manage to use only one counter (only one add instruction). So I guess its better.
I compiled both of your samples with g++ -O2, and the main difference I found was that the value in edx (n) is used differently, which leads to slightly different code.
First function:
0000000000000000 <_Z6f_sse2PfS_j>:
0: c1 ea 02 shr $0x2,%edx # loop_size = n / 4.
3: 85 d2 test %edx,%edx
5: 74 2d je 34 <_Z6f_sse2PfS_j+0x34>
7: 83 ea 01 sub $0x1,%edx
a: 0f 28 0d 00 00 00 00 movaps 0x0(%rip),%xmm1 # 11 <_Z6f_sse2PfS_j+0x11>
11: 48 83 c2 01 add $0x1,%rdx
15: 31 c0 xor %eax,%eax
17: 48 c1 e2 04 shl $0x4,%rdx // Adjust for loop size vs. index.
1b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
20: 0f 28 04 07 movaps (%rdi,%rax,1),%xmm0
24: 0f 58 c1 addps %xmm1,%xmm0
27: 0f 29 04 06 movaps %xmm0,(%rsi,%rax,1)
2b: 48 83 c0 10 add $0x10,%rax
2f: 48 39 d0 cmp %rdx,%rax
32: 75 ec jne 20 <_Z6f_sse2PfS_j+0x20>
34: f3 c3 repz retq
Second function:
0000000000000000 <_Z5f_ssePfS_j>:
0: 85 d2 test %edx,%edx
2: 74 22 je 26 <_Z5f_ssePfS_j+0x26>
4: 0f 28 0d 00 00 00 00 movaps 0x0(%rip),%xmm1 # b <_Z5f_ssePfS_j+0xb>
b: 31 c0 xor %eax,%eax
d: 31 c9 xor %ecx,%ecx
f: 90 nop
10: 0f 28 04 07 movaps (%rdi,%rax,1),%xmm0
14: 83 c1 04 add $0x4,%ecx
17: 0f 58 c1 addps %xmm1,%xmm0
1a: 0f 29 04 06 movaps %xmm0,(%rsi,%rax,1)
1e: 48 83 c0 10 add $0x10,%rax
22: 39 ca cmp %ecx,%edx
24: 77 ea ja 10 <_Z5f_ssePfS_j+0x10>
26: f3 c3 repz retq
I also looked at the code generated, and came up with this:
void f_sse2(float *input, float *output, unsigned int n)
{
__m128 *end = reinterpret_cast<__m128*>(&input[n]);
__m128 *input_sse = reinterpret_cast<__m128*>(input);//Input intermediate pointer
__m128 *output_sse = reinterpret_cast<__m128*>(output);//Output intermediate pointer
__m128 s = _mm_set1_ps(0.1f);
while(input_sse < end)
*output_sse++ = _mm_add_ps(*input_sse++, s);
}
which generates this code:
0000000000000000 <_Z6f_sse2PfS_j>:
0: 89 d2 mov %edx,%edx
2: 48 8d 04 97 lea (%rdi,%rdx,4),%rax
6: 48 39 c7 cmp %rax,%rdi
9: 73 23 jae 2e <_Z6f_sse2PfS_j+0x2e>
b: 0f 28 0d 00 00 00 00 movaps 0x0(%rip),%xmm1 # 12 <_Z6f_sse2PfS_j+0x12>
12: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
18: 0f 28 07 movaps (%rdi),%xmm0
1b: 48 83 c7 10 add $0x10,%rdi
1f: 0f 58 c1 addps %xmm1,%xmm0
22: 0f 29 06 movaps %xmm0,(%rsi)
25: 48 83 c6 10 add $0x10,%rsi
29: 48 39 f8 cmp %rdi,%rax
2c: 77 ea ja 18 <_Z6f_sse2PfS_j+0x18>
2e: f3 c3 repz retq
Which I think may be a tiny bit more efficient, but probably not worth changing it for. But it gave me something to do for 15 minutes.

"call" instruction that seemingly jumps into itself

I have some C++ code
#include <cstdio>
#include <boost/bind.hpp>
#include <boost/function.hpp>
class A {
public:
void do_it() { std::printf("aaa"); }
};
void
call_it(const boost::function<void()> &f)
{
f();
}
void
func()
{
A *a = new A;
call_it(boost::bind(&A::do_it, a));
}
which gcc 4 compiles into the following assembly (from objdump):
00000030 <func()>:
30: 55 push %ebp
31: 89 e5 mov %esp,%ebp
33: 56 push %esi
34: 31 f6 xor %esi,%esi
36: 53 push %ebx
37: bb 00 00 00 00 mov $0x0,%ebx
3c: 83 ec 40 sub $0x40,%esp
3f: c7 04 24 01 00 00 00 movl $0x1,(%esp)
46: e8 fc ff ff ff call 47 <func()+0x17>
4b: 8d 55 ec lea 0xffffffec(%ebp),%edx
4e: 89 14 24 mov %edx,(%esp)
51: 89 5c 24 04 mov %ebx,0x4(%esp)
55: 89 74 24 08 mov %esi,0x8(%esp)
59: 89 44 24 0c mov %eax,0xc(%esp)
; the rest of the function is omitted
I can't understand the operand of call instruction here, why does it call into itself, but with one byte off?
The call is probably to an external function, and the address you see (FFFFFFFC) is just a placeholder for the real address, which the linker and/or loader will take care of later.