Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
void func(int depth){
if(depth== 0) return;
int number(12345);
cout << number; //it does something with number
func(--depth);
}
void func2(int depth){
if(depth== 0) return;
{
int number(12345);
cout << number; //it does something with number
}
func2(--depth);
}
void main(){
func(10); //will this function cost more memory?
func2(10);//will this function cost less memory?
}
Hi. I have two functions here. Will func2 cost less memory because its number(12345) is encapsulated by "{}" so that by the time func2 calls the next iteration, number(12345) can go outside the scope and disappear?
I believe func will cost more because its number(12345) is not outside of the scope even when it reaches the next iteration?
Assuming we have AMD/Intel x86_64 architecture and our compiler is GCC.
Lets took assembly output (-O2 -S) and analyze:
func:
.LFB1560:
pushq %rsi
.seh_pushreg %rsi
pushq %rbx
.seh_pushreg %rbx
subq $40, %rsp
.seh_stackalloc 40
.seh_endprologue
movl %ecx, %ebx
testl %ecx, %ecx
je .L3
movq .refptr._ZSt4cout(%rip), %rsi
.p2align 4,,10
.L5:
movl $12345, %edx
movq %rsi, %rcx
call _ZNSolsEi
subl $1, %ebx
jne .L5
.L3:
addq $40, %rsp
popq %rbx
popq %rsi
ret
.seh_endproc
func2:
.LFB2046:
pushq %rsi
.seh_pushreg %rsi
pushq %rbx
.seh_pushreg %rbx
subq $40, %rsp
.seh_stackalloc 40
.seh_endprologue
movl %ecx, %ebx
testl %ecx, %ecx
je .L10
movq .refptr._ZSt4cout(%rip), %rsi
.p2align 4,,10
.L12:
movl $12345, %edx
movq %rsi, %rcx
call _ZNSolsEi
subl $1, %ebx
jne .L12
.L10:
addq $40, %rsp
popq %rbx
popq %rsi
ret
.seh_endproc
So as you can see, both functions completely identical to each other.
Well, GCC 6.2.1 with -O0 produces the same assembly for both of your functions.
What closing braces do, within a single function, is determine when destructors are called. For instance consider the following:
struct Number {
int n;
~Number() { std::cout << "~" << n << ", "; }
};
void func(int depth){
if(depth== 0) return;
Number number{depth};
std::cout << number.n << ", ";
func(--depth);
}
void func2(int depth){
if(depth== 0) return;
{
Number number{depth};
std::cout << number.n << ", ";
}
func2(--depth);
}
int main(int, char**) {
std::cout << "Calling func. \n";
func(10);
std::cout << "\n\n";
std::cout << "Calling func1. \n";
func2(10);
}
This outputs
Calling func.
10, 9, 8, 7, 6, 5, 4, 3, 2, 1, ~1, ~2, ~3, ~4, ~5, ~6, ~7, ~8, ~9, ~10,
Calling func1.
10, ~10, 9, ~9, 8, ~8, 7, ~7, 6, ~6, 5, ~5, 4, ~4, 3, ~3, 2, ~2, 1, ~1,
So if number were some std::vector or maybe std::ifstream, then yes those braces would be quite necessary.
Why speculate?
Since your depth is small (i.e. 10 is small), you should run your MCVE.
Just two small changes will provide info to determine your answer:
a) cout the address of number (in addition to the value).
b) instead of a const number, fetch something that is not constant. (see below)
#include <iostream>
#include <iomanip>
class T595_t
{
public:
T595_t() = default;
~T595_t() = default;
int exec(int , char**)
{
std::cout << "\n func1(10)" << std::flush;
func1(10);
std::cout << "\n func2(10)" << std::flush;
func2(10);
return 0;
}
private: // methods
void func1(int depth)
{
if(depth== 0) return;
uint64_t number(time(nullptr));
std::cout << "\n " << depth
<< " " << number << " " << &number;
func1(--depth);
}
void func2(int depth)
{
if(depth== 0) return;
{
uint64_t number(std::time(nullptr));
std::cout << "\n " << depth
<< " " << number << " " << &number;
}
func2(--depth);
}
}; // class T595_t
int main(int argc, char* argv[])
{
T595_t t595;
return t595.exec(argc, argv);
}
With output (on Ubuntu 17.10, using g++ version 7.2.0.)
func1(10)
10 1523379410 0x7ffd449d7d90
9 1523379410 0x7ffd449d7d60
8 1523379410 0x7ffd449d7d30
7 1523379410 0x7ffd449d7d00
6 1523379410 0x7ffd449d7cd0
5 1523379410 0x7ffd449d7ca0
4 1523379410 0x7ffd449d7c70
3 1523379410 0x7ffd449d7c40
2 1523379410 0x7ffd449d7c10
1 1523379410 0x7ffd449d7be0
func2(10)
10 1523379410 0x7ffd449d7d90
9 1523379410 0x7ffd449d7d60
8 1523379410 0x7ffd449d7d30
7 1523379410 0x7ffd449d7d00
6 1523379410 0x7ffd449d7cd0
5 1523379410 0x7ffd449d7ca0
4 1523379410 0x7ffd449d7c70
3 1523379410 0x7ffd449d7c40
2 1523379410 0x7ffd449d7c10
1 1523379410 0x7ffd449d7be0
When unoptimized, the code gen for each funcX seems to do the same thing. And since the code completes in less than a second, all the values (of number) are the same.
Related
I have been using memcmp function for compare 2 integers in my performance critical application. I had to use this other than using equal operators as I have to deal with the other datatypes generically. However, I suspected the memcpy performance for primitive data types and changed that to equal operator. However, the performance of the increased.
I just did some simple test as follows.
Using memcmp
#include <time.h>
#include <sys/time.h>
#include <stdlib.h>
#include <stdio.h>
#include <iostream>
#include <string.h>
using namespace std;
int main(int argc, char **argv)
{
int iValue1 = atoi(argv[1]);
int iValue2 = atoi(argv[2]);
struct timeval start;
gettimeofday(&start, NULL);
for (int i = 0; i < 2000000000; i++)
{
// if (iValue1 == iValue2)
if (memcmp(&iValue1, &iValue2, sizeof(int)) == 0)
{
cout << "Hello" << endl;
};
};
struct timeval end;
gettimeofday(&end, NULL);
cout << "Time taken : " << ((end.tv_sec * 1000000 + end.tv_usec) - (start.tv_sec * 1000000 + start.tv_usec)) << " us" << endl;
return 0;
}
The output of the program was follows.
sujith#linux-1xs7:~> g++ -m64 -O3 Main.cpp
sujith#linux-1xs7:~> ./a.out 3424 234
Time taken : 13539618 us
sujith#linux-1xs7:~> ./a.out 3424 234
Time taken : 13534932 us
sujith#linux-1xs7:~> ./a.out 3424 234
Time taken : 13599818 us
sujith#linux-1xs7:~> ./a.out 3424 234
Time taken : 13639394 us
Using equal operator
#include <time.h>
#include <sys/time.h>
#include <stdlib.h>
#include <stdio.h>
#include <iostream>
#include <string.h>
using namespace std;
int main(int argc, char **argv)
{
int iValue1 = atoi(argv[1]);
int iValue2 = atoi(argv[2]);
struct timeval start;
gettimeofday(&start, NULL);
for (int i = 0; i < 2000000000; i++)
{
if (iValue1 == iValue2)
// if (memcmp(&iValue1, &iValue2, sizeof(int)) == 0)
{
cout << "Hello" << endl;
};
};
struct timeval end;
gettimeofday(&end, NULL);
cout << "Time taken : " << ((end.tv_sec * 1000000 + end.tv_usec) - (start.tv_sec * 1000000 + start.tv_usec)) << " us" << endl;
return 0;
}
The output of the program was follows.
sujith#linux-1xs7:~> g++ -m64 -O3 Main.cpp
sujith#linux-1xs7:~> ./a.out 234 23423
Time taken : 9 us
sujith#linux-1xs7:~> ./a.out 234 23423
Time taken : 13 us
sujith#linux-1xs7:~> ./a.out 234 23423
Time taken : 14 us
sujith#linux-1xs7:~> ./a.out 234 23423
Time taken : 15 us
sujith#linux-1xs7:~> ./a.out 234 23423
Time taken : 16 us
Can someone please let me know whether the equal operator works faster than the memcmp for primitive data types? If so, What is happening there? Doesn't the equal operator use memcmp inside?
Microbenchmarks are hard to write.
The loop in the first case compiles to (at g++ -O3):
movl $2000000000, %ebx
jmp .L3
.L2:
subl $1, %ebx
je .L7
.L3:
leaq 12(%rsp), %rsi
leaq 8(%rsp), %rdi
movl $4, %edx
call memcmp
testl %eax, %eax
jne .L2
; code to do the printing omitted
subl $1, %ebx
jne .L3
.L7:
addq $16, %rsp
xorl %eax, %eax
popq %rbx
ret
The loop in the second case compiles to
cmpl %eax, %ebp
je .L7
.L2:
addq $8, %rsp
xorl %eax, %eax
popq %rbx
popq %rbp
ret
.L7:
movl $2000000000, %ebx
.L3:
; code to do the printing omitted
subl $1, %ebx
jne .L3
jmp .L2
Note that in the first case memcmp is called 2000000000 times. In the second case, the optimizer hoisted the comparison out of the loop, so it is done only once. Moreover, in the second case the compiler placed the two variables entirely in registers, while in the first one they need to be placed on the stack because you are taking their address.
Even when just looking at the comparison, compare two ints takes a single cmpl instruction. Using memcmp incurs a function call, and internally memcmp is likely to require some extra checks.
In this particular case, clang++ -O3 compiles the memcmp to a single cmpl instruction. However, it doesn't hoist the check outside the loop if you use memcmp.
In some cases of microbenchmarking, static code analyzer is smart enough to elide multiple function calls with the same argument values, rendering measurement useless. Benchmarking function f with code like this:
long s = 0;
...
for (int i = 0; i < N; ++i) {
startTimer();
s += f(M);
stopTimer();
}
...
cout << s;
can be defeated by optimizer. I wonder, if current or near future optimizer technology is smart enough to defeat this version:
long s = 0;
...
for (int i = 0; i < N; ++i) {
long m = lround(pow(sqrt(i), 2))/i*M;
startTimer();
s += f(m);
stopTimer();
}
...
cout << s;
Answer you title question:
Is any C++ compiler able to optimize lround(pow(sqrt(i), 2)) replacing it with i, now or in the near future?
yes, for statically known arguments: see it Live On Godbolt
All of the code in that sample program got compiled down to a single constant value! And, best of all, that's with optimizations disabled: g++-4.8 -O0 :)
#include <cmath>
constexpr int N = 100;
constexpr double M = 1.0;
constexpr int i = 4;
static constexpr double foo1(int i) { return sqrt(i); }
static constexpr auto f1 = foo1(4);
static constexpr double foo2(int i) { return pow(sqrt(i), 2); }
static constexpr auto f2 = foo2(4);
static constexpr double foo3(int i) { return pow(sqrt(i), 2)/i*M; }
static constexpr auto f3 = foo3(4);
static constexpr long foo4(int i) { return pow(sqrt(i), 2)/i*M; }
static constexpr auto f4 = foo4(4);
#include <cstdio>
int main()
{
printf("f1 + f2 + f3 + f4: %f\n", f1 + f2 + f2 + f3);
}
Get's compiled into a single, statically known constant:
.LC1:
.string "f1 + f2 + f3 + f4: %f\n"
.text
.globl main
.type main, #function
main:
.LFB225:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movabsq $4622382067542392832, %rax
vmovd %rax, %xmm0
movl $.LC1, %edi
movl $1, %eax
call printf
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
Voila. That's because the GNU standard library has constexpr versions of the math functions (except the lround) in C++11 mode.
It's entirely thinkable that the compiler unrolls a loop like
for (int i; i<5; ++i)
s += foo(i);
into
s += foo(1);
s += foo(2);
s += foo(3);
s += foo(4);
Though I haven't checked that yet.
It is possible, but the optimiser must be taught the semantics of library functions, which is hard and time consuming.
Then again IEEE754 math is tricky.
What about declaring volatile long m= M; instead ?
This seems a fairly large topic. For example if you try and cast(convert) a -ve float to a +ve unsigned int it doesn't work. So I am now reading about two's complement, promotion and bit patterns, and how you convert/deal with -ve to +ve float/integers. For example x stays as -1 in the example on VS 2010.
float x = -1;
(unsigned int)y = (unsigned int)x;
printf("y:%u", y);
So how exactly are negative integers stored in memory in terms of bit patterns, what options in C++ are there for converting them, can you do this via bit shifting, what is the best way to do this.
So how exactly are negative integers stored in memory in terms of bit patterns
To get some better understanding of the representation of negative integer values, use the following code to play with it:
#include <iostream>
#include <bitset>
#include <cstdint>
void printBitWise(std::ostream& os, uint8_t* data, size_t size) {
for(size_t i = 0; i < size; ++i) {
for(uint8_t j = 0; j < 8; ++j) {
if((data[i] >> j) & 1) {
os << '1';
}
else {
os << '0';
}
}
}
}
int main() {
int x = -1;
std::bitset<sizeof(int) * 8> bitwise1(x);
std::cout << bitwise1.to_string() << std::endl;
int y = -2;
std::bitset<sizeof(int) * 8> bitwise2(y);
std::cout << bitwise2.to_string() << std::endl;
float a = -1;
printBitWise(std::cout,reinterpret_cast<uint8_t*>(&a),sizeof(float));
std::cout << std::endl;
double b = -1;
printBitWise(std::cout,reinterpret_cast<uint8_t*>(&b),sizeof(double));
std::cout << std::endl;
float c = -2;
printBitWise(std::cout,reinterpret_cast<uint8_t*>(&c),sizeof(float));
std::cout << std::endl;
double d = -2;
printBitWise(std::cout,reinterpret_cast<uint8_t*>(&d),sizeof(double));
std::cout << std::endl;
return 0;
}
Output:
11111111111111111111111111111111
11111111111111111111111111111110
00000000000000000000000111111101
0000000000000000000000000000000000000000000000000000111111111101
00000000000000000000000000000011
0000000000000000000000000000000000000000000000000000000000000011
The bit format of float and double values is a different story. It's described with the IEEE floating point format, and may be compiler implementation specific regarding specific behaviors (e.g. 'rounding rules' or 'operations').
In your program, the variable x is of float type. The machine need to convert it to integer type. For intel processors, the instruction is "cvttss2si". Please check http://en.wikipedia.org/wiki/Single-precision_floating-point_format to see how float is represented in the binary format.
For the code snippt that you gave out, I tested with g++ and VS 2013. Both works as expected and prints "y:-1".
#include <cstdio>
int main()
{
float x = -1;
unsigned int y;
y = (unsigned int)x;
printf("y:%d", y);
return 0;
}
However, in this program, the compiler does the float to integer conversion for us.
movl $-1, %eax
movl %eax, -12(%rbp)
movl -12(%rbp), %esi
movb $0, %al
callq _printf
The following sample program can reveal how the machine does the float to integer conversion:
#include <cstdio>
int main()
{
float x ;
scanf("%f", &x);
unsigned int y;
y = (unsigned int)x;
printf("y:%d", y);
return 0;
}
Here is the assembly show that cvttss2si does the float to integer conversion work (http://www.jaist.ac.jp/iscenter-new/mpc/altix/altixdata/opt/intel/vtune/doc/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc68.htm).
cvttss2si -8(%rbp), %rsi
movl %esi, %ecx
movl %ecx, -12(%rbp)
movl -12(%rbp), %esi
movq -24(%rbp), %rdi ## 8-byte Reload
movl %eax, -28(%rbp) ## 4-byte Spill
movb $0, %al
callq _printf
On many platforms, the sign of a number is indicated by a reserved bit.
With two's complement integers, the Most Significant Bit (MSB) indicates the sign, when set the value is negative, when clear, the value is positive. However, setting the bit may not correctly convert the value from positive to negative.
In many floating point formats, there is a bit reserved to indicate the sign of the number. You'll have to research the various floating point standard formats, especially the ones used by your platform and compiler.
The best and most portable method to convert from negative numbers to positive is to use the abs family of functions. Remember, this is with signed data types.
To convert from positive to negative, multiply by -1 or -1.0.
Negative numbers are not defined for the unsigned types.
I'm trying to build a clock, so I'm working with ASM and an arduino. For most parts, plain C will be fine, but for preparing the time to be output to BCD to Decimal converters I decided to go with ASM. I wrote the following code in 8086 C++/ASM and it runs fine on my computer:
#include <iostream>
using namespace std;
int main(int argc, char **argv)
{
for(int num = 0; num < 16; num++) {
int bin[4];
bin[0] = bin[1] = bin[2] = bin[3] = 0;
asm("movl %%ebx, %%eax;"
"andl $8, %%eax;"
"cmp $0, %%eax;"
"je a;"
"movl $1, %0;"
"jmp b;"
"a: movl $0, %0;"
"b: movl %%ebx, %%eax;"
"andl $4, %%eax;"
"cmp $0, %%eax;"
"je c;"
"movl $1, %1;"
"jmp d;"
"c: movl $0, %1;"
"d: movl %%ebx, %%eax;"
"andl $2, %%eax;"
"cmp $0, %%eax;"
"je e;"
"movl $1, %2;"
"jmp f;"
"e: movl $0, %2;"
"f: movl %%ebx, %%eax;"
"andl $1, %%eax;"
"cmp $0, %%eax;"
"je g;"
"movl $1, %3;"
"jmp h;"
"g: movl $0, %3;"
"h: nop"
: "=r" (bin[0]), "=r" (bin[1]), "=r" (bin[2]), "=b" (bin[3])
: "b" (num)
: "%eax"
);
cout << num << ": ";
for(int i = 0; i < 4; i++) {
cout << bin[i];
}
cout << endl;
}
return 0;
}
However, when I modified it to run on the Arduino, things stop working entirely:
for(uint8_t num = 0; num < 16; num++) {
uint8_t bin[4];
bin[0] = bin[1] = bin[2] = bin[3] = 0;
asm("mov __tmp_reg__, %[val];"
"and __tmp_reg__, $8;"
"cmp __tmp_reg__, $0;"
"je a;"
"mov %[bit8], $1;"
"rjmp b;"
"a: mov $0, %[bit8];"
"b: mov %[val], __tmp_reg__;"
"and __tmp_reg__, $4;"
"cmp __tmp_reg__, $0;"
"je c;"
"mov %[bit4], $1;"
"rjmp d;"
"c: mov $0, %[bit4];"
"d: mov %[val], __tmp_reg__;"
"and __tmp_reg__, $2;"
"cmp __tmp_reg__, $0;"
"je e;"
"mov %[bit2], $1;"
"rjmp f;"
"e: mov $0, %[bit2];"
"f: mov %[val], __tmp_reg__;"
"and __tmp_reg__, $1;"
"cmp __tmp_reg__, $0;"
"je g;"
"mov %[bit1], $1;"
"rjmp h;"
"g: mov $0, %[bit1];"
"h: nop"
: [bit8] "=r" (bin[0]), [bit4] "=r" (bin[1]), [bit2] "=r" (bin[2]), [bit1] "=r" (bin[3])
: [val] "r" (num)
: "r0"
);
The 8086 code gives the output you'd expect:
0: 0000
1: 0001
2: 0010
3: 0011
...
13: 1101
14: 1110
15: 1111
But the code run on the Arduino gives a different output:
0: 5000
1: 0000
2: 0000
3: 0000
... (zeros continue)
13: 0000
14: 0000
15: 0000
As you can imagine, the code becomes useless if it returns... five. And I'm clueless as to how it could return 5 when nothing is anywhere close to 5 in the source. I'm at a loss as to what to do here, so I could really use some help.
I'm using the Arduino Leonardo, which has an ATMega32U processor. I've tried disassembling the executable generated by the Arduino software (which compiles it with AVR-GCC), but I can't seem to get anywhere in my efforts to find the code I put in.
Thanks for your time, Stack Overflow.
The code you have can EASILY be written in C++, like this:
int bin[4] = {};
bin[0] = !!(num & 8);
bin[1] = !!(num & 4);
bin[2] = !!(num & 2);
bin[3] = !!(num & 1);
or:
int bin[4];
int bit = 8;
for(int i = 0; i < 4; i++)
{
bin[i] = !!(num & bit);
bit >>= 1;
}
If you don't like !! (which makes "take the next value and make it either 0 [if it's false] or 1 [if it's true]), you could replace it with:
for(int i = 0; i < 4; i++)
{
bin[i] = (num >> 3 - i) & 1;
}
I take it you intentionally want the highest bit in the lowest bin index, rather than the usual case of highest bit in the highest index.
I have some critical branching code inside a loop that's run about 2^26 times. Branch prediction is not optimal because m is random. How would I remove the branching, possibly using bitwise operators?
bool m;
unsigned int a;
const unsigned int k = ...; // k >= 7
if(a == 0)
a = (m ? (a+1) : (k));
else if(a == k)
a = (m ? 0 : (a-1));
else
a = (m ? (a+1) : (a-1));
And here is the relevant assembly generated by gcc -O3:
.cfi_startproc
movl 4(%esp), %edx
movb 8(%esp), %cl
movl (%edx), %eax
testl %eax, %eax
jne L15
cmpb $1, %cl
sbbl %eax, %eax
andl $638, %eax
incl %eax
movl %eax, (%edx)
ret
L15:
cmpl $639, %eax
je L23
testb %cl, %cl
jne L24
decl %eax
movl %eax, (%edx)
ret
L23:
cmpb $1, %cl
sbbl %eax, %eax
andl $638, %eax
movl %eax, (%edx)
ret
L24:
incl %eax
movl %eax, (%edx)
ret
.cfi_endproc
The branch-free division-free modulo could have been useful, but testing shows that in practice, it isn't.
const unsigned int k = 639;
void f(bool m, unsigned int &a)
{
a += m * 2 - 1;
if (a == -1u)
a = k;
else if (a == k + 1)
a = 0;
}
Testcase:
unsigned a = 0;
f(false, a);
assert(a == 639);
f(false, a);
assert(a == 638);
f(true, a);
assert(a == 639);
f(true, a);
assert(a == 0);
f(true, a);
assert(a == 1);
f(false, a);
assert(a == 0);
Actually timing this, using a test program:
int main()
{
for (int i = 0; i != 10000; i++)
{
unsigned int a = k / 2;
while (a != 0) f(rand() & 1, a);
}
}
(Note: there's no srand, so results are deterministic.)
My original answer: 5.3s
The code in the question: 4.8s
Lookup table: 4.5s (static unsigned lookup[2][k+1];)
Lookup table: 4.3s (static unsigned lookup[k+1][2];)
Eric's answer: 4.2s
This version: 4.0s
The fastest I've found is now the table implementation
Timings I got (UPDATED for new measurement code)
HVD's most recent: 9.2s
Table version: 7.4s (with k=693)
Table creation code:
unsigned int table[2*k];
table_ptr = table;
for(int i = 0; i < k; i++){
unsigned int a = i;
f(0, a);
table[i<<1] = a;
a = i;
f(1, a);
table[i<<1 + 1] = a;
}
Table runtime loop:
void f(bool m, unsigned int &a){
a = table_ptr[a<<1 | m];
}
With HVD's measurement code, I saw the cost of the rand() dominating the runtime, so that the runtime for a branchless version was about the same range as these solutions. I changed the measurement code to this (UPDATED to keep random branch order, and pre-computing random values to prevent rand(), etc. from trashing the cache)
int main(){
unsigned int a = k / 2;
int m[100000];
for(int i = 0; i < 100000; i++){
m[i] = rand() & 1;
}
for (int i = 0; i != 10000; i++
{
for(int j = 0; j != 100000; j++){
f(m[j], a);
}
}
}
I don't think you can remove the branches entirely, but you can reduce the number by branching on m first.
if (m){
if (a==k) {a = 0;} else {++a;}
}
else {
if (a==0) {a = k;} else {--a;}
}
Adding to Antimony's rewrite:
if (a==k) {a = 0;} else {++a;}
looks like an increase with wraparound. You can write this as
a=(a+1)%k;
which, of course, only makes sense if divisions are actually faster than branches.
Not sure about the other one; too lazy to think about what the (~0)%k will be.
This has no branches. Because K is constant, compiler might be able to optimize the modulo depending on it's value. And if K is 'small' then a full lookup table solution would probably be even faster.
bool m;
unsigned int a;
const unsigned int k = ...; // k >= 7
const int inc[2] = {1, k};
a = a + inc[m] % (k+1);
If k isn't large enough to cause overflow, you could do something like this:
int a; // Note: not unsigned int
int plusMinus = 2 * m - 1;
a += plusMinus;
if(a == -1)
a = k;
else if (a == k+1)
a = 0;
Still branches, but the branch prediction should be better, since the edge conditions are rarer than m-related conditions.