Inline assembly language - Accessing an array's elements in C++ - c++

I'm upgrading a Borland C++ Builder 6 project to the latest Embarcadero C++ Builder (11.1)
It's a 32 bit windows app.
Some legacy code, not written by me, contains an array unsigned 32 bit integers, declared outside of a class like this:
UINT32 CRC32_TABLE[] =
{
0x000000000,0x077073096,0x0EE0E612C,0x0990951BA,
0x0076DC419,0x0706AF48F,0x0E963A535,0x09E6495A3]; //the actual array is a lot bigger
Later on there is this function:
static UINT32 GenerateCRC(const void* p_data,
int nbytes)
{
UINT32 rvalue;
__asm
{
mov esi,p_data
mov ecx,nbytes
mov ebx,0FFFFFFFFH // Initialize CRC accumulator to -1
/*--------------------------------------*/
/* Accumulate the CRC in the specified */
/* range of memory. */
/*--------------------------------------*/
loop2: // FOR i = 1 TO nbytes DO
xor eax,eax
mov al,[esi] // Get a byte to be checked
inc esi // Bump index
xor al,bl // XOR NEW BYTE WITH LOW CRC
shl eax,2 // MAKE IT A DWORD INDEX
mov edi,eax //
shr ebx,8 // SHIFT OLD CRC RIGHT 8
xor ebx,CRC32_TABLE[edi] // XOR SHIFTED CRC WITH CONSTANT
loop loop2 // ENDFOR
not ebx
mov rvalue,ebx
}
return rvalue;
}
The current compiler wont accept xor ebx,CRC32_TABLE[edi], saying "cannot use a base register with variable reference".
It is decades since I've done any assembly language work, so any pointers to fixing this would be very much appreciated.
I'm very open to replacing the asm code with C++, btw. I have no idea why this code was written in assembly language. It's not used in a time critical section but to verify logins... I'd like to think there was a better reason than 'because I could' but, judging from the C++ code, this was written by someone (sadly, no longer with us) who liked to complicate things for no reason...
Andy

Related

What is the compiler doing here that allows comparison of many values to be done with few actual comparisons?

My question is about what the compiler is doing in this case that optimizes the code way more than what I would think is possible.
Given this enum:
enum MyEnum {
Entry1,
Entry2,
... // Entry3..27 are the same, omitted for size.
Entry28,
Entry29
};
And this function:
bool MyFunction(MyEnum e)
{
if (
e == MyEnum::Entry1 ||
e == MyEnum::Entry3 ||
e == MyEnum::Entry8 ||
e == MyEnum::Entry14 ||
e == MyEnum::Entry15 ||
e == MyEnum::Entry18 ||
e == MyEnum::Entry21 ||
e == MyEnum::Entry22 ||
e == MyEnum::Entry25)
{
return true;
}
return false;
}
For the function, MSVC generates this assembly when compiled with -Ox optimization flag (Godbolt):
bool MyFunction(MyEnum) PROC ; MyFunction
cmp ecx, 24
ja SHORT $LN5#MyFunction
mov eax, 20078725 ; 01326085H
bt eax, ecx
jae SHORT $LN5#MyFunction
mov al, 1
ret 0
$LN5#MyFunction:
xor al, al
ret 0
Clang generates similar (slightly better, one less jump) assembly when compiled with -O3 flag:
MyFunction(MyEnum): # #MyFunction(MyEnum)
cmp edi, 24
ja .LBB0_2
mov eax, 20078725
mov ecx, edi
shr eax, cl
and al, 1
ret
.LBB0_2:
xor eax, eax
ret
What is happening here? I see that even if I add more enum comparisons to the function, the assembly that is generated does not actually become "more", it's only this magic number (20078725) that changes. That number depends on how many enum comparisons are happening in the function. I do not understand what is happening here.
The reason why I am looking at this is that I was wondering if it is good to write the function as above, or alternatively like this, with bitwise comparisons:
bool MyFunction2(MyEnum e)
{
if (
e == MyEnum::Entry1 |
e == MyEnum::Entry3 |
e == MyEnum::Entry8 |
e == MyEnum::Entry14 |
e == MyEnum::Entry15 |
e == MyEnum::Entry18 |
e == MyEnum::Entry21 |
e == MyEnum::Entry22 |
e == MyEnum::Entry25)
{
return true;
}
return false;
}
This results in this generated assembly with MSVC:
bool MyFunction2(MyEnum) PROC ; MyFunction2
xor edx, edx
mov r9d, 1
cmp ecx, 24
mov eax, edx
mov r8d, edx
sete r8b
cmp ecx, 21
sete al
or r8d, eax
mov eax, edx
cmp ecx, 20
cmove r8d, r9d
cmp ecx, 17
sete al
or r8d, eax
mov eax, edx
cmp ecx, 14
cmove r8d, r9d
cmp ecx, 13
sete al
or r8d, eax
cmp ecx, 7
cmove r8d, r9d
cmp ecx, 2
sete dl
or r8d, edx
test ecx, ecx
cmove r8d, r9d
test r8d, r8d
setne al
ret 0
Since I do not understand what happens in the first case, I can not really judge which one is more efficient in my case.
Quite smart! The first comparison with 24 is to do a rough range check - if it's more than 24 or less than 0 it will bail out; this is important as the instructions that follow that operate on the magic number have a hard cap to [0, 31] for operand range.
For the rest, the magic number is just a bitmask, with the bits corresponding to the "good" values set.
>>> bin(20078725)
'0b1001100100110000010000101'
It's easy to spot the first and third bits (counting from 1 and from right) set, the 8th, 14th, 15th, ...
MSVC checks it "directly" using the BT (bit test) instruction and branching, clang instead shifts it of the appropriate amount (to get the relevant bit in the lowest order position) and keeps just it ANDing it with zero (avoiding a branch).
The C code corresponding to the clang version would be something like:
bool MyFunction(MyEnum e) {
if(unsigned(e) > 24) return false;
return (20078725 >> e) & 1;
}
as for the MSVC version, it's more like
inline bool bit_test(unsigned val, int bit) {
return val & (1<<bit);
}
bool MyFunction(MyEnum e) {
if(unsigned(e) > 24) return false;
return bit_test(20078725, e);
}
(I kept the bit_test function separated to emphasize that it's actually a single instruction in assembly, that val & (1<<bit) thing has no correspondence to the original assembly.
As for the if-based code, it's quite bad - it uses a lot of CMOV and ORs the results together, which is both longer code, and will probably serialize execution. I suspect the corresponding clang code will be better. OTOH, you wrote this code using bitwise OR (|) instead of the more semantically correct logical OR (||), and the compiler is strictly following your orders (typical of MSVC).
Another possibility to try instead could be a switch - but I don't think there's much to gain compared to the code already generated for the first snippet, which looks pretty good to me.
Ok, doing a quick test with all the versions against all compilers, we can see that:
the C translation of the CLang output above results in pretty much that same code (= to the clang output) in all compilers; similarly for the MSVC translation;
the bitwise or version is the same as the logical or version (= good) in both CLang and gcc;
in general, gcc does essentially the same thing as CLang except for the switch case;
switch results are varied:
CLang does best, by generating the exact same code;
both gcc and MSVC generate jump-table based code, which in this case is less good; however:
gcc prefers to emit a table of QWORDs, trading size for simplicity of the setup code;
MSVC instead emits a table of BYTEs, paying it in setup code size; I couldn't get gcc to emit similar code even changing -O3 to -Os (optimize for size).
Ah, the old immediate bitmap trick.
GCC does this too, at least for a switch.
x86 asm casetable implementation. Unfortunately GCC9 has a regression for some cases: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91026#c3 ; GCC8 and earlier do a better job.
Another example of using it, this time for code-golf (fewest bytes of code, in this case x86 machine code) to detect certain letters: User Appreciation Challenge #1: Dennis ♦
The basic idea is to use the input as an index into a bitmap of true/false results.
First you have to range-check because the bitmap is fixed-width, and x86 shifts wrap the shift count. We don't want high inputs to alias into the range where there are some that should return true. cmp edi, 24/ja is doing.
(If the range between the lowest and highest true values was from 120 to 140, for example, it might start with a sub edi,120 to range-shift everything before the cmp.)
Then you use bitmap & (1<<e) (the bt instruction), or (bitmap >> e) & 1 (shr / and) to check the bit in the bitmap that tells you whether that e value should return true or false.
There are many ways to implement that check, logically equivalent but with performance differences.
If the range was wider than 32, it would have to use 64-bit operand-size. If it was wider than 64, the compiler might not attempt this optimization at all. Or might still do it for some of the conditions that are in a narrow range.
Using an even larger bitmap (in .rodata memory) would be possible but probably not something most compilers will invent for you. Either with bt [mem],reg (inefficient) or manually indexing a dword and checking that the same way this code checks the immediate bitmap. If you had a lot of high-entropy ranges it might be worth checking 2x 64-bit immediate bitmap, branchy or branchless...
Clang/LLVM has other tricks up its sleeve for efficiently comparing against multiple values (when it doesn't matter which one is hit), e.g. broadcast a value into a SIMD register and use a packed compare. That isn't dependent on the values being in a dense range. (Clang generates worse code for 7 comparisons than for 8 comparisons)
that optimizes the code way more than what I would think is possible.
These kinds of optimizations come from smart human compiler developers that notice common patterns in source code and think of clever ways to implement them. Then get compilers to recognize those patterns and transform their internal representation of the program logic to use the trick.
Turns out that switch and switch-like if() statements are common, and aggressive optimizations are common.
Compilers are far from perfect, but sometimes they do come close to living up to what people often claim; that compilers will optimize your code for you so you can write it in a human-readable way and still have it run near-optimally. This is sometimes true over the small scale.
Since I do not understand what happens in the first case, I can not really judge which one is more efficient in my case.
The immediate bitmap is vastly more efficient. There's no data memory access in either one so no cache miss loads. The only "expensive" instruction is a variable-count shift (3 uops on mainstream Intel, because of x86's annoying FLAGS-setting semantics; BMI2 shrx is only 1 uop and avoid having to mov the number to ecx.) https://agner.org/optimize. And see other performance analysis links in https://stackoverflow.com/tags/x86/info.
Each instruction in the cmp/cmov chain is at least 1 uop, and there's a pretty long dependency chain through each cmov because MSVC didn't bother to break it into 2 or more parallel chains. But regardless it's just a lot of uops, far more than the bitmap version, so worse for throughput (ability for out-of-order exec to overlap the work with surrounding code) as well as latency.
bt is also cheap: 1 uop on modern AMD and Intel. (bts, btr, btc are 2 on AMD, still 1 on Intel).
The branch in the immediate-bitmap version could have been a setna / and to make it branchless, but especially for this enum definition the compiler expected that it would be in range. It could have increased branch predictability by only requiring e <= 31, not e <= 24.
Since the enum only goes up to 29, and IIRC it's UB to have out-of-range enum values, it could actually optimize it away entirely.
Even if the e>24 branch doesn't predict very well, it's still probably better overall. Given current compilers, we only get a choice between the nasty chain of cmp/cmov or branch + bitmap. Unless turn the asm logic back into C to hand-hold compilers into making the asm we want, then we can maybe get branchless with an AND or CMOV to make it always zero for out-of-range e.
But if we're lucky, profile-guided optimization might let some compilers make the bitmap range check branchless. (In asm the behaviour of shl reg, cl with cl > 31 or 63 is well-defined: on x86 it simply masks the count. In a C equivalent, you could use bitmap >> (e&31) which can still optimize to a shr; compilers know that x86 shr masks the count so they can optimize that away. But not for other ISAs that saturate the shift count...)
There are lots of ways to implement the bitmap check that are pretty much equivalent. e.g. you could even use the CF output of shr, set according to the last bit shifted out. At least if you make sure CF has a known state ahead of time for the cl=0 case.
When you want an integer bool result, right-shifting seems to make more sense than bt / setcc, but with shr costing 3 uops on Intel it might actually be best to use bt reg,reg / setc al. Especially if you only need a bool, and can use EAX as your bitmap destination so the previous value of EAX is definitely ready before setcc. (Avoiding a false dependency on some unrelated earlier dep chain.)
BTW, MSVC has other silliness: as What is the best way to set a register to zero in x86 assembly: xor, mov or and? explains, xor al,al is totally stupid compared to xor eax,eax when you want to zero AL. If you don't need to leave the upper bytes of RAX unmodified, zero the full register with a zeroing idiom.
And of course branching just to return 0 or return 1 makes little sense, unless you expect it to be very predictable and want to break the data dependency. I'd expect that setc al would make more sense to read the CF result of bt

How to measure the number of increments per second

I want to measure the speed in which my PC can increment a counter N times (e.g., for N = 10^9).
I tried the following code:
using namespace std
auto start = chrono::steady_clock::now();
for (int i = 0; i < N; ++i)
{
}
auto end = chrono::steady_clock::now();
However, the compiler is smart enough to simply set i=N, and I get that start==end regardless of the value of N.
How can I change the code to measure the increment speed? (adding costly operations in the loop would dominate the runtime and would not allow the measurement to be correct).
I use Windows 10 and Visual Studio 15.9.7.
A bit of motivation: my code takes about 2 seconds for N=10^9. I'm wondering if there's any "meat" left in optimizing it further (e.g., could it possibly go down to 1 sec? or would the loop itself require more?)
This question doesn't really make sense in C or C++. The compiler aims to generate the fastest code that meets the constraints defined by your source code. In your question, you do not define a constraint that the compiler must do a loop at all. Because the loop has no effect, the optimizer will remove it.
Gabriel Staple's answer is probably the nearest thing you can get to a sensible answer to your question, but it is also not quite right because it defines too many constraints that limits the compiler's freedom to implement optimal code. Volatile often forces the compiler to write the result back to memory each time the variable is modified.
eg, this code:
void foo(int N) {
for (volatile int i = 0; i < N; ++i)
{
}
}
Becomes this assembly (on an x64 compiler I tried):
mov DWORD PTR [rsp-4], 0
mov eax, DWORD PTR [rsp-4]
cmp edi, eax
jle .L1
.L3:
mov eax, DWORD PTR [rsp-4] # Read i from mem
add eax, 1 # i++
mov DWORD PTR [rsp-4], eax # Write i to mem
mov eax, DWORD PTR [rsp-4] # Read it back again before
# evaluating the loop condition.
cmp eax, edi # Is i < N?
jl .L3 # Jump back to L3 if not.
.L1:
It sounds like your real question is more like how fast is:
L1: add eax, 1
jmp L1
Even the answer to that is complex and requires an understanding of the internals of your CPU's pipelines.
I recommend playing with Godbolt to understand more about what the compiler is doing. eg https://godbolt.org/z/59XUSu
You can directly measure the speed of the "empty loop", but it is not easy to convince a C++ compiler to emit it. GCC and Clang can be tricked with asm volatile("") but MSVC inline assembly has always been different and is disabled completely for 64bit programs.
It is possible to use MASM to side-step that restriction:
.MODEL FLAT
.CODE
_testfun PROC
sub ecx, 1
jnz _testfun
ret
_testfun ENDP
END
Import it into your code with extern "C" void testfun(unsigned N);.
Try volatile int i = 0 In your for loop. The volatile keyword tells the compiler this variable could change at any time, due to outside events or threads, and therefore it can't make the same assumptions about what the variable might be in the future.

Intel DRNG giving only giving 4 bytes of data instead of 8

I am trying to implement Intel DRNG in c++.
According to its guide to generate a 64 bit unsigned long long the code should be:
int rdrand64_step (unsigned long long *rand)
{
unsigned char ok;
asm volatile ("rdrand %0; setc %1"
: "=r" (*rand), "=qm" (ok));
return (int) ok;
}
However the output of this function rand is only giving me an output of only 32 bits as shown.
bd4a749d
d461c2a8
8f666eee
d1d5bcc4
c6f4a412
any reason why this is happening?
more info: the IDE I'm using is codeblocks
Use int _rdrand64_step (unsigned __int64* val) from immintrin.h instead of writing inline asm. You don't need it, and there are many reasons (including this one) to avoid it: https://gcc.gnu.org/wiki/DontUseInlineAsm
In this case, the problem is that you're probably compiling 32-bit code, so of course 64-bit rdrand is not encodeable. But the way you used inline-asm ended up giving you a 32-bit rdrand, and storing garbage from another register for the high half.
gcc -Wall -O3 -m32 -march=ivybridge (and similar for clang) produces (on Godbolt):
In function 'rdrand64_step':
7 : <source>:7:1: warning: unsupported size for integer register
rdrand64_step:
push ebx
rdrand ecx; setc al
mov edx, DWORD PTR [esp+8] # load the pointer arg
movzx eax, al
mov DWORD PTR [edx], ecx
mov DWORD PTR [edx+4], ebx # store garbage in the high half of *rand
pop ebx
ret
I guess you called this function with a caller that happened to have ebx=0. Or else you used a different compiler that did something different. Maybe something else happens after inlining. If you looked at disassembly of what you actually compiled, you could explain exactly what's going on.
If you'd used the intrinsic, you would have gotten error: '_rdrand64_step' was not declared in this scope, because immintrin.h only declares it in 64-bit mode (and with a -march setting that implies rdrand support. Or [-mrdrnd]3. Best option: use -march=native if you're building on the target machine).
You'd also get significantly more efficient code for a retry loop, at least with clang:
unsigned long long use_intrinsic(void) {
unsigned long long rand;
while(!_rdrand64_step(&rand)); // TODO: retry limit in case RNG is broken.
return rand;
}
use_intrinsic: # #use_intrinsic
.LBB2_1: # =>This Inner Loop Header: Depth=1
rdrand rax
jae .LBB2_1
ret
That avoids setcc and then testing that, which is of course redundant. gcc6 has syntax for returning flag results from inline asm. You can also use asm goto and put a jcc inside the asm, jumping to a label: return 1; target or falling through to a return 0. (The inline-asm docs have an example of doing this. https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html. See also the inline-assembly tag wiki.)
Using your inline-asm, clang (in 64-bit mode) compiles it to:
use_asm:
.LBB1_1:
rdrand rax
setb byte ptr [rsp - 1]
cmp byte ptr [rsp - 1], 0
je .LBB1_1
ret
(clang makes bad decisions for constraints with multiple options that include memory.)
gcc7.2 and ICC17 actually end up with better code from the asm than from the intrinsic. They use cmovc to get a 0 or 1 and then test that. It's pretty dumb. But that's a gcc/ICC missed optimization that will hopefully be.

How do you convert __asm blocks to string-based asm (compatible with G++)?

I recently found a C++ library that would be perfect for my project, but it defines two functions that use __asm blocks that can only be compiled in VC++:
// upper 32-bit result of 32x32-bit product
inline unsigned Product_64(unsigned l, unsigned c)
{
_asm {
mov eax,l
mul c
mov eax,edx
}
} // return value in register EAX
// division of 64-bit (after scaling) by a 32-bit number
inline unsigned Division_64(unsigned dvh, unsigned dvr)
{
_asm {
xor eax,eax
not eax
mov edx,dvh
div dvr
}
} // return value in register EAX
The problem with this is that I need to compile the project with G++, which uses string based asm blocks. Given that I don't have the time to properly learn assembly, is there:
A script that will reliably converts to G++'s string asm
A way of rewriting this function in C++ to the same effect
A list of simple instructions on how to convert between the two formats
Thanks!
You're requesting two things here:
conversion between Intel ASM syntax and AT&T syntax. See for example http://www.delorie.com/djgpp/v2faq/faq17_2.html
converting between calling conventions. VC++ always assumes the return value to be in EAX, for example. In G++, you can specify more intricate behaviour between C++ and asm.
Rewriting these two function to C++ (untested, assuming a compiler where 'long long' is 64 bits):
Product_64: return (int)(((long long)l * c) >> 32);
Division_64: return (int)(((((long long)dvh) << 32) + ~(long long)0) / dvr);

Why use xor with a literal instead of inversion (bitwise not)

I have come across this CRC32 code and was curious why the author would choose to use
crc = crc ^ ~0U;
instead of
crc = ~crc;
As far as I can tell, they are equivalent.
I have even disassembled the two versions in Visual Studio 2010.
Not optimized build:
crc = crc ^ ~0U;
009D13F4 mov eax,dword ptr [crc]
009D13F7 xor eax,0FFFFFFFFh
009D13FA mov dword ptr [crc],eax
crc = ~crc;
011C13F4 mov eax,dword ptr [crc]
011C13F7 not eax
011C13F9 mov dword ptr [crc],eax
I also cannot justify the code by thinking about the number of cycles that each instruction takes since both should be taking 1 cycle to complete. In fact, the xor might have a penalty by having to load the literal from somewhere, though I am not certain of this.
So I'm left thinking that it is possibly just a preferred way to describe the algorithm, rather than an optimization... Would that be correct?
Edit 1:
Since I just realized that the type of the crc variable is probably important to mention I am including the whole code (less the lookup table, way too big) here so you don't have to follow the link.
uint32_t crc32(uint32_t crc, const void *buf, size_t size)
{
const uint8_t *p;
p = buf;
crc = crc ^ ~0U;
while (size--)
{
crc = crc32_tab[(crc ^ *p++) & 0xFF] ^ (crc >> 8);
}
return crc ^ ~0U;
}
Edit 2:
Since someone has brought up the fact that an optimized build would be of interest, I have made one and included it below.
Optimized build:
Do note that the whole function (included in the last edit below) was inlined.
// crc = crc ^ ~0U;
zeroCrc = 0;
zeroCrc = crc32(zeroCrc, zeroBufferSmall, sizeof(zeroBufferSmall));
00971148 mov ecx,14h
0097114D lea edx,[ebp-40h]
00971150 or eax,0FFFFFFFFh
00971153 movzx esi,byte ptr [edx]
00971156 xor esi,eax
00971158 and esi,0FFh
0097115E shr eax,8
00971161 xor eax,dword ptr ___defaultmatherr+4 (973018h)[esi*4]
00971168 add edx,ebx
0097116A sub ecx,ebx
0097116C jne main+153h (971153h)
0097116E not eax
00971170 mov ebx,eax
// crc = ~crc;
zeroCrc = 0;
zeroCrc = crc32(zeroCrc, zeroBufferSmall, sizeof(zeroBufferSmall));
01251148 mov ecx,14h
0125114D lea edx,[ebp-40h]
01251150 or eax,0FFFFFFFFh
01251153 movzx esi,byte ptr [edx]
01251156 xor esi,eax
01251158 and esi,0FFh
0125115E shr eax,8
01251161 xor eax,dword ptr ___defaultmatherr+4 (1253018h)[esi*4]
01251168 add edx,ebx
0125116A sub ecx,ebx
0125116C jne main+153h (1251153h)
0125116E not eax
01251170 mov ebx,eax
Something nobody's mentioned yet; if this code is being compiled on a machine with 16 bit unsigned int then these two code snippets are different.
crc is specified as a 32-bit unsigned integral type. ~crc will invert all bits, but if unsigned int is 16bit then crc = crc ^ ~0U will only invert the lower 16 bits.
I don't know enough about the CRC algorithm to know whether this is intentional or a bug, perhaps hivert can clarify; although looking at the sample code posted by OP, it certainly does make a difference to the loop that follows.
NB. Sorry for posting this as an "answer" because it isn't an answer, but it's too big to just fit in a comment :)
The short answer is: Because it allows to have an uniform algorithm for all CRC's
The reason is the following: There is a lot of variant of CRC. Each one depend on a Z/Z2 polynomial which is used for an euclidian division. Usually is it implemented using the algorithm described In this paper by Aram Perez. Now depending on the polynomial you are using, there is a final XOR at the end of the algorithm which depend on the polynomial whose goal is to eliminate some corner case. It happens that for CRC32 this is the same as a global not but this is not true for all CRC. As an evidence on This web page you can read (emphasis mine):
Consider a message that begins with some number of zero bits. The remainder will never contain anything other than zero until the first one in the message is shifted into it. That's a dangerous situation, since packets beginning with one or more zeros may be completely legitimate and a dropped or added zero would not be noticed by the CRC. (In some applications, even a packet of all zeros may be legitimate!) The simple way to eliminate this weakness is to start with a nonzero remainder. The parameter called initial remainder tells you what value to use for a particular CRC standard. And only one small change is required to the crcSlow() and crcFast() functions:
crc remainder = INITIAL_REMAINDER;
The final XOR value exists for a similar reason. To implement this capability, simply change the value that's returned by crcSlow() and crcFast() as follows:
return (remainder ^ FINAL_XOR_VALUE);
If the final XOR value consists of all ones (as it does in the CRC-32 standard), this extra step will have the same effect as complementing the final remainder. However, implementing it this way allows any possible value to be used in your specific application.
Just to add my own guess to the mix, x ^ 0x0001 keeps the last bit and flipps the others; to turn off the last bit use x & 0xFFFE or x & ~0x0001; to turn on the last bit unconditionally use x | 0x0001. I.e., if you are doing lots of bit-twiddling, your fingers probably know those idioms and just roll them out without much thinking.
I doubt there's any deep reason. Maybe that's how the author thought about it ("I'll just xor with all ones"), or perhaps how it was expressed in the algorithm definition.
I think it is for the same reason that some write
const int zero = 0;
and others write
const int zero = 0x00000000;
Different people think different ways. Even about a fundamental operation.