Translating RCL assembly into C/C++

Translating RCL assembly into C/C++ - c++

I'm trying to manually translate some assembly into C/C++ in order to migrate a code-base to x64 bit as Visual Studio does not allow __asm code to compile in x64 bit.
I've translated some parts, however I am stuck on the following (it's an extract from the full function but should be self-contained):
void Foo()
{
char cTemp = ...;
int iVal = ...;
__asm
{
mov ebx, iVal
mov dl, cTemp
mov al, dl
MOV CL, 3
CLC
MOV AX, BX
ROR AH, CL
XOR DL, AH
SAR DL, 1
RCL BX, 1
RCL DL, 1
}
}
The parts I'm struggling with are:
RCL BX, 1
RCL DL, 1
From what I understand this is the equivalent of the following:
short v = ...;
v = (v << 1) + (CLEAR_FLAG ? 1 : 0)
Which from my understanding means if the value of (v << 1) overflows then add 1, otherwise add 0 (I may be mis-understanding though, so please correct me if so).
What I'm struggling to do is detect the overflow in C/C++ when carrying the shift operation. I've looked around and the only thing I can find is detecting addition/subtraction overflow before it happens, but nothing with regards to bit shifting.
Is it possible at all to translate such assembly to C/C++?

RCL is a "rotate left with carry" operation. So you need to take into account the previous instruction, SAR, that sets the carry flag (CF).
Note that SAR is a signed right shift, so will need a signed operand. It important to use proper data types that match the instruction precisely in bitness and signedness.
A 1-to-1 translation could look something like this
int8_t dl = /* ... */;
uint16_t bx = /* ... */;
// SAR DL,1
int8_t carry_1 = dl & 1;
dl >>= 1;
// RCL BX,1
uint16_t carry_2 = bx >> 15;
bx = (bx << 1) | carry_1;
// RCL DL,1
dl = (dl << 1) | carry_2;
There is probably a way to simplify these further. There are tools that can do that, they provide a somewhat readable C++ equivalent for a decompiled function.

Related

Translating C++ x86 Inline assembly code to C++

I've been struggling trying to convert this assembly code to C++ code.
It's a function from an old game that takes pixel data Stmp, and I believe it places it to destination void* dest
void Function(int x, int y, int yl, void* Stmp, void* dest)
{
unsigned long size = 1280 * 2;
unsigned long j = yl;
void* Dtmp = (void*)((char*)dest + y * size + (x * 2));
_asm
{
push es;
push ds;
pop es;
mov edx,Dtmp;
mov esi,Stmp;
mov ebx,j;
xor eax,eax;
xor ecx,ecx;
loop_1:
or bx,bx;
jz exit_1;
mov edi,edx;
loop_2:
cmp word ptr[esi],0xffff;
jz exit_2;
mov ax,[esi];
add edi,eax;
mov cx,[esi+2];
add esi,4;
shr ecx,2;
jnc Next2;
movsw;
Next2:
rep movsd;
jmp loop_2;
exit_2:
add esi,2;
add edx,size;
dec bx;
jmp loop_1;
exit_1:
pop es;
};
}
That's where I've gotten as far to: (Not sure if it's even correct)
while (j > 0)
{
if (*stmp != 0xffff)
{
}
++stmp;
dtmp += size;
--j;
}
Any help is greatly appreciated. Thank you.

It saves / restores ES around setting it equal to DS so rep movsd will use the same addresses for load and store. That instruction is basically memcpy(edi, esi, ecx) but incrementing the pointers in EDI and ESI (by 4 * ecx). https://www.felixcloutier.com/x86/movs:movsb:movsw:movsd:movsq
In a flat memory model, you can totally ignore that. This code looks like it might have been written to run in 16-bit unreal mode, or possibly even real mode, hence the use of 16-bit registers all over the place.
Look like it's loading some kind of records that tell it how many bytes to copy, and reading until the end of the record, at which point it looks for the next record there. There's an outer loop around that, looping through records.
The records look like this I think:
struct sprite_line {
uint16_t skip_dstbytes, src_bytes;
uint16_t src_data[]; // flexible array member, actual size unlimited but assumed to be a multiple of 2.
};
The inner loop is this:
;; char *dstp; // in EDI
;; struct spriteline *p // in ESI
loop_2:
cmp word ptr[esi],0xffff ; while( p->skip_dstbytes != (uint16_t)-1 ) {
jz exit_2;
mov ax,[esi]; ; EAX was xor-zeroed earlier; some old CPUs maybe had slow movzx loads
add edi,eax; ; dstp += p->skip_dstbytes;
mov cx,[esi+2]; ; bytelen = p->src_len;
add esi,4; ; p->data
shr ecx,2; ; length in dwords = bytelen >> 2
jnc Next2;
movsw; ; one 16-bit (word) copy if bytelen >> 1 is odd, i.e. if last bit shifted out was a 1.
; The first bit shifted out isn't checked, so size is assumed to be a multiple of 2.
Next2:
rep movsd; ; copy in 4-byte chunks
Old CPUs (before IvyBridge) had rep movsd faster than rep movsb, otherwise this code could just have done that.
or bx,bx;
jz exit_1;
That's an obsolete idiom that comes from 8080 for test bx,bx / jnz, i.e. jump if BX was zero. So it's a while( bx != 0 ) {} loop. With dec bx in it. It's an inefficient way to write a while (--bx) loop; a compiler would put a dec/jnz .top_of_loop at the bottom, with a test once outside the loop in case it needs to run zero times. Why are loops always compiled into "do...while" style (tail jump)?
Some people would say that's what a while loop looks like in asm, if they're picturing totally naive translation from C to asm.

Efficient overflow-immune arithmetic mean in C/C++

The arithmetic mean of two unsigned integers is defined as:
mean = (a+b)/2
Directly implementing this in C/C++ may overflow and produce a wrong result. A correct implementation would avoid this. One way of coding it could be:
mean = a/2 + b/2 + (a%2 + b%2)/2
But this produces rather a lot of code with typical compilers. In assembler, this usually can be done much more efficiently. For example, the x86 can do this in the following way (assembler pseudo code, I hope you get the point):
ADD a,b ; addition, leaving the overflow condition in the carry bit
RCR a,1 ; rotate right through carry, effectively a division by 2
After those two instructions, the result is in a, and the remainder of the division is in the carry bit. If correct rounding is desired, a third ADC instruction would have to add the carry into the result.
Note that the RCR instruction is used, which rotates a register through the carry. In our case, it is a rotate by one position, so that the previous carry becomes the most significant bit in the register, and the new carry holds the previous LSB from the register. It seems that MSVC doesn't even offer an intrinsic for this instruction.
Is there a known C/C++ pattern that can be expected to be recognized by an optimizing compiler so that it produces such efficient code? Or, more generally, is there a rational way how to program in C/C++ source level so that the carry bit is being used by the compiler to optimize the generated code?
EDIT:
A 1-hour lecture about std::midpoint: https://www.youtube.com/watch?v=sBtAGxBh-XI
Wow!
EDIT2: Great discussion on Microsoft blog

The following method avoids overflow and should result in fairly efficient assembly (example) without depending on non-standard features:
mean = (a&b) + (a^b)/2;

There are three typical methods to compute average without overflow, one of which is limited to uint32_t (on 64-bit architectures).
// average "SWAR" / Montgomery
uint32_t avg(uint32_t a, uint32_t b) {
return (a & b) + ((a ^ b) >> 1);
}
// in case the relative magnitudes are known
uint32_t avg2(uint32_t min, uint32_t max) {
return min + (max - min) / 2;
}
// in case the relative magnitudes are not known
uint32_t avg2_constrained(uint32_t a, uint32_t b) {
return a + (int32_t)(b - a) / 2;
}
// average increase width (not applicable to uint64_t)
uint32_t avg3(uint32_t a, uint32_t b) {
return ((uint64_t)a + b) >> 1;
}
The corresponding assembler sequences from clang in two architectures are
avg(unsigned int, unsigned int)
mov eax, esi
and eax, edi
xor esi, edi
shr esi
add eax, esi
avg2(unsigned int, unsigned int)
sub esi, edi
shr esi
lea eax, [rsi + rdi]
avg3(unsigned int, unsigned int)
mov ecx, edi
mov eax, esi
add rax, rcx
shr rax
vs.
avg(unsigned int, unsigned int)
and w8, w1, w0
eor w9, w1, w0
add w0, w8, w9, lsr #1
ret
avg2(unsigned int, unsigned int)
sub w8, w1, w0
add w0, w0, w8, lsr #1
ret
avg3(unsigned int, unsigned int):
mov w8, w1
add x8, x8, w0, uxtw
lsr x0, x8, #1
ret
Out of these three versions, avg2 would perform in ARM64 as well, as the optimal sequence using carry flag -- and also it's likely that avg3 would perform as well, noticing that the mov w8,w1 is used to clear the top 32-bits, which may be unnecessary given that the compiler knows they are cleared by any previous instruction that is used to produce the value.
Similar statement can be made of the Intel version for avg3, which would in optimal case compiled to just the two meaningful instructions:
add rax, rcx
shr rax
See https://godbolt.org/z/5TMd3zr81 for online comparison.
The "SWAR"/Montgomery version is typically only justified, when trying to compute multiple averages packed to a single (large) integer in which case the full formula contains masking with the bit positions of the highest bits: return (a & b) + (((a ^ b) >> 1) & ~kH;.

Is there a branchless method to quickly find the min/max of two double-precision floating-point values?

I have two doubles, a and b, which are both in [0,1]. I want the min/max of a and b without branching for performance reasons.
Given that a and b are both positive, and below 1, is there an efficient way of getting the min/max of the two? Ideally, I want no branching.

Yes, there is a way to calculate the maximum or minimum of two doubles without any branches. The C++ code to do so looks like this:
#include <algorithm>
double FindMinimum(double a, double b)
{
return std::min(a, b);
}
double FindMaximum(double a, double b)
{
return std::max(a, b);
}
I bet you've seen this before. Lest you don't believe that this is branchless, check out the disassembly:
FindMinimum(double, double):
minsd xmm1, xmm0
movapd xmm0, xmm1
ret
FindMaximum(double, double):
maxsd xmm1, xmm0
movapd xmm0, xmm1
ret
That's what you get from all popular compilers targeting x86. The SSE2 instruction set is used, specifically the minsd/maxsd instructions, which branchlessly evaluate the minimum/maximum value of two double-precision floating-point values.
All 64-bit x86 processors support SSE2; it is required by the AMD64 extensions. Even most x86 processors without 64-bit support SSE2. It was released in 2000. You'd have to go back a long way to find a processor that didn't support SSE2. But what about if you did? Well, even there, you get branchless code on most popular compilers:
FindMinimum(double, double):
fld QWORD PTR [esp + 12]
fld QWORD PTR [esp + 4]
fucomi st(1)
fcmovnbe st(0), st(1)
fstp st(1)
ret
FindMaximum(double, double):
fld QWORD PTR [esp + 4]
fld QWORD PTR [esp + 12]
fucomi st(1)
fxch st(1)
fcmovnbe st(0), st(1)
fstp st(1)
ret
The fucomi instruction performs a comparison, setting flags, and then the fcmovnbe instruction performs a conditional move, based on the value of those flags. This is all completely branchless, and relies on instructions introduced to the x86 ISA with the Pentium Pro back in 1995, supported on all x86 chips since the Pentium II.
The only compiler that won't generate branchless code here is MSVC, because it doesn't take advantage of the FCMOVxx instruction. Instead, you get:
double FindMinimum(double, double) PROC
fld QWORD PTR [a]
fld QWORD PTR [b]
fcom st(1) ; compare "b" to "a"
fnstsw ax ; transfer FPU status word to AX register
test ah, 5 ; check C0 and C2 flags
jp Alt
fstp st(1) ; return "b"
ret
Alt:
fstp st(0) ; return "a"
ret
double FindMinimum(double, double) ENDP
double FindMaximum(double, double) PROC
fld QWORD PTR [b]
fld QWORD PTR [a]
fcom st(1) ; compare "b" to "a"
fnstsw ax ; transfer FPU status word to AX register
test ah, 5 ; check C0 and C2 flags
jp Alt
fstp st(0) ; return "b"
ret
Alt:
fstp st(1) ; return "a"
ret
double FindMaximum(double, double) ENDP
Notice the branching JP instruction (jump if parity bit set). The FCOM instruction is used to do the comparison, which is part of the base x87 FPU instruction set. Unfortunately, this sets flags in the FPU status word, so in order to branch on those flags, they need to be extracted. That's the purpose of the FNSTSW instruction, which stores the x87 FPU status word to the general-purpose AX register (it could also store to memory, but…why?). The code then TESTs the appropriate bit, and branches accordingly to ensure that the correct value is returned. In addition to the branch, retrieving the FPU status word will also be relatively slow. This is why the Pentium Pro introduced the FCOM instructions.
However, it is unlikely that you would be able to improve upon the speed of any of this code by using bit-twiddling operations to determine min/max. There are two basic reasons:
The only compiler generating inefficient code is MSVC, and there's no good way to force it to generate the instructions you want it to. Although inline assembly is supported in MSVC for 32-bit x86 targets, it is a fool's errand when seeking performance improvements. I'll also quote myself:
Inline assembly disrupts the optimizer in rather significant ways, so unless you're writing significant swaths of code in inline assembly, there is unlikely to be a substantial net performance gain. Furthermore, Microsoft's inline assembly syntax is extremely limited. It trades flexibility for simplicity in a big way. In particular, there is no way to specify input values, so you're stuck loading the input from memory into a register, and the caller is forced to spill the input from a register to memory in preparation. This creates a phenomenon I like to call "a whole lotta shufflin' goin' on", or for short, "slow code". You don't drop to inline assembly in cases where slow code is acceptable. Thus, it is always preferable (at least on MSVC) to figure out how to write C/C++ source code that persuades the compiler to emit the object code you want. Even if you can only get close to the ideal output, that's still considerably better than the penalty you pay for using inline assembly.
In order to get access to the raw bits of a floating-point value, you'd have to do a domain transition, from floating-point to integer, and then back to floating-point. That's slow, especially without SSE2, because the only way to get a value from the x87 FPU to the general-purpose integer registers in the ALU is indirectly via memory.
If you wanted to pursue this strategy anyway—say, to benchmark it—you could take advantage of the fact that floating-point values are lexicographically ordered in terms of their IEEE 754 representations, except for the sign bit. So, since you are assuming that both values are positive:
FindMinimumOfTwoPositiveDoubles(double a, double b):
mov rax, QWORD PTR [a]
mov rdx, QWORD PTR [b]
sub rax, rdx ; subtract bitwise representation of the two values
shr rax, 63 ; isolate the sign bit to see if the result was negative
ret
FindMaximumOfTwoPositiveDoubles(double a, double b):
mov rax, QWORD PTR [b] ; \ reverse order of parameters
mov rdx, QWORD PTR [a] ; / for the SUB operation
sub rax, rdx
shr rax, 63
ret
Or, to avoid inline assembly:
bool FindMinimumOfTwoPositiveDoubles(double a, double b)
{
static_assert(sizeof(a) == sizeof(uint64_t),
"A double must be the same size as a uint64_t for this bit manipulation to work.");
const uint64_t aBits = *(reinterpret_cast<uint64_t*>(&a));
const uint64_t bBits = *(reinterpret_cast<uint64_t*>(&b));
return ((aBits - bBits) >> ((sizeof(uint64_t) * CHAR_BIT) - 1));
}
bool FindMaximumOfTwoPositiveDoubles(double a, double b)
{
static_assert(sizeof(a) == sizeof(uint64_t),
"A double must be the same size as a uint64_t for this bit manipulation to work.");
const uint64_t aBits = *(reinterpret_cast<uint64_t*>(&a));
const uint64_t bBits = *(reinterpret_cast<uint64_t*>(&b));
return ((bBits - aBits) >> ((sizeof(uint64_t) * CHAR_BIT) - 1));
}
Note that there are severe caveats to this implementation. In particular, it will break if the two floating-point values have different signs, or if both values are negative. If both values are negative, then the code can be modified to flip their signs, do the comparison, and then return the opposite value. To handle the case where the two values have different signs, code can be added to check the sign bit.
// ...
// Enforce two's-complement lexicographic ordering.
if (aBits < 0)
{
aBits = ((1 << ((sizeof(uint64_t) * CHAR_BIT) - 1)) - aBits);
}
if (bBits < 0)
{
bBits = ((1 << ((sizeof(uint64_t) * CHAR_BIT) - 1)) - bBits);
}
// ...
Dealing with negative zero will also be a problem. IEEE 754 says that +0.0 is equal to −0.0, so your comparison function will have to decide if it wants to treat these values as different, or add special code to the comparison routines that ensures negative and positive zero are treated as equivalent.
Adding all of this special-case code will certainly reduce performance to the point that we will break even with a naïve floating-point comparison, and will very likely end up being slower.

Convert flag into either 0xFF or 0, based on whether flag equals 1 or 0

I have a binary flag f, equal to either zero or one.
If equal to one, I would like to convert to 0xFF, otherwise, to 0.
Current solution is f*0xFF, but I would rather use bit twiddling to achieve this.

How about just:
(unsigned char)-f
or alternately:
0xFF & -f
If f is already a char, then you just need -f.
This approach works because -0 == 0 and -1 == 0xFFFFF..., so the negation gets you want you want directly, perhaps with some extra high bits set if f is larger than a char (you didn't say).
Remember though that compilers are smart. I tried all of the following solutions, and all compiled down to 3 instructions or less, and none had a branch (even the solution with a conditional):
Conditional
int remap_cond(int f) {
return f ? 0xFF : 0;
}
Compiles to:
remap_cond:
test edi, edi
mov eax, 255
cmove eax, edi
ret
So even the "obvious" conditional works well, in three instructions and a latency of 2 or 3 cycles on most modern x86 hardware, depending on cmov performance.
Multiplication
Your original solution of:
int remap_mul(int f) {
return f * 0xFF;
}
Actually compiles into nice code that avoids the multiplication entirely, replacing it with a shift and subtract:
remap_mul:
mov eax, edi
sal eax, 8
sub eax, edi
ret
This will generally take two cycles on machines with mov-elimination, and the mov would often be removed by inlining anyway.
Subtraction
As corn3lius pointed out, you can do some subtraction from 0x100 and a mask, like so:
int remap_shift_sub(int f) {
return 0xFF & (0x100 - f);
}
This compiles to1:
remap_shift_sub:
neg edi
movzx eax, dil
ret
So that's the best so far I think - a latency of 2 cycles on most hosts, and the movzx can often be eliminated by inlining2 - e.g., since it could use the 8-bit register in a subsequent consuming instruction.
Note that the compiler has smartly eliminated both the masking operation (you could perhaps argue the movzx accounts for it), and the use of the 0x100 constant, because it understands that a simple negation does the same thing here (in particular, all the bits that differ between -f and 0x100 - f are masked away by the 0xFF & ... operation).
That leads directly to the following C code:
int remap_neg_mask(int f) {
return -f;
}
which compiles down the exact same thing.
You can play with all of this on godbolt.
1 Except on clang, which inserts an extra mov to get the result in eax rather than generating it there in the first place.
2 Note that by "inlining" I mean both real inlining the compiler does if you actually write this as a function, but also what happens if you just do the remapping operation directly at the place you need it without a function.

value = 0xFF & ((1 << 16) - f )
If f is one, subtract it from 0x100 giving you 0xFF; otherwise subtract 0 and bitmask with 0xFF and get 0.
Too obvious?
value = ( f == 1 ) ? 0xFF : 0;

Can someone explain the meaning of malloc(20 * c | -(20 * (unsigned __int64)(unsigned int)c >> 32 != 0))

In decompiled code generated by IDA I see expressions like:
malloc(20 * c | -(20 * (unsigned __int64)(unsigned int)c >> 32 != 0))
malloc(6 * n | -(3 * (unsigned __int64)(unsigned int)(2 * n) >> 32 != 0))
Can someone explain the purpose of these calculations?
c and n are int (signed integer) values.
Update.
Original C++ code was compiled with MSVC for 32-bit platform.
Here's assembly code for second line of decompiled C-code above (malloc(6 * ..)):
mov ecx, [ebp+pThis]
mov [ecx+4], eax
mov eax, [ebp+pThis]
mov eax, [eax]
shl eax, 1
xor ecx, ecx
mov edx, 3
mul edx
seto cl
neg ecx
or ecx, eax
mov esi, esp
push ecx ; Size
call dword ptr ds:__imp__malloc

I'm guessing that original source code used the C++ new operator to allocate an array and was compiled with Visual C++. As user3528438's answer indicates this code is meant to prevent overflows. Specifically it's a 32-bit unsigned saturating multiply. If the result of the multiplication would be greater than 4,294,967,295, the maximum value of a 32-bit unsigned number, the result is clamped or "saturated" to that maximum.
Since Visual Studio 2005, Microsoft's C++ compiler has generated code to protect against overflows. For example, I can generate assembly code that could be decompiled into your examples by compiling the following with Visual C++:
#include <stdlib.h>
void *
operator new[](size_t n) {
return malloc(n);
}
struct S {
char a[20];
};
struct T {
char a[6];
};
void
foo(int n, S **s, T **t) {
*s = new S[n];
*t = new T[n * 2];
}
Which, with Visual Studio 2015's compiler generates the following assembly code:
mov esi, DWORD PTR _n$[esp]
xor ecx, ecx
mov eax, esi
mov edx, 20 ; 00000014H
mul edx
seto cl
neg ecx
or ecx, eax
push ecx
call _malloc
mov ecx, DWORD PTR _s$[esp+4]
; Line 19
mov edx, 6
mov DWORD PTR [ecx], eax
xor ecx, ecx
lea eax, DWORD PTR [esi+esi]
mul edx
seto cl
neg ecx
or ecx, eax
push ecx
call _malloc
Most of the decompiled expression is actually meant to handle just one assembly statement. The assembly instruction seto cl sets CL to 1 if the previous MUL instruction overflows, otherwise it sets CL to 0. Similarly the expression 20 * (unsigned __int64)(unsigned int)c >> 32 != 0 evaluates to 1 if the result of 20 * c overflows, and evaluates to 0 otherwise.
If this overflow protection wasn't there and the result of 20 * c did actually overflow then the call to malloc would probably succeed, but allocate much less memory than the program intended. The program would then likely write past the end of the memory actually allocated and trash other bits of memory. This would amount to a buffer overrun, one that could be potentially exploited by hackers.

Since this code is decompiled from ASM, so we can only guess what it actually does.
Let's first format it so figure the precedence:
malloc(20 * c | -(20 * (unsigned __int64)(unsigned int)c >> 32 != 0))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
//this is first evaluated, promoting c to
//64 bit unsigned int without doing sign
//extension, regardless the type of c
malloc(20 * c | -(20 * (uint64_t)c >> 32 != 0))
^^^^^^^^^^^^^^^^
//then, multiply by 20, with uint64 result
malloc(20 * c | -(20 * (uint64_t)c >> 32 != 0))
^^^^^^^^^^^^^^^^^^^^^^^^^^^
//if 20c is greater than 2^32-1, then result is true,
//use -1 to generate a mask of 0xffffffff,
//bitwise operator | then masks 20c to 0xffffffff
//(2^32-1, the maximum of size_t, input type to malloc)
//regardless what 20c actually is
//if 20c is smaller than 2^32-1, then result is false,
//the mask is 0, bitwise operator | keeps the final
//input to malloc as 20c untouched
What are 20 and 6?
Those probably come from the common usage of
malloc(sizeof(Something)*count). Those two calls to malloc are probably made with sizeof(Something) and sizeof(SomethingElse) evaluated to 20 and 6 at compile time.
So what this code actually does:
My guess, it's trying to prevent sizeof(Something)*count from overflowing and cause the malloc to succeed and cause buffer overflow when the memory is used.
By evaluating the product in 64 bit unsigned int and test against 2^32-1, when size is greater than 2^32-1, the input to malloc is set to a very large value that makes it guaranteed to fail (No 32 bit system can allocate 2^32-1 bytes of memory).

Can someone explain the purpose of these calculations?
It is important to understand that compiling changes the semantic meaning of code. Much unspecified behavior of the original code becomes specified by the compilation process.
IDA has no idea whether things the generated assembly code just happens to do are important or not. To be safe, it tries to perfectly replicate the behavior of the assembly code, even in cases that cannot possibly happen given the way the code is used.
Here, IDA is probably replicating the overflow characteristics that the conversion of types just happens to have on this platform. It can't just replicate the original C code because the original C code likely had unspecified behavior for some values of c or n, likely negative ones.
For example, say I write this C code: int f(unsigned j) { return j; }. My compiler will likely turn that into very simple assembly code giving whatever behavior for negative values of j that my platform just happens to give.
But if you decompile the generated assembly, you cannot decompile it to int f(unsigned j) { return j; } because that will not behave the same as the my assembly code did on platforms with different overflow behavior. That could compile to code (on other platforms) that returns different values than my assembly code does for negative values of j.
So it is often literally impossible (in fact, incorrect) to decompile C code into the original code, it will often have these kinds of "portably replicate this platform's behavior" oddities.

it's rounding up to the nearest block size.
forgive me. What it's doing is calculating a multiple of c while simultaneously checking for a negative value (overflow):
#include <iostream>
#include <cstdint>
size_t foo(char c)
{
return 20 * c | -(20 * (std::uint64_t)(unsigned int)c >> 32 != 0);
}
int main()
{
using namespace std;
for (char i = -4 ; i < 4 ; ++i)
{
cout << "input is: " << int(i) << ", result is " << foo(i) << endl;
}
return 0;
}
results:
input is: -4, result is 18446744073709551615
input is: -3, result is 18446744073709551615
input is: -2, result is 18446744073709551615
input is: -1, result is 18446744073709551615
input is: 0, result is 0
input is: 1, result is 20
input is: 2, result is 40
input is: 3, result is 60
To me the number 18446744073709551615 doesn't mean much, at a glance. Only after seeing it expressed in hex I went "ah". – Jongware
adding << hex:
input is: -1, result is ffffffffffffffff

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js