C++ HOW can this out-of-range access inside struct go wrong? - c++

#include <iostream>
#include <random>
using namespace std;
struct TradeMsg {
int64_t timestamp; // 0->7
char exchange; // 8
char symbol[17]; // 9->25
char sale_condition[4]; // 26 -> 29
char source_of_trade; // 30
uint8_t trade_correction; // 31
int64_t trade_volume; // 32->39
int64_t trade_price; // 40->47
};
static_assert(sizeof(TradeMsg) == 48);
char buffer[1000000];
template<class T, size_t N=1>
int someFunc(char* buffer, T* output, int& cursor) {
// read + process data from buffer. Return data in output. Set cursor to the last byte read + 1.
return cursor + (rand() % 20) + 1; // dummy code
}
void parseData(TradeMsg* msg) {
int cursor = 0;
cursor = someFunc<int64_t>(buffer, &msg->timestamp, cursor);
cursor = someFunc<char>(buffer, &msg->exchange, cursor);
cursor++;
int i = 0;
// i is GUARANTEED to be <= 17 after this loop,
// edit: the input data in buffer[] guarantee that fact.
while (buffer[cursor + i] != ',') {
msg->symbol[i] = buffer[cursor + i];
i++;
}
msg->symbol[i] = '\n'; // might access symbol[17].
cursor = cursor + i + 1;
for (i=0; i<4; i++) msg->sale_condition[i] = buffer[cursor + i];
cursor += 5;
//cursor = someFunc...
}
int main()
{
TradeMsg a;
a.symbol[17] = '\0';
return 0;
}
I have this struct that is guaranteed to have predictable size. In the code, there is a case where the program tries to assign value to an array element past its size msg->symbol[17] = ... .
However, in that case, the assignment does not cause any harm as long as:
It is done before the next struct members (sale_condition) are assigned (no unexpected code reordering).
It does not modifies any previous members (timestamp, exchange).
It does not access any memory outside the struct.
I read that this is undefined behavior. But what kind of compiler optimization/code generation can make this go wrong? symbol[17] is pretty deep inside the middle of the struct, so I don't see how can the compiler generates an access outside it. Assume that platform is x86-64 only

Various folks have pointed out debug-mode checks that will fire on access outside the bounds of an array member of a struct, with options like gcc -fsanitize=undefined. Separate from that, it's also legal for a compiler to use the assumption of non-overlap between member accesses to reorder two assignments which actually do alias:
#Peter in comments points out that the compiler is allowed to assume that accesses to msg->symbol[i] don't affect other struct members, and potentially delay msg->symbol[i] = '\n'; until after the loop that writes msg->sale_condition[i]. (i.e. sink that store to the bottom of the function).
There isn't a good reason you'd expect a compiler to want to do that in this function alone, but perhaps after inlining into some caller that also stored something there, it could be relevant. Or just because it's a DeathStation 9000 that exists in this thought experiment to break your code.
You could write this safely, although GCC compiles that worse
Since char* is allowed to alias any other object, you could offset a char* relative to the start of the whole struct, rather than to the start of the member array. Use offsetof to find the right start point like this:
#include <cstddef>
...
((char*)msg + offsetof(TradeMsg, symbol))[i] = '\n'; // might access symbol[17].
That's exactly equivalent to *((char*)msg + offsetof(...) + i) = '\n'; by definition of C++'s [] operator, even though it lets you use [i] to index relative to the same position.
However, that does compile to less efficient asm with GCC11.2 -O2. (Godbolt), mostly because int i, cursor are narrower than pointer-width. The "safe" version that redoes indexing from the start of the struct does more indexing work in asm, not using the msg+offsetof(symbol) pointer that it was already using as the base register in the loop.
# original version, with UB if `i` goes past the buffer.
# gcc11.2 -O2 -march=haswell. -O3 fully unrolls into a chain of copy/branch
... partially peeled first iteration
.L3: # do{
mov BYTE PTR [rbx+8+rax], dl # store into msg->symbol[i]
movsx rdi, eax # not read inside the loop
lea ecx, [r8+rax]
inc rax
movzx edx, BYTE PTR buffer[rsi+1+rax] # load from buffer
cmp dl, 44
jne .L3 # }while(buffer[cursor+i] != ',')
## End of copy-and-search loop.
# Loops are identical up to this point except for MOVSX here vs. MOV in the no-UB version.
movsx rcx, ecx # just redo sign extension of this calculation that was done repeatedly inside the loop just for this, apparently.
.L2:
mov BYTE PTR [rbx+9+rdi], 10 # store a newline
mov eax, 1 # set up for next loop
# offsetof version, without UB
# same loop, but with RDI and RSI usage switched.
# And with mov esi, eax zero extension instead of movsx rdi, eax sign extension
cmp dl, 44
jne .L3 # }while(buffer[cursor+i] != ',')
add esi, 9 # offsetof(TradeMsg, symbol)
movsx rcx, ecx # more stuff getting sign extended.
movsx rsi, esi # including something used in the newline store
.L2:
mov BYTE PTR [rbx+rsi], 10
mov eax, 1 # set up for next loop
The RCX calculation seems to just be for use by the next loop, setting sale_conditions.
BTW, the copy-and-search loop is like strcpy but with a ',' terminator. Unfortunately gcc/clang don't know how to optimize that; they compile to a slow byte-at-a-time loop, not e.g. an AVX512BW masked store using mask-1 from a vec == set1_epi8(',') compare, to get a mask selecting the bytes-before-',' instead of the comma element. (Probably needs a bithack to isolate that lowest-set-bit as the only set bit, though, unless it's safe to always copy 16 or 17 bytes separate from finding the ',' position, which could be done efficiently without masked stores or branching.)
Another option might be a union between a char[21] and struct{ char sym[17], sale[4];}, if you use a C++ implementation that allows C99-style union type-punning. (It's a GNU extension, and also supported by MSVC, but not necessarily literally every x86 compiler.)
Also, style-wise, shadowing int i = 0; with for( int i=0 ; i<4 ; i++ ) is poor style. Pick a different var name for that loop, like j. (Or if there is anything meaningful, a better name for i which has to survive across multiple loops.)

In a few cases:
When variable guard is set up: https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html
In a C++ interpreter (yes they exist): https://root.cern/cling/

Your symbol has a size of 17 Yet, you are trying to assign a value to the 18th index a.symbol[17] = '\0';
Remember your index value starts off at 0 not 1.
So you have two places that can go wrong. i can equal 17 which will cause an error and that last line I showed above will cause an error.

Related

C++ std::countr_zero() in SIMD 128/256/512 (find position of least significant 1 bit in 128/256/512-bit number) [duplicate]

I have a huge memory block (bit-vector) with size N bits within one memory page, consider N on average is 5000, i.e. 5k bits to store some flags information.
At a certain points in time (super-frequent - critical) I need to find the first bit set in this whole big bit-vector. Now I do it per-64-word, i.e. with help of __builtin_ctzll). But when N grows and search algorithm cannot be improved, there can be some possibility to scale this search through the expansion of memory access width. This is the main problem in a few words
There is a single assembly instruction called BSF that gives the position of the highest set bit (GCC's __builtin_ctzll()).
So in x86-64 arch I can find the highest bit set cheaply in 64-bit words.
But what about scaling through memory width?
E.g. is there a way to do it efficiently with 128 / 256 / 512 -bit registers?
Basically I'm interested in some C API function to achieve this, but also want to know what this method is based on.
UPD: As for CPU, I'm interested for this optimization to support the following CPU lineups:
Intel Xeon E3-12XX, Intel Xeon E5-22XX/26XX/E56XX, Intel Core i3-5XX/4XXX/8XXX, Intel Core i5-7XX, Intel Celeron G18XX/G49XX (optional for Intel Atom N2600, Intel Celeron N2807, Cortex-A53/72)
P.S. In mentioned algorithm before the final bit scan I need to sum k (in average 20-40) N-bit vectors with CPU AND (the AND result is just a preparatory stage for the bit-scan). This is also desirable to do with memory width scaling (i.e. more efficiently than per 64bit-word AND)
Read also: Find first set
This answer is in a different vein, but if you know in advance that you're going to be maintaining a collection of B bits and need to be able to efficiently set and clear bits while also figuring out which bit is the first bit set, you may want to use a data structure like a van Emde Boas tree or a y-fast trie. These data structures are designed to store integers in a small range, so instead of setting or clearing individual bits, you could add or remove the index of the bit you want to set/clear. They're quite fast - you can add or remove items in time O(log log B), and they let you find the smallest item in time O(1). Figure that if B ≈ 50000, then log log B is about 4.
I'm aware this doesn't directly address how to find the highest bit set in a huge bitvector. If your setup is such that you have to work with bitvectors, the other answers might be more helpful. But if you have the option to reframe the problem in a way that doesn't involve bitvector searching, these other data structures might be a better fit.
The best way to find the first set bit within a whole vector (AFAIK) involves finding the first non-zero SIMD element (e.g. a byte or dword), then using a bit-scan on that. (__builtin_ctz / bsf / tzcnt / ffs-1) . As such, ctz(vector) is not itself a useful building block for searching an array, only for after the loop.
Instead you want to loop over the array searching for a non-zero vector, using a whole-vector check involving SSE4.1 ptest xmm0,xmm0 / jz .loop (3 uops), or with SSE2 pcmpeqd v, zero / pmovmskb / cmp eax, 0xffff / je .loop (3 uops after cmp/jcc macro-fusion). https://uops.info/
Once you do find a non-zero vector, pcmpeqb / movmskps / bsf on that to find a dword index, then load that dword and bsf it. Add the start-bit position (CHAR_BIT*4*dword_idx) to the bsf bit-position within that element. This is a fairly long dependency chain for latency, including an integer L1d load latency. But since you just loaded the vector, at least you can be fairly confident you'll hit in cache when you load it again with integer. (If the vector was generated on the fly, then probably still best to store / reload it and let store-forwarding work, instead of trying to generate a shuffle control for vpermilps/movd or SSSE3 pshufb/movd/movzx ecx, al.)
The loop problem is very much like strlen or memchr, except we're rejecting a single value (0) and looking for anything else. Still, we can take inspiration from hand-optimized asm strlen / memchr implementations like glibc's, for example loading multiple vectors and doing one check to see if any of them have what they're looking for. (For strlen, combine with pminub to get a 0 if any element is 0. For pcmpeqb compare results, OR for memchr). For our purposes, the reduction operation we want is OR - any non-zero input will make the output non-zero, and bitwise boolean ops can run on any vector ALU port.
(If the expected first-bit-position isn't very high, it's not worth being too aggressive with this: if the first set bit is in the first vector, sorting things out between 2 vectors you've loaded will be slower. 5000 bits is only 625 bytes, or 19.5 AVX2 __m256i vectors. And the first set bit is probably not always right at the end)
AVX2 version:
This checks pairs of 32-byte vectors (i.e. whole cache lines) for non-zero, and if found then sorts that out into one 64-bit bitmap for a single CTZ operation. That extra shift/OR costs latency in the critical path, but the hope is that we get to the first 1 bit sooner.
Combining 2 vectors down to one with OR means it's not super useful to know which element of the OR result was non-zero. We basically redo the work inside the if. That's the price we pay for keeping the amount of uops low for the actual search part.
(The if body ends with a return, so in the asm it's actually like an if()break, or actually an if()goto out of the loop since it goes to a difference place than the not-found return -1 from falling through out of the loop.)
// untested, especially the pointer end condition, but compiles to asm that looks good
// Assumes len is a multiple of 64 bytes
#include <immintrin.h>
#include <stdint.h>
#include <string.h>
// aliasing-safe: p can point to any C data type
int bitscan_avx2(const char *p, size_t len /* in bytes */)
{
//assert(len % 64 == 0);
//optimal if p is 64-byte aligned, so we're checking single cache-lines
const char *p_init = p;
const char *endp = p + len - 64;
do {
__m256i v1 = _mm256_loadu_si256((const __m256i*)p);
__m256i v2 = _mm256_loadu_si256((const __m256i*)(p+32));
__m256i or = _mm256_or_si256(v1,v2);
if (!_mm256_testz_si256(or, or)){ // find the first non-zero cache line
__m256i v1z = _mm256_cmpeq_epi32(v1, _mm256_setzero_si256());
__m256i v2z = _mm256_cmpeq_epi32(v2, _mm256_setzero_si256());
uint32_t zero_map = _mm256_movemask_ps(_mm256_castsi256_ps(v1z));
zero_map |= _mm256_movemask_ps(_mm256_castsi256_ps(v2z)) << 8;
unsigned idx = __builtin_ctz(~zero_map); // Use ctzll for GCC, because GCC is dumb and won't optimize away a movsx
uint32_t nonzero_chunk;
memcpy(&nonzero_chunk, p+4*idx, sizeof(nonzero_chunk)); // aliasing / alignment-safe load
return (p-p_init + 4*idx)*8 + __builtin_ctz(nonzero_chunk);
}
p += 64;
}while(p < endp);
return -1;
}
On Godbolt with clang 12 -O3 -march=haswell:
bitscan_avx2:
lea rax, [rdi + rsi]
add rax, -64 # endp
xor ecx, ecx
.LBB0_1: # =>This Inner Loop Header: Depth=1
vmovdqu ymm1, ymmword ptr [rdi] # do {
vmovdqu ymm0, ymmword ptr [rdi + 32]
vpor ymm2, ymm0, ymm1
vptest ymm2, ymm2
jne .LBB0_2 # if() goto out of the inner loop
add ecx, 512 # bit-counter incremented in the loop, for (p-p_init) * 8
add rdi, 64
cmp rdi, rax
jb .LBB0_1 # }while(p<endp)
mov eax, -1 # not-found return path
vzeroupper
ret
.LBB0_2:
vpxor xmm2, xmm2, xmm2
vpcmpeqd ymm1, ymm1, ymm2
vmovmskps eax, ymm1
vpcmpeqd ymm0, ymm0, ymm2
vmovmskps edx, ymm0
shl edx, 8
or edx, eax # mov ah,dl would be interesting, but compilers won't do it.
not edx # one_positions = ~zero_positions
xor eax, eax # break false dependency
tzcnt eax, edx # dword_idx
xor edx, edx
tzcnt edx, dword ptr [rdi + 4*rax] # p[dword_idx]
shl eax, 5 # dword_idx * 4 * CHAR_BIT
add eax, edx
add eax, ecx
vzeroupper
ret
This is probably not optimal for all CPUs, e.g. maybe we could use a memory-source vpcmpeqd for at least one of the inputs, and not cost any extra front-end uops, only back-end. As long as compilers keep using pointer-increments, not indexed addressing modes that would un-laminate. That would reduce the amount of work needed after the branch (which probably mispredicts).
To still use vptest, you might have to take advantage of the CF result from the CF = (~dst & src == 0) operation against a vector of all-ones, so we could check that all elements matched (i.e. the input was all zeros). Unfortunately, Can PTEST be used to test if two registers are both zero or some other condition? - no, I don't think we can usefully use vptest without a vpor.
Clang decided not to actually subtract pointers after the loop, instead to do more work in the search loop. :/ The loop is 9 uops (after macro-fusion of cmp/jb), so unfortunately it can only run a bit less than 1 iteration per 2 cycles. So it's only managing less than half of L1d cache bandwidth.
But apparently a single array isn't your real problem.
Without AVX
16-byte vectors mean we don't have to deal with the "in-lane" behaviour of AVX2 shuffles. So instead of OR, we can combine with packssdw or packsswb. Any set bits in the high half of a pack input will signed-saturate the result to 0x80 or 0x7f. (So signed saturation is key, not unsigned packuswb which will saturate signed-negative inputs to 0.)
However, shuffles only run on port 5 on Intel CPUs, so beware of throughput limits. ptest on Skylake for example is 2 uops, p5 and p0, so using packsswb + ptest + jz would limit to one iteration per 2 clocks. But pcmpeqd + pmovmskb don't.
Unfortunately, using pcmpeq on each input separately before packing / combining would cost more uops. But would reduce the amount of work left for the cleanup, and if the loop-exit usually involves a branch mispredict, that might reduce overall latency.
2x pcmpeqd => packssdw => pmovmskb => not => bsf would give you a number you have to multiply by 2 to use as a byte offset to get to the non-zero dword. e.g. memcpy(&tmp_u32, p + (2*idx), sizeof(tmp_u32));. i.e. bsf eax, [rdi + rdx*2].
With AVX-512:
You mentioned 512-bit vectors, but none of the CPUs you listed support AVX-512. Even if so, you might want to avoid 512-bit vectors because SIMD instructions lowering CPU frequency, unless your program spends a lot of time doing this, and your data is hot in L1d cache so you can truly benefit instead of still bottlenecking on L2 cache bandwidth. But even with 256-bit vectors, AVX-512 has new instructions that are useful for this:
integer compares (vpcmpb/w/d/q) have a choice of predicate, so you can do not-equal instead of having to invert later with NOT. Or even test-into-register vptestmd so you don't need a zeroed vector to compare against.
compare-into-mask is sort of like pcmpeq + movmsk, except the result is in a k register, still need a kmovq rax, k0 before you can tzcnt.
kortest - set FLAGS according to the OR of two mask registers being non-zero. So the search loop could do vpcmpd k0, ymm0, [rdi] / vpcmpd k1, ymm0, [rdi+32] / kortestw k0, k1
ANDing multiple input arrays
You mention your real problem is that you have up-to-20 arrays of bits, and you want to intersect them with AND and find the first set bit in the intersection.
You may want to do this in blocks of a few vectors, optimistically hoping that there will be a set bit somewhere early.
AND groups of 4 or 8 inputs, accumulating across results with OR so you can tell if there were any 1s in this block of maybe 4 vectors from each input. (If there weren't any 1 bits, do another block of 4 vectors, 64 or 128 bytes while you still have the pointers loaded, because the intersection would definitely be empty if you moved on to the other inputs now). Tuning these chunk sizes depends on how sparse your 1s are, e.g. maybe always work in chunks of 6 or 8 vectors. Power-of-2 numbers are nice, though, because you can pad your allocations out to a multiple of 64 or 128 bytes so you don't have to worry about stopping early.)
(For odd numbers of inputs, maybe pass the same pointer twice to a function expecting 4 inputs, instead of dispatching to special versions of the loop for every possible number.)
L1d cache is 8-way associative (before Ice Lake with 12-way), and a limited number of integer/pointer registers can make it a bad idea to try to read too many streams at once. You probably don't want a level of indirection that makes the compiler loop over an actual array in memory of pointers either.
You may try this function, your compiler should optimize this code for your CPU. It's not super perfect, but it should be relatively quick and mostly portable.
PS length should be divisible by 8 for max speed
#include <stdio.h>
#include <stdint.h>
/* Returns the index position of the most significant bit; starting with index 0. */
/* Return value is between 0 and 64 times length. */
/* When return value is exact 64 times length, no significant bit was found, aka bf is 0. */
uint32_t offset_fsb(const uint64_t *bf, const register uint16_t length){
register uint16_t i = 0;
uint16_t remainder = length % 8;
switch(remainder){
case 0 : /* 512bit compare */
while(i < length){
if(bf[i] | bf[i+1] | bf[i+2] | bf[i+3] | bf[i+4] | bf[i+5] | bf[i+6] | bf[i+7]) break;
i += 8;
}
/* fall through */
case 4 : /* 256bit compare */
while(i < length){
if(bf[i] | bf[i+1] | bf[i+2] | bf[i+3]) break;
i += 4;
}
/* fall through */
case 6 : /* 128bit compare */
/* fall through */
case 2 : /* 128bit compare */
while(i < length){
if(bf[i] | bf[i+1]) break;
i += 2;
}
/* fall through */
default : /* 64bit compare */
while(i < length){
if(bf[i]) break;
i++;
}
}
register uint32_t offset_fsb = i * 64;
/* Check the last uint64_t if the last uint64_t is not 0. */
if(bf[i]){
register uint64_t s = bf[i];
offset_fsb += 63;
while(s >>= 1) offset_fsb--;
}
return offset_fsb;
}
int main(int argc, char *argv[]){
uint64_t test[16];
test[0] = 0;
test[1] = 0;
test[2] = 0;
test[3] = 0;
test[4] = 0;
test[5] = 0;
test[6] = 0;
test[7] = 0;
test[8] = 0;
test[9] = 0;
test[10] = 0;
test[11] = 0;
test[12] = 0;
test[13] = 0;
test[14] = 0;
test[15] = 1;
printf("offset_fsb = %d\n", offset_fsb(test, 16));
return 0;
}

Translating C++ x86 Inline assembly code to C++

I've been struggling trying to convert this assembly code to C++ code.
It's a function from an old game that takes pixel data Stmp, and I believe it places it to destination void* dest
void Function(int x, int y, int yl, void* Stmp, void* dest)
{
unsigned long size = 1280 * 2;
unsigned long j = yl;
void* Dtmp = (void*)((char*)dest + y * size + (x * 2));
_asm
{
push es;
push ds;
pop es;
mov edx,Dtmp;
mov esi,Stmp;
mov ebx,j;
xor eax,eax;
xor ecx,ecx;
loop_1:
or bx,bx;
jz exit_1;
mov edi,edx;
loop_2:
cmp word ptr[esi],0xffff;
jz exit_2;
mov ax,[esi];
add edi,eax;
mov cx,[esi+2];
add esi,4;
shr ecx,2;
jnc Next2;
movsw;
Next2:
rep movsd;
jmp loop_2;
exit_2:
add esi,2;
add edx,size;
dec bx;
jmp loop_1;
exit_1:
pop es;
};
}
That's where I've gotten as far to: (Not sure if it's even correct)
while (j > 0)
{
if (*stmp != 0xffff)
{
}
++stmp;
dtmp += size;
--j;
}
Any help is greatly appreciated. Thank you.
It saves / restores ES around setting it equal to DS so rep movsd will use the same addresses for load and store. That instruction is basically memcpy(edi, esi, ecx) but incrementing the pointers in EDI and ESI (by 4 * ecx). https://www.felixcloutier.com/x86/movs:movsb:movsw:movsd:movsq
In a flat memory model, you can totally ignore that. This code looks like it might have been written to run in 16-bit unreal mode, or possibly even real mode, hence the use of 16-bit registers all over the place.
Look like it's loading some kind of records that tell it how many bytes to copy, and reading until the end of the record, at which point it looks for the next record there. There's an outer loop around that, looping through records.
The records look like this I think:
struct sprite_line {
uint16_t skip_dstbytes, src_bytes;
uint16_t src_data[]; // flexible array member, actual size unlimited but assumed to be a multiple of 2.
};
The inner loop is this:
;; char *dstp; // in EDI
;; struct spriteline *p // in ESI
loop_2:
cmp word ptr[esi],0xffff ; while( p->skip_dstbytes != (uint16_t)-1 ) {
jz exit_2;
mov ax,[esi]; ; EAX was xor-zeroed earlier; some old CPUs maybe had slow movzx loads
add edi,eax; ; dstp += p->skip_dstbytes;
mov cx,[esi+2]; ; bytelen = p->src_len;
add esi,4; ; p->data
shr ecx,2; ; length in dwords = bytelen >> 2
jnc Next2;
movsw; ; one 16-bit (word) copy if bytelen >> 1 is odd, i.e. if last bit shifted out was a 1.
; The first bit shifted out isn't checked, so size is assumed to be a multiple of 2.
Next2:
rep movsd; ; copy in 4-byte chunks
Old CPUs (before IvyBridge) had rep movsd faster than rep movsb, otherwise this code could just have done that.
or bx,bx;
jz exit_1;
That's an obsolete idiom that comes from 8080 for test bx,bx / jnz, i.e. jump if BX was zero. So it's a while( bx != 0 ) {} loop. With dec bx in it. It's an inefficient way to write a while (--bx) loop; a compiler would put a dec/jnz .top_of_loop at the bottom, with a test once outside the loop in case it needs to run zero times. Why are loops always compiled into "do...while" style (tail jump)?
Some people would say that's what a while loop looks like in asm, if they're picturing totally naive translation from C to asm.

C++ inline assembly trying to copy a char from a std::string into a register

I have an assignment in C++ to read a file into a string variable which contains digits (no spaces), and using inline assembly, the program needs to sum up the digits of the string. For this I want to loop until end of string (NULL) and every iteration copy 1 char (which is 1 digit) into a register so I can use compare and subtract on it. The problem is that every time instead of copying the char to the register it copies some random value.
I'm using Visual Studio for debugging. Variable Y is the string and I'm trying to copy every iteration of the loop the current char into register AL.
// read from txt file
string y;
cout << "\n" << "the text is \n";
ifstream infile;
infile.open("1.txt");
getline(infile, y);
cout << y;
infile.close();
// inline assembly
_asm
{
mov edx, 0 // counter
mov ebx, 0
mov eax, 0
loop1:
movzx AL, y[ebx]
cmp AL, 0x00
jz finished
sub AL, 48 // convert ascii to number, assuming digit
add edx, eax // add digit to counter
add ebx, 1 // move pointer to the next byte
loop loop1
finished:
mov i, edx
}
For example assuming Y is "123" and it's the first iteration of the loop, EBX is 0. I expect y[ebx] to point to value 49 ('1') and indeed in debug I see y[ebx]'s value is 49. I want to copy said value into a register, so when I use instruction:
movzx AL, y[ebx]
I expect register AL to change to 49 ('1'), but the value changes to something random instead. For instance last debug session it changed to 192 ('À').
y is the std::string object's control block. You want to access its C string data.
MSVC inline asm syntax is pretty crap, so there's no way to just ask for a pointer to that in a register. I think you have to create a new C++ variable like char *ystr = y.c_str();
That C variable is a pointer which you need to load into register with mov ecx, [ystr]. Accessing the bytes of ystr's object-representation directly would give you the bytes of the pointer.
Also, your current code is using the loop instruction, which is slow and equivalent to dec ecx/jnz. But you didn't initialize ECX, and your loop termination condition is based on the zero terminator, not a counter that you know ahead of the first iteration. (Unless you also ask the std::string for its length instead).
There is zero reason to use the loop instruction here. Put a test al,al / jnz loop1 at the bottom of your loop like a normal person.

What is the correct way to obtain (-1)^n?

Many algorithms require to compute (-1)^n (both integer), usually as a factor in a series. That is, a factor that is -1 for odd n and 1 for even n. In a C++ environment, one often sees:
#include<iostream>
#include<cmath>
int main(){
int n = 13;
std::cout << std::pow(-1, n) << std::endl;
}
What is better or the usual convention? (or something else),
std::pow(-1, n)
std::pow(-1, n%2)
(n%2?-1:1)
(1-2*(n%2)) // (gives incorrect value for negative n)
EDIT:
In addition, user #SeverinPappadeux proposed another alternative based on (a global?) array lookups. My version of it is:
const int res[] {-1, 1, -1}; // three elements are needed for negative modulo results
const int* const m1pow = res + 1;
...
m1pow[n%2]
This is not probably not going to settle the question but, by using the emitted code we can discard some options.
First without optimization, the final contenders are:
1 - ((n & 1) << 1);
(7 operation, no memory access)
mov eax, DWORD PTR [rbp-20]
add eax, eax
and eax, 2
mov edx, 1
sub edx, eax
mov eax, edx
mov DWORD PTR [rbp-16], eax
and
retvals[n&1];
(5 operations, memory --registers?-- access)
mov eax, DWORD PTR [rbp-20]
and eax, 1
cdqe
mov eax, DWORD PTR main::retvals[0+rax*4]
mov DWORD PTR [rbp-8], eax
Now with optimization (-O3)
1 - ((n & 1) << 1);
(4 operation, no memory access)
add edx, edx
mov ebp, 1
and edx, 2
sub ebp, edx
.
retvals[n&1];
(4 operations, memory --registers?-- access)
mov eax, edx
and eax, 1
movsx rcx, eax
mov r12d, DWORD PTR main::retvals[0+rcx*4]
.
n%2?-1:1;
(4 operations, no memory access)
cmp eax, 1
sbb ebx, ebx
and ebx, 2
sub ebx, 1
The test are here. I had to some some acrobatics to have meaningful code that doesn't elide operations all together.
Conclusion (for now)
So at the end it depends on the level optimization and expressiveness:
1 - ((n & 1) << 1); is always good but not very expressive.
retvals[n&1]; pays a price for memory access.
n%2?-1:1; is expressive and good but only with optimization.
You can use (n & 1) instead of n % 2 and << 1 instead of * 2 if you want to be super-pedantic, er I mean optimized.
So the fastest way to compute in an 8086 processor is:
1 - ((n & 1) << 1)
I just want to clarify where this answer is coming from. The original poster alfC did an excellent job of posting a lot of different ways to compute (-1)^n some being faster than others.
Nowadays with processors being as fast as they are and optimizing compilers being as good as they are we usually value readability over the slight (even negligible) improvements from shaving a few CPU cycles from an operation.
There was a time when one pass compilers ruled the earth and MUL operations were new and decadent; in those days a power of 2 operation was an invitation for gratuitous optimization.
Usually you don't actually calculate (-1)^n, instead you track the current sign (as a number being either -1 or 1) and flip it every operation (sign = -sign), do this as you handle your n in order and you will get the same result.
EDIT: Note that part of the reason I recommend this is because there is rarely actually semantic value is the representation (-1)^n it is merely a convenient method of flipping the sign between iterations.
First of all, the fastest isOdd test I do know (in an inline method)
/**
* Return true if the value is odd
* #value the value to check
*/
inline bool isOdd(int value)
{
return (value & 1);
}
Then make use of this test to return -1 if odd, 1 otherwise (which is the actual output of (-1)^N )
/**
* Return the computation of (-1)^N
* #n the N factor
*/
inline int minusOneToN(int n)
{
return isOdd(n)?-1:1;
}
Last as suggested #Guvante, you can spare a multiplication just flipping the sign of a value (avoiding using the minusOneToN function)
/**
* Example of the general usage. Avoids a useless multiplication
* #value The value to flip if it is odd
*/
inline int flipSignIfOdd(int value)
{
return isOdd(value)?-value:value;
}
Many algorithms require to compute (-1)^n (both integer), usually as a
factor in a series. That is, a factor that is -1 for odd n and 1 for
even n.
Consider evaluating the series as a function of -x instead.
If it's speed you need, here goes ...
int inline minus_1_pow(int n) {
static const int retvals[] {1, -1};
return retvals[n&1];
}
The Visual C++ compiler with optimization turned to 11 compiles this down to two machine instructions, neither of which is a branch. It optimizes-away the retvals array also, so no cache misses.
What about
(1 - (n%2)) - (n%2)
n%2 most likely will be computed only once
UPDATE
Actually, simplest and most correct way would be using table
const int res[] {-1, 1, -1};
return res[n%2 + 1];
Well if we are performing the calculation in a series, why not handle the calculation in a positive loop and a negative loop, skipping the evaluation completely?
The Taylor series expansion to approximate the natural log of (1+x) is a perfect example of this type of problem. Each term has (-1)^(n+1), or (1)^(n-1). There is no need to calculate this factor. You can "slice" the problem by either executing 1 loop for every two terms, or two loops, one for the odd terms and one for the even terms.
Of course, since the calculation, by its nature, is one over the domain of real numbers, you will be using a floating point processor to evaluate the individual terms anyway. Once you have decided to do that, you should just use the library implementation for the natural logarithm. But if for some reason, you decide not to, it will certainly be faster, but not by much, not to waste cycles calculating the value of -1 to the nth power.
Perhaps each can even be done in separate threads. Maybe the problem can be vectorized, even.

Pointer/memory arithmetic in Assembly

I'm trying to get a hold of assembly, but there's one probably very simple thing I don't understand.
Consider this following simple example
long long * values = new long long[2];
values[0] = 10;
values[1] = 20;
int j = -1;
values[j+2] = 15; // xxxxxxx
Now, the last line (marked with xxxxxx) disassembles to:
000A6604 mov eax,dword ptr [j]
000A6607 mov ecx,dword ptr [values]
000A660A mov dword ptr [ecx+eax*8+10h],0Fh
First question: What is actually stored in eax and ecx, is it the actual values (i.e. -1 for "j", and the two long long values 10 and 20 for "values"), or is it merely a memory address (e.g. someting like &p, &values) pointing to some place where the values are being stored?
Second question, I know what the third line is supposed to do, but I'm not quite sure why this actually works.
So my understand is, it copies the value 0x0F into the specified memory location. The memory location is basically
- the location of the first element stored in ecx
- plus the size of long long in bytes (= 8) * the value of eax (which equals j, so -1)
- plus the generic offset of 16 bytes (2 times the size of long long).
What I don't get is: In this expression, ecx seems to be a memory address, while eax seems to be a value (-1). How is this possible? Seeing they were defined in pretty much the same way, shouldn't eax and ecx either both contain memory addresses, or both values?
Thanks.
eax and ecx are registers -- the first two instructions load those registers with the values used in the calculation, i.e. j and values (where values means the base address of the array by that name).
I know what the third line is supposed to do, but I'm not quite sure why this actually works
The instruction mov dword ptr [ecx+eax*8+10h],0Fh means move the value 0Fh (i.e. 15 decimal) into the location ecx+eax*8+10h. To figure that out, consider each piece:
ecx is the base address of the values array
eax is the value at j, i.e. -1
eax*8 is j converted to an offset in bytes -- the size of a long long is 8 bytes
eax*8+10h 10h is 16 decimal, i.e. 2*8, so this is j+2 converted to a byte offset
ecx+eax*8+10h adds that final offset to the base address of the array to determine the location in which to store the value 15