Why is there no Z80 like LDIR functionality in C/C++/rtl? - c++

In Z80 machine code, a cheap technique to initialize a buffer to a fixed value, say all blanks. So a chunk of code might look something like this.
LD HL, DESTINATION ; point to the source
LD DE, DESTINATION + 1 ; point to the destination
LD BC, DESTINATION_SIZE - 1 ; copying this many bytes
LD (HL), 0X20 ; put a seed space in the first position
LDIR ; move 1 to 2, 2 to 3...
The result being that the chunk of memory at DESTINATION is completely blank filled.
I have experimented with memmove, and memcpy, and can't replicate this behavior. I expected memmove to be able to do it correctly.
Why do memmove and memcpy behave this way?
Is there any reasonable way to do this sort of array initialization?
I am already aware of char array[size] = {0} for array initialization
I am already aware that memset will do the job for single characters.
What other approaches are there to this issue?

There was a quicker way of blanking an area of memory using the stack. Although the use of LDI and LDIR was very common, David Webb (who pushed the ZX Spectrum in all sorts of ways like full screen number countdowns including the border) came up with this technique which is 4 times faster:
saves the Stack Pointer and then
moves it to the end of the screen.
LOADs the HL register pair with
zero,
goes into a massive loop
PUSHing HL onto the Stack.
The Stack moves up the screen and down
through memory and in the process,
clears the screen.
The explanation above was taken from the review of David Webbs game Starion.
The Z80 routine might look a little like this:
DI ; disable interrupts which would write to the stack.
LD HL, 0
ADD HL, SP ; save stack pointer
EX DE, HL ; in DE register
LD HL, 0
LD C, 0x18 ; Screen size in pages
LD SP, 0x4000 ; End of screen
PAGE_LOOP:
LD B, 128 ; inner loop iterates 128 times
LOOP:
PUSH HL ; effectively *--SP = 0; *--SP = 0;
DJNZ LOOP ; loop for 256 bytes
DEC C
JP NZ,PAGE_LOOP
EX DE, HL
LD SP, HL ; restore stack pointer
EI ; re-enable interrupts
However, that routine is a little under twice as fast. LDIR copies one byte every 21 cycles. The inner loop copies two bytes every 24 cycles -- 11 cycles for PUSH HL and 13 for DJNZ LOOP. To get nearly 4 times as fast simply unroll the inner loop:
LOOP:
PUSH HL
PUSH HL
...
PUSH HL ; repeat 128 times
DEC C
JP NZ,LOOP
That is very nearly 11 cycles every two bytes which is about 3.8 times faster than the 21 cycles per byte of LDIR.
Undoubtedly the technique has been reinvented many times. For example, it appeared earlier in sub-Logic's Flight Simulator 1 for the TRS-80 in 1980.

memmove and memcpy don't work that way because it's not a useful semantic for moving or copying memory. It's handy in the Z80 to do be able to fill memory, but why would you expect a function named "memmove" to fill memory with a single byte? It's for moving blocks of memory around. It's implemented to get the right answer (the source bytes are moved to the destination) regardless of how the blocks overlap. It's useful for it to get the right answer for moving memory blocks.
If you want to fill memory, use memset, which is designed to do just what you want.

I believe this goes to the design philosophy of C and C++. As Bjarne Stroustrup once said, one of the major guiding principles of the design of C++ is "What you don’t use, you don’t pay for". And while Dennis Ritchie may not have said it in exactly those same words, I believe that was a guiding principle informing his design of C (and the design of C by subsequent people) as well. Now you may think that if you allocate memory it should automatically be initialized to NULL's and I'd tend to agree with you. But that takes machine cycles and if you're coding in a situation where every cycle is critical, that may not be an acceptable trade-off. Basically C and C++ try to stay out of your way--hence if you want something initialized you have to do it yourself.

The Z80 sequence you show was the fastest way to do that - in 1978. That was 30 years ago. Processors have progressed a lot since then, and today that's just about the slowest way to do it.
Memmove is designed to work when the source and destination ranges overlap, so you can move a chunk of memory up by one byte. That's part of its specified behavior by the C and C++ standards. Memcpy is unspecified; it might work identically to memmove, or it might be different, depending on how your compiler decides to implement it. The compiler is free to choose a method that is more efficient than memmove.

Why do memmove and memcpy behave this way?
Probably because there’s no specific, modern C++ compiler that targets the Z80 hardware? Write one. ;-)
The languages don't specify how a given hardware implements anything. This is entirely up to the programmers of the compiler and libraries. Of course, writing an own, highly specified version for every imaginable hardware configuration is a lot of work. That’ll be the reason.
Is there any reasonable way to do this sort of array initialization?Is there any reasonable way to do this sort of array initialization?
Well, if all else fails you could always use inline assembly. Other than that, I expect std::fill to perform best in a good STL implementation. And yes, I’m fully aware that my expectations are too high and that std::memset often performs better in practice.

If you're fiddling at the hardware level, then some CPUs have DMA controllers that can fill blocks of memory exceedingly quickly (much faster than the CPU could ever do). I've done this on a Freescale i.MX21 CPU.

This be accomplished in x86 assembly just as easily. In fact, it boils down to nearly identical code to your example.
mov esi, source ; set esi to be the source
lea edi, [esi + 1] ; set edi to be the source + 1
mov byte [esi], 0 ; initialize the first byte with the "seed"
mov ecx, 100h ; set ecx to the size of the buffer
rep movsb ; do the fill
However, it is simply more efficient to set more than one byte at a time if you can.
Finally, memcpy/memmove aren't what you are looking for, those are for making copies of blocks of memory from from area to another (memmove allows source and dest to be part of the same buffer). memset fills a block with a byte of your choosing.

There's also calloc that allocates and initializes the memory to 0 before returning the pointer. Of course, calloc only initializes to 0, not something the user specifies.

If this is the most efficient way to set a block of memory to a given value on the Z80, then it's quite possible that memset() might be implemented as you describe on a compiler that targets Z80s.
It might be that memcpy() might also use a similar sequence on that compiler.
But why would compilers targeting CPUs with completely different instruction sets from the Z80 be expected to use a Z80 idiom for these types of things?
Remember that the x86 architecture has a similar set of instructions that could be prefixed with a REP opcode to have them execute repeatedly to do things like copy, fill or compare blocks of memory. However, by the time Intel came out with the 386 (or maybe it was the 486) the CPU would actually run those instructions slower than simpler instructions in a loop. So compilers often stopped using the REP-oriented instructions.

Seriously, if you're writing C/C++, just write a simple for-loop and let the compiler bother for you. As an example, here's some code VS2005 generated for this exact case (using templated size):
template <int S>
class A
{
char s_[S];
public:
A()
{
for(int i = 0; i < S; ++i)
{
s_[i] = 'A';
}
}
int MaxLength() const
{
return S;
}
};
extern void useA(A<5> &a, int n); // fool the optimizer into generating any code at all
void test()
{
A<5> a5;
useA(a5, a5.MaxLength());
}
The assembler output is the following:
test PROC
[snip]
; 25 : A<5> a5;
mov eax, 41414141H ;"AAAA"
mov DWORD PTR a5[esp+40], eax
mov BYTE PTR a5[esp+44], al
; 26 : useA(a5, a5.MaxLength());
lea eax, DWORD PTR a5[esp+40]
push 5 ; MaxLength()
push eax
call useA
It does not get any more efficient than that. Stop worrying and trust your compiler or at least have a look at what your compiler produces before trying to find ways to optimize. For comparison I also compiled the code using std::fill(s_, s_ + S, 'A') and std::memset(s_, 'A', S) instead of the for-loop and the compiler produced the identical output.

If you're on the PowerPC, _dcbz().

There are a number of situations where it would be useful to have a "memspread" function whose defined behavior was to copy the starting portion of a memory range throughout the whole thing. Although memset() does just fine if the goal is to spread a single byte value, there are times when e.g. one may want to fill an array of integers with the same value. On many processor implementations, copying a byte at a time from the source to the destination would be a pretty crummy way to implement it, but a well-designed function could yield good results. For example, start by seeing if the amount of data is less than 32 bytes or so; if so, just do a bytewise copy; otherwise check the source and destination alignment; if they are aligned, round the size down to the nearest word (if necessary), then copy the first word everywhere it goes, copy the next word everywhere it goes, etc.
I too have at times wished for a function that was specified to work as a bottom-up memcpy, intended for use with overlapping ranges. As to why there isn't a standard one, I guess nobody thought it important.

memcpy() should have that behavior. memmove() doesn't by design, if the blocks of memory overlap, it copies the contents starting at the ends of the buffers to avoid that sort of behavior. But to fill a buffer with a specific value you should be using memset() in C or std::fill() in C++, which most modern compilers will optimize to the appropriate block fill instruction (such as REP STOSB on x86 architectures).

As said before, memset() offers the desired functionality.
memcpy() is for moving around blocks of memory in all cases where the source and destination buffers do not overlap, or where dest < source.
memmove() solves the case of buffers overlapping and dest > source.
On x86 architectures, good compilers directly replace memset calls with inline assembly instructions very effectively setting the destination buffer's memory, even applying further optimizations like using 4-byte values to fill as long as possible (if the following code isn't totally syntactically correct blame it on my not using X86 assembly code for a long time):
lea edi,dest
;copy the fill byte to all 4 bytes of eax
mov al,fill
mov ah,al
mov dx,ax
shl eax,16
mov ax,dx
mov ecx,count
mov edx,ecx
shr ecx,2
cld
rep stosd
test edx,2
jz moveByte
stosw
moveByte:
test edx,1
jz fillDone
stosb
fillDone:
Actually this code is far more efficient than your Z80 version, as it doesn't do memory to memory, but only register to memory moves. Your Z80 code is in fact quite a hack as it relies on each copy operation having filled the source of the subsequent copy.
If the compiler is halfway good, it might be able to detect more complicated C++ code that can be broken down to memset (see the post below), but I doubt that this actually happens for nested loops, probably even invoking initialization functions.

Related

Get the address of an intrinsic function-generated instruction

I have a function that uses the compiler intrinsic __movsq to copy some data from a global buffer into another global buffer upon every call of the function. I'm trying to nop out those instructions once a flag has been set globally and the same function is called again. Example code:
// compiler: MSVC++ VS 2022 in C++ mode; x64
void DispatchOptimizationLoop()
{
__movsq(g_table, g_localimport, 23);
// hopefully create a nop after movsq?
static unsigned char* ptr = (unsigned char*)(&__nop);
if (!InterlockedExchange8(g_Reduce, 1))
{
// point to movsq in memory
ptr -= 3;
// nop it out
...
}
// rest of function here
...
}
Basically the function places a nop after the movsq, and then tries to get the address of the placed nop then backtrack by the size of the movsq so that a pointer is pointing to the start of movsq, so then I can simply cover it with 3 0x90s. I am aware that the line (unsigned char*)(&__nop) is not actaully creating a nop because I'm not calling the intrinsic, I'm just trying to show what I want to do.
Is this possible, or is there a better way to store the address of the instructions that need to be nop'ed out in the future?
It's not useful to have the address of a 0x90 NOP somewhere else, all you need is the address of machine code inside your function. Nothing you've written comes remotely close to helping you find that. As you say, &__nop doesn't lead to there being a NOP in your function's machine code which you could offset relative to.
If you want to hard-code offsets that could break with different optimization settings, you could take the address of the start of the function and offset it.
Or you could write the whole function in asm so you can put a label on the address you want to modify. That would actually let you do this safely.
You might get something that happens to work with GNU C labels as values, where you can take the address of C goto labels like &&label. Like put a mylabel: before the intrinsic, and maybe after for good measure so you can check that the difference is the expected 3 bytes. If you're lucky, the compiler won't put any other instructions between your labels.
So you can memset((void*)&&mylabel, 0x90, 3) (after an assert on &&mylabel_end - &&mylabel == 3). But I don't think MSVC supports that GNU extension or anything equivalent.
But you can't actually use memset if another thread could be running this at the same time.
And for efficiency, you want a single 3-byte NOP anyway.
And of course you'd have to VirtualProtect the page of machine code containing that instruction to make it writeable. (Assuming the function is 16-byte aligned, it's hopefully impossible for that one instruction near the start to be split across two pages.)
And if other threads could be running this function at the same time, you'd better use an atomic RMW (on the containing dword or qword) to replace the 3-byte instruction with a single 3-byte NOP, otherwise you could have another thread fetch and decode the first NOP, but then fetch a byte of of the movsq machine code not replaced yet.
Actually a plain mov store would be atomic if it's 4 bytes not crossing an 8-byte boundary. Since there are no other writers of different data, it's fine to load / AND/OR / store to later store the same surrounding bytes you loaded earlier. Normally a non-atomic load+store is not thread-safe, but no other threads could have written a different value in the meantime.
Atomicity of cross-modifying code
I think cross-modifying code has atomicity rules similar to data. But if the instruction spans a 16-byte boundary, code-fetch in another core might have pulled in the first 1 or 2 bytes of it before you atomically replace all 3. So the 2nd and 3rd byte get treated as either the start of an instruction, or the 2nd + 3rd bytes of a long-NOP. Since long-NOPs generally start with 0F 1F with an escape byte, if that's not how __movsq starts then it could desync.
So if cross-modifying code doesn't trigger a pipeline nuke on the other core, it's not safe to do it while another thread might be running the code. Code fetch is usually done in 16-byte chunks but that's not guaranteed. And it's not guaranteed that they're aligned 16-byte chunks.
So you should probably make sure no other threads are running this function while you change the machine code. Unless you're very sure of the safety of what you're doing and check each build to make sure the instruction starts at a safe offset, where safe is defined according to any possibility or anything that could go wrong.

Crash with icc: can the compiler invent writes where none existed in the abstract machine?

Consider the following simple program:
#include <cstring>
#include <cstdio>
#include <cstdlib>
void replace(char *str, size_t len) {
for (size_t i = 0; i < len; i++) {
if (str[i] == '/') {
str[i] = '_';
}
}
}
const char *global_str = "the quick brown fox jumps over the lazy dog";
int main(int argc, char **argv) {
const char *str = argc > 1 ? argv[1] : global_str;
replace(const_cast<char *>(str), std::strlen(str));
puts(str);
return EXIT_SUCCESS;
}
It takes an (optional) string on the command line and prints it, with / characters replaced by _. This replacement functionality is implemented by the c_repl function1. For example, a.out foo/bar prints:
foo_bar
Elementary stuff so far, right?
If you don't specify a string, it conveniently uses the global string the quick brown fox jumps over the lazy dog, which doesn't contain any / characters, and so doesn't undergo any replacement.
Of course, string constants are const char[], so I need to cast away the constness first - that's the const_cast you see. Since the string is never actually modified, I am under the impression this is legal.
gcc and clang compile a binary that has the expected behavior, with or without passing a string on the command line. icc crashes, when you don't provide a string, however:
icc -xcore-avx2 char_replace.cpp && ./a.out
Segmentation fault (core dumped)
The underlying cause is the main loop for c_repl which looks like this:
400c0c: vmovdqu ymm2,YMMWORD PTR [rsi]
400c10: add rbx,0x20
400c14: vpcmpeqb ymm3,ymm0,ymm2
400c18: vpblendvb ymm4,ymm2,ymm1,ymm3
400c1e: vmovdqu YMMWORD PTR [rsi],ymm4
400c22: add rsi,0x20
400c26: cmp rbx,rcx
400c29: jb 400c0c <main+0xfc>
It is a vectorized loop. The basic idea is that 32 bytes are loaded, and then compared against the / character, forming a mask value with a byte set for each byte that matched, and then the existing string is blended against a vector containing 32 _ characters, effectively replacing only the / characters. Finally, the updated register is written back to the string, with the vmovdqu YMMWORD PTR [rsi],ymm4 instruction.
This final store crashes, because the string is read-only and allocated in the .rodata section of the binary, which is loaded using read-only pages. Of course, the store was a logical "no op", writing back the same characters it read, but the CPU doesn't care!
Is my code legal C++ and therefore I should blame icc for miscompiling this, or I am wading into the UB swamp somewhere?
1 The same crash from the same issue occurs with std::replace on a std::string rather than my "C-like" code, but I wanted to simplify the analysis as much as possible and make it entirely self-contained.
Your program is well-formed and free of undefined behaviour, as far as I can tell. The C++ abstract machine never actually assigns to a const object. A not-taken if() is sufficient to "hide" / "protect" things that would be UB if they executed. The only thing an if(false) can't save you from is an ill-formed program, e.g. syntax errors or trying to use extensions that don't exist on this compiler or target arch.
Compilers aren't in general allowed to invent writes with if-conversion to branchless code.
Casting away const is legal, as long as you don't actually assign through it. e.g. for passing a pointer to a function that isn't const-correct, and takes a read-only input with a non-const pointer. The answer you linked on Is it allowed to cast away const on a const-defined object as long as it is not actually modified? is correct.
ICC's behaviour here is not evidence for UB in ISO C++ or C. I think your reasoning is sound, and this is well-defined. You've found an ICC bug. If anyone cares, report it on their forums: https://software.intel.com/en-us/forums/intel-c-compiler. Existing bug reports in that section of their forum have been accepted by developers, e.g. this one.
We can construct an example where it auto-vectorizes the same way (with unconditional and non-atomic read/maybe-modify/rewrite) where it's clearly illegal, because the read / rewrite is happening on a 2nd string that the C abstract machine doesn't even read.
Thus, we can't trust ICC's code-gen to tell us anything about when we've caused UB, because it will make crashing code even in clearly legal cases.
Godbolt: ICC19.0.1 -O2 -march=skylake (Older ICC only understood options like -xcore-avx2, but modern ICC understands the same -march as GCC/clang.)
#include <stddef.h>
void replace(const char *str1, char *str2, size_t len) {
for (size_t i = 0; i < len; i++) {
if (str1[i] == '/') {
str2[i] = '_';
}
}
}
It checks for overlap between str1[0..len-1] and str2[0..len-1], but for large enough len and no overlap it will use this inner loop:
..B1.15: # Preds ..B1.15 ..B1.14 //do{
vmovdqu ymm2, YMMWORD PTR [rsi+r8] #6.13 // load from str2
vpcmpeqb ymm3, ymm0, YMMWORD PTR [rdi+r8] #5.24 // compare vs. str1
vpblendvb ymm4, ymm2, ymm1, ymm3 #6.13 // blend
vmovdqu YMMWORD PTR [r8+rsi], ymm4 #6.13 // store to str2
add r8, 32 #4.5 // i+=32
cmp r8, rax #4.5
jb ..B1.15 # Prob 82% #4.5 // }while(i<len);
For thread-safety, it's well known that inventing write via non-atomic read/rewrite is unsafe.
The C++ abstract machine never touches str2 at all, so that invalidates any arguments for the one-string version about data-race UB being impossible because reading str at the same time another thread is writing it was already UB. Even C++20 std::atomic_ref doesn't change that, because we're reading through a non-atomic pointer.
But even worse than that, str2 can be nullptr. Or pointing to close to the end of an object (which happens to be stored near the end of a page), with str1 containing chars such that no writes past the end of str2 / the page will happen. We could even arrange for only the very last byte (str2[len-1]) to be in a new page, so that it's one-past-the-end of a valid object. It's even legal to construct such a pointer (as long as you don't deref). But it would be legal to pass str2=nullptr; code behind an if() that doesn't run doesn't cause UB.
Or another thread is running the same search/replace function in parallel, with a different key/replacement that will only write different elements of str2. The non-atomic load/store of unmodified values will step on modified values from the other thread. It's definitely allowed, according to the C++11 memory model, for different threads to simultaneously touch different elements of the same array. C++ memory model and race conditions on char arrays. (This is why char must be as large as the smallest unit of memory the target machine can write without a non-atomic RMW. An internal atomic RMW for a byte stores into cache is fine, though, and doesn't stop byte-store instructions from being useful.)
(This example is only legal with the separate str1/str2 version, because reading every element means the threads would be reading array elements the other thread could be in the middle of writing, which is data-race UB.)
As Herb Sutter mentioned in atomic<> Weapons: The C++ Memory Model and Modern Hardware Part 2: Restrictions on compilers and hardware (incl. common bugs); code generation and performance on x86/x64, IA64, POWER, ARM, and more; relaxed atomics; volatile: weeding out non-atomic RMW code-gen has been an ongoing issue for compilers after C++11 was standardized. We're most of the way there, but highly-aggressive and less-mainstream compilers like ICC clearly still have bugs.
(However, I'm pretty confident that Intel compiler devs would consider this a bug.)
Some less-plausible (to see in a real program) examples that this would also break:
Besides nullptr, you could pass a pointer to (an array of) std::atomic<T> or a mutex where a non-atomic read/rewrite breaks things by inventing writes. (char* can alias anything).
Or str2 points to a buffer that you've carved up for dynamic allocation, and the early part of str1 will have some matches, but later parts of str1 won't have any matches, and that part of str2 is being used by other threads. (And for some reason you can't easily calculate a length that stops the loop short).
For future readers: If you want to let compilers auto-vectorize this way:
You can write source like str2[i] = x ? replacement : str2[i]; that always writes the string in the C++ abstract machine. IIRC, that lets gcc/clang vectorize the way ICC does after doing its unsafe if-conversion to blend.
In theory an optimizing compiler can turn it back into a conditional branch in the scalar cleanup or whatever to avoid dirtying memory unnecessarily. (Or if targeting an ISA like ARM32 where a predicated store is possible, instead of only ALU select operations like x86 cmov, PowerPC isel, or AArch64 csel. ARM32 predicated instructions are architecturally a NOP if the predicate is false).
Or if an x86 compiler chose to use AVX512 masked stores, that would also make it safe to vectorize the way ICC does: masked stores do fault suppression, and never actually store to elements where the mask is false. (When using a mask register with AVX-512 load and stores, is a fault raised for invalid accesses to masked out elements?).
vpcmpeqb k1, zmm0, [rdi] ; compare from memory into mask
vmovdqu8 [rsi]{k1}, zmm1 ; masked store that only writes elements where the mask is true
ICC19 actually does basically this (but with indexed addressing modes) with -march=skylake-avx512. But with ymm vectors because 512-bit lowers max turbo too much to be worth it unless your whole program is heavily using AVX512, on Skylake Xeons anyway.
So I think ICC19 is safe when vectorizing this with AVX512, but not AVX2. Unless there are problems in its cleanup code where it does something more complicated with vpcmpuq and kshift / kor, a zero-masked load, and a masked compare into another mask reg.
AVX1 has masked stores (vmaskmovps/pd) with fault-suppression and everything, but until AVX512BW there's no granularity narrower than 32 bits. The AVX2 integer versions are only available in dword/qword granularity, vpmaskmovd/q.

Is copying in a loop less efficient than memcpy()?

I started to study IT and I am discussing with a friend right now whether this code is inefficient or not.
// const char *pName
// char *m_pName = nullptr;
for (int i = 0; i < strlen(pName); i++)
m_pName[i] = pName[i];
He is claiming that for example memcopy would do the same like the for loop above. I wonder if that's true, I don't believe.
If there are more efficient ways or if this is inefficient, please tell me why!
Thanks in advance!
I took a look at actual g++ -O3 output for your code, to see just how bad it was.
char* can alias anything, so even the __restrict__ GNU C++ extension can't help the compiler hoist the strlen out of the loop.
I was thinking it would be hoisted, and expecting that the major inefficiency here was just the byte-at-a-time copy loop. But no, it's really as bad as the other answers suggest. m_pName even has to be re-loaded every time, because the aliasing rules allow m_pName[i] to alias this->m_pName. The compiler can't assume that storing to m_pName[i] won't change class member variables, or the src string, or anything else.
#include <string.h>
class foo {
char *__restrict__ m_pName = nullptr;
void set_name(const char *__restrict__ pName);
void alloc_name(size_t sz) { m_pName = new char[sz]; }
};
// g++ will only emit a non-inline copy of the function if there's a non-inline definition.
void foo::set_name(const char * __restrict__ pName)
{
// char* can alias anything, including &m_pName, so the loop has to reload the pointer every time
//char *__restrict__ dst = m_pName; // a local avoids the reload of m_pName, but still can't hoist strlen
#define dst m_pName
for (unsigned int i = 0; i < strlen(pName); i++)
dst[i] = pName[i];
}
Compiles to this asm (g++ -O3 for x86-64, SysV ABI):
...
.L7:
movzx edx, BYTE PTR [rbp+0+rbx] ; byte load from src. clang uses mov al, byte ..., instead of movzx. The difference is debatable.
mov rax, QWORD PTR [r12] ; reload this->m_pName
mov BYTE PTR [rax+rbx], dl ; byte store
add rbx, 1
.L3: ; first iteration entry point
mov rdi, rbp ; function arg for strlen
call strlen
cmp rbx, rax
jb .L7 ; compare-and-branch (unsigned)
Using an unsigned int loop counter introduces an extra mov ebx, ebp copy of the loop counter, which you don't get with either int i or size_t i, in both clang and gcc. Presumably they have a harder time accounting for the fact that unsigned i could produce an infinite loop.
So obviously this is horrible:
a strlen call for every byte copied
copying one byte at a time
reloading m_pName every time through the loop (can be avoided by loading it into a local).
Using strcpy avoids all these problems, because strlen is allowed to assume that it's src and dst don't overlap. Don't use strlen + memcpy unless you want to know strlen yourself. If the most efficient implementation of strcpy is to strlen + memcpy, the library function will internally do that. Otherwise, it will do something even more efficient, like glibc's hand-written SSE2 strcpy for x86-64. (There is a SSSE3 version, but it's actually slower on Intel SnB, and glibc is smart enough not to use it.) Even the SSE2 version may be unrolled more than it should be (great on microbenchmarks, but pollutes the instruction cache, uop-cache, and branch-predictor caches when used as a small part of real code). The bulk of the copying is done in 16B chunks, with 64bit, 32bit, and smaller, chunks in the startup/cleanup sections.
Using strcpy of course also avoids bugs like forgetting to store a trailing '\0' character in the destination. If your input strings are potentially gigantic, using int for the loop counter (instead of size_t) is also a bug. Using strncpy is generally better, since you often know the size of the dest buffer, but not the size of the src.
memcpy can be more efficient than strcpy, since rep movs is highly optimized on Intel CPUs, esp. IvB and later. However, scanning the string to find the right length first will always cost more than the difference. Use memcpy when you already know the length of your data.
At best it's somewhat inefficient. At worst, it's quite inefficient.
In the good case, the compiler recognizes that it can hoist the call to strlen out of the loop. In this case, you end up traversing the input string once to compute the length, and then again to copy to the destination.
In the bad case, the compiler calls strlen every iteration of the loop, in which case the complexity becomes quadratic instead of linear.
As far as how to do it efficiently, I'd tend to so something like this:
char *dest = m_pName;
for (char const *in = pName; *in; ++in)
*dest++ = *in;
*dest++ = '\0';
This traverses the input only once, so it's potentially about twice as fast as the first, even in the better case (and in the quadratic case, it can be many times faster, depending on the length of the string).
Of course, this is doing pretty much the same thing as strcpy would. That may or may not be more efficient still--I've certainly seen cases where it was. Since you'd normally assume strcpy is going to be used quite a lot, it can be worthwhile to spend more time optimizing it than some random guy on the internet typing in an answer in a couple minutes.
Yes, your code is inefficient. Your code takes what is called "O(n^2)" time. Why? You have the strlen() call in your loop, so your code is recalculating the length of the string every single loop. You can make it faster by doing this:
unsigned int len = strlen(pName);
for (int i = 0; i < len; i++)
m_pName[i] = pName[i];
Now, you calculate the string length only once, so this code takes "O(n)" time, which is much faster than O(n^2). This is now about as efficient as you can get. However, A memcpy call would still be 4-8 times faster, because this code copies 1 byte at a time, whereas memcpy will use your system's word length.
Depends on interpretation of efficiency. I'd claim using memcpy() or strcpy() more efficient, because you don't write such loops every time you need a copy.
He is claiming that for example memcopy would do the same like the for loop above.
Well, not exactly the same. Probably, because memcpy() takes the size once, while strlen(pName) might be called with every loop iteration potentially. Thus from potential performance efficiency considerations memcpy() would be better.
BTW from your commented code:
// char *m_pName = nullptr;
Initializing like that would lead to undefined behavior without allocating memory for m_pName:
char *m_pName = new char[strlen(pName) + 1];
Why the +1? Because you have to consider putting a '\0' indicating the end of the c-style string.
Yes, it's inefficient, not because you're using a loop instead of memcpy but because you're calling strlen on each iteration. strlen loops over the entire array until it finds the terminating zero byte.
Also, it's very unlikely that the strlen will be optimized out of the loop condition, see In C++, should I bother to cache variables, or let the compiler do the optimization? (Aliasing).
So memcpy(m_pName, pName, strlen(pName)) would indeed be faster.
Even faster would be strcpy, because it avoids the strlen loop:
strcpy(m_pName, pName);
strcpy does the same as the loop in #JerryCoffin's answer.
For simple operations like that you should almost always say what you mean and nothing more.
In this instance if you had meant strcpy() then you should have said that, because strcpy() will copy the terminating NUL character, whereas that loop will not.
Neither one of you can win the debate. A modern compiler has seen a thousand different memcpy() implementations and there's a good chance it's just going to recognise yours and replace your code either with a call to memcpy() or with its own inlined implementation of the same.
It knows which one is best for your situation. Or at least it probably knows better than you do. When you second-guess that you run the risk of the compiler failing to recognise it and your version being worse than the collected clever tricks the compiler and/or library knows.
Here are a few considerations that you have to get right if you want to run your own code instead of the library code:
What's the largest read/write chunk size that is efficient (it's rarely bytes).
For what range of loop lengths is it worth the trouble of pre-aligning reads and writes so that larger chunks can be copied?
Is it better to align reads, align writes, do nothing, or to align both and perform permutations in arithmetic to compensate?
What about using SIMD registers? Are they faster?
How many reads should be performed before the first write? How much register file needs to be used for the most efficient burst accesses?
Should a prefetch instruction be included?
How far ahead?
How often?
Does the loop need extra complexity to avoid preloading over the end?
How many of these decisions can be resolved at run-time without causing too much overhead? Will the tests cause branch prediction failures?
Would inlining help, or is that just wasting icache?
Does the loop code benefit from cache line alignment? Does it need to be packed tightly into a single cache line? Are there constraints on other instructions within the same cache line?
Does the target CPU have dedicated instructions like rep movsb which perform better? Does it have them but they perform worse?
Going further; because memcpy() is such a fundamental operation it's possible that even the hardware will recognise what the compiler's trying to do and implement its own shortcuts that even the compiler doesn't know about.
Don't worry about the superfluous calls to strlen(). Compiler probably knows about that, too. (Compiler should know in some instances, but it doesn't seem to care) Compiler sees all. Compiler knows all. Compiler watches over you while you sleep. Trust the compiler.
Oh, except the compiler might not catch that null pointer reference. Stupid compiler!
This code is confused in various ways.
Just do m_pName = pName; because you're not actually copying the string.
You're just pointing to the one you've already got.
If you want to copy the string m_pName = strdup(pName); would do it.
If you already have storage, strcpy or memcpy would do it.
In any case, get strlen out of the loop.
This is the wrong time to worry about performance.
First get it right.
If you insist on worrying about performance, it's hard to beat strcpy.
What's more, you don't have to worry about it being right.
As a matter of fact, why do you need to copy at all ??? (either with the loop or memcpy)
if you want to duplicate a memory block, thats a different question, but since its a pointer all you need is &pName[0] (which is the address of the first location of the array) and sizeof pName ... thats it ... you can reference any object in the array by incrementing the address of first byte and you know the limit using the size value ... why have all these pointers ???(let me know if there is more to that than theoretical debate)

Interchange 2 variables in C++ with asm code

I have a huge function that sorts a very large amount of int data. The code works fine except the fact that it's slower that it should be. My first step into solving this is to place some asm code inside C++. How can I interchange 2 variables using asm? I've tried this:
_asm{ push a[x]; push a[y]; pop a[x]; pop a[y];}
and this:
_asm(mov eax, a[x];mov ebx,a[y]; mov a[x],ebx; mov a[y],eax;}
but both crash. How can I save some time on these interchanges ? I use VS_2010
In general, it is very difficult to do better than your compiler with simple code like this.
A compiler, when faced with a swap operation on integers, will typically issue code like this:
mov eax, [x]
mov ebx, [y]
mov [x], ebx
mov [y], eax
Before you try to override, first check what the compiler is actually generating. If it's something like this, don't bother going any further; you won't be able to do better than this. Moreover, if you leave it to the compiler, it may, if these variables are used immediately thereafter, choose to reuse one of these registers to save on variable loads/stores as well. This is impossible with hand-coded assembly; the compiler must reload the variables after the black box that is hand-coded asm.
Note that the push/push/pop/pop sequence is likely to be much slower; not only does it add an additional four memory operations to the stack, it also introduces dependencies on the stack pointer, eliminating any possibility of pipelining. With the simple mov sequence, it is at least possible to run the pair of reads and pair of writes in parallel if they are on different memory banks, or one is in cache, etc. It also does not introduce stalls on the stack pointer in later code.
As such, you should not try to micro-optimize the cost of an interchange; instead, reduce the number of interchanges performed. There are many sorting algorithms available, each with slightly different characteristics. You may find some are better (cause less swaps) on your dataset than others.
What makes you think you can produce faster assembly than an optimizing compiler?
Even if you'll get it to work properly, all you're likely to achieve is to confuse the optimizer to produce even slower code.
When you do in-line assembly, you can change things so that assumptions the compiler has made about register contents will no longer be true. Often times EAX is used to pass a parameter or return a value, so trashing EAX might not have much effect, but you clobbered EBX and didn't put it back, and that could cause problems. Try pushing EBX before you use it, then pop it when you are done.
You can use the variable names, function names and labels in assembly code as symbols. Note that things like a[x] is not such valid symbol.
Writing more efficient code takes skill and knowledge, using asm does not necessarily help you there.
You can compare assembly code that your compiler produces for both the function with inline assembler and without to see where you did break it.

What are the differences between using array offsets vs pointer incrementation?

Given 2 functions, which should be faster, if there is any difference at all? Assume that the input data is very large
void iterate1(const char* pIn, int Size)
{
for ( int offset = 0; offset < Size; ++offset )
{
doSomething( pIn[offset] );
}
}
vs
void iterate2(const char* pIn, int Size)
{
const char* pEnd = pIn+Size;
while(pIn != pEnd)
{
doSomething( *pIn++ );
}
}
Are there other issues to be considered with either approach?
Chances are, your compiler's optimizer will create a loop induction variable for the first case to turn it into the second. I'd expect no difference after optimizations so I tend to prefer the first style because I find it clearer to read.
Boojum is correct - IF your compiler has a good optimizer and you have it enabled. If that's not the case, or your use of arrays isn't sequential and liable to optimization, using array offsets can be far, far slower.
Here's an example. Back about 1988, we were implementing a window with a simple teletype interface on a Mac II. This consisted of 24 lines of 80 characters. When you got a new line in from the ticker, you scrolled up the top 23 lines and displayed the new one on the bottom. When there was something on the teletype, which wasn't all the time, it came in at 300 baud, which with the serial protocol overhead was about 30 characters per second. So we're not talking something that should have taxed a 16 MHz 68020 at all!
But the guy who wrote this did it like:
char screen[24][80];
and used 2-D array offsets to scroll the characters like this:
int i, j;
for (i = 0; i < 23; i++)
for (j = 0; j < 80; j++)
screen[i][j] = screen[i+1][j];
Six windows like this brought the machine to its knees!
Why? Because compilers were stupid in those days, so in machine language, every instance of the inner loop assignment, screen[i][j] = screen[i+1][j], looked kind of like this (Ax and Dx are CPU registers);
Fetch the base address of screen from memory into the A1 register
Fetch i from stack memory into the D1 register
Multiply D1 by a constant 80
Fetch j from stack memory and add it to D1
Add D1 to A1
Fetch the base address of screen from memory into the A2 register
Fetch i from stack memory into the D1 register
Add 1 to D1
Multiply D1 by a constant 80
Fetch j from stack memory and add it to D1
Add D1 to A2
Fetch the value from the memory address pointed to by A2 into D1
Store the value in D1 into the memory address pointed to by A1
So we're talking 13 machine language instructions for each of the 23x80=1840 inner loop iterations, for a total of 23920 instructions, including 3680 CPU-intensive integer multiplies.
We made a few changes to the C source code, so then it looked like this:
int i, j;
register char *a, *b;
for (i = 0; i < 22; i++)
{
a = screen[i];
b = screen[i+1];
for (j = 0; j < 80; j++)
*a++ = *b++;
}
There are still two machine-language multiplies, but they're in the outer loop, so there are only 46 integer multiplies instead of 3680. And the inner loop *a++ = *b++ statement only consisted of two machine-language operations.
Fetch the value from the memory address pointed to by A2 into D1, and post-increment A2
Store the value in D1 into the memory address pointed to by A1, and post-increment A1.
Given there are 1840 inner loop iterations, that's a total of 3680 CPU-cheap instructions - 6.5 times fewer - and NO integer multiplies. After this, instead of dying at six teletype windows, we never were able to pull up enough to bog the machine down - we ran out of teletype data sources first. And there are ways to optimize this much, much further, as well.
Now, modern compilers will do that kind of optimization for you - IF you ask them to do it, and IF your code is structured in a way that permits it.
But there are still circumstances where compilers can't do that for you - for instance, if you're doing non-sequential operations in the array.
So I've found it's served me well to use pointers instead of array references whenever possible. The performance is certainly never worse, and frequently much, much better.
With modern compiler there shouldn't be any difference in performance between the two, especially in such simplistic easily recognizable examples. Moreover, even if the compiler does not recognize their equivalence, i.e. translates each code "literally", there still shouldn't be any noticeable performance difference on a typical modern hardware platform. (Of course, there might be more specialized platforms out there where the difference might be noticeable.)
As for other considerations... Conceptually, when you implement an algorithm using the index access you impose a random-access requirement on the underlying data structure. When you use a pointer ("iterator") access, you only impose a sequential-access requirement on the underlying data structure. Random-access is a stronger requirement than sequential-access. For this reason I, for one, in my code prefer to stick to pointer access whenever possible, and use index access only when necessary.
More generally, if an algorithm can be implemented efficiently through sequential access, it is better to do it that way, without involving the unnecessary stronger requirement of random-access. This might prove useful in the future, should a need arise to refactor the code or to change the algorithm.
They are almost identical. Both solutions involve a temporary variable, an increment of a word on your system (int or ptr), and a logical check which should take one assembly instruction.
The only difference I see is the array lookup
arr[idx]
might require pointer arithmetic then a fetch while the dereference:
*ptr
just requires a fetch
My advice is that if it really matters, implement both and see if there's any savings.
To be sure, you must profile in your intended target environment.
That said, my guess is that any modern compiler is going to optimize them both down to very similar (if not identical) code.
If you didn't have an optimizer, the second has a chance of being faster, because you aren't re-computing the pointer on every iteration. But unless Size is a VERY large number (or the routine is called quite often), the difference isn't going to matter to your program's overall execution speed.
The pointer op used to be much faster. Now it's a bit faster, but the compiler may optimize it for you
Historically it was much faster to iterate via *p++ than p[i]; that was part of the motivation for having pointers in the language.
Plus, p[i] often requires a slower multiply op or at least a shift, so the optimization of replacing multiplies in a loop with adds to a pointer was sufficiently important to have a specific name: strength reduction. The subscript also tended to produce bigger code.
However, two things have changed: one is that compilers are much more sophisticated and are generally capable of doing this optimization for you.
The other is that the relative difference between an op and a memory access has increased. When *p++ was invented memory and cpu op times were similar. Today, a random desktop machine can do 3 billion integer ops / second, but only about 10 or 20 million random DRAM reads. Cache accesses are faster, and the system will prefetch and stream sequential memory accesses as you step through an array, but it still costs a lot to hit memory, and a bit of subscript fiddling isn't such a big deal.
Several years ago I asked this exact question. Someone in an interview was failing a candidate for picking the array notation because it was supposedly obviously slower. At that point I compiled both versions and looked at the disassembly. There was one opcode extra in the array notation. This was with Visual C++ (.net?). Based on what I saw I concluded that there is no appreciable difference.
Doing this again, here is what I found:
iterate1(arr, 400); // array notation
011C1027 mov edi,dword ptr [__imp__printf (11C20A0h)]
011C102D add esp,0Ch
011C1030 xor esi,esi
011C1032 movsx ecx,byte ptr [esp+esi+8] <-- Loop starts here
011C1037 push ecx
011C1038 push offset string "%c" (11C20F4h)
011C103D call edi
011C103F inc esi
011C1040 add esp,8
011C1043 cmp esi,190h
011C1049 jl main+32h (11C1032h)
iterate2(arr, 400); // pointer offset notation
011C104B lea esi,[esp+8]
011C104F nop
011C1050 movsx edx,byte ptr [esi] <-- Loop starts here
011C1053 push edx
011C1054 push offset string "%c" (11C20F4h)
011C1059 call edi
011C105B inc esi
011C105C lea eax,[esp+1A0h]
011C1063 add esp,8
011C1066 cmp esi,eax
011C1068 jne main+50h (11C1050h)
Why don't you try both and time them? My guess would be that they are optimized by the compiler into basically the same code. Just remember to turn on optimizations when comparing (-O3).
In the "other considerations" column, I'd say approach one is more clear. That's just my opinion though.
You're asking the wrong question. Should a developer aim for readability or performance first?
The first version is idiomatic for processing array, and your intent will be clear to anyone who has worked with arrays before, whereas the second relies heavily on the equivalence between array names and pointers, forcing someone reading the code to switch metaphors several times.
Cue the comments saying that the second version is crystal clear to any developer worth his keybaord.
If you wrote your program, and it's running slow, and you have profiled to the point where you have identified this loop as the bottleneck, then it would make sense to pop the hood and look at which of these is faster. But get something clear up and running first using well-known idiomatic language constructs.
Performance questions aside, it strikes me that the while loop variant has potential maintainability issues, as a programmer coming along to add some new bells and whistles has to remember to put the array increment in the right place, whereas the for loop variant puts it safely out of the body of the loop.