Is copying in a loop less efficient than memcpy()? - c++

I started to study IT and I am discussing with a friend right now whether this code is inefficient or not.
// const char *pName
// char *m_pName = nullptr;
for (int i = 0; i < strlen(pName); i++)
m_pName[i] = pName[i];
He is claiming that for example memcopy would do the same like the for loop above. I wonder if that's true, I don't believe.
If there are more efficient ways or if this is inefficient, please tell me why!
Thanks in advance!

I took a look at actual g++ -O3 output for your code, to see just how bad it was.
char* can alias anything, so even the __restrict__ GNU C++ extension can't help the compiler hoist the strlen out of the loop.
I was thinking it would be hoisted, and expecting that the major inefficiency here was just the byte-at-a-time copy loop. But no, it's really as bad as the other answers suggest. m_pName even has to be re-loaded every time, because the aliasing rules allow m_pName[i] to alias this->m_pName. The compiler can't assume that storing to m_pName[i] won't change class member variables, or the src string, or anything else.
#include <string.h>
class foo {
char *__restrict__ m_pName = nullptr;
void set_name(const char *__restrict__ pName);
void alloc_name(size_t sz) { m_pName = new char[sz]; }
};
// g++ will only emit a non-inline copy of the function if there's a non-inline definition.
void foo::set_name(const char * __restrict__ pName)
{
// char* can alias anything, including &m_pName, so the loop has to reload the pointer every time
//char *__restrict__ dst = m_pName; // a local avoids the reload of m_pName, but still can't hoist strlen
#define dst m_pName
for (unsigned int i = 0; i < strlen(pName); i++)
dst[i] = pName[i];
}
Compiles to this asm (g++ -O3 for x86-64, SysV ABI):
...
.L7:
movzx edx, BYTE PTR [rbp+0+rbx] ; byte load from src. clang uses mov al, byte ..., instead of movzx. The difference is debatable.
mov rax, QWORD PTR [r12] ; reload this->m_pName
mov BYTE PTR [rax+rbx], dl ; byte store
add rbx, 1
.L3: ; first iteration entry point
mov rdi, rbp ; function arg for strlen
call strlen
cmp rbx, rax
jb .L7 ; compare-and-branch (unsigned)
Using an unsigned int loop counter introduces an extra mov ebx, ebp copy of the loop counter, which you don't get with either int i or size_t i, in both clang and gcc. Presumably they have a harder time accounting for the fact that unsigned i could produce an infinite loop.
So obviously this is horrible:
a strlen call for every byte copied
copying one byte at a time
reloading m_pName every time through the loop (can be avoided by loading it into a local).
Using strcpy avoids all these problems, because strlen is allowed to assume that it's src and dst don't overlap. Don't use strlen + memcpy unless you want to know strlen yourself. If the most efficient implementation of strcpy is to strlen + memcpy, the library function will internally do that. Otherwise, it will do something even more efficient, like glibc's hand-written SSE2 strcpy for x86-64. (There is a SSSE3 version, but it's actually slower on Intel SnB, and glibc is smart enough not to use it.) Even the SSE2 version may be unrolled more than it should be (great on microbenchmarks, but pollutes the instruction cache, uop-cache, and branch-predictor caches when used as a small part of real code). The bulk of the copying is done in 16B chunks, with 64bit, 32bit, and smaller, chunks in the startup/cleanup sections.
Using strcpy of course also avoids bugs like forgetting to store a trailing '\0' character in the destination. If your input strings are potentially gigantic, using int for the loop counter (instead of size_t) is also a bug. Using strncpy is generally better, since you often know the size of the dest buffer, but not the size of the src.
memcpy can be more efficient than strcpy, since rep movs is highly optimized on Intel CPUs, esp. IvB and later. However, scanning the string to find the right length first will always cost more than the difference. Use memcpy when you already know the length of your data.

At best it's somewhat inefficient. At worst, it's quite inefficient.
In the good case, the compiler recognizes that it can hoist the call to strlen out of the loop. In this case, you end up traversing the input string once to compute the length, and then again to copy to the destination.
In the bad case, the compiler calls strlen every iteration of the loop, in which case the complexity becomes quadratic instead of linear.
As far as how to do it efficiently, I'd tend to so something like this:
char *dest = m_pName;
for (char const *in = pName; *in; ++in)
*dest++ = *in;
*dest++ = '\0';
This traverses the input only once, so it's potentially about twice as fast as the first, even in the better case (and in the quadratic case, it can be many times faster, depending on the length of the string).
Of course, this is doing pretty much the same thing as strcpy would. That may or may not be more efficient still--I've certainly seen cases where it was. Since you'd normally assume strcpy is going to be used quite a lot, it can be worthwhile to spend more time optimizing it than some random guy on the internet typing in an answer in a couple minutes.

Yes, your code is inefficient. Your code takes what is called "O(n^2)" time. Why? You have the strlen() call in your loop, so your code is recalculating the length of the string every single loop. You can make it faster by doing this:
unsigned int len = strlen(pName);
for (int i = 0; i < len; i++)
m_pName[i] = pName[i];
Now, you calculate the string length only once, so this code takes "O(n)" time, which is much faster than O(n^2). This is now about as efficient as you can get. However, A memcpy call would still be 4-8 times faster, because this code copies 1 byte at a time, whereas memcpy will use your system's word length.

Depends on interpretation of efficiency. I'd claim using memcpy() or strcpy() more efficient, because you don't write such loops every time you need a copy.
He is claiming that for example memcopy would do the same like the for loop above.
Well, not exactly the same. Probably, because memcpy() takes the size once, while strlen(pName) might be called with every loop iteration potentially. Thus from potential performance efficiency considerations memcpy() would be better.
BTW from your commented code:
// char *m_pName = nullptr;
Initializing like that would lead to undefined behavior without allocating memory for m_pName:
char *m_pName = new char[strlen(pName) + 1];
Why the +1? Because you have to consider putting a '\0' indicating the end of the c-style string.

Yes, it's inefficient, not because you're using a loop instead of memcpy but because you're calling strlen on each iteration. strlen loops over the entire array until it finds the terminating zero byte.
Also, it's very unlikely that the strlen will be optimized out of the loop condition, see In C++, should I bother to cache variables, or let the compiler do the optimization? (Aliasing).
So memcpy(m_pName, pName, strlen(pName)) would indeed be faster.
Even faster would be strcpy, because it avoids the strlen loop:
strcpy(m_pName, pName);
strcpy does the same as the loop in #JerryCoffin's answer.

For simple operations like that you should almost always say what you mean and nothing more.
In this instance if you had meant strcpy() then you should have said that, because strcpy() will copy the terminating NUL character, whereas that loop will not.
Neither one of you can win the debate. A modern compiler has seen a thousand different memcpy() implementations and there's a good chance it's just going to recognise yours and replace your code either with a call to memcpy() or with its own inlined implementation of the same.
It knows which one is best for your situation. Or at least it probably knows better than you do. When you second-guess that you run the risk of the compiler failing to recognise it and your version being worse than the collected clever tricks the compiler and/or library knows.
Here are a few considerations that you have to get right if you want to run your own code instead of the library code:
What's the largest read/write chunk size that is efficient (it's rarely bytes).
For what range of loop lengths is it worth the trouble of pre-aligning reads and writes so that larger chunks can be copied?
Is it better to align reads, align writes, do nothing, or to align both and perform permutations in arithmetic to compensate?
What about using SIMD registers? Are they faster?
How many reads should be performed before the first write? How much register file needs to be used for the most efficient burst accesses?
Should a prefetch instruction be included?
How far ahead?
How often?
Does the loop need extra complexity to avoid preloading over the end?
How many of these decisions can be resolved at run-time without causing too much overhead? Will the tests cause branch prediction failures?
Would inlining help, or is that just wasting icache?
Does the loop code benefit from cache line alignment? Does it need to be packed tightly into a single cache line? Are there constraints on other instructions within the same cache line?
Does the target CPU have dedicated instructions like rep movsb which perform better? Does it have them but they perform worse?
Going further; because memcpy() is such a fundamental operation it's possible that even the hardware will recognise what the compiler's trying to do and implement its own shortcuts that even the compiler doesn't know about.
Don't worry about the superfluous calls to strlen(). Compiler probably knows about that, too. (Compiler should know in some instances, but it doesn't seem to care) Compiler sees all. Compiler knows all. Compiler watches over you while you sleep. Trust the compiler.
Oh, except the compiler might not catch that null pointer reference. Stupid compiler!

This code is confused in various ways.
Just do m_pName = pName; because you're not actually copying the string.
You're just pointing to the one you've already got.
If you want to copy the string m_pName = strdup(pName); would do it.
If you already have storage, strcpy or memcpy would do it.
In any case, get strlen out of the loop.
This is the wrong time to worry about performance.
First get it right.
If you insist on worrying about performance, it's hard to beat strcpy.
What's more, you don't have to worry about it being right.

As a matter of fact, why do you need to copy at all ??? (either with the loop or memcpy)
if you want to duplicate a memory block, thats a different question, but since its a pointer all you need is &pName[0] (which is the address of the first location of the array) and sizeof pName ... thats it ... you can reference any object in the array by incrementing the address of first byte and you know the limit using the size value ... why have all these pointers ???(let me know if there is more to that than theoretical debate)

Related

Crash with icc: can the compiler invent writes where none existed in the abstract machine?

Consider the following simple program:
#include <cstring>
#include <cstdio>
#include <cstdlib>
void replace(char *str, size_t len) {
for (size_t i = 0; i < len; i++) {
if (str[i] == '/') {
str[i] = '_';
}
}
}
const char *global_str = "the quick brown fox jumps over the lazy dog";
int main(int argc, char **argv) {
const char *str = argc > 1 ? argv[1] : global_str;
replace(const_cast<char *>(str), std::strlen(str));
puts(str);
return EXIT_SUCCESS;
}
It takes an (optional) string on the command line and prints it, with / characters replaced by _. This replacement functionality is implemented by the c_repl function1. For example, a.out foo/bar prints:
foo_bar
Elementary stuff so far, right?
If you don't specify a string, it conveniently uses the global string the quick brown fox jumps over the lazy dog, which doesn't contain any / characters, and so doesn't undergo any replacement.
Of course, string constants are const char[], so I need to cast away the constness first - that's the const_cast you see. Since the string is never actually modified, I am under the impression this is legal.
gcc and clang compile a binary that has the expected behavior, with or without passing a string on the command line. icc crashes, when you don't provide a string, however:
icc -xcore-avx2 char_replace.cpp && ./a.out
Segmentation fault (core dumped)
The underlying cause is the main loop for c_repl which looks like this:
400c0c: vmovdqu ymm2,YMMWORD PTR [rsi]
400c10: add rbx,0x20
400c14: vpcmpeqb ymm3,ymm0,ymm2
400c18: vpblendvb ymm4,ymm2,ymm1,ymm3
400c1e: vmovdqu YMMWORD PTR [rsi],ymm4
400c22: add rsi,0x20
400c26: cmp rbx,rcx
400c29: jb 400c0c <main+0xfc>
It is a vectorized loop. The basic idea is that 32 bytes are loaded, and then compared against the / character, forming a mask value with a byte set for each byte that matched, and then the existing string is blended against a vector containing 32 _ characters, effectively replacing only the / characters. Finally, the updated register is written back to the string, with the vmovdqu YMMWORD PTR [rsi],ymm4 instruction.
This final store crashes, because the string is read-only and allocated in the .rodata section of the binary, which is loaded using read-only pages. Of course, the store was a logical "no op", writing back the same characters it read, but the CPU doesn't care!
Is my code legal C++ and therefore I should blame icc for miscompiling this, or I am wading into the UB swamp somewhere?
1 The same crash from the same issue occurs with std::replace on a std::string rather than my "C-like" code, but I wanted to simplify the analysis as much as possible and make it entirely self-contained.
Your program is well-formed and free of undefined behaviour, as far as I can tell. The C++ abstract machine never actually assigns to a const object. A not-taken if() is sufficient to "hide" / "protect" things that would be UB if they executed. The only thing an if(false) can't save you from is an ill-formed program, e.g. syntax errors or trying to use extensions that don't exist on this compiler or target arch.
Compilers aren't in general allowed to invent writes with if-conversion to branchless code.
Casting away const is legal, as long as you don't actually assign through it. e.g. for passing a pointer to a function that isn't const-correct, and takes a read-only input with a non-const pointer. The answer you linked on Is it allowed to cast away const on a const-defined object as long as it is not actually modified? is correct.
ICC's behaviour here is not evidence for UB in ISO C++ or C. I think your reasoning is sound, and this is well-defined. You've found an ICC bug. If anyone cares, report it on their forums: https://software.intel.com/en-us/forums/intel-c-compiler. Existing bug reports in that section of their forum have been accepted by developers, e.g. this one.
We can construct an example where it auto-vectorizes the same way (with unconditional and non-atomic read/maybe-modify/rewrite) where it's clearly illegal, because the read / rewrite is happening on a 2nd string that the C abstract machine doesn't even read.
Thus, we can't trust ICC's code-gen to tell us anything about when we've caused UB, because it will make crashing code even in clearly legal cases.
Godbolt: ICC19.0.1 -O2 -march=skylake (Older ICC only understood options like -xcore-avx2, but modern ICC understands the same -march as GCC/clang.)
#include <stddef.h>
void replace(const char *str1, char *str2, size_t len) {
for (size_t i = 0; i < len; i++) {
if (str1[i] == '/') {
str2[i] = '_';
}
}
}
It checks for overlap between str1[0..len-1] and str2[0..len-1], but for large enough len and no overlap it will use this inner loop:
..B1.15: # Preds ..B1.15 ..B1.14 //do{
vmovdqu ymm2, YMMWORD PTR [rsi+r8] #6.13 // load from str2
vpcmpeqb ymm3, ymm0, YMMWORD PTR [rdi+r8] #5.24 // compare vs. str1
vpblendvb ymm4, ymm2, ymm1, ymm3 #6.13 // blend
vmovdqu YMMWORD PTR [r8+rsi], ymm4 #6.13 // store to str2
add r8, 32 #4.5 // i+=32
cmp r8, rax #4.5
jb ..B1.15 # Prob 82% #4.5 // }while(i<len);
For thread-safety, it's well known that inventing write via non-atomic read/rewrite is unsafe.
The C++ abstract machine never touches str2 at all, so that invalidates any arguments for the one-string version about data-race UB being impossible because reading str at the same time another thread is writing it was already UB. Even C++20 std::atomic_ref doesn't change that, because we're reading through a non-atomic pointer.
But even worse than that, str2 can be nullptr. Or pointing to close to the end of an object (which happens to be stored near the end of a page), with str1 containing chars such that no writes past the end of str2 / the page will happen. We could even arrange for only the very last byte (str2[len-1]) to be in a new page, so that it's one-past-the-end of a valid object. It's even legal to construct such a pointer (as long as you don't deref). But it would be legal to pass str2=nullptr; code behind an if() that doesn't run doesn't cause UB.
Or another thread is running the same search/replace function in parallel, with a different key/replacement that will only write different elements of str2. The non-atomic load/store of unmodified values will step on modified values from the other thread. It's definitely allowed, according to the C++11 memory model, for different threads to simultaneously touch different elements of the same array. C++ memory model and race conditions on char arrays. (This is why char must be as large as the smallest unit of memory the target machine can write without a non-atomic RMW. An internal atomic RMW for a byte stores into cache is fine, though, and doesn't stop byte-store instructions from being useful.)
(This example is only legal with the separate str1/str2 version, because reading every element means the threads would be reading array elements the other thread could be in the middle of writing, which is data-race UB.)
As Herb Sutter mentioned in atomic<> Weapons: The C++ Memory Model and Modern Hardware Part 2: Restrictions on compilers and hardware (incl. common bugs); code generation and performance on x86/x64, IA64, POWER, ARM, and more; relaxed atomics; volatile: weeding out non-atomic RMW code-gen has been an ongoing issue for compilers after C++11 was standardized. We're most of the way there, but highly-aggressive and less-mainstream compilers like ICC clearly still have bugs.
(However, I'm pretty confident that Intel compiler devs would consider this a bug.)
Some less-plausible (to see in a real program) examples that this would also break:
Besides nullptr, you could pass a pointer to (an array of) std::atomic<T> or a mutex where a non-atomic read/rewrite breaks things by inventing writes. (char* can alias anything).
Or str2 points to a buffer that you've carved up for dynamic allocation, and the early part of str1 will have some matches, but later parts of str1 won't have any matches, and that part of str2 is being used by other threads. (And for some reason you can't easily calculate a length that stops the loop short).
For future readers: If you want to let compilers auto-vectorize this way:
You can write source like str2[i] = x ? replacement : str2[i]; that always writes the string in the C++ abstract machine. IIRC, that lets gcc/clang vectorize the way ICC does after doing its unsafe if-conversion to blend.
In theory an optimizing compiler can turn it back into a conditional branch in the scalar cleanup or whatever to avoid dirtying memory unnecessarily. (Or if targeting an ISA like ARM32 where a predicated store is possible, instead of only ALU select operations like x86 cmov, PowerPC isel, or AArch64 csel. ARM32 predicated instructions are architecturally a NOP if the predicate is false).
Or if an x86 compiler chose to use AVX512 masked stores, that would also make it safe to vectorize the way ICC does: masked stores do fault suppression, and never actually store to elements where the mask is false. (When using a mask register with AVX-512 load and stores, is a fault raised for invalid accesses to masked out elements?).
vpcmpeqb k1, zmm0, [rdi] ; compare from memory into mask
vmovdqu8 [rsi]{k1}, zmm1 ; masked store that only writes elements where the mask is true
ICC19 actually does basically this (but with indexed addressing modes) with -march=skylake-avx512. But with ymm vectors because 512-bit lowers max turbo too much to be worth it unless your whole program is heavily using AVX512, on Skylake Xeons anyway.
So I think ICC19 is safe when vectorizing this with AVX512, but not AVX2. Unless there are problems in its cleanup code where it does something more complicated with vpcmpuq and kshift / kor, a zero-masked load, and a masked compare into another mask reg.
AVX1 has masked stores (vmaskmovps/pd) with fault-suppression and everything, but until AVX512BW there's no granularity narrower than 32 bits. The AVX2 integer versions are only available in dword/qword granularity, vpmaskmovd/q.

embed a functions assembly code in a struct

I've a rather special question: is it possible in C/++ (both because I am sure the question is the same in both languages) to specify a functions's location? Why? I have a very large list of function pointers, and I want to eliminate them.
(Currently) This looks like that(repeated over lika a million times, stored in the user's RAM):
struct {
int i;
void(* funptr)();
} test;
Because I know that in most assembly languages, functions are just "goto" directives, I had the following idea. Is it possible to optimize the above construct so that it looks like that?
struct {
int i;
// embed the assembler of the function here
// so that all the functions
// instructions are located here
// like this: mov rax, rbx
// jmp _start ; just demo code
} test2;
In the end, the thing should look like this in memory: An int holding any value, followed by the function's assembly code, referenced by test2. I should be able to call these functions like that: ((void(*)()) (&pointerToTheStruct + sizeof(int)))();
You might think that I'm insane to optimize the app that way, and I cannot disclose any more details on it's function, but if anyone has some pointers on how solve this problem, I would appreciate it.
I do not think that there is a standard way to this, so any hacky way to do this via inline assembler/other crazy things is also appreciated!
The only thing you really have to do is make the compiler aware of the (constant) value of the function pointer you want in the struct. The compiler will then (presumably/hopefully) inline that function call wherever it sees it called through that function pointer:
template<void(*FPtr)()>
struct function_struct {
int i;
static constexpr auto funptr = FPtr;
};
void testFunc()
{
volatile int x = 0;
}
using test = function_struct<testFunc>;
int main()
{
test::funptr();
}
Demo - no call or jmp after optimization.
It remains unclear what the point of the int i is. Note that the code is not technically "directly after the i" here, but it is even more unclear how you'd expect instances of the struct to look like (is the code in them or is it "static" in a way? I feel there is some misunderstanding here on your part what compilers actually produce...). But consider the ways that compiler inlining can help you and you might find the solution you need. If you're worried about executable size after inlining, tell the compiler and it will compromise between speed and size.
This sounds like a terrible idea for a lot of reasons that probably won't save memory, and will hurt performance by diluting L1I-cache with data and L1D-cache with code. And worse if you ever modify or copy objects: self-modifying code stalls.
But yes, this would be possible in C99/C11 with a flexible array member at the end of the struct, which you cast to a function pointer.
struct int_with_code {
int i;
char code[]; // C99 flexible array member. GNU extension in C++
// Store machine code here
// you can't get the compiler to do this for you. Good Luck!
};
void foo(struct int_with_code *p) {
// explicit C-style cast compiles as both C and C++
void (*funcp)(void) = ( void (*)(void) ) p->code;
funcp();
}
Compiler output from clang7.0, on the Godbolt compiler explorer is the same when compiled as either C or C++. This is targeting the x86-64 System V ABI, where the first function arg is passed in RDI.
# this is the code that *uses* such an object, not the code that goes in its code[]
# This proves that it compiles,
# without showing any way to get compiler-generated code into code[]
foo: # #foo
add rdi, 4 # move the pointer 4 bytes forward, to point at code[]
jmp rdi # TAILCALL
(If you leave out the (void) arg-type declaration in C, the compiler will zero AL first in the x86-64 SysV calling convention, in case its actually a variadic function, because it's passing no FP args in registers.)
You'd have to allocate your objects in memory that was executable (normally not done unless they're const with static storage), e.g. compile with gcc -zexecstack. Or use a custom mmap/mprotect or VirtualAlloc/VirtualProtect on POSIX or Windows.
Or if your objects are all statically allocated, it might be possible to massage compiler output to turn functions in the .text section into objects by adding an int member right before each one. Maybe with some .section and linker tricks, and maybe a linker script, you could even somehow automate it.
But unless they're all the same length (e.g. with padding like char code[60]), that won't form an array you can index, so you'll need some way of referencing all these variable-length object.
There are potentially huge performance downsides if you ever modify an object before calling its function: on x86 you'll get self-modifying-code pipeline nuke for executing code near a just-written memory location.
Or if you copied an object before calling its function: x86 pipeline flush, or on other ISAs you need to manually flush caches to get the I-cache in sync with D-cache (so the newly-written bytes can be executed). But you can't copy such objects because their size isn't stored anywhere. You can't search the machine code for a ret instruction, because a 0xc3 byte might appear somewhere that's not the start of an x86 instruction. Or on any ISA, the function might have multiple ret instructions (tail duplication optimization). Or end with a jmp instead of a ret (tailcall).
Storing a size would start to defeat the purpose of saving size, eating up at least an extra byte in each object.
Writing code to an object at runtime, then casting to a function pointer, is undefined behaviour in ISO C and C++. On GNU C/C++, make sure you call __builtin___clear_cache on it to sync caches or whatever else is necessary. Yes, this is needed even on x86 to disable dead-store elimination optimizations: see this test case. On x86 it's just a compile-time thing, no extra asm. It doesn't actually clear any caches.
If you do copy at runtime startup, maybe allocate a big chunk of memory and carve out variable-length chunks of it, while copying. If you malloc each separately, you're wasting memory-management overhead on it.
This idea will not save you memory unless you have about as many functions as you have objects
Normally you have a fairly limited number of actual functions, with many objects having copies of the same function pointer. (You've kind of hand-rolled C++ virtual functions, but with only one function you just have a function pointer directly instead of a vtable pointer to a table of pointers for that class type. One fewer levels of indirection, and apparently you're not passing the object's own address to the function.)
One of the several benefits of this level of indirection is that one pointer is usually significantly smaller than the entire code for a function. For that to not be the case, your functions would have to be tiny.
Example: with 10 different functions of 32 bytes each, and 1000 objects with function pointers, you have a total of 320 bytes of code (which will stay hot in I-cache), and 8000 bytes of function pointers. (And in your objects, another 4 bytes per object wasted on padding to align the pointer, making the total size 16 instead of 12 bytes per object.) Anyway, that's 16320 bytes total for entire structs + code. If you allocated each object separately, there's per-object bookkeeping.
With inlining machine code into each object, and no padding, that's 1000 * (4+32) = 36000 bytes, over twice the total size.
x86-64 is probably a best-case scenario, where a pointer is 8 bytes and x86-64 machine code uses a (famously complex) variable-length instruction encoding which allows for high code density in some cases, especially when optimizing for code-size. (e.g. code-golfing. https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code). But unless your functions are mostly something trivial like lea eax, [rdi + rdi*2] (3 bytes=opcode + ModRM + SIB) / ret (1 byte), they're still going to take more than 8 bytes. (That's return x*3; for a function that takes a 32-bit integer x arg, in the x86-64 System V ABI.)
If they're wrappers for larger functions, a normal call rel32 instruction is 5 bytes. A load of static data is at least 6 bytes (opcode + modrm + rel32 for a RIP-relative addressing mode, or loading EAX specifically can use the special no-modrm encoding for an absolute address. But in x86-64 that's a 64-bit absolute unless you use an address-size prefix too, potentially causing an LCP stall in the decoders on Intel. mov eax, [32 bit absolute address] = addr32 (0x67) + opcode + abs32 = 6 bytes again, so this is worse for no benefit).
Your function-pointer type doesn't have any args (assuming this is C++ where foo() means foo(void) in a declaration, not like old C where an empty arg list is somewhat similar to (...)). Thus we can assume you're not passing args, so to do anything useful the functions are probably accessing some static data or making another call.
Ideas that make more sense:
Use an ILP32 ABI like Linux x32, where the CPU runs in 64-bit mode but your code uses 32-bit pointers. This would make each of your objects only 8 bytes instead of 16. Avoiding pointer-bloat is a classic use-case for x32 or ILP32 ABIs in general.
Or (yuck) compile your code as 32-bit. But then you have obsolete 32-bit calling conventions that pass args on the stack instead of registers, and less than half the registers, and much higher overhead for position-independent code. (No EIP/RIP-relative addressing.)
Store an unsigned int table index to a table of function pointers. If you have 100 functions but 10k objects, the table is only 100 pointers long. In asm you could index an array of code directly (computed goto style) if all the functions were padded to the same length, but in C++ you can't do that. An extra level of indirection with a table of function pointers is probably your best bet.
e.g.
void (*const fptrs[])(void) = {
func1, func2, func3, ...
};
struct int_with_func {
int i;
unsigned f;
};
void bar(struct int_with_func *p) {
fptrs[p->f] ();
}
clang/gcc -O3 output:
bar(int_with_func*):
mov eax, dword ptr [rdi + 4] # load p->f
jmp qword ptr [8*rax + fptrs] # TAILCALL # index the global table with it for a memory-indirect jmp
If you were compiling a shared library, PIE executable, or not targeting Linux, the compiler couldn't use a 32-bit absolute address to index a static array with one instruction. So there'd be a RIP-relative LEA in there and something like jmp [rcx+rax*8].
This is an extra level of indirection vs. storing a function pointer in each object, but it lets you shrink each object to 8 bytes, down from 16, like using 32-bit pointers. Or to 5 or 6 bytes, if you use an unsigned short or uint8_t and pack the structs with __attribute__((packed)) in GNU C.
No, not really.
The way to specify a function's location is to use a function pointer, which you're already doing.
You could make different types which have their own different member functions, but then you're back to the original problem.
I have in the past experimented with auto-generating (as a pre-build step, using Python) a function with a long switch statement that does the work of mapping int i to a normal function call. This gets rid of the function pointers, at the expense of branching. I don't remember whether it ended up being worthwhile in my case and, even if I did, that wouldn't tell us whether it's worthwhile in your case.
Because I know that in most assembly languages, functions are just "goto" directives
Well, it's perhaps a little more complicated than that…
You might think that I'm insane to optimize the app that way
Perhaps. Trying to eliminate indirection is not, in itself, a bad thing, so I don't think you're wrong to try to improve this. I just don't think that you necessarily can.
but if anyone has some pointers
lol
I don't understand the goal of this "optimization" is it about saving the memory?
I might be misunderstanding the question, but if you just replace your function pointer with a regular function, then you'll have your struct only containing the int as data and the function-pointer being inserted by the compiler when you take the address of it, instead of stored in memory.
So just do
struct {
int i;
void func();
} test;
Then sizeof(test)==sizeof(int) should hold true if you set alignment/packing to be tight.

Assignment vs mempcy - which will be faster in this case

which of the two is faster: ?
1.
char* _pos ..;
short value = ..;
*((short*)_pos = va;
2.
char* _pos ..;
short value = ..;
memcpy(_pos, &value, sizeof(short));
As with all "which is faster?" questions, you should benchmark it to see for yourself. And if it matters, then ask why and pick which you want.
In any case, your first example is technically undefined behavior since you are violating strict-aliasing. So if you had to choose without benchmarking, go with the second one.
To answer the actual question, which is faster will probably depend on the alignment of pos. If it's aligned properly, then 1 will probably be faster. If not, then 2 might be faster depending on how it's optimized by the compiler. (1 might even crash if the hardware doesn't support misaligned access.)
But this is all guess-work. You really need to benchmark it to know for sure.
At the very least, you should look at the compiled assembly:
: *(short *)_pos = value;
mov WORD PTR [rcx], dx
vs.
: memcpy(_pos, &value, sizeof(short));
mov WORD PTR [rcx], dx
Which in this case (in MSVC) shows the exact same assembly with default optimizations. So you can expect the performance to be the same.
With gcc at an optimization level of -O1 or higher, the following two functions compile to exactly the same machine code on x86:
void foo(char *_pos, short value)
{
memcpy(_pos, &value, sizeof(short));
}
void bar(char *_pos, short value)
{
*(short *)_pos = value;
}
The compiler might implement them both the same way.
If it does it naively, assignment will be faster.
For any practical purpose, they'll both be done in no time, and you don't need to worry.
Also note that you may have alignment problem s(_pos may not be aligned on 2 bytes, which may crash on some processors), and type punning problems (the compiler may assume that what _pos points to isn't changed, because you wrote using a short *).
Does it matter? It might be that the first case will save you some cycles (depends on the compiler sophistication and optimizations). But is it worth the readibility and maintainability hit?
Many bugs are introduced because of premature optimization. You should first identify the bottleneck, and if this assignment is that bottleneck - benchmark each of the options (taking care of alignment and other issues mentioned here by others already).
The question is implementation-dependent. In practice, for doing nothing but copying sizeof(short) bytes, if one is going to be slower, it's going to be memcpy. For considerably larger data sets, if one is going to be faster, it's generally going to be memcpy.
As pointed out, #1 invokes undefined behavior.
We can see that simple assignment is certainly easier to read and write and less error prone than both. Clarity and correctness should come first, even in performance-critical areas for the simple reason that it's easier to optimize correct code than it is to fix optimized, incorrect code. If this is really a C++ question, the need for such code (casts or memcpy that bulldoze over the type system to x-ray and copy around bits) should be very, very rare.
If you are certain that there won't be an alignment issue, and you really find this is a bottleneck situation then go ahead and do the first.
If you are unhappy calling memcpy then do something like:
*pos = static_cast<char>(value & 0xff );
*(pos+1) = static_cast<char>(value >> 8 );
although if you are going to do that then use unsigned values.
The above code ensures you get little-endian too. (Obviously reverse the order of the assignments if you want big-endian). You might want a consistent endian-ness if the data is passed around as some kind of binary blob, which is, I assume, what you are trying to create.
You might wish to use something like google protocol buffers if you want to create binary blobs. There is also boost::serialize which includes binary serialization.
You can avoid breaking aliasing rules and calling a function by using a union:
union {
char* c;
short* s;
} _pos;
short value = ...
_pos->s = value;

What are the differences between using array offsets vs pointer incrementation?

Given 2 functions, which should be faster, if there is any difference at all? Assume that the input data is very large
void iterate1(const char* pIn, int Size)
{
for ( int offset = 0; offset < Size; ++offset )
{
doSomething( pIn[offset] );
}
}
vs
void iterate2(const char* pIn, int Size)
{
const char* pEnd = pIn+Size;
while(pIn != pEnd)
{
doSomething( *pIn++ );
}
}
Are there other issues to be considered with either approach?
Chances are, your compiler's optimizer will create a loop induction variable for the first case to turn it into the second. I'd expect no difference after optimizations so I tend to prefer the first style because I find it clearer to read.
Boojum is correct - IF your compiler has a good optimizer and you have it enabled. If that's not the case, or your use of arrays isn't sequential and liable to optimization, using array offsets can be far, far slower.
Here's an example. Back about 1988, we were implementing a window with a simple teletype interface on a Mac II. This consisted of 24 lines of 80 characters. When you got a new line in from the ticker, you scrolled up the top 23 lines and displayed the new one on the bottom. When there was something on the teletype, which wasn't all the time, it came in at 300 baud, which with the serial protocol overhead was about 30 characters per second. So we're not talking something that should have taxed a 16 MHz 68020 at all!
But the guy who wrote this did it like:
char screen[24][80];
and used 2-D array offsets to scroll the characters like this:
int i, j;
for (i = 0; i < 23; i++)
for (j = 0; j < 80; j++)
screen[i][j] = screen[i+1][j];
Six windows like this brought the machine to its knees!
Why? Because compilers were stupid in those days, so in machine language, every instance of the inner loop assignment, screen[i][j] = screen[i+1][j], looked kind of like this (Ax and Dx are CPU registers);
Fetch the base address of screen from memory into the A1 register
Fetch i from stack memory into the D1 register
Multiply D1 by a constant 80
Fetch j from stack memory and add it to D1
Add D1 to A1
Fetch the base address of screen from memory into the A2 register
Fetch i from stack memory into the D1 register
Add 1 to D1
Multiply D1 by a constant 80
Fetch j from stack memory and add it to D1
Add D1 to A2
Fetch the value from the memory address pointed to by A2 into D1
Store the value in D1 into the memory address pointed to by A1
So we're talking 13 machine language instructions for each of the 23x80=1840 inner loop iterations, for a total of 23920 instructions, including 3680 CPU-intensive integer multiplies.
We made a few changes to the C source code, so then it looked like this:
int i, j;
register char *a, *b;
for (i = 0; i < 22; i++)
{
a = screen[i];
b = screen[i+1];
for (j = 0; j < 80; j++)
*a++ = *b++;
}
There are still two machine-language multiplies, but they're in the outer loop, so there are only 46 integer multiplies instead of 3680. And the inner loop *a++ = *b++ statement only consisted of two machine-language operations.
Fetch the value from the memory address pointed to by A2 into D1, and post-increment A2
Store the value in D1 into the memory address pointed to by A1, and post-increment A1.
Given there are 1840 inner loop iterations, that's a total of 3680 CPU-cheap instructions - 6.5 times fewer - and NO integer multiplies. After this, instead of dying at six teletype windows, we never were able to pull up enough to bog the machine down - we ran out of teletype data sources first. And there are ways to optimize this much, much further, as well.
Now, modern compilers will do that kind of optimization for you - IF you ask them to do it, and IF your code is structured in a way that permits it.
But there are still circumstances where compilers can't do that for you - for instance, if you're doing non-sequential operations in the array.
So I've found it's served me well to use pointers instead of array references whenever possible. The performance is certainly never worse, and frequently much, much better.
With modern compiler there shouldn't be any difference in performance between the two, especially in such simplistic easily recognizable examples. Moreover, even if the compiler does not recognize their equivalence, i.e. translates each code "literally", there still shouldn't be any noticeable performance difference on a typical modern hardware platform. (Of course, there might be more specialized platforms out there where the difference might be noticeable.)
As for other considerations... Conceptually, when you implement an algorithm using the index access you impose a random-access requirement on the underlying data structure. When you use a pointer ("iterator") access, you only impose a sequential-access requirement on the underlying data structure. Random-access is a stronger requirement than sequential-access. For this reason I, for one, in my code prefer to stick to pointer access whenever possible, and use index access only when necessary.
More generally, if an algorithm can be implemented efficiently through sequential access, it is better to do it that way, without involving the unnecessary stronger requirement of random-access. This might prove useful in the future, should a need arise to refactor the code or to change the algorithm.
They are almost identical. Both solutions involve a temporary variable, an increment of a word on your system (int or ptr), and a logical check which should take one assembly instruction.
The only difference I see is the array lookup
arr[idx]
might require pointer arithmetic then a fetch while the dereference:
*ptr
just requires a fetch
My advice is that if it really matters, implement both and see if there's any savings.
To be sure, you must profile in your intended target environment.
That said, my guess is that any modern compiler is going to optimize them both down to very similar (if not identical) code.
If you didn't have an optimizer, the second has a chance of being faster, because you aren't re-computing the pointer on every iteration. But unless Size is a VERY large number (or the routine is called quite often), the difference isn't going to matter to your program's overall execution speed.
The pointer op used to be much faster. Now it's a bit faster, but the compiler may optimize it for you
Historically it was much faster to iterate via *p++ than p[i]; that was part of the motivation for having pointers in the language.
Plus, p[i] often requires a slower multiply op or at least a shift, so the optimization of replacing multiplies in a loop with adds to a pointer was sufficiently important to have a specific name: strength reduction. The subscript also tended to produce bigger code.
However, two things have changed: one is that compilers are much more sophisticated and are generally capable of doing this optimization for you.
The other is that the relative difference between an op and a memory access has increased. When *p++ was invented memory and cpu op times were similar. Today, a random desktop machine can do 3 billion integer ops / second, but only about 10 or 20 million random DRAM reads. Cache accesses are faster, and the system will prefetch and stream sequential memory accesses as you step through an array, but it still costs a lot to hit memory, and a bit of subscript fiddling isn't such a big deal.
Several years ago I asked this exact question. Someone in an interview was failing a candidate for picking the array notation because it was supposedly obviously slower. At that point I compiled both versions and looked at the disassembly. There was one opcode extra in the array notation. This was with Visual C++ (.net?). Based on what I saw I concluded that there is no appreciable difference.
Doing this again, here is what I found:
iterate1(arr, 400); // array notation
011C1027 mov edi,dword ptr [__imp__printf (11C20A0h)]
011C102D add esp,0Ch
011C1030 xor esi,esi
011C1032 movsx ecx,byte ptr [esp+esi+8] <-- Loop starts here
011C1037 push ecx
011C1038 push offset string "%c" (11C20F4h)
011C103D call edi
011C103F inc esi
011C1040 add esp,8
011C1043 cmp esi,190h
011C1049 jl main+32h (11C1032h)
iterate2(arr, 400); // pointer offset notation
011C104B lea esi,[esp+8]
011C104F nop
011C1050 movsx edx,byte ptr [esi] <-- Loop starts here
011C1053 push edx
011C1054 push offset string "%c" (11C20F4h)
011C1059 call edi
011C105B inc esi
011C105C lea eax,[esp+1A0h]
011C1063 add esp,8
011C1066 cmp esi,eax
011C1068 jne main+50h (11C1050h)
Why don't you try both and time them? My guess would be that they are optimized by the compiler into basically the same code. Just remember to turn on optimizations when comparing (-O3).
In the "other considerations" column, I'd say approach one is more clear. That's just my opinion though.
You're asking the wrong question. Should a developer aim for readability or performance first?
The first version is idiomatic for processing array, and your intent will be clear to anyone who has worked with arrays before, whereas the second relies heavily on the equivalence between array names and pointers, forcing someone reading the code to switch metaphors several times.
Cue the comments saying that the second version is crystal clear to any developer worth his keybaord.
If you wrote your program, and it's running slow, and you have profiled to the point where you have identified this loop as the bottleneck, then it would make sense to pop the hood and look at which of these is faster. But get something clear up and running first using well-known idiomatic language constructs.
Performance questions aside, it strikes me that the while loop variant has potential maintainability issues, as a programmer coming along to add some new bells and whistles has to remember to put the array increment in the right place, whereas the for loop variant puts it safely out of the body of the loop.

Why is there no Z80 like LDIR functionality in C/C++/rtl?

In Z80 machine code, a cheap technique to initialize a buffer to a fixed value, say all blanks. So a chunk of code might look something like this.
LD HL, DESTINATION ; point to the source
LD DE, DESTINATION + 1 ; point to the destination
LD BC, DESTINATION_SIZE - 1 ; copying this many bytes
LD (HL), 0X20 ; put a seed space in the first position
LDIR ; move 1 to 2, 2 to 3...
The result being that the chunk of memory at DESTINATION is completely blank filled.
I have experimented with memmove, and memcpy, and can't replicate this behavior. I expected memmove to be able to do it correctly.
Why do memmove and memcpy behave this way?
Is there any reasonable way to do this sort of array initialization?
I am already aware of char array[size] = {0} for array initialization
I am already aware that memset will do the job for single characters.
What other approaches are there to this issue?
There was a quicker way of blanking an area of memory using the stack. Although the use of LDI and LDIR was very common, David Webb (who pushed the ZX Spectrum in all sorts of ways like full screen number countdowns including the border) came up with this technique which is 4 times faster:
saves the Stack Pointer and then
moves it to the end of the screen.
LOADs the HL register pair with
zero,
goes into a massive loop
PUSHing HL onto the Stack.
The Stack moves up the screen and down
through memory and in the process,
clears the screen.
The explanation above was taken from the review of David Webbs game Starion.
The Z80 routine might look a little like this:
DI ; disable interrupts which would write to the stack.
LD HL, 0
ADD HL, SP ; save stack pointer
EX DE, HL ; in DE register
LD HL, 0
LD C, 0x18 ; Screen size in pages
LD SP, 0x4000 ; End of screen
PAGE_LOOP:
LD B, 128 ; inner loop iterates 128 times
LOOP:
PUSH HL ; effectively *--SP = 0; *--SP = 0;
DJNZ LOOP ; loop for 256 bytes
DEC C
JP NZ,PAGE_LOOP
EX DE, HL
LD SP, HL ; restore stack pointer
EI ; re-enable interrupts
However, that routine is a little under twice as fast. LDIR copies one byte every 21 cycles. The inner loop copies two bytes every 24 cycles -- 11 cycles for PUSH HL and 13 for DJNZ LOOP. To get nearly 4 times as fast simply unroll the inner loop:
LOOP:
PUSH HL
PUSH HL
...
PUSH HL ; repeat 128 times
DEC C
JP NZ,LOOP
That is very nearly 11 cycles every two bytes which is about 3.8 times faster than the 21 cycles per byte of LDIR.
Undoubtedly the technique has been reinvented many times. For example, it appeared earlier in sub-Logic's Flight Simulator 1 for the TRS-80 in 1980.
memmove and memcpy don't work that way because it's not a useful semantic for moving or copying memory. It's handy in the Z80 to do be able to fill memory, but why would you expect a function named "memmove" to fill memory with a single byte? It's for moving blocks of memory around. It's implemented to get the right answer (the source bytes are moved to the destination) regardless of how the blocks overlap. It's useful for it to get the right answer for moving memory blocks.
If you want to fill memory, use memset, which is designed to do just what you want.
I believe this goes to the design philosophy of C and C++. As Bjarne Stroustrup once said, one of the major guiding principles of the design of C++ is "What you don’t use, you don’t pay for". And while Dennis Ritchie may not have said it in exactly those same words, I believe that was a guiding principle informing his design of C (and the design of C by subsequent people) as well. Now you may think that if you allocate memory it should automatically be initialized to NULL's and I'd tend to agree with you. But that takes machine cycles and if you're coding in a situation where every cycle is critical, that may not be an acceptable trade-off. Basically C and C++ try to stay out of your way--hence if you want something initialized you have to do it yourself.
The Z80 sequence you show was the fastest way to do that - in 1978. That was 30 years ago. Processors have progressed a lot since then, and today that's just about the slowest way to do it.
Memmove is designed to work when the source and destination ranges overlap, so you can move a chunk of memory up by one byte. That's part of its specified behavior by the C and C++ standards. Memcpy is unspecified; it might work identically to memmove, or it might be different, depending on how your compiler decides to implement it. The compiler is free to choose a method that is more efficient than memmove.
Why do memmove and memcpy behave this way?
Probably because there’s no specific, modern C++ compiler that targets the Z80 hardware? Write one. ;-)
The languages don't specify how a given hardware implements anything. This is entirely up to the programmers of the compiler and libraries. Of course, writing an own, highly specified version for every imaginable hardware configuration is a lot of work. That’ll be the reason.
Is there any reasonable way to do this sort of array initialization?Is there any reasonable way to do this sort of array initialization?
Well, if all else fails you could always use inline assembly. Other than that, I expect std::fill to perform best in a good STL implementation. And yes, I’m fully aware that my expectations are too high and that std::memset often performs better in practice.
If you're fiddling at the hardware level, then some CPUs have DMA controllers that can fill blocks of memory exceedingly quickly (much faster than the CPU could ever do). I've done this on a Freescale i.MX21 CPU.
This be accomplished in x86 assembly just as easily. In fact, it boils down to nearly identical code to your example.
mov esi, source ; set esi to be the source
lea edi, [esi + 1] ; set edi to be the source + 1
mov byte [esi], 0 ; initialize the first byte with the "seed"
mov ecx, 100h ; set ecx to the size of the buffer
rep movsb ; do the fill
However, it is simply more efficient to set more than one byte at a time if you can.
Finally, memcpy/memmove aren't what you are looking for, those are for making copies of blocks of memory from from area to another (memmove allows source and dest to be part of the same buffer). memset fills a block with a byte of your choosing.
There's also calloc that allocates and initializes the memory to 0 before returning the pointer. Of course, calloc only initializes to 0, not something the user specifies.
If this is the most efficient way to set a block of memory to a given value on the Z80, then it's quite possible that memset() might be implemented as you describe on a compiler that targets Z80s.
It might be that memcpy() might also use a similar sequence on that compiler.
But why would compilers targeting CPUs with completely different instruction sets from the Z80 be expected to use a Z80 idiom for these types of things?
Remember that the x86 architecture has a similar set of instructions that could be prefixed with a REP opcode to have them execute repeatedly to do things like copy, fill or compare blocks of memory. However, by the time Intel came out with the 386 (or maybe it was the 486) the CPU would actually run those instructions slower than simpler instructions in a loop. So compilers often stopped using the REP-oriented instructions.
Seriously, if you're writing C/C++, just write a simple for-loop and let the compiler bother for you. As an example, here's some code VS2005 generated for this exact case (using templated size):
template <int S>
class A
{
char s_[S];
public:
A()
{
for(int i = 0; i < S; ++i)
{
s_[i] = 'A';
}
}
int MaxLength() const
{
return S;
}
};
extern void useA(A<5> &a, int n); // fool the optimizer into generating any code at all
void test()
{
A<5> a5;
useA(a5, a5.MaxLength());
}
The assembler output is the following:
test PROC
[snip]
; 25 : A<5> a5;
mov eax, 41414141H ;"AAAA"
mov DWORD PTR a5[esp+40], eax
mov BYTE PTR a5[esp+44], al
; 26 : useA(a5, a5.MaxLength());
lea eax, DWORD PTR a5[esp+40]
push 5 ; MaxLength()
push eax
call useA
It does not get any more efficient than that. Stop worrying and trust your compiler or at least have a look at what your compiler produces before trying to find ways to optimize. For comparison I also compiled the code using std::fill(s_, s_ + S, 'A') and std::memset(s_, 'A', S) instead of the for-loop and the compiler produced the identical output.
If you're on the PowerPC, _dcbz().
There are a number of situations where it would be useful to have a "memspread" function whose defined behavior was to copy the starting portion of a memory range throughout the whole thing. Although memset() does just fine if the goal is to spread a single byte value, there are times when e.g. one may want to fill an array of integers with the same value. On many processor implementations, copying a byte at a time from the source to the destination would be a pretty crummy way to implement it, but a well-designed function could yield good results. For example, start by seeing if the amount of data is less than 32 bytes or so; if so, just do a bytewise copy; otherwise check the source and destination alignment; if they are aligned, round the size down to the nearest word (if necessary), then copy the first word everywhere it goes, copy the next word everywhere it goes, etc.
I too have at times wished for a function that was specified to work as a bottom-up memcpy, intended for use with overlapping ranges. As to why there isn't a standard one, I guess nobody thought it important.
memcpy() should have that behavior. memmove() doesn't by design, if the blocks of memory overlap, it copies the contents starting at the ends of the buffers to avoid that sort of behavior. But to fill a buffer with a specific value you should be using memset() in C or std::fill() in C++, which most modern compilers will optimize to the appropriate block fill instruction (such as REP STOSB on x86 architectures).
As said before, memset() offers the desired functionality.
memcpy() is for moving around blocks of memory in all cases where the source and destination buffers do not overlap, or where dest < source.
memmove() solves the case of buffers overlapping and dest > source.
On x86 architectures, good compilers directly replace memset calls with inline assembly instructions very effectively setting the destination buffer's memory, even applying further optimizations like using 4-byte values to fill as long as possible (if the following code isn't totally syntactically correct blame it on my not using X86 assembly code for a long time):
lea edi,dest
;copy the fill byte to all 4 bytes of eax
mov al,fill
mov ah,al
mov dx,ax
shl eax,16
mov ax,dx
mov ecx,count
mov edx,ecx
shr ecx,2
cld
rep stosd
test edx,2
jz moveByte
stosw
moveByte:
test edx,1
jz fillDone
stosb
fillDone:
Actually this code is far more efficient than your Z80 version, as it doesn't do memory to memory, but only register to memory moves. Your Z80 code is in fact quite a hack as it relies on each copy operation having filled the source of the subsequent copy.
If the compiler is halfway good, it might be able to detect more complicated C++ code that can be broken down to memset (see the post below), but I doubt that this actually happens for nested loops, probably even invoking initialization functions.