VirtualAlloc C++ , injected dll, asm - c++

I want to reserve space for my codecave in application.
I use VirtualAlloc function to reserve this space.
I have X questions.
What parameters (sllocation type and protection) should I use to allocate memory for code-cave?
As return value I get address of my codecave. In other part of the program I want to JMP to that codecave. How to do it? I know (correct me if I'm wrong) that JMP takes as agument nuber that is offset from current location. But I want to JMP to ma codecave. How to calculate this offset.

Just stumbled across. To clearify this topic for the rest of us: Calculating the relative JMP offset to a codecave patch works by subtracting your patch address with your current programm counter address:
uint32_t patch_address = (uint32_t) VirtualAlloc(...);
uint32_t jmp_offset = patch_address - (current_offset + current_len);
Note: current_len is the number of bytes your JMP instruction takes. This depends on the fact if its a short jmp (EB) or a long jump (E9). In your example 2 bytes, but a regular JMP (E8 0x12345678) takes 5 bytes.
So here we see that your example wont work easily, because you would have to override the next bytes that belong to the following MOV and even the CALL instruction(s). This relies on the fact that your codecave has a greater distance to the current instruction offset because it is allocated in a different region in the address space.
So what you can do is to copy the overwritten 7 Bytes into your cave. That can only work if you dont mess with EDI register in your patch (because of the "MOV ECX, EDI"). And you would have to correct the CALLs address you are overwriting. So this is probably not the best location to place a codecave, but its doable.
i wrote my own hooking library that cares for generic register arguments, stack cleanup and overwritten asm paddings, but i suggest to use the above mentioned frameworks.
regards,
michael

Subtracting the address of your jump target from the address of the instruction after the jump will give you the jump offset.

If you don't get such things, use a library like MS Detours, N-CodeHook, or something else.

Related

Get the address of an intrinsic function-generated instruction

I have a function that uses the compiler intrinsic __movsq to copy some data from a global buffer into another global buffer upon every call of the function. I'm trying to nop out those instructions once a flag has been set globally and the same function is called again. Example code:
// compiler: MSVC++ VS 2022 in C++ mode; x64
void DispatchOptimizationLoop()
{
__movsq(g_table, g_localimport, 23);
// hopefully create a nop after movsq?
static unsigned char* ptr = (unsigned char*)(&__nop);
if (!InterlockedExchange8(g_Reduce, 1))
{
// point to movsq in memory
ptr -= 3;
// nop it out
...
}
// rest of function here
...
}
Basically the function places a nop after the movsq, and then tries to get the address of the placed nop then backtrack by the size of the movsq so that a pointer is pointing to the start of movsq, so then I can simply cover it with 3 0x90s. I am aware that the line (unsigned char*)(&__nop) is not actaully creating a nop because I'm not calling the intrinsic, I'm just trying to show what I want to do.
Is this possible, or is there a better way to store the address of the instructions that need to be nop'ed out in the future?
It's not useful to have the address of a 0x90 NOP somewhere else, all you need is the address of machine code inside your function. Nothing you've written comes remotely close to helping you find that. As you say, &__nop doesn't lead to there being a NOP in your function's machine code which you could offset relative to.
If you want to hard-code offsets that could break with different optimization settings, you could take the address of the start of the function and offset it.
Or you could write the whole function in asm so you can put a label on the address you want to modify. That would actually let you do this safely.
You might get something that happens to work with GNU C labels as values, where you can take the address of C goto labels like &&label. Like put a mylabel: before the intrinsic, and maybe after for good measure so you can check that the difference is the expected 3 bytes. If you're lucky, the compiler won't put any other instructions between your labels.
So you can memset((void*)&&mylabel, 0x90, 3) (after an assert on &&mylabel_end - &&mylabel == 3). But I don't think MSVC supports that GNU extension or anything equivalent.
But you can't actually use memset if another thread could be running this at the same time.
And for efficiency, you want a single 3-byte NOP anyway.
And of course you'd have to VirtualProtect the page of machine code containing that instruction to make it writeable. (Assuming the function is 16-byte aligned, it's hopefully impossible for that one instruction near the start to be split across two pages.)
And if other threads could be running this function at the same time, you'd better use an atomic RMW (on the containing dword or qword) to replace the 3-byte instruction with a single 3-byte NOP, otherwise you could have another thread fetch and decode the first NOP, but then fetch a byte of of the movsq machine code not replaced yet.
Actually a plain mov store would be atomic if it's 4 bytes not crossing an 8-byte boundary. Since there are no other writers of different data, it's fine to load / AND/OR / store to later store the same surrounding bytes you loaded earlier. Normally a non-atomic load+store is not thread-safe, but no other threads could have written a different value in the meantime.
Atomicity of cross-modifying code
I think cross-modifying code has atomicity rules similar to data. But if the instruction spans a 16-byte boundary, code-fetch in another core might have pulled in the first 1 or 2 bytes of it before you atomically replace all 3. So the 2nd and 3rd byte get treated as either the start of an instruction, or the 2nd + 3rd bytes of a long-NOP. Since long-NOPs generally start with 0F 1F with an escape byte, if that's not how __movsq starts then it could desync.
So if cross-modifying code doesn't trigger a pipeline nuke on the other core, it's not safe to do it while another thread might be running the code. Code fetch is usually done in 16-byte chunks but that's not guaranteed. And it's not guaranteed that they're aligned 16-byte chunks.
So you should probably make sure no other threads are running this function while you change the machine code. Unless you're very sure of the safety of what you're doing and check each build to make sure the instruction starts at a safe offset, where safe is defined according to any possibility or anything that could go wrong.

segmentation fault after linking c++ file with asm file [duplicate]

I am currently learning x86 assembly. Something is not clear to me still however when using the stack for function calls. I understand that the call instruction will involve pushing the return address on the stack and then load the program counter with the address of the function to call. The ret instruction will load this address back to the program counter.
My confusion is, does it matter when the ret instruction is called within the procedure/function? Will it always find the correct return address stored on the stack, or must the stack pointer be currently pointing to where the return address was stored? If that's the case, can't we just use push and pop instead of call and ret?
For example, the code below could be the first on entering the function , if we push different registers on the stack, must the ret instruction only be called after the registers are popped in the reverse order so that after the pop %ebp instruction , the stack pointer will point to the correct place on the stack where the return address is, or will it still find it regardless where it is called? Thanks in advance
push %ebp
mov %ebp, %esp
//push other registers
...
//pop other registers
mov %esp, %ebp
(could ret instruction go here for example and still pop the correct return address?)
pop %ebp
ret
You must leave the stack and non-volatile registers as you found them. The calling function has no clue what you might have done with them otherwise - the calling function will simply continue to its next instruction after ret. Only ret after you're done cleaning up.
ret will always look to the top of the stack for its return address and will pop it into EIP. If the ret is a "far" return then it will also pop the code segment into the CS register (which would also have been pushed by call for a "far" call). Since these are the first things pushed by call, they must be the last things popped by ret. Otherwise you'll end up reting somewhere undefined.
The CPU has no idea what is function/etc... The ret instruction will fetch value from memory pointed to by esp a jump there. For example you can do things like (to illustrate the CPU is not interested into how you structurally organize your source code):
; slow alternative to "jmp continue_there_address"
push continue_there_address
ret
continue_there_address:
...
Also you don't need to restore the registers from stack, (not even restore them to the original registers), as long as esp points to the return address when ret is executed, it will be used:
call SomeFunction
...
SomeFunction:
push eax
push ebx
push ecx
add esp,8 ; forget about last 2 push
pop ecx ; ecx = original eax
ret ; returns back after call
If your function should be interoperable from other parts of code, you may still want to store/restore the registers as required by the calling convention of the platform you are programming for, so from the caller point of view you will not modify some register value which should be preserved, etc... but none of that bothers CPU and executing instruction ret, the CPU just loads value from stack ([esp]), and jumps there.
Also when the return address is stored to stack, it does not differ from other values pushed to stack in any way, all of them are just values written in memory, so the ret has no chance to somehow find "return address" in stack and skip "values", for CPU the values in memory look the same, each 32 bit value is that, 32 bit value. Whether it was stored by call, push, mov, or something else, doesn't matter, that information (origin of value) is not stored, only value.
If that's the case, can't we just use push and pop instead of call and ret?
You can certainly push preferred return address into stack (my first example). But you can't do pop eip, there's no such instruction. Actually that's what ret does, so pop eip is effectively the same thing, but no x86 assembly programmer use such mnemonics, and the opcode differs from other pop instructions. You can of course pop the return address into different register, like eax, and then do jmp eax, to have slow ret alternative (modifying also eax).
That said, the complex modern x86 CPUs do keep some track of call/ret pairings (to predict where the next ret will return, so it can prefetch the code ahead quickly), so if you will use one of those alternative non-standard ways, at some point the CPU will realize it's prediction system for return address is off the real state, and it will have to drop all those caches/preloads and re-fetch everything from real eip value, so you may pay performance penalty for confusing it.
In the example code, if the return was done before pop %ebp, it would attempt to return to the "address" that was in ebp at the start of the function, which would be the wrong address to return to.

Why would a compiler generate this assembly?

While stepping through some Qt code I came across the following. The function QMainWindowLayout::invalidate() has the following implementation:
void QMainWindowLayout::invalidate()
{
QLayout::invalidate()
minSize = szHint = QSize();
}
It is compiled to this:
<invalidate()> push %rbx
<invalidate()+1> mov %rdi,%rbx
<invalidate()+4> callq 0x7ffff4fd9090 <QLayout::invalidate()>
<invalidate()+9> movl $0xffffffff,0x564(%rbx)
<invalidate()+19> movl $0xffffffff,0x568(%rbx)
<invalidate()+29> mov 0x564(%rbx),%rax
<invalidate()+36> mov %rax,0x56c(%rbx)
<invalidate()+43> pop %rbx
<invalidate()+44> retq
The assembly from invalidate+9 to invalidate+36 seems stupid. First the code writes -1 to %rbx+0x564 and %rbx+0x568, but then it loads that -1 from %rbx+0x564 back into a register just to write it out to %rbx+0x56c. This seems like something the compiler should easily be able to optimize into just another move immediate.
So is this stupid code (and if so, why wouldn't the compiler optimize it?) or is this somehow very clever and faster than using just another move immediate?
(Note: This code is from the normal release library build shipped by ubuntu, so it was presumably compiled by GCC in optimize mode. The minSize and szHint variables are normal variables of type QSize.)
Not sure you're correct when you're saying it's stupid. I think the compiler might be trying to optimize the code size here. There is no 64-bit immediate to memory mov instruction. So the compiler has to generate 2 mov instructions just like it did above. Each of them would be 10 bytes, the 2 moves generated are 14 bytes. It's been written to so there is most likely no memory latency so I do not think you'll take any performance hit here.
The code is "less than perfect".
For code size, those 4 instructions add up to 34 bytes. A much smaller sequence (19 bytes) is possible:
00000000 31C0 xor eax,eax
00000002 48F7D0 not rax
00000005 48898364050000 mov [rbx+0x564],rax
0000000C 4889836C050000 mov [rbx+0x56c],rax
;Note: XOR above clears RAX due to zero extension
For performance things aren't so simple. The CPU wants to do many instructions at the same time, and the code above breaks that. For example:
xor eax,eax
not rax ;Must wait until previous instruction finishes
mov [rbx+0x564],rax ;Must wait until previous instruction finishes
mov [rbx+0x56c],rax ;Must wait until "not" finishes
For performance you want to do this:
00000000 48C7C0FFFFFFFF mov rax,0xffffffff
00000007 C78364050000FFFFFFFF mov dword [rbx+0x564],0xffffffff
00000011 C78368050000FFFFFFFF mov dword [rbx+0x568],0xffffffff
0000001B C7836C050000FFFFFFFF mov dword [rbx+0x56c],0xffffffff
00000025 C78370050000FFFFFFFF mov dword [rbx+0x570],0xffffffff
;Note: first MOV sets RAX to 0xFFFFFFFFFFFFFFFF due to sign extension
This allows all of the instructions to be executed in parallel, with no dependencies anywhere. Sadly, it's also much larger (45 bytes).
If you try to get a balance between code size and performance; then you could hope that the first instruction (that sets the value in RAX) completes before the last instruction/s needs to know the value in RAX. This might be something like this:
mov rax,-1
mov dword [rbx+0x564],0xffffffff
mov dword [rbx+0x568],0xffffffff
mov dword [rbx+0x56c],rax
This is 34 bytes (the same size as the original code). This is likely to be a good compromise between code size and performance.
Now; let's look at the original code and see why it is bad:
mov dword [rbx+0x564],0xffffffff
mov dword [rbx+0x568],0xffffffff
mov rax,[rbx+0x564] ;Massive problem
mov [rbx+0x56C],rax ;Depends on previous instruction
Modern CPUs do have something called "store forwarding", where writes are stored in a buffer and future reads can get the value from this buffer to avoid reading the value from cache. Ironically, this only works if the size of the read is smaller than or equal to the size of the write. The "store forwarding" will not work for this code as there are 2 writes and the read is larger than both of them. This means that the third instruction has to wait until the first 2 instructions have written to cache and then has to read the value from cache; which could easily add up to a penalty of about 30 cycles or more. Then the fourth instruction must wait for the third instruction (and can't happen in parallel with anything) so that's another problem.
I'd break down the lines as this (think several have comment same steps)
These two lines comes from the inline definition of QSize() http://qt.gitorious.org/qt/qt/blobs/4.7/src/corelib/tools/qsize.h
which set each field separately. Also, my guess is that 0x564(%rbx) is the address of szHint which is also set at the same time.
<invalidate()+9> movl $0xffffffff,0x564(%rbx)
<invalidate()+19> movl $0xffffffff,0x568(%rbx)
These lines are finally setting minSize using 64bit operations because the compiler now know the size of a QSize object. And the address of minSize is 0x56c(%rbx)
<invalidate()+29> mov 0x564(%rbx),%rax
<invalidate()+36> mov %rax,0x56c(%rbx)
Note. First part is setting two separate fields, and next part is copying a QSize object (regardless content). The question then is, should the compiler be smart enough to build a compound 64bit value because it saw preset values just earlier? Not sure about that...
In addition to Guillaume's answer, the 64 bit load/store is not aligned. But according to the Intel optimization guide (p 3-62)
Misaligned data access can incur significant performance penalties.
This is particularly true for cache line splits. The size of a cache
line is 64 bytes in the Pentium 4 and other recent Intel processors,
including processors based on Intel Core microarchitecture.
An access to data unaligned on 64-byte boundary leads to two memory
accesses and requires several μops to be executed (instead of one).
Accesses that span 64-byte boundaries are likely to incur a large
performance penalty, the cost of each stall generally are greater on
machines with longer pipelines.
Which imo implies that an unaligned load/store that does not cross a cache line boundary is cheap. In this case the base pointer in the process I was debugging was 0x10f9bb0, so the two variables are 20 and 28 bytes into the cacheline.
Normally Intel processors use store to load forwarding, so a load of a value that was just stored doesn't even need to touch the cache. But the same guide also states that a large load of several smaller stores does not store-load-forward but stalls: (p 3-66, p 3-68)
Assembly/Compiler Coding Rule 49. (H impact, M generality) The data of
a load which is forwarded from a store must be completely contained
within the store data.
; A. Large load stall
mov mem, eax ; Store dword to address “MEM"
mov mem + 4, ebx ; Store dword to address “MEM + 4"
fld mem ; Load qword at address “MEM", stalls
So the code in question probably causes a stall, and therefore I'm inclined to believe it is not optimal. I wouldn't be very surprised if GCC does not take such limitations fully into account. Does anyone know if/how much modelling of store-to-load forwarding limitations GCC does?
EDIT: some experimenting with adding filler values before the minSize/szHint fields shows that GCC does not care at all where the cache line boundaries are, and neither does clang.

Function Prologue and Epilogue in C

I know data in nested function calls go to the Stack.The stack itself implements a step-by-step method for storing and retrieving data from the stack as the functions get called or returns.The name of these methods is most known as Prologue and Epilogue.
I tried with no success to search material on this topic. Do you guys know any resource ( site,video, article ) about how function prologue and epilogue works generally in C ? Or if you can explain would be even better.
P.S : I just want some general view, not too detailed.
There are lots of resources out there that explain this:
Function prologue (Wikipedia)
x86 Disassembly/Calling Conventions (WikiBooks)
Considerations for Writing Prolog/Epilog Code (MSDN)
to name a few.
Basically, as you somewhat described, "the stack" serves several purposes in the execution of a program:
Keeping track of where to return to, when calling a function
Storage of local variables in the context of a function call
Passing arguments from calling function to callee.
The prolouge is what happens at the beginning of a function. Its responsibility is to set up the stack frame of the called function. The epilog is the exact opposite: it is what happens last in a function, and its purpose is to restore the stack frame of the calling (parent) function.
In IA-32 (x86) cdecl, the ebp register is used by the language to keep track of the function's stack frame. The esp register is used by the processor to point to the most recent addition (the top value) on the stack. (In optimized code, using ebp as a frame pointer is optional; other ways of unwinding the stack for exceptions are possible, so there's no actual requirement to spend instructions setting it up.)
The call instruction does two things: First it pushes the return address onto the stack, then it jumps to the function being called. Immediately after the call, esp points to the return address on the stack. (So on function entry, things are set up so a ret could execute to pop that return address back into EIP. The prologue points ESP somewhere else, which is part of why we need an epilogue.)
Then the prologue is executed:
push ebp ; Save the stack-frame base pointer (of the calling function).
mov ebp, esp ; Set the stack-frame base pointer to be the current
; location on the stack.
sub esp, N ; Grow the stack by N bytes to reserve space for local variables
At this point, we have:
...
ebp + 4: Return address
ebp + 0: Calling function's old ebp value
ebp - 4: (local variables)
...
The epilog:
mov esp, ebp ; Put the stack pointer back where it was when this function
; was called.
pop ebp ; Restore the calling function's stack frame.
ret ; Return to the calling function.
C Function Call Conventions and the Stack explains well the concept of a call stack
Function prologue briefly explains the assembly code and the hows and whys.
The gen on function perilogues
I am quite late to the party & I am sure that in the last 7 years since the question was asked, you'd have gotten a way clearer understanding of things, that is of course if you chose to pursue the question any further. However, I thought I would still give a shot at especially the why part of the prolog & the epilog.
Also, the accepted answer elegantly & quite simply explains the how of the epilog & the prolog, with good references. I only intend to supplement that answer with the why (at least the logical why) part.
I will quote the below from the accepted answer & try to extend it's explanation.
In IA-32 (x86) cdecl, the ebp register is used by the language to keep
track of the function's stack frame. The esp register is used by the
processor to point to the most recent addition (the top value) on the
stack.
The call instruction does two things: First it pushes the return
address onto the stack, then it jumps to the function being called.
Immediately after the call, esp points to the return address on the
stack.
The last line in the quote above says immediately after the call, esp points to the return address on the stack.
Why's that?
So let's say that our code that's getting currently executed has the following situation, as shown in the (really badly drawn) diagram below
So our next instruction to be executed is, say at the address 2. This is where the EIP is pointing. The current instruction has a function call (that would internally translate to the assembly call instruction).
Now ideally, because the EIP is pointing to the very next instruction, that would indeed be the next instruction to get executed. But since there's sort of a diversion from the current execution flow path, (that is now expected because of the call) the EIP's value would change. Why? Because now another instruction, that may be somewhere else, say at the address 1234 (or whatever), may need to get executed. But in order to complete the execution flow of the program as was intended by the programmer, after the diversion activities are done, the control must return back to the address 2 as that is what should have been executed next should the diversion have not happened. Let us call this address 2 as the return address in the context of the call that is being made.
Problem 1
So, before the diversion actually happens, the return address, 2, would need to be stored somewhere temporarily.
There could have been many choices of storing it in any of the available registers, or some memory location etc. But for (I believe good reason) it was decided that the return address would be stored onto the stack.
So what needs to be done now is increment the ESP (the stack pointer) such that the top of the stack now points at the next address on the stack. So TOS' (TOS before the increment) which was pointing to the address, say 292, now gets incremented & starts pointing to the address 293. That is where we put our return address 2. So something like this:
So it looks like now we have achieved our goal of temporarily storing the return address somewhere. We should now just go about making the diversion call. And we could. But there's a small problem. During the execution of the called function, the stack pointer, along with the other register values, could be manipulated multiple times.
Problem 2
So, although the return address of ours, is still stored on the stack, at location 293, after the called function finishes off executing, how would the execution flow know that it should now goto 293 & that's where it would find the return address?
So (I believe for good reason again) one of the ways of solving the above problem could be to store the stack address 293 (where the return address is) in a (designated) register called EBP. But then what about the contents of EBP? Would that not be overwritten? Sure, that's a valid point. So let's store the current contents of EBP on to the stack & then store this stack address into EBP. Something like this:
The stack pointer is incremented. The current value of EBP (denoted as EBP'), which is say xxx, is stored onto the top of the stack, i.e. at the address 294. Now that we have taken a backup of the current contents of EBP, we can safely put any other value onto the EBP. So we put the current address of the top of the stack, that is the address 294, in EBP.
With the above strategy in place, we solve for the Problem 2 discussed above. How? So now when the execution flow wants to know where from should it fetch the return address, it would :
first get the value from EBP out and point the ESP to that value. In our case, this would make TOS (top of stack) point to the address 294 (since that is what is stored in EBP).
Then it would restore the previous value of EBP. To do this it would simply take the value at 294 (the TOS), which is xxx (which was actually the older value of EBP), & put it back to EBP.
Then it would decrement the stack pointer to go to the next lower address in the stack which is 293 in our case. Thus finally reaching 293 (see that's what our problem 2 was). That's where it would find the return address, which is 2.
It will finally pop this 2 out into the EIP, that's the instruction that should have ideally been executed should the diversion have not happened, remember.
And the steps that we just saw being performed, with all the jugglery, to store the return address temporarily & then retrieve it is exactly what gets done with the function prolog (before the function call) & the epilog (before the function ret). The how was already answered, we just answered the why as well.
Just an end note: For the sake of brevity, I have not taken care of the fact that the stack addresses may grow the other way round.
Every function has an identical prologue(The starting of function code) and epilogue ( The ending of a function).
Prologue: The structure of Prologue is look like:
push ebp
mov esp,ebp
Epilogue: The structure of Prologue is look like:
leave
ret
More in detail : what is Prologue and Epilogue

Why is there no Z80 like LDIR functionality in C/C++/rtl?

In Z80 machine code, a cheap technique to initialize a buffer to a fixed value, say all blanks. So a chunk of code might look something like this.
LD HL, DESTINATION ; point to the source
LD DE, DESTINATION + 1 ; point to the destination
LD BC, DESTINATION_SIZE - 1 ; copying this many bytes
LD (HL), 0X20 ; put a seed space in the first position
LDIR ; move 1 to 2, 2 to 3...
The result being that the chunk of memory at DESTINATION is completely blank filled.
I have experimented with memmove, and memcpy, and can't replicate this behavior. I expected memmove to be able to do it correctly.
Why do memmove and memcpy behave this way?
Is there any reasonable way to do this sort of array initialization?
I am already aware of char array[size] = {0} for array initialization
I am already aware that memset will do the job for single characters.
What other approaches are there to this issue?
There was a quicker way of blanking an area of memory using the stack. Although the use of LDI and LDIR was very common, David Webb (who pushed the ZX Spectrum in all sorts of ways like full screen number countdowns including the border) came up with this technique which is 4 times faster:
saves the Stack Pointer and then
moves it to the end of the screen.
LOADs the HL register pair with
zero,
goes into a massive loop
PUSHing HL onto the Stack.
The Stack moves up the screen and down
through memory and in the process,
clears the screen.
The explanation above was taken from the review of David Webbs game Starion.
The Z80 routine might look a little like this:
DI ; disable interrupts which would write to the stack.
LD HL, 0
ADD HL, SP ; save stack pointer
EX DE, HL ; in DE register
LD HL, 0
LD C, 0x18 ; Screen size in pages
LD SP, 0x4000 ; End of screen
PAGE_LOOP:
LD B, 128 ; inner loop iterates 128 times
LOOP:
PUSH HL ; effectively *--SP = 0; *--SP = 0;
DJNZ LOOP ; loop for 256 bytes
DEC C
JP NZ,PAGE_LOOP
EX DE, HL
LD SP, HL ; restore stack pointer
EI ; re-enable interrupts
However, that routine is a little under twice as fast. LDIR copies one byte every 21 cycles. The inner loop copies two bytes every 24 cycles -- 11 cycles for PUSH HL and 13 for DJNZ LOOP. To get nearly 4 times as fast simply unroll the inner loop:
LOOP:
PUSH HL
PUSH HL
...
PUSH HL ; repeat 128 times
DEC C
JP NZ,LOOP
That is very nearly 11 cycles every two bytes which is about 3.8 times faster than the 21 cycles per byte of LDIR.
Undoubtedly the technique has been reinvented many times. For example, it appeared earlier in sub-Logic's Flight Simulator 1 for the TRS-80 in 1980.
memmove and memcpy don't work that way because it's not a useful semantic for moving or copying memory. It's handy in the Z80 to do be able to fill memory, but why would you expect a function named "memmove" to fill memory with a single byte? It's for moving blocks of memory around. It's implemented to get the right answer (the source bytes are moved to the destination) regardless of how the blocks overlap. It's useful for it to get the right answer for moving memory blocks.
If you want to fill memory, use memset, which is designed to do just what you want.
I believe this goes to the design philosophy of C and C++. As Bjarne Stroustrup once said, one of the major guiding principles of the design of C++ is "What you don’t use, you don’t pay for". And while Dennis Ritchie may not have said it in exactly those same words, I believe that was a guiding principle informing his design of C (and the design of C by subsequent people) as well. Now you may think that if you allocate memory it should automatically be initialized to NULL's and I'd tend to agree with you. But that takes machine cycles and if you're coding in a situation where every cycle is critical, that may not be an acceptable trade-off. Basically C and C++ try to stay out of your way--hence if you want something initialized you have to do it yourself.
The Z80 sequence you show was the fastest way to do that - in 1978. That was 30 years ago. Processors have progressed a lot since then, and today that's just about the slowest way to do it.
Memmove is designed to work when the source and destination ranges overlap, so you can move a chunk of memory up by one byte. That's part of its specified behavior by the C and C++ standards. Memcpy is unspecified; it might work identically to memmove, or it might be different, depending on how your compiler decides to implement it. The compiler is free to choose a method that is more efficient than memmove.
Why do memmove and memcpy behave this way?
Probably because there’s no specific, modern C++ compiler that targets the Z80 hardware? Write one. ;-)
The languages don't specify how a given hardware implements anything. This is entirely up to the programmers of the compiler and libraries. Of course, writing an own, highly specified version for every imaginable hardware configuration is a lot of work. That’ll be the reason.
Is there any reasonable way to do this sort of array initialization?Is there any reasonable way to do this sort of array initialization?
Well, if all else fails you could always use inline assembly. Other than that, I expect std::fill to perform best in a good STL implementation. And yes, I’m fully aware that my expectations are too high and that std::memset often performs better in practice.
If you're fiddling at the hardware level, then some CPUs have DMA controllers that can fill blocks of memory exceedingly quickly (much faster than the CPU could ever do). I've done this on a Freescale i.MX21 CPU.
This be accomplished in x86 assembly just as easily. In fact, it boils down to nearly identical code to your example.
mov esi, source ; set esi to be the source
lea edi, [esi + 1] ; set edi to be the source + 1
mov byte [esi], 0 ; initialize the first byte with the "seed"
mov ecx, 100h ; set ecx to the size of the buffer
rep movsb ; do the fill
However, it is simply more efficient to set more than one byte at a time if you can.
Finally, memcpy/memmove aren't what you are looking for, those are for making copies of blocks of memory from from area to another (memmove allows source and dest to be part of the same buffer). memset fills a block with a byte of your choosing.
There's also calloc that allocates and initializes the memory to 0 before returning the pointer. Of course, calloc only initializes to 0, not something the user specifies.
If this is the most efficient way to set a block of memory to a given value on the Z80, then it's quite possible that memset() might be implemented as you describe on a compiler that targets Z80s.
It might be that memcpy() might also use a similar sequence on that compiler.
But why would compilers targeting CPUs with completely different instruction sets from the Z80 be expected to use a Z80 idiom for these types of things?
Remember that the x86 architecture has a similar set of instructions that could be prefixed with a REP opcode to have them execute repeatedly to do things like copy, fill or compare blocks of memory. However, by the time Intel came out with the 386 (or maybe it was the 486) the CPU would actually run those instructions slower than simpler instructions in a loop. So compilers often stopped using the REP-oriented instructions.
Seriously, if you're writing C/C++, just write a simple for-loop and let the compiler bother for you. As an example, here's some code VS2005 generated for this exact case (using templated size):
template <int S>
class A
{
char s_[S];
public:
A()
{
for(int i = 0; i < S; ++i)
{
s_[i] = 'A';
}
}
int MaxLength() const
{
return S;
}
};
extern void useA(A<5> &a, int n); // fool the optimizer into generating any code at all
void test()
{
A<5> a5;
useA(a5, a5.MaxLength());
}
The assembler output is the following:
test PROC
[snip]
; 25 : A<5> a5;
mov eax, 41414141H ;"AAAA"
mov DWORD PTR a5[esp+40], eax
mov BYTE PTR a5[esp+44], al
; 26 : useA(a5, a5.MaxLength());
lea eax, DWORD PTR a5[esp+40]
push 5 ; MaxLength()
push eax
call useA
It does not get any more efficient than that. Stop worrying and trust your compiler or at least have a look at what your compiler produces before trying to find ways to optimize. For comparison I also compiled the code using std::fill(s_, s_ + S, 'A') and std::memset(s_, 'A', S) instead of the for-loop and the compiler produced the identical output.
If you're on the PowerPC, _dcbz().
There are a number of situations where it would be useful to have a "memspread" function whose defined behavior was to copy the starting portion of a memory range throughout the whole thing. Although memset() does just fine if the goal is to spread a single byte value, there are times when e.g. one may want to fill an array of integers with the same value. On many processor implementations, copying a byte at a time from the source to the destination would be a pretty crummy way to implement it, but a well-designed function could yield good results. For example, start by seeing if the amount of data is less than 32 bytes or so; if so, just do a bytewise copy; otherwise check the source and destination alignment; if they are aligned, round the size down to the nearest word (if necessary), then copy the first word everywhere it goes, copy the next word everywhere it goes, etc.
I too have at times wished for a function that was specified to work as a bottom-up memcpy, intended for use with overlapping ranges. As to why there isn't a standard one, I guess nobody thought it important.
memcpy() should have that behavior. memmove() doesn't by design, if the blocks of memory overlap, it copies the contents starting at the ends of the buffers to avoid that sort of behavior. But to fill a buffer with a specific value you should be using memset() in C or std::fill() in C++, which most modern compilers will optimize to the appropriate block fill instruction (such as REP STOSB on x86 architectures).
As said before, memset() offers the desired functionality.
memcpy() is for moving around blocks of memory in all cases where the source and destination buffers do not overlap, or where dest < source.
memmove() solves the case of buffers overlapping and dest > source.
On x86 architectures, good compilers directly replace memset calls with inline assembly instructions very effectively setting the destination buffer's memory, even applying further optimizations like using 4-byte values to fill as long as possible (if the following code isn't totally syntactically correct blame it on my not using X86 assembly code for a long time):
lea edi,dest
;copy the fill byte to all 4 bytes of eax
mov al,fill
mov ah,al
mov dx,ax
shl eax,16
mov ax,dx
mov ecx,count
mov edx,ecx
shr ecx,2
cld
rep stosd
test edx,2
jz moveByte
stosw
moveByte:
test edx,1
jz fillDone
stosb
fillDone:
Actually this code is far more efficient than your Z80 version, as it doesn't do memory to memory, but only register to memory moves. Your Z80 code is in fact quite a hack as it relies on each copy operation having filled the source of the subsequent copy.
If the compiler is halfway good, it might be able to detect more complicated C++ code that can be broken down to memset (see the post below), but I doubt that this actually happens for nested loops, probably even invoking initialization functions.