Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I have been giving a metamorphic engine a try. I started by trying to analyze the opcode assembly instruction but it does not seem to give me anything. The instruction I am looking for in the function is MOV. Why does it not return anything even though they are in the function?
#include <iostream>
#include <Windows.h>
using namespace std;
struct OPCODE
{
unsigned short usSize;
PBYTE pbOpCode;
bool bRelative;
bool bMutated;
};
namespace MOVRegisters
{
enum MovRegisters
{
EAX = 0xB8,
ECX,
EDX,
EBX,
ESP,
EBP,
ESI,
EDI
};
}
bool __fastcall bIsMOV(PBYTE pInstruction)
{
if (*pInstruction == MOVRegisters::EAX || *pInstruction == MOVRegisters::ECX || *pInstruction == MOVRegisters::EDX || *pInstruction == MOVRegisters::EBX ||
*pInstruction == MOVRegisters::ESP || *pInstruction == MOVRegisters::EBP || *pInstruction == MOVRegisters::ESI || *pInstruction == MOVRegisters::EDI)
return true;
else
return false;
}
void pCheckByte(PVOID pFunction, PBYTE pFirstFive)
{
if (*pFirstFive == 0x0)
memcpy(pFirstFive, pFunction, 5);
else
memcpy(pFunction, pFirstFive, 5);
PBYTE pCurrentByte = (PBYTE)pFunction;
while (*pCurrentByte != 0xC3 && *pCurrentByte != 0xC2 && *pCurrentByte != 0xCB && *pCurrentByte != 0xCA)
{
OPCODE* pNewOp = new OPCODE();
pNewOp->pbOpCode = pCurrentByte;
if (bIsMOV(pCurrentByte))
{
cout << "mov instr.\n";
}
}
}
void function()
{
int eaxVal;
__asm
{
mov eax, 5
add eax, 6
mov eaxVal, eax
}
printf("Testing %d\n", eaxVal);
}
int main()
{
PBYTE pFirstFive = (PBYTE)malloc(5);
RtlZeroMemory(pFirstFive, 5);
while (true)
{
pCheckByte(function, pFirstFive);
system("pause");
}
return 0;
}
Did you look at the disassembly of function()? The first instruction probably won't be mov eax, 5, since MSVC probably makes a stack frame in functions with inline asm. (push ebp / mov ebp, esp).
Does your code actually loop over the bytes of the function? You have a loop, but it leaks memory every iteration. The only occurrence of pNewOp is, so it's write-only.
OPCODE* pNewOp = new OPCODE();
pNewOp->pbOpCode = pCurrentByte;
Note that looping over all the bytes will give false positives, because 0xb3 or whatever can occur as a non-opcode byte. (e.g. a ModR/M or SIB byte, or immediate data.) Similarly, you could have false positives on your 0xC3, ... scan for ret instructions. Again, look at disassembly with the raw machine code.
Writing your own code for parsing x86 machine code seems like a lot of unnecessary work; there are many tools an libraries that already do this.
Also, single-step through your C++ code in a debugger to see what it does.
Related
I have a program in which a simple function is called a large number of times. I have added some simple logging code and find that this significantly affects performance, even when the logging code is not actually called. A complete (but simplified) test case is shown below:
#include <chrono>
#include <iostream>
#include <random>
#include <sstream>
using namespace std::chrono;
std::mt19937 rng;
uint32_t getValue()
{
// Just some pointless work, helps stop this function from getting inlined.
for (int x = 0; x < 100; x++)
{
rng();
}
// Get a value, which happens never to be zero
uint32_t value = rng();
// This (by chance) is never true
if (value == 0)
{
value++; // This if statment won't get optimized away when printing below is commented out.
std::stringstream ss;
ss << "This never gets printed, but commenting out these three lines improves performance." << std::endl;
std::cout << ss.str();
}
return value;
}
int main(int argc, char* argv[])
{
// Just fror timing
high_resolution_clock::time_point start = high_resolution_clock::now();
uint32_t sum = 0;
for (uint32_t i = 0; i < 10000000; i++)
{
sum += getValue();
}
milliseconds elapsed = duration_cast<milliseconds>(high_resolution_clock::now() - start);
// Use (print) the sum to make sure it doesn't get optimized away.
std::cout << "Sum = " << sum << ", Elapsed = " << elapsed.count() << "ms" << std::endl;
return 0;
}
Note that the code contains stringstream and cout but these are never actually called. However, the presence of these three lines of code increases the run time from 2.9 to 3.3 seconds. This is in release mode on VS2013. Curiously, if I build in GCC using '-O3' flag the extra three lines of code actually decrease the runtime by half a second or so.
I understand that the extra code could impact the resulting executable in a number of ways, such as by preventing inlining or causing more cache misses. The real question is whether there is anything I can do to improve on this situation? Switching to sprintf()/printf() doesn't seem to make a difference. Do I need to simply accept that adding such logging code to small functions will affect performance even if not called?
Note: For completeness, my real/full scenario is that I use a wrapper macro to throw exceptions and I like to log when such an exception is thrown. So when I call THROW_EXCEPT(...) it inserts code similar to that shown above and then throws. This in then hurting when I throw exceptions from inside a small function. Any better alternatives here?
Edit: Here is a VS2013 solution for quick testing, and so compiler settings can be checked: https://drive.google.com/file/d/0B7b4UnjhhIiEamFyS0hjSnVzbGM/view?usp=sharing
So I initially thought that this was due to branch prediction and optimising out branches so I took a look at the annotated assembly for when the code is commented out:
if (value == 0)
00E21371 mov ecx,1
00E21376 cmove eax,ecx
{
value++;
Here we see that the compiler has helpfully optimised out our branch, so what if we put in a more complex statement to prevent it from doing so:
if (value == 0)
00AE1371 jne getValue+99h (0AE1379h)
{
value /= value;
00AE1373 xor edx,edx
00AE1375 xor ecx,ecx
00AE1377 div eax,ecx
Here the branch is left in but when running this it runs about as fast as the previous example with the following lines commented out. So lets have a look at the assembly for having those lines left in:
if (value == 0)
008F13A0 jne getValue+20Bh (08F14EBh)
{
value++;
std::stringstream ss;
008F13A6 lea ecx,[ebp-58h]
008F13A9 mov dword ptr [ss],8F32B4h
008F13B3 mov dword ptr [ebp-0B0h],8F32F4h
008F13BD call dword ptr ds:[8F30A4h]
008F13C3 push 0
008F13C5 lea eax,[ebp-0A8h]
008F13CB mov dword ptr [ebp-4],0
008F13D2 push eax
008F13D3 lea ecx,[ss]
008F13D9 mov dword ptr [ebp-10h],1
008F13E0 call dword ptr ds:[8F30A0h]
008F13E6 mov dword ptr [ebp-4],1
008F13ED mov eax,dword ptr [ss]
008F13F3 mov eax,dword ptr [eax+4]
008F13F6 mov dword ptr ss[eax],8F32B0h
008F1401 mov eax,dword ptr [ss]
008F1407 mov ecx,dword ptr [eax+4]
008F140A lea eax,[ecx-68h]
008F140D mov dword ptr [ebp+ecx-0C4h],eax
008F1414 lea ecx,[ebp-0A8h]
008F141A call dword ptr ds:[8F30B0h]
008F1420 mov dword ptr [ebp-4],0FFFFFFFFh
That's a lot of instructions if that branch is ever hit. So what if we try something else?
if (value == 0)
011F1371 jne getValue+0A6h (011F1386h)
{
value++;
printf("This never gets printed, but commenting out these three lines improves performance.");
011F1373 push 11F31D0h
011F1378 call dword ptr ds:[11F30ECh]
011F137E add esp,4
Here we have far fewer instructions and once again it runs as quickly as with all lines commented out.
So I'm not sure I can say for certain exactly what is happening here but I feel at the moment it is a combination of branch prediction and CPU instruction cache misses.
In order to solve this problem you could move the logging into a function like so:
void log()
{
std::stringstream ss;
ss << "This never gets printed, but commenting out these three lines improves performance." << std::endl;
std::cout << ss.str();
}
and
if (value == 0)
{
value++;
log();
Then it runs as fast as before with all those instructions replaced with a single call log (011C12E0h).
So I've asked this before but with significantly less detail. The question title accurately describes the problem: I have a method in C++ that I am trying to call from assembly (x86) that has both parameters and a return value. I have a rough understanding, at best, of assembly and a fairly solid understanding of C++ (otherwise I would not have undertaken this problem). Here's what I have as far as code goes:
// methodAddr is a pointer to the method address
void* methodAddr = method->Address;
// buffer is an int array of parameter values. The parameters can be anything (of any type)
// but are copied into an int array so they can be pushed onto the stack in reverse order
// 4 bytes at a time (as in push (int)). I know there's an issue here that is irrelevent to my baseline testing, in that if any parameter is over 4 bytes it will be broken and
// reversed (which is not good) but for basic testing this isn't an issue, so overlook this.
for (int index = bufferElementCount - 1; index >= 0; index--)
{
int val = buffer[index];
__asm
{
push val
}
}
int returnValueCount = 0;
// if there is a return value, allocate some space for it and push that onto the stack after
// the parameters have been pushed on
if (method->HasReturnValue)
{
*returnSize = method->ReturnValueSize;
outVal = new char[*returnSize];
returnValueCount = (*returnSize / 4) + (*returnSize % 4 != 0 ? 1 : 0);
memset(outVal, 0, *returnSize);
for (int index = returnValueCount - 1; index >= 0; index--)
{
char* addr = ((char*)outVal) + (index * 4);
__asm
{
push addr
}
}
}
// calculate the stack pointer offset so after the call we can pop the parameters and return value
int espOffset = (bufferElementCount + returnValueCount) * 4;
// call the method
__asm
{
call methodAddr;
add esp, espOffset
};
For my basic testing I am using a method with the following signature:
Person MyMethod3( int, char, int );
The problem is this: when omit the return value from the method signature, all of the parameter values are properly passed. But when I leave the method as is, the parameter data that is passed is incorrect but the value returned is correct. So my question, obviously, is what is wrong? I've tried pushing the return value space onto the stack before the parameters. The person structure is as follows:
class Person
{
public:
Text Name;
int Age;
float Cash;
ICollection<Person*>* Friends;
};
Any help would be greatly appreciated. Thanks!
I'm using Visual Studio 2013 with the November 2013 CTP compiler for C++, targeting x86.
As it relates to disassembly, this is the straight method call:
int one = 876;
char two = 'X';
int three = 9738;
Person p = MyMethod3(one, two, three);
And here is the disassembly for that:
00CB0A20 mov dword ptr [one],36Ch
char two = 'X';
00CB0A27 mov byte ptr [two],58h
int three = 9738;
00CB0A2B mov dword ptr [three],260Ah
Person p = MyMethod3(one, two, three);
00CB0A32 push 10h
00CB0A34 lea ecx,[p]
00CB0A37 call Person::__autoclassinit2 (0C6AA2Ch)
00CB0A3C mov eax,dword ptr [three]
00CB0A3F push eax
00CB0A40 movzx ecx,byte ptr [two]
00CB0A44 push ecx
00CB0A45 mov edx,dword ptr [one]
00CB0A48 push edx
00CB0A49 lea eax,[p]
00CB0A4C push eax
00CB0A4D call MyMethod3 (0C6B783h)
00CB0A52 add esp,10h
00CB0A55 mov dword ptr [ebp-4],0
My interpretation of this is as follows:
Execute the assignments to the local variables. Then create the output register. Then put the parameters in a particular register (the order here happens to be eax, ecx, and edx, which makes sense (eax and ebx are for one, ecx is for two, and edx and some other register for the last parameter?)). Then call LEA (load-effective address) which I don't understand but have understood to be a MOV. Then it calls the method with an address as the parameter? And then moves the stack pointer to pop the parameters and return value.
Any further explanation is appreciated, as I'm sure my understanding here is somewhat flawed.
As part of updating the toolchain for a legacy codebase, we would like to move from the Borland C++ 5.02 compiler to the Microsoft compiler (VS2008 or later). This is an embedded environment where the stack address space is predefined and fairly limited. It turns out that we have a function with a large switch statement which causes a much larger stack allocation under the MS compiler than with Borland's and, in fact, results in a stack overflow.
The form of the code is something like this:
#ifdef PKTS
#define RETURN_TYPE SPacket
typedef struct
{
int a;
int b;
int c;
int d;
int e;
int f;
} SPacket;
SPacket error = {0,0,0,0,0,0};
#else
#define RETURN_TYPE int
int error = 0;
#endif
extern RETURN_TYPE pickone(int key);
void findresult(int key, RETURN_TYPE* result)
{
switch(key)
{
case 1 : *result = pickone(5 ); break;
case 2 : *result = pickone(6 ); break;
case 3 : *result = pickone(7 ); break;
case 4 : *result = pickone(8 ); break;
case 5 : *result = pickone(9 ); break;
case 6 : *result = pickone(10); break;
case 7 : *result = pickone(11); break;
case 8 : *result = pickone(12); break;
case 9 : *result = pickone(13); break;
case 10 : *result = pickone(14); break;
case 11 : *result = pickone(15); break;
default : *result = error; break;
}
}
When compiled with cl /O2 /FAs /c /DPKTS stack_alloc.cpp, a portion of the listing file looks like this:
_TEXT SEGMENT
$T2592 = -264 ; size = 24
$T2582 = -240 ; size = 24
$T2594 = -216 ; size = 24
$T2586 = -192 ; size = 24
$T2596 = -168 ; size = 24
$T2590 = -144 ; size = 24
$T2598 = -120 ; size = 24
$T2588 = -96 ; size = 24
$T2600 = -72 ; size = 24
$T2584 = -48 ; size = 24
$T2602 = -24 ; size = 24
_key$ = 8 ; size = 4
_result$ = 12 ; size = 4
?findresult##YAXHPAUSPacket###Z PROC ; findresult, COMDAT
; 27 : switch(key)
mov eax, DWORD PTR _key$[esp-4]
dec eax
sub esp, 264 ; 00000108H
...
$LN11#findresult:
; 30 : case 2 : *result = pickone(6 ); break;
push 6
lea ecx, DWORD PTR $T2584[esp+268]
push ecx
jmp SHORT $LN17#findresult
$LN10#findresult:
; 31 : case 3 : *result = pickone(7 ); break;
push 7
lea ecx, DWORD PTR $T2586[esp+268]
push ecx
jmp SHORT $LN17#findresult
$LN17#findresult:
call ?pickone##YA?AUSPacket##H#Z ; pickone
mov edx, DWORD PTR [eax]
mov ecx, DWORD PTR _result$[esp+268]
mov DWORD PTR [ecx], edx
mov edx, DWORD PTR [eax+4]
mov DWORD PTR [ecx+4], edx
mov edx, DWORD PTR [eax+8]
mov DWORD PTR [ecx+8], edx
mov edx, DWORD PTR [eax+12]
mov DWORD PTR [ecx+12], edx
mov edx, DWORD PTR [eax+16]
mov DWORD PTR [ecx+16], edx
mov eax, DWORD PTR [eax+20]
add esp, 8
mov DWORD PTR [ecx+20], eax
; 41 : }
; 42 : }
add esp, 264 ; 00000108H
ret 0
The allocated stack space includes dedicated locations for each case to temporarily store the structure returned from pickone(), though in the end, only one value will be copied to the result structure. As you can imagine, with larger structures, more cases, and recursive calls in this function, the available stack space is consumed rapidly.
If the return type is POD, as when the above is compiled without the /DPKTS directive, each case copies directly to result, and stack usage is more efficient:
$LN10#findresult:
; 31 : case 3 : *result = pickone(7 ); break;
push 7
call ?pickone##YAHH#Z ; pickone
mov ecx, DWORD PTR _result$[esp]
add esp, 4
mov DWORD PTR [ecx], eax
; 41 : }
; 42 : }
ret 0
Can anyone explain why the compiler takes this approach and whether there's a way to convince it to do otherwise? I have limited freedom to re-architect the code, so pragmas and the like are the more desirable solutions. So far, I have not found any combination of optimization, debug, etc. arguments that make a difference.
Thank you!
EDIT
I understand that findresult() needs to allocate space for the return value of pickone(). What I don't understand is why the compiler allocates additional space for each possible case in the switch. It seems that space for one temporary would be sufficient. This is, in fact, how gcc handles the same code. Borland, on the other hand, appears to use RVO, passing the pointer all the way down and avoiding use of a temporary. The MS C++ compiler is the only one of the three that reserves space for each case in the switch.
I know that it's difficult to suggest refactoring options when you don't know which portions of the test code can change -- that's why my first question is why does the compiler behave this way in the test case. I'm hoping that if I can understand that, I can choose the best refactoring/pragma/command-line option to fix it.
Why not just
void findresult(int key, RETURN_TYPE* result)
{
if (key >= 1 && key <= 11)
*result = pickone(4+key);
else
*result = error;
}
Assuming this counts as a smaller change, I just remembered an old question about scope, specifically related to embedded compilers. Does the optimizer do any better if you wrap each case in braces to explicitly limit the temporary scope?
switch(key)
{
case 1 : { *result = pickone(5 ); break; }
Another scope-changing option:
void findresult(int key, RETURN_TYPE* result)
{
RETURN_TYPE tmp;
switch(key)
{
case 1 : tmp = pickone(5 ); break;
...
}
*result = tmp;
}
This is all a bit hand-wavy, because we're just trying to guess which input will coax a sensible response from this unfortunate optimizer.
I'm going to assume that rewriting that function is allowed, as long as the changes don't "leak" outside the function. I'm also assuming that (as mentioned in the comments) you actually have a number of separate functions to call (but that they all receive the same type of input and return the same result type).
For such a case, I'd probably change the function to something like:
RETURN_TYPE func1(int) { /* ... */ }
RETURN_TYPE func2(int) { /* ... */ }
// ...
void findresult(int key, RETURN_TYPE *result) {
typedef RETURN_TYPE (*f)(int);
f funcs[] = (func1, func2, func3, func4, func5, /* ... */ };
if (in_range(key))
*result = funcs[key](key+4);
else
*result = error;
}
I've run into an issue porting a codebase from linux (gcc) to windows (msvc). It seems like the C99 function vsscanf isn't available and has no obvious replacement.
I've read about a solution using the internal function _input_l and linking statically to the crt runtime, but unfortunately I cannot link statically since it would mess with all the plugins (as dlls) being loaded by the application.
So is there any replacement or a way to write a wrapper for vsscanf?
Update 2016-02-24:
When this was first asked there was no native replacement but since then MSVC has implemented support for this and much more.
VS2013 and later implements vsscanf and friends.
C++11 includes support as well.
A hack that should work:
int vsscanf(const char *s, const char *fmt, va_list ap)
{
void *a[20];
int i;
for (i=0; i<sizeof(a)/sizeof(a[0]); i++) a[i] = va_arg(ap, void *);
return sscanf(s, fmt, a[0], a[1], a[2], a[3], a[4], a[5], a[6], /* etc... */);
}
Replace 20 with the max number of args you think you might need. This code isn't terribly portable but it's only intended to be used on one particular broken system missing vsscanf so that shouldn't matter so much.
A quick search turned up several suggestions, including http://www.flipcode.net/archives/vsscanf_for_Win32.shtml
As this is tagged C++ have you considered just biting the bullet and moving away from the scanf line of functions completely? The C++ idiomatic way would be to use a std::istringstream. Rewriting to make use of that instead of looking for a vsscanf replacement would possibly be easier and more portable, not to mention having much greater type safety.
Funny it never came up for me before today. I could've sworn I'd used the function in the past. But anyway, here's a solution that works and is as safe as your arguments and format string:
template < size_t _NumArgs >
int VSSCANF_S(LPCTSTR strSrc, LPCTSTR ptcFmt, INT_PTR (&arr)[_NumArgs]) {
class vaArgs
{
vaArgs() {}
INT_PTR* m_args[_NumArgs];
public:
vaArgs(INT_PTR (&arr)[_NumArgs])
{
for(size_t nIndex=0;nIndex<_NumArgs;++nIndex)
m_args[nIndex] = &arr[nIndex];
}
};
return sscanf_s(strSrc, ptcFmt, vaArgs(arr));
}
///////////////////////////////////////////////////////////////////////////////
int _tmain(int, LPCTSTR argv[])
{
INT_PTR args[3];
int nScanned = VSSCANF_S(_T("-52 Hello 456 #"), _T("%d Hello %u %c"), args);
return printf(_T("Arg1 = %d, arg2 = %u, arg3 = %c\n"), args[0], args[1], args[2]);
}
Out:
Arg1 = -52, arg2 = 456, arg3 = #
Press any key to continue . . .
Well I can't get the formatting right but you get the idea.
if you want to wrap sscanf and you are using C++11, you can do this:
template<typename... Args>
int mysscanf(const char* str, const char* fmt, Args... args) {
//...
return sscanf(str, fmt, args...);
}
to make this work on msvc, you need to download this update:
http://www.microsoft.com/en-us/download/details.aspx?id=35515
modified from :
http://www.gamedev.net/topic/310888-no-vfscanf-in-visual-studio/
#if defined(_WIN32) && (_MSC_VER <= 1500)
static int vsscanf(
const char *buffer,
const char *format,
va_list argPtr
)
{
// Get an upper bound for the # of args
size_t count = 0;
const char* p = format;
while(1)
{
char c = *(p++);
if (c == 0)
break;
if (c == '%' && (p[0] != '*' && p[0] != '%'))
++count;
}
if (count <= 0)
return 0;
int result;
// copy stack pointer
_asm
{
mov esi, esp;
}
// push variable parameters pointers on stack
for (int i = count - 1; i >= 0; --i)
{
_asm
{
mov eax, dword ptr[i];
mov ecx, dword ptr [argPtr];
mov edx, dword ptr [ecx+eax*4];
push edx;
}
}
int stackAdvance = (2 + count) * 4;
_asm
{
// now push on the fixed params
mov eax, dword ptr [format];
push eax;
mov eax, dword ptr [buffer];
push eax;
// call sscanf, and more the result in to result
call dword ptr [sscanf];
mov result, eax;
// restore stack pointer
mov eax, dword ptr[stackAdvance];
add esp, eax;
}
return result;
}
#endif // _WIN32 / _MSC_VER <= 1500
tested only on Visual Studio 2008
I have a C++ app that uses large arrays of data, and have noticed while testing that it is running out of memory, while there is still plenty of memory available. I have reduced the code to a sample test case as follows;
void MemTest()
{
size_t Size = 500*1024*1024; // 512mb
if (Size > _HEAP_MAXREQ)
TRACE("Invalid Size");
void * mem = malloc(Size);
if (mem == NULL)
TRACE("allocation failed");
}
If I create a new MFC project, include this function, and run it from InitInstance, it works fine in debug mode (memory allocated as expected), yet fails in release mode (malloc returns NULL). Single stepping through release into the C run times, my function gets inlined I get the following
// malloc.c
void * __cdecl _malloc_base (size_t size)
{
void *res = _nh_malloc_base(size, _newmode);
RTCCALLBACK(_RTC_Allocate_hook, (res, size, 0));
return res;
}
Calling _nh_malloc_base
void * __cdecl _nh_malloc_base (size_t size, int nhFlag)
{
void * pvReturn;
// validate size
if (size > _HEAP_MAXREQ)
return NULL;
'
'
And (size > _HEAP_MAXREQ) returns true and hence my memory doesn't get allocated. Putting a watch on size comes back with the exptected 512MB, which suggests the program is linking into a different run-time library with a much smaller _HEAP_MAXREQ. Grepping the VC++ folders for _HEAP_MAXREQ shows the expected 0xFFFFFFE0, so I can't figure out what is happening here. Anyone know of any CRT changes or versions that would cause this problem, or am I missing something way more obvious?
Edit: As suggested by Andreas, looking at this under this assembly view shows the following;
--- f:\vs70builds\3077\vc\crtbld\crt\src\malloc.c ------------------------------
_heap_alloc:
0040B0E5 push 0Ch
0040B0E7 push 4280B0h
0040B0EC call __SEH_prolog (40CFF8h)
0040B0F1 mov esi,dword ptr [size]
0040B0F4 cmp dword ptr [___active_heap (434660h)],3
0040B0FB jne $L19917+7 (40B12Bh)
0040B0FD cmp esi,dword ptr [___sbh_threshold (43464Ch)]
0040B103 ja $L19917+7 (40B12Bh)
0040B105 push 4
0040B107 call _lock (40DE73h)
0040B10C pop ecx
0040B10D and dword ptr [ebp-4],0
0040B111 push esi
0040B112 call __sbh_alloc_block (40E736h)
0040B117 pop ecx
0040B118 mov dword ptr [pvReturn],eax
0040B11B or dword ptr [ebp-4],0FFFFFFFFh
0040B11F call $L19916 (40B157h)
$L19917:
0040B124 mov eax,dword ptr [pvReturn]
0040B127 test eax,eax
0040B129 jne $L19917+2Ah (40B14Eh)
0040B12B test esi,esi
0040B12D jne $L19917+0Ch (40B130h)
0040B12F inc esi
0040B130 cmp dword ptr [___active_heap (434660h)],1
0040B137 je $L19917+1Bh (40B13Fh)
0040B139 add esi,0Fh
0040B13C and esi,0FFFFFFF0h
0040B13F push esi
0040B140 push 0
0040B142 push dword ptr [__crtheap (43465Ch)]
0040B148 call dword ptr [__imp__HeapAlloc#12 (425144h)]
0040B14E call __SEH_epilog (40D033h)
0040B153 ret
$L19914:
0040B154 mov esi,dword ptr [ebp+8]
$L19916:
0040B157 push 4
0040B159 call _unlock (40DDBEh)
0040B15E pop ecx
$L19929:
0040B15F ret
_nh_malloc:
0040B160 cmp dword ptr [esp+4],0FFFFFFE0h
0040B165 ja _nh_malloc+29h (40B189h)
With the registers as follows;
EAX = 009C8AF0 EBX = FFFFFFFF ECX = 009C8A88 EDX = 00747365 ESI = 00430F80
EDI = 00430F80 EIP = 0040B160 ESP = 0013FDF4 EBP = 0013FFC0 EFL = 00000206
So the compare does appear to be against the correct constant, i.e. #040B160 cmp dword ptr [esp+4],0FFFFFFE0h, also esp+4 = 0013FDF8 = 1F400000 (my 512mb)
Second edit: Problem was actually in HeapAlloc, as per Andreas' post. Changing to a new seperate heap for large objects, using HeapCreate & HeapAlloc, did not help alleviate the problem, nor did an attempt to use VirtualAlloc with various parameters. Some further experimentation has shown that where allocation one large section of contiguous memory fails, two smaller blocks yielding the same total memory is ok. e.g. where a 300MB malloc fails, 2 x 150MB mallocs work ok. So it looks like I'll need a new array class that can live in a number of biggish memory fragments rather than a single contiguous block. Not a major problem, but I would have expected a bit more out of Win32 in this day and age.
Last edit: The following yielded 1.875GB of space, albeit non-contiguous
#define TenMB 1024*1024*10
void SmallerAllocs()
{
size_t Total = 0;
LPVOID p[200];
for (int i = 0; i < 200; i++)
{
p[i] = malloc(TenMB);
if (p[i])
Total += TenMB; else
break;
}
CString Msg;
Msg.Format("Allocated %0.3lfGB",Total/(1024.0*1024.0*1024.0));
AfxMessageBox(Msg,MB_OK);
}
May it be the cast that the debugger is playing a trick on you in release-mode? Neither single stepping nor the values of variables are reliable in release-mode.
I tried your example in VS2003 in release mode, and when single stepping it does at first look like the code is landing on the return NULL line, but when I continue stepping it eventually continues into HeapAlloc, I would guess that it's this function that's failing, looking at the disassembly if (size > _HEAP_MAXREQ) reveals the following:
00401078 cmp dword ptr [esp+4],0FFFFFFE0h
so I don't think it's a problem with _HEAP_MAXREQ.