Calculating offset for hotpatching/inline function hooking - c++

From http://lastfrag.com/hotpatching-and-inline-hooking-explained/,
Q1) Does code proceed from high memory to low memory or vice versa?
Q2) More importantly, during the calculation of the replacement offset, why is it that you have to minus the function preamble? Is it because the offset starts from the end of the instruction and not the beginning?
DWORD ReplacementAddressOffset = ReplacementAddress - OriginalAddress - 5;
Full Code:
void HookAPI(wchar_t *Module, char *API, DWORD Function)
{
HMODULE hModule = LoadLibrary(Module);
DWORD OriginalAddress = (DWORD)GetProcAddress(hModule, API);
DWORD ReplacementAddress = (DWORD)Function;
DWORD ReplacementAddressOffset = ReplacementAddress - OriginalAddress - 5;
LPBYTE pOriginalAddress = (LPBYTE)OriginalAddress;
LPBYTE pReplacementAddressOffset = (LPBYTE)(&ReplacementAddressOffset);
DWORD OldProtect = 0;
DWORD NewProtect = PAGE_EXECUTE_READWRITE;
VirtualProtect((PVOID)OriginalAddress, 5, NewProtect, &OldProtect);
for (int i = 0; i < 5; i++)
Store[i] = pOriginalAddress[i];
pOriginalAddress[0] = (BYTE)0xE9;
for (int i = 0; i < 4; i++)
pOriginalAddress[i + 1] = pReplacementAddressOffset[i];
VirtualProtect((PVOID)OriginalAddress, 5, OldProtect, &NewProtect);
FlushInstructionCache(GetCurrentProcess(), NULL, NULL);
FreeLibrary(hModule);
}
Q3) In this code, the relative address of a jmp instruction is being replaced; relAddrSet is a pointer to the original destination; to is a pointer to the new destination. I don't understand the calculation of the to address, why is it that you have to add the original destination to the functionForHook + opcodeOffset?
DWORD *relAddrSet = (DWORD *)(currentOpcode + 1);
DWORD_PTR to = (*relAddrSet) + ((DWORD_PTR)functionForHook + opcodeOffset);
*relAddrSet = (DWORD)(to - ((DWORD_PTR)originalFunction + opcodeOffset));

Yes the relative address is the the offset after the instructions, that's why you have to substract 5.
But, in my opinion, you should just forget the idea of the relative jump and try absolute jump.
Why ? Because it is a lot easier and x86-64 compatible (relative jumps are limited to +/-2GB).
The absolute jump is (x64) :
48 b8 ef cd ab 89 67 45 23 01 mov rax, 0x0123456789abcdef
ff e0 jmp rax
And for x86 :
b8 67 45 23 01 mov eax, 0x01234567
ff e0 jmp eax
Here is the modified code (the loader is now 7 bytes instead of 5):
void HookAPI(wchar_t *Module, char *API, DWORD Function)
{
HMODULE hModule = LoadLibrary(Module);
DWORD OriginalAddress = (DWORD)GetProcAddress(hModule, API);
DWORD OldProtect = 0;
DWORD NewProtect = PAGE_EXECUTE_READWRITE;
VirtualProtect((PVOID)OriginalAddress, 7, NewProtect, &OldProtect);
memcpy(Store, OriginalAddress, 7);
memcpy(OriginalAddress, "\xb8\x00\x00\x00\x00\xff\xe0", 7);
memcpy(OriginalAddress+1, &ReplacementAddress, sizeof(void*));
VirtualProtect((PVOID)OriginalAddress, 7, OldProtect, &NewProtect);
FlushInstructionCache(GetCurrentProcess(), NULL, NULL);
FreeLibrary(hModule);
}
The code is the same for x64 but you have to add 2 nops (90) at the beginning or the end in order match the size of the following instructions, so the loader is "\x48\xb8<8-bytes addr>\xff\xe0\x90\x90" (14 bytes)

Q1) The program runs from lower to highest addresses (i.e. the program counter gets increased by the size of each instruction, unless in case of jumps, calls or ret). But I am probably missing the point of the question.
Q2) Yes, on x86 the jumps are executed after the program counter has been increased by the size of the jump instruction (5 bytes); when the CPU adds the jump offset to the program counter to calculate the target address, the program counter has already been increased of 5.
Q3) This code is quite weird, but it may work. I suppose that *relAddrset initially contains a jump offset to originalFunction (i.e. *relAddSet==originalFunction-relativeOffset). If this is true, the final result is that *reladdrSet contains a jump offset to functionFoHook. Indeed the last instruction becomes:
*relAddrSet=(originalFunction-relativeOffset)+functionForHook-originalFunction
== functionForHook-relativeOffset

Yes, code runs "forward" if I understand this question correctly. One instruction is executed after another if it is not branching.
An instruction that does a relative jump (JMP, CALL) does the jump relative to the start of the next instruction. That's why you have to subtract the length of the instruction (here: 5) from the difference.
I can't answer your third question. Please give some context and what the code is supposed to do.

Related

Trying to understand a detour (hooking) function

Hi I'm trying to understand a function, it's about Windows API hooking. I'm trying to hook LoadLibraryA to see if any cheats are trying to inject into my game. For that I'm trying to intercept any calls to LoadLibraryA.
I tried to write comments to explain what I think is going on, but I'm unsure about the latter parts
// src = address of LoadLibraryA in kernel32.dll,
// dst = my function prototype of LoadLibraryA
// len = 5, as we allocate a JMP instruction (0xE9)
PVOID Detour(BYTE* src, const BYTE* dst, const int len)
{
BYTE* jmp = (BYTE*)malloc(len + 5); // allocate 10 bytes
DWORD oldProtection; // change protection of 5 bytes starting from LoadLibraryA in kernel32.dll
VirtualProtect(src, len, PAGE_EXECUTE_READWRITE, &oldProtection); // Changes the protection on a region of committed pages in the virtual address space of the calling process.
memcpy(jmp, src, len); // save 5 first bytes of the start of LoadLibraryA in kernel32.dll from src to jmp
jmp += len; // start from byte 6
jmp[0] = 0xE9; // insert jump from byte 6 - 10:
// jmp looks like this currently: [8BFF] = 2 bytes [55] = 1 byte [8BEC] = 2 bytes [0xE9] = 5 bytes
// ??
*(DWORD*)(jmp + 1) = (DWORD)(src + len - jmp) - 5; // ?
// ??
src[0] = 0xE9;
*(DWORD*)(src + 1) = (DWORD)(dst - src) - 5; // ?
// Set the same memory protection as before.
VirtualProtect(src, len, oldProtection, &oldProtection);
// ??
return (jmp - len);
}
Below is the representation before the hook and after.
Before:
After:
The function works fine, just need help in understanding whats going on in the later part > of the function. I'm unsure what happens from here jmp += len;
First thing I notice is your code is doing the detour and the jmp back all in one go, which is different than I usually see people do it.
memcpy(jmp, src, len);
You're copying the stolen bytes to the location of your shellcode
jmp is the address you're jumping to
jmp += len;
length is the number of stolen bytes, or bytes you overwrite which are copied to the area you jmp too, because you must still execute them. So your advancing to the byte directly following your relative jmp in your shellcode
jmp[0] = 0xE9;
You're writing the relative jump instruction
(DWORD)(jmp + 1) = (DWORD)(src + len - jmp) - 5;
jmp + 1 = the address after the jmp instruction where you need to place the relative address
(src + len - jmp) - 5 is the equation required to get the relative address
src[0] = 0xE9;
(DWORD)(src + 1) = (DWORD)(dst - src) - 5;
You're doing the same thing you did inside your shellcode except you're just creating the detour to it in this case.
return (jmp - len);
You're returning the address of the shellcode (this is kinda weird but you have to do this because your code did jmp +=len)

Get current Image Base address

I made some computations to get a relative virtual address(RVA).
I compute a correct RVA (according to the .map file) and I want to translate it to a callable function pointer.
Is there a way to translate it to a physical address?
I have thought of taking the image base address and add it. According to this thread it should be possible via GetModuleHandle(NULL), but the result is "wrong". I only have a good result when I subtract a pointer from a function from its RVA defined in the .map file.
Is there a WINAPI to either convert a RVA to a physical address, or get the image base address, or get the RVA of a physical address?
Here's some example code:
#include <stdio.h>
#include <Windows.h>
#include <WinNT.h>
static int count = 0;
void reference() // According to .map RVA = 0x00401000
{
}
void toCall() // According to .map RVA = 0x00401030
{
printf("Called %d\n", ++count);
}
int main()
{
typedef void (*fnct_t)();
fnct_t fnct;
fnct = (fnct_t) (0x00401030 + (((int) reference) - 0x00401000));
fnct(); // works
fnct = (fnct_t) (0x00401030 + ((int) GetModuleHandle(NULL)) - 0x00400000);
fnct(); // often works
return 0;
}
My main concern is that it seems that sometimes (maybe in threaded contexts) GetModuleHandle(NULL) isn't correct.
To get the image base without the entry point being predefined directly at compile-time you can do, a simple search from aligning eax#1000h from the current VA and looping until a valid PE signature 'MZ' is found in the current memory page.
Assuming the base address is not relocated into another PE image. I've prepared a function for you:
DWORD FindImageBase() {
DWORD* VA = (DWORD *) &FindImageBase, ImageBase;
__asm {
mov eax, VA
and eax, 0FFFF0000h
search:
cmp word ptr [eax], 0x5a4d
je stop
sub eax, 00010000h
jmp search
stop:
mov [ImageBase], 0
mov [ImageBase], eax
}
return ImageBase;
}

Calling C++ Method from Assembly with Parameters and Return Value

So I've asked this before but with significantly less detail. The question title accurately describes the problem: I have a method in C++ that I am trying to call from assembly (x86) that has both parameters and a return value. I have a rough understanding, at best, of assembly and a fairly solid understanding of C++ (otherwise I would not have undertaken this problem). Here's what I have as far as code goes:
// methodAddr is a pointer to the method address
void* methodAddr = method->Address;
// buffer is an int array of parameter values. The parameters can be anything (of any type)
// but are copied into an int array so they can be pushed onto the stack in reverse order
// 4 bytes at a time (as in push (int)). I know there's an issue here that is irrelevent to my baseline testing, in that if any parameter is over 4 bytes it will be broken and
// reversed (which is not good) but for basic testing this isn't an issue, so overlook this.
for (int index = bufferElementCount - 1; index >= 0; index--)
{
int val = buffer[index];
__asm
{
push val
}
}
int returnValueCount = 0;
// if there is a return value, allocate some space for it and push that onto the stack after
// the parameters have been pushed on
if (method->HasReturnValue)
{
*returnSize = method->ReturnValueSize;
outVal = new char[*returnSize];
returnValueCount = (*returnSize / 4) + (*returnSize % 4 != 0 ? 1 : 0);
memset(outVal, 0, *returnSize);
for (int index = returnValueCount - 1; index >= 0; index--)
{
char* addr = ((char*)outVal) + (index * 4);
__asm
{
push addr
}
}
}
// calculate the stack pointer offset so after the call we can pop the parameters and return value
int espOffset = (bufferElementCount + returnValueCount) * 4;
// call the method
__asm
{
call methodAddr;
add esp, espOffset
};
For my basic testing I am using a method with the following signature:
Person MyMethod3( int, char, int );
The problem is this: when omit the return value from the method signature, all of the parameter values are properly passed. But when I leave the method as is, the parameter data that is passed is incorrect but the value returned is correct. So my question, obviously, is what is wrong? I've tried pushing the return value space onto the stack before the parameters. The person structure is as follows:
class Person
{
public:
Text Name;
int Age;
float Cash;
ICollection<Person*>* Friends;
};
Any help would be greatly appreciated. Thanks!
I'm using Visual Studio 2013 with the November 2013 CTP compiler for C++, targeting x86.
As it relates to disassembly, this is the straight method call:
int one = 876;
char two = 'X';
int three = 9738;
Person p = MyMethod3(one, two, three);
And here is the disassembly for that:
00CB0A20 mov dword ptr [one],36Ch
char two = 'X';
00CB0A27 mov byte ptr [two],58h
int three = 9738;
00CB0A2B mov dword ptr [three],260Ah
Person p = MyMethod3(one, two, three);
00CB0A32 push 10h
00CB0A34 lea ecx,[p]
00CB0A37 call Person::__autoclassinit2 (0C6AA2Ch)
00CB0A3C mov eax,dword ptr [three]
00CB0A3F push eax
00CB0A40 movzx ecx,byte ptr [two]
00CB0A44 push ecx
00CB0A45 mov edx,dword ptr [one]
00CB0A48 push edx
00CB0A49 lea eax,[p]
00CB0A4C push eax
00CB0A4D call MyMethod3 (0C6B783h)
00CB0A52 add esp,10h
00CB0A55 mov dword ptr [ebp-4],0
My interpretation of this is as follows:
Execute the assignments to the local variables. Then create the output register. Then put the parameters in a particular register (the order here happens to be eax, ecx, and edx, which makes sense (eax and ebx are for one, ecx is for two, and edx and some other register for the last parameter?)). Then call LEA (load-effective address) which I don't understand but have understood to be a MOV. Then it calls the method with an address as the parameter? And then moves the stack pointer to pop the parameters and return value.
Any further explanation is appreciated, as I'm sure my understanding here is somewhat flawed.

How can I prevent MSVC++ from over-allocating stack space for a switch statement?

As part of updating the toolchain for a legacy codebase, we would like to move from the Borland C++ 5.02 compiler to the Microsoft compiler (VS2008 or later). This is an embedded environment where the stack address space is predefined and fairly limited. It turns out that we have a function with a large switch statement which causes a much larger stack allocation under the MS compiler than with Borland's and, in fact, results in a stack overflow.
The form of the code is something like this:
#ifdef PKTS
#define RETURN_TYPE SPacket
typedef struct
{
int a;
int b;
int c;
int d;
int e;
int f;
} SPacket;
SPacket error = {0,0,0,0,0,0};
#else
#define RETURN_TYPE int
int error = 0;
#endif
extern RETURN_TYPE pickone(int key);
void findresult(int key, RETURN_TYPE* result)
{
switch(key)
{
case 1 : *result = pickone(5 ); break;
case 2 : *result = pickone(6 ); break;
case 3 : *result = pickone(7 ); break;
case 4 : *result = pickone(8 ); break;
case 5 : *result = pickone(9 ); break;
case 6 : *result = pickone(10); break;
case 7 : *result = pickone(11); break;
case 8 : *result = pickone(12); break;
case 9 : *result = pickone(13); break;
case 10 : *result = pickone(14); break;
case 11 : *result = pickone(15); break;
default : *result = error; break;
}
}
When compiled with cl /O2 /FAs /c /DPKTS stack_alloc.cpp, a portion of the listing file looks like this:
_TEXT SEGMENT
$T2592 = -264 ; size = 24
$T2582 = -240 ; size = 24
$T2594 = -216 ; size = 24
$T2586 = -192 ; size = 24
$T2596 = -168 ; size = 24
$T2590 = -144 ; size = 24
$T2598 = -120 ; size = 24
$T2588 = -96 ; size = 24
$T2600 = -72 ; size = 24
$T2584 = -48 ; size = 24
$T2602 = -24 ; size = 24
_key$ = 8 ; size = 4
_result$ = 12 ; size = 4
?findresult##YAXHPAUSPacket###Z PROC ; findresult, COMDAT
; 27 : switch(key)
mov eax, DWORD PTR _key$[esp-4]
dec eax
sub esp, 264 ; 00000108H
...
$LN11#findresult:
; 30 : case 2 : *result = pickone(6 ); break;
push 6
lea ecx, DWORD PTR $T2584[esp+268]
push ecx
jmp SHORT $LN17#findresult
$LN10#findresult:
; 31 : case 3 : *result = pickone(7 ); break;
push 7
lea ecx, DWORD PTR $T2586[esp+268]
push ecx
jmp SHORT $LN17#findresult
$LN17#findresult:
call ?pickone##YA?AUSPacket##H#Z ; pickone
mov edx, DWORD PTR [eax]
mov ecx, DWORD PTR _result$[esp+268]
mov DWORD PTR [ecx], edx
mov edx, DWORD PTR [eax+4]
mov DWORD PTR [ecx+4], edx
mov edx, DWORD PTR [eax+8]
mov DWORD PTR [ecx+8], edx
mov edx, DWORD PTR [eax+12]
mov DWORD PTR [ecx+12], edx
mov edx, DWORD PTR [eax+16]
mov DWORD PTR [ecx+16], edx
mov eax, DWORD PTR [eax+20]
add esp, 8
mov DWORD PTR [ecx+20], eax
; 41 : }
; 42 : }
add esp, 264 ; 00000108H
ret 0
The allocated stack space includes dedicated locations for each case to temporarily store the structure returned from pickone(), though in the end, only one value will be copied to the result structure. As you can imagine, with larger structures, more cases, and recursive calls in this function, the available stack space is consumed rapidly.
If the return type is POD, as when the above is compiled without the /DPKTS directive, each case copies directly to result, and stack usage is more efficient:
$LN10#findresult:
; 31 : case 3 : *result = pickone(7 ); break;
push 7
call ?pickone##YAHH#Z ; pickone
mov ecx, DWORD PTR _result$[esp]
add esp, 4
mov DWORD PTR [ecx], eax
; 41 : }
; 42 : }
ret 0
Can anyone explain why the compiler takes this approach and whether there's a way to convince it to do otherwise? I have limited freedom to re-architect the code, so pragmas and the like are the more desirable solutions. So far, I have not found any combination of optimization, debug, etc. arguments that make a difference.
Thank you!
EDIT
I understand that findresult() needs to allocate space for the return value of pickone(). What I don't understand is why the compiler allocates additional space for each possible case in the switch. It seems that space for one temporary would be sufficient. This is, in fact, how gcc handles the same code. Borland, on the other hand, appears to use RVO, passing the pointer all the way down and avoiding use of a temporary. The MS C++ compiler is the only one of the three that reserves space for each case in the switch.
I know that it's difficult to suggest refactoring options when you don't know which portions of the test code can change -- that's why my first question is why does the compiler behave this way in the test case. I'm hoping that if I can understand that, I can choose the best refactoring/pragma/command-line option to fix it.
Why not just
void findresult(int key, RETURN_TYPE* result)
{
if (key >= 1 && key <= 11)
*result = pickone(4+key);
else
*result = error;
}
Assuming this counts as a smaller change, I just remembered an old question about scope, specifically related to embedded compilers. Does the optimizer do any better if you wrap each case in braces to explicitly limit the temporary scope?
switch(key)
{
case 1 : { *result = pickone(5 ); break; }
Another scope-changing option:
void findresult(int key, RETURN_TYPE* result)
{
RETURN_TYPE tmp;
switch(key)
{
case 1 : tmp = pickone(5 ); break;
...
}
*result = tmp;
}
This is all a bit hand-wavy, because we're just trying to guess which input will coax a sensible response from this unfortunate optimizer.
I'm going to assume that rewriting that function is allowed, as long as the changes don't "leak" outside the function. I'm also assuming that (as mentioned in the comments) you actually have a number of separate functions to call (but that they all receive the same type of input and return the same result type).
For such a case, I'd probably change the function to something like:
RETURN_TYPE func1(int) { /* ... */ }
RETURN_TYPE func2(int) { /* ... */ }
// ...
void findresult(int key, RETURN_TYPE *result) {
typedef RETURN_TYPE (*f)(int);
f funcs[] = (func1, func2, func3, func4, func5, /* ... */ };
if (in_range(key))
*result = funcs[key](key+4);
else
*result = error;
}

Unusual heap size limitations in VS2003 C++

I have a C++ app that uses large arrays of data, and have noticed while testing that it is running out of memory, while there is still plenty of memory available. I have reduced the code to a sample test case as follows;
void MemTest()
{
size_t Size = 500*1024*1024; // 512mb
if (Size > _HEAP_MAXREQ)
TRACE("Invalid Size");
void * mem = malloc(Size);
if (mem == NULL)
TRACE("allocation failed");
}
If I create a new MFC project, include this function, and run it from InitInstance, it works fine in debug mode (memory allocated as expected), yet fails in release mode (malloc returns NULL). Single stepping through release into the C run times, my function gets inlined I get the following
// malloc.c
void * __cdecl _malloc_base (size_t size)
{
void *res = _nh_malloc_base(size, _newmode);
RTCCALLBACK(_RTC_Allocate_hook, (res, size, 0));
return res;
}
Calling _nh_malloc_base
void * __cdecl _nh_malloc_base (size_t size, int nhFlag)
{
void * pvReturn;
// validate size
if (size > _HEAP_MAXREQ)
return NULL;
'
'
And (size > _HEAP_MAXREQ) returns true and hence my memory doesn't get allocated. Putting a watch on size comes back with the exptected 512MB, which suggests the program is linking into a different run-time library with a much smaller _HEAP_MAXREQ. Grepping the VC++ folders for _HEAP_MAXREQ shows the expected 0xFFFFFFE0, so I can't figure out what is happening here. Anyone know of any CRT changes or versions that would cause this problem, or am I missing something way more obvious?
Edit: As suggested by Andreas, looking at this under this assembly view shows the following;
--- f:\vs70builds\3077\vc\crtbld\crt\src\malloc.c ------------------------------
_heap_alloc:
0040B0E5 push 0Ch
0040B0E7 push 4280B0h
0040B0EC call __SEH_prolog (40CFF8h)
0040B0F1 mov esi,dword ptr [size]
0040B0F4 cmp dword ptr [___active_heap (434660h)],3
0040B0FB jne $L19917+7 (40B12Bh)
0040B0FD cmp esi,dword ptr [___sbh_threshold (43464Ch)]
0040B103 ja $L19917+7 (40B12Bh)
0040B105 push 4
0040B107 call _lock (40DE73h)
0040B10C pop ecx
0040B10D and dword ptr [ebp-4],0
0040B111 push esi
0040B112 call __sbh_alloc_block (40E736h)
0040B117 pop ecx
0040B118 mov dword ptr [pvReturn],eax
0040B11B or dword ptr [ebp-4],0FFFFFFFFh
0040B11F call $L19916 (40B157h)
$L19917:
0040B124 mov eax,dword ptr [pvReturn]
0040B127 test eax,eax
0040B129 jne $L19917+2Ah (40B14Eh)
0040B12B test esi,esi
0040B12D jne $L19917+0Ch (40B130h)
0040B12F inc esi
0040B130 cmp dword ptr [___active_heap (434660h)],1
0040B137 je $L19917+1Bh (40B13Fh)
0040B139 add esi,0Fh
0040B13C and esi,0FFFFFFF0h
0040B13F push esi
0040B140 push 0
0040B142 push dword ptr [__crtheap (43465Ch)]
0040B148 call dword ptr [__imp__HeapAlloc#12 (425144h)]
0040B14E call __SEH_epilog (40D033h)
0040B153 ret
$L19914:
0040B154 mov esi,dword ptr [ebp+8]
$L19916:
0040B157 push 4
0040B159 call _unlock (40DDBEh)
0040B15E pop ecx
$L19929:
0040B15F ret
_nh_malloc:
0040B160 cmp dword ptr [esp+4],0FFFFFFE0h
0040B165 ja _nh_malloc+29h (40B189h)
With the registers as follows;
EAX = 009C8AF0 EBX = FFFFFFFF ECX = 009C8A88 EDX = 00747365 ESI = 00430F80
EDI = 00430F80 EIP = 0040B160 ESP = 0013FDF4 EBP = 0013FFC0 EFL = 00000206
So the compare does appear to be against the correct constant, i.e. #040B160 cmp dword ptr [esp+4],0FFFFFFE0h, also esp+4 = 0013FDF8 = 1F400000 (my 512mb)
Second edit: Problem was actually in HeapAlloc, as per Andreas' post. Changing to a new seperate heap for large objects, using HeapCreate & HeapAlloc, did not help alleviate the problem, nor did an attempt to use VirtualAlloc with various parameters. Some further experimentation has shown that where allocation one large section of contiguous memory fails, two smaller blocks yielding the same total memory is ok. e.g. where a 300MB malloc fails, 2 x 150MB mallocs work ok. So it looks like I'll need a new array class that can live in a number of biggish memory fragments rather than a single contiguous block. Not a major problem, but I would have expected a bit more out of Win32 in this day and age.
Last edit: The following yielded 1.875GB of space, albeit non-contiguous
#define TenMB 1024*1024*10
void SmallerAllocs()
{
size_t Total = 0;
LPVOID p[200];
for (int i = 0; i < 200; i++)
{
p[i] = malloc(TenMB);
if (p[i])
Total += TenMB; else
break;
}
CString Msg;
Msg.Format("Allocated %0.3lfGB",Total/(1024.0*1024.0*1024.0));
AfxMessageBox(Msg,MB_OK);
}
May it be the cast that the debugger is playing a trick on you in release-mode? Neither single stepping nor the values of variables are reliable in release-mode.
I tried your example in VS2003 in release mode, and when single stepping it does at first look like the code is landing on the return NULL line, but when I continue stepping it eventually continues into HeapAlloc, I would guess that it's this function that's failing, looking at the disassembly if (size > _HEAP_MAXREQ) reveals the following:
00401078 cmp dword ptr [esp+4],0FFFFFFE0h
so I don't think it's a problem with _HEAP_MAXREQ.