MSVC 2013: crash at addpd xmm6,xmmword ptr [rax+rbx*8] - c++

Since some days I am using MSVC 2013 and my application crashes when executing the following code (sparse matrix multiplied by vector, pseudo code: A = this * pVector):
complex<double> x = (A.getValue(lRow) + (mValues[lIdx] * pVectorB->getValue(lCol)));
Before I used MSVC 2005 and the application runs well.
The exception (First-chance exception at 0x000000014075D1D2 in psc64.exe: 0xC0000005: Access violation reading location 0xFFFFFFFFFFFFFFFF.) was thrown.
I track the assembly to:
addpd xmm6, xmmword ptr [rax+rbx*8]
It crash only with optimization /O2 (maximize speed) but not with no optimization /Od.
I can also avoid the crash when adding code (cout<<"bla bla") into the method pVectorB->getValue(lCol).
I believe it could be some problem with not initialized variables. But I could not find any. Therefore I look into the disassembly.
I check XMM6 and ptr [rax+rbx*8]. They are the same without crash (with cout<<"bla bla") and with crash.
Is there any thing more I should look for other then XMM6 and the value of ptr [rax+rbx*8]?
I am looking for the problem since quite some time but could not find any hint to track down the problem to the line of code I have to correct.
Any help is highly appreciated. Thank you.
The code for getValue:
template <class T> class Vector
{ const T& getValue(const int pIdx) const
{
if(false == checkBounds(pIdx)){
throw MathException(__FILE__, __LINE__, "T& Vector<class T>::getValue(const pIdx): checkBounds fails pIdx = %i", pIdx);
}
return mVal[pIdx];
}
bool checkBounds(const int pIdx)const
{
bool ret = true;
if(pIdx >= mMaxSize){
DBG_SEVERE2("pIdx >= mMaxSize, pIdx = %i, mMaxSize = %i", pIdx, mMaxSize);
ret = false;
}
if(pIdx < 0){
DBG_SEVERE1("pIdx < 0, pIdx = %i", pIdx);
ret = false;
}
return ret;
}
}
The allocation of mVal:
void* lTmp= calloc((4 * sizeof(complex<double>))+4, 1);
((char*)lTmp)[0] = 0xC;
((char*)lTmp)[1] = 0xC;
((char*)lTmp)[(4 * sizeof(complex<double>)) + 2] = 0xC;
((char*)lTmp)[(4 * sizeof(complex<double>)) + 3] = 0xC;
mVal= (void*)(((char*)lTmp) + 2)
SOLUTION:
As suggested it works without the 2 byte in front and behind the desired array (mVal). But it also works with multiple of 16byte before and after the array.

When using memory operands to SSE instructions like addpd xmm6, xmmword ptr [rax+rbx*8] the memory operand must be aligned. There are unaligned load instructions: You could movdqu from [rax+rbx*8] and then operate on the register. But if you use the memory form, the alignment is important. The optimization flags probably changed the alignment of your array. Or it may have folded the load into a memory operand (which are faster in some cases) and caused the problem that way.

Related

OpenMP Strange Behavior - differences in performance

I want to speedup image processing code using OpenMP and I found some strange behavior in my code. I'm using Visual Studio 2019 and I also tried Intel C++ compiler with same result.
I'm not sure why is the code with OpenMP in some situations much slower than in the others. For example function divideImageDataWithParam() or difference between copyFirstPixelOnRow() and copyFirstPixelOnRowUsingTSize() using struct TSize as parameter of image data size. Why is performance of boxFilterRow() and boxFilterRow_OpenMP() so different a why isn't it with different radius size in program?
I created github repository for this little testing project:
https://github.com/Tb45/OpenMP-Strange-Behavior
Here are all results summarized:
https://github.com/Tb45/OpenMP-Strange-Behavior/blob/master/resuts.txt
I didn't find any explanation why is this happening or what am I doing wrong.
Thanks for your help.
I'm working on faster box filter and others for image processing algorithms.
typedef intptr_t int_t;
struct TSize
{
int_t width;
int_t height;
};
void divideImageDataWithParam(
const unsigned char * src, int_t srcStep, unsigned char * dst, int_t dstStep, TSize size, int_t param)
{
for (int_t y = 0; y < size.height; y++)
{
for (int_t x = 0; x < size.width; x++)
{
dst[y*dstStep + x] = src[y*srcStep + x]/param;
}
}
}
void divideImageDataWithParam_OpenMP(
const unsigned char * src, int_t srcStep, unsigned char * dst, int_t dstStep, TSize size, int_t param, bool parallel)
{
#pragma omp parallel for if(parallel)
for (int_t y = 0; y < size.height; y++)
{
for (int_t x = 0; x < size.width; x++)
{
dst[y*dstStep + x] = src[y*srcStep + x]/param;
}
}
}
Results of divideImageDataWithParam():
generateRandomImageData :: 3840x2160
numberOfIterations = 100
With Visual C++ 2019:
32bit 64bit
336.906ms 344.251ms divideImageDataWithParam
1832.120ms 6395.861ms divideImageDataWithParam_OpenMP single-thread parallel=false
387.152ms 1204.302ms divideImageDataWithParam_OpenMP multi-threaded parallel=true
With Intel C++ 19:
32bit 64bit
15.162ms 8.927ms divideImageDataWithParam
266.646ms 294.134ms divideImageDataWithParam_OpenMP single-threaded parallel=false
239.564ms 1195.556ms divideImageDataWithParam_OpenMP multi-threaded parallel=true
Screenshot from Intel VTune Amplifier, where divideImageDataWithParam_OpenMP() with parallel=false take most of the time in instruction mov to dst memory.
648trindade is right; it has to do with optimizations that cannot be done with openmp. But its not loop-unrolling or vectorization, its inlining which allows for a smart substitution.
Let me explain: Integer divisions are incredibly slow (64bit IDIV: ~40-100 Cycles). So whenever possible people (and compilers) try to avoid divisions. One trick you can use is to substitute a division with a multiplication and a shift. That only works if the divisor is known at compile time. This is the case because your function divideImageDataWithParam is inlined and PARAM is known. You can verify this by prepending it with __declspec(noinline). You will get the timings that you expected.
The openmp parallelization does not allow this trick because the function cannot be inlined and therefore param is not known at compile time and an expensive IDIV-instruction is generated.
Compiler output of divideImageDataWithParam (WIN10, MSVC2017, x64):
0x7ff67d151480 <+ 336> movzx ecx,byte ptr [r10+r8]
0x7ff67d151485 <+ 341> mov rax,r12
0x7ff67d151488 <+ 344> mul rax,rcx <------- multiply
0x7ff67d15148b <+ 347> shr rdx,3 <------- shift
0x7ff67d15148f <+ 351> mov byte ptr [r8],dl
0x7ff67d151492 <+ 354> lea r8,[r8+1]
0x7ff67d151496 <+ 358> sub r9,1
0x7ff67d15149a <+ 362> jne test!main+0x150 (00007ff6`7d151480)
And the openmp-version:
0x7ff67d151210 <+ 192> movzx eax,byte ptr [r10+rcx]
0x7ff67d151215 <+ 197> lea rcx,[rcx+1]
0x7ff67d151219 <+ 201> cqo
0x7ff67d15121b <+ 203> idiv rax,rbp <------- idiv
0x7ff67d15121e <+ 206> mov byte ptr [rcx-1],al
0x7ff67d151221 <+ 209> lea rax,[r8+rcx]
0x7ff67d151225 <+ 213> mov rdx,qword ptr [rbx]
0x7ff67d151228 <+ 216> cmp rax,rdx
0x7ff67d15122b <+ 219> jl test!divideImageDataWithParam$omp$1+0xc0 (00007ff6`7d151210)
Note 1) If you try out the compiler explorer (https://godbolt.org/) you will see that some compilers do the substitution for the openmp version too.
Note 2) As soon as the parameter is not known at compile time this optimization cannot be done anyway. So if you put your function into a library it will be slow. I'd do something like precomputing the division for all possible values and then do a lookup. This is even faster because the lookup table fits into 4-5 cache lines and L1 latency is only 3-4 cycles.
void divideImageDataWithParam(
const unsigned char * src, int_t srcStep, unsigned char * dst, int_t dstStep, TSize size, int_t param)
{
uint8_t tbl[256];
for(int i = 0; i < 256; i++) {
tbl[i] = i / param;
}
for (int_t y = 0; y < size.height; y++)
{
for (int_t x = 0; x < size.width; x++)
{
dst[y*dstStep + x] = tbl[src[y*srcStep + x]];
}
}
}
Also thanks for the interesting question, I learned a thing or two along the way! ;-)
This behavior is explained by the use of compiler optimizations: when enabled, divideImageDataWithParam sequential code will be subjected to a series of optimizations (loop-unrolling, vectorization, etc.) that divideImageDataWithParam_OpenMP parallel code probably is not, as it is certainly uncharacterized after the process of outlining parallel regions by the compiler.
If you compile this same code without optimizations, you will find that the runtime version of the sequential version is very similar to that of the parallel version with only one thread.
The maximum speedup of parallel version in this case is limited by the division of the original workload without optimizations. Optimizations in this case need to be writed manually.

Calling C++ Method from Assembly with Parameters and Return Value

So I've asked this before but with significantly less detail. The question title accurately describes the problem: I have a method in C++ that I am trying to call from assembly (x86) that has both parameters and a return value. I have a rough understanding, at best, of assembly and a fairly solid understanding of C++ (otherwise I would not have undertaken this problem). Here's what I have as far as code goes:
// methodAddr is a pointer to the method address
void* methodAddr = method->Address;
// buffer is an int array of parameter values. The parameters can be anything (of any type)
// but are copied into an int array so they can be pushed onto the stack in reverse order
// 4 bytes at a time (as in push (int)). I know there's an issue here that is irrelevent to my baseline testing, in that if any parameter is over 4 bytes it will be broken and
// reversed (which is not good) but for basic testing this isn't an issue, so overlook this.
for (int index = bufferElementCount - 1; index >= 0; index--)
{
int val = buffer[index];
__asm
{
push val
}
}
int returnValueCount = 0;
// if there is a return value, allocate some space for it and push that onto the stack after
// the parameters have been pushed on
if (method->HasReturnValue)
{
*returnSize = method->ReturnValueSize;
outVal = new char[*returnSize];
returnValueCount = (*returnSize / 4) + (*returnSize % 4 != 0 ? 1 : 0);
memset(outVal, 0, *returnSize);
for (int index = returnValueCount - 1; index >= 0; index--)
{
char* addr = ((char*)outVal) + (index * 4);
__asm
{
push addr
}
}
}
// calculate the stack pointer offset so after the call we can pop the parameters and return value
int espOffset = (bufferElementCount + returnValueCount) * 4;
// call the method
__asm
{
call methodAddr;
add esp, espOffset
};
For my basic testing I am using a method with the following signature:
Person MyMethod3( int, char, int );
The problem is this: when omit the return value from the method signature, all of the parameter values are properly passed. But when I leave the method as is, the parameter data that is passed is incorrect but the value returned is correct. So my question, obviously, is what is wrong? I've tried pushing the return value space onto the stack before the parameters. The person structure is as follows:
class Person
{
public:
Text Name;
int Age;
float Cash;
ICollection<Person*>* Friends;
};
Any help would be greatly appreciated. Thanks!
I'm using Visual Studio 2013 with the November 2013 CTP compiler for C++, targeting x86.
As it relates to disassembly, this is the straight method call:
int one = 876;
char two = 'X';
int three = 9738;
Person p = MyMethod3(one, two, three);
And here is the disassembly for that:
00CB0A20 mov dword ptr [one],36Ch
char two = 'X';
00CB0A27 mov byte ptr [two],58h
int three = 9738;
00CB0A2B mov dword ptr [three],260Ah
Person p = MyMethod3(one, two, three);
00CB0A32 push 10h
00CB0A34 lea ecx,[p]
00CB0A37 call Person::__autoclassinit2 (0C6AA2Ch)
00CB0A3C mov eax,dword ptr [three]
00CB0A3F push eax
00CB0A40 movzx ecx,byte ptr [two]
00CB0A44 push ecx
00CB0A45 mov edx,dword ptr [one]
00CB0A48 push edx
00CB0A49 lea eax,[p]
00CB0A4C push eax
00CB0A4D call MyMethod3 (0C6B783h)
00CB0A52 add esp,10h
00CB0A55 mov dword ptr [ebp-4],0
My interpretation of this is as follows:
Execute the assignments to the local variables. Then create the output register. Then put the parameters in a particular register (the order here happens to be eax, ecx, and edx, which makes sense (eax and ebx are for one, ecx is for two, and edx and some other register for the last parameter?)). Then call LEA (load-effective address) which I don't understand but have understood to be a MOV. Then it calls the method with an address as the parameter? And then moves the stack pointer to pop the parameters and return value.
Any further explanation is appreciated, as I'm sure my understanding here is somewhat flawed.

Extremely bizarre code generation in Visual C++, for nearly identical code; 3x speed difference

The code below (reduced from my larger code, after my astonishment at how its speed paled in comparison with that of std::vector) has two peculiar features:
It runs more than three times faster when I make a very tiny modification to the source code (always compiling it with /O2 with Visual C++ 2010).
Note: To make this a little more fun, I put a hint for the modification at the end, so you can spend some time figuring out the change yourself. The original code was ~500 lines, so it took me a heck of a lot longer to pin it down, since the fix looks pretty irrelevant to the performance.
It runs about 20% faster with /MTd than with /MT, even though the output loop looks the same!!!
The difference in the assembly code for the tiny-modification case is:
Loop without the modification (~300 ms):
00403383 mov esi,dword ptr [esp+10h]
00403387 mov edx,dword ptr [esp+0Ch]
0040338B mov dword ptr [edx+esi*4],eax
0040338E add dword ptr [esp+10h],ecx
00403392 add eax,ecx
00403394 cmp eax,4000000h
00403399 jl main+43h (403383h)
Loop with /MTd (looks identical! but ~270 ms):
00407D73 mov esi,dword ptr [esp+10h]
00407D77 mov edx,dword ptr [esp+0Ch]
00407D7B mov dword ptr [edx+esi*4],eax
00407D7E add dword ptr [esp+10h],ecx
00407D82 add eax,ecx
00407D84 cmp eax,4000000h
00407D89 jl main+43h (407D73h)
Loop with the modification (~100 ms!!):
00403361 mov dword ptr [esi+eax*4],eax
00403364 inc eax
00403365 cmp eax,4000000h
0040336A jl main+21h (403361h)
Now my question is, why should the changes above have the effects they do? It's completely bizarre!
Especially the first one -- it shouldn't affect anything at all (once you see the difference in the code), and yet it lowers the speed dramatically.
Is there an explanation for this?
#include <cstdio>
#include <ctime>
#include <algorithm>
#include <memory>
template<class T, class Allocator = std::allocator<T> >
struct vector : Allocator
{
T *p;
size_t n;
struct scoped
{
T *p_;
size_t n_;
Allocator &a_;
~scoped() { if (p_) { a_.deallocate(p_, n_); } }
scoped(Allocator &a, size_t n) : a_(a), n_(n), p_(a.allocate(n, 0)) { }
void swap(T *&p, size_t &n)
{
std::swap(p_, p);
std::swap(n_, n);
}
};
vector(size_t n) : n(0), p(0) { scoped(*this, n).swap(p, n); }
void push_back(T const &value) { p[n++] = value; }
};
int main()
{
int const COUNT = 1 << 26;
vector<int> vect(COUNT);
clock_t start = clock();
for (int i = 0; i < COUNT; i++) { vect.push_back(i); }
printf("time: %d\n", (clock() - start) * 1000 / CLOCKS_PER_SEC);
}
Hint (hover your mouse below):
It has to do with the allocator.
Answer:
Change Allocator &a_ to Allocator a_.
For what it's worth, my speculation for the difference between /MT and /MTd is that the /MTd heap allocation will paint the heap memory for debugging purposes making it more likely to be paged in - that occurs before you start the clock.
If you 'pre-heat' the vector allocation, you get the same numbers for /MT and /MTd:
vector<int> vect(COUNT);
// make sure vect's memory is warmed up
for (int i = 0; i < COUNT; i++) { vect.push_back(i); }
vect.n = 0; // clear the vector
clock_t start = clock();
for (int i = 0; i < COUNT; i++) { vect.push_back(i); }
printf("time: %d\n", (clock() - start) * 1000 / CLOCKS_PER_SEC);
It's strange that Allocator& will break the alias chain while Allocator will not.
You can try
for(int i=vect.n; i<COUNT;++i){
...
}
to enforce i and n are synchronized.
This will make vc much easier to optimize.
emm... It seems that the "fastest" code
00403361 mov dword ptr [esi+eax*4],eax
00403364 inc eax
00403365 cmp eax,4000000h
0040336A jl main+21h (403361h)
is somewhat over-optimized. In this loop, vect.n is ignored at all...
If there was an exception happened in the loop, vect.n will not be updated correctly.
So the answer might be: when you use Allocator, vc figures out that vect.n will be never used again,
so that it can be ignored. It's amazing, but in general it's not so useful and dangerous.

What's the reason behind applying two explicit type casts in a row?

What's the reason behind applying two explicit type casts as below?
if (unlikely(val != (long)(char)val)) {
Code taken from lxml.etree.c source file from lxml's source package.
That's a cheap way to check to see if there's any junk in the high bits. The char cast chops of the upper 8, 24 or 56 bits (depending on sizeof(val)) and then promotes it back. If char is signed, it will sign extend as well.
A better test might be:
if (unlikely(val & ~0xff)) {
or
if (unlikely(val & ~0x7f)) {
depending on whether this test cares about bit 7.
Just for grins and completeness, I wrote the following test code:
void RegularTest(long val)
{
if (val != ((int)(char)val)) {
printf("Regular = not equal.");
}
else {
printf("Regular = equal.");
}
}
void MaskTest(long val)
{
if (val & ~0xff) {
printf("Mask = not equal.");
}
else {
printf("Mask = equal.");
}
}
And here's what the cast code turns into in debug in visual studio 2010:
movsx eax, BYTE PTR _val$[ebp]
cmp DWORD PTR _val$[ebp], eax
je SHORT $LN2#RegularTes
this is the mask code:
mov eax, DWORD PTR _val$[ebp]
and eax, -256 ; ffffff00H
je SHORT $LN2#MaskTest
In release, I get this for the cast code:
movsx ecx, al
cmp eax, ecx
je SHORT $LN2#RegularTes
In release, I get this for the mask code:
test DWORD PTR _val$[ebp], -256 ; ffffff00H
je SHORT $LN2#MaskTest
So what's going on? In the cast case it's doing a byte mov with sign extension (ha! bug - the code is not the same because chars are signed) and then a compare and to be totally sneaky, the compiler/linker has also made this function use register passing for the argument. In the mask code in release, it has folded everything up into a single test instruction.
Which is faster? Beats me - and frankly unless you're running this kind of test on a VERY slow CPU or are running it several billion times, it won't matter. Not in the least.
So the answer in this case, is to write code that is clear about its intent. I would expect a C/C++ jockey to look at the mask code and understand its intent, but if you don't like that, you should opt for something like this instead:
#define BitsAbove8AreSet(x) ((x) & ~0xff)
#define BitsAbove7AreSet(x) ((x) & ~0x7f)
or:
inline bool BitsAbove8AreSet(long t) { return (t & ~0xff) != 0; } // make it a bool to be nice
inline bool BitsAbove7AreSet(long t) { return (t & ~0x7f) != 0; }
And use the predicates instead of the actual code.
In general, I think "is it cheap?" is not a particularly good question to ask about this unless you're working in some very specific problem domains. For example, I work in image processing and when I have some kind of operation going from one image to another, I often have code that looks like this:
BYTE *srcPixel = PixelOffset(src, x, y, srcrowstride, srcdepth);
int srcAdvance = PixelAdvance(srcrowstride, right, srcdepth);
BYTE *dstPixel = PixelOffset(dst, x, y, dstrowstride, dstdepth);
int dstAdvance = PixelAdvance(dstrowstride, right, dstdepth);
for (y = top; y < bottom; y++) {
for (x=left; x < right; x++) {
ProcessOnePixel(srcPixel, srcdepth, dstPixel, dstdepth);
srcPixel += srcdepth;
dstPixel += dstdepth;
}
srcPixel += srcAdvance;
dstPixel += dstAdvance;
}
And in this case, assume that ProcessOnePixel() is actually a chunk of inline code that will be executed billions and billions of times. In this case, I care a whole lot about not doing function calls, not doing redundant work, not rechecking values, ensuring that the computational flow will translate into something that will use registers wisely, etc. But my actual primary concern is that the code can be read by the next poor schmuck (probably me) who has to look at it.
And in our current coding world, it is FAR FAR CHEAPER for nearly every problem domain to spend a little time up front ensuring that your code is easy to read and maintain than it is to worry about performance out of the gate.
Speculations:
cast to char: to mask the 8 low bits,
cast to long: to bring the value back to signed (if char is unsigned).
If val is a long then the (char) will strip off all but the bottom 8 bits. The (long) casts it back for the comparison.

Unusual heap size limitations in VS2003 C++

I have a C++ app that uses large arrays of data, and have noticed while testing that it is running out of memory, while there is still plenty of memory available. I have reduced the code to a sample test case as follows;
void MemTest()
{
size_t Size = 500*1024*1024; // 512mb
if (Size > _HEAP_MAXREQ)
TRACE("Invalid Size");
void * mem = malloc(Size);
if (mem == NULL)
TRACE("allocation failed");
}
If I create a new MFC project, include this function, and run it from InitInstance, it works fine in debug mode (memory allocated as expected), yet fails in release mode (malloc returns NULL). Single stepping through release into the C run times, my function gets inlined I get the following
// malloc.c
void * __cdecl _malloc_base (size_t size)
{
void *res = _nh_malloc_base(size, _newmode);
RTCCALLBACK(_RTC_Allocate_hook, (res, size, 0));
return res;
}
Calling _nh_malloc_base
void * __cdecl _nh_malloc_base (size_t size, int nhFlag)
{
void * pvReturn;
// validate size
if (size > _HEAP_MAXREQ)
return NULL;
'
'
And (size > _HEAP_MAXREQ) returns true and hence my memory doesn't get allocated. Putting a watch on size comes back with the exptected 512MB, which suggests the program is linking into a different run-time library with a much smaller _HEAP_MAXREQ. Grepping the VC++ folders for _HEAP_MAXREQ shows the expected 0xFFFFFFE0, so I can't figure out what is happening here. Anyone know of any CRT changes or versions that would cause this problem, or am I missing something way more obvious?
Edit: As suggested by Andreas, looking at this under this assembly view shows the following;
--- f:\vs70builds\3077\vc\crtbld\crt\src\malloc.c ------------------------------
_heap_alloc:
0040B0E5 push 0Ch
0040B0E7 push 4280B0h
0040B0EC call __SEH_prolog (40CFF8h)
0040B0F1 mov esi,dword ptr [size]
0040B0F4 cmp dword ptr [___active_heap (434660h)],3
0040B0FB jne $L19917+7 (40B12Bh)
0040B0FD cmp esi,dword ptr [___sbh_threshold (43464Ch)]
0040B103 ja $L19917+7 (40B12Bh)
0040B105 push 4
0040B107 call _lock (40DE73h)
0040B10C pop ecx
0040B10D and dword ptr [ebp-4],0
0040B111 push esi
0040B112 call __sbh_alloc_block (40E736h)
0040B117 pop ecx
0040B118 mov dword ptr [pvReturn],eax
0040B11B or dword ptr [ebp-4],0FFFFFFFFh
0040B11F call $L19916 (40B157h)
$L19917:
0040B124 mov eax,dword ptr [pvReturn]
0040B127 test eax,eax
0040B129 jne $L19917+2Ah (40B14Eh)
0040B12B test esi,esi
0040B12D jne $L19917+0Ch (40B130h)
0040B12F inc esi
0040B130 cmp dword ptr [___active_heap (434660h)],1
0040B137 je $L19917+1Bh (40B13Fh)
0040B139 add esi,0Fh
0040B13C and esi,0FFFFFFF0h
0040B13F push esi
0040B140 push 0
0040B142 push dword ptr [__crtheap (43465Ch)]
0040B148 call dword ptr [__imp__HeapAlloc#12 (425144h)]
0040B14E call __SEH_epilog (40D033h)
0040B153 ret
$L19914:
0040B154 mov esi,dword ptr [ebp+8]
$L19916:
0040B157 push 4
0040B159 call _unlock (40DDBEh)
0040B15E pop ecx
$L19929:
0040B15F ret
_nh_malloc:
0040B160 cmp dword ptr [esp+4],0FFFFFFE0h
0040B165 ja _nh_malloc+29h (40B189h)
With the registers as follows;
EAX = 009C8AF0 EBX = FFFFFFFF ECX = 009C8A88 EDX = 00747365 ESI = 00430F80
EDI = 00430F80 EIP = 0040B160 ESP = 0013FDF4 EBP = 0013FFC0 EFL = 00000206
So the compare does appear to be against the correct constant, i.e. #040B160 cmp dword ptr [esp+4],0FFFFFFE0h, also esp+4 = 0013FDF8 = 1F400000 (my 512mb)
Second edit: Problem was actually in HeapAlloc, as per Andreas' post. Changing to a new seperate heap for large objects, using HeapCreate & HeapAlloc, did not help alleviate the problem, nor did an attempt to use VirtualAlloc with various parameters. Some further experimentation has shown that where allocation one large section of contiguous memory fails, two smaller blocks yielding the same total memory is ok. e.g. where a 300MB malloc fails, 2 x 150MB mallocs work ok. So it looks like I'll need a new array class that can live in a number of biggish memory fragments rather than a single contiguous block. Not a major problem, but I would have expected a bit more out of Win32 in this day and age.
Last edit: The following yielded 1.875GB of space, albeit non-contiguous
#define TenMB 1024*1024*10
void SmallerAllocs()
{
size_t Total = 0;
LPVOID p[200];
for (int i = 0; i < 200; i++)
{
p[i] = malloc(TenMB);
if (p[i])
Total += TenMB; else
break;
}
CString Msg;
Msg.Format("Allocated %0.3lfGB",Total/(1024.0*1024.0*1024.0));
AfxMessageBox(Msg,MB_OK);
}
May it be the cast that the debugger is playing a trick on you in release-mode? Neither single stepping nor the values of variables are reliable in release-mode.
I tried your example in VS2003 in release mode, and when single stepping it does at first look like the code is landing on the return NULL line, but when I continue stepping it eventually continues into HeapAlloc, I would guess that it's this function that's failing, looking at the disassembly if (size > _HEAP_MAXREQ) reveals the following:
00401078 cmp dword ptr [esp+4],0FFFFFFE0h
so I don't think it's a problem with _HEAP_MAXREQ.