memory issues on a native c++ app on windows - c++

I'm investigating a bad_alloc crashes for a multithreaded native cpp app, from WinDbg it's clearly happening on allocating large object on heap (mostly basic_string ctor or some array allocation with new operator). From the !address -summary and memory analysis from DebugDiag it's seems like app memory usage is very high but heap size is still very small (around 70 MB).
LFH Key : 0x233116ff
Termination on corruption : ENABLED
Heap Flags Reserv Commit Virt Free List UCR Virt Lock Fast
(k) (k) (k) (k) length blocks cont. heap
-----------------------------------------------------------------------------
05f60000 00000002 68964 56804 68964 8411 37570 13 2 29701
External fragmentation 14 % (37570 free blocks)
072a0000 00001002 60 4 60 2 1 1 0 0
096f0000 00001002 60 4 60 2 1 1 0 0
1f430000 00001002 60 4 60 2 1 1 0 0
-----------------------------------------------------------------------------
I want to dig deep into the memory from the usage table and find out the cause of higher memory allocation, any suggestion on how to proceed further?

Related

Valgrind error with new array [duplicate]

I am getting an invalid read error when the src string ends with \n, the error disappear when i remove \n:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main (void)
{
char *txt = strdup ("this is a not socket terminated message\n");
printf ("%d: %s\n", strlen (txt), txt);
free (txt);
return 0;
}
valgrind output:
==18929== HEAP SUMMARY:
==18929== in use at exit: 0 bytes in 0 blocks
==18929== total heap usage: 2 allocs, 2 frees, 84 bytes allocated
==18929==
==18929== All heap blocks were freed -- no leaks are possible
==18929==
==18929== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
==18929==
==18929== 1 errors in context 1 of 1:
==18929== Invalid read of size 4
==18929== at 0x804847E: main (in /tmp/test)
==18929== Address 0x4204050 is 40 bytes inside a block of size 41 alloc'd
==18929== at 0x402A17C: malloc (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==18929== by 0x8048415: main (in /tmp/test)
==18929==
==18929== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
How to fix this without sacrificing the new line character?
It's not about the newline character, nor the printf format specifier. You've found what is arguably a bug in strlen(), and I can tell you must be using gcc.
Your program code is perfectly fine. The printf format specifier could be a little better, but it won't cause the valgrind error you are seeing. Let's look at that valgrind error:
==18929== Invalid read of size 4
==18929== at 0x804847E: main (in /tmp/test)
==18929== Address 0x4204050 is 40 bytes inside a block of size 41 alloc'd
==18929== at 0x402A17C: malloc (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==18929== by 0x8048415: main (in /tmp/test)
"Invalid read of size 4" is the first message we must understand. It means that the processor ran an instruction which would load 4 consecutive bytes from memory. The next line indicates that the address attempted to be read was "Address 0x4204050 is 40 bytes inside a block of size 41 alloc'd."
With this information, we can figure it out. First, if you replace that '\n' with a '$', or any other character, the same error will be produced. Try it.
Secondly, we can see that your string has 40 characters in it. Adding the \0 termination character brings the total bytes used to represent the string to 41.
Because we have the message "Address 0x4204050 is 40 bytes inside a block of size 41 alloc'd," we now know everything about what is going wrong.
strdup() allocated the correct amount of memory, 41 bytes.
strlen() attempted to read 4 bytes, starting at the 40th, which would extend to a non-existent 43rd byte.
valgrind caught the problem
This is a glib() bug. Once upon a time, a project called Tiny C Compiler (TCC) was starting to take off. Coincidentally, glib was completely changed so that the normal string functions, such as strlen() no longer existed. They were replaced with optimized versions which read memory using various methods such as reading four bytes at a time. gcc was changed at the same time to generate calls to the appropriate implementations, depending on the alignment of the input pointer, the hardware compiled for, etc. The TCC project was abandoned when this change to the GNU environment made it so difficult to produce a new C compiler, by taking away the ability to use glib for the standard library.
If you report the bug, glib maintainers probably won't fix it. The reason is that under practical use, this will likely never cause an actual crash. The strlen function is reading bytes 4 at a time because it sees that the addresses are 4-byte aligned. It's always possible to read 4 bytes from a 4-byte-aligned address without segfaulting, given that reading 1 byte from that address would succeed. Therefore, the warning from valgrind doesn't reveal a potential crash, just a mismatch in assumptions about how to program. I consider valgrind technically correct, but I think there is zero chance that glib maintainers will do anything to squelch the warning.
The error message seems to indicate that it's strlen that read past the malloced buffer allocated by strdup. On a 32-bit platform, an optimal strlen implementation could read 4 bytes at a time into a 32-bit register and do some bit-twiddling to see if there's a null byte in there. If near the end of the string, there are less than 4 bytes left, but 4 bytes are still read to perform the null byte check, then I could see this error getting printed. In that case, presumably the strlen implementer would know if it's "safe" to do this on the particular platform, in which case the valgrind error is a false positive.

How to allocate a large amount of memory for an array?

The machine has 2 GB and more free memory.
Like the maximum number of elements of the array is limited only by the capabilities of the OS / computer.
That is, having a minimum of 2 GB array can have 2 ^ 32 -1 elements.
But the compiler does not miss. What if I really want an array with 2 ^ 32 elements? :) I tried 2 ^ 31-1, but it does not work ...
OS 64 bit, 6 GB memory, Win32 Console App
char * buffer = new char[2147483647]; //Microsoft C++ exception: std::bad_alloc at memory location 0x004FF998.
You CANNOT get a 2 GB single allocation in a Windows 32-bit application. You'd think that you could, since you can get 3 or 4 GB with /LARGEADDRESSAWARE. But you can't.
The Windows OS maps some trap pages right around the 2GB mark, for catching some kinds of programming errors I assume (actually read the link, I was wrong, it made the Alpha AXP port easier). And that means that your 2 GB array has nowhere it can fit.
So yeah, build your app as a 64-bit application.

std::map in class : trade off between execution speed and memory usage

My question concerns the trade off between execution speed and memory usage when designing a class that will be instantiated thousands or millions of times and used differently in different contexts.
So I have a class that contains a bunch of numerical properties (stored in int and double). A simple example would be
class MyObject
{
public:
double property1;
double property2;
...
double property14
int property15;
int property16;
...
int property25;
MyObject();
~MyObject();
};
This class is used by different programs that instantiate
std::vector<MyObject> SetOfMyObjects;
that may contain as much as a few millions of elements. The thing is that depending on the context, some or many properties may remain unused (we do not need to compute them in this given context), implying that the memory for millions of useless int and double is allocated. As I said, the usefulness and uselessness of the properties depend on the context, and I would like to avoid writing a different class for each specific contexts.
So I was thinking about using std::maps to assign memory only for the properties I use. For example
class MyObject
{
public:
std::map<std::string, double> properties_double;
std::map<std::string, int> properties_int;
MyObject();
~MyObject();
};
such that if "property1" has to be computed, it would stored as
MyObject myobject;
myobject.properties_double["property1"] = the_value;
Obviously, I would define proper "set" and "get" methods.
I understand that accessing elements in a std::map goes as the logarithm of its size, but since the number of properties is quite small (about 25), I suppose that this should not slow down the execution of the code too much.
Am I overthinking this too much? Do you think using std::map is a good idea? Any suggestion from more seasoned programmers would be appreciated.
I don't think this is your best option, for 25 elements, you will not benefit that much from using a map in terms of lookup performance. Also, it depends on what kinds of properties are you going to have, if it is a fixed set of properties as in your example, then string lookup would be a waste of memory and CPU cycles, you could go for an enum of all properties or just an integer and use a sequential container for the properties each element has. For such a small number of possible properties, lookup time will be lower than a map because of cache friendliness and integer comparisons, and memory usage will be lower too. For such a small set of properties this solution is marginally better.
Then there is the problem that an int is usually twice as small as a double. And they are different types. So it is not directly possible to store both in a single container, but you could have enough space for a double in each element, and either use a union or just read/write an int from/to the address of the double if the property "index" is larger than 14.
So you can have something as simple as:
struct Property {
int type;
union {
int d_int;
double d_double;
};
};
class MyObject {
std::vector<Property> properties;
};
And for type 1 - 14 you read the d_double field, for type 15 - 25 the d_int field.
BENCHMARKS!!!
Out of curiosity I did some testing, creating 250k objects, each with 5 int and 5 double properties, using a vector, a map and a hash for the properties, and measured memory usage and time taken to set and get the properties, ran each test 3 times in a row to see impact on caching, calculate checksum for getters to verify consistency, and here are the results:
vector | iteration | memory usage MB | time msec | checksum
setting 0 32 54
setting 1 32 13
setting 2 32 13
getting 0 32 77 3750000
getting 1 32 77 3750000
getting 2 32 77 3750000
map | iteration | memory usage MB | time msec | checksum
setting 0 132 872
setting 1 132 800
setting 2 132 800
getting 0 132 800 3750000
getting 1 132 799 3750000
getting 2 132 799 3750000
hash | iteration | memory usage MB | time msec | checksum
setting 0 155 797
setting 1 155 702
setting 2 155 702
getting 0 155 705 3750000
getting 1 155 705 3750000
getting 2 155 706 3750000
As expected, the vector solution is by far the fastest and most efficient, although it is most influenced by cold cache, even running cold it is way faster than a map or hash implementation.
On a cold run, the vector implementation is 16.15 times faster than map and 14.75 times faster than hash. On a warm run it is even faster - 61 times faster and 54 times faster respectively.
As for memory usage, the vector solution is far more efficient as well, using over 4 times less memory than the map solution and almost 5 times less than the hash solution.
As I said, it is marginally better.
To clarify, the "cold run" is not only the first run but also the one inserting the actual values in the properties, so it is fairly illustrative of the insert operations overhead. None of the containers used preallocation so they used their default policies of expanding. As for the memory usage, it is possible it doesn't accurately reflect actual memory usage 100% accurately, since I use the entire working set for the executable, and there is usually some preallocation taking place on OS level as well, it will most likely be more conservative as the working set increases. Last but not least, the map and hash solutions are implemented using a string lookup as the OP originally intended, which is why they are so inefficient. Using integers as keys in the map and hash produces far more competitive results:
vector | iteration | memory usage MB | time msec | checksum
setting 0 32 55
setting 1 32 13
setting 2 32 13
getting 0 32 77 3750000
getting 1 32 77 3750000
getting 2 32 77 3750000
map | iteration | memory usage MB | time msec | checksum
setting 0 47 95
setting 1 47 11
setting 2 47 11
getting 0 47 12 3750000
getting 1 47 12 3750000
getting 2 47 12 3750000
hash | iteration | memory usage MB | time msec | checksum
setting 0 68 98
setting 1 68 19
setting 2 68 19
getting 0 68 21 3750000
getting 1 68 21 3750000
getting 2 68 21 3750000
Memory usage is much lower for hash and map, while still higher than vector, but in terms of performance the tables are turned, while the vector solution sill wins at inserts, at reading and writing the map solution takes the trophy. So there's the trade-off.
As for how much memory is saved compared to having all the properties as object members, by just a rough calculation, it would take about 80 MB of RAM to have 250k such objects in a sequential container. So you save like 50 MB for the vector solution and almost nothing for the hash solution. And it goes without saying - direct member access would be much faster.
TL;DR: it's not worth it.
From carpenters we get: measure twice, cut once. Apply it.
Your 25 int and double will occupy on a x86_64 processor:
14 double: 112 bytes (14 * 8)
11 int: 44 bytes (11 * 4)
for a total of 156 bytes.
A std::pair<std::string, double> will, on most implementation, consume:
24 bytes for the string
8 bytes for the double
and a node in the std::map<std::string, double> will add at least 3 pointers (1 parent, 2 children) and a red-black flag for another 24 bytes.
That's at least 56 bytes per property.
Even with a 0-overhead allocator, any time you store 3 elements or more in this map you use more than 156 bytes...
A compressed (type, property) pair will occupy:
8 bytes for the property (double is the worst case)
8 bytes for the type (you can choose a smaller type, but alignment kicks in)
for a total of 16 bytes per pair. Much better than map.
Stored in a vector, this will mean:
24 bytes of overhead for the vector
16 bytes per property
Even with a 0-overhead allocator, any time you store 9 elements or more in this vector you use more than 156 bytes.
You know the solution: split that object.
You're looking up objects by name that you know will be there. So look them up by name.
I understand that accessing elements in a std::map goes as the logarithm of its size, but since the number of properties is quite small (about 25), I suppose that this should not slow down the execution of the code too much.
You will slow down your program by more than one order of magnitude. Lookup of a map may be O(logN) but it's O(LogN) * C. C will be huge compared to direct access of properties (thousands of times slower).
implying that the memory for millions of useless int and double is allocated
A std::string is at least 24 bytes on all the implementations I can think of - assuming you keen the names of properties short (google 'short string optimisation' for details).
Unless 60% of your properties are unpopulated there is no saving using a map keyed by string at all.
With so many objects and small map object in each you may hit another problem - memory fragmentation. It could be usable to have std::vector with std::pair<key,value> in it instead and do lookup (I think binary search should be sufficient, but it depends on your situation, it could be cheaper to do linear lookup but not to sort the vector). For property key I would use enum instead of string, unless later is dictated by interface (which you did not show).
Just an idea (not compiled/tested):
struct property_type
{
enum { kind_int, kind_double } k;
union { int i; double d; };
};
enum prop : unsigned char { height, widht, };
typedef std::map< std::pair< int/*data index*/, prop/*property index*/ >, property_type > map_type;
class data_type
{
map_type m;
public:
double& get_double( int i, prop p )
{
// invariants...
return m[ std::pair<int,prop>(i,p) ].d;
}
};
Millions of ints and doubles is still only hundreds of megabytes of data. On a modern computer that may not be a huge issue.
The map route looks like it will be a waste of time but there is an alternative you could use that saves memory while retaining decent performance characteristics: store the details in a separate vector and store an index into this vector (or -1 for unassigned) in your main data type. Unfortunately, your description doesn't really indicate how the property usage actually looks but I'm going to guess you can sub-divide into properties that are always, or usually, set together and some that are needed for every node. Let's say you subdivide into four sets: A, B, C and D. The As are needed for every node whereas B, C and D are rarely set but all elements are typically modified together, then modify the struct you're storing like so:
struct myData {
int A1;
double A2;
int B_lookup = -1;
int C_lookup = -1;
int D_lookup = -1;
};
struct myData_B {
int B1;
double B2;
//etc.
};
// and for C and D
and then store 4 vectors in your main class. When a property in the Bs in accessed you add a new myData_B to the vector of Bs (actually a deque might be a better choice, retaining fast access but without the same memory fragmentation issues) and set the B_lookup value in the original myData to the index of the new myData_B. And the same for Cs and Ds.
Whether this is worth doing depends on how few of the properties you actually access and how you access them to together but you should be able to modify the idea to your tastes.

Does toluapp create memory leak when tolua_pushusertype_and_takeownership is used?

This question might be for Lua and tolua experts.
I'm using tolua++1.0.93, and lua-5.1.4 (CEGUI 0.84 dependencies).
I have been tracking this nasty memory leak for couple of hours, and I've found that toluapp creates tolua_gc table in Lua registry and it seems that this table grows infinitely.
When I push my object to Lua using tolua_pushusertype_and_takeownership I want Lua's GC to delete my object. And it does so, but tolua_pushusertype_and_takeownership calls tolua_register_gc which puts this objects metatable under object as a key to this "global" tolua_gc table.
When tolua_gc_event function calls collector function (which calls delete operator) it that sets nil value to tolua_gc table under just deleted object as a key. So that should work, right?
Well, no.
Maybe I understood something wrong, but it seems that this has no effect on size of the tolua_gc table.
I have also tried to manually call tolua.releaseownership(object) from Lua. And it worked. I mean, it decreased memory used by Lua (LUA_GCCOUNT) but since it disconnected collector from object, operator delete is never called and it created memory leaks in C++.
It is really strange behavior, because all tolua.releaseownership does is it sets nil value to tolua_gc table under passed object as a key.
So why tolua.releaseownership decreases size of memory used by Lua, and tolua_gc_event does not?
The only difference is that tolua.releaseownership calls garbage collector before it sets nil to tolua_gc table, and tolua_gc_event is called by garbage collector (opposite situation).
Why do we need that global tolua_gc table? Can't we just take metatable from object directly at the moment of collection?
I have really limited memory that I can use from this process (8MB) and it seems that this tolua_gc table occupies like 90% of it after some time.
How can I fix this?
Thank you.
EDIT:
These are code samples:
extern unsigned int TestSSCount;
class TestSS
{
public:
double d_double;
TestSS()
{
// TestSSCount++;
// fprintf(stderr, "c(%d)\n",TestSSCount);
}
TestSS(const TestSS& other)
{
d_double = other.d_double * 0.5;
// TestSSCount++;
// fprintf(stderr, "cc(%d)\n",TestSSCount);
}
~TestSS()
{
// TestSSCount--;
// fprintf(stderr, "d(%d)\n", TestSSCount);
}
};
class App
{
...
TestSS doNothing()
{
TestSS t;
t.d_double = 13.89;
return t;
}
void callGC()
{
int kbs_before = lua_gc(d_state, LUA_GCCOUNT, 0);
lua_gc(d_state, LUA_GCCOLLECT, 0);
int kbs_after = lua_gc(d_state, LUA_GCCOUNT, 0);
printf("GC changed memory usage from %d kB to %d kB, difference %d kB",
kbs_before, kbs_after, kbs_before - kbs_after);
}
...
};
This is .pkg file:
class TestSS
{
public:
double d_double;
};
class App
{
TestSS doNothing();
void callGC();
};
Now complete Lua code (app and rootWindow are C++ objects provided as constants to Lua):
function handleCharacterKey(e_)
local key = CEGUI.toKeyEventArgs(e_).scancode
if key == CEGUI.Key.One then
for i = 1,10000,1 do
-- this makes GC clear all memory from Lua heap but does not call destructor in C++
-- tolua.releaseownership(app:doNothing())
-- this makes GC call destructors in C++ but somehow makes Lua heap increase constantly
app:doNothing()
elseif key == CEGUI.Key.Zero then
app:callGC()
end
end
rootWindow:subscribeEvent("KeyUp", "handleCharacterKey")
This is output that I get when pressing 0 1 0 1 0 1 0:
This is when I use tolua.releaseowenership
GC changed memory usage from 294 kB to 228 kB, difference 66 k
GC changed memory usage from 228 kB to 228 kB, difference 0 kB
GC changed memory usage from 228 kB to 228 kB, difference 0 kB
GC changed memory usage from 228 kB to 228 kB, difference 0 kB
This is without tolua.releaseownership:
GC changed memory usage from 294 kB to 228 kB, difference 66 kB
GC changed memory usage from 605 kB to 604 kB, difference 1 kB
GC changed memory usage from 982 kB to 861 kB, difference 121 kB
GC changed memory usage from 1142 kB to 1141 kB, difference 1 kB
And this is without releaseownership but sequnce that I press on the keyboard is 0 1 0 1 0 1 0 0 0 0 (three more extra calls to GC at the end)
GC changed memory usage from 294 kB to 228 kB, difference 66 kB
GC changed memory usage from 603 kB to 602 kB, difference 1 kB
GC changed memory usage from 982 kB to 871 kB, difference 111 kB
GC changed memory usage from 1142 kB to 1141 kB, difference 1 kB
GC changed memory usage from 1141 kB to 868 kB, difference 273 kB <- this is after first additional GC call
GC changed memory usage from 868 kB to 868 kB, difference 0 kB
GC changed memory usage from 868 kB to 868 kB, difference 0 kB
Problem is not bug or memory leak. Although you could say it's memory leak if you have really limited memory. The problem is that tolua_gc, being lua table as it is, does not rehash when you remove elements by setting them to nil.
Although I supposed it could be the problem, I was stupid enough not to see if it is true. So garbage collector can not force table to rehash and shrink its size. So table will grow until certain insert that will trigger rehash.
Read this: http://www.lua.org/gems/sample.pdf
So at the end I've removed tolua_gc table and I put metatables (that tolua had used to put in tolua_gc table with lightuserdata as a key) as special field in userdata object itself.
And also instead of accessing those metatables from tolua_gc table, I get it now from object itself. Everything else is the same, and it seems to work.

How to calculate the achieved bandwidth of a CUDA kernel

I want a measure of how much of the peak memory bandwidth my kernel archives.
Say I have a NVIDIA Tesla C1060, which has a max Bandwidth of 102.4 GB/s. In my kernel I have the following accesses to global memory:
...
for(int k=0;k>4000;k++){
float result = (in_data[index]-loc_mem[k]) * (in_data[index]-loc_mem[k]);
....
}
out_data[index]=result;
out_data2[index]=sqrt(result);
...
I count for each thread 4000*2+2 accesses to global memory. Having 1.000.000 threads and all accesses are float I have ~32GB of global memory accesses (inbound and outbound added). As my kernel only takes 0.1s I would archive ~320GB/s which is higher than the max bandwidth, thus there is an error in my calculations / assumptions. I assume, CUDA does some caching, so not all memory accesses count. Now my questions:
What is my error?
What accesses to global memory are cached and which are not?
Is it correct that I don't count access to registers, local, shared and constant memory?
Can I use the CUDA profiler for easier and more accurate results? Which counters would I need to use? How would I need to interpret them?
Profiler output:
method gputime cputime occupancy instruction warp_serial memtransfer
memcpyHtoD 10.944 17 16384
fill 64.32 93 1 14556 0
fill 64.224 83 1 14556 0
memcpyHtoD 10.656 11 16384
fill 64.064 82 1 14556 0
memcpyHtoD 1172.96 1309 4194304
memcpyHtoD 10.688 12 16384
cu_more_regT 93223.906 93241 1 40716656 0
memcpyDtoH 1276.672 1974 4194304
memcpyDtoH 1291.072 2019 4194304
memcpyDtoH 1278.72 2003 4194304
memcpyDtoH 1840 3172 4194304
New question:
- When 4194304Bytes = 4Bytes * 1024*1024 data points = 4MB and gpu_time ~= 0.1 s then I achieve a bandwidth of 10*40MB/s = 400MB/s. That seems very low. Where is the error?
p.s. Tell me if you need other counters for your answer.
sister question: How to calculate Gflops of a kernel
You do not really have 1.000.000 of threads running at once. You do ~32GB of global memory accesses where the bandwidth will be given by the current threads running (reading) in the SMs and the size of the data read.
All accesses in global memory are cached in L1 and L2 unless you specify un-cached data to the compiler.
I think so. Achieved bandwidth is related to global memory.
I will recommend use the visual profiler to see the read/write/global memory bandwidth. Would be interesting if you post your result :).
Default counters in Visual Profiler gives you enough information to get an idea about your kernel (memory bandwidth, shared memory bank conflicts, instructions executed...).
Regarding to your question, to calculate the achieved global memory throughput:
Compute Visual Profiler. DU-05162-001_v02 | October 2010. User Guide. Page 56, Table 7. Supported Derived Statistics.
Global memory read throughput in giga-bytes per second. For compute capability < 2.0 this is calculated as (((gld_32*32) + (gld_64*64) + (gld_128*128)) * TPC) / gputime For compute capability >= 2.0 this is calculated as ((DRAM reads) * 32) / gputime
Hope this help.