Profiling memory usage in Mathematica - profiling

Is there any way to profile the mathkernel memory usage (down to individual variables) other than paying $$$ for their Eclipse plugin (mathematica workbench, iirc)?
Right now I finish execution of a program that takes multiple GB's of ram, but the only things that are stored should be ~50MB of data at most, yet mathkernel.exe tends to hold onto ~1.5GB (basically, as much as Windows will give it). Is there any better way to get around this, other than saving the data I need and quitting the kernel every time?
EDIT: I've just learned of the ByteCount function (which shows some disturbing results on basic datatypes, but that's besides the point), but even the sum over all my variables is nowhere near the amount taken by mathkernel. What gives?

One thing a lot of users don't realize is that it takes memory to store all your inputs and outputs in the In and Out symbols, regardless of whether or not you assign an output to a variable. Out is also aliased as %, where % is the previous output, %% is the second-to-last, etc. %123 is equivalent to Out[123].
If you don't have a habit of using %, or only use it to a few levels deep, set $HistoryLength to 0 or a small positive integer, to keep only the last few (or no) outputs around in Out.
You might also want to look at the functions MaxMemoryUsed and MemoryInUse.
Of course, the $HistoryLength issue may or not be your problem, but you haven't shared what your actual evaluation is.
If you're able to post it, perhaps someone will be able to shed more light on why it's so memory-intensive.

Here is my solution for profiling of memory usage:
myByteCount[symbolName_String] :=
Replace[ToHeldExpression[symbolName],
Hold[x__] :>
If[MemberQ[Attributes[x], Protected | ReadProtected],
Sequence ## {}, {ByteCount[
Through[{OwnValues, DownValues, UpValues, SubValues,
DefaultValues, FormatValues, NValues}[Unevaluated#x,
Sort -> False]]], symbolName}]];
With[{listing = myByteCount /# Names[]},
Labeled[Grid[Reverse#Take[Sort[listing], -100], Frame -> True,
Alignment -> Left],
Column[{Style[
"ByteCount for symbols without attributes Protected and \
ReadProtected in all contexts", 16, FontFamily -> "Times"],
Style[Row#{"Total: ", Total[listing[[All, 1]]], " bytes for ",
Length[listing], " symbols"}, Bold]}, Center, 1.5], Top]]
Evaluation the above gives the following table:

Michael Pilat's answer is a good one, and MemoryInUse and MaxMemoryUsed are probably the best tools you have. ByteCount is rarely all that helpful because what it measures can be a huge overestimate because it ignores shared subexpressions, and it often ignores memory that isn't directly accessible through Mathematica functions, which is often a major component of memory usage.
One thing you can do in some circumstances is use the Share function, which forces subexpressions to be shared when possible. In some circumstances, this can save you tens or even hundreds of magabytes. You can tell how well it's working by using MemoryInUse before and after you use Share.
Also, some innocuous-seeming things can cause Mathematica to use a whole lot more memory than you expect. Contiguous arrays of machine reals (and only machine reals) can be allocated as so-called "packed" arrays, much the way they would be allocated by C or Fortran. However, if you have a mix of machine reals and other structures (including symbols) in an array, everything has to be "boxed", and the array becomes an array of pointers, which can add a lot of overhead.

One way is to automatize restarting of kernel when it goes out of memory. You can execute your memory-consuming code in a slave kernel while the master kernel only takes the result of computation and controls memory usage.

Related

What are common values for uninitialized memory for debugging?

A long time ago I learned about filling unused / uninitialized memory with 0xDEADBEEF so that in a debugger or a crash report if I ever see that value I know I'm looking at uninitialized memory. I saw from a crash report iOS uses 0xBBADBEEF.
What other creative values have people used? Do any particular values have any kind of specific benefit?
The most obvious benefit of values that turn into words is that, at least of most people, if the words are in their language they stick out easily where as some strictly numeric value is less likely to stick out.
But, maybe there are other reason to pick numbers? For example an odd number might crash a processors (68000) for example on certain memory accesses so it's probably better to pick 0x0BADBEEF over 0xBADBEEF0. Are their any other values (maybe processor specific) that have a concrete benefit for using for uninitialized memory?
Generally speaking, you want a value which is unlikely to happen to "work" when interpreted as either an integer, a pointer, or a string. So, here are a few constraints:
Don't use a value that's a multiple of the smallest "usual" alignment on your target architecture. For x86, that's 4 (bytes), so no values that are divisible by 4. This ensures that if the value is interpreted as a pointer, it'll be obviously-incorrect. If you're on a non-x86 architecture, you might even be able to use a value that will cause an alignment trap if used as a pointer.
Don't use a value which could reasonably be a small (positive or negative) integer. Your typical "int" variable in a C program never gets larger than 1,000 or so, so don't use small numbers as your empty data fill.
Don't use a value which is composed entirely of valid ASCII characters. Make sure there's at least one byte in there with the high bit set. These days, you'd want to make sure they weren't valid UTF-8 or possibly UTF-16 values, either.
Don't have any zero bytes in the value. There are too many cases where this would work out to be "helpful" to keeping the program from crashing - terminating a string, giving a non-int field a reasonable-looking value, etc.
Don't use a single (or two) byte values, repeated over and over. Having a full-word length pattern can make it easier to determine how your wild pointer ended up pointing where it is, at least narrowing down which operations offset it from the start of the pattern.
Don't use a value that maps to an valid address for a "typical" process. If the highest bits are set, it'll typically take a whole lot of malloc() before your process will grow large enough to make that a valid address.
Perhaps unsurprisingly, patterns like 0xDEADBEEF meet basically all of these requirements.
One technical term for values like this is "poison value".
Hex numbers that form English words are called Hexspeak. Wikipedia's Hexspeak article pretty much answers this question, cataloguing many known constants in use for various things, including several that are used as poison values / canaries / sanity checks, as well as other uses like error codes or IPv6 addresses.
I seem to recall some variation of 0xBADF00D. (maybe with a repeated letter like your 2nd example).
There's also 0xDEADC0DE. (Googling for where I've seen this used found the wikipedia article linked above).
Other English words in hex I've seen: Java .class files use 0xCAFEBABE as the magic number (first 4 bytes of the file). As a play on this, I guess, the Jikes JVM uses 0xDEADBABE as a sanity check constant.
Apparently Java wasn't the first user of 0xCAFEBABE. Wikipedia says "It was originally created by NeXTSTEP developers as a reference to the baristas at Peet's Coffee & Tea", and was used by the people developing Java before they thought of the name "Java". So it didn't come out of Java -> coffee (if anything the other way around), it's just plain old non-feminist tech culture. :(
re: update: Choosing a good value. For a poison value (not an error code), you want all the bytes to be different and not 0x00 or 0xFF, since those are probably the most likely values for an errant single-byte store. This applies especially for things like stack canaries (to detect buffer overruns), or other cases where detecting that it didn't get overwritten is important.
Your speculation about picking an odd value makes a lot of sense. Not being a valid memory address in the virtual memory layout of typical processes is a big advantage. Failing noisily as early as possible is optimal for debugging. Anyway, this probably means that having the high bit set is a good idea, so 0x0... is probably not a good idea.

When should I use CUDA's built-in warpSize, as opposed to my own proper constant?

nvcc device code has access to a built-in value, warpSize, which is set to the warp size of the device executing the kernel (i.e. 32 for the foreseeable future). Usually you can't tell it apart from a constant - but if you try to declare an array of length warpSize you get a complaint about it being non-const... (with CUDA 7.5)
So, at least for that purpose you are motivated to have something like (edit):
enum : unsigned int { warp_size = 32 };
somewhere in your headers. But now - which should I prefer, and when? : warpSize, or warp_size?
Edit: warpSize is apparently a compile-time constant in PTX. Still, the question stands.
Let's get a couple of points straight. The warp size isn't a compile time constant and shouldn't be treated as one. It is an architecture specific runtime immediate constant (and its value just happens to be 32 for all architectures to date). Once upon a time, the old Open64 compiler did emit a constant into PTX, however that changed at least 6 years ago if my memory doesn't fail me.
The value is available:
In CUDA C via warpSize, where is is not a compile time constant (the PTX WARP_SZ variable is emitted by the compiler in such cases).
In PTX assembler via WARP_SZ, where it is a runtime immediate constant
From the runtime API as a device property
Don't declare you own constant for the warp size, that is just asking for trouble. The normal use case for an in-kernel array dimensioned to be some multiple of the warp size would be to use dynamically allocated shared memory. You can read the warp size from the host API at runtime to get it. If you have a statically declared in-kernel you need to dimension from the warp size, use templates and select the correct instance at runtime. The latter might seem like unnecessary theatre, but it is the right thing to do for a use case that almost never arises in practice. The choice is yours.
Contrary to talonmies's answer I find warp_size constant perfectly acceptable. The only reason to use warpSize is to make the code forward-compatibly with a possible future hardware that may have warps of different size. However, when such hardware arrives, the kernel code will most likely require other alterations as well in order to remain efficient. CUDA is not a hardware-agnostic language - on the contrary, it is still quite a low-level programming language. Production code uses various intrinsic functions that come and go over time (e.g. __umul24).
The day we get a different warp size (e.g. 64) many things will change:
The warpSize will have to be adjusted obviously
Many warp-level intrinsic will need their signature adjusted, or a new version produced, e.g. int __ballot, and while int does not need to be 32-bit, it is most commonly so!
Iterative operations, such as warp-level reductions, will need their number of iterations adjusted. I have never seen anyone writing:
for (int i = 0; i < log2(warpSize); ++i) ...
that would be overly complex in something that is usually a time-critical piece of code.
warpIdx and laneIdx computation out of threadIdx would need to be adjusted. Currently, the most typical code I see for it is:
warpIdx = threadIdx.x/32;
laneIdx = threadIdx.x%32;
which reduces to simple right-shift and mask operations. However, if you replace 32 with warpSize this suddenly becomes a quite expensive operation!
At the same time, using warpSize in the code prevents optimization, since formally it is not a compile-time known constant.
Also, if the amount of shared memory depends on the warpSize this forces you to use the dynamically allocated shmem (as per talonmies's answer). However, the syntax for that is inconvenient to use, especially when you have several arrays -- this forces you to do pointer arithmetic yourself and manually compute the sum of all memory usage.
Using templates for that warp_size is a partial solution, but adds a layer of syntactic complexity needed at every function call:
deviceFunction<warp_size>(params)
This obfuscates the code. The more boilerplate, the harder the code is to read and maintain.
My suggestion would be to have a single header that control all the model-specific constants, e.g.
#if __CUDA_ARCH__ <= 600
//all devices of compute capability <= 6.0
static const int warp_size = 32;
#endif
Now the rest of your CUDA code can use it without any syntactic overhead. The day you decide to add support for newer architecture, you just need to alter this one piece of code.

Strange C++ Memory Allocation

I created a simple class, Storer, in C++, playing with memory allocation. It contains six field variables, all of which are assigned in the constructor:
int x;
int y;
int z;
char c;
long l;
double d;
I was interested in how these variables were being stored, so I wrote the following code:
Storer *s=new Storer(5,4,3,'a',5280,1.5465);
cout<<(long)s<<endl<<endl;
cout<<(long)&(s->x)<<endl;
cout<<(long)&(s->y)<<endl;
cout<<(long)&(s->z)<<endl;
cout<<(long)&(s->c)<<endl;
cout<<(long)&(s->l)<<endl;
cout<<(long)&(s->d)<<endl;
I was very interested in the output:
33386512
33386512
33386516
33386520
33386524
33386528
33386536
Why is the char c taking up four bytes? sizeof(char) returns, of course, 1, so why is the program allocating more memory than it needs? This is confirmed that too much memory is being allocated with the following code:
cout<<sizeof(s->c)<<endl;
cout<<sizeof(Storer)<<endl;
cout<<sizeof(int)+sizeof(int)+sizeof(int)+sizeof(char)+sizeof(long)+sizeof(double)<<endl;
which prints:
1
32
29
confirming that, indeed, 3 bytes are being allocated needlessly. Can anyone explain to me why this is happening? Thanks.
Data alignment and compiler padding say hi!
The CPU has no notion of type, what it gets in its 32-bit (or 64-bit, or 128-bit (SSE), or 256-bit (AVX) - let's keep it simple at 32) registers needs to be properly aligned in order to be processed correctly and efficiently. Imagine a simple scenario, where you have a char, followed by an int. In a 32-bit architecture, that's 1 byte for a char and 4 bytes for an integer.
A 32-bit register would have to break on its boundary, only taking in 3 bytes of the integer and leaving the 4th byte for "a second run". It cannot process the data properly that way, so the compiler will add padding in order to make sure all the stuff is processed efficiently. And that means adding a certain amount of padding depending on the type in question.
Why is misalignment a problem?
The computer is not human, it can't just pick them out with a pair of eyes and a brain. It has to be very deterministic and cautious about how it goes about doing things. First it loads one block which contains n bytes of the given information, shift it around so that it prunes out unrelated information, then another, again, shift out a bunch of unnecessary bytes which do not have anything to do with the operation at hand and only then can it do the necessary operations. And usually you have two operands, that's just one complete. When you do all that work, only then can you actually process it. Way too much performance overhead when you can simply align the data properly (and most of the time, compilers do it for you, if you're not doing anything fancy).
Could you visualize it?
Visually - the first green byte is the mentioned char, and the three green bytes plus the first red one of the second block is the 4-byte int, colorcoded on a 4-byte access boundary (we're talking about a 32-bit register). The "instead part" at the bottom shows an ideal setup where the int hits the register properly (the char getting padded into obedience somewhere off image):
Read more on data alignment, which comes quite handy when you're dealing with fancy extensions of the instruction set like SSE (128-bit regs) or AVX (256-bit regs), so special care must be taken so that the optimizations of vectorization are not defeated ( aligning on a 16-byte boundary for SSE, 16*8 -> 128-bits).
Additional remarks on user defined alignment
phonetagger made a valid point in the comments that there are pragma directives which can be assigned through the preprocessor to force to compiler in order to align the data in a way the user, programmer specifies. But such directives, like #pragma pack(...), are a statement to the compiler that you know what you're doing and what's best for you. Be sure that you do, because if you fail to accomodate your environment, you might experience various penalties - the most obvious being using external libraries you didn't write yourself which differ in the way they pack data.
Things simply explode when they clash. Best is to advise caution in such cases and really being intimate with the issue at hand. If you're not sure, leave it to the defaults. If you are not sure but have to use something like SSE where alignment is king (and not default nor simple by a long shot), consult various resources online or ask an another question here.
I will make an analogy to help you understand.
Assume there is a long loaf of bread and you have a cutting machine that can cut it into slices of equal thickness. Then you are giving out these breads to, let's say children. Every child takes their bread and fairly do what they want to do with them (put Nutella on them and eat, etc.). They can even make thinner slices out of it and use it like that.
If one child comes up to you and says that he does not want that slice everyone is getting, but a thinner slice instead, then you will have difficulties, because your cutting machine is optimized to cut at least a minimum amount, which makes everyone happy. But when one child asks for a thinner slice, then you have to reinvent the machine or put additional complexity to it like introducing two cutting modes. You don't want that. Eventually you give up and just give him a big slice anyway.
This is the same reason why it happens. Hope you could relate to the analogy.
Data alignement is why the char has allocated 4 bytes : Data alignement
char does not take up four bytes: it takes up a single byte as usual. You can check it by printing sizeof(char). The other three bytes are padding that the compiler inserts to optimize access to other members of your class. Depending on hardware, it is often much faster to access multi-byte types, say, 4-byte integers, when they are located at an address divisible by four. A compiler may insert up to three bytes of padding before an int member to align it with a good memory address for faster access.
If you would like to experiment with class layouts, you can use a handy operation called offsetof. It takes two parameters - the name of the member and the name of the class, and it returns the number of bytes from the base address of your struct to the position of the member in memory.
cout << offsetof(Storer, x) << endl;
cout << offsetof(Storer, y) << endl;
cout << offsetof(Storer, z) << endl;
Structure members are aligned in particular ways. In general, if you want the most compact representation, list the members in decreasing order of size.
http://en.wikipedia.org/wiki/Data_structure_alignment#Typical_alignment_of_C_structs_on_x86

Profiling a simple, one cycle length operation

We have an assignment where we need to profile a 'simple instruction' (addition or bit-wise and for example). This means performing the same operation a large number of times (100K+) and measuring the average time in microseconds. The result should be presented in cycle-lengths: (totalTime/iterations)*cphMHz.
So, results may vary but all in all we were told that we should get a result close to 1 cycle-length. Actual result doesn't matter as long as programming is correct.
My question is: what is a good operation to profile?
There are two points I need to concider:
I use loop unrolling to be a bit more accurate, so in each iteration I perform 10 simple instruction. This means I have to choose an operation to wouldn't be performed only once due to compiler optimization (we can't use -o0 flag as school staff does not).
Bad example: var = i; - the compiler would only perform the last command.
What is a real 'simple instruction'? How do I know the number of operations that are actually performed? I tried reading the assembly output, but I couldn't understand it.
Hope I was clear enough, any idea would be great.
Thanks anyway
P.S don't know if it matters but I write in CPP
1) This sounds (to me) like an impossible task, if optimizations are (or might be) enabled. You can never be sure on what the compiler will do during optimizations. I'd definitely do something like reusing the previous result. If allowed to/possible, I'd try to include a raw assembler snippet to be profiled (so you can be sure there's no additional overhead; although it still could be optimized).
2) As for instructions: One assembler command is one instruction. E.g. a += i will - depending on available instruction set and stuff - most likely result in 4 instructions: read a, read i, add, write a. Reading assembly is pretty much straightforward. Depending on the instruction set/processor, there might be different "directions" for reading (i.e. "from -> to"). x86 assemblers (and those for most other common processors) will prefer instruction target, source, while DSPs prefer to use instruction source, target. Just important to know: moving data has to happen through registers. So even a single assignment like a = b will result in two instructions (b to register and register to a).
In general, if this answer goes into the wrong direction, try to elaborate a bit more on your specific task and its requirements (e.g. which compiler is to be used) and drop me a short comment.

possible sets of output-preserving code manipulations

This is a theoretical question, so expect that many details here are not computable in practice or even in theory.
Let's say I have a string s that I want to compress. The result should be a self-extracting binary (can be x86 asm but can also be some other hypothetical Turing-complete low level language) which outputs s.
Now, we can easily iterate through all possible such binaries/programs, ordered by size. Let B_s be the sub-list of these binaries who output s (of course B_s is uncomputable).
As every set of positive integers must have a minimum, there must be a smallest program b_min_s in B_s
From s, I can also construct a canonical program b_cano_s which just outputs s in a trivial way. I.e. the size of b_cano_s will be in O(#s) -- if we think of ELF with data segments, we will even have #b_cano_s ~ #s.
Is there a set A of possible operations on the binaries which:
1 . Will preserve the output.
2a . Given b_cano_s, we can arrive somehow by operations from A at b_min_s.
(2b . Given b_cano_s, we can arrive at all programs in B_s.)
for all possible strings s.
The conditions 1+2a are weaker than the conditions 1+2b. Maybe, if there is such a set A, we will automatically have both, though. (Is that so?)
Does such a set A exists? I was thinking about some obvious operations, like searching for some repeated strings and shorten this. Or some of the other common compression methods. However, that probably is not enough to arrive at all programs B_s and my intention says also not necessarily at b_min_s for the same reason.
If it exists, can we express it, i.e. is it computable?
You should link your related previous questions.
2a. As noted, you can not determine b_min_s, because that results in a paradox. As a result, I don't think you can prove the operations in A are sufficient to reduce to it.
2b. You can brute force B_s, but this is an infinite set, and the procedure is non-terminating. However, for each program in B_s, you can calculate a manipulation from b_cano_s to B_s. However, that does not imply these possible operations will be meaningful. It seems operations like "delete characters in this range", "insert character at this position" qualify.