Julia memory allocation in simple size statement - profiling

I am running the following routine in with --track-allocation=user. The routine is called in a for loop. Still I am surprised at the allocation generated in the first line: I would expect numqps to be allocated on the stack and thus not contributing to the final allocation count.
- function buildpoints{T}(cell::Cell{T}, uv)
-
2488320 numqps::Int = size(uv,2)
12607488 mps = Array(Point{T}, numqps)
0 for i in 1 : numqps
0 mps[i] = buildpoint(cell, Vector2(uv[1,i], uv[2,i]))
- end
-
0 return mps
-
- end
EDIT: A bit further along in the memory profiling output I find:
1262976 numcells(m::Mesh) = size(m.faces,2)
It seems the size function on Arrays is implemented not very efficiently?

Apparently I was calling size on a variable declared as
type MyType{T}
A::Array{T}
end
So the type of A was only partially declared, i.e. only the eltype was supplied, not the number of dimensions. I noticed similar allocation overheads when accessing elements (A[i,j]). Allocation disappeared when I declared instead
type MyType{T}
A::Array{T,2}
end

In interpreting the results, there are a few important details. Under
the user setting, the first line of any function directly called from
the REPL will exhibit allocation due to events that happen in the REPL
code itself. More significantly, JIT-compilation also adds to
allocation counts, because much of Julia’s compiler is written in
Julia (and compilation usually requires memory allocation). The
recommended procedure is to force compilation by executing all the
commands you want to analyze, then call Profile.clear_malloc_data()
(page 594) to reset all allocation counters. Finally, execute the
desired commands and quit Julia to trigger the generation of the .mem
files.

Related

Segmentation fault for array, but only if a component of a derived type

Pretty simple setup, using gfortran 4.8.5 on linux (red hat):
I get a segfault if my array of reals (inside a derived type) has size > 2,000,000. This seems to be a standard stack/heap issue as my stack size is 8mb if I check with ulimit.
There is no problem if the array is NOT inside a derived type
Note that as #francescalus guesses, removing the initial value = 0.0 eliminates the problem
Edit to add: Note that I have posted a followup question Segmentation fault related to component of derived type that represents a more realistic use case and further narrows down the conditions under which this seems to occur.
program main
call sub1 ! seg fault if col size > 2,100,000
call sub2 ! works fine at col size = 100,000,000
end program main
subroutine sub1
type table
real :: col(2100000) = 0.0 ! works if "= 0.0" removed
end type table
type(table) :: table1
table1%col = 1.0
end subroutine sub1
subroutine sub2
real :: col(100000000) = 0.0
col = 1.0
end subroutine sub2
Some obvious questions here:
Is this expected behavior, or some bug that was fixed in newer versions of gfortran?
Am I following standard fortran operating procedures here, or doing something wrong?
What is the recommended way to avoid this (please assume that I am unable to update to a newer version of gfortran in the near term)? I will almost certainly solve with an allocatable array component for reasons not specific to this question, but that might not be an ideal general solution and I would like to know of all good options I have here.
In particular, is initializing the components of a derived type bad practice?
This is likely to be a runtime issue due to insufficient stack, rather than a bug with gfortran.
Gfortran uses the stack to store automatic arrays and other initialization data. When code does not create problems when one such array is small, but segfaults when the size of the array increases, a possible reason is running out of stack.
The issue seems to be the same in more recent versions of gfortran. I compiled and ran your program with gfortran 4.8.4, 4.9.3, 5.5.0, 6.4.0, 7.3.0 and 8.2.0. In all cases I obtained a segmentation fault with the default stack size, but no error when the stack size was slightly increased.
$ ./sfa
Segmentation fault
$ ulimit -s
8192
$ ulimit -s 8256
$ ./sfa && echo "DONE"
DONE
Your problem may be solved by running
$ ulimit -s unlimited
before executing your binary. I am not aware of any particular penalty for doing this, but programmers more aware of the fine details of memory management, such as compiler developers, may think otherwise.
Initializing the components of a derived type is not bad practice, but as you can see, it can create problems with the stack if the component is a big array - be it due to the storage of the component itself, or to the storage of memory to work on the RHS of the assignment. If the component is made allocatable and allocated in a subroutine, the array is stored in the heap rather than in the stack, and this issue is usually avoided. In this case, it may be about actually setting the values of the array dynamically in a subroutine rather than at compile time. It may be less elegant, but I think it's worth it, since it's the typical example of code development work that prevents avoidable, environment-related errors when executing the binary.
Your code above is standards compliant. As explained in the comments, lack of explicit interfaces for subroutines is not good practice, but for these simple subroutines it's not against the rules.
Some compilers have flags that allow you to change where some objects are allocated in memory. While it may fix a particular issue, flags are compiler dependent, and usually not equivalent when comparing different compilers. Using dynamic memory via allocatables is a more robust solution, according to my experience.
Finally, note that, if you are using OpenMP, the ulimit command above only affects the master thread - you need to set the stack size of each of the other threads via the environment variable OMP_STACKSIZE, which cannot be unlimited. And bear in mind that non-master threads running out of stack are a problem much more difficult to diagnose, since the binary may stop without a proper Segmentation fault error.
These are not necessarily useful solutions, but below are some conditions under which the seg fault disappears. A couple of people mentioned the lack of an explicit interface (as bad practice though not technically incorrect), and it seems that this might be one key here as either of these two changes to the code gets rid of the seg fault, although it's not quite that simple, as I'll explain:
Put everything in main, with no subroutine calls
Put the type definition table in a module
Let me expand on #2 briefly. Simply taking the example in the OP and then giving it an explicit interface by putting the subroutine in a module does NOT work. However, if I put the type definition in a module and then use it (as shown below) the segfault does not occur:
program main
use table_mod
type(table) :: table1
table1%col = 1.0
end program main

Segmentation fault - why and how does it work?

In both the functions defined below, it tries to allocate 10M of memory in the stack. But the segmentation fault happens only in the second case and not it the first and I am trying to understand why so.
Function definition 1:
a(int *i)
{
char iptr[50000000];
*i = 1;
}
Function definition 2:
a()
{
char c;
char iptr[5000000];
printf("&c = 0x%lx, iptr = 0x%x ... ", &c, iptr);
fflush(stdout);
c = iptr[0];
printf("ok\n");
}
According to my understanding in case of local variables that are not alloted memory dynamically are stored in stack section of the program. So I suppose, during compile time itself the compiler checks if the variable fits in the stack or not.
Hence if above stated is true, then segmentation fault should occur in both the cases (i.e. also in case 1).
The website (http://web.eecs.utk.edu/courses/spring2012/cs360/360/notes/Memory/lecture.html) from where I picked this states that the segfault happens in function 2 in a when the code attempts to push iptr on the stack for the printf call. This is because the stack pointer is pointing to the void. Had we not referenced anything at the stack pointer, our program should have worked.
I need help understanding this last statement and my earlier doubt related to this.
So I suppose, during compile time itself the compiler checks if the variable fits in the stack or not.
No, that cannot be done. When compiling a function, the compiler does not know what the call stack will be when the function is called, so it will assume that you know what you are doing (which might or not be the case). Also note that the amount of stack space may be affected by both compile time and runtime restrictions (in Linux you can set the stack size with ulimit on the shell that starts the process).
I need help understanding this last statement and my earlier doubt related to this.
I would not attempt to look too much into that statement, it is not standard but rather based on knowledge of a particular implementation that is not even described there, and thus is built on some assumptions that are not necessarily true.
It assumes that the act of allocating the array does not 'touch' the allocated memory (in some debug builds in some implementations that is false) and thus whether you attempt to allocate 1 byte or 100M if the data is not touched by your program the allocation is fine --this need not be the case.
It also assumes that the arguments of the function printf are passed in the stack (this is actually the case in all implementations I know, due to the variadic arguments nature of the function). With the previous assumption, the array would overflow the stack (assuming an stack of <10M), but would not crash as the memory is not accessed, but to be able to call printf the value of the argument would be pushed to the stack beyond the array. This will write to memory and that write will be beyond the allocated space for the stack and crash.
Again, all this is implementation, not defined by the language.
Error in your code is being thrown by the following code:
; Find next lower page and probe
cs20:
sub eax, _PAGESIZE_ ; decrease by PAGESIZE
test dword ptr [eax],eax ; probe page. "**This line throws the error**"
jmp short cs10
_chkstk endp
end
From chkstk.asm file, which Provide stack checking on procedure entry. And this file explicitically defines:
_PAGESIZE_ equ 1000h
Now as a explanation of your problem This Question tells everything you need as mentioned by: Shafik Yaghmour
Your printf format string assumes that pointers, ints (%x), and longs (%lx) are all the same size; this may be false on your platform, leading to undefined behavior. Use %p instead. I intended to make this a comment, but can't yet.
I am surprised no one noticed that the first function allocates 10 times the space than the second function. There are seven zeros after 5 in the first function whereas the second function has six zeros after 5 :-)
I compiled it with gcc-4.6.3 and got segmentation fault on the first function but not on the second function. After I removed the additional zero in the first function, seg fault went away. Adding a zero in the second function introduced the seg fault. So at least in my case, the reason of this seg fault is that the program could not allocate the required space on the stack. I would be happy to hear about the observations that differ from the above.

Limit recursive calls in C++ (about 5000)?

In order to know the limit of the recursive calls in C++ i tried this function !
void recurse ( int count ) // Each call gets its own count
{
printf("%d\n",count );
// It is not necessary to increment count since each function's
// variables are separate (so each count will be initialized one greater)
recurse ( count + 1 );
}
this program halt when count is equal 4716 ! so the limit is just 4716 !!
I'm a little bit confused !! why the program stops exeuction when the count is equal to 4716 !!
PS: Executed under Visual studio 2010.
thanks
The limit of recursive calls depends on the size of the stack. The C++ language is not limiting this (from memory, there is a lower limit of how many function calls a standards conforming compiler will need to support, and it's a pretty small value).
And yes, recursing "infinitely" will stop at some point or another. I'm not entirely sure what else you expect.
It is worth noting that designing software to do "boundless" recursion (or recursion that runs in to the hundreds or thousands) is a very bad idea. There is no (standard) way to find out the limit of the stack, and you can't recover from a stack overflow crash.
You will also find that if you add an array or some other data structure [and use it, so it doesn't get optimized out], the recursion limit goes lower, because each stack-frame uses more space on the stack.
Edit: I actually would expect a higher limit, I suspect you are compiling your code in debug mode. If you compile it in release mode, I expect you get several thousand more, possibly even endless, because the compiler converts your tail-recursion into a loop.
The stack size is dependent on your environment.
In *NIX for instance, you can modify the stack size in the environment, then run your program and the result will be different.
In Windows, you can change it this way (source):
$ editbin /STACK:reserve[,commit] program.exe
You've probably run out of stack space.
Every time you call the recursive function, it needs to push a return address on the stack so it knows where to return to after the function call.
It crashes at 4716 because it just happens to run out of stack space after about 4716 iterations.

Memory error: Dereference null pointer/ SSE misalignment

I'm compiling a program on remote linux server. The program compiled. However when I run it the program ends abruptly. So I debugged the program using DDT. It spits out the following error:
Process 0:
Memory error detected in ClassName::function (filename.cpp:6462).
Thread 1 attempted to dereference a null pointer or execute an SSE instruction with an
incorrectly aligned memory address (the latter may sometimes occur spuriously if guard
pages are enabled)
Tip: Use the stack list and the local variables to explore your program's current
state and identify the source of the error.
Can anyone please tell me what exactly this error means?
The line where the program stops looks like this:
SumUtility = ParaEst[0] + hhincome * ParaEst[71] + IsBlack * ParaEst[61] + IsBachAss * (ParaEst[55]);
This is within a switch case.
These are the variable types
vector<double> ParaEst;
double hhincome;
int IsBlack, Is BachAss;
Thanks for the help!
It means that:
ParaEst is NULL or a bad Pointer
ParaEst's individual array values are not aligned to 16-byte boundaries, required for SSE.
hhincome, IsBlack, or IsBachAss are not aligned to 16-byte boundaries and are SSE type values.
SumUtility is not aligned to 16-bytes and is a SSE type field.
If you could post the assembly code of the exact line that failed along with the register values of that assembler line, we could tell you exactly which of the above conditions have failed. It would also help to see the types of each variable shown to help narrow root the cause.
Ok... The problem finally got fixed.
The issue was that the expression where the code was breaking down was in a newly defined function. However for some weird reason running the make-file did not incorporate these changes and was still compiling using the previously compiled .o file. This resulted in garbage values being assigned to the variables within this new function. To top things off the program calls this function as a first step. Hence there was this systematic breakdown. The technical aspect of this was what Michael alluded to.
After this I would always recommend to use a make clean option in the make file. The issue of why running the make file is failing to compile the modified source file is an issue that definitely warrants further discussion.
Thanks for the responses!!

Profiling memory usage in Mathematica

Is there any way to profile the mathkernel memory usage (down to individual variables) other than paying $$$ for their Eclipse plugin (mathematica workbench, iirc)?
Right now I finish execution of a program that takes multiple GB's of ram, but the only things that are stored should be ~50MB of data at most, yet mathkernel.exe tends to hold onto ~1.5GB (basically, as much as Windows will give it). Is there any better way to get around this, other than saving the data I need and quitting the kernel every time?
EDIT: I've just learned of the ByteCount function (which shows some disturbing results on basic datatypes, but that's besides the point), but even the sum over all my variables is nowhere near the amount taken by mathkernel. What gives?
One thing a lot of users don't realize is that it takes memory to store all your inputs and outputs in the In and Out symbols, regardless of whether or not you assign an output to a variable. Out is also aliased as %, where % is the previous output, %% is the second-to-last, etc. %123 is equivalent to Out[123].
If you don't have a habit of using %, or only use it to a few levels deep, set $HistoryLength to 0 or a small positive integer, to keep only the last few (or no) outputs around in Out.
You might also want to look at the functions MaxMemoryUsed and MemoryInUse.
Of course, the $HistoryLength issue may or not be your problem, but you haven't shared what your actual evaluation is.
If you're able to post it, perhaps someone will be able to shed more light on why it's so memory-intensive.
Here is my solution for profiling of memory usage:
myByteCount[symbolName_String] :=
Replace[ToHeldExpression[symbolName],
Hold[x__] :>
If[MemberQ[Attributes[x], Protected | ReadProtected],
Sequence ## {}, {ByteCount[
Through[{OwnValues, DownValues, UpValues, SubValues,
DefaultValues, FormatValues, NValues}[Unevaluated#x,
Sort -> False]]], symbolName}]];
With[{listing = myByteCount /# Names[]},
Labeled[Grid[Reverse#Take[Sort[listing], -100], Frame -> True,
Alignment -> Left],
Column[{Style[
"ByteCount for symbols without attributes Protected and \
ReadProtected in all contexts", 16, FontFamily -> "Times"],
Style[Row#{"Total: ", Total[listing[[All, 1]]], " bytes for ",
Length[listing], " symbols"}, Bold]}, Center, 1.5], Top]]
Evaluation the above gives the following table:
Michael Pilat's answer is a good one, and MemoryInUse and MaxMemoryUsed are probably the best tools you have. ByteCount is rarely all that helpful because what it measures can be a huge overestimate because it ignores shared subexpressions, and it often ignores memory that isn't directly accessible through Mathematica functions, which is often a major component of memory usage.
One thing you can do in some circumstances is use the Share function, which forces subexpressions to be shared when possible. In some circumstances, this can save you tens or even hundreds of magabytes. You can tell how well it's working by using MemoryInUse before and after you use Share.
Also, some innocuous-seeming things can cause Mathematica to use a whole lot more memory than you expect. Contiguous arrays of machine reals (and only machine reals) can be allocated as so-called "packed" arrays, much the way they would be allocated by C or Fortran. However, if you have a mix of machine reals and other structures (including symbols) in an array, everything has to be "boxed", and the array becomes an array of pointers, which can add a lot of overhead.
One way is to automatize restarting of kernel when it goes out of memory. You can execute your memory-consuming code in a slave kernel while the master kernel only takes the result of computation and controls memory usage.