Memory overflow in Contiki - c++

I am working on an application which requires to msp430 math functions. On using such functions i.e. powf, sqrt etc. memory overflow (ROM) occurs. One such case is, my code works when i use this float i variable without using static.
#include "contiki.h"
#include <stdio.h> /* For printf() */
#include <math.h>
#define DEBUG DEBUG_NONE
/*---------------------------------------------------------------------------*/
PROCESS(hello_world_process, "Hello world process");
AUTOSTART_PROCESSES(&hello_world_process);
/*---------------------------------------------------------------------------*/
PROCESS_THREAD(hello_world_process, ev, data)
{
PROCESS_BEGIN();
float i;
i = 2.1;
printf("Hello, world\n");
printf("%i\n", (int)powf(10,i));
PROCESS_END();
}
/*---------------------------------------------------------------------------*/
However in second case it doesn't work...
#include "contiki.h"
#include <stdio.h> /* For printf() */
#include <math.h>
#define DEBUG DEBUG_NONE
static float i;
/*---------------------------------------------------------------------------*/
PROCESS(hello_world_process, "Hello world process");
AUTOSTART_PROCESSES(&hello_world_process);
/*---------------------------------------------------------------------------*/
PROCESS_THREAD(hello_world_process, ev, data)
{
PROCESS_BEGIN();
i = 2.1;
printf("Hello, world\n");
printf("%i\n", (int)powf(10,i));
PROCESS_END();
}
/*---------------------------------------------------------------------------*/
Suggested answer is to upgrade msp430-gcc, but this may lead to system instability. Any other suggestions to efficiently handle memory overflows ?
What methodology can be followed for efficiently managing memory in embedded systems.

In the first case, the symbol i is local (on the stack frame of the function), so the compiler is able to optimize the function call away and compute the value of powf(10, 2.1) in compile time. In the second case, the symbol i is defined outside the function.
The optimizer fails to see that it does not get modified by some other code, external to the main process. Hence it does not optimize the powf away, and you end up trying to link with floating point features. Since msp430 does not support floating point in hardware, the linker ends up trying to add a lot of binary code to the executable. The executable gets too big and linking fails.
Upgrading the compiler will not magically solve the problem. You need to free up some memory. Follow the Contiki configuration guidelines: https://github.com/contiki-os/contiki/wiki/Reducing-Contiki-OS-firmware-size

If you need to save RAM, you might consider reducing:
QUEUEBUF_CONF_NUM: the number of packets in the link-layer queue. 4 is probably a lower bound for reasonable operation. As the traffic load increases, e.g. more frequent traffic or larger datagrams, you will need to increase this parameter.
NBR_TABLE_CONF_MAX_NEIGHBORS: the number of entries in the neighbor table. A value greater than the maximum network density is safe. A value lower than that will also work, as the neighbor table will automatically focus on relevant neighbors. But too low values will result in degraded performance.
NETSTACK_MAX_ROUTE_ENTRIES: the number of routing entries, i.e., in RPL non-storing mode, the number of links in the routing graph, and in storing mode, the number of routing table elements. At the network root, this must be set to the maximum network size. In non-storing mode, other nodes can set this parameter to 0. In storing mode, it is recommended for all nodes to also provision enough entries for each node in the network.
UIP_CONF_BUFFER_SIZE: the size of the IPv6 buffer. The minimum value for interoperability is 1280. In closed systems, where no large datagrams are used, lowering this to e.g. 140 may be sensible.
SICSLOWPAN_CONF_FRAG: Enables/disables 6LoWPAN fragmentation. Disable this if all your traffic fits a single link-layer packet. Note that this will also save some significant ROM.
If you need to save ROM, you can consider the following:
UIP_CONF_TCP: Enables/disables TCP. Make sure this is disabled when TCP is unused.
UIP_CONF_UDP: Enables/disables UDP. Make sure this is disabled when UDP is unused.
SICSLOWPAN_CONF_FRAG: As mentioned above. Disable if no fragmentation is needed.
LOG_CONF_LEVEL_*: Logs consume a large amount of ROM. Reduce log levels to save some more.
There are many other parameters that affect RAM/ROM usage. You can inspect os/contiki-default-conf.h as well as platform-specific contiki-conf.h files for inspiration. Or use .flashprof and .ramprof to identify the hotspots.
*Answered on Contiki wiki in Tutorial: RAM and ROM usage by
George Oikonomou

Related

Maximum recursive function calls in C/C++ before stack is full and gives a segmentation fault?

I was doing a question where I used a recursive function to create a segment tree. For larger values it started giving segmentation fault. So I thought before it might be because of array index value out of bound but later I thought it might be because of program stack going too big.
I wrote this code to count what is the maximum number of recursive calls allowed before the system give seg-fault.
#include<iostream>
using namespace std;
void recur(long long int);
int main()
{
recur(0);
return 0;
}
void recur(long long int v)
{
v++;
cout<<v<<endl;
recur(v);
}
After running the above code I got value of v to be 261926 and 261893 and 261816 before getting segmentation fault and all values were close to these.
Now I know that this would depend on machine to machine, and the size of the stack of the function being called but can someone explain the basics of how to keep safe from seg-faults and what is a soft limit that one can keep in mind.
The number of recursion levels you can do depends on the call-stack size combined with the size of local variables and arguments that are placed on such a stack. Aside from "how the code is written", just like many other memory related things, this is very much dependent on the system you're running on, what compiler you are using, optimisation level [1], and so on. Some embedded systems I've worked on, the stack would be a few hundred bytes, my first home computer had 256 bytes of stack, where modern desktops have megabytes of stack (and you can adjust it, but eventually you will run out)
Doing recursion at unlimited depth is not a good idea, and you should look at changing your code to so that "it doesn't do that". You need to understand the algorithm and understand to what depth it will recurse, and whether that is acceptable in your system. There is unfortunately nothing anyone can do at the time stack runs out (at best your program crashes, at worst it doesn't, but instead causes something ELSE to go wrong, such as the stack or heap of some other application gets messed up!)
On a desktop machine, I'd think it's acceptable to have a recursion depth of a hew hundred to some thousands, but not much more than this - and that is if you have small usage of stack in each call - if each call is using up kilobytes of stack, you should limit the call level even further, or reduce the need for stack-space.
If you need to have more recursion depth than that, you need to re-arrange the code - for example using a software stack to store the state, and a loop in the code itself.
[1] Using g++ -O2 on your posted code, I got to 50 million and counting, and I expect if I leave it long enough, it will restart at zero because it keeps going forever - this since g++ detects that this recursion can be converted into a loop, and does that. Same program compiled with -O0 or -O1 does indeed stop at a little over 200000. With clang++ -O1 it just keeps going. The clang-compiled code is still running as I finished writing the rest of the code, at 185 million "recursions".
There is (AFAIK) no well established limit. (I am answering from a Linux desktop point of view).
On desktops, laptops the default stack size is a few megabytes in 2015. On Linux you could use setrlimit(2) to change it (to a reasonable figure, don't expect to be able to set it to a gigabyte these days) - and you could use getrlimit(2) or parse /proc/self/limits (see proc(5)) to query it . On embedded microcontrollers - or inside the Linux kernel- , the entire stack may be much more limited (to a few kilobytes in total).
When you create a thread using pthread_create(3) you could use an explicit pthread_attr_t and use pthread_attr_setstack(3) to set the stack space.
BTW, with recent GCC, you might compile all your software (including the standard C library) with split stacks (so pass -fsplit-stack to gcc or g++)
At last your example is a tail call, and GCC could optimize that (into a jump with arguments). I checked that if you compile with g++ -O2 (using GCC 4.9.2 on Linux/x86-64/Debian) the recursion would be transformed into a genuine loop and no stack allocation would grow indefinitely (your program run for nearly 40 millions calls to recur in a minute, then I interrupted it) In better languages like Scheme or Ocaml there is a guarantee that tail calls are indeed compiled iteratively (then the tail recursive call becomes the usually -or even the only- looping construct).
CyberSpok is excessive in his comment (hinting to avoid recursions). Recursions are very useful, but you should limit them to a reasonable depth (e.g. a few thousands), and you should take care that call frames on the call stack are small (less than a kilobyte each), so practically allocate and deallocate most of the data in the C heap. The GCC -fstack-usage options is really useful for reporting stack usage of every compiled function. See this and that answers.
Notice that continuation passing style is a canonical way to transform recursions into iterations (then you trade stack frames with dynamically allocated closures).
Some clever algorithms replace a recursion with fancy modifying iterations, e.g. the Deutche-Shorr-Waite graph marking algorithm.
For Linux based applications, we can use getrlimit and setrlimit API's to know various kernel resource limits, like size of core file, cpu time, stack size, nice values, max. no. of processes etc. 'RLIMIT_STACK' is the resource name for stack defined in linux kernel. Below is simple program to retrieve stack size :
#include <iostream>
#include <sys/time.h>
#include <sys/resource.h>
#include <errno.h>
using namespace std;
int main()
{
struct rlimit sl;
int returnVal = getrlimit(RLIMIT_STACK, &sl);
if (returnVal == -1)
{
cout << "Error. errno: " << errno << endl;
}
else if (returnVal == 0)
{
cout << "stackLimit soft - max : " << sl.rlim_cur << " - " << sl.rlim_max << endl;
}
}

Implementing an Interrupt driven Sampling Profiler

I am trying to create a sampling profiler that works on linux, I am unsure how to send an interrupt or how to get the program counter (pc) so I can find out the point the program is at when I interrupt it.
I have tried using signal(SIGUSR1, Foo*) and calling backtrace, but I get the stack for the thread I am in when I raise(SIGUSR1) rather than the thread the program is being run on.
I am not really sure if this is even the right way of going about it...
Any advice?
If you must write a profiler, let me suggest you use a good one (Zoom) as your model, not a bad one (gprof).
These are its characteristics.
There are two phases. First is the data-gathering phase:
When it takes a sample, it reads the whole call stack, not just the program counter.
It can take samples even when the process is blocked due to I/O, sleep, or anything else.
You can turn sampling on/off, so as to only take samples during times you care about. For example, while waiting for the user to type something, it is pointless to be sampling.
Second is the data-presentation phase.
What you have is a collection of stack samples, where a stack sample is a vector of memory addresses, which are almost all return addresses.
Each return address indicates a line of code in a function, unless it's in some system routine you don't have symbolic information for.
The key piece of useful information is residency fraction (usually expressed as a percent).
If there are a total of m stack samples, and line of code L is present anywhere on n of the samples, then its residency fraction is n/m.
This is true even if L appears more that once on a sample, that is still just one sample it appears on.
The importance of residency fraction is it directly indicates what fraction of time statement L is responsible for.
If you have taken m=1000 samples, and L appears on n=300 of them, then L's residency fraction is 300/1000 or 30%.
This means that if L could be removed, total time would decrease by 30%.
It is typically known as inclusive percent.
You can determine residency fraction not just for lines of code, but for anything else you can describe. For example, line of code L is inside some function F.
So you can determine the residency fraction for functions, as opposed to lines of code.
That would give you inclusive percent by function.
You could look at function pairs, like on what fraction of samples do you see function F calling function G.
That would give you the information that makes up call graphs.
There are all kinds of information you can get out of the stack samples.
One that is often seen is a "butterfly view", where you have a "focus" on one line L or function F, and on one side you show all the lines or functions immediately above it in the stack samples, and on the other side all the lines of functions immediately below it.
On each of these, you can show the residency fraction.
You can click around in this to try to find lines of code with high residency fraction that you can find a way to eliminate or reduce.
That's how you speed up the code.
Whatever you do for output, I think it is very important to allow the user to actually examine a small number of them, randomly selected.
They convey far more insight than can be gotten from any method that condenses the information.
As important as it is to know what the profiler should do, it is also important to know what not to do, even if lots of other profilers do them:
self time. A useless number. Look at some reasonable-size programs and you will see why.
invocation counts. Of no help in finding code with high residency fraction, and you can't get it with samples alone anyway.
high-frequency sampling. It's amazing how many people, certainly profiler builders, think it is important to get lots of samples. Suppose line L is on 30% of 1000 samples. Then its true inclusive percent is 30 +/- 1.4 percent. On the other hand, if it is on 30% of 10 samples, its inclusive percent is 30 +/- 14 percent. It's still pretty big - big enough to fix. What happens in most profilers is people think they need "numerical precision", so they take lots of samples and accumulate what they call "statistics", and then throw away the samples. That's like digging up diamonds, weighing them, and throwing them away. The real value is in the samples themselves, because they tell you what the problem is.
You can send signal to specific thread using pthread_kill and tid (gettid()) of target thread.
Right way of creating simple profilers is by using setitimer which can send periodic signal (SIGALRM or SIGPROF) for example, every 10 ms; or posix timers (timer_create, timer_settime, or timerfd), without needs of separate thread for sending profiling signals. Check sources of google-perftools (gperftools), they use setitimer or posix timers and collects profile with backtraces.
gprof also uses setitimer for implementing cpu time profiling (9.1 Implementation of Profiling - " Linux 2.0 ..arrangements are made for the kernel to periodically deliver a signal to the process (typically via setitimer())").
For example: result of codesearch for setitimer in gperftools's sources: https://code.google.com/p/gperftools/codesearch#search/&q=setitimer&sq=package:gperftools&type=cs
void ProfileHandler::StartTimer() {
if (!allowed_) {
return;
}
struct itimerval timer;
timer.it_interval.tv_sec = 0;
timer.it_interval.tv_usec = 1000000 / frequency_;
timer.it_value = timer.it_interval;
setitimer(timer_type_, &timer, 0);
}
You should know that setitimer has problems with fork and clone; it doesn't work with multithreaded applications. There is try to create helper wrapper: http://sam.zoy.org/writings/programming/gprof.html (wrong one) but I don't remember, does it work correctly (setitimer usually send process-wide signal, and not thread-wide). UPD: seems that since linux kernel 2.6.12, setitimer's signal is directed to the process as whole (any thread may get it).
To direct signal from timer_create to specific thread, you need gettid() (#include <sys/syscall.h>, syscall(__NR_gettid)) and SIGEV_THREAD_ID flag. Don't checked how to create periodic posix timer with thread_create (probably with timer_settime and non-zero it_interval).
PS: there is some overview of profiling in wikibooks: http://en.wikibooks.org/wiki/Introduction_to_Software_Engineering/Tools/Profiling

C++ get Processor ID

This thread is ok.
How to get Processor and Motherboard Id?
I wanted to get processor ID using C++ code not using WMI or any third party lib.
OR anything on a computer that turns out to be unique.
One thing is Ethernet ID but which is again removable on some machines. This I want to use mostly for licensing purpose.
Is processor ID unique and available on all major processors?
I had a similar problem lately and I did the following. First I gained some unique system identification values:
GetVolumeInformation for HDD serial number
GetComputerName (this of course is not unique, but our system was using the computer names to identify clients on a LAN, so it was good for me)
__cpuid (and specifically the PSN - processor serial number field)
GetAdaptersInfo for MAC addresses
I took these values and combined them in an arbitrary but deterministic way (read update below!) (adding, xoring, dividing and keeping the remainder etc.). Iterate over the values as if they were strings and be creative. In the end, you will get a byte literal which you can transform to the ASCII range of letters and numbers to get a unique, "readable" code that doesn't look like noise.
Another approach can be simply concatenating these values and then "cover them up" with xoring something over them (and maybe transforming to letters again).
I'm saying it's unique, because at least one of the inputs is supposed to be unique (the MAC address). Of course you need some understanding of number theory to not blew away this uniqueness, but it should be good enough anyway.
Important update: Since this post I learned a few things about cryptography, and I'm on the opinion that making up an arbitrary combination (essentially your own hash) is almost certainly a bad idea. Hash functions used in practice are constructed to be well-behaved (as in low probability of collisions) and to be hard to break (the ability construct a value that has the same hash value as another). Constructing such a function is a very hard computer science problem and unless you are qualified, you shouldn't attempt. The correct approach for this is to concatenate whatever information you can collect about the hardware (i.e. the ones I listed in the post) and use a cryptographic hash or digital signature to get a verifiable and secure output. Do not implement the cryptographic algorithms yourself either; there are lots of vulnerability pitfalls that take lots of knowledge to avoid. Use a well-known and trusted library for the implementation of the algorithms.
If you're using Visual Studio, Microsoft provides the __cpuid instrinsic in the <intrin.h> header. Example on the linked msdn site.
Hm...
There are special libraries to generate unique ID based on the hardware installed (so for the specified computer this ID always be the same). Most of them takes motherboard ID + HDD ID + CPU ID and mix these values.
Whe reinvent the wheel? Why not to use these libraries? Any serious reason?
You can use command line.
wmic cpu list full
wmic baseboard list full
Or WMI interface
#include <wmi.hpp>
#include <wmiexception.hpp>
#include <wmiresult.hpp>
#include <../src/wmi.cpp>
#include <../src/wmiresult.cpp> // used
#pragma comment(lib, "wbemuuid.lib")
struct Win32_WmiCpu
{
void setProperties(const WmiResult& result, std::size_t index)
{
//EXAMPLE EXTRACTING PROPERTY TO CLASS
result.extract(index, "ProcessorId", (*this).m_cpuID);
}
static std::string getWmiClassName()
{
return "Win32_Processor";
}
string m_cpuID;
//All the other properties you wish to read from WMI
}; //end struct Win32_ComputerSystem
struct Win32_WmiMotherBoard
{
void setProperties(const WmiResult& result, std::size_t index)
{
//EXAMPLE EXTRACTING PROPERTY TO CLASS
result.extract(index, "SerialNumber", (*this).m_mBId);
}
static std::string getWmiClassName()
{
return "Win32_BaseBoard";
}
string m_mBId;
}; //end struct Win32_ComputerSystem
try
{
const Win32_WmiCpu cpu = Wmi::retrieveWmi<Win32_WmiCpu>();
strncpy_s(ret.m_cpu, cpu.m_cpuID.c_str(), _TRUNCATE);
}
catch (const Wmi::WmiException& )
{
}
try
{
const Win32_WmiMotherBoard mb = Wmi::retrieveWmi<Win32_WmiMotherBoard>();
strncpy_s(ret.m_mb, mb.m_mBId.c_str(), _TRUNCATE);
}
catch (const Wmi::WmiException& )
{
}

How much is 32 kB of compiled code

I am planning to use an Arduino programmable board. Those have quite limited flash memories ranging between 16 and 128 kB to store compiled C or C++ code.
Are there ways to estimate how much (standard) code it will represent ?
I suppose this is very vague, but I'm only looking for an order of magnitude.
The output of the size command is a good starting place, but does not give you all of the information you need.
$ avr-size program.elf
text data bss dec hex filename
The size of your image is usually a little bit more than the sum of the text and the data sections. The bss section is essentially compressed because it is all 0s. There may be other sections which are relevant which aren't listed by size.
If your build system is set up like ones that I've used before for AVR microcontrollers then you will end up with an *.elf file as well as a *.bin file, and possibly a *.hex file. The *.bin file is the actual image that would be stored in the program flash of the processor, so you can examine its size to determine how your program is growing as you make edits to it. The *.bin file is extracted from the *.elf file with the objdump command and some flags which I can't remember right now.
If you are wanting to know how to guess-timate how your much your C or C++ code will produce when compiled, this is a lot more difficult. I have observed a 10x blowup in a function when I tried to use a uint64_t rather than a uint32_t when all I was doing was incrementing it (this was about 5 times more code than I thought it would be). This was mostly to do with gcc's avr optimizations not being the best, but smaller changes in code size can creep in from seemingly innocent code.
This will likely be amplified with the use of C++, which tends to hide more things that turn into code than C does. Chief among the things C++ hides are destructor calls and lots of pointer dereferencing which has to do with the this pointer in objects as well as a secret pointer many objects have to their virtual function table and class static variables.
On AVR all of this pointer stuff is likely to really add up because pointers are twice as big as registers and take multiple instructions to load. Also AVR has only a few register pairs that can be used as pointers, which results in lots of moving things into and out of those registers.
Some tips for small programs on AVR:
Use uint8_t and int8_t instead of int whenever you can. You could also use uint_fast8_t and int_fast8_t if you want your code to be portable. This can lead to many operations taking up only half as much code, because int is two bytes.
Be very aware of things like string and struct constants and literals and how/where they are stored.
If you're not scared of it, read the AVR assembly manual. You can get an idea of the types of instructions, and from that the type of C code that easily maps to those instructions. Use that kind of C code.
You can't really say there. The length of the uncompiled code has little to do with the length of the compiled code. For example:
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
int main()
{
std::vector<std::string> strings;
strings.push_back("Hello");
strings.push_back("World");
std::sort(strings.begin(), strings.end());
std::copy(strings.begin(), strings.end(), std::ostream_iterator<std::string>(std::cout, ""));
}
vs
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
int main()
{
std::vector<std::string> strings;
strings.push_back("Hello");
strings.push_back("World");
for ( int idx = 0; idx < strings.size(); idx++ )
std::cout << strings[idx];
}
Both are the exact same number of lines, and produce the same output, but the first example involves an instantiation of std::sort, which is probably an order of magnitude more code than the rest of the code here.
If you absolutely need to count number of bytes used in the program, use assembler.
Download the arduino IDE and 'verify' some of your existing code, or look at the sample sketches. It will tell you how many bytes that code is, which will give you an idea of how much more you can fit into a given device. Picking a couple of the examples at random, the web server example is 5816 bytes, and the LCD hello world is 2616. Both use external libraries.
Try creating a simplified version of your app, focusing on the most valuable feature first, then start adding up the 'nice (and cool) stuff to have'. Keep an eye on the byte usage shown in the Arduino IDE when you verify your code.
As a rough indication, my first app (LED flasher controlled by a push buttun) requires 1092 bytes. That`s roughly 1K out of 32k. Pretty small footprint for C++ code!
What worries me most is the limited amount of RAM (1 Kb). If the CPU stack takes some of it, then there isn`t much left for creating any data structures.
I only had my Arduino for 48 hrs, so there is still a lot to use it effectively ;-) But it's a lot of fun to use :).
It's quite a bit for a reasonably complex piece of software, but you will start bumping into the limit if you want it to have a lot of different functionality. Also, if you want to store quite a lot of static strings and data, it can eat into that quite quickly. But 32 KB is a decent amount for embedded applications. It tends to be RAM that you have problems with first!
Also, quite often the C++ compilers for embedded systems are a lot worse than the C compilers.
That is, they are nowhere as good as C++ compilers for the common desktop OS's (in terms of producing efficient machine code for the target platform).
At a linux system you can do some experiments with static compiled example programs. E.g.
$ size `which busybox `
text data bss dec hex filename
1830468 4448 25650 1860566 1c63d6 /bin/busybox
The sizes are given in bytes. This output is independent from the executable file format, since the sizes of the different sections inside the file format. The text section contains the machine code and const stufff. The data section contains data for static initialization of variables. The bss size is the size of uninitialized data - of course uninitialized data does not need to be stored in the executable file.)
Well, busybox contains a lot of functionality (like all common shell commands, a shell etc.).
If you link own examples with gcc -static, keep in mind, that your used libc may dramatically increase the program size and that using an embedded libc may be much more space efficient.
To test that you can check out the diet-libc or uclibc and link against that. Actually, busybox is usually linked against uclibc.
Note that the sizes you get this way give you only an order of magnitude. For example, your workstation probably uses another CPU architecture than the arduino board and the machine code of different architecture may differ, more or less, in its size (because of operand sizes, available instructions, opcode encoding and so one).
To go on with rough order of magnitude reasoning, busybox contains roughly 309 tools (including ftp daemon and such stuff), i.e. the average code size of a busybox tool is roughly 5k.

Profiling memory usage in Mathematica

Is there any way to profile the mathkernel memory usage (down to individual variables) other than paying $$$ for their Eclipse plugin (mathematica workbench, iirc)?
Right now I finish execution of a program that takes multiple GB's of ram, but the only things that are stored should be ~50MB of data at most, yet mathkernel.exe tends to hold onto ~1.5GB (basically, as much as Windows will give it). Is there any better way to get around this, other than saving the data I need and quitting the kernel every time?
EDIT: I've just learned of the ByteCount function (which shows some disturbing results on basic datatypes, but that's besides the point), but even the sum over all my variables is nowhere near the amount taken by mathkernel. What gives?
One thing a lot of users don't realize is that it takes memory to store all your inputs and outputs in the In and Out symbols, regardless of whether or not you assign an output to a variable. Out is also aliased as %, where % is the previous output, %% is the second-to-last, etc. %123 is equivalent to Out[123].
If you don't have a habit of using %, or only use it to a few levels deep, set $HistoryLength to 0 or a small positive integer, to keep only the last few (or no) outputs around in Out.
You might also want to look at the functions MaxMemoryUsed and MemoryInUse.
Of course, the $HistoryLength issue may or not be your problem, but you haven't shared what your actual evaluation is.
If you're able to post it, perhaps someone will be able to shed more light on why it's so memory-intensive.
Here is my solution for profiling of memory usage:
myByteCount[symbolName_String] :=
Replace[ToHeldExpression[symbolName],
Hold[x__] :>
If[MemberQ[Attributes[x], Protected | ReadProtected],
Sequence ## {}, {ByteCount[
Through[{OwnValues, DownValues, UpValues, SubValues,
DefaultValues, FormatValues, NValues}[Unevaluated#x,
Sort -> False]]], symbolName}]];
With[{listing = myByteCount /# Names[]},
Labeled[Grid[Reverse#Take[Sort[listing], -100], Frame -> True,
Alignment -> Left],
Column[{Style[
"ByteCount for symbols without attributes Protected and \
ReadProtected in all contexts", 16, FontFamily -> "Times"],
Style[Row#{"Total: ", Total[listing[[All, 1]]], " bytes for ",
Length[listing], " symbols"}, Bold]}, Center, 1.5], Top]]
Evaluation the above gives the following table:
Michael Pilat's answer is a good one, and MemoryInUse and MaxMemoryUsed are probably the best tools you have. ByteCount is rarely all that helpful because what it measures can be a huge overestimate because it ignores shared subexpressions, and it often ignores memory that isn't directly accessible through Mathematica functions, which is often a major component of memory usage.
One thing you can do in some circumstances is use the Share function, which forces subexpressions to be shared when possible. In some circumstances, this can save you tens or even hundreds of magabytes. You can tell how well it's working by using MemoryInUse before and after you use Share.
Also, some innocuous-seeming things can cause Mathematica to use a whole lot more memory than you expect. Contiguous arrays of machine reals (and only machine reals) can be allocated as so-called "packed" arrays, much the way they would be allocated by C or Fortran. However, if you have a mix of machine reals and other structures (including symbols) in an array, everything has to be "boxed", and the array becomes an array of pointers, which can add a lot of overhead.
One way is to automatize restarting of kernel when it goes out of memory. You can execute your memory-consuming code in a slave kernel while the master kernel only takes the result of computation and controls memory usage.