Why is deallocation slow?

Why is deallocation slow? - c++

I have a question for which I am unable to find answer on the net...
I have a set declared like this:
set<unsigned int> MySet
I am inserting a million random numbers generated with mersenne twister. Random generation and insertion are really fast (around a second for a million numbers), but deallocation is extremely slow (1 and a half minute).
Why is deallocation so slow? I am not using any custom destructors for the set.

Compile your code in release mode.
This does two things.
It turns on the optimizations which definitely help.
Also the memory management libraries are different for debug and release.
The debug version of the library are built to allow for debugging and they maintain extra information (like marking de-allocated memory). All this extra processing does actually cost
The objective of the two version of the library are completely different. The release version is definitely optimized for speed the debug version is optimized for recovery and debugging.
Note this information is about DevStudio.

Probably because it makes more sense to optimize allocation at the expense of deallocation, because many applications allocate without deallocating, but never vice versa. I've seen a similar pattern myself, in an application that mixed calls to malloc and free (as opposed to allocating and deallocating all at once).
I've never written a heap allocator, so I don't know if there's a deeper technical reason than that. When deallocating, adjacent free blocks must be found and coalesced. So the work is just fundamentally different.
90 seconds for 1 million small free()'s sounds quite slow. I've never really programmed Windows so I can't say if that's abnormal, but the system should be able to do much better.
The solution to your problem may be simply to skip freeing the objects before the program exits. You might try deriving a custom allocator from std::allocator< unsigned int > which makes deallocate a no-op.

Related

How do i overload new/stl to make unknown objects faster?

In my question profiling: deque is 23% of my runtime i have a problem with 'new' being a large % of my runtime. The problems are
I have to use the new keyword a lot and on many different classes/structs (i have >200 of them and its by design). I use lots of stl objects, iterators and strings. I use strdup and other allocation (or free) functions.
I have one function that is called >2million times. All it did was create stl iterators and it took up >20% of the time (however from what i remember stl is optimized pretty nicely and debug makes it magnitudes slower).
But keeping in mind i need to allocate and free these iterators >2m times along with other functions that are called often. How do i optimize the new and malloc keyword/function? Especially for all these classes/structs and classes/struct i didnt write (stl and others)
Although profiling says i (and stl?) use the new keyword more then anything else.

Look for opportunities to avoid the allocation/freeing, either by adding your own management layer to recycle memory and objects that have already been allocated, or modifying their allocators. There are plenty of articles on STL Allocators:
http://www.codeguru.com/cpp/cpp/cpp_mfc/stl/article.php/c4079
http://bmagic.sourceforge.net/memalloc.html
http://www.codeproject.com/KB/stl/blockallocator.aspx
I have seen large multimap code go from unusably slow to very fast simply by replacing the default allocator.

You can't make malloc faster. You might be able to make new faster, but I bet you can find ways not to call them so much.
One way to find excess calls to anything is to peruse the code looking for them, but that's slow and error-prone, and they're not always visible.
A simple and foolproof way to find them is to pause the program a few times and look at the stack.
Notice you don't really need to measure anything. If something is happening that takes a large fraction of time, that is the probability you will see it on each pause, and the goal is to find it.
You do get a rough measurement, but that's only a by-product of finding the problem.
Here's an example where this was done in a series of stages, resulting in large speedup factors.

Which memory allocation algorithm suits best for performance and time critical c++ applications?

I ask this question to determine which memory allocation algorithm gives better results with performance critical applications, like game engines, or embedded applications. Results are actually depends percentage of memory fragmented and time-determinism of memory request.
There are several algorithms in the text books (e.g. Buddy memory allocation), but also there are others like TLSF. Therefore, regarding memory allocation algorithms available, which one of them is fastest and cause less fragmentation. BTW, Garbage collectors should be not included.
Please also, note that this question is not about profiling, it just aims to find out optimum algorithm for given requirements.

It all depends on the application. Server applications which can clear out all memory relating to a particular request at defined moments will have a different memory access pattern than video games, for instance.
If there was one memory allocation algorithm that was always best for performance and fragmentation, wouldn't the people implementing malloc and new always choose that algorithm?
Nowadays, it's usually best to assume that the people who wrote your operating system and runtime libraries weren't brain dead; and unless you have some unusual memory access pattern don't try to beat them.
Instead, try to reduce the number of allocations (or reallocations) you make. For instance, I often use a std::vector, but if I know ahead of time how many elements it will have, I can reserve that all in one go. This is much more efficient than letting it grow "naturally" through several calls to push_back().
Many people coming from languages where new just means "gimme an object" will allocate things for no good reason. If you don't have to put it on the heap, don't call new.
As for fragmentation: it still depends. Unfortunately I can't find the link now, but I remember a blog post from somebody at Microsoft who had worked on a C++ server application that suffered from memory fragmentation. The team solved the problem by allocating memory from two regions. Memory for all requests would come from region A until it was full (requests would free memory as normal). When region A was full, all memory would be allocated from region B. By the time region B was full, region A was completely empty again. This solved their fragmentation problem.
Will it solve yours? I have no idea. Are you working on a project which services several independent requests? Are you working on a game?
As for determinism: it still depends. What is your deadline? What happens when you miss the deadline (astronauts lost in space? the music being played back starts to sound like garbage?)? There are real time allocators, but remember: "real time" means "makes a promise about meeting a deadline," not necessarily "fast."
I did just come across a post describing various things Facebook has done to both speed up and reduce fragmentation in jemalloc. You may find that discussion interesting.

Barış:
Your question is very general, but here's my answer/guidance:
I don't know about game engines, but for embedded and real time applications, The general goals of an allocation algorithm are:
1- Bounded execution time: You have to know in advance the worst case allocation time so you can plan your real time tasks accordingly.
2- Fast execution: Well, the faster the better, obviously
3- Always allocate: Especially for real-time, security critical applications, all requests must be satisfied. If you request some memory space and get a null pointer: trouble!
4- Reduce fragmentation: Although this depends on the algorithm used, generally, less fragmented allocations provide better performance, due to a number of reasons, including caching effects.
In most critical systems, you are not allowed to dynamically allocate any memory to begin with. You analyze your requirements and determine your maximum memory use and allocate a large chunk of memory as soon as your application starts. If you can't, then the application does not even start, if it does start, no new memory blocks are allocated during execution.
If speed is a concern, I'd recommend following a similar approach. You can implement a memory pool which manages your memory. The pool could initialize a "sufficient" block of memory in the start of your application and serve your memory requests from this block. If you require more memory, the pool can do another -probably large- allocation (in anticipation of more memory requests), and your application can start using this newly allocated memory. There are various memory pooling schemes around as well, and managing these pools is another whole topic.
As for some examples: VxWorks RTOS used to employ a first-fit allocation algorithm where the algorithm analyzed a linked list to find a big enough free block. In VxWorks 6, they're using a best-fit algorithm, where the free space is kept in a tree and allocations traverse the tree for a big enough free block. There's a white paper titled Memory Allocation in VxWorks 6.0, by Zoltan Laszlo, which you can find by Googling, that has more detail.
Going back to your question about speed/fragmentation: It really depends on your application. Things to consider are:
Are you going to make lots of very small allocations, or relatively larger ones?
Will the allocations come in bursts, or spread equally throughout the application?
What is the lifetime of the allocations?
If you're asking this question because you're going to implement your own allocator, you should probably design it in such a way that you can change the underlying allocation/deallocation algorithm, because if the speed/fragmentation is really that critical in your application, you're going to want to experiment with different allocators. If I were to recommend something without knowing any of your requirements, I'd start with TLSF, since it has good overall characteristics.

As other already wrote, there is no "optimum algorithm" for each possible application. It was already proven that for any possible algorithm you can find an allocation sequence which will cause a fragmentation.
Below I write a few hints from my game development experience:
Avoid allocations if you can
A common practices in the game development field was (and to certain extent still is) to solve the dynamic memory allocation performance issues by avoiding the memory allocations like a plague. It is quite often possible to use stack based memory instead - even for dynamic arrays you can often come with an estimate which will cover 99 % of cases for you and you need to allocate only when you are over this boundary. Another commonly used approach is "preallocation": estimate how much memory you will need in some function or for some object, create a kind of small and simplistic "local heap" you allocate up front and perform the individual allocations from this heap only.
Memory allocator libraries
Another option is to use some of the memory allocation libraries - they are usually created by experts in the field to fit some special requirements, and if you have similar requiremens, they may fit your requirements.
Multithreading
There is one particular case in which you will find the "default" OS/CRT allocator performs badly, and that is multithreading. If you are targeting Windows, by aware both OS and CRT allocators provided by Microsoft (including the otherwise excellent Low Fragmentation Heap) are currently blocking. If you want to perform significant threading, you need either to reduce the allocation as much as possible, or to use some of the alternatives. See Can multithreading speed up memory allocation?

The best practice is - use whatever you can use to make the thing done in time (in your case - default allocator). If the whole thing is very complex - write tests and samples that will emulate parts of the whole thing. Then, run performance tests and benchmarks to find bottle necks (probably they will nothing to do with memory allocation :).
From this point you will see what exactly slowdowns your code and why. Only based on such precise knowledge you can ever optimize something and choose one algorithm over another. Without tests its just a waste of time since you can't even measure how much your optimization will speedup your app (in fact such "premature" optimizations can really slowdown it).
Memory allocation is a very complex thing and it really depends on many factors. For example, such allocator is simple and damn fast but can be used only in limited number of situations:
char pool[MAX_MEMORY_REQUIRED_TO_RENDER_FRAME];
char *poolHead = pool;
void *alloc(size_t sz) { char *p = poolHead; poolHead += sz; return p; }
void free() { poolHead = pool; }
So there is no "the best algorithm ever".

One constraint that's worth mentioning, which has not been mentioned yet, is multi-threading: Standard allocators must be implemented to support several threads, all allocating/deallocating concurrently, and passing objects from one thread to another so that it gets deallocated by a different thread.
As you may have guessed from that description, it is a tricky task to implement an allocator that handles all of this well. And it does cost performance as it is impossible to satisfy all these constrains without inter-thread communication (= use of atomic variables and locks) which is quite costly.
As such, if you can avoid concurrency in your allocations, you stand a good chance to implement your own allocator that significantly outperforms the standard allocators: I once did this myself, and it saved me roughly 250 CPU cycles per allocation with a fairly simple allocator that's based on a number of fixed sized memory pools for small objects, stacking free objects with an intrusive linked list.
Of course, avoiding concurrency is likely a no-go for you, but if you don't use it anyway, exploiting that fact might be something worth thinking about.

How does a memory leak improve performance

I'm building a large RTree (spatial index) full of nodes. It needs to be able to handle many queries AND updates. Objects are continuously being created and destroyed. The basic test I'm running is to see the performance of the tree as the number of objects in the tree increases. I insert from 100-20000 uniformly size, randomly located objects in increments of 100. Searching and updating are irrelevant to the issue I am currently faced with.
Now, when there is NO memory leak the "insert into tree" performance is everywhere. It goes anywhere from 10.5 seconds with ~15000 objects to 1.5 with ~18000. There is no pattern whatsoever.
When I deliberately add in a leak, as simple as putting in "new int;" I don't assign it to anything, that right there is a line to itself, the performance instantly falls onto a nice gentle curve sloping from 0 (roughly) seconds for 100 objects to 1.5 for the full 20k.
Very, very lost at this point. If you want source code I can include it but it's huuugggeeee and literally the only line that makes a difference is "new int;"
Thanks in advance!
-nick

I'm not sure how you came up with this new int test, but it's not very good way to fix things :) Run your code using a profiler and find out where the real delays are. Then concentrate on fixing the hot spots.
g++ has it built in - just compile with -pg

Without more information it's impossible to be sure.
However I wonder if this is to do with heap fragmentation. By creating a freeing many blocks of memory you'll likely be creating a whole load of small fragments of memory linked together.The memory manager needs to keep track of them all so it can allocate them again if needed.
Some memory managers when you free a block try to "merge" it with surrounding blocks of memory and on a highly fragmented heap this can be very slow as it tries to find the surrounding blocks. Not only this, but if you have limited physical memory it can "touch " many physical pages of memory as it follows the chain of memory blocks which can cause a whole load of extremely slow page faults which will be very variable in speed depending on exactly how much physical memory the OS decides to give that process.
By leaving some un-freed memory you will be changing this pattern of access which might make a large difference to the speed. You might for example be forcing the run time library to allocate new block of memory each time rather than having to track down a suitably sized existing block to reuse.
I have no evidence this is the case in your program, but I do know that memory fragmentation is often the causes of slow programs when a lot of new and free is performed.

The possible thing that is happening which explains this (a theory)
The compiler did not remove the empty new int
The new int is in one of the inner loops or somewhere in your recursive traversal wherein it gets executed the most amount of time
The overall RSS of the process increases and eventually the total memory being used by the process
There are page faults happening because of this
Because of the page-faults, the process becomes I/O bound instead of being CPU bound
End result, you see a drop in the throughput. It will help if you can mention the compiler being used and the options for the compiler that you are using to build the code.

I am taking a stab in the dark here but the problem could be the way the heap gets fragmented. You said that you are creating a destroying large numbers of objects. I will assume that the objects are all of different size.
When one allocates memory on the heap, a cell the size needed is broken off from the heap. When the memory is freed, the cell is added to a freelist. When one does a new alloc, the allocator walks the heap until a cell that is big enough is found. When doing large numbers of allocations, the free list can get rather long and walking the list can take a non-trivial amount of time.
Now an int is rather small. So when you do your new int, it may well eat up all the small heap cells on the free list and thus dramatically speed up larger allocations.
The chances are, however that you are allocating and freeing similar sized objects. If you use your own freelists, you will safe yourself many heap walks and may dramatically improve performance. This is exactly what the STL allocators do to improve performance.

Solution: Do not run from Visual Studio. Actually run the .exe file. Figured this out because that's what the profilers were doing and the numbers were magically dropping. Checked memory usage and version running (and giving me EXCEPTIONAL times) was not blowing up to excessively huge sizes.
Solution to why the hell Visual Studio does ridiculous crap like this: No clue.

Will Garbage Collected C be Faster Than C++?

I had been wondering for quite some time on how to manager memory in my next project. Which is writing a DSL in C/C++.
It can be done in any of the three ways.
Reference counted C or C++.
Garbage collected C.
In C++, copying class and structures from stack to stack and managing strings separately with some kind of GC.
The community probably already has a lot of experience on each of these methods. Which one will be faster? What are the pros and cons for each?
A related side question. Will malloc/free be slower than allocating a big chunk at the beginning of the program and running my own memory manager over it? .NET seems to do it. But I am confused why we can't count on OS to do this job better and faster than what we can do ourselves.

It all depends! That's a pretty open question. It needs an essay to answer it!
Hey.. here's one somebody prepared earlier:
http://lambda-the-ultimate.org/node/2552
http://www.hpl.hp.com/personal/Hans_Boehm/gc/issues.html
It depends how big your objects are, how many of them there are, how fast they're being allocated and discarded, how much time you want to invest optimizing and tweaking to make optimizations. If you know the limits of how much memory you need, for fast performance, I would think you can't really beat grabbing all the memory you need from the OS up front, and then managing it yourself.
The reason it can be slow allocating memory from the OS is that it deals with lots of processes and memory on disk and in ram, so to get memory it's got to decide if there is enough. Possibly, it might have to page another processes memory out from ram to disk so it can give you enough. There's lots going on. So managing it yourself (or with a GC collected heap) can be far quicker than going to the OS for each request. Also, the OS usually deals with bigger chunks of memory, so it might round up the size of requests you make meaning you could waste memory.
Have you got a real hard requirement for going super quick? A lot of DSL applications don't need raw performance. I'd suggest going with whatever's simplest to code. You could spend a lifetime writing memory management systems and worrying which is best.

Why would garbage collected C be faster than C++? The only garbage collectors available for C are pretty inefficient things, more designed to plug memory leaks than to actually improve the quality of your code.
In any case, C++ has the potential for reaching better performance with less code (note that it's only a potential. It's also very possible to write C++ code that is far slower than the equivalent C).
Considering the current state of both languages, GC's are not currently going to improve performance in your code. GC's can be made very efficient in languages designed for it. C/C++ are not among those. ;)
Apart from that, it's impossible to say. Languages don't have a speed. It doesn't make sense to ask which language is faster. It depends on 1) the specific code, 2) the compiler that compiles it, and 3) the system it's running on (hardware as well as OS).
malloc is a fairly slow operation, far slower than the .NET equivalents, so yes, if you are performing a lot of small allocations, you may be better off allocating a large pool of memory once, and then using chunks of that.
The reason is that the OS has to find a free chunk of memory, basically by following a linked list of all free memory areas. In .NET, a new() call is basically nothing more than moving the heap pointer as many bytes as required by the allocation.

uh ... It depends how you write the garbage collection system for your DSL. Neither C or C++ comes with a garbage collection facility built-in but either could be used to write a very efficient or a very inefficient garbage collector. Writing such a thing, by the way, is a non-trivial task.
DSLs are often written in higher level languages such as Ruby or Python specifically because the language writer can leverage the garbage collection and other facilities of the language. C and C++ are great for writing full, industrial strength languages but you certainly need to know what you are doing to use them - knowledge of yacc and lex is especially useful here but a good understanding of dynamic memory management is important also, as you say. You could also check out keykit, an open source music DSL written in C, if you still like the idea of a DSL in C/C++.

With most garbage collection implementations, allocation can see a speed improvement, but then you have the additional cost of the collection phase which can be triggered at any point in your program's execution, leading to a sudden (seemingly random) delay.
As for your second question, it depends on your memory management algorithms. You'd be safe sticking with your library's default malloc implementation, but there are alternatives which boast better performance.

A related side question. Will malloc/free be slower than allocating a big chuck at the begining of the program and running my own memory manager over it? .NET seems to do it. But I am confused why we can't count on OS to do this job better and faster than what we can do ourselves.
The problem with letting the OS handle memory allocation is that it introduces indeterministic behaviour. There's no way for the programmer to know how long the OS will take to return a new chunk of memory - an allocation may be quite costly if memory has to be paged out to disk.
Preallocating therefore might be a good idea, especially when using a copying garbage collector. It'll increase memory consumption, but allocation will be fast because in most cases it'll just be a pointer increment.

As people have pointed out - GC is faster to allocate (because it just gives you the next block on its list), but slower overall (because it has to compact the heap regularly, in order for allocs to be fast).
so - go for the compromise solution (which is actually pretty damn good):
You create your own heaps, one for each size of object you generally allocate (or 4-byte, 8 byte, 16-byte, 32-byte, etc) then, when you want a new piece of memory you grab the last 'block' on the appropriate heap. Because you pre-allocate from these heaps, all you need to do when allocating is grab the next free block. This works better than the standard allocator because you are happily wasting memory - if you want to allocate 12 bytes, you'll give up a whole 16 byte block from the 16-byte heap. You keep a bitmap of used v free blocks so you can allocate quickly without wasting loads of memory or needing to compact.
Also, because you're running several heaps, highly-parallel systems work much better as you don't need to lock so often (ie you have multiple locks for each heap so you don't get contention nearly as much)
Try it - we used it to replace the standard heap on a very intensive application, performance went up by quite a lot.
BTW. the reason the standard allocators are slow is that they try not to waste memory - so if you allocate a 5 byte, 7 byte and 32 bytes from the standard heap, it'll keep those 'boundaries'. Next time you need to allocate, it'll walk through those looking for enough space to give you what you asked for. That worked well for low-memory systems, but you only have to look at how much memory most apps use today to see that GC systems go the other way, and try to make allocations as fast as possible whilst caring nothing for how much memory is wasted.

The problem has a lot of variables, but if your application is written with garbage collection in mind, and if you exploit the special features of the Boehm collector, such as different allocation calls for blocks that don't contain pointers, then as a general rule your application
- Will have simpler interfaces
- Will run somewhat faster
- Will require from 1.2x to 2x the space
than a similar application using explicit memory management.
For documentation and evidence supporting these claims, you can see the information on Boehm's web site, and also Ben Zorn's several papers on the measured cost of conservative garbage collection.
Most importantly you'll save a ton of effort and won't have to worry about a significant class of memory-management bugs.
The issue of C vs C++ is orthogonal, but GC will definitely be faster than reference counting, especially when there's no compiler support for reference counting.

Neither C nor C++ will give you garbage for free. What they will give you is memory allocation libraries (which provide malloc/free, etc). There are many online resources to algorithms for writing garbage collection libraries. A good start is link text

Most non GC languages will allocate and de-allocate the memory as needed and no longer needed. GC'd languages usually allocate large chunks of memory before hand and only free the memory when idle and not in the middle of a intensive task so I am going to yes if GC kicks in at correct time.
The D programming language is a garbage collected language and ABI compatible with C and partly ABI compatible with C++. This Page shows some benchmarks between string performance in C++ and D.

I suggest that if you have written a program where memory allocation and deallocation (explicitly or GC'ed) is the bottleneck, then you should re-think your architecture, design and implementation.

If you don't want to explicitly manage memory, don't use C/C++. There are plenty of languages with either reference counting or compiler-supported garbage collectors that will probably work much better for you.
C/C++ are designed in an environment where the programmer manages their own memory. Trying to retrofit GC or ref counting onto them may help some, but you'll find that you either have to compromise the performance of the GC (because it doesn't have any compiler hinting as to where pointers might be), or you'll find new and fascinating ways that you can screw up the reference counts or the GC or whatever.
I know it sounds like a good idea, but really, you should just grab a language more suited to the task.

Is it acceptable not to deallocate memory

I'm working on a project that is supposed to be used from the command line with the following syntax:
program-name input-file
The program is supposed to process the input, compute some stuff and spit out results on stdout.
My language of choice is C++ for several reasons I'm not willing to debate. The computation phase will be highly symbolic (think compiler) and will use pretty complex dynamically allocated data structures. In particular, it's not amenable to RAII style programming.
I'm wondering if it is acceptable to forget about freeing memory, given that I expect the entire computation to consume less than the available memory and that the OS is free to reclaim all the memory in one step after the program finishes (assume program terminates in seconds). What are your feeling about this?
As a backup plan, if ever my project will require to run as a server or interactively, I figured that I can always refit a garbage collector into the source code. Does anyone have experience using garbage collectors for C++? Do they work well?

It shouldn't cause any problems in the specific situation described the question.
However, it's not exactly normal. Static analysis tools will complain about it. Most importantly, it builds bad habits.

Sometimes not deallocating memory is the right thing to do.
I used to write compilers. After building the parse tree and traversing it to write the intermediate code, we would simply just exit. Deallocating the tree would have
added a bit of slowness to the compiler, which we wanted of course to be as fast as possible.
taken up code space
taken time to code and test the deallocators
violated the "no code executes better than 'no code'" dictum.
HTH! FWIW, this was "back in the day" when memory was non-virtual and minimal, the boxes were much slower, and the first two were non-trivial considerations.

My feeling would be something like "WTF!!!"
Look at it this way:
You choose a programming language that does not include a garbage collector, we are not allowed to ask why.
You are basically stating that you are too lazy to care about freeing the memory.
Well, WTF again. Laziness isn't a good reason for anything, the least of what is playing around with memory without freeing it.
Just free the memory, it's a bad practice, the scenario may change and then can be a million reasons you can need that memory freed and the only reason for not doing it is laziness, don't get bad habits, and get used to do things right, that way you'll tend to do them right in the future!!

Not deallocating memory should not be problem but it is a bad practice.

Joel Coehoorn is right:
It shouldn't cause any problems.
However, it's not exactly normal.
Static analysis tools will complain
about it. Most importantly, it builds
bad habits.
I'd also like to add that thinking about deallocation as you write the code is probably a lot easier than trying to retrofit it afterwards. So I would probably make it deallocate memory; you don't know how your program might be used in future.
If you want a really simple way to free memory, look at the "pools" concept that Apache uses.

Well, I think that it's not acceptable. You've already alluded to potential future problems yourself. Don't think they're necessarily easy to solve.
Things like “… given that I expect the entire computation to consume less …” are famous last phrases. Similarly, refitting code with some feature is one of these things they all talk of and never do.
Not deallocating memory might sound good in the short run but can potentially create a huge load of problems in the long run. Personally, I just don't think that's worth it.
There are two strategies. Either you build in the GC design from the very beginning. It's more work but it will pay off. For a lot of small objects it might pay to use a pool allocator and just keep track of the memory pool. That way, you can keep track of the memory consumption and simply avoid a lot of problems that similar code, but without allocation pool, would create.
Or you use smart pointers throughout the program from the beginning. I actually prefer this method even though it clutters the code. One solution is to rely heavily on templates, which takes out a lot of redundancy when referring to types.
Take a look at projects such as WebKit. Their computation phase resembles yours since they build parse trees for HTML. They use smart pointers throughout their program.
Finally: “It’s a question of style … Sloppy work tends to be habit-forming.”
– Silk in Castle of Wizardry by David Eddings.

will use pretty complex dynamically
allocated data structures. In
particular, it's not amenable to RAII
style programming.
I'm almost sure that's an excuse for lazy programming. Why can't you use RAII? Is it because you don't want to keep track of your allocations, there's no pointer to them that you keep? If so, how do you expect to use the allocated memory - there's always a pointer to it that contains some data.
Is it because you don't know when it should be released? Leave the memory in RAII objects, each one referenced by something, and they'll all trickle-down free each other when the containing object gets freed - this is particularly important if you want to run it as a server one day, each iteration of the server effective runs a 'master' object that holds all others so you can just delete it and all the memory disappears. It also helps prevent you retro-fitting a GC.
Is it because all your memory is allocated and kept in-use all the time, and only freed at the end? If so see above.
If you really, really cannot think of a design where you cannot leak memory, at least have the decency to use a private heap. Destroy that heap before you quit and you'll have a better design already, if a little 'hacky'.
There are instances where memory leaks are ok - static variables, globally initialised data, things like that. These aren't generally large though.

Reference counting smart pointers like shared_ptr in boost and TR1 could also help you manage your memory in a simple manner.
The drawback is that you have to wrap every pointers that use these objects.

I've done this before, only to find that, much later, I needed the program to be able to process several inputs without separate commands, or that the guts of the program were so useful that they needed to be turned into a library routine that could be called many times from within another program that was not expected to terminate. It was much harder to go back later and re-engineer the program than it would have been to make it leak-less from the start.
So, while it's technically safe as you've described the requirements, I advise against the practice since it's likely that your requirements may someday change.

If the run time of your program is very short, it should not be a problem. However, being too lazy to free what you allocate and losing track of what you allocate are two entirely different things. If you have simply lost track, its time to ask yourself if you actually know what your code is doing to a computer.
If you are just in a hurry or lazy and the life of your program is small in relation to what it actually allocates (i.e. allocating 10 MB per second is not small if running for 30 seconds) .. then you should be OK.
The only 'noble' argument regarding freeing allocated memory sets in when a program exits .. should one free everything to keep valgrind from complaining about leaks, or just let the OS do it? That entirely depends on the OS and if your code might become a library and not a short running executable.
Leaks during run time are generally bad, unless you know your program will run in a short amount of time and not cause other programs far more important than your's as far as the OS is concerned to skid to dirty paging.

What are your feeling about this?
Some O/Ses might not reclaim the memory, but I guess you're not intenting to run on those O/Ses.
As a backup plan, if ever my project will require to run as a server or interactively, I figured that I can always refit a garbage collector into the source code.
Instead, I figure you can spawn a child process to do the dirty work, grab the output from the child process, let the child process die as soon as possible after that and then expect the O/S to do the garbage collection.

I have not personally used this, but since you are starting from scratch you may wish to consider the Boehm-Demers-Weiser conservative garbage collector

The answer really depends on how large your program will be and what performance characteristics it needs to exhibit. If you never deallocate memory, your process's memory footprint will be much larger than it would otherwise be. Depeding on the system, this could cause a lot of paging and slow down the performance for you or other applications on the system.
Beyond that, what everyone above says is correct. It probably won't cause harm in the short term, but it's a bad practice that you should avoid. You'll never be able to use the code again. Trying to retrofit a GC on afterwards will be a nightmare. Just think about going to each place you allocate memory and trying to retrofit it but not break anything.
One more reason to avoid doing this: reputation. If you fail to deallocate, everyone who maintains the code will curse your name and your rep in the company will take a hit. "Can you believe how dumb he was? Look at this code."

If it is non-trivial for you to determine where to deallocate the memory, I would be concerned that other aspects of the data structure manipulation may not be fully understood either.

Apart from the fact that the OS (kernel and/or C/C++ library) can choose not to free the memory when the execution ends, your application should always provide proper freeing of allocated memory as a good practice. Why? Suppose you decide to extend that application or reuse the code; you'll quickly get in trouble if the code you had previously written hogs up the memory unnecessarily, after finishing its job. It's a recipe for memory leaks.

In general, I agree it's a bad practice.
For a one shot program, it can be OK, but it kinda shows like you don't what you are doing.
There is one solution to your problem though - use a custom allocator, which preallocates larger blocks from malloc, and then, after the computation phase, instead of freeing all the little blocks from you custom allocator, just release the larger preallocated blocks of memory. Then you don't need to keep track of all objects you need to deallocate and when. One guy who wrote a compiler too explained this approach many years ago to me, so if it worked for him, it will probably work for you as well.

Try to use automatic variables in methods so that they will be freed automatically from the stack.
The only useful reason to not free heap memory is to save a tiny amount of computational power used in the free() method. You might loose any advantage if page faults become an issue due to large virtual memory needs with small physical memory resources. Some factors to consider are:
If you are allocating a few huge chunks of memory or many small chunks.
Is the memory going to need to be locked into physical memory.
Are you absolutely positive the code and memory needed will fit into 2GB, for a Win32 system, including memory holes and padding.

That's generally a bad idea. You might encounter some cases where the program will try to consume more memory than it's available. Plus you risk being unable to start several copies of the program.
You can still do this if you don't care of the mentioned issues.

When you exit from a program, the memory allocated is automatically returned to the system. So you may not deallocate the memory you had allocated.
But deallocations becomes necessary when you go for bigger programs such as an OS or Embedded systems where the program is meant to run forever & hence a small memory leak can be malicious.
Hence it is always recommended to deallocate the memory you have allocated.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js