Allocate large blocks of contiguous memory - do or don't?

Allocate large blocks of contiguous memory - do or don't? - c++

I have always been convinced that it is not a good practice to allocate large blocks of contiguous memory. It is clear that you are likely to run into trouble if memory fragmentation comes into play, which in most cases cannot be excluded for sure (especially in large projects designed as services or the like).
Recently I came accross the ITK image processing library and realized, that they (virtually) always allocate image data (even 3D - which might be huge) as one contiguous block. I was told that this should not be a problem, at least for 64 bit processes. However, I don't see a systematic difference between 64 bit and 32 bit processes besides the fact that memory problems might occurr delayed due to the larger virtual address space.
To come to the point: I wonder what is good practice when dealing with large amounts of data: Simply allocate it as one big block, or better split it up into smaller pieces for allocation?
As the question is of course rather system specific I would like to restrict it to native (unmanaged, no CLR) C++ especially under windows. However, I would be also interested in any more general comments - if possible.

The question almost seems nonsensical... let me rephrase it to illustrate:
If you need a large block of memory and are worried about fragmentation, should you just fragment it yourself?
You don't gain anything by fragmenting it yourself rather than letting the system memory manager fragment it for you. The system is extremely good at this, and you are not likely to do it better.
That being said, if all things being equal you can do the same task but broken into sensible fragments, it may be worth profiling to see if you can gain anything. But in general, you won't gain anything in a reasonable sense -- you won't be able to outperform the OS.

Related

Memory defragmentation/heap compaction - commonplace in managed languages, but not in C++. Why?

I've been reading up a little on zero-pause garbage collectors for managed languages. From what I understand, one of the most difficult things to do without stop-the-world pauses is heap compaction. Only very few collectors (eg Azul C4, ZGC) seem to be doing, or at least approaching, this.
So, most GCs introduce dreaded stop-the-world pauses the compact the heap (bad!). Not doing this seems extremely difficult, and does come with a performance/throughput penalty. So either way, this step seems rather problematic.
And yet - as far as I know, most if not all GCs still do compact the heap occasionally. I've yet to see a modern GC that doesn't do this by default. Which leads me to believe: It has to be really, really important. If it wasn't, surely, the tradeoff wouldn't be worth it.
At the same time, I have never seen anyone do memory defragmentation in C++. I'm sure some people somewhere do, but - correct me if I am wrong - it does not at all seem to be a common concern.
I could of course imagine static memory somewhat lessens this, but surely, most codebases would do a fair amount of dynamic allocations?!
So I'm curious, why is that?
Are my assumptions (very important in managed languages; rarely done in C++) even correct? If yes, is there any explanation I'm missing?

Garbage collection can compact the heap because it knows where all of the pointers are. After all, it just finished tracing them. That means that it can move objects around and adjust the pointers (references) to the new location.
However, C++ cannot do that, because it doesn't know where all the pointers are. If the memory allocation library moved things around, there could be dangling pointers to the old locations.
Oh, and for long running processes, C++ can indeed suffer from memory fragmentation. This was more of a problem on 32-bit systems because it could fail to allocate memory from the OS, because it might have used up all of the available 1 MB memory blocks. In 64-bit it is almost impossible to create so many memory mappings that there is nowhere to put a new one. However, if you ended up with a 16 byte memory allocation in each 4K memory page, that's a lot of wasted space.
C and C++ applications solve that by using storage pools. For a web server, for example, it would start a pool with a new request. At the end of that web request, everything in the pool gets destroyed. The pool makes a nice, constant sized block of RAM that gets reused over and over without fragmentation.
Garbage collection tends to use recycling pools as well, because it avoids the strain of running a big GC trace and reclaim at the end of a connection.
One method some old operating systems like Apple OS 9 used before virtual memory was a thing is handles. Instead of a memory pointer, allocation returned a handle. That handle was a pointer to the real object in memory. When the operating system needed to compact memory or swap it to disk it would change the handle.
I have actually implemented a similar system in C++ using an array of handles into a shared memory map psuedo-database. When the map was compacted then the handle table was scanned for affected entries and updated.

Generic memory compaction is not generally useful nor desirable because of its costs.
What may be desirable is to have no wasted/fragmented memory and that can be achieved by other methods than memory compaction.
In C++ one can come up with a different allocation approach for objects that do cause fragmentation in their specific application, e.g. double-pointers or double-indexes to allow for object relocation; object pools or arenas that prevent or minimize fragmentation. Such solutions for specific object types is superior to generic garbage collection because they employ application/business specific knowledge which allows to minimize the scope/cost of object storage maintenance and also happen at most appropriate times.
A research found that garbage collected languages require 5 times more memory to achieve performance of non-GC equivalent programs. Memory fragmentation is more severe in GC languages.

How to allocate a large dynamic array in C++?

So I am currently trying to allocate dynamically a large array of elements in C++ (using "new"). Obviously, when "large" becomes too large (>4GB), my program crashes with a "bad_alloc" exception because it can't find such a large chunk of memory available.
I could allocate each element of my array separately and then store the pointers to these elements in a separate array. However, time is critical in my application so I would like to avoid as much cache misses as I can. I could also group some of these elements into blocks but what would be the best size for such a block?
My question is then: what is the best way (timewise) to allocate dynamically a large array of elements such that elements do not have to be stored contiguously but they must be accessible by index (using [])? This array is never going to be resized, no elements is going to be inserted or deleted of it.
I thought I could use std::deque for this purpose, knowing that the elements of an std::deque might or might not be stored contiguously in memory but I read there are concerns about the extra memory this container takes?
Thank you for your help on this!

If your problem is such that you actually run out of memory allocating fairly small blocks (as is done by deque) is not going to help, the overhead of tracking the allocations will only make the situation worse. You need to re-think your implementation such that you can deal with it in blocks that will still fit in memory. For such problems, if using x86 or x64 based hardware I would suggest blocks of at least 2 megabytes (the large page size).

Obviously, when "large" becomes too large (>4GB), my program crashes
with a "bad_alloc" exception because it can't find such a large chunk
of memory available.
You should be using 64-bit CPU and OS at this point, allocating huge contiguous chunk of memory should not be a problem, unless you are actually running out of memory. It is possible that you are building 32-bit program. In this case you won't be able to allocate more than 4 GB. You should build 64-bit application.
If you want something better than plain operator new, then your question is OS-specific. Look at API provided by your OS: on POSIX system you should look for mmap and for VirtualAlloc on Windows.
There are multiple problems with large allocations:
For security reasons OS kernel never gives you pages filled with garbage values, instead all new memory will be zero initialized. This means you don't have to initialize that memory as long as zeroes are exactly what you want.
OS gives you real memory lazily on first access. If you are processing large array, you might waste a lot of time taking page faults. To avoid this you can use MAP_POPULATE on Linux. On Windows you can try PrefetchVirtualMemory (but I am not sure if it can do the job). This should make init allocation slower, but should decrease total time spent in kernel.
Working with large chunks of memory wastes slots in Translation Lookaside Buffer (TLB). Depending on you memory access pattern, this can cause noticeable slowdown. To avoid this you can try using large pages (mmap with MAP_HUGETLB, MAP_HUGE_2MB, MAP_HUGE_1GB on Linux, VirtualAlloc and MEM_LARGE_PAGES). Using large pages is not easy, as they are usually not available by default. They also cannot be swapped out (always "locked in memory"), so using them requires privileges.
If you don't want to use OS-specific functions, the best you can find in C++ is std::calloc. Unlike std::malloc or operator new it returns zero initialized memory so you can probably avoid wasting time initializing that memory. Other than that, there is nothing special about that function. But this is the closest you can get while staying withing standard C++.
There are no standard containers designed to handle large allocations, moreover, all standard container are really really bad at handling those situations.
Some OSes (like Linux) overcommit memory, others (like Windows) do not. Windows might refuse to give you memory if it knows it won't be able to satisfy your request later. To avoid this you might want to increase your page file. Windows needs to reserve that space on disk beforehand, but it does not mean it will use it (start swapping). As actual memory is given to programs lazily, there are might be a lot of memory reserved for applications that will never be actually given to them.
If increasing page file is too inconvenient, you can try creating large file and map it into memory. That file will serve as a "page file" for your memory. See CreateFileMapping and MapViewOfFile.

The answer to this question is extremely application, and platform, dependent. These days if you just need a small integer factor greater than 4GB, you use a 64-bit machine, if possible. Sometimes reducing the size of the element in the array is possible as well. (E.g. using 16-bit fixed-point of half-float instead of 32-bit float.)
Beyond this, you are either looking at sparse arrays or out-of-core techniques. Sparse arrays are used when you are not actually storing elements at all locations in the array. There are many possible implementations and which is best depends on both the distribution of the data and the access pattern of the algorithm. See Eigen for example.
Out-of-core involves explicitly reading and writing parts of the array to/from disk. This used to be fairly common, but people work pretty hard to avoid doing this now. Applications that really require such are often built on top of a database or similar to handle the data management. In scientific computing, one ends up needing to distribute the compute as well as the data storage so there's a lot of complexity around that as well. For important problems the entire design may be driven by having good locality of reference.
Any sparse data structure will have overhead in how much space it takes. This can be fairly low, but it means you have to be careful if you actually have a dense array and are simply looking to avoid memory fragmentation.
If your problem can be broken into smaller pieces that only access part of the array at a time and the main issue is memory fragmentation making it hard to allocate one large block, then breaking the array in to pieces, effectively adding an outer vector of pointers, is a good bet. If you have random access to an array larger than 4 gigabytes and no way to localize the accesses, 64-bit is the way to go.

Depending on what you need the memory for and your speed concerns, and if you're using Linux, you can always try using mmap and simulate a sort of swap. It might be slower, but you can map very large sizes. See Mmap() an entire large file

C++ Memory Counting in OpenCV

I have an application written in OpenCV. It consists of two threads that each perform an OpenCV function. How can i determine how much memory each thread is generating?
I'm using libdispatch, Grand Central Dispatch design pattern. It is at a stage where i can have multiple tasks running at once. How can i manage memory in such a situation? With some opencv processes and enough concurrent tasks, i can easily hit my RAM ceiling. How to manage this?
What strategies can be employed in C++?
If each thread had a memory limit, how could this be handled?
Regards,
Daniel

I'm not familiar with the dispatching library/pattern you're using, but I've had a quick glance over what it aims to do. I've done a fair amount of work in the image processing/video processing domain, so hopefully my answer isn't a completely useless wall-of-text ;)
My suspicion is that you're firing off whole image buffers to different threads to run the same processing on them. If this is the case, then you're quickly going to hit RAM limits. If a task (thread) uses N image buffers in its internal functions, and your RAM is M, then you may start running out of legs at M / N tasks (threads). If this is the case, then you may need to resort to firing off chunks of images to the threads instead (see the hints further down on using dependency graphs for processing).
You should also consider the possibility that performance in your particular algorithm is memory bound and not CPU bound. So it may be pointless firing off more threads even though you have extra cores, and perhaps in this case you're better off focusing on CPU SIMD things like SSE/MMX.
Profile first, Ask (Memory Allocator) Questions Later
Using hand-rolled memory allocators that cater for concurrent environments and your specific memory requirements can make a big difference to performance. However, they're unlikely to reduce the amount of memory you use unless you're working with many small objects, where you may be able to do a better job with memory layout when allocating and reclaiming them than the default malloc/free implementations. As you're working with image processing algorithms, the latter is unlikely. You've typically got huge image buffers allocated on the heap as opposed to many small-ish structs.
I'll add a few tips on where to begin reading on rolling your own allocators at the end of my answer, but in general my advice would be to first profile and figure out where the memory is being used. Having written the code you may have a good hunch about where it's going already, but if not tools like valgrind's massif (complicated beast) can be a big help.
After having profiled the code, figure out how you can reduce the memory use. There are many, many things you can do here, depending on what's using the memory. For example:
Free up any memory you don't need as soon as you're done with it. RAII can come in handy here.
Don't copy memory unless you need to.
Share memory between threads and processes where appropriate. It will make it more difficult than working with immutable/copied data, because you'll have to synchronise read/write access, but depending on your problem case it may make a big difference.
If you're using memory caches, and you don't want to cache the data to disk due to performance reasons, then consider using in-memory compression (e.g. zipping some of the cache) when it's falling to the bottom of your least-recently-used cache, for example.
Instead of loading a whole dataset and having each method operate on the whole of it, see if you can chunk it up and only operate on a subset of it. This is particularly relevant when dealing with large data sets.
See if you can get away with using less resolution or accuracy, e.g. quarter-size instead of full size images, or 32 bit floats instead of 64 bit floats (or even custom libraries for 16 bit floats), or perhaps using only one channel of image data at a time (just red, or just blue, or just green, or greyscale instead of RGB).
As you're working with OpenCV, I'm guessing you're either working on image processing or video processing. These can easily gobble up masses of memory. In my experience, initial R&D implementations typically process a whole image buffer in one method before passing it over to the next. This often results in multiple full image buffers being used, which is hugely expensive in terms of memory consumption. Reducing the use of any temporary buffers can be a big win here.
Another approach to alleviate this is to see if you can figure out the data dependencies (e.g. by looking at the ROIs required for low-pass filters, for example), and then processing smaller chunks of the images and joining them up again later, and to avoid temporary duplicate buffers as much as possible. Reducing the memory footprint in this way can be a big win, as you're also typically reducing the chances of a cache miss. Such approaches often hugely complicate the implementation, and unless you have a graph-based framework in place that already supports it, it's probably not something you should attempt before exhausting other options. Intel have a number of great resources pertaining to optimisation of threaded image processing applications.
Tips on Memory Allocators
If you still think playing with memory allocators is going to be useful, here are some tips.
For example, on Linux, you could use
malloc hooks, or
just override them in your main compilation unit (main.cpp), or a library that you statically link, or a shared libary that you LD_PRELOAD, for example.
There are several excellent malloc/free replacements available that you could study for ideas, e.g.
dlmalloc
tcmalloc
If you're dealing with specific C++ objects, then you can override their new and delete operators. See this link, for example.
Lastly, if I did manage to guess wrong regarding where memory is being used, and you do, in fact, have loads of small objects, then search the web for 'small memory allocators'. Alexander Alexandrescu wrote a couple of great articles on this, e.g. here and here.

Which memory allocation algorithm suits best for performance and time critical c++ applications?

I ask this question to determine which memory allocation algorithm gives better results with performance critical applications, like game engines, or embedded applications. Results are actually depends percentage of memory fragmented and time-determinism of memory request.
There are several algorithms in the text books (e.g. Buddy memory allocation), but also there are others like TLSF. Therefore, regarding memory allocation algorithms available, which one of them is fastest and cause less fragmentation. BTW, Garbage collectors should be not included.
Please also, note that this question is not about profiling, it just aims to find out optimum algorithm for given requirements.

It all depends on the application. Server applications which can clear out all memory relating to a particular request at defined moments will have a different memory access pattern than video games, for instance.
If there was one memory allocation algorithm that was always best for performance and fragmentation, wouldn't the people implementing malloc and new always choose that algorithm?
Nowadays, it's usually best to assume that the people who wrote your operating system and runtime libraries weren't brain dead; and unless you have some unusual memory access pattern don't try to beat them.
Instead, try to reduce the number of allocations (or reallocations) you make. For instance, I often use a std::vector, but if I know ahead of time how many elements it will have, I can reserve that all in one go. This is much more efficient than letting it grow "naturally" through several calls to push_back().
Many people coming from languages where new just means "gimme an object" will allocate things for no good reason. If you don't have to put it on the heap, don't call new.
As for fragmentation: it still depends. Unfortunately I can't find the link now, but I remember a blog post from somebody at Microsoft who had worked on a C++ server application that suffered from memory fragmentation. The team solved the problem by allocating memory from two regions. Memory for all requests would come from region A until it was full (requests would free memory as normal). When region A was full, all memory would be allocated from region B. By the time region B was full, region A was completely empty again. This solved their fragmentation problem.
Will it solve yours? I have no idea. Are you working on a project which services several independent requests? Are you working on a game?
As for determinism: it still depends. What is your deadline? What happens when you miss the deadline (astronauts lost in space? the music being played back starts to sound like garbage?)? There are real time allocators, but remember: "real time" means "makes a promise about meeting a deadline," not necessarily "fast."
I did just come across a post describing various things Facebook has done to both speed up and reduce fragmentation in jemalloc. You may find that discussion interesting.

Barış:
Your question is very general, but here's my answer/guidance:
I don't know about game engines, but for embedded and real time applications, The general goals of an allocation algorithm are:
1- Bounded execution time: You have to know in advance the worst case allocation time so you can plan your real time tasks accordingly.
2- Fast execution: Well, the faster the better, obviously
3- Always allocate: Especially for real-time, security critical applications, all requests must be satisfied. If you request some memory space and get a null pointer: trouble!
4- Reduce fragmentation: Although this depends on the algorithm used, generally, less fragmented allocations provide better performance, due to a number of reasons, including caching effects.
In most critical systems, you are not allowed to dynamically allocate any memory to begin with. You analyze your requirements and determine your maximum memory use and allocate a large chunk of memory as soon as your application starts. If you can't, then the application does not even start, if it does start, no new memory blocks are allocated during execution.
If speed is a concern, I'd recommend following a similar approach. You can implement a memory pool which manages your memory. The pool could initialize a "sufficient" block of memory in the start of your application and serve your memory requests from this block. If you require more memory, the pool can do another -probably large- allocation (in anticipation of more memory requests), and your application can start using this newly allocated memory. There are various memory pooling schemes around as well, and managing these pools is another whole topic.
As for some examples: VxWorks RTOS used to employ a first-fit allocation algorithm where the algorithm analyzed a linked list to find a big enough free block. In VxWorks 6, they're using a best-fit algorithm, where the free space is kept in a tree and allocations traverse the tree for a big enough free block. There's a white paper titled Memory Allocation in VxWorks 6.0, by Zoltan Laszlo, which you can find by Googling, that has more detail.
Going back to your question about speed/fragmentation: It really depends on your application. Things to consider are:
Are you going to make lots of very small allocations, or relatively larger ones?
Will the allocations come in bursts, or spread equally throughout the application?
What is the lifetime of the allocations?
If you're asking this question because you're going to implement your own allocator, you should probably design it in such a way that you can change the underlying allocation/deallocation algorithm, because if the speed/fragmentation is really that critical in your application, you're going to want to experiment with different allocators. If I were to recommend something without knowing any of your requirements, I'd start with TLSF, since it has good overall characteristics.

As other already wrote, there is no "optimum algorithm" for each possible application. It was already proven that for any possible algorithm you can find an allocation sequence which will cause a fragmentation.
Below I write a few hints from my game development experience:
Avoid allocations if you can
A common practices in the game development field was (and to certain extent still is) to solve the dynamic memory allocation performance issues by avoiding the memory allocations like a plague. It is quite often possible to use stack based memory instead - even for dynamic arrays you can often come with an estimate which will cover 99 % of cases for you and you need to allocate only when you are over this boundary. Another commonly used approach is "preallocation": estimate how much memory you will need in some function or for some object, create a kind of small and simplistic "local heap" you allocate up front and perform the individual allocations from this heap only.
Memory allocator libraries
Another option is to use some of the memory allocation libraries - they are usually created by experts in the field to fit some special requirements, and if you have similar requiremens, they may fit your requirements.
Multithreading
There is one particular case in which you will find the "default" OS/CRT allocator performs badly, and that is multithreading. If you are targeting Windows, by aware both OS and CRT allocators provided by Microsoft (including the otherwise excellent Low Fragmentation Heap) are currently blocking. If you want to perform significant threading, you need either to reduce the allocation as much as possible, or to use some of the alternatives. See Can multithreading speed up memory allocation?

The best practice is - use whatever you can use to make the thing done in time (in your case - default allocator). If the whole thing is very complex - write tests and samples that will emulate parts of the whole thing. Then, run performance tests and benchmarks to find bottle necks (probably they will nothing to do with memory allocation :).
From this point you will see what exactly slowdowns your code and why. Only based on such precise knowledge you can ever optimize something and choose one algorithm over another. Without tests its just a waste of time since you can't even measure how much your optimization will speedup your app (in fact such "premature" optimizations can really slowdown it).
Memory allocation is a very complex thing and it really depends on many factors. For example, such allocator is simple and damn fast but can be used only in limited number of situations:
char pool[MAX_MEMORY_REQUIRED_TO_RENDER_FRAME];
char *poolHead = pool;
void *alloc(size_t sz) { char *p = poolHead; poolHead += sz; return p; }
void free() { poolHead = pool; }
So there is no "the best algorithm ever".

One constraint that's worth mentioning, which has not been mentioned yet, is multi-threading: Standard allocators must be implemented to support several threads, all allocating/deallocating concurrently, and passing objects from one thread to another so that it gets deallocated by a different thread.
As you may have guessed from that description, it is a tricky task to implement an allocator that handles all of this well. And it does cost performance as it is impossible to satisfy all these constrains without inter-thread communication (= use of atomic variables and locks) which is quite costly.
As such, if you can avoid concurrency in your allocations, you stand a good chance to implement your own allocator that significantly outperforms the standard allocators: I once did this myself, and it saved me roughly 250 CPU cycles per allocation with a fairly simple allocator that's based on a number of fixed sized memory pools for small objects, stacking free objects with an intrusive linked list.
Of course, avoiding concurrency is likely a no-go for you, but if you don't use it anyway, exploiting that fact might be something worth thinking about.

Will Garbage Collected C be Faster Than C++?

I had been wondering for quite some time on how to manager memory in my next project. Which is writing a DSL in C/C++.
It can be done in any of the three ways.
Reference counted C or C++.
Garbage collected C.
In C++, copying class and structures from stack to stack and managing strings separately with some kind of GC.
The community probably already has a lot of experience on each of these methods. Which one will be faster? What are the pros and cons for each?
A related side question. Will malloc/free be slower than allocating a big chunk at the beginning of the program and running my own memory manager over it? .NET seems to do it. But I am confused why we can't count on OS to do this job better and faster than what we can do ourselves.

It all depends! That's a pretty open question. It needs an essay to answer it!
Hey.. here's one somebody prepared earlier:
http://lambda-the-ultimate.org/node/2552
http://www.hpl.hp.com/personal/Hans_Boehm/gc/issues.html
It depends how big your objects are, how many of them there are, how fast they're being allocated and discarded, how much time you want to invest optimizing and tweaking to make optimizations. If you know the limits of how much memory you need, for fast performance, I would think you can't really beat grabbing all the memory you need from the OS up front, and then managing it yourself.
The reason it can be slow allocating memory from the OS is that it deals with lots of processes and memory on disk and in ram, so to get memory it's got to decide if there is enough. Possibly, it might have to page another processes memory out from ram to disk so it can give you enough. There's lots going on. So managing it yourself (or with a GC collected heap) can be far quicker than going to the OS for each request. Also, the OS usually deals with bigger chunks of memory, so it might round up the size of requests you make meaning you could waste memory.
Have you got a real hard requirement for going super quick? A lot of DSL applications don't need raw performance. I'd suggest going with whatever's simplest to code. You could spend a lifetime writing memory management systems and worrying which is best.

Why would garbage collected C be faster than C++? The only garbage collectors available for C are pretty inefficient things, more designed to plug memory leaks than to actually improve the quality of your code.
In any case, C++ has the potential for reaching better performance with less code (note that it's only a potential. It's also very possible to write C++ code that is far slower than the equivalent C).
Considering the current state of both languages, GC's are not currently going to improve performance in your code. GC's can be made very efficient in languages designed for it. C/C++ are not among those. ;)
Apart from that, it's impossible to say. Languages don't have a speed. It doesn't make sense to ask which language is faster. It depends on 1) the specific code, 2) the compiler that compiles it, and 3) the system it's running on (hardware as well as OS).
malloc is a fairly slow operation, far slower than the .NET equivalents, so yes, if you are performing a lot of small allocations, you may be better off allocating a large pool of memory once, and then using chunks of that.
The reason is that the OS has to find a free chunk of memory, basically by following a linked list of all free memory areas. In .NET, a new() call is basically nothing more than moving the heap pointer as many bytes as required by the allocation.

uh ... It depends how you write the garbage collection system for your DSL. Neither C or C++ comes with a garbage collection facility built-in but either could be used to write a very efficient or a very inefficient garbage collector. Writing such a thing, by the way, is a non-trivial task.
DSLs are often written in higher level languages such as Ruby or Python specifically because the language writer can leverage the garbage collection and other facilities of the language. C and C++ are great for writing full, industrial strength languages but you certainly need to know what you are doing to use them - knowledge of yacc and lex is especially useful here but a good understanding of dynamic memory management is important also, as you say. You could also check out keykit, an open source music DSL written in C, if you still like the idea of a DSL in C/C++.

With most garbage collection implementations, allocation can see a speed improvement, but then you have the additional cost of the collection phase which can be triggered at any point in your program's execution, leading to a sudden (seemingly random) delay.
As for your second question, it depends on your memory management algorithms. You'd be safe sticking with your library's default malloc implementation, but there are alternatives which boast better performance.

A related side question. Will malloc/free be slower than allocating a big chuck at the begining of the program and running my own memory manager over it? .NET seems to do it. But I am confused why we can't count on OS to do this job better and faster than what we can do ourselves.
The problem with letting the OS handle memory allocation is that it introduces indeterministic behaviour. There's no way for the programmer to know how long the OS will take to return a new chunk of memory - an allocation may be quite costly if memory has to be paged out to disk.
Preallocating therefore might be a good idea, especially when using a copying garbage collector. It'll increase memory consumption, but allocation will be fast because in most cases it'll just be a pointer increment.

As people have pointed out - GC is faster to allocate (because it just gives you the next block on its list), but slower overall (because it has to compact the heap regularly, in order for allocs to be fast).
so - go for the compromise solution (which is actually pretty damn good):
You create your own heaps, one for each size of object you generally allocate (or 4-byte, 8 byte, 16-byte, 32-byte, etc) then, when you want a new piece of memory you grab the last 'block' on the appropriate heap. Because you pre-allocate from these heaps, all you need to do when allocating is grab the next free block. This works better than the standard allocator because you are happily wasting memory - if you want to allocate 12 bytes, you'll give up a whole 16 byte block from the 16-byte heap. You keep a bitmap of used v free blocks so you can allocate quickly without wasting loads of memory or needing to compact.
Also, because you're running several heaps, highly-parallel systems work much better as you don't need to lock so often (ie you have multiple locks for each heap so you don't get contention nearly as much)
Try it - we used it to replace the standard heap on a very intensive application, performance went up by quite a lot.
BTW. the reason the standard allocators are slow is that they try not to waste memory - so if you allocate a 5 byte, 7 byte and 32 bytes from the standard heap, it'll keep those 'boundaries'. Next time you need to allocate, it'll walk through those looking for enough space to give you what you asked for. That worked well for low-memory systems, but you only have to look at how much memory most apps use today to see that GC systems go the other way, and try to make allocations as fast as possible whilst caring nothing for how much memory is wasted.

The problem has a lot of variables, but if your application is written with garbage collection in mind, and if you exploit the special features of the Boehm collector, such as different allocation calls for blocks that don't contain pointers, then as a general rule your application
- Will have simpler interfaces
- Will run somewhat faster
- Will require from 1.2x to 2x the space
than a similar application using explicit memory management.
For documentation and evidence supporting these claims, you can see the information on Boehm's web site, and also Ben Zorn's several papers on the measured cost of conservative garbage collection.
Most importantly you'll save a ton of effort and won't have to worry about a significant class of memory-management bugs.
The issue of C vs C++ is orthogonal, but GC will definitely be faster than reference counting, especially when there's no compiler support for reference counting.

Neither C nor C++ will give you garbage for free. What they will give you is memory allocation libraries (which provide malloc/free, etc). There are many online resources to algorithms for writing garbage collection libraries. A good start is link text

Most non GC languages will allocate and de-allocate the memory as needed and no longer needed. GC'd languages usually allocate large chunks of memory before hand and only free the memory when idle and not in the middle of a intensive task so I am going to yes if GC kicks in at correct time.
The D programming language is a garbage collected language and ABI compatible with C and partly ABI compatible with C++. This Page shows some benchmarks between string performance in C++ and D.

I suggest that if you have written a program where memory allocation and deallocation (explicitly or GC'ed) is the bottleneck, then you should re-think your architecture, design and implementation.

If you don't want to explicitly manage memory, don't use C/C++. There are plenty of languages with either reference counting or compiler-supported garbage collectors that will probably work much better for you.
C/C++ are designed in an environment where the programmer manages their own memory. Trying to retrofit GC or ref counting onto them may help some, but you'll find that you either have to compromise the performance of the GC (because it doesn't have any compiler hinting as to where pointers might be), or you'll find new and fascinating ways that you can screw up the reference counts or the GC or whatever.
I know it sounds like a good idea, but really, you should just grab a language more suited to the task.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js