C++ classes with dynamic allocation in cuda?

C++ classes with dynamic allocation in cuda? - c++

I have a basic doubt on porting C++ classes to CUDA and I can not find a direct, clear answer about what it seems to be a pain in the end.
I think one would agree that C++ code for the host will very often use new/delete operators in the constructor and destructor. Thinking about easily porting C++ code to CUDA, there are few postings claiming that it is 'easy', or say easier and easier, and the main reason given are examples with __host__ __device__ decorators. It is not difficult to find out in some postings that dynamical allocation in the device usually implies a serious penalty in performance. So, what is one supposed to do with the C++ classes in CUDA?
Adding decorators is not going to change the dynamical allocation of memory that happens in the core of the constructors and destructors. It seems one does need to rewrite the C++ classes without new/delete. In my experience it was really impressive how bad does a new/delete class behave compared with a static allocation, for obvious reasons, but it is really bad, like going to a processor 20 years old ... So, what do people who have ported C++ applications with dynamical allocation do? (for more than very few doubles in an array that can be counted with the hands)

The standard approach is to change the scope and life cycle of objects within the code so that it isn't necessary to continuously create and destroy objects as part of computations on the device. Memory allocation in most distributed memory architectures (CUDA, HPC clusters, etc) is expensive, and the usual solution is to use it as sparingly as possible and amortise the cost of the operation by extending the lifetime of objects.
Ideally, create all the objects you need at the beginning of the programming, even if that means pre-allocating a pool of objects which will be consumed as the program runs. That is more efficient that ad-hoc memory allocation and deallocation. It also avoids problems with memory fragmentation, which can get to be an issue on GPU hardware where pages sizes are rather large.

Related

Why do C++ standard library containers use memory pools, if apparently the malloc/free pair does the same job?

I've read that repetitive calls to malloc/free can be expensive and for this reason C++ standard library containers use memory pools rather than calling free in their destructors. Also, I've read, this means that the performance of C++ standard library containers can be higher than manually allocating and deallocating all necessary C-style arrays.
However, I'm confused about this, since now I'm reading in the C FAQ: ( http://c-faq.com/malloc/freetoOS.html )
Most implementations of malloc/free do not return freed memory to the operating system, but merely make it available for future malloc calls within the same program.
This means that essentially the malloc/free functions try to do the very same job as the C++ standard library containers: They try to optimize repetitive claiming/reclaiming memory by keeping memory in a pool and then giving the program pieces of this pool on request. While I can see the benefits of such an optimization if performed once, my intuition tells me that if we start doing this on a few different layers of abstraction simultaneously the performance is likely to actually decrease - as we will be duplicating the same work.
What am I misunderstanding here?

Some implementations of the standard library use memory pools.
In general, when you know the memory needs of a particular container, you might be able to do a better job of managing its memory than a general-purpose memory manager that doesn't know your container's specific needs.
For example, if you're using std::list<int> every node in the list is the same size, and having the container maintain a list of unused nodes (just two pointer assignments to add or remove a node to/from the free list) may be faster than releasing unused nodes back to the more general but more complex general-purpose memory manager used by new/delete (malloc/free).

The general memory management utility called malloc is generally optimized for common case scenarios.
Since the system should support multiple processes, each behaving differently, this optimization might be excellent for some applications and not that good for others.
A general purpose allocator tries to consider the following generic guidelines:
Maximizing Compatibility: An allocator should be plug-compatible with others; in particular it should obey ANSI/POSIX conventions.
Maximizing Portability: Reliance on as few system-dependent features (such as system calls) as possible, while still providing optional support for other useful features found only on some systems; conformance to all known system constraints on alignment and addressing rules.
Minimizing Space: The allocator should not waste space: It should obtain as little memory from the system as possible, and should maintain memory in ways that minimize fragmentation -- ``holes''in contiguous chunks of memory that are not used by the program.
Minimizing Time: The malloc(), free() and realloc routines should be as fast as possible in the average case.
Maximizing Tunability: Optional features and behavior should be controllable by users either statically (via #define and the like) or dynamically (via control commands such as mallopt).
Maximizing Locality: Allocating chunks of memory that are typically used together near each other. This helps minimize page and cache misses during program execution.
Maximizing Error Detection: It does not seem possible for a general-purpose allocator to also serve as general-purpose memory error testing tool such as Purify. However, allocators should provide some means for detecting corruption due to overwriting memory, multiple frees, and so on.
Minimizing Anomalies
This snippet was taken from a great document written by Doug Lea about what is called Doug Lea's malloc, which was the de facto memory management algorithm for many years, and I think every programmer should read this.
On the contrary, when a container is created, many factors are known during compile-time, and even more can be predicted during run-time, for example, we knows the size of the objects we are going to hold.
Using this knowledge, standard containers were written to work well with general purpose allocators.

C++ memory management paradigms

I'm moving from C to C++11 and trying to figure out the memory management paradigm for C++11 programs (or any modern languages with built-in exceptions). Specifically, I'm having a crack at game development where running out of memory is a real concern.
In C, I'm using to checking the return value of malloc; and generally use custom allocators.
With C++, I'm quite confused; though I like how the STL containers were built allowing custom allocators. Since the STL containers all manage their own memory, simply adding an element to a vector could throw a std::bad_alloc. How do I guard against such things? I have heard wrapping all throwing calls in try/catch blocks can be prohibitively expensive.
However, allowing the exception to travel up the callstack would allow in a bunch of functions that would not fully execute, and would lead to some really tricky code. i.e. if A->B->C->D is a callstack, D throws, and A catches, then B, C, and D could have potentially created some weird problems by not being able to finish execution normally.
Additionally, the nothrow argument seems to allow very C-like code; though I now don't see the benefit over a plain malloc.
What are some best practices out there for writing exception-safe C++ code that guards against out-of-memory issues?
edit: A relevant answer on progammers.stackexchange arguing for exception-less C++ design in consoles. Not sure if these arguments still apply to the 8th generation consoles

My answer will be geared more towards game development since that is my background and that is part of what you are interested in. Different types of applications will have different requirements.
Games generally allocate all dynamic memory up front and stick within that budget. Consoles in particular have hard memory limits and most games will want to use all of it.
There's a few reasons for allocating everything up front.
One, performance. Memory allocation is slow. You want to avoid it at all costs. If you allocate everything up front, you can then write custom, high performance memory allocators like Pool Allocators, Stack Allocators, etc, that just grab memory from your pre-allocated buffer. It's important to choose the best allocator for the task at hand.
Two, you'll know quickly if there isn't enough memory for your game. In development, you'll crash if you run out of memory and will need to adjust usage, but the final release shouldn't crash because you've allocated up front and stuck within your memory budgets.
For exceptions, many (but not all) games disable exceptions, again for performance reasons. In fact some console compilers don't even support exceptions. Then you will either need to use an STL library with no exceptions, or implement your own containers. Many game teams choose to implement their own for performance reasons as well as to better integrate them with custom memory allocators.
That said, dynamic memory allocation, STL, and exceptions are probably perfectly fine for smaller personal projects/games, but keep in mind what will be necessary for large, high performance, real-time games.
For exception safety, I would definitely use RAII. That is its purpose. Also I would recommend using smart pointers like std::unique_ptr and std::shared_ptr for memory management. Coupled with RAII, if your constructors throw, memory will be freed.

Use destructors to clean up automatically when scopes are exited. That's called RAII, Resource Acquisition Is Initializion, although the acronym is not exactly the best one could devise. All the standard containers etc. clean up automatically.
In languages like C# and Java which are based on garbage collection, you instead pepper the code with try blocks and “using” statements. Java just got that (part of the try syntax IIRC); it's been in C# from the beginning (keyword using); in Python it's called with; and C++ doesn't have it and doesn't need it. I once created a WITH macro for C++, a clever little hack, thinking I would be using it all the time, but I haven't used it once except just to try it out right after creating it: in C++ RAII does it all.
Summing up: use RAII, i.e., use destructors, and just let those exceptions propagate.
Regarding memory exhaustion, ordinarily that's just regarded as “we're done for”, nothing to do except terminate as orderly as possible.
But it doesn't hurt to maybe set aside a little buffer which can deallocated when memory is exhausted, so as to have some working memory for cleanup-work.
C++ doesn't differentiate between hard exceptions (fatal like e.g. memory exhaustion) and soft exceptions (general failures that are not fatal).

Implementing a memory manager in multithreaded C/C++ with dynamically sized memory pool?

Background: I'm developing a multiplatform framework of sorts that will be used as base for both game and util/tool creation. The basic idea is to have a pool of workers, each executing in its own thread. (Furthermore, workers will also be able to spawn at runtime.) Each thread will have it's own memory manager.
I have long thought about creating my own memory management system, and I think this project will be perfect to finally give it a try. I find such a system fitting due to the types of usages of this framework will often require memory allocations in realtime (games and texture edition tools).
Problems:
No generally applicable solution(?) - The framework will be used for both games/visualization (not AAA, but indie/play) and tool/application creation. My understanding is that for game development it is usual (at least for console games) to allocate a big chunk of memory only once in the initialization, and then use this memory internally in the memory manager. But is this technique applicable in a more general application?
In a game you could theoretically know how much memory your scenes and resources will need, but for example, a photo editing application will load resources of all different sizes... So in the latter case a more dynamic memory "chunk size" would be needed? Which leads me to the next problem:
Moving already allocated data and keeping valid pointers - Normally when allocating on the heap, you will acquire a simple pointer to the memory chunk. In a custom memory manager, as far as I understand it, a similar approach is then to return a pointer to somewhere free in the pre-allocated chunk. But what happens if the pre-allocated chunk is too small and needs to be resized or even defragmentated? The data would be needed to be moved around in the memory and the old pointers would be invalid. Is there a way to transparently wrap these pointers in some way, but still use them as normally "outside" the memory management as if they were usual C++ pointers?
Third party libraries - If there is no way to transparently use a custom memory management system for all memory allocation in the application, every third party library I'm linking with, will still use the "old" OS memory allocations internally. I have learned that it is common for libraries to expose functions to set custom allocation functions that the library will use, but it is not guaranteed every library I will use will have this ability.
Questions: Is it possible and feasible to implement a memory manager that can use a dynamically sized memory chunk pool? If so, how would defragmentation and memory resize work, without breaking currently in-use pointers? And finally, how is such a system best implemented to work with third party libraries?
I'm also thankful for any related reading material, papers, articles and whatnot! :-)

As someone who has previously written many memory managers and heap implementations for AAA games for the last few generations of consoles let me tell you its simply not worth it anymore.
Your information is old - back in the gamecube era [circa 2003] we used to do what you said- allocate a large chunk and carve out that chunk manually using custom algorithms tweaked for each game.
Once virtual memory came along (xbox era), games got more complicated [and so made more allocations and became multimthreaded] address fragmentation made this untenable. So we switched to custom allocators to handle certain types of requests only - for instance physical memory, or lock free small block low fragmentation heaps or thread local cache of recently used blocks.
As built in memory managers become better it gets harder to do better than those - certainly in the general case and a close thing for a specific use cases. Doug Lea Allocator [or whatever the mainstream c++ linux compilers come with now] and the latest Windows low fragmentation heaps are really very good, and you'd do far better investing your time elsewhere.
I've got spreadsheets at work measuring all kinds of metrics for a whole load of allocators - all the big name ones and a fair few I've collected over the years. And basically whilst the specialist allocators can win on a few metrics [lowest overhead per alloc, spacial proximity, lowest fragmentation, etc] for overall metrics the mainstream ones are simply the best.
As a user of your library, my personal preferred option is you just allocate memory when you need it. Use operator new/the new operator and I can use the standard C++ mechanisms to replace those and use my custom heap (if I indeed have one), or alternatively I can use platform specific ways of replacing your allocations (e.g. XMemAlloc on Xbox). I don't need tagging [capturing callstacks is far superior which I can do if I want]. Lower down that list comes you giving me an interface that you'll call when you need to allocate memory - this is just a pain for you to implement and I'll probably just pass it onto operator new anyway. The worst thing you can do is 'know best' and create your own custom heaps. If memory allocation performance is a problem, I'd much rather you share the solution the whole game uses than roll your own.

If you're looking to write your own malloc()/free(), etc., you probably should start by checking out the source code for existing systems such as dlmalloc. This is a hard problem, though, for what it's worth. Writing your own malloc library is Hard. Beating existing general purpose malloc libraries will be Even Harder.

And now, here is the correct answer: DON'T IMPLEMENT YET ANOTHER MEMORY MANAGER.
It is incredibly hard to implement a memory manager that does not fail under different kinds of usage patterns and events. You may be able to build a specific manager that works well under YOUR usage patterns, but to write one which works well for MANY users is a full-time job that almost no one has really done well. Worse, it is fantastically easy to implement a memory manager that works great 99% of the time and then 1% of the time crash or suddenly consume most or all available memory on your system due to unexpected heap fragmentation.
I say this as someone who has written multiple memory managers, watched multiple people write their own memory managers, and watched even more people attempt to write memory managers and fail. This problem is deceptively difficult, not because it's hard to write templated allocators and generic types with inheritance and such, but because the other solutions given in this thread tend to fail under corner types of load behavior. Once you start supporting byte alignments (as all real-world allocators must) then heap fragmentation rears its ugly head. Cute heuristics that work great for small test programs, fail miserably when subjected to large, real-world programs.
And once you get it working, someone else will need: cookies to verify against memory stomps; heap usage reporting; memory pools; pools of pools; memory leak tracking and reporting; heap auditing; chunk splitting and coalescing; thread-local storage; lookasides; CPU and process-level page faulting and protection; setting and checking and clearing "free-memory" patterns aka 0xdeadbeef; and whatever else I can't think of off the top of my head.
Writing yet another memory manager falls squarely under the heading of Premature Optimization. Since there are multiple free, good, memory managers with thousands of hours of development and testing behind them, you have to justify spending the cost of your own time in such a way that the result would provide some sort of measurable improvement over what other people have done, and you can use, for free.
If you are SURE you want to implement your own memory manager (and hopefully you are NOT sure after reading this message), read through the dlmalloc sources in detail, then read through the tcmalloc sources in detail as well, THEN make sure you understand the performance trade-offs in implementing a thread-safe versus a thread-unsafe memory manager, and why the naive implementations tend to give poor performance results.

Prepare more than one solution and let the user of the framework adopt any particular one. Policy classes to the generic allocator you develop would do this nicely.
A nice way to get around this is to wrap up pointers in a class with overloaded * operator. Make the internal data of that class only an index to the memory pool. Now, you can just change the index quickly after a background thread copies the data over.
Most good C++ libraries support allocators and you should implement one. You can also overload the global new so your version gets used. And keep in mind that you generally won't need to think about a library allocating or deallocating a large amount of data, which is generally a responsibility of client code.

To GC or Not To GC

I've recently seen two really nice and educating languages talks:
This first one by Herb Sutter, presents all the nice and cool features of C++0x, why C++'s future seems brighter than ever, and how M$ is said to be a good guy in this game. The talk revolves around efficiency and how minimizing heap activity very often improves performance.
This other one, by Andrei Alexandrescu, motivates a transition from C/C++ to his new game-changer D. Most of D's stuff seems really well motivated and designed. One thing, however, surprised me, namely that D pushes for garbage collection and that all classes are created solely by reference. Even more confusing, the book The D Programming Language Ref Manual specifically in the section about Resource Management states the following, quote:
Garbage collection eliminates the tedious, error prone memory allocation tracking code
necessary in C and C++. This not only means much faster development time and lower
maintenance costs, but the resulting program frequently runs faster!
This conflicts with Sutter's constant talk about minimizing heap activity. I strongly respect both Sutter's and Alexandrescou's insights, so I feel a bit confused about these two key questions
Doesn't creating class instances solely by reference result in a lot of unnecesseary heap activity?
In which cases can we use Garbage Collection without sacrificing run-time performance?

To directly answer your two questions:
Yes, creating class instances by reference does result in a lot of heap activity, but:
a. In D, you have struct as well as class. A struct has value semantics and can do everything a class can, except polymorphism.
b. Polymorphism and value semantics have never worked well together due to the slicing problem.
c. In D, if you really need to allocate a class instance on the stack in some performance-critical code and don't care about the loss of safety, you can do so without unreasonable hassle via the scoped function.
GC can be comparable to or faster than manual memory management if:
a. You still allocate on the stack where possible (as you typically do in D) instead of relying on the heap for everything (as you often do in other GC'd languages).
b. You have a top-of-the-line garbage collector (D's current GC implementation is admittedly somewhat naive, though it has seen some major optimizations in the past few releases, so it's not as bad as it was).
c. You're allocating mostly small objects. If you allocate mostly large arrays and performance ends up being a problem, you may want to switch a few of these to the C heap (you have access to C's malloc and free in D) or, if it has a scoped lifetime, some other allocator like RegionAllocator. (RegionAllocator is currently being discussed and refined for eventual inclusion in D's standard library).
d. You don't care that much about space efficiency. If you make the GC run too frequently to keep the memory footprint ultra-low, performance will suffer.

The reason creating an object on the heap is slower than creating it on the stack is that the memory allocation methods need to deal with things like heap fragmentation. Allocating memory on the stack is as simple as incrementing the stack pointer (a constant-time operation).
Yet, with a compacting garbage collector, you don't have to worry about heap fragmentation, heap allocations can be as fast as stack allocations. The Garbage Collection page for the D Programming Language explains this in more detail.
The assertion that GC'd languages run faster is probably assuming that many programs allocate memory on the heap much more often than on the stack. Assuming that heap allocation could be faster in a GC'd language, then it follows that you have just optimized a huge part of most programs (heap allocation).

An answer to 1):
As long as your heap is contiguous, allocating on it is just as cheap as allocating on the stack.
On top of that, while you allocate objects that lie next to each other, your memory caching performance will be great.
As long as you don't have to run the garbage collector, no performance is lost, and the heap stays contiguous.
That's the good news :)
Answer to 2):
GC technology has advanced greatly; they even come in real-time flavors nowadays. That means that guaranteeing contiguous memory is a policy-driven, implementation-dependent issue.
So if
you can afford a real-time gc
there are enough allocation-pauses in your application
it can keep your free-list a free-block
You may end up with better performance.
Answer to unasked question:
If developers are freed from memory-management issues, they may have more time to spend on real performance and scalability aspects in their code. That's a non-technical factor coming into play, too.

It's not either "garbage collection" or "tedious error prone" handwritten code. Smart pointers that are truly smart can give you stack semantics and mean you never type "delete" but you aren't paying for garbage collection. Here's another video by Herb that makes the point - safe and fast - that's what we want.

Another point to consider is the 80:20 rule. It is likely that that vast majority of the places you allocate are irrelevant and you won't gain much over a GC even if you could push the cost there to zero. If you accept that, then the simplicity you can gain by using a GC can displace the cost of using it. This is particularly true if you can avoid doing copies. What D provides is a GC for the 80% cases and access to stack allocation and malloc for the 20%.

Even if you had ideal garbage collector, it still would have been slower than creating things on stack. So you have to have a language that allows both at the same time. Furthermore, the only way to achieve the same performance with garbage collector as with manually managed memory allocations (done the right way), is to make it do the same things with memory as experienced developer would have had done, and that in many cases would require a garbage collector decisions to be made in compile-time and executed in run-time. Usually, garbage collection makes things slower, languages working with dynamic memory only are slower, and predictability of execution of programs written in those languages is low while latency of execution is higher. Frankly, I personally don't see why one would need a garbage collector. Managing memory manually is not hard. At least not in C++. Of course, I won't mind compiler generate code that clean-ups all things for me as I would have done, but this doesn't seem possible at the moment.

In many cases a compiler can optimize heap-allocation back to stack allocation. This is the case if your object doesn't escape the local scope.
A decent compiler will almost certainly make x stack-allocated in the following example:
void f() {
Foo* x = new Foo();
x->doStuff(); // Assuming doStuff doesn't assign 'this' anywhere
// delete x or assume the GC gets it
}
What the compiler does is called escape analysis.
Also, D could in theory have a moving GC, which means potential performance improvements by improved cache usage when the GC compacts your heap objects together. It also combats heap fragmentation as explained in Jack Edmonds' answer. Similar things can be done with manual memory management, but it's extra work.

A incremental low priority GC will collect garbage when high priority task are not running. The high priority threads will run faster since no memory deallocation will be done.
This is the idea of Henriksson's RT Java GC see http://www.oracle.com/technetwork/articles/javase/index-138577.html

Garbage collection does in fact slow code down. It's adding extra functionality to the program that has to run in addition to your code. There are other problems with it as well, such as for example, the GC not running until memory is actually needed. This can result in small memory leaks. Another issue is if a reference is not removed properly, the GC will not pick it up, and once again result in a leak. My other issue with GC is that it kind of promotes lazyness in programmers. I'm an advocate of learning the low level concepts of memory management before jumping into higher level. It's like Mathematics. You learn how to solve for the roots of a quadratic, or how to take a derivative by hand first, then you learn how to do it on the calculator. Use these things as tools, not crutches.
If you don't want to hit your performance, be smart about the GC and your heap vs stack usage.

My point is that GC is inferior to malloc when you do normal procedural programming. You just go from procedure to procedure, allocate and free, use global variables, and declare some functions _inline or _register. This is C style.
But once you go higher abstraction layer, you need at least reference counting. So you can pass by reference, count them and free once the counter is zero. This is good, and superior to malloc after the amount and hierarchy of objects become too difficult to manage manually. This is C++ style. You will define constructors and destructors to increment counters, you will copy-on-modify, so the shared object will split in two, once some part of it is modified by one party, but another party still needs the original value. So you can pass huge amount of data from function to function without thinking whether you need to copy data here or just send a pointer there. The ref-counting does those decisions for you.
Then comes the whole new world, closures, functional programming, duck typing, circular references, asynchronouse execution. Code and data start mixing, you find yourself passing function as parameter more often than normal data. You realize that metaprogramming can be done without macros or templates. Your code starts to soak in the sky and loosing solid ground, because you are executing something inside callbacks of callbacks of callbacks, data becomes unrooted, things become asynchronous, you get addicted to closure variables. So this is where timer based, memory-walking GC is the only possible solution, otherwise closures and circular references are not possible at all. This is JavaScript way.
You mentioned D, but D is still improved C++ so malloc or ref counting in constructors, stack allocations, global variables (even if they are compicated trees of entities of all kinds) is probably what you choose.

Will Garbage Collected C be Faster Than C++?

I had been wondering for quite some time on how to manager memory in my next project. Which is writing a DSL in C/C++.
It can be done in any of the three ways.
Reference counted C or C++.
Garbage collected C.
In C++, copying class and structures from stack to stack and managing strings separately with some kind of GC.
The community probably already has a lot of experience on each of these methods. Which one will be faster? What are the pros and cons for each?
A related side question. Will malloc/free be slower than allocating a big chunk at the beginning of the program and running my own memory manager over it? .NET seems to do it. But I am confused why we can't count on OS to do this job better and faster than what we can do ourselves.

It all depends! That's a pretty open question. It needs an essay to answer it!
Hey.. here's one somebody prepared earlier:
http://lambda-the-ultimate.org/node/2552
http://www.hpl.hp.com/personal/Hans_Boehm/gc/issues.html
It depends how big your objects are, how many of them there are, how fast they're being allocated and discarded, how much time you want to invest optimizing and tweaking to make optimizations. If you know the limits of how much memory you need, for fast performance, I would think you can't really beat grabbing all the memory you need from the OS up front, and then managing it yourself.
The reason it can be slow allocating memory from the OS is that it deals with lots of processes and memory on disk and in ram, so to get memory it's got to decide if there is enough. Possibly, it might have to page another processes memory out from ram to disk so it can give you enough. There's lots going on. So managing it yourself (or with a GC collected heap) can be far quicker than going to the OS for each request. Also, the OS usually deals with bigger chunks of memory, so it might round up the size of requests you make meaning you could waste memory.
Have you got a real hard requirement for going super quick? A lot of DSL applications don't need raw performance. I'd suggest going with whatever's simplest to code. You could spend a lifetime writing memory management systems and worrying which is best.

Why would garbage collected C be faster than C++? The only garbage collectors available for C are pretty inefficient things, more designed to plug memory leaks than to actually improve the quality of your code.
In any case, C++ has the potential for reaching better performance with less code (note that it's only a potential. It's also very possible to write C++ code that is far slower than the equivalent C).
Considering the current state of both languages, GC's are not currently going to improve performance in your code. GC's can be made very efficient in languages designed for it. C/C++ are not among those. ;)
Apart from that, it's impossible to say. Languages don't have a speed. It doesn't make sense to ask which language is faster. It depends on 1) the specific code, 2) the compiler that compiles it, and 3) the system it's running on (hardware as well as OS).
malloc is a fairly slow operation, far slower than the .NET equivalents, so yes, if you are performing a lot of small allocations, you may be better off allocating a large pool of memory once, and then using chunks of that.
The reason is that the OS has to find a free chunk of memory, basically by following a linked list of all free memory areas. In .NET, a new() call is basically nothing more than moving the heap pointer as many bytes as required by the allocation.

uh ... It depends how you write the garbage collection system for your DSL. Neither C or C++ comes with a garbage collection facility built-in but either could be used to write a very efficient or a very inefficient garbage collector. Writing such a thing, by the way, is a non-trivial task.
DSLs are often written in higher level languages such as Ruby or Python specifically because the language writer can leverage the garbage collection and other facilities of the language. C and C++ are great for writing full, industrial strength languages but you certainly need to know what you are doing to use them - knowledge of yacc and lex is especially useful here but a good understanding of dynamic memory management is important also, as you say. You could also check out keykit, an open source music DSL written in C, if you still like the idea of a DSL in C/C++.

With most garbage collection implementations, allocation can see a speed improvement, but then you have the additional cost of the collection phase which can be triggered at any point in your program's execution, leading to a sudden (seemingly random) delay.
As for your second question, it depends on your memory management algorithms. You'd be safe sticking with your library's default malloc implementation, but there are alternatives which boast better performance.

A related side question. Will malloc/free be slower than allocating a big chuck at the begining of the program and running my own memory manager over it? .NET seems to do it. But I am confused why we can't count on OS to do this job better and faster than what we can do ourselves.
The problem with letting the OS handle memory allocation is that it introduces indeterministic behaviour. There's no way for the programmer to know how long the OS will take to return a new chunk of memory - an allocation may be quite costly if memory has to be paged out to disk.
Preallocating therefore might be a good idea, especially when using a copying garbage collector. It'll increase memory consumption, but allocation will be fast because in most cases it'll just be a pointer increment.

As people have pointed out - GC is faster to allocate (because it just gives you the next block on its list), but slower overall (because it has to compact the heap regularly, in order for allocs to be fast).
so - go for the compromise solution (which is actually pretty damn good):
You create your own heaps, one for each size of object you generally allocate (or 4-byte, 8 byte, 16-byte, 32-byte, etc) then, when you want a new piece of memory you grab the last 'block' on the appropriate heap. Because you pre-allocate from these heaps, all you need to do when allocating is grab the next free block. This works better than the standard allocator because you are happily wasting memory - if you want to allocate 12 bytes, you'll give up a whole 16 byte block from the 16-byte heap. You keep a bitmap of used v free blocks so you can allocate quickly without wasting loads of memory or needing to compact.
Also, because you're running several heaps, highly-parallel systems work much better as you don't need to lock so often (ie you have multiple locks for each heap so you don't get contention nearly as much)
Try it - we used it to replace the standard heap on a very intensive application, performance went up by quite a lot.
BTW. the reason the standard allocators are slow is that they try not to waste memory - so if you allocate a 5 byte, 7 byte and 32 bytes from the standard heap, it'll keep those 'boundaries'. Next time you need to allocate, it'll walk through those looking for enough space to give you what you asked for. That worked well for low-memory systems, but you only have to look at how much memory most apps use today to see that GC systems go the other way, and try to make allocations as fast as possible whilst caring nothing for how much memory is wasted.

The problem has a lot of variables, but if your application is written with garbage collection in mind, and if you exploit the special features of the Boehm collector, such as different allocation calls for blocks that don't contain pointers, then as a general rule your application
- Will have simpler interfaces
- Will run somewhat faster
- Will require from 1.2x to 2x the space
than a similar application using explicit memory management.
For documentation and evidence supporting these claims, you can see the information on Boehm's web site, and also Ben Zorn's several papers on the measured cost of conservative garbage collection.
Most importantly you'll save a ton of effort and won't have to worry about a significant class of memory-management bugs.
The issue of C vs C++ is orthogonal, but GC will definitely be faster than reference counting, especially when there's no compiler support for reference counting.

Neither C nor C++ will give you garbage for free. What they will give you is memory allocation libraries (which provide malloc/free, etc). There are many online resources to algorithms for writing garbage collection libraries. A good start is link text

Most non GC languages will allocate and de-allocate the memory as needed and no longer needed. GC'd languages usually allocate large chunks of memory before hand and only free the memory when idle and not in the middle of a intensive task so I am going to yes if GC kicks in at correct time.
The D programming language is a garbage collected language and ABI compatible with C and partly ABI compatible with C++. This Page shows some benchmarks between string performance in C++ and D.

I suggest that if you have written a program where memory allocation and deallocation (explicitly or GC'ed) is the bottleneck, then you should re-think your architecture, design and implementation.

If you don't want to explicitly manage memory, don't use C/C++. There are plenty of languages with either reference counting or compiler-supported garbage collectors that will probably work much better for you.
C/C++ are designed in an environment where the programmer manages their own memory. Trying to retrofit GC or ref counting onto them may help some, but you'll find that you either have to compromise the performance of the GC (because it doesn't have any compiler hinting as to where pointers might be), or you'll find new and fascinating ways that you can screw up the reference counts or the GC or whatever.
I know it sounds like a good idea, but really, you should just grab a language more suited to the task.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js