Compacting garbage collector implementation in C++0x - c++

I'm implementing a compacting garbage collector for my own personal use in C++0x, and I've got a question. Obviously the mechanics of the collector depend upon moving objects, and I've been wondering how to implement this in terms of the smart pointer types that point to it. I've been thinking about either pointer-to-pointer in the pointer type itself, or, the collector maintains a list of pointers that point to each object so that they can be modified, removing the need for a double de-ref when accessing the pointer but adding some extra overhead during collection and additional memory overhead. What's the best way to go here?
Edit: My primary concern is for speedy allocation and access. I'm not concerned with particularly efficient collections or other maintenance, because that's not really what the GC is intended for.

There's nothing straight forward about grafting on extra GC to C++, let alone a compacting algorithm. It isn't clear exactly what you're trying to do and how it will interact with the rest of the C++ code.
I have actually written a gc in C++ which works with existing C++ code, and it had a compactor at one stage (though I dropped it because it was too slow). But there are many nasty semantic problems. I mentioned to Bjarne only a few weeks ago that C++ lacks the operator required to do it properly and the situation is that it is unlikely to ever exist because it has limited utility..
What you actually need is a "re-addres-me" operator. What happens is that you do not actually move objects around. You just use mmap to change the object address. This is much faster, and, in effect, it is using the VM features to provide handles.
Without this facility you have to have a way to perform an overlapping move of an object, which you cannot do in C++ efficiently: you'd have to move to a temporary first. In C, it is much easier, you can use memmove. At some stage all the pointers to or into the moved objects have to be adjusted.
Using handles does not solve this problem, it just reduces the problem from arbitrary sized objects to constant sized ones: these are easier to manage in an array, but the same problem exists: you have to manage the storage. If you remove lots of handle from the array randomly .. you still have a problem with fragmentation.
So don't bother with handles, they don't work.
This is what I did in Felix: you call new(shape, collector) T(args). Here the shape is a descriptor of the type, including a list of offsets which contain (GC) pointers, and the address of a routine to finalise the object (by default, it calls the destructor).
It also contains a flag saying if the object can be moved with memmove. If the object is big or immobile, it is allocated by malloc. If the object is small and mobile, it is allocated in an arena, provided there is space in the arena.
The arena is compacted by moving all the objects in it, and using the shape information to globally adjust all the pointers to or into these objects. Compaction can be done incrementally.
The downside for a C++ programmer is the need to construct a correct shape object to pass. This doesn't bother me because I'm implementing a language which can generate the shape information automatically.
Now: the key point is: to do compaction, you must use a precise collector. Compaction cannot work with a conservative collector. This is very important. It is fine to allow some leakage if you see an value that looks like a pointer but happens to be an integer: some object won't be collected, but this is usually no big deal. But for compaction you have to adjust the pointers but you'd better not change that integer: so you have to know for sure when something is a pointer, so your collector has to be precise: the shape must be known.
In Ocaml this is relatively simple: everything is either a pointer or integer and the low bit is used at run time to tell. Objects pointed at have a code telling the type, and there are only a few types: either a scalar (don't scan it) or an aggregate (scan it, it only contains integers or pointers).

This is a pretty straight-forward question so here's a straight-forward answer:
Mark-and-sweep (and occasionally mark-and-compact to avoid heap fragmentation) is the fastest when it comes to allocation and access (avoiding double de-refs). It's also very easy to implement. Since you're not worried about collection performance impact (mark-and-sweep tends to freeze up the process in a nondeterministically), this should be the way to go.
Implementation details found at:
http://www.brpreiss.com/books/opus5/html/page424.html#secgarbagemarksweep
http://www.brpreiss.com/books/opus5/html/page428.html

A nursery generation will give you the best possible allocation performance because it is just a pointer bump.
You could implement pointer updates without using double indirection by using techniques like a shadow stack but this will be slow and very error prone if you're writing this C++ code by hand.

Related

Realloc creates dangling pointers?

I wrote some array code that allocates memory and each value in the array is a type. I then have another array that consists of in to the first array for references.
Both arrays can grow. It uses realloc. Because the 2nd array contains pointers in to the first, they are surely not updated when the first array changes(I don't do it manually and there is no GC). Surely all the pointers in the 2nd array are invalid! (they point to memory that was free'ed by realloc).
This is the case right?
This seems like it would make persistent pointers to blocks of memory that may move very dangerous?
What is the standard solution? Don't use "global" pointers? Using pointers to pointers to pointers? I think I could make the 2nd array use **'s and could probably get things to work.
In a MT environment, things are even worse. Local pointers access may be moved in the middle, then the memory changed, and the local pointer is now wrong. (Which, of course might be solved by preventing the moves by lock, etc...)
Go with functional programming?
Yes, the realloc can invalidate your references. If there is no continuous space for relocating your array will be moved.
Consider using a container as the std::deque.
1) This is the case right?
Yes.
2) This seems like it would make persistent pointers to blocks of memory that may move very dangerous?
Yes.
3) What is the standard solution?
You design your application such that the life-times of your objects is well defined so that you do not refer to them after they are no longer required.
4) In a MT environment, things are even worse. Local pointers access may be moved in the middle, then the memory changed, and the local pointer is now wrong. (Which, of course might be solved by preventing the moves by lock, etc...)
Obviously you should never use a pointer that no-longer pointers to its resource. Managing shared resources in a MT environment is non trivial and there are a whole bunch of tools and techniques to achieve it.
5) Go with functional programming?
It is always advisable to avoid pointers if you can.
Without a specific problem it is hard to give a specific solution. But in order to achieve "not pointing at disappeared resources" we have various tools to employ. We have smart pointers, we have containers and we have value semantics. We need to understand how to use all of those but also we need to design with object lifetime in mind as a major consideration.
Object life-time should always be an important factor. However some languages (like Java for instance) mitigate against bad-design by providing a "safer" environment. C++, on the other hand, is rather less forgiving. However it does have a whole bunch of sophisticated tools for the task. That means a steeper learning curve but more efficiency and better control.

Should I free long-lived memory that would normally be freed at the very end of the program?

I am currently writing a library that parse some structured binary data into a set of objects. These objects are expected to outlive any user code, and would normally be freed at the end or after the main function.
I am using shared (and weak) pointers to manage the memory of each object, but it is causing a lot of added complexity to the program, and raises structural issues that I will not get into in this particular question.
Considering that:
traversing the entirety of the binary data is expensive and I cannot afford to do it more than one time,
each visited entry is used to build an object, that then gets registered (i.e. added into the set),
entries in the binary data may rely on other entries that appears later but gets parsed immediately, and registered when the entry is visited again,
duplicate entries may appear at any moment, but I need to merge those duplicates into one instance (and update any pointer referencing those duplicates to the new merged entry) before registration,
every single one of those objects is guaranteed to be of one of many POD types deriving a common class, so nothing except memory needs to be cleaned up,
the resulting program will run on a modern OS (or in this case, that collects memory from dead processes),
I am very tempted to just use raw pointers, never free the memory taken by those objects and let the OS do its cleanup after the process exits.
What would be the best course of action?
If you're writing reusable code, you need to at least provide the option of cleaning up. What if some program uses your library for one operation, and then continues running? It's not safe to assume that the process exits immediately after your library's task is complete.
The other answers cover the general and standard approach: in an ideal world, yes, you'd clean up your memory, because it makes the code more generic and more reusable and helps with tooling. As others have said, std::unique_ptr for owning pointers and raw pointers for non-owning pointers should work well.
There are a couple of more specialized approaches that may or may not be useful:
Use a pool allocator (such as Boost.Pool, or roll your own) to allocate a bunch of memory up front then dole out pieces of it for your objects. You can then free every object at once by deleting the pool.
Intentionally not freeing memory is occasionally a valid technique. See, e.g., "Increasing Compiler Performance by Over 75%", by Walter Bright. Of course, a compiler is a specialized problem domain, and Walter Bright is probably one of the top compiler developers alive, so techniques that work for his problem domain shouldn't be blindly applied elsewhere.
the resulting program will run on a modern OS (or in this case, that collects memory from dead processes)
I am very tempted to just use raw pointers, never free the memory taken by those objects and let the OS do its cleanup after the process exits.
If you take this approach, then anyone who uses your library and then uses valgrind to try to detect memory leaks in their program will report massive leaks coming from your library and complain to you about it, so if I were you I definitely would not do this.
If you are writing a library then you should provide a cleanup function that frees all memory that you allocated.
A practical example of why this is useful is if a Windows DLL uses your library. When the library is loaded, static data is initialized. When the library is unloaded, static data is cleared. If your library has some global pointers to memory that is never freed, then load-unload cycles of the DLL will leak memory.
If the objects are all of the same type, then rather than allocating each one independently, you could just put them all into a vector and have them refer to each other by index number instead of using pointers. The vector's built-in memory management takes care of allocating space as needed, and when you're done with the objects, you can just destroy the vector to deallocate them all at once. (Note that vector::clear() doesn't actually free the memory, though it does make it available to store a new set of objects in the vector.)
If your objects aren't all the same type, you'll want to look into the more general concept of region-based memory management. As above, the idea is that you can allocate all your objects in a relatively small number of memory chunks (possibly just one), which can be freed later without having to track all the
individual objects allocated within.
If your ownership and lifetimes are clear I suggest you use unique_ptr for the owning pointers and raw pointers for the non-owning pointers. It should be less complex than shared_ptr and weak_ptr whilst still managing memory automatically.
I don't think not managing memory at all is an option. But I think using smart pointers to express ownership is not just about good memory management it also makes code easier to reason about.
Try to think of future maintenance work. Suppose your code needs to be broken up or other stuff done after it. In this case you're opening yourself up to leaks or being a resource hog later down the line.
Cleaning up (or being able to do s) is good. It may seem obvious now that an application should work with a single structured binary dataset throughout its entire lifetime, but you'll start kicking yourself once you realize you need to write an application that needs to reset half-way through and start over with another dataset.
(a related thing that's easy to overlook is that an application may need to work with two completely independent datasets at the same time, so try not to design your library to exclude that use case!)
That said, I think you may be focusing too much on the extremes. Code that shouldn't participate in memory management can use raw pointers, and this is reasonable when there is no risk of these pointers outliving your structured dataset in memory.
However, that doesn't mean that code that does participate in memory management needs to use raw pointers too. You can use smart pointers to manage your data structures even if you are passing raw pointers out to the user.
That aside, keep in mind that, in my experience, pointers are usually the wrong semantics — usually, most use cases are most natural with reference or value semantics, which means you should be passing around raw references, or passing around lightweight wrapper class that have reference or value semantics, but are implemented as containing a pointer to the actual data. Or even as a copy of the actual data if appropriate.

Why use new and delete at all?

I'm new to C++ and I'm wondering why I should even bother using new and delete? It can cause problems (memory leaks) and I don't get why I shouldn't just initialize a variable without the new operator. Can someone explain it to me? It's hard to google that specific question.
For historical and efficiency reasons, C++ (and C) memory management is explicit and manual.
Sometimes, you might allocate on the call stack (e.g. by using VLAs or alloca(3)). However, that is not always possible, because
stack size is limited (depending on the platform, to a few kilobytes or a few megabytes).
memory need is not always FIFO or LIFO. It does happen that you need to allocate memory, which would be freed (or becomes useless) much later during execution, in particular because it might be the result of some function (and the caller - or its caller - would release that memory).
You definitely should read about garbage collection and dynamic memory allocation. In some languages (Java, Ocaml, Haskell, Lisp, ....) or systems, a GC is provided, and is in charge of releasing memory of useless (more precisely unreachable) data. Read also about weak references. Notice that most GCs need to scan the call stack for local pointers.
Notice that it is possible, but difficult, to have quite efficient garbage collectors (but usually not in C++). For some programs, Ocaml -with a generational copying GC- is faster than the equivalent C++ code -with explicit memory management.
Managing memory explicitly has the advantage (important in C++) that you don't pay for something you don't need. It has the inconvenience of putting more burden on the programmer.
In C or C++ you might sometimes consider using the Boehm's conservative garbage collector. With C++ you might sometimes need to use your own allocator, instead of the default std::allocator. Read also about smart pointers, reference counting, std::shared_ptr, std::unique_ptr, std::weak_ptr, and the RAII idiom, and the rule of three (in C++, becoming the rule of 5). The recent wisdom is to avoid explicit new and delete (e.g. by using standard containers and smart pointers).
Be aware that the most difficult situation in managing memory are arbitrary, perhaps circular, graphs (of reference).
On Linux and some other systems, valgrind is a useful tool to hunt memory leaks.
The alternative, allocating on the stack, will cause you trouble as stack sizes are often limited to Mb magnitudes and you'll get lots of value copies. You'll also have problems sharing stack-allocated data between function calls.
There are alternatives: using std::shared_ptr (C++11 onwards) will do the delete for you once the shared pointer is no longer being used. A technique referred to by the hideous acronym RAII is exploited by the shared pointer implementation. I mention it explicitly since most resource cleanup idioms are RAII-based. You can also make use of the comprehensive data structures available in the C++ Standard Template Library which eliminate the need to get your hands too dirty with explicit memory management.
But formally, every new must be balanced with a delete. Similarly for new[] and delete[].
Indeed in many cases new and delete are not needed, you can just use standard containers instead and leaving to them the allocation/deallocation management.
One of the reasons for which you may need to use allocation explicitly is for objects where the identity is important (i.e. they are not just values that can be copied around).
For example if you have a gui "window" object then making copies probably doesn't make sense and thus you're more or less ruling out all standard containers (they're designed for objects that can be copied and assigned). In this case if the object needs to survive the function that creates it probably the simplest solution is to just allocate explicitly it on the heap, possibly using a smart pointer to avoid leaks or use-after-delete.
In other cases it may be important to avoid copies not because they're illegal, but just not very efficient (big objects) and explicitly handling the instance lifetime may be a better (faster) solution.
Another case where explicit allocation/deallocation may be the best option are complex data structures that cannot be represented by the standard library (for example a tree in which each node is also part of a doubly-linked list).
Modern C++ styles often frown on explicit calls to new and delete outside of specialized resource management code.
This is not because the stack/automatic storage is sufficient, but rather because RAII smart resource owners (be they containers, shared pointers, or something else) make almost all direct memory wrangling unnessecary. And as the problem of memory management is often error prone, this makes your code more robust, easier to read, and sometimes faster (as the fancy resource owners can use techniques you might not bother with everywhere).
This is exemplified by the rule of zero: write no destructor, copy/move assign, copy/move constructor. Store state in smart storage, and have it handle it for you.
None of the above applies when you yourself are writing smart memory owning classes. This is a rare thing to need to do, however. It also requires C++14 (for make_unique) to get rid of the penultimate excuse to call new.
Now, the free store is still used, just not directly, under the above style. The free store (aka heap) is needed because automatic storage (aka the stack) only supports really simple object lifetime rules (scope based, compile time deterministic size and count, FILO order). As runtime sized and counted data is common, and object lifetime is often not that simple, the free store is used by most programs. Sometimes copying an object around on the stack is enough to make the simple lifetime less of a problem, but at other times identity is important.
The final reason is stack overflow. On some C++ implementations the stack/automatic storage is seriously constrained in size. What more is that there is rarely if ever a reliable failure mode when you put to much stuff in it. By storing large data on the free store, we can reduce the chance the stack will overflow.
First, if you don't need dynamic allocation, don't use it.
The most frequent reason for needing dynamic allocation is that
the object will have a lifetime which is determined by the
program logic rather than lexical scope. The new and
delete operators are designed to support explicitly managed
lifetimes.
Another common reason is that the size or structure of the
"object" is determined at runtime. For simple cases (arrays,
etc.) there are standard classes (std::vector) which will
handle this for you, but for more complicated structures (e.g.
graphs and trees), you'll have to do this yourself. (The usual
technique here is to create a class representing the graph or
tree, and have it manage the memory.)
And there is the case where the object must be polymorphic, and
the actual type won't be known until runtime. (There are some
tricky ways of handling this without dynamic allocation in the
simplest cases, but in general, you'll need dynamic allocation.)
In this case, std::unique_ptr might be appropriate to handle
the delete, or if the object must be shared, std::shared_ptr
(although usually, objects which must be shared fall into the
first category, above, and so smart pointers aren't
appropriate).
There are probably other reasons as well, but these are the
three that I've encountered the most often.
Only on simple programs you can know beforehand how much memory you'd use. In general you can not foresee how much memory you'd use.
However with modern C++11 you generally rely on standard libraries like vector and map for memory allocation, and the use of smart pointers helps you avoid memory leaks, so you don't really need to use new and delete explicitly by hand.
When you are using New then your object stores in Heap, and it remains there until you don't manually delete it. but in the case without using new your object goes in Stack and it destroys automatically when it goes out of scope.
Stack is set to a fix size, so if there is no any block for assign a new object then Stack Overflow occurs. This often happens when a lot of nested functions are being called, or if there is an infinite recursive call. If the current size of the heap is too small to accommodate new memory, then more memory can be added to the heap by the operating system.
Another reason may be if you are explicitly calling an external library or API with a C-style interface. Setting up a callback in such cases often means context data must be supplied and returned in the callback, and such an interface usually provides only a 'simple' void* or int*. Allocating an object or struct with new is appropriate for such actions, (you can delete it later in the callback, should you need to).

Is it a bad idea to store large vectors in the stack?

I've been working on a bunch of image processing programs.. nothing fancy, mostly experimenting quick and dirty. The image data is stored in vectors which are declared on the stack (I try to avoid having to use pointers when I don't need to pass data around). I've noticed that some of my functions have been behaving very strangely despite countless amounts of debugging and stepping. Sometimes the debugger would give me an error that it cannot evaluate a certain variable, among other things. Things generally just do not make sense, and past experience tells me that when this happens this means that there is some kind of overflow or memory corruption going on. The first thing that came to mind was that it was probably due to me storing lots of image data into vectors.
However, I was under the impression that vectors store their actual data in the heap, and so I thought it wouldn't hurt to have a few of these large vectors on the stack. Am I wrong in thinking this? Should I be allocating my vectors and storing them in the heap rather than the stack?
Thanks,
[...]vectors store their actual data in the heap
vector, like all other containers, uses an allocator object for memory management. Typically, if you don't specify anything as the template's second parameter, the default allocator -- std::allocator from <memory> -- is used. It is the allocator's responsibility to reserve memory. It is free to do so either from the free-store or on the stack.
Most implementations typically use the pimpl idiom and store a pointer within the vector object which points to the actual memory on the free-store.
I've noticed that some of my functions have been behaving very strangely despite countless amounts of debugging and stepping
You may want to check that you are using your vectors properly. Look up the standard as to what gurantees you get with each member function, what conditions must be satisfied for the contained types and when your iterators get invalidated. That should be a good start.
std::vector does not store its memory within itself. It allocates memory from the heap (or wherever your allocator gets it from). So whether the vector itself is on the stack is irrelevant.
I would be willing to say that 99.9% of vector implementations store all of their data in the heap. Maybe somebody out there made a stack implementation, but you're probably not dealing with that. If random, intermittent failures are occurring a corner case not getting checked with pointer arithmetic is more likely the case. Either way, we can't know without you posting code

Why don't some languages allow declaration of pointers?

I'm programming in C++ right now, and I love using pointers. But it seems that other, newer languages like Java, C#, and Python don't allow you to explicitly declare pointers. In other words, you can't write both int x and int * y, and have x be a value while y is a pointer, in any of those languages. What is the reasoning behind this?
Pointers aren't bad, they are just easy to get wrong. In newer languages they have found ways of doing the same things, but with less risk of shooting yourself in the foot.
There is nothing wrong with pointers though. Go ahead and love them.
Toward your example, why would you want both x and y pointing to the same memory? Why not just always call it x?
One more point, pointers mean that you have to manage the memory lifetime yourself. Newer languages prefer to use garbage collection to manage the memory and allowing for pointers would make that task quite difficult.
I'll start with one of my favorite Scott Meyers quotes:
When I give talks on exception handling, I teach people two things:
POINTERS ARE YOUR ENEMIES, because they lead to the kinds of problems that auto_ptr is designed to eliminate.
POINTERS ARE YOUR FRIENDS, because operations on pointers can't throw.
Then I tell them to have a nice day :-)
The point is that pointers are extremely useful and it's certainly necessary to understand them when programming in C++. You can't understand the C++ memory model without understanding pointers. When you are implementing a resource-owning class (like a smart pointer, for example), you need to use pointers, and you can take advantage of their no-throw guarantee to write exception-safe resource owning classes.
However, in well-written C++ application code, you should never have to work with raw pointers. Never. You should always use some layer of abstraction instead of working directly with pointers:
Use references instead of pointers wherever possible. References cannot be null and they make code easier to understand, easier to write, and easier to code review.
Use smart pointers to manage any pointers that you do use. Smart pointers like shared_ptr, auto_ptr, and unique_ptr help to ensure that you don't leak resources or free resources prematurely.
Use containers like those found in the standard library for storing collections of objects instead of allocating arrays yourself. By using containers like vector and map, you can ensure that your code is exception safe (meaning that even when an exception is thrown, you won't leak resources).
Use iterators when working with containers. It's far easier to use iterators correctly than it is to use pointers correctly, and many library implementations provide debug support for helping you to find where you are using them incorrectly.
When you are working with legacy or third-party APIs and you absolutely must use raw pointers, write a class to encapsulate usage of that API.
C++ has automatic resource management in the form of Scope-Bound Resource Management (SBRM, also called Resource Acquisition is Initialization, or RAII). Use it. If you aren't using it, you're doing it wrong.
Pointers can be abused, and managed languages prefer to protect you from potential pitfalls. However, pointers are certainly not bad - they're an integral feature of the C and C++ languages, and writing C/C++ code without them is both tricky and cumbersome.
A true "pointer" has two characteristics.
It holds the address of another object (or primitive)
and exposes the numeric nature of that address so you can do arithmetic.
Typically the arithmetic operations defined for pointers are:
Adding an integer to a pointer into an array, which returns the address of another element.
Subtracting two pointers into the same array, which returns the number of elements in-between (inclusive of one end).
Comparing two pointers into the same array, which indicates which element is closer to the head of the array.
Managed languages generally lead you down the road of "references" instead of pointers. A reference also holds the address of another object (or primitive), but arithmetic is disallowed.
Among other things, this means you can't use pointer arithmetic to walk off the end of an array and treat some other data using the wrong type. The other way of forming an invalid pointer is taken care of in such environments by using garbage collection.
Together this ensures type-safety, but at a terrible loss of generality.
I try to answer OP's question directly:
In other words, you can't write both
int x and int * y, and have x be a
value while y is a pointer, in any of
those languages. What is the reasoning
behind this?
The reason behind this is the managed memory model in these languages. In C# (or Python, or Java,...) the lifetime of resources and thus usage of memory is managed automatically by the underlying runtime, or by it's garbage collector, to be precise. Briefly: The application has no control over the location of a resource in memory. It is not specified - and is even not guaranteed to stay constant during the lifetime of a resource. Hence, the notion of pointer as 'a location of something in virtual or physical memory' is completely irrelevant.
As someone already mentioned, pointers can, and actually, will go wrong if you have a massive application. This is one of the reasons we sometimes see Windows having problems due to NULL pointers created! I personally don't like pointers because it causes terrible memory leak and no matter how good you manage your memory, it will eventually hunt you down in some ways. I experienced this a lot with OpenCV when working around image processing applications. Having lots of pointers floating around, putting them in a list and then retrieving them later caused problems for me. But again, there are good sides of using pointers and it is often a good way to tune up your code. It all depends on what you are doing, what specs you have to meet, etc.
Pointers aren't bad. Pointers are difficult. The reason you can't have int x and int * y in Java, C#, etc., is that such languages want to put you away from solving coding problems (which are eventually subtle mistakes of yours), and they want to bring you closer to producing solutions to your project. They want to give you productivity.
As I said before, pointers aren't bad, they're just diffucult. In a Hello World programs pointers seems to be a piece of cake. Nevertheless when the program start growing, the complexity of managing pointers, passing the correct values, counting the objects, deleting the pointers, etc. start to get complex.
Now, from the point of view of the programmer (the user of the language, in this case you) this would lead to another problem that will become evident over the time: the longer it takes you don't read the code, the more difficult it will become to understand it again (i.e. projects from past years, months or event days). Added to this the fact that sometimes pointers use more than one level of indirection (TheClass ** ptr).
Last but not least, there are topics when the pointers application is very useful. As I mention before, they aren't bad. In Algorithms field they are very useful because in mathemical context you can refer to a specific value using a simple pointer (i.e. int *) and change is value without creating another object, and ironically it becomes easier to implement the algorithms with the use of pointers rather than without it.
Reminder: Each time when you ask why pointers or another thing is bad or is good, instead try to think in the historical context around when such topic or technology emerged and the problem that they tried to solve. In this case, pointers where needed to access to the memory of the PDP computers at Bell laboratories.