"Thinking in C++": operation of "Stash" library - c++

Stash library in "Thinking in C++" by Bruce Eckel:
Basically he seems to be setting up an array-index-addressable interface (via fetch) to a set of entities that are actually stored at random memory locations, without actually copying those data entities, in order to simulate for the user the existence of an actual contiguous-memory data block. In short, a contiguous, indexed address map. Do I have this right? Also, his mapping is on a byte-by-byte basis; if it were not for this requirement (and I am unsure of its importance), I believe that there may be simpler ways to generate such a data structure in C++. I looked into memcpy, but do not see how to actually copy data on a byte-by-byte basis to create such an indexed structure.
Prior posting:
This library appears to create a pointer assemblage, not a data-storage assemblage.
Is this true? (Applies to both the C and C++ versions.) Thus the name "stash" might be a little misleading, as nothing but pointers to data stashed elsewhere is put into a "stash," and Eckel states that "the data is copied."
Background: Looking at “add” and “inflate,” the so-called “copying” is equating pointers to other pointers (“storage” to “e” in “add” and “b” to “storage” in “inflate”). The use of “new” in this case is strange to me, because storage for data is indeed allocated but “b” is set to the address of the data, and no data assignments seem to take place in the entire library. So I am not sure what the point of the “allocation” by “new” is when the allocated space is apparently never written into or read from in the library. The “element” to be added exists elsewhere in memory already, and seemingly all we are doing is creating a sequential pointer structure to each byte of every “element” desired to be reference-able through CStash. Do I understand this library correctly?
Similarly, it looks as though “stack” in the section “Nested structures” appears actually to work only with addresses of data, not with data. I wrote my own linked-list stack successfully, which actually stores data in the stack nodes.

Related

C++ garbage collection

There are a number of garbage collection libraries for C++.
I am kind of confused how the pointer tracking works.
In particular, suppose we have a base pointer P and a list of other pointers who are computed as offsets from P using an array.
Ex,
P2 = P+offset[0]
How does the garbage collector know P2 is still in scope? It has no direct reference but it's still accessible.
Probably the most popular C++ gc is
https://en.m.wikipedia.org/wiki/Boehm_garbage_collector
But following their example syntax it seems very easy to break so I must not be understanding something.
This question cannot be answered in general. There are different systems that may be regarded as garbage collection for C++; for example, Herb Sutter's deferred_ptr is basically a garbage collecting smart pointer. I've personally implemented another version of this idea, similar to Sutter's but less fancy.
I can answer about Boehm, however. How the Boehm garbage collector recognizes pointers when it does its "mark" phase, is basically by scanning memory and assuming that things that look like pointers are pointers.
The garbage collector knows all the areas of memory where user data is and it knows all of the pointers that it has allocated and how big those allocations were. It just looks for chains of pointers starting from "root segments" defined as below, where by "look" we mean explicitly scanning memory for 64 bit values that are the same as one of the GC allocations it has done.
From here:
Since it cannot generally tell where pointer variables are located, it
scans the following root segments for pointers:
The registers. Depending on the architecture, this may be done using assembly code, or by calling a setjmp-like function which
saves register contents on the stack.
The stack(s). In the case of a single-threaded application, on most platforms this is done by scanning the memory between (an
approximation of) the current stack pointer and GC_stackbottom. (For
Itanium, the register stack scanned separately.) The GC_stackbottom
variable is set in a highly platform-specific way depending on the
appropriate configuration information in gcconfig.h. Note that the
currently active stack needs to be scanned carefully, since
callee-save registers of client code may appear inside collector
stack frames, which may change during the mark process. This is
addressed by scanning some sections of the stack "eagerly",
effectively capturing a snapshot at one point in time.
Static data region(s). In the simplest case, this is the region between DATASTART and DATAEND, as defined in gcconfig.h. However, in
most cases, this will also involve static data regions associated
with dynamic libraries. These are identified by the mostly
platform-specific code in dyn_load.c.
The address space for 64-bit pointers is huge so false positives will be rare, but even if they occur, false positives would just be leaks, that last as long as there happens to be some other variable in the memory the mark phase scans that is exactly the same value as some 64-bit pointer that was allocated by the garbage collector.

What does HeapValidate in windows do?

I have been reading up on HeapValidate in an existing code and trying to figure out what does it do. The documentation says that it checks whether the heap control structures are in a consistent state. What does that mean?
The heap is a data structure like any other, and in its metadata-state-variables there are certain conditions that should always be true. As a made-up example, the number of children in a heap-tree-node should always be a non-negative number; so if HeapValidate() reads a child-count variable and sees that it is negative, it knows something has gone badly wrong and can flag that block as broken.
You might wonder, assuming Microsoft’s heap code does not have any bugs, how the heap’s metadata might get in to an invalid/“impossible” state like that in the first place. Since the heap’s metadata structures live in the same address space that the user-code has access to, it’s usually the result of buggy user code writing some other data via an invalid pointer that happens to point at a memory location where the heap’s metadata field happens to reside, silently overwriting/corrupting the metadata.

C++ - Managing References in Disk Based Vector

I am developing a set of vector classes that all derived from an abstract vector. I am doing this so that in our software that makes use of these vectors, we can quickly switch between the vectors without any code breaking (or at least minimize failures, but my goal is full compatibility). All of the vectors match.
I am working on a Disk Based Vector that mostly conforms to match the STL Vector implementation. I am doing this because we need to handle large out of memory files that contain various formats of data. The Disk Vector handles data read/write to disk by using template specialization/polymorphism of serialization and deserialization classes. The data serialization and deserialization has been tested, and it works (up to now). My problem occurs when dealing with references to the data.
For example,
Given a DiskVector dv, a call to dv[10] would get a point to a spot on disk, then seek there, read out the char stream. This stream gets passed to a deserializor which converts the byte stream into the appropriate data type. Once I have the value, I my return it.
This is where I run into a problem. In the STL, they return it as a reference, so in order to match their style, I need to return a reference. What I do it store the value in an unordered_map with the given index (in this example, 10). Then I return a reference to the value in the unordered_map.
If this continues without cleanup, then the purpose of the DiskVector is lost because all the data just gets loaded into memory, which is bad due to data size. So I clean up this map by deleting the indexes later on when other calls are made. Unfortunately, if a user decided to store this reference for a long time, and then it gets deleted in the DiskVector, we have a problem.
So my questions
Is there a way to see if any other references to a certain instance are in use?
Is there a better way to solve this while still maintaining the polymorphic style for reasons described at the beginning?
Is it possible to construct a special class that would behave as a reference, but handle the disk IO dynamically so I could just return that instead?
Any other ideas?
So a better solution at what I was trying to do is to use SQLite as the backend for the database. Use BLOBs as the column types for key and value columns. This is the approach I am taking now. That said, in order to get it to work well, I need to use what cdhowie posted in the comments to my question.

Why are empty classes 8 bytes and larger classes always > 8 bytes?

class foo { }
writeln(foo.classinfo.init.length); // = 8 bytes
class foo { char d; }
writeln(foo.classinfo.init.length); // = 9 bytes
Is d actually storing anything in those 8 bytes, and if so, what? It seems like a huge waste, If I'm just wrapping a few value types then the the class significantly bloats the program, specifically if I am using a lot of them. A char becomes 8 times larger while an int becomes 3 times as large.
A struct's minimum size is 1 byte.
In D, object have a header containing 2 pointer (so it may be 8bytes or 16 depending on your architecture).
The first pointer is the virtual method table. This is an array that is generated by the compiler filled with function pointer, so virtual dispatch is possible. All instances of the same class share the same virtual method table.
The second pointer is the monitor. It is used for synchronization. It is not sure that this field stay here forever, because D emphasis local storage and immutability, which make synchronization on many objects useless. As this field is older than these features, it is still here and can be used. However, it may disapear in the future.
Such header on object is very common, you'll find the same in Java or C# for instance. You can look here for more information : http://dlang.org/abi.html
D uses two machine words in each class instance for:
A pointer to the virtual function table. This contains the addresses of virtual methods. The first entry points towards the class's classinfo, which is also used by dynamic casts.
The monitor, which allows the synchronized(obj) syntax, documented here.
These fields are described in the D documentation here (scroll down to "Class Properties") and here (scroll down to "Classes").
I don't know the particulars of D, but in both Java and .net, every class object contains information about its type, and also holds information about whether it's the target of any monitor locks, whether it's eligible for finalization cleanup, and various other things. Having a standard means by which all objects store such information can make many things more convenient for both users and implementers of the language and/or framework. Incidentally, in 32-bit versions of .net, the overhead for each object is 8 bytes except that there is a 12-byte minimum object size. This minimum stems from the fact that when the garbage-collector moves objects around, it needs to temporarily store in the old location a reference to the new one as well as some sort of linked data structure that will permit it to examine arbitrarily-deep nested references without needing an arbitrarily-large stack.
Edit
If you want to use a class because you need to be able to persist references to data items, space is at a premium, and your usage patterns are such that you'll know when data items are still useful and when they become obsolete, you may be able to define an array of structures, and then pass around indices to the array elements. It's possible to write code to handle this very efficiently with essentially zero overhead, provided that the structure of your program allows you to ensure that every item that gets allocated is released exactly once and things are not used once they are released.
If you would not be able to readily determine when the last reference to an object is going to go out of scope, eight bytes would be a very reasonable level of overhead. I would expect that most frameworks would force objects to be aligned on 32-bit boundaries (so I'm surprised that adding a byte would push the size to nine rather than twelve). If a system is going have a garbage collector that works better than a Commodore 64(*), it would need to have an absolute minimum of a bit of overhead per object to indicate which things are used and which aren't. Further, unless one wants to have separate heaps for objects which can contain supplemental information and those which can't, one will every object to either include space for a supplemental-information pointer, or include space for all the supplemental information (locking, abandonment notification requests, etc.). While it might be beneficial in some cases to have separate heaps for the two categories of objects, I doubt the benefits would very often justify the added complexity.
(*) The Commodore 64 garbage collector worked by allocating strings from the top of memory downward, while variables (which are not GC'ed) were allocated bottom-up. When memory got full, the system would scan all variables to find the reference to the string that was stored at the highest address. That string would then be moved to the very top of memory and all references to it would be updated. The system would then scan all variables to find the reference to the string at the highest address below the one it just moved and update all references to that. The process would repeat until it didn't find any more strings to move. This algorithm didn't require any extra data to be stored with strings in memory, but it was of course dog slow. The Commodore 128 garbage collector stored with each string in GC space a pointer to the variable that holds a reference and a length byte that could be used to find the next lower string in GC space; it could thus check each string in order to find out whether it was still used, relocating it to the top of memory if so. Much faster, but at the cost of three bytes' overhead per string.
You should look into the storage requirements for various types. Every instruction, storage allocation (ie:variable/object, etc) uses up a specific amount of space. In c# an Int32 type integer object should store integer information to the tune of 4 bytes (32bit). It might have other information, too, because it is an object, but your character data type probably only requires 1 byte of information. If you have constructs like for or while in your class, those things will take up space, too, because each of those things is telling your class to do something. The class itself requires a number of instructions to be created in memory, which would account for the 8 initial bytes.
Take an assembler language course. You'll learn all you ever wanted to know and then some about why your programs use however much memory or take up however much storage when compiled.

mmap-loadable data structure library for C++ (or C)

I have a some large data structure (N > 10,000) that usually only needs to be created once (at runtime), and can be reused many times afterwards, but it needs to be loaded very quickly. (It is used for user input processing on iPhoneOS.) mmap-ing a file seems to be the best choice.
Are there any data structure libraries for C++ (or C)? Something along the line
ReadOnlyHashTable<char, int> table ("filename.hash");
// mmap(...) inside the c'tor
...
int freq = table.get('a');
...
// munmap(...); inside the d'tor.
Thank you!
Details:
I've written a similar class for hash table myself but I find it pretty hard to maintain, so I would like to see if there's existing solutions already. The library should
Contain a creation routine that serialize the data structure into file. This part doesn't need to be fast.
Contain a loading routine that mmap a file into read-only (or read-write) data structure that can be usable within O(1) steps of processing.
Use O(N) amount of disk/memory space with a small constant factor. (The device has serious memory constraint.)
Small time overhead to accessors. (i.e. the complexity isn't modified.)
Assumptions:
Bit representation of data (e.g. endianness, encoding of float, etc.) does not matter since it is only used locally.
So far the possible types of data I need are integers, strings, and struct's of them. Pointers do not appear.
P.S. Can Boost.intrusive help?
You could try to create a memory mapped file and then create the STL map structure with a customer allocator. Your customer allocator then simply takes the beginning of the memory of the memory mapped file, and then increments its pointer according to the requested size.
In the end all the allocated memory should be within the memory of the memory mapped file and should be reloadable later.
You will have to check if memory is free'd by the STL map. If it is, your customer allocator will lose some memory of the memory mapped file but if this is limited you can probably live with it.
Sounds like maybe you could use one of the "perfect hash" utilities out there. These spend some time opimising the hash function for the particular data, so there are no hash collisions and (for minimal perfect hash functions) so that there are no (or at least few) empty gaps in the hash table. Obviously, this is intended to be generated rarely but used frequently.
CMPH claims to cope with large numbers of keys. However, I have never used it.
There's a good chance it only generates the hash function, leaving you to use that to generate the data structure. That shouldn't be especially hard, but it possibly still leaves you where you are now - maintaining at least some of the code yourself.
Just thought of another option - Datadraw. Again, I haven't used this, so no guarantees, but it does claim to be a fast persistent database code generator.
WRT boost.intrusive, I've just been having a look. It's interesting. And annoying, as it makes one of my own libraries look a bit pointless.
I thought this section looked particularly relevant.
If you can use "smart pointers" for links, presumably the smart pointer type can be implemented using a simple offset-from-base-address integer (and I think that's the point of the example). An array subscript might be equally valid.
There's certainly unordered set/multiset support (C++ code for hash tables).
Using cmph would work. It does have the serialization machinery for the hash function itself, but you still need to serialize the keys and the data, besides adding a layer of collision resolution on top of it if your query set universe is not known before hand. If you know all keys before hand, then it is the way to go since you don't need to store the keys and will save a lot of space. If not, for such a small set, I would say it is overkill.
Probably the best option is to use google's sparse_hash_map. It has very low overhead and also has the serialization hooks that you need.
http://google-sparsehash.googlecode.com/svn/trunk/doc/sparse_hash_map.html#io
GVDB (GVariant Database), the core of Dconf is exactly this.
See git.gnome.org/browse/gvdb, dconf and bv
and developer.gnome.org/glib/2.30/glib-GVariant.html