C++ - Map-like data structure with structural sharing/immutability

C++ - Map-like data structure with structural sharing/immutability - c++

Functional programming languages often work on immutable data structures but stay efficient by structural sharing. E.g. you work on some map of information, if you insert an element, you will not modify the existing map but create a new updated version. To avoid massive copying and memory usage, the map will share (as good as possible) the unchanged data between both instances.
I would be interested if there exists some template library providing such a map like data structure for C++. I searched a bit and found nothing, beside internal classes in LLVM.

A Copy On Write b+tree sounds like what your looking for. It basically creates a new snapshot of itself every time it gets modified but it shares unmodified leaf nodes between versions. Most of the implementations I've seen tend to be baked into append only database log files. CouchDB has a very nice write up on them. They are however "relatively easy", as far as map data structures go, to implement.

You can use an ordinary map, but marking every element with a timestamp or "map version number". If you want to remove elements too, use two marks. If you might reinsert removed elements, then you need a list of values and pairs of marks per element.
For example, you search for the key "foo", and you find that it had the value 5 in versions 0 to 3 (included), then it was "removed", and then it had the value -8 in versions 9 to current.
This eats a lot of memory and time, though.

Related

data structure advice on c++

I am looking for data structure in c++ and I need an advice.
I have nodes, every node has unique_id and group_id:
1 1.1.1.1
2 1.1.1.2
3 1.1.1.3
4 1.1.2.1
5 1.1.2.2
6 1.1.2.3
7 2.1.1.1
8 2.1.1.2
I need a data structure to answer those questions:
what is the group_id of node 4
give me list (probably vector) of unique_id's that belong to group 1.1.1
give me list (probably vector) of unique_id's that belong to group 1.1
give me list (probably vector) of unique_id's that belong to group 1
Is there a data structure that can answer those questions (what is the complexity time of inserting and answering)? or should I implement it?
I would appreciate an example.
EDIT:
at the beginning, I need to build this data structure. most of the action is reading by group id. insertion will happen but less then reading.
the time complexity is more important than memory space

To me, hierarchical data like the group ID calls for a tree structure. (I assume that for 500 elements this is not really necessary, but it seems natural and scales well.)
Each element in the first two levels of the tree would just hold vectors (if they come ordered) or maps (if they come un-ordered) of sub-IDs.
The third level in the tree hierarchy would hold pointers to leaves, again in a vector or map, which contain the fourth group ID part and the unique ID.
Questions 2-4 are easily and quickly answered by navigating the tree.
For question 1 one needs an additional map from unique IDs to leaves in the tree; each element inserted into the tree also has a pointer to it inserted into the map.

First of all, if you are going to have only a small number of nodes then it would probably make sense not to mess with advanced data structuring. Simple linear search could be sufficient.
Next, it looks like a good job for SQL. So may be it's a good idea to incorporate into your app SQLite library. But even if you really want to do it without SQL it's still a good hint: what you need are two index trees to support quick searching through your array. The complexity (if using balanced trees) will be logarithmic for all operations.

Depends...
How often do you insert? Or do you mostly read?
How often do you access by Id or GroupId?
With a max of 500 nodes I would put them in a simple Vector where the Id is the offset into the array (if the Ids are indeed as shown). The group-search can than be implemented by iterating over the array and comparing the partial gtroup-ids.
If this is too expensive and you really access the strcuture a lot and need very high performance, or you do a lot of inserts I would implement a tree with a HashMap for the Id's.
If the data is stored in a database you may use a SELECT/ CONNECT BY if your systems supports that and query the information directly from the DB.
Sorry for not providing a clear answer, but the solution depends on too many factors ;-)

Sounds like you need a container with two separate indexes on unique_id and group_id. Question 1 will be handled by the first index, Questions 2-4 will be handled by the second.
Maybe take a look at Boost Multi-index Containers Library

I am not sure of the perfect DS for this. But I would like to make use of a map.
It will give you O(1) efficiency for question 1 and for insertion O(logn) and deletion. The issue comes for question 2,3,4 where your efficiency will be O(n) where n is the number of nodes.

Storing named data, where the 'name' is larger than the 'data'?

I'm writing the logic portion of a game, and want to create, retrieve, and store values (integers) to keep track of progress. For instance, a door would create the pair ("location.room.doorlock", 0) in an std::map, and unlocking this door would set that value to 1. Anytime the player wants to go through this door, it would retrieve the value by that keyname to see if it's passable. (Just an example, but it's important that this information exist outside of the "door" object itself, as characters or other events might retrieve this data and act on it.)
The problem though is that the name (or map key) itself is far larger than the data it's referring to, which seems wasteful, and feels 'wrong' as a result.
Is there a commonly used or best approach for storing this type of data, one where the key isn't so much larger than the data itself?
It is possible to know how much space to allocate at compile time for the progress data itself, if it's important. It need not use std::map either, so long as I don't have to use raw array indices to get or store data.

It seems like you have two options, if you really want to diminish the size of the string (although the string length does not seem to be that bad at all).
You can either just change your naming conventions or implement hashing. Hashing can be implemented in the form of a hashmap (also known as an unordered map) or by hand (you can create a small program that hashes your names to an int, then use that as a pair). Hashmaps/unordered maps are probably your best bet, as there is a lot of support code out there for it and you don't run the risk of having to deal with bugs in your own programs.
http://www.cplusplus.com/reference/unordered_map/unordered_map/

Hash table with two keys

I have a large amount of data the I want to be able to access in two different ways. I would like constant time look up based on either key, constant time insertion with one key, and constant time deletion with the other. Is there such a data structure and can I construct one using the data structures in tr1 and maybe boost?

Use two parallel hash-tables. Make sure that the keys are stored inside the element value, because you'll need all the keys during deletion.

Have you looked at Bloom Filters? They aren't O(1), but I think they perform better than hash tables in terms of both time and space required to do lookups.

Hard to find why you need to do this but as someone said try using 2 different hashtables.
Just pseudocode in here:
Hashtable inHash;
Hashtable outHash;
//Hello myObj example!!
myObj.inKey="one";
myObj.outKey=1;
myObj.data="blahblah...";
//adding stuff
inHash.store(myObj.inKey,myObj.outKey);
outHash.store(myObj.outKey,myObj);
//deleting stuff
inHash.del(myObj.inKey,myObj.outKey);
outHash.del(myObj.outKey,myObj);
//findin stuff
//straight
myObj=outHash.get(1);
//the other way; still constant time
key=inHash.get("one");
myObj=outHash.get(key);
Not sure, thats what you're looking for.

This is one of the limits of the design of standard containers: a container in a sense "own" the contained data and expects to be the only owner... containers are not merely "indexes".
For your case a simple, but not 100% effective, solution is to have two std::maps with "Node *" as value and storing both keys in the Node structure (so you have each key stored twice). With this approach you can update your data structure with reasonable overhead (you will do some extra map search but that should be fast enough).
A possibly "correct" solution however would IMO be something like
struct Node
{
Key key1;
Key key2;
Payload data;
Node *Collision1Prev, *Collision1Next;
Node *Collision2Prev, *Collision2Next;
};
basically having each node in two different hash tables at the same time.
Standard containers cannot be combined this way. Other examples I coded by hand in the past are for example an hash table where all nodes are also in a doubly-linked list, or a tree where all nodes are also in an array.
For very complex data structures (e.g. network of structures where each one is both the "owner" of several chains and part of several other chains simultaneously) I even resorted sometimes to code generation (i.e. scripts that generate correct pointer-handling code given a description of the data structure).

Erlang persistent data structures

As I've understood, when you create a new list with expression like the following, Erlang doesn't copy L1, it just copies H.
L2 = [H|L1]
Does Erlang have persistent a data structure (see Persistent data structure) for dict, that is, when you add/remove/modify nodes in the tree only few elements are being copied (like in Clojure)?

You have misunderstood the situation when you build a list using [H|T]. It is as you say that T is not copied but neither is H. All that happens is that a new list cell is prepended to T with a reference to H as its head (its tail is T). When working with lists the only bits which are created are the actual list cells and never the data in each cell.
The same happens when working with dict. When you modify (add/delete elements) in the dict only the actual dict structure is modified and not the actual data in the dict. Also it is smart so as to only copy as little of the dict structure as is necessary to make the modification.
So, yes, Erlang has persistent data structures. In that respect clojure is like Erlang (we were around long before it).

In my experience, the data structures for the library module do not degrade in performance or memory pressure when they get larger.
For a dict, it uses a dynamic hash table as internal data structure and work is done essentially only on the bucket where the modification is done.
I also looked in the gb_trees module where I found the comment:
Behaviour is logaritmic (as it should be).
And gb_trees are generally pretty fast, so I'm quite sure not much copying is going on.
Generally, if you implement data structures like these in a language like Erlang, you take care of copying issues, so there is no need to worry about it for the general library functions.
I reread the article about persistent data structures: in the sense of this article, Erlang's data structures are fully persistent and also confluently persistent.

compressed vector/array class with random data access

I would like to make "compressed array"/"compressed vector" class (details below), that allows random data access with more or less constant time.
"more or less constant time" means that although element access time isn't constant, it shouldn't keep increasing when I get closer to certain point of the array. I.e. container shouldn't do significantly more calculations (like "decompress everything once again to get last element", and "do almost nothing to get the first") to get one element. Can be probably achieved by splitting array into chunks of compressed data. I.e. accessing one element should take "averageTime" +- some deviation. I could say that I want best-case access time and worst-case access time to be relatively close to average access time.
What are my options (suitable algorithms/already available containers - if there are any)?
Container details:
Container acts as a linear array of identical elements (such as std::vector)
Once container is initialized, data is constant and never changes. Container needs to provide read-only access.
Container should behave like array/std::vector - i.e. values accessed via operator[], there is .size(), etc.
It would be nice if I could make it as template class.
Access to data should be more or less constant-time. I don't need same access time for every element, but I shouldn't have to decompress everything to get last element.
Usage example:
Binary search on data.
Data details:
1. Data is structs mostly consisting of floats and a few ints. There are more floats than ints. No strings.
2. It is unlikely that there are many identical elements in array, so simply indexeing data won't be possible.
3. Size of one element is less than 100 bytes.
4. Total data size per container is between few kilobytes and a few megabytes.
5. Data is not sparse - it is continuous block of elements, all of them are assigned, there are no "empty slots".
The goal of compression is to reduce amount of ram the block takes when compared to uncompressed representation as array, while keeping somewhat reasonable read access performance, and allowing to randomly access elements as array. I.e. data should be stored in compressed form internally, and I should be able to access it (read-only) as if it is a std::vector or similar container.
Ideas/Opinions?

I take it that you want an array whose elements are not stored vanilla, but compressed, to minimize memory usage.
Concerning compression, you have no exceptional insight about the structure of your data, so you're fine with some kind of standard entropy encoding. Ideally, would like like to run GZIP on your whole array and be done with it, but that would lose O(1) access, which is crucial to you.
A solution is to use Huffmann coding together with an index table.
Huffmann coding works by replacing each input symbol (for instance, an ASCII byte) with another symbol of variable bit length, depending on frequency of occurency in the whole stream. For instance, the character E appears very often, so it gets a short bit sequence, while 'W' is seldom and gets a long bit sequence.
E -> 0b10
W -> 0b11110
Now, compress your whole array with this method. Unfortunately, since the output symbols have variable length, you can no longer index your data as before: item number 15 is no longer at stream[15*sizeof(item)].
Fortunately, this problem can solved by using an additional index table index that stores where each item start in the compressed stream. In other words, the compressed data for item 15 can be found at stream[index[15]]; the index table accumulates the variable output lengths.
So, to get item 15, you simply start decompressing the bytes at stream[index[15]]. This works because the Huffmann coding doesn't do anything fancy to the output, it just concatenates the new code words, and you can start decoding inside the stream without having to decode all previous items.
Of course, the index table adds some overhead; you may want to tweak the granularity so that compressed data + index table is still smaller than original data.

Are you coding for an embedded system and/or do you have hundreds or thousands of these containers? If not, while I think this is an interesting theoretical question (+1), I suspect that the slowdown as a result of doing the decompression will be non-trivial and that it would be better to use use a std::vector.
Next, are you sure that the data you're storing is sufficiently redundant that smaller blocks of it will actually be compressible? Have you tried saving off blocks of different sizes (powers of 2 perhaps) and tried running them through gzip as an exercise? It may be that any extra data needed to help the decompression algorithm (depending on approach) would reduce the space benefits of doing this sort of compressed container.
If you decide that it's still reasonable to do the compression, then there are at least a couple possibilities, none pre-written though. You could compress each individual element, storing a pointer to the compressed data chunk. Then index access is still constant, just needing to decompress the actual data. Possibly using a proxy object would make doing the actual data decompression easier and more transparent (and maybe even allow you to use std::vector as the underlying container).
Alternately, std::deque stores its data in chunks already, so you could use a similar approach here. For example std::vector<compressed_data_chunk> where each chunk holds say 10 items compressed together as your underlying container. Then you can still directly index the chunk you need, decompress it, and return the item from the decompressed data. If you want, your containing object (that holds the vector) could even cache the most recently decompressed chunk or two for added performance on consecutive access (although this wouldn't help binary search very much at all).

I've been thinking about this for a while now. From a theoretical point of view I identified 2 possibilities:
Flyweight, because repetition can be lessened with this pattern.
Serialization (compression is some form of serialization)
The first is purely object oriented and fits well I think in general, it doesn't have the disadvantage of messing up pointers for example.
The second seems better adapted here, although it does have a slight disadvantage in general: pointer invalidation + issues with pointer encoding / decoding, virtual tables, etc... Notably it doesn't work if the items refer to each others using pointers instead of indices.
I have seen a few "Huffman coding" solutions, however this means that for each structure one needs to provide a compressing algorithm. It's not easy to generalize.
So I'd rather go the other way and use a compressing library like 'zlib', picking up a fast algorithm like lzo for example.
B* tree (or a variant) with large number of items per node (since it doesn't move) like say 1001. Each node contains a compressed representation of the array of items. Indices are not compressed.
Possibly: cache_view to access the container while storing the last 5 (or so) decompressed nodes or something. Another variant is to implement reference counting and keep the data uncompressed as long as someones got a handle to one of the items in the node.
Some remarks:
if you should a large number of items/keys per node you have near constant access time, for example with 1001 it means that there are only 2 levels of indirection as long as you store less than a million items, 3 levels of indirection for a billion etc...
you can build a readable/writable container with such a structure. I would make it so that I only recompress once I am done writing the node.

Okay, from the best of my understanding, what you want is some kind of accessor template. Basically, create a template adapter that has as its argument one of your element types which it accesses internally via whatever, a pointer, an index into your blob, etc. Make the adapter pointer-like:
const T &operator->(void) const;
etc. since it's easier to create a pointer adapter than it is a reference adapter (though see vector if you want to know how to write one of those). Notice, I made this accessor constant as per your guidelines. Then, pre-compute your offsets when the blob is loaded / compressed and populate the vector with your templated adapter class. Does this make sense? If you need more details, I will be happy to provide.
As for the compression algorithm, I suggest you simply do a frequency analysis of bytes in your blob and then run your uncompressed blob through a hard-coded Huffman encoding (as was more or less suggested earlier), capturing the offset of each element and storing it in your proxy adapter which in turn are the elements of your array. Indeed, you could make this all part of some compression class that compresses and generates elements that can be copy-back-inserted into your vector from the beginning. Again, reply if you need sample code.

Can some of the answers to "What is the best compression algorithm that allows random reads/writes in a file?"
be adapted to your in-memory data?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js