I am looking for data structure in c++ and I need an advice.
I have nodes, every node has unique_id and group_id:
1 1.1.1.1
2 1.1.1.2
3 1.1.1.3
4 1.1.2.1
5 1.1.2.2
6 1.1.2.3
7 2.1.1.1
8 2.1.1.2
I need a data structure to answer those questions:
what is the group_id of node 4
give me list (probably vector) of unique_id's that belong to group 1.1.1
give me list (probably vector) of unique_id's that belong to group 1.1
give me list (probably vector) of unique_id's that belong to group 1
Is there a data structure that can answer those questions (what is the complexity time of inserting and answering)? or should I implement it?
I would appreciate an example.
EDIT:
at the beginning, I need to build this data structure. most of the action is reading by group id. insertion will happen but less then reading.
the time complexity is more important than memory space
To me, hierarchical data like the group ID calls for a tree structure. (I assume that for 500 elements this is not really necessary, but it seems natural and scales well.)
Each element in the first two levels of the tree would just hold vectors (if they come ordered) or maps (if they come un-ordered) of sub-IDs.
The third level in the tree hierarchy would hold pointers to leaves, again in a vector or map, which contain the fourth group ID part and the unique ID.
Questions 2-4 are easily and quickly answered by navigating the tree.
For question 1 one needs an additional map from unique IDs to leaves in the tree; each element inserted into the tree also has a pointer to it inserted into the map.
First of all, if you are going to have only a small number of nodes then it would probably make sense not to mess with advanced data structuring. Simple linear search could be sufficient.
Next, it looks like a good job for SQL. So may be it's a good idea to incorporate into your app SQLite library. But even if you really want to do it without SQL it's still a good hint: what you need are two index trees to support quick searching through your array. The complexity (if using balanced trees) will be logarithmic for all operations.
Depends...
How often do you insert? Or do you mostly read?
How often do you access by Id or GroupId?
With a max of 500 nodes I would put them in a simple Vector where the Id is the offset into the array (if the Ids are indeed as shown). The group-search can than be implemented by iterating over the array and comparing the partial gtroup-ids.
If this is too expensive and you really access the strcuture a lot and need very high performance, or you do a lot of inserts I would implement a tree with a HashMap for the Id's.
If the data is stored in a database you may use a SELECT/ CONNECT BY if your systems supports that and query the information directly from the DB.
Sorry for not providing a clear answer, but the solution depends on too many factors ;-)
Sounds like you need a container with two separate indexes on unique_id and group_id. Question 1 will be handled by the first index, Questions 2-4 will be handled by the second.
Maybe take a look at Boost Multi-index Containers Library
I am not sure of the perfect DS for this. But I would like to make use of a map.
It will give you O(1) efficiency for question 1 and for insertion O(logn) and deletion. The issue comes for question 2,3,4 where your efficiency will be O(n) where n is the number of nodes.
Related
I am designing a Graph in c++ using a hash table for its elements. The hashtable is using open addressing and the Graph has no more than 50.000 edges. I also designed a PRIM algorithm to find the minimum spanning tree of the graph. My PRIM algorithm creates storage for the following data:
A table named Q to put there all the nodes in the beginning. In every loop, a node is visited and in the end of the loop, it's deleted from Q.
A table named Key, one for each node. The key is changed when necessary (at least one time per loop).
A table named Parent, one for each node. In each loop, a new element is inserted in this table.
A table named A. The program stores here the final edges of the minimum spanning tree. It's the table that is returned.
What would be the most efficient data structure to use for creating these tables, assuming the graph has 50.000 edges?
Can I use arrays?
I fear that the elements for every array will be way too many. I don't even consider using linked lists, of course, because the accessing of each element will take to much time. Could I use hash tables?
But again, the elements are way to many. My algorithm works well for Graphs consisting of a few nodes (10 or 20) but I am sceptical about the situation where the Graphs consist of 40.000 nodes. Any suggestion is much appreciated.
(Since comments were getting a bit long): The only part of the problem that seems to get ugly for very large size, is that every node not yet selected has a cost and you need to find the one with lowest cost at each step, but executing each step reduces the cost of a few effectively random nodes.
A priority queue is perfect when you want to keep track of lowest cost. It is efficient for removing the lowest cost node (which you do at each step). It is efficient for adding a few newly reachable nodes, as you might on any step. But in the basic design, it does not handle reducing the cost of a few nodes that were already reachable at high cost.
So (having frequent need for a more functional priority queue), I typically create a heap of pointers to objects and in each object have an index of its heap position. The heap methods all do a callback into the object to inform it whenever its index changes. The heap also has some external calls into methods that might normally be internal only, such as the one that is perfect for efficiently fixing the heap when an existing element has its cost reduced.
I just reviewed the documentation for the std one
http://en.cppreference.com/w/cpp/container/priority_queue
to see if the features I always want to add were there in some form I hadn't noticed before (or had been added in some recent C++ version). So far as I can tell, NO. Most real world uses of priority queue (certainly all of mine) need minor extra features that I have no clue how to tack onto the standard version. So I have needed to rewrite it from scratch including the extra features. But that isn't actually hard.
The method I use has been reinvented by many people (I was doing this in C in the 70's, and wasn't first). A quick google search found one of many places my approach is described in more detail than I have described it.
http://users.encs.concordia.ca/~chvatal/notes/pq.html#heap
pros, I need some performance-opinions with the following:
1st Question:
I want to store objects in a 3D-Grid-Structure, overall it will be ~33% filled, i.e. 2 out of 3 gridpoints will be empty.
Short image to illustrate:
Maybe Option A)
vector<vector<vector<deque<Obj>> grid;// (SizeX, SizeY, SizeZ);
grid[x][y][z].push_back(someObj);
This way I'd have a lot of empty deques, but accessing one of them would be fast, wouldn't it?
The Other Option B) would be
std::unordered_map<Pos3D, deque<Obj>, Pos3DHash, Pos3DEqual> Pos3DMap;
where I add&delete deques when data is added/deleted. Probably less memory used, but maybe less fast? What do you think?
2nd Question (follow up)
What if I had multiple containers at each position? Say 3 buckets for 3 different entities, say object types ObjA, ObjB, ObjC per grid point, then my data essentially becomes 4D?
Another illustration:
Using Option 1B I could just extend Pos3D to include the bucket number to account for even more sparse data.
Possible queries I want to optimize for:
Give me all Objects out of ObjA-buckets from the entire structure
Give me all Objects out of ObjB-buckets for a set of
grid-positions
Which is the nearest non-empty ObjC-bucket to
position x,y,z?
PS:
I had also thought about a tree based data-structure before, reading about nearest neighbour approaches. Since my data is so regular I had thought I'd save all the tree-building dividing of the cells into smaller pieces and just make a static 3D-grid of the final leafs. Thats how I came to ask about the best way to store this grid here.
Question associated with this, if I have a map<int, Obj> is there a fast way to ask for "all objects with keys between 780 and 790"? Or is the fastest way the building of the above mentioned tree?
EDIT
I ended up going with a 3D boost::multi_array that has fortran-ordering. It's a little bit like the chunks games like minecraft use. Which is a little like using a kd-tree with fixed leaf-size and fixed amount of leaves? Works pretty fast now so I'm happy with this approach.
Answer to 1st question
As #Joachim pointed out, this depends on whether you prefer fast access or small data. Roughly, this corresponds to your options A and B.
A) If you want fast access, go with a multidimensional std::vector or an array if you will. std::vector brings easier maintenance at a minimal overhead, so I'd prefer that. In terms of space it consumes O(N^3) space, where N is the number of grid points along one dimension. In order to get the best performance when iterating over the data, remember to resolve the indices in the reverse order as you defined it: innermost first, outermost last.
B) If you instead wish to keep things as small as possible, use a hash map, and use one which is optimized for space. That would result in space O(N), with N being the number of elements. Here is a benchmark comparing several hash maps. I made good experiences with google::sparse_hash_map, which has the smallest constant overhead I have seen so far. Plus, it is easy to add it to your build system.
If you need a mixture of speed and small data or don't know the size of each dimension in advance, use a hash map as well.
Answer to 2nd question
I'd say you data is 4D if you have a variable number of elements a long the 4th dimension, or a fixed large number of elements. With option 1B) you'd indeed add the bucket index, for 1A) you'd add another vector.
Which is the nearest non-empty ObjC-bucket to position x,y,z?
This operation is commonly called nearest neighbor search. You want a KDTree for that. There is libkdtree++, if you prefer small libraries. Otherwise, FLANN might be an option. It is a part of the Point Cloud Library which accomplishes a lot of tasks on multidimensional data and could be worth a look as well.
Functional programming languages often work on immutable data structures but stay efficient by structural sharing. E.g. you work on some map of information, if you insert an element, you will not modify the existing map but create a new updated version. To avoid massive copying and memory usage, the map will share (as good as possible) the unchanged data between both instances.
I would be interested if there exists some template library providing such a map like data structure for C++. I searched a bit and found nothing, beside internal classes in LLVM.
A Copy On Write b+tree sounds like what your looking for. It basically creates a new snapshot of itself every time it gets modified but it shares unmodified leaf nodes between versions. Most of the implementations I've seen tend to be baked into append only database log files. CouchDB has a very nice write up on them. They are however "relatively easy", as far as map data structures go, to implement.
You can use an ordinary map, but marking every element with a timestamp or "map version number". If you want to remove elements too, use two marks. If you might reinsert removed elements, then you need a list of values and pairs of marks per element.
For example, you search for the key "foo", and you find that it had the value 5 in versions 0 to 3 (included), then it was "removed", and then it had the value -8 in versions 9 to current.
This eats a lot of memory and time, though.
I am not at all an expert in database design, so I will put my need in plain words before I try to translate it in CS terms: I am trying to find the right way to iterate quickly over large subsets (say ~100Mo of double) of data, in a potentially very large dataset (say several Go).
I have objects that basically consist of 4 integers (keys) and the value, a simple struct (1 double 1 short).
Since my keys can take only a small number of values (couple hundreds) I thought it would make sense to save my data as a tree (1 depth by key, values are the leaves, much like XML's XPath in my naive view at least).
I want to be able to iterate through subset of leaves based on key values / a fonction of those keys values. Which key combination to filter upon will vary. I think this is call a transversal search ?
So to avoid comparing n times the same keys, ideally I would need the data structure to be indexed by each of the permutation of the keys (12 possibilities: !4/!2 ). This seems to be what boost::multi_index is for, but, unless I'm overlooking smth, the way this would be done would be actually constructing those 12 tree structure, storing pointers to my value nodes as leaves. I guess this would be extremely space inefficient considering the small size of my values compared to the keys.
Any suggestions regarding the design / data structure I should use, or pointers to concise educational materials regarding these topics would be very appreciated.
With Boost.MultiIndex, you don't need as many as 12 indices (BTW, the number of permutations of 4 elements is 4!=24, not 12) to cover all queries comprising a particular subset of 4 keys: thanks to the use of composite keys, and with a little ingenuity, 6 indices suffice.
By some happy coincindence, I provided in my blog some years ago an example showing how to do this in a manner that almost exactly matches your particular scenario:
Multiattribute querying with Boost.MultiIndex
Source code is provided that you can hopefully use with little modification to suit your needs. The theoretical justification of the construct is also provided in a series of articles in the same blog:
A combinatory theorem
Generating permutation covers: part I
Generating permutation covers: part II
Multicolumn querying
The maths behind this is not trivial and you might want to safely ignore it: if you need assistance understanding it, though, do not hesitate to comment on the blog articles.
How much memory does this container use? In a typical 32-bit computer, the size of your objects is 4*sizeof(int)+sizeof(double)+sizeof(short)+padding, which typically yields 32 bytes (checked with Visual Studio on Win32). To this Boost.MultiIndex adds an overhead of 3 words (12 bytes) per index, so for each element of the container you've got
32+6*12 = 104 bytes + padding.
Again, I checked with Visual Studio on Win32 and the size obtained was 128 bytes per element. If you have 1 billion (10^9) elements, then 32 bits is not enough: going to a 64-bit OS will most likely double the size of obejcts, so the memory needed would amount to 256 GB, which is quite a powerful beast (don't know whether you are using something as huge as this.)
B-Tree index and Bitmap Index are two of the major indexes used, but they aren't the only ones. You should explore them. Something to get you started .
Article evaluating when to use B-Tree and when to use Bitmap
It depends on the algorithm accessing it, honestly. If this structure needs to be resident, and you can afford the memory consumption, then just do it. multi_index is fine, though it will destroy your compile times if it's in a header.
If you just need a one time traversal, then building the structure will be kind of a waste. Something like next_permutation may be a good place to start.
I need to store data grouping nodes of a graph partition, something like:
[node1, node2] [node3] [node4, node5, node6]
My first idea was to have just a simple vector or array of ints, where the position in the array denoted the node_id and it's value is some kind of group_id
The problem is many partition algorithms rely on operating on pairs of nodes within a group. With this method, I think I would waste a lot of computation searching through the vector to find out which nodes belong to the same group.
I could also store as a stl set of sets, which seems closer to the mathematical definition of a partition, but I am getting the impression nested sets are not advised or unnecessary, and I would need to modify the inner sets which I am not sure is possible.
Any suggestions?
Depending on what exactly you want to do with the sets, you could try a disjoint set data structure. In this structure, each element has a method find that returns the "representative" of the set it belongs to.
A C++ implementation is available in Boost.
There are two good data structures that come to mind.
The first data structure (and one that's been mentioned here before) is the disjoint-set forest, which gives extraordinarily efficient implementations of "merge these two sets" and "what set is x in?". However, it does not support the operation of splitting groups apart from one another.
The other structure I'd recommend is a link/cut tree. This structure lets you build up partitions of a graph that can be joined together into trees. Unlike the disjoint set forest, the tree describing the partition can be cut into smaller trees, allowing you to break partitions into smaller groups. This structure is a bit less efficient than the union/find structure, but it still supports all operations in amortized O(lg n).