I need to store data grouping nodes of a graph partition, something like:
[node1, node2] [node3] [node4, node5, node6]
My first idea was to have just a simple vector or array of ints, where the position in the array denoted the node_id and it's value is some kind of group_id
The problem is many partition algorithms rely on operating on pairs of nodes within a group. With this method, I think I would waste a lot of computation searching through the vector to find out which nodes belong to the same group.
I could also store as a stl set of sets, which seems closer to the mathematical definition of a partition, but I am getting the impression nested sets are not advised or unnecessary, and I would need to modify the inner sets which I am not sure is possible.
Any suggestions?
Depending on what exactly you want to do with the sets, you could try a disjoint set data structure. In this structure, each element has a method find that returns the "representative" of the set it belongs to.
A C++ implementation is available in Boost.
There are two good data structures that come to mind.
The first data structure (and one that's been mentioned here before) is the disjoint-set forest, which gives extraordinarily efficient implementations of "merge these two sets" and "what set is x in?". However, it does not support the operation of splitting groups apart from one another.
The other structure I'd recommend is a link/cut tree. This structure lets you build up partitions of a graph that can be joined together into trees. Unlike the disjoint set forest, the tree describing the partition can be cut into smaller trees, allowing you to break partitions into smaller groups. This structure is a bit less efficient than the union/find structure, but it still supports all operations in amortized O(lg n).
Related
I am trying to sort a large collection of objects into a series of groups, which represent some kind of commonality between them.
There seems to be two ways I can go about this:
1) I can manage everything by hand, sorting out all the objects into a vector of vectors. However, this means that I have to iterate over all the upper level vectors every time I want to try and find an existing group for an ungrouped object. I imagine this will become very computationally expensive very quickly as the number of disjoint groups increases.
2) I can use the identifiers of each object that I'm using to classify them as a key for an std::map, where the value is a vector. At that point, all I have to do is iterate over all the input objects once, calling myMap[object.identifier].push_back(object) each time. The map will sort everything out into the appropriate vector, and then I can just iterate over the resulting values afterwards.
My question is...
Which method would be best to use? It seems like a vector of vectors would be faster initially, but it's going to slow down as more and more groups are created. AFAIK, std::map uses RB trees internally, which means that finding the appropriate vector to add the object to should be faster, but you're going to pay for that when the tree inevitably needs to be rebalanced.
The additional memory consumption from an std::map doesn't matter. I'm dealing with anywhere from 12000 to 80000 individual objects that need to be grouped together, and I expect there to be anywhere from 12000 to 20000 groups once everything is said and done.
Instead of using either of your mentioned approaches directly, I suggest you evaluate the use of std::unordered_map (docs here) for your use case. It uses maps with buckets and hashed values internally and has average constant complexity for search, insertion and removal.
I am designing a Graph in c++ using a hash table for its elements. The hashtable is using open addressing and the Graph has no more than 50.000 edges. I also designed a PRIM algorithm to find the minimum spanning tree of the graph. My PRIM algorithm creates storage for the following data:
A table named Q to put there all the nodes in the beginning. In every loop, a node is visited and in the end of the loop, it's deleted from Q.
A table named Key, one for each node. The key is changed when necessary (at least one time per loop).
A table named Parent, one for each node. In each loop, a new element is inserted in this table.
A table named A. The program stores here the final edges of the minimum spanning tree. It's the table that is returned.
What would be the most efficient data structure to use for creating these tables, assuming the graph has 50.000 edges?
Can I use arrays?
I fear that the elements for every array will be way too many. I don't even consider using linked lists, of course, because the accessing of each element will take to much time. Could I use hash tables?
But again, the elements are way to many. My algorithm works well for Graphs consisting of a few nodes (10 or 20) but I am sceptical about the situation where the Graphs consist of 40.000 nodes. Any suggestion is much appreciated.
(Since comments were getting a bit long): The only part of the problem that seems to get ugly for very large size, is that every node not yet selected has a cost and you need to find the one with lowest cost at each step, but executing each step reduces the cost of a few effectively random nodes.
A priority queue is perfect when you want to keep track of lowest cost. It is efficient for removing the lowest cost node (which you do at each step). It is efficient for adding a few newly reachable nodes, as you might on any step. But in the basic design, it does not handle reducing the cost of a few nodes that were already reachable at high cost.
So (having frequent need for a more functional priority queue), I typically create a heap of pointers to objects and in each object have an index of its heap position. The heap methods all do a callback into the object to inform it whenever its index changes. The heap also has some external calls into methods that might normally be internal only, such as the one that is perfect for efficiently fixing the heap when an existing element has its cost reduced.
I just reviewed the documentation for the std one
http://en.cppreference.com/w/cpp/container/priority_queue
to see if the features I always want to add were there in some form I hadn't noticed before (or had been added in some recent C++ version). So far as I can tell, NO. Most real world uses of priority queue (certainly all of mine) need minor extra features that I have no clue how to tack onto the standard version. So I have needed to rewrite it from scratch including the extra features. But that isn't actually hard.
The method I use has been reinvented by many people (I was doing this in C in the 70's, and wasn't first). A quick google search found one of many places my approach is described in more detail than I have described it.
http://users.encs.concordia.ca/~chvatal/notes/pq.html#heap
I am looking for data structure in c++ and I need an advice.
I have nodes, every node has unique_id and group_id:
1 1.1.1.1
2 1.1.1.2
3 1.1.1.3
4 1.1.2.1
5 1.1.2.2
6 1.1.2.3
7 2.1.1.1
8 2.1.1.2
I need a data structure to answer those questions:
what is the group_id of node 4
give me list (probably vector) of unique_id's that belong to group 1.1.1
give me list (probably vector) of unique_id's that belong to group 1.1
give me list (probably vector) of unique_id's that belong to group 1
Is there a data structure that can answer those questions (what is the complexity time of inserting and answering)? or should I implement it?
I would appreciate an example.
EDIT:
at the beginning, I need to build this data structure. most of the action is reading by group id. insertion will happen but less then reading.
the time complexity is more important than memory space
To me, hierarchical data like the group ID calls for a tree structure. (I assume that for 500 elements this is not really necessary, but it seems natural and scales well.)
Each element in the first two levels of the tree would just hold vectors (if they come ordered) or maps (if they come un-ordered) of sub-IDs.
The third level in the tree hierarchy would hold pointers to leaves, again in a vector or map, which contain the fourth group ID part and the unique ID.
Questions 2-4 are easily and quickly answered by navigating the tree.
For question 1 one needs an additional map from unique IDs to leaves in the tree; each element inserted into the tree also has a pointer to it inserted into the map.
First of all, if you are going to have only a small number of nodes then it would probably make sense not to mess with advanced data structuring. Simple linear search could be sufficient.
Next, it looks like a good job for SQL. So may be it's a good idea to incorporate into your app SQLite library. But even if you really want to do it without SQL it's still a good hint: what you need are two index trees to support quick searching through your array. The complexity (if using balanced trees) will be logarithmic for all operations.
Depends...
How often do you insert? Or do you mostly read?
How often do you access by Id or GroupId?
With a max of 500 nodes I would put them in a simple Vector where the Id is the offset into the array (if the Ids are indeed as shown). The group-search can than be implemented by iterating over the array and comparing the partial gtroup-ids.
If this is too expensive and you really access the strcuture a lot and need very high performance, or you do a lot of inserts I would implement a tree with a HashMap for the Id's.
If the data is stored in a database you may use a SELECT/ CONNECT BY if your systems supports that and query the information directly from the DB.
Sorry for not providing a clear answer, but the solution depends on too many factors ;-)
Sounds like you need a container with two separate indexes on unique_id and group_id. Question 1 will be handled by the first index, Questions 2-4 will be handled by the second.
Maybe take a look at Boost Multi-index Containers Library
I am not sure of the perfect DS for this. But I would like to make use of a map.
It will give you O(1) efficiency for question 1 and for insertion O(logn) and deletion. The issue comes for question 2,3,4 where your efficiency will be O(n) where n is the number of nodes.
I have a large amount of data the I want to be able to access in two different ways. I would like constant time look up based on either key, constant time insertion with one key, and constant time deletion with the other. Is there such a data structure and can I construct one using the data structures in tr1 and maybe boost?
Use two parallel hash-tables. Make sure that the keys are stored inside the element value, because you'll need all the keys during deletion.
Have you looked at Bloom Filters? They aren't O(1), but I think they perform better than hash tables in terms of both time and space required to do lookups.
Hard to find why you need to do this but as someone said try using 2 different hashtables.
Just pseudocode in here:
Hashtable inHash;
Hashtable outHash;
//Hello myObj example!!
myObj.inKey="one";
myObj.outKey=1;
myObj.data="blahblah...";
//adding stuff
inHash.store(myObj.inKey,myObj.outKey);
outHash.store(myObj.outKey,myObj);
//deleting stuff
inHash.del(myObj.inKey,myObj.outKey);
outHash.del(myObj.outKey,myObj);
//findin stuff
//straight
myObj=outHash.get(1);
//the other way; still constant time
key=inHash.get("one");
myObj=outHash.get(key);
Not sure, thats what you're looking for.
This is one of the limits of the design of standard containers: a container in a sense "own" the contained data and expects to be the only owner... containers are not merely "indexes".
For your case a simple, but not 100% effective, solution is to have two std::maps with "Node *" as value and storing both keys in the Node structure (so you have each key stored twice). With this approach you can update your data structure with reasonable overhead (you will do some extra map search but that should be fast enough).
A possibly "correct" solution however would IMO be something like
struct Node
{
Key key1;
Key key2;
Payload data;
Node *Collision1Prev, *Collision1Next;
Node *Collision2Prev, *Collision2Next;
};
basically having each node in two different hash tables at the same time.
Standard containers cannot be combined this way. Other examples I coded by hand in the past are for example an hash table where all nodes are also in a doubly-linked list, or a tree where all nodes are also in an array.
For very complex data structures (e.g. network of structures where each one is both the "owner" of several chains and part of several other chains simultaneously) I even resorted sometimes to code generation (i.e. scripts that generate correct pointer-handling code given a description of the data structure).
Consider a sequence of n positive real numbers, (ai), and its partial sum sequence, (si). Given a number x ∊ (0, sn], we have to find i such that si−1 < x ≤ si. Also we want to be able to change one of the ai’s without having to update all partial sums. Both can be done in O(log n) time by using a binary tree with the ai’s as leaf node values, and the values of the non-leaf nodes being the sum of the values of the respective children. If n is known and fixed, the tree doesn’t have to be self-balancing and can be stored efficiently in a linear array. Furthermore, if n is a power of two, only 2 n − 1 array elements are required. See Blue et al., Phys. Rev. E 51 (1995), pp. R867–R868 for an application. Given the genericity of the problem and the simplicity of the solution, I wonder whether this data structure has a specific name and whether there are existing implementations (preferably in C++). I’ve already implemented it myself, but writing data structures from scratch always seems like reinventing the wheel to me—I’d be surprised if nobody had done it before.
This is known as a finger tree in functional programming but apparently there are implementations in imperative languages. In the articles there is a link to a blog post explaining an implementation of this data structure in C# which could be useful to you.
Fenwick tree (aka Binary indexed tree) is a data structure that maintains a sequence of elements, and is able to compute cumulative sum of any range of consecutive elements in O(logn) time. Changing value of any single element needs O(logn) time as well.