Data structure for topic list of clients

Data structure for topic list of clients - c++

I need a data structure with O(1) add, find, and delete operations for a topic subscribed by a list of clients.
Some of the functions it needs to support are: isTopicExists, isClientExists, getClientsForTopic, addClientForTopic, removeClientForTopic, and getTopicsForClient.
Given a topic name, a client id that we can assume to be unique, and client pointer, what is the best data structure to use? What implementations are available?

A hash map may seem not a bad idea. Its expected complexity is O(1) but a pessimistic scenario with many collisions can get you to O(n) depending on how chaining is implemented. A logarithmic search will be hard to beat here, so I would go for a self-balancing binary search tree, even std::map (red-black tree in most STL implementations). The only way of making it more efficient is using a vector (an array), but only as long as your IDs are small or offseted but close to each other. You can't beat maths here.

Related

Can a map be used as a tree?

eg. std::map<Item, std::vector<Item> >.
Would that be able serve as a "quick-and-dirty" tree structure (with some helper functions on top and given that less is implemented for Item), given that theres none in the std/boost ?
Would a std::unordered_map be better suited/more usefull/beneficial ? that requires a hash instead of compare though - which can be harder to implement.
I can see one issue though, finding parent/owner have to brute force go through the entire map (although that might be best stored in seperate structure if needed).
Another thing Im not so fond of is the sort of dual meaning of a map entry with an Item with an empty child list.

Can a map be used as a tree?
Situation is inverse: std::map is internally implemented using a tree. So tree can (is) used as a map.
Neither map nor unordered map are useful for implementing a general tree structure. Only if your intention is to use the tree as a map would it be useful to use these structures (because they are maps which was desirable in this scenario)

You can absolutely represent a tree in this manner; whether it's a good idea in a given situation depends entirely on which operations you need to be fast, which operations you're okay with being slow, and what your space requirements are.
(And of course in many applications the answer to all of the above may be "I don't care," in which case any implementation is probably fine.)

A* whats the best data structure for the open set?

Im developing an A* for the first time, and I was using a priority_queue for the open set, until I realize you need to check if nodes are in the open set too, not just the close one.
Thing is, you cant iterate over a priority queue..So why everyone recommend a priority queue for the open set? Is it yet the best option? I think the only way to iterate over it is making a copy so I can pop everything from it (enormous cost).
What the best data structure to use on A*?

A priority queue (PQ) is an abstract data structure (ADS). There are many possibilities to implement them. Unfortunately, the priority_queue supplied with the C++ standard library is rather limited, and other implementations are suited a lot better for implementing A*. Spoilers: you can use std::set/multiset instead of std::priority_queue. But read on:
So what do you need from the priority queue to implement A* is:
Get the node with the lowest cost
Decrease the costs of arbitrary elements
Any priority queue can do 1., but for 2., you need a "mutable" priority queue. The Standard-Lib one cannot do this. Also, you need an easy way to find entries in the priority queue, to find out where to decrease the keys (For when A* finds a better path to an already opened node). There are two basic ways for this: You store a handle to the priority queue element within your graph node (or use a map to store those handles for each graph node) - or you insert the graph nodes themselves.
For the first case, where you store handles for each node, you can use std::multiset for your priority queue. std::multiset::first() will always be your "lowest cost" key, and you can decrease a key by removing it from the set, changing the value and re-inserting, and updating the handle. Alternatively, you can use the mutable priority queues from Boost.Heap, which directly support "decrease-key".
For the second case, you would need some kind of "intrusive" binary tree - since your pathfinding nodes themselves need to be in the priority queue. If you don't want to roll your own, see the ordered associative containers in Boost.Intrusive.

The subject is very large, I suggest you reading this page if you want to know the different possibilities and have a good understanding of which data structure is adapted to your situation :
http://theory.stanford.edu/~amitp/GameProgramming/ImplementationNotes.html#set-representation
In my case, the binary heap was a good balance between difficulty to implement and performances, which was totally what I was looking for. But maybe you are looking for something different ?
The rest of the document is a very good reference for A* for game development
http://theory.stanford.edu/~amitp/GameProgramming/index.html

They mean A priority queue not necessarily the std::priority_queue class that comes with the language. If the built in one doesn't do what you need it to write your own, or find another.

C++ (Hashmap style) Data Structure Ideal For This Scenario?

People have asked similar questions about the efficiency of various data structures but none I have read are totally applicable to my scenario so I wondered if people had suggestions for one that was tailored to satisfy the following criteria efficiently:
Each element will have a unique key. There will be no possibility of collisions because each element hashes to a different key. EDIT: *The key is a 32-bit uint.*
The elements are all unique and therefore can be thought of as a set.
The only operations required are adding and getting, not deletion. These need to be quick as they will be used several 100,000 times in a typical run!
The order in which elements are kept is irrelevant.
Speed is more important than memory-consumption... though it can't be too
greedy!
I am developing for a company that will use the program commercially so any third-party data structures should come with no copyright protection or anything, but if the STL has a data structure that will do the job efficiently then that would be perfect.
I know there are countless Hashmap/Dictionary style C++ data structures with implementations that are built to satisfy different criteria so if someone can suggest one ideal for this situation then that would be greatly appreciated.
Many thanks
Edit:
I found this passage on SO that seems to suggest unordered_map would be good?
hash_map and unordered_map are generally implemented with hash tables.
Thus the order is not maintained. unordered_map insert/delete/query
will be O(1) (constant time) where map will be O(log n) where n is the
number of items in the data structure. So unordered_map is faster, and
if you don't care about the order of the items should be preferred
over map. Sometimes you want to maintain order (ordered by the key)
and for that map would be the choice.

Looks like a prefix tree (with element at each node end) also fits in this scenario. It's damn fast, even faster than hash map because no hash value calculation is done and getting a value is purely O(n) where n is the key length. It's a bit memory hungry but common prefix of keys are shared in the same node path.
EDIT: I assume the keys are string, not simple values like integers

As for build-in solutions I'd recommand google::dense_hash_map. They are really fast especially for numeric keys. You'll have to decide on a specific key that will be reserved as "empty_key". Moreover here is a really nice comparison of different hash-map implementations.
An excerpt
Library Linux-intCPU (sec) Linux-strCPU (sec) Linux PeakMem (MB)
glib 3.490 4.720 24.968
ghthash 3.260 3.460 61.232
CC’s hashtable 3.040 4.050 129.020
TR1 1.750 3.300 28.648
STL hash_set 2.070 3.430 25.764
google-sparse 2.560 6.930 5.42/8.54
google-dense 0.550 2.820 24.7/49.3
khash (C++) 1.100 2.900 6.88/13.1
khash (C) 1.140 2.940 6.91/13.1
STL set (RB) 7.840 18.620 29.388
kbtree (C) 4.260 17.620 4.86/9.59
NP’s splaytree 11.180 27.610 19.024
However, when setting a "deleted_key", this map can also perform deletions. So maybe it'll be possible to create a custom solution that is even more efficient. But apart from that minor point, any hash-map should exactly suit your needs (note that "map" is an ordered tree-map and thus slower).

What you need definitely sounds like a hash set, C++ has this as either std::tr1::unordered_set or in Boost.Unordered.
P.S. Note, however, that TR1 is not yet standard, and you'll probably need to get Boost for the implementation.

It sounds like std::unordered_set would fit the bill, but without
knowing more about the key, it's difficult to say. I'm curious about
how you can guarantee that there will be no possibility of collisions:
this implies a small (less than the size of the table), finite set of
keys. If this is the case, it may be more efficient to map the keys to
a small int, and use std::vector (with empty slots for the entries not
present).

What you're looking for is an unordered_set. You can find one in Boost, TR1, or C++0x. If you're hoping to associate the key with a value, then unordered_map does just that- also in Boost/TR1/C++0x.

Why is std::map implemented as a red-black tree?

Why is std::map implemented as a red-black tree?
There are several balanced binary search trees (BSTs) out there. What were design trade-offs in choosing a red-black tree?

Probably the two most common self balancing tree algorithms are Red-Black trees and AVL trees. To balance the tree after an insertion/update both algorithms use the notion of rotations where the nodes of the tree are rotated to perform the re-balancing.
While in both algorithms the insert/delete operations are O(log n), in the case of Red-Black tree re-balancing rotation is an O(1) operation while with AVL this is a O(log n) operation, making the Red-Black tree more efficient in this aspect of the re-balancing stage and one of the possible reasons that it is more commonly used.
Red-Black trees are used in most collection libraries, including the offerings from Java and Microsoft .NET Framework.

It really depends on the usage. AVL tree usually has more rotations of rebalancing. So if your application doesn't have too many insertion and deletion operations, but weights heavily on searching, then AVL tree probably is a good choice.
std::map uses Red-Black tree as it gets a reasonable trade-off between the speed of node insertion/deletion and searching.

The previous answers only address tree alternatives and red black probably only remains for historical reasons.
Why not a hash table?
A type only requires < operator (comparison) to be used as a key in a tree. However, hash tables require that each key type has a hash function defined. Keeping type requirements to a minimum is very important for generic programming so you can use it with a wide variety of types and algorithms.
Designing a good hash table requires intimate knowledge of the context it which it will be used. Should it use open addressing, or linked chaining? What levels of load should it accept before resizing? Should it use an expensive hash that avoids collisions, or one that is rough and fast?
Since the STL can't anticipate which is the best choice for your application, the default needs to be more flexible. Trees "just work" and scale nicely.
(C++11 did add hash tables with unordered_map. You can see from the documentation it requires setting policies to configure many of these options.)
What about other trees?
Red Black trees offer fast lookup and are self balancing, unlike BSTs. Another user pointed out its advantages over the self-balancing AVL tree.
Alexander Stepanov (The creator of STL) said that he would use a B* Tree instead of a Red-Black tree if he wrote std::map again, because it is more friendly for modern memory caches.
One of the biggest changes since then has been the growth of caches.
Cache misses are very costly, so locality of reference is much more
important now. Node-based data structures, which have low locality of
reference, make much less sense. If I were designing STL today, I
would have a different set of containers. For example, an in-memory
B*-tree is a far better choice than a red-black tree for implementing
an associative container. - Alexander Stepanov
Should maps always use trees?
Another possible maps implementation would be a sorted vector (insertion sort) and binary search. This would work well
for containers which aren't modified often but are queried frequently.
I often do this in C as qsort and bsearch are built in.
Do I even need to use map?
Cache considerations mean it rarely makes sense to use std::list or std::deque over std:vector even for those situations we were taught in school (such as removing an element from the middle of the list).
Applying that same reasoning, using a for loop to linear search a list is often more efficient and cleaner than building a map for a few lookups.
Of course choosing a readable container is usually more important than performance.

AVL trees have a maximum height of 1.44logn, while RB trees have a maximum of 2logn. Inserting an element in a AVL may imply a rebalance at one point in the tree. The rebalancing finishes the insertion. After insertion of a new leaf, updating the ancestors of that leaf has to be done up to the root, or up to a point where the two subtrees are of equal depth. The probability of having to update k nodes is 1/3^k. Rebalancing is O(1). Removing an element may imply more than one rebalancing (up to half the depth of the tree).
RB-trees are B-trees of order 4 represented as binary search trees. A 4-node in the B-tree results in two levels in the equivalent BST. In the worst case, all the nodes of the tree are 2-nodes, with only one chain of 3-nodes down to a leaf. That leaf will be at a distance of 2logn from the root.
Going down from the root to the insertion point, one has to change 4-nodes into 2-nodes, to make sure any insertion will not saturate a leaf. Coming back from the insertion, all these nodes have to be analysed to make sure they correctly represent 4-nodes. This can also be done going down in the tree. The global cost will be the same. There is no free lunch! Removing an element from the tree is of the same order.
All these trees require that nodes carry information on height, weight, color, etc. Only Splay trees are free from such additional info. But most people are afraid of Splay trees, because of the ramdomness of their structure!
Finally, trees can also carry weight information in the nodes, permitting weight balancing. Various schemes can be applied. One should rebalance when a subtree contains more than 3 times the number of elements of the other subtree. Rebalancing is again done either throuh a single or double rotation. This means a worst case of 2.4logn. One can get away with 2 times instead of 3, a much better ratio, but it may mean leaving a little less thant 1% of the subtrees unbalanced here and there. Tricky!
Which type of tree is the best? AVL for sure. They are the simplest to code, and have their worst height nearest to logn. For a tree of 1000000 elements, an AVL will be at most of height 29, a RB 40, and a weight balanced 36 or 50 depending on the ratio.
There are a lot of other variables: randomness, ratio of adds, deletes, searches, etc.

It is just the choice of your implementation - they could be implemented as any balanced tree. The various choices are all comparable with minor differences. Therefore any is as good as any.

Update 2017-06-14: webbertiger edit its answer after I commented. I should point out that its answer is now a lot better to my eyes. But I kept my answer just as additional information...
Due to the fact that I think first answer is wrong (correction: not both anymore) and the third has a wrong affirmation. I feel I had to clarify things...
The 2 most popular tree are AVL and Red Black (RB). The main difference lie in the utilization:
AVL : Better if ratio of consultation (read) is bigger than manipulation (modification). Memory foot print is a little less than RB (due to the bit required for coloring).
RB : Better in general cases where there is a balance between consultation (read) and manipulation (modification) or more modification over consultation. A slightly bigger memory footprint due to the storing of red-black flag.
The main difference come from the coloring. You do have less re-balance action in RB tree than AVL because the coloring enable you to sometimes skip or shorten re-balance actions which have a relative hi cost. Because of the coloring, RB tree also have higher level of nodes because it could accept red nodes between black ones (having the possibilities of ~2x more levels) making search (read) a little bit less efficient... but because it is a constant (2x), it stay in O(log n).
If you consider the performance hit for a modification of a tree (significative) VS the performance hit of consultation of a tree (almost insignificant), it become natural to prefer RB over AVL for a general case.

data structure for storing array of strings in a memory

I'm considering of data structure for storing a large array of strings in a memory. Strings will be inserted at the beginning of the programm and will not be added or deleted while programm is running. The crucial point is that search procedure should be as fast as it can be. Saving of memory is not important. I incline to standard structure hash_set from standard library, that allows to search elements in the structure with about constant time. But it's not guaranteed that this time will be short. Will anyone suggest a better standard desicion?
Many thanks!

Try a Prefix Tree
A Trie is better than a Binary Search Tree for searching elements. Compared against a hash table, you could see this question

If lookup time really is the only important thing, then at startup time, once you have all the strings, you could compute a perfect hash over them, and use this as the hashing function for a hashtable.
The problem is how you'd execute the hash - any kind of byte-code-based computation is probably going to be slower than using a fixed hash and dealing with collisions. But if all you care about is lookup speed, then you can require that your process has the necessary privileges to load and execute code. Write the code for the perfect hash, run it through a compiler, load it. Test at runtime whether it's actually faster for these strings than your best known data-agnostic structure (which might be a Trie, a hashtable, a Judy array or a splay tree, depending on implementation details and your typical access patterns), and if not fall back to that. Slow setup, fast lookup.
It's almost never truly the case that speed is the only crucial point.

There is e.g. google-sparsehash.
It includes a dense hash set/map (re)implementation that may perform better than the standard library hash set/map.
See performance. Make sure that you are using a good hash function. (My subjective vote: murmur2.)
Strings will be inserted at the
beginning of the programm and will not
be added or deleted while programm is running.
If the strings are immutable - so insertion/deletion is "infrequent", so to speak -, another option is to build a Directed Acyclic Word Graph or a Compact Directed Acyclic Word Graph that might* be faster than a hash table and has a better worst case guarantee.
**Standard disclaimer applies: depending on the use case, implementations, data set, phase of the moon, etc. Theoretical expectations may differ from observed results because of factors not accounted for (e.g. cache and memory latency, time complexity of certain machine instructions, etc.).*

A hash_set with a suitable number of buckets would be ideal, alternatively a vector with the strings in dictionary order, searched used binary search, would be great too.

The two standard data structures for fast string lookup are hash tables and tries, particularly Patricia tries. A good hash implementation and a good trie implementation should give similar performance, as long as the hash implementation is good enough to limit the number of collisions. Since you never modify the set of strings, you could try to build a perfect hash. If performance is more important than development time, try all solutions and benchmark them.
A complementary technique that could save lookups in the string table is to use atoms: each time you read a string that you know you're going to look up in the table, look it up immediately, and store a pointer to it (or an index in the data structure) instead of storing the string. That way, testing the equality of two strings is a simple pointer or integer equality (and you also save memory by storing each string once).

Your best bet would be as follows:
Building your structure:
Insert all your strings (char*s) into an array.
Sort the array lexicographically.
Lookup
Use a binary search on your array.
This maintains cache locality, allows for efficient lookup (Will search in a space of ~4 billion strings with 32 comparisons), and is dead simple to implement. There's no need to get fancy with tries, because they are complicated, and slower than they appear (especially if you have long strings).
Random sidenote: Combined with http://blogs.msdn.com/b/oldnewthing/archive/2005/05/19/420038.aspx, you'll be unstoppable!

Well, assuming you truly want an array and not an associative contaner as you've mentioned, the allocation strategy mentioned in Raymond Chen's Blog would be efficient.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js