Fast, sorted data structure with random access? - c++

I want to add several objects in a structure which allows this:
Insertion of objects, immediately ordering the entire structure on add so I have a descending ordering of an int;
Being able to change the int by which the objects are ordered (I mean: say that object number 2, now has a int of 5, so it reorders the structure);
Fast structure, because it will be completely iterated 60 times a second;
Being able to directly access the objects by position;
Only needs to be iterated from top to bottom: higher INT to lower INT
No deletion required, but could become useful later on.
Some indications on how to use the structure would be great, since I don't know much about the C++ standard libraries.

All of the operations that you've listed (except for lookup by index) can be supported by a standard binary search tree, keyed by integer values. This gives you the ability to iterate over the elements in sorted order and to keep the objects sorted during any insertion. As #njr mentioned, you can also update priorities by removing objects from the binary search tree, changing their priority, then reinserting them into the binary search tree.
To support random access by index, you should consider looking into order statistic trees, a variant on binary search trees that in addition to all other operations supports very fast (O(log n)) lookup of an element by its index. That is, you could very efficiently query for the 15th element in the sorted sequence, or the 17th, etc. Order statistic trees aren't part of the C++ standard libraries, but this older question contains answers that can link you to an implementation.

Use a set or a map
For requirement 1 - provide a custom sorting function
For 2 - remove the item and add it again (or provide a wrapper that does that)
3 doesn't make sense (How big is the list, how fast is the processor/ram)
For 4 - Are you sure you need that? It seems to be kind of weird to try to access it by position when the position can change suddenly (some item was added or removed)
5 - same as 1

Related

Rank-Preserving Data Structure other than std:: vector?

I am faced with an application where I have to design a container that has random access (or at least better than O(n)) has inexpensive (O(1)) insert and removal, and stores the data according to the order (rank) specified at insertion.
For example if I have the following array:
[2, 9, 10, 3, 4, 6]
I can call the remove on index 2 to remove 10 and I can also call the insert on index 1 by inserting 13.
After those two operations I would have:
[2, 13, 9, 3, 4, 6]
The numbers are stored in a sequence and insert/remove operations require an index parameter to specify where the number should be inserted or which number should be removed.
My question is, what kind of data structures, besides a Linked List and a vector, could maintain something like this? I am leaning towards a Heap that prioritizes on the next available index. But I have been seeing something about a Fusion Tree being useful (but more in a theoretical sense).
What kind of Data structures would give me the most optimal running time while still keeping memory consumption down? I have been playing around with an insertion order preserving hash table, but it has been unsuccessful so far.
The reason I am tossing out using a std:: vector straight up is because I must construct something that out preforms a vector in terms of these basic operations. The size of the container has the potential to grow to hundreds of thousands of elements, so committing to shifts in a std::vector is out of the question. The same problem lines with a Linked List (even if doubly Linked), traversing it to a given index would take in the worst case O (n/2), which is rounded to O (n).
I was thinking of a doubling linked list that contained a Head, Tail, and Middle pointer, but I felt that it wouldn't be much better.
In a basic usage, to be able to insert and delete at arbitrary position, you can use linked lists. They allow for O(1) insert/remove, but only provided that you have already located the position in the list where to insert. You can insert "after a given element" (that is, given a pointer to an element), but you can not as efficiently insert "at given index".
To be able to insert and remove an element given its index, you will need a more advanced data structure. There exist at least two such structures that I am aware of.
One is a rope structure, which is available in some C++ extensions (SGI STL, or in GCC via #include <ext/rope>). It allows for O(log N) insert/remove at arbitrary position.
Another structure allowing for O(log N) insert/remove is a implicit treap (aka implicit cartesian tree), you can find some information at http://codeforces.com/blog/entry/3767, Treap with implicit keys or https://codereview.stackexchange.com/questions/70456/treap-with-implicit-keys.
Implicit treap can also be modified to allow to find minimal value in it (and also to support much more operations). Not sure whether rope can handle this.
UPD: In fact, I guess that you can adapt any O(log N) binary search tree (such as AVL or red-black tree) for your request by converting it to "implicit key" scheme. A general outline is as follows.
Imagine a binary search tree which, at each given moment, stores the consequitive numbers 1, 2, ..., N as its keys (N being the number of nodes in the tree). Every time we change the tree (insert or remove the node) we recalculate all the stored keys so that they are still from 1 to the new value of N. This will allow insert/remove at arbitrary position, as the key is now the position, but it will require too much time for all keys update.
To avoid this, we will not store keys in the tree explicitly. Instead, for each node, we will store the number of nodes in its subtree. As a result, any time we go from the tree root down, we can keep track of the index (position) of current node — we just need to sum the sizes of subtrees that we have to our left. This allows us, given k, locate the node that has index k (that is, which is the k-th in the standard order of binary search tree), on O(log N) time. After this, we can perform insert or delete at this position using standard binary tree procedure; we will just need to update the subtree sizes of all the nodes changed during the update, but this is easily done in O(1) time per each node changed, so the total insert or remove time will be O(log N) as in original binary search tree.
So this approach allows to insert/remove/access nodes at given position in O(log N) time using any O(log N) binary search tree as a basis. You can of course store the additional information ("values") you need in the nodes, and you can even be able to calculate the minimum of these values in the tree just by keeping the minimum value of each node's subtree.
However, the aforementioned treap and rope are more advanced as they allow also for split and merge operations (taking a substring/subarray and concatenating two strings/arrays).
Consider a skip list, which can implement linear time rank operations in its "indexable" variation.
For algorithms (pseudocode), see A Skip List Cookbook, by Pugh.
It may be that the "implicit key" binary search tree method outlined by #Petr above is easier to get to, and may even perform better.

Performance specification for handling duplicate keys in a binary search tree

I was going through the book Introduction to Algorithms looking for the best ways to handle duplicate keys in a binary search tree.
There are several ways mentioned for this use case:
Keep a boolean flag x:b at node x, and set x to either x:left or x:right based on the value of x:b, which alternates between FALSE and TRUE each time we
visit x while inserting a node with the same key as x.
Keep a list of nodes with equal keys at x, and insert ´ into the list.
Randomly set x to either x:left or x:right.
I understand each implementation has it's own performance hits/misses, and STL may implement it differently from Boost Containers.
Is the performance bound mentioned in C++11 specification for the worst time performance of handling duplicate keys , say for multimap?
In terms of insertion/deletion time 2 is always better because it wouldn't increase the size of the tree and wouldn't require elaborate structure changes when you insert or delete a duplicate.
Option 3 is space optimal if there are small number of duplicates.
Option 1 will require storing 1 extra bit of information (which, in most implementation takes 2 bytes), but the height of the tree will be optimal as compared to 1.
TL;DR: Implementing 2 is slightly difficult, but worthwhile if number of duplicates is large. Otherwise use 3. I wouldn't use 1.

How to improve linked list searching. C++

I have simple method in C++ which searchs for string in linked list. That works well but I need to make it faster. Is it possible? Maybe I need to insert items into list in alphabetical order? But I dont think it could help in serching list anymore. In list there is about 300 000 items (words).
int GetItemPosition(const char* stringToFind)
{
int i = 0;
MyList* Tmp = FistListItem;
while (Tmp){
if (!strcmp(Tmp->Value, stringToFind))
{
return i;
}
Tmp = Tmp->NextItem;
i++;
}
return -1;
}
Method returns the position number if item found, otherwise returns -1.
Any sugesstion will be helpfull.
Thanks for answers, I can change structure. I have only one constraint. Code must implement the following interface:
int Count(void);
int AddItem(const char* StringValue, int WordOccurrence);
int GetItemPosition(const char* StringValue);
char* GetString(int Index);
int GetOccurrenceNum(int Index);
void SetInteger(int Index, int WordOccurrence);
So which structure will be the in your opinion the most suitable?
Searching a linked list is linear so you need to iterate from beginning one by one so it is O(n). Linked lists are not the best if you will use it for searching, you can utilize more suitable data structures such as binary trees.
Ordering elements does not help much because still you need to iterate each element anyway.
Wikipedia article says:
In an unordered list, one simple heuristic for decreasing average search time is the move-to-front heuristic, which simply moves an element to the beginning of the list once it is found. This scheme, handy for creating simple caches, ensures that the most recently used items are also the quickest to find again.
Another common approach is to "index" a linked list using a more
efficient external data structure. For example, one can build a
red-black tree or hash table whose elements are references to the
linked list nodes. Multiple such indexes can be built on a single
list. The disadvantage is that these indexes may need to be updated
each time a node is added or removed (or at least, before that index
is used again).
So in the first case you can slightly improve (by statistical assumptions) your search performance by moving items found previously closer to the beginning of the list. This assumes that previously found elements will be searched more frequently.
Second method requires to use other data structures.
If using linked lists is not a hard requirement, consider using hash tables, sorted arrays (random access) or balanced trees.
Consider using array or std::vector as a storage instead of linked list, and use binary search to find particular string, or even better, std::set, if you don't need a numerical index. If for some reasons it is not possible to use other containers, there is not much possible to do - you may want to speed up the process of comparison by storing hash of the string along with it in node.
I suggest hashing.
Since you've already got a linked list of your own), you can try chaining with linked lists for collision resolution.
Rather than using a linear linked list, you may want to use a binary search tree, or a red/black tree. These trees are designed on minimizing the traversals to find an item.
You could also store "short cut links". For example, if the list is of strings, you could have an array of links of where to start searching based on the first letter.
For example, shortcut['B'] would return a pointer to the first link to start searching for strings starting with 'B'.
The answer is no, you cannot improve the search without changing your data-structure.
As it stands, sorting the list will not give you a faster search for any random item.
It will only allow you to quickly decide if the given item is in the list by testing against the first item (which will be either the smallest or the largest entry) and this improvement is not likely to make a big difference.
So can you please edit your question and explain to us your constraints?
Can you use a completely different data structure, like an array or a tree? (as others have suggested)
If not, can you modify the way your linked list is linked?
If not, we will be unlikely to help you...
The best option is to use faster data structure for storing strings:
std::map - red-black tree behind the scenes. Has O(logn) for search/insert/delete operations. Suitable if you want to store additional values with strings (for example - positions).
std::set - basically the same tree but without values. Best for case when you need only contains operation.
std::unordered_map - hash table. O(1) access.
std::unordered_set - hash set. Also O(1) access.
Note. But in all of these cases there is a catch. Complexity is calculated only based on n (count of strings). In reality string comparison is not free. So, O(1) becomes O(m), O(logn) becomes O(mlogn) (where m is maximal length of string). This does not matter in case of relatively short strings. But if this is not true consider using Trie. In practice trie can be even faster than hash table - each character of query string is accessed only once instead of multiple times. For hash table/set it's at least once for hash calculation and at least once for actual string comparison (depending on collision resolution strategy - not sure how it is implemented in C++).

Implementation of a locally ordered set or priority queue?

I have a rather large set of objects that represent numbers and I want to select such numbers according to a custom ordering. This ordering includes several criteria such as the type of their representation (some numbers are represented by an interval), their integrality and ultimatively their value. These numbers are shared throughout the programs (shared pointers) and there is nothing I can do about this.
However, the elements properties can change at any time such that the order changes while I can't notify the container about this. For example, some operations require a refinement of a number that is represented by an interval and during this refinement, the exact value can be found. Thereby, the number changes from the interval representation to a rational number, possibly even an integer. This change, due to the shared instance, immediately propagates to the number in the container and breaks the ordering (and I don't even know which number changed). This totally breaks std::set.
So what I'd like to have is a container that tries to be sorted, but does not rely on this. Whenever an operation detects an incorrect ordering, this ordering should be corrected locally. For example insert would insert the element (using binary search) and always check if the ordering of the current element (w.r.t. the neighbors) is correct.
I'd be willing to accept that "give me the smallest element" would then be only "give me a small element" and that find or remove would have linear complexity: I only need front, insert and remove_front to be particularly efficient.
Is there any implementation that does something like this?
How would you implement this?
If you are looking for an algorithm in the standard library, you should take a look at:
std::make_heap
std::pop_heap
std::push_heap
In <algorithm>. They might fit your need, and even if they don't I'm quite sure you will find what you are looking for in some kind of heap structure. Which one will probably depend on how your code is structured, and how often you expect a value to change etc.
In short:
A heap is a data structure in which it is fast to find and extract the smallest (or largest) element. It is also for most heaps possible to create restructure the heap in linear time or better. You could start out from this page on Wikipedia if you want to learn more about heaps.

How can we benefit from vs2010 hash_map's less?

See this if you don't know vs2010 actually requires total ordering, and hence it require a user defined less.
one of answer said it is possible for binary search, but I don't think so, this because
The hash function should be uniform, and it is better that load factor less than 1, it means, in most case, one element per hash slot. i.e. no need binary search.
Obviously, it will slow down insertion because of locating the appropriate position.
How does hash-map benefit from this design? and how do we utilize this design?
thanks
The hash function should be uniform, and it is better that load factor less than 1, it means, in most case, one element per hash slot. i.e. no need binary search.
There won't be at most one element per hash slot. Some buckets will have to keep more than one key. Unless the input is only from a pre-determined restricted set of values (i.e. perfect hashing), the hash function will have to deal with more inputs than the outputs that it can produce. There will be collisions; this is unavoidable in an implementation as generic as this one. However, good hash functions should produce well-distributed hashes and that makes the number of elements per hash slot stay low.
Obviously, it will slow down insertion because of locating the appropriate position.
Assuming a good hash function and non-degenerate input (input designed so that many elements result in the same hash), there will always be only a few keys per bucket. Inserting into such a binary search tree won't be that big of a cost, and that little cost may bring benefits elsewhere (searches may be faster than on an implementation with a linked list). And in case of degenerate input, the hash map will degenerate into a binary search tree, which is much better than a simple linked list.
Your question is largely irrelevant in practice, because C++ now supplies unordered_map etc. which use an Equal predicate rather than a less-than comparator.
However, consider a hash_map<string, ...>. Clearly, the value space of string is larger than that of size_t, so for any hash function there will be values that have the same hash and so are placed in the same bucket. In the pathological situation where all the items in the hash table are placed in the same bucket, exploiting ordering among keys will result in improved speed of access, insertion and removal.
Note that search on an ordered list (or binary tree) is O(log n) as opposed to O(n).