Instant sort when put new value in array C++ - c++

I have a dynamically allocated array containing structs with a key pair value. I need to write an update(key,value) function that puts new struct into array or if struct with same key is already in the array it needs to update its value. Insert and Update is combined in one function.
The problem is:
Before adding a struct I need to check if struct with this key already existing.
I can go through all elements of array and compare key (very slow)
Or I can use binary search, but (!) array must be sorted.
So I tried to sort array with each update (sloooow) or to sort it when calling binary search funtion.....which is each time updating
Finally, I thought that there must be a way of inserting a struct into array so it would be placed in a right place and be always sorted.
However, I couldn't think of an algorithm like that so I came here to ask for some help because google refuses to read my mind.
I need to make my code faster because my array accepts more that 50 000 structs and I'm using bubble sort (because I'm dumb).

Take a look at Red Black Trees: http://en.wikipedia.org/wiki/Red%E2%80%93black_tree
They will ensure the data is always sorted, and it has a complexity of O ( log n ) for inserts.
A binary heap will not suffice, as a binary heap does not have guaranteed sort order, your only guarantee is that the top element is either min or max.

One possible approach is to use a different data structure. As there is no genuine need to keep the structs ordered, there is only need to detect if the struct with the same key exits, so the costs of maintaining order in a balanced tree (for instance by using std::map) are excessive. A more suitable data structure would be a hash table. C++11 provides such in the standard library under obscure name std::unordered_map (http://en.cppreference.com/w/cpp/container/unordered_map).
If you insist on using an array, a possible approach might be to combine these algorithms:
Bloom filter (http://en.wikipedia.org/wiki/Bloom_filter)
Partial sort (http://en.cppreference.com/w/cpp/algorithm/partial_sort)
Binary search
Maintain two ranges in the array -- first goes a range that is already sorted, then goes a range that is not yet. When you insert a struct, first check with the bloom filter if a matching struct might already exist. If the bloom filter gives a negative answer, then just insert the struct at the end of the array. After that the sorted range does not change, the unsorted range grows by one.
If the bloom filter gives a positive answer, then apply partial sort algorithm to make the entire array sorted and then use binary search to check if such an object actually exists. If so, replace this element. After that the sorted range is the entire array, and the unsorted range is empty.
If the binary search has shown that the bloom filter was wrong, and the matching struct is not there, then you just put the new struct at the end of the array. After that the sorted range is entire array minus one, and the unsorted range is the last element in the array.

Each time you insert an element, binary search to find if it exists. If it doesn't exist, the binary search will give you the index at which you can insert it.

You could use std::set, which does not allow duplicate elements and places elements in sorted position. This assumes that you are storing the key and value in a struct, and not separately. In order for the sorting to work properly, you will need to define a comparison function for the structs.

Related

Count of previously smaller elements encountered in an input stream of integers?

Given an input stream of numbers ranging from 1 to 10^5 (non-repeating) we need to be able to tell at each point how many numbers smaller than this have been previously encountered.
I tried to use the set in C++ to maintain the elements already encountered and then taking upper_bound on the set for the current number. But upper_bound gives me the iterator of the element and then again I have to iterate through the set or use std::distance which is again linear in time.
Can I maintain some other data structure or follow some other algorithm in order to achieve this task more efficiently?
EDIT : Found an older question related to fenwick trees that is helpful here. Btw I have solved this problem now using segment trees taking hints from #doynax comment.
How to use Binary Indexed tree to count the number of elements that is smaller than the value at index?
Regardless of the container you are using, it is very good idea to enter them as sorted set so at any point we can just get the element index or iterator to know how many elements are before it.
You need to implement your own binary search tree algorithm. Each node should store two counters with total number of its child nodes.
Insertion to binary tree takes O(log n). During the insertion counters of all parents of that new element should be incremented O(log n).
Number of elements that are smaller than the new element can be derived from stored counters O(log n).
So, total running time O(n log n).
Keep your table sorted at each step. Use binary search. At each point, when you are searching for the number that was just given to you by the input stream, binary search is going to find either the next greatest number, or the next smallest one. Using the comparison, you can find the current input's index, and its index will be the numbers that are less than the current one. This algorithm takes O(n^2) time.
What if you used insertion sort to store each number into a linked list? Then you can count the number of elements less than the new one when finding where to put it in the list.
It depends on whether you want to use std or not. In certain situations, some parts of std are inefficient. (For example, std::vector can be considered inefficient in some cases due to the amount of dynamic allocation that occurs.) It's a case-by-case type of thing.
One possible solution here might be to use a skip list (relative of linked lists), as it is easier and more efficient to insert an element into a skip list than into an array.
You have to use the skip list approach, so you can use a binary search to insert each new element. (One cannot use binary search on a normal linked list.) If you're tracking the length with an accumulator, returning the number of larger elements would be as simple as length-index.
One more possible bonus to using this approach is that std::set.insert() is log(n) efficient already without a hint, so efficiency is already in question.

How to improve linked list searching. C++

I have simple method in C++ which searchs for string in linked list. That works well but I need to make it faster. Is it possible? Maybe I need to insert items into list in alphabetical order? But I dont think it could help in serching list anymore. In list there is about 300 000 items (words).
int GetItemPosition(const char* stringToFind)
{
int i = 0;
MyList* Tmp = FistListItem;
while (Tmp){
if (!strcmp(Tmp->Value, stringToFind))
{
return i;
}
Tmp = Tmp->NextItem;
i++;
}
return -1;
}
Method returns the position number if item found, otherwise returns -1.
Any sugesstion will be helpfull.
Thanks for answers, I can change structure. I have only one constraint. Code must implement the following interface:
int Count(void);
int AddItem(const char* StringValue, int WordOccurrence);
int GetItemPosition(const char* StringValue);
char* GetString(int Index);
int GetOccurrenceNum(int Index);
void SetInteger(int Index, int WordOccurrence);
So which structure will be the in your opinion the most suitable?
Searching a linked list is linear so you need to iterate from beginning one by one so it is O(n). Linked lists are not the best if you will use it for searching, you can utilize more suitable data structures such as binary trees.
Ordering elements does not help much because still you need to iterate each element anyway.
Wikipedia article says:
In an unordered list, one simple heuristic for decreasing average search time is the move-to-front heuristic, which simply moves an element to the beginning of the list once it is found. This scheme, handy for creating simple caches, ensures that the most recently used items are also the quickest to find again.
Another common approach is to "index" a linked list using a more
efficient external data structure. For example, one can build a
red-black tree or hash table whose elements are references to the
linked list nodes. Multiple such indexes can be built on a single
list. The disadvantage is that these indexes may need to be updated
each time a node is added or removed (or at least, before that index
is used again).
So in the first case you can slightly improve (by statistical assumptions) your search performance by moving items found previously closer to the beginning of the list. This assumes that previously found elements will be searched more frequently.
Second method requires to use other data structures.
If using linked lists is not a hard requirement, consider using hash tables, sorted arrays (random access) or balanced trees.
Consider using array or std::vector as a storage instead of linked list, and use binary search to find particular string, or even better, std::set, if you don't need a numerical index. If for some reasons it is not possible to use other containers, there is not much possible to do - you may want to speed up the process of comparison by storing hash of the string along with it in node.
I suggest hashing.
Since you've already got a linked list of your own), you can try chaining with linked lists for collision resolution.
Rather than using a linear linked list, you may want to use a binary search tree, or a red/black tree. These trees are designed on minimizing the traversals to find an item.
You could also store "short cut links". For example, if the list is of strings, you could have an array of links of where to start searching based on the first letter.
For example, shortcut['B'] would return a pointer to the first link to start searching for strings starting with 'B'.
The answer is no, you cannot improve the search without changing your data-structure.
As it stands, sorting the list will not give you a faster search for any random item.
It will only allow you to quickly decide if the given item is in the list by testing against the first item (which will be either the smallest or the largest entry) and this improvement is not likely to make a big difference.
So can you please edit your question and explain to us your constraints?
Can you use a completely different data structure, like an array or a tree? (as others have suggested)
If not, can you modify the way your linked list is linked?
If not, we will be unlikely to help you...
The best option is to use faster data structure for storing strings:
std::map - red-black tree behind the scenes. Has O(logn) for search/insert/delete operations. Suitable if you want to store additional values with strings (for example - positions).
std::set - basically the same tree but without values. Best for case when you need only contains operation.
std::unordered_map - hash table. O(1) access.
std::unordered_set - hash set. Also O(1) access.
Note. But in all of these cases there is a catch. Complexity is calculated only based on n (count of strings). In reality string comparison is not free. So, O(1) becomes O(m), O(logn) becomes O(mlogn) (where m is maximal length of string). This does not matter in case of relatively short strings. But if this is not true consider using Trie. In practice trie can be even faster than hash table - each character of query string is accessed only once instead of multiple times. For hash table/set it's at least once for hash calculation and at least once for actual string comparison (depending on collision resolution strategy - not sure how it is implemented in C++).

Efficient frequency counter

I have 15,000,000 std:vectors of 6 integers.
Those 15M vectors contain duplicates.
Duplicate example:
(4,3,2,0,4,23)
(4,3,2,0,4,23)
I need to obtain a list of unique sequence with their associated count. (A sequence that is only present once would have a 1 count)
Is there an algorithm in the std C++ (can be x11) that does that in one shot?
Windows, 4GB RAM, 30+GB hdd
There is no such algorithm in the standard library which does exactly this, however it's very easy with a single loop and by choosing the proper data structure.
For this you want to use std::unordered_map which is typically a hash map. It has expected constant time per access (insert and look-up) and thus the first choice for huge data sets.
The following access and incement trick will automatically insert a new entry in the counter map if it's not yet there; then it will increment and write back the count.
typedef std::vector<int> VectorType; // Please consider std::array<int,6>!
std::unordered_map<VectorType, int> counters;
for (VectorType vec : vectors) {
counters[vec]++;
}
For further processing, you most probably want to sort the entries by the number of occurrence. For this, either write them out in a vector of pairs (which encapsulates the number vector and the occurrence count), or in an (ordered) map which has key and value swapped, so it's automatically ordered by the counter.
In order to reduce the memory footprint of this solution, try this:
If you don't need to get the keys back from this hash map, you can use a hash map which doesn't store the keys but only their hashes. For this, use size_t for the key type, std::identity<std::size_t> for the internal hash function and access it with a manual call to the hash function std::hash<VectorType>.
std::unordered_map<std::size_t, int, std::identity<std::size_t> > counters;
std::hash<VectorType> hashFunc;
for (VectorType vec : vectors) {
counters[hashFunc(vec)]++;
}
This reduces memory but requires an additional effort to interpret the results, as you have to loop over the original data structure a second time in order to find the original vectors (then look-up them in your hash map by hashing them again).
Yes: first std::sort the list (std::vector uses lexicographic ordering, the first element is the most significant), then loop with std::adjacent_find to find duplicates. When a duplicate is found, use std::adjacent_find again but with an inverted comparator to find the first non-duplicate.
Alternately, you could use std::unique with a custom comparator that flags when a duplicate is found, and maintains a count through the successive calls. This also gives you a deduplicated list.
The advantage of these approaches over std::unordered_map is space complexity proportional to the number of duplicates. You don't have to copy the entire original dataset or add a seldom-used field for dup-count.
You should convert each vector element to string one by one like this "4,3,2,0,4,23".
Then add them into a new string vector by controlling their existance with find() function.
If you need original vector, convert string vector to another integer sequence vector.
If you do not need delete duplicated elements while making sting vector.

Data structure that returns index and stores count of strings in c++

I am building an xlsx builder, and I have a series of strings to save in a spreadsheet (xml file). There may be duplication, so I want to store the strings in a map and increment their counts. Then instead of storing the strings I can store the index they are at in the map, and store the strings in another xml file. But retrieving the index of a given string is O(n) with std::map. Is there a data structure that can accomplish this faster?
Unless your "separate file" needs to be in lexicographic order don't use the index in the map, store the index explicitly.
So for example a map<string, gubbins>, with struct gubbins { size_t count; size_t index; }.
Whenever you insert a new key to the map, give its index the "next" value of an incrementing counter.
The range of index values used is contiguous unless you later come along and decrement the refcount then remove entries from the map when it hits zero. In that case you can "defragment" the indexes, but of course not if you've already used the indexes to identify the strings elsewhere.
The operation to write the strings file requires sorting by index first. You can do that in linear time -- create a big enough array and then run through the map, storing each string at the correct index. Or you can build the strings file as you go, adding each string when it's added to the map.
It's probably possible to do the whole thing with the right boost:multi_index.
If you need to store the strings in sorted order, you might want to look into the order statistic tree data structure, which is a balanced binary search tree augmented with extra information that makes it possible to determine the nth element in the tree efficiently (in O(log n) time). This gives you all of the original functionality of the std::map, plus random access.
There isn't a standard implementation of order statistic trees in the C++ standard libraries, but a quick Google search should turn some up.
Hope this helps!

Merging two arrays of a class

I have two arrays of class Record. Class Record is defined like this
class Record{
char* string; //the word string
int count; //frequency word appears
}
And these are the two arrays defined (already initialized)
Record recordarray1=new Record[9000000]; //contains 9000000 unsorted Records
Record recordarray2=new Record[8000000] //contains 8000000 unsorted Records
the purpose is to find strings that match between the two arrays and add them to a new array where their counts are added together, and if there is a string not in the other array then just add to the new array. To do this I have tried sorting the two arrays first, (in alphabetical order by strings), then comparing recordarray2, if the string matches then advance recordarray2's index otherwise advance recordarray1's index until you find one. If you don't find it, then add it to the new array.
Unfortunately this method is WAY too slow, sorting itself takes 20+ seconds with STL sort. Is there a quicker standard method of sorting that i'm missing?
If I've understood correctly your algorithm should take O( nlogn + mlogm [sort both arrays] + n + m [to go through the arrays and compare]).
It may not be much of an optimization but you try to sort just one of the arrays and use binary search to check if the elements of the other array are present or not. So now it should take O( n [to copy one array as the new array] + nlogn [to sort it] + mlogn [to binary search the elements of the second into the sorted new one] ).
HTH
Sorting object might be expensive, so I would try to avoid this.
One faster way might be to create an index for each array using a std::hash_map with the string as has index and the array index as value. You get two containers that can be iterated at one time. The iterator for the lesser will be advanced until you find a match or the other points to a lesser value. This will lead you to a predictable iteration count.
The possible solution is to use unordered_map. The algorithm whould be as following:
Put the first array into the map, using strings as keys and count as values.
For each member in the second array, check it against containment in the map.
If it exists there
Put the record into the new array, combining counts
Remove the record from the map
Else
Put the record into the new array
Iterate throug the remaining recors in the map and put the in to the new array.
The complexity of this algorithm is aproximatelty O(n+m)
I feel that sorting is not needed. You can use following algorithm.
Start with the first element of
recordarray1; put into the new array
Search elements in recordarray2.
If the element is found increment count in new array. Also set the
recordarray2[N]::count to negative value; so that it will not be checked again in step 3
Put all the elements from
recordarray2 which doesn't have
count set to negative into new
array. If negative count is
encountered then simply change it to
positive.
Note: This algorithm doesn't take care if you have similar string elements in the same array. Also don't use string as a variable name. As it's also a typename as std::string.