Merging two arrays of a class - c++

I have two arrays of class Record. Class Record is defined like this
class Record{
char* string; //the word string
int count; //frequency word appears
}
And these are the two arrays defined (already initialized)
Record recordarray1=new Record[9000000]; //contains 9000000 unsorted Records
Record recordarray2=new Record[8000000] //contains 8000000 unsorted Records
the purpose is to find strings that match between the two arrays and add them to a new array where their counts are added together, and if there is a string not in the other array then just add to the new array. To do this I have tried sorting the two arrays first, (in alphabetical order by strings), then comparing recordarray2, if the string matches then advance recordarray2's index otherwise advance recordarray1's index until you find one. If you don't find it, then add it to the new array.
Unfortunately this method is WAY too slow, sorting itself takes 20+ seconds with STL sort. Is there a quicker standard method of sorting that i'm missing?

If I've understood correctly your algorithm should take O( nlogn + mlogm [sort both arrays] + n + m [to go through the arrays and compare]).
It may not be much of an optimization but you try to sort just one of the arrays and use binary search to check if the elements of the other array are present or not. So now it should take O( n [to copy one array as the new array] + nlogn [to sort it] + mlogn [to binary search the elements of the second into the sorted new one] ).
HTH

Sorting object might be expensive, so I would try to avoid this.
One faster way might be to create an index for each array using a std::hash_map with the string as has index and the array index as value. You get two containers that can be iterated at one time. The iterator for the lesser will be advanced until you find a match or the other points to a lesser value. This will lead you to a predictable iteration count.

The possible solution is to use unordered_map. The algorithm whould be as following:
Put the first array into the map, using strings as keys and count as values.
For each member in the second array, check it against containment in the map.
If it exists there
Put the record into the new array, combining counts
Remove the record from the map
Else
Put the record into the new array
Iterate throug the remaining recors in the map and put the in to the new array.
The complexity of this algorithm is aproximatelty O(n+m)

I feel that sorting is not needed. You can use following algorithm.
Start with the first element of
recordarray1; put into the new array
Search elements in recordarray2.
If the element is found increment count in new array. Also set the
recordarray2[N]::count to negative value; so that it will not be checked again in step 3
Put all the elements from
recordarray2 which doesn't have
count set to negative into new
array. If negative count is
encountered then simply change it to
positive.
Note: This algorithm doesn't take care if you have similar string elements in the same array. Also don't use string as a variable name. As it's also a typename as std::string.

Related

Is there any faster way to insert elements in a std::vector

I'm trying to insert a string in the following vector: std::vector<std::string> fileVec.
The fileVec already has many elements in it (up to 1 million strings) before I call these lines:
int index = 5;
//there is some code here to find the index i want insert the text (let's take for example has value 5)
fileVec.insert(fileVec.begin() + index, "add this text");
The problem I have is that it take so much time to insert the text (specially if index is a small number).
Is there any faster way to add elements in a big vector (without deleting other elements)?
The fileVec.insert will not be called many times, around 15 times.
std::vector is not designed for frequent addition of elements in the middle, especially if it is very large (one million elements is huge.) Consider using std::list - a double linked list where adding elements in the middle is very fast, because all you have to do is change a few pointers. In std::vector, all the elements have to be moved over, which of course, causes lots of overhead.
But at the same time, accessing single elements in std::list is slow, because you need to traverse the entire list until you find the one you want.
So pick your poison, but I highly suggest using std::list for this case.
I assume that your vector is sorted (otherwise you could just add new strings to the end).
What are you doing with this vector? Search? In that case you could keep those 15 new strings in another vector, search in that vector first and if not found - search in original one.

Instant sort when put new value in array C++

I have a dynamically allocated array containing structs with a key pair value. I need to write an update(key,value) function that puts new struct into array or if struct with same key is already in the array it needs to update its value. Insert and Update is combined in one function.
The problem is:
Before adding a struct I need to check if struct with this key already existing.
I can go through all elements of array and compare key (very slow)
Or I can use binary search, but (!) array must be sorted.
So I tried to sort array with each update (sloooow) or to sort it when calling binary search funtion.....which is each time updating
Finally, I thought that there must be a way of inserting a struct into array so it would be placed in a right place and be always sorted.
However, I couldn't think of an algorithm like that so I came here to ask for some help because google refuses to read my mind.
I need to make my code faster because my array accepts more that 50 000 structs and I'm using bubble sort (because I'm dumb).
Take a look at Red Black Trees: http://en.wikipedia.org/wiki/Red%E2%80%93black_tree
They will ensure the data is always sorted, and it has a complexity of O ( log n ) for inserts.
A binary heap will not suffice, as a binary heap does not have guaranteed sort order, your only guarantee is that the top element is either min or max.
One possible approach is to use a different data structure. As there is no genuine need to keep the structs ordered, there is only need to detect if the struct with the same key exits, so the costs of maintaining order in a balanced tree (for instance by using std::map) are excessive. A more suitable data structure would be a hash table. C++11 provides such in the standard library under obscure name std::unordered_map (http://en.cppreference.com/w/cpp/container/unordered_map).
If you insist on using an array, a possible approach might be to combine these algorithms:
Bloom filter (http://en.wikipedia.org/wiki/Bloom_filter)
Partial sort (http://en.cppreference.com/w/cpp/algorithm/partial_sort)
Binary search
Maintain two ranges in the array -- first goes a range that is already sorted, then goes a range that is not yet. When you insert a struct, first check with the bloom filter if a matching struct might already exist. If the bloom filter gives a negative answer, then just insert the struct at the end of the array. After that the sorted range does not change, the unsorted range grows by one.
If the bloom filter gives a positive answer, then apply partial sort algorithm to make the entire array sorted and then use binary search to check if such an object actually exists. If so, replace this element. After that the sorted range is the entire array, and the unsorted range is empty.
If the binary search has shown that the bloom filter was wrong, and the matching struct is not there, then you just put the new struct at the end of the array. After that the sorted range is entire array minus one, and the unsorted range is the last element in the array.
Each time you insert an element, binary search to find if it exists. If it doesn't exist, the binary search will give you the index at which you can insert it.
You could use std::set, which does not allow duplicate elements and places elements in sorted position. This assumes that you are storing the key and value in a struct, and not separately. In order for the sorting to work properly, you will need to define a comparison function for the structs.

Efficient frequency counter

I have 15,000,000 std:vectors of 6 integers.
Those 15M vectors contain duplicates.
Duplicate example:
(4,3,2,0,4,23)
(4,3,2,0,4,23)
I need to obtain a list of unique sequence with their associated count. (A sequence that is only present once would have a 1 count)
Is there an algorithm in the std C++ (can be x11) that does that in one shot?
Windows, 4GB RAM, 30+GB hdd
There is no such algorithm in the standard library which does exactly this, however it's very easy with a single loop and by choosing the proper data structure.
For this you want to use std::unordered_map which is typically a hash map. It has expected constant time per access (insert and look-up) and thus the first choice for huge data sets.
The following access and incement trick will automatically insert a new entry in the counter map if it's not yet there; then it will increment and write back the count.
typedef std::vector<int> VectorType; // Please consider std::array<int,6>!
std::unordered_map<VectorType, int> counters;
for (VectorType vec : vectors) {
counters[vec]++;
}
For further processing, you most probably want to sort the entries by the number of occurrence. For this, either write them out in a vector of pairs (which encapsulates the number vector and the occurrence count), or in an (ordered) map which has key and value swapped, so it's automatically ordered by the counter.
In order to reduce the memory footprint of this solution, try this:
If you don't need to get the keys back from this hash map, you can use a hash map which doesn't store the keys but only their hashes. For this, use size_t for the key type, std::identity<std::size_t> for the internal hash function and access it with a manual call to the hash function std::hash<VectorType>.
std::unordered_map<std::size_t, int, std::identity<std::size_t> > counters;
std::hash<VectorType> hashFunc;
for (VectorType vec : vectors) {
counters[hashFunc(vec)]++;
}
This reduces memory but requires an additional effort to interpret the results, as you have to loop over the original data structure a second time in order to find the original vectors (then look-up them in your hash map by hashing them again).
Yes: first std::sort the list (std::vector uses lexicographic ordering, the first element is the most significant), then loop with std::adjacent_find to find duplicates. When a duplicate is found, use std::adjacent_find again but with an inverted comparator to find the first non-duplicate.
Alternately, you could use std::unique with a custom comparator that flags when a duplicate is found, and maintains a count through the successive calls. This also gives you a deduplicated list.
The advantage of these approaches over std::unordered_map is space complexity proportional to the number of duplicates. You don't have to copy the entire original dataset or add a seldom-used field for dup-count.
You should convert each vector element to string one by one like this "4,3,2,0,4,23".
Then add them into a new string vector by controlling their existance with find() function.
If you need original vector, convert string vector to another integer sequence vector.
If you do not need delete duplicated elements while making sting vector.

Data structure that returns index and stores count of strings in c++

I am building an xlsx builder, and I have a series of strings to save in a spreadsheet (xml file). There may be duplication, so I want to store the strings in a map and increment their counts. Then instead of storing the strings I can store the index they are at in the map, and store the strings in another xml file. But retrieving the index of a given string is O(n) with std::map. Is there a data structure that can accomplish this faster?
Unless your "separate file" needs to be in lexicographic order don't use the index in the map, store the index explicitly.
So for example a map<string, gubbins>, with struct gubbins { size_t count; size_t index; }.
Whenever you insert a new key to the map, give its index the "next" value of an incrementing counter.
The range of index values used is contiguous unless you later come along and decrement the refcount then remove entries from the map when it hits zero. In that case you can "defragment" the indexes, but of course not if you've already used the indexes to identify the strings elsewhere.
The operation to write the strings file requires sorting by index first. You can do that in linear time -- create a big enough array and then run through the map, storing each string at the correct index. Or you can build the strings file as you go, adding each string when it's added to the map.
It's probably possible to do the whole thing with the right boost:multi_index.
If you need to store the strings in sorted order, you might want to look into the order statistic tree data structure, which is a balanced binary search tree augmented with extra information that makes it possible to determine the nth element in the tree efficiently (in O(log n) time). This gives you all of the original functionality of the std::map, plus random access.
There isn't a standard implementation of order statistic trees in the C++ standard libraries, but a quick Google search should turn some up.
Hope this helps!

Inserting and removing elements from an array while maintaining the array to be sorted

I'm wondering whether somebody can help me with this problem. I'm using C/C++ to program and I need to do the following:
I am given a sorted array P (biggest first) containing floats. It usually has a very big size.. sometimes holding correlation values from 10 megapixel images. I need to iterate through the array until it is empty. Within the loop there is additional processing taking place.
The gist of the problem is that at the start of the loop, I need to remove the elements with the maximum value from the array, check certain conditions and if they hold, then I need to reinsert the elements into the array but after decreasing their value. However, I want the array to be efficiently sorted after the reinsertion.
Can somebody point me towards a way of doing this? I have tried the naive approach of re-sorting everytime I insert, but that seems really wasteful.
Change the data structure. Repeatedly accessing the largest element, and then quickly inserting new values, in such a way that you can still efficiently repeatedly access the largest element, is a job for a heap, which may be fairly easily created from your array in C++.
BTW, please don't talk about "C/C++". There is no such language. You're instead making vague implications about the style in which you're writing things, most of which will strike experienced programmers as bad.
I would look into the http://www.cplusplus.com/reference/stl/priority_queue/, as it is designed to do just this.
You could use a binary search to determine where to insert the changed value after you removed it from the array. Note that inserting or removing at the front or somewhere in the middle is not very efficient either, as it requires moving all items with a higher index up or down, respectively.
ISTM that you should rather put your changed items into a new array and sort that once, after you finished iterating over the original array. If memory is a problem, and you really have to do things in place, change the values in place and only sort once.
I can't think of a better way to do this. Keeping the array sorted all the time seems rather inefficient.
Since the array is already sorted, you can use a binary search to find the location to insert the updated value. C++ provides std::lower_bound or std::upper_bound for this purpose, C provides bsearch. Just shift all the existing values up by one location in the array and store the new value at the newly cleared spot.
Here's some pseudocode that may work decently if you aren't decreasing the removed values by much:
For example, say you're processing the element with the maximum value in the array, and say the array is sorted in descending order (largest first).
Remove array[0].
Let newVal = array[0] - adjustment, where adjustment is the amount you're decreasing the value by.
Now loop through, adjusting only the values you need to:
Pseudocode:
i = 0
while (newVal < array[i]) {
array[i] = array[i+1];
i++;
}
array[i] = newVal;
swap(array[i], array[i+1]);
Again, if you're not decreasing the removed values by a large amount (relative to the values in the array), this could work fairly efficiently.
Of course, the generally better alternative is to use a more appropriate data structure, such as a heap.
Maybe using another temporary array could help.
This way you can first sort the "changed" elements alone.
And after that just do a regular merge O(n) for the two sub-arrays to the temp array, and copy everything back to the original array.