Efficient frequency counter - c++

I have 15,000,000 std:vectors of 6 integers.
Those 15M vectors contain duplicates.
Duplicate example:
(4,3,2,0,4,23)
(4,3,2,0,4,23)
I need to obtain a list of unique sequence with their associated count. (A sequence that is only present once would have a 1 count)
Is there an algorithm in the std C++ (can be x11) that does that in one shot?
Windows, 4GB RAM, 30+GB hdd

There is no such algorithm in the standard library which does exactly this, however it's very easy with a single loop and by choosing the proper data structure.
For this you want to use std::unordered_map which is typically a hash map. It has expected constant time per access (insert and look-up) and thus the first choice for huge data sets.
The following access and incement trick will automatically insert a new entry in the counter map if it's not yet there; then it will increment and write back the count.
typedef std::vector<int> VectorType; // Please consider std::array<int,6>!
std::unordered_map<VectorType, int> counters;
for (VectorType vec : vectors) {
counters[vec]++;
}
For further processing, you most probably want to sort the entries by the number of occurrence. For this, either write them out in a vector of pairs (which encapsulates the number vector and the occurrence count), or in an (ordered) map which has key and value swapped, so it's automatically ordered by the counter.
In order to reduce the memory footprint of this solution, try this:
If you don't need to get the keys back from this hash map, you can use a hash map which doesn't store the keys but only their hashes. For this, use size_t for the key type, std::identity<std::size_t> for the internal hash function and access it with a manual call to the hash function std::hash<VectorType>.
std::unordered_map<std::size_t, int, std::identity<std::size_t> > counters;
std::hash<VectorType> hashFunc;
for (VectorType vec : vectors) {
counters[hashFunc(vec)]++;
}
This reduces memory but requires an additional effort to interpret the results, as you have to loop over the original data structure a second time in order to find the original vectors (then look-up them in your hash map by hashing them again).

Yes: first std::sort the list (std::vector uses lexicographic ordering, the first element is the most significant), then loop with std::adjacent_find to find duplicates. When a duplicate is found, use std::adjacent_find again but with an inverted comparator to find the first non-duplicate.
Alternately, you could use std::unique with a custom comparator that flags when a duplicate is found, and maintains a count through the successive calls. This also gives you a deduplicated list.
The advantage of these approaches over std::unordered_map is space complexity proportional to the number of duplicates. You don't have to copy the entire original dataset or add a seldom-used field for dup-count.

You should convert each vector element to string one by one like this "4,3,2,0,4,23".
Then add them into a new string vector by controlling their existance with find() function.
If you need original vector, convert string vector to another integer sequence vector.
If you do not need delete duplicated elements while making sting vector.

Related

C++ efficient way to store and update sorted items

I have a operation that continuously generates random solutions (std::vector<float>). I evaluate the solutions against a mathematical function to see their usefulness (float). I would like to store the top 10 solutions all the time. What would be the most efficient way to do this in C++?
I need to store both the solutions(std::vector) and their usefulness (float). I am performing several hundred thousands of evaluations and hence I am in need of an efficient solution.
Edit:
I am aware of sorting methods. I am looking for methods other than sorting and storing the values. Looking for better data structures if any.
You evaluate the float score() function for current std::vector<T> solution, store them in a std::pair<vector<T>, float>.
You use a std::priority_queue< pair<vector<T>, float> > to store the 10 best solutions based on their score, and the score itself. std::priority_queue is a heap, so it allows you to extract its max value according to a compare function that you can set up to compare score_a < score_b.
Store the first 10 pairs, then for each new one compare it with the top of the heap, if score(new) > score(10th) then insert(new) into the priority_queue p, and p.pop_back() to get rid of the old 10th element.
You keep doing this inside a loop until you run out of vector<T> solutions.
Have a vector of pair, where pair has 1 element as solution and other element as usefulness. Then write custom comparator to compare elements in the vector.
Add element at last, then sort this vector and remove last element.
As #user4581301 mentioned in comments, for 10 elements, you dont need to sort. Just traverse vector everytime, or you can also perform ordered insert in vector.
Here are some links to help you:
https://www.geeksforgeeks.org/sorting-vector-of-pairs-in-c-set-1-sort-by-first-and-second/
Comparator for vector<pair<int,int>>

C++ : How can I push values into a vector only if that value isn't stored in the vector already?

If for example, I was just pushing 200 random numbers into a vector, how can I ensure that duplicates will not be pushed in?
seems like a map could be a helpful structure instead of a Vector.
If you must stick to a Vector then you need to divide your task into two parts; duplication detection and then insertion. Again, your could insert into a map and then read that out into the Vector.
In either case the problem is - intrinsically - two problems. Good luck!
You need to check if the vector already contains the value, if not the push new value, i.e.
std::vector<int>::iterator it;
it = find (myvector.begin(), myvector.end(), newvalue);
if (it == myvector.end()) {
// newvalue is not found
}
But this could be costly since find method would be checking every value inside myvector.
Instead using set or map data structure can be more efficient.
If the random numbers are integer and within a relatively small range, you can try this:
You want N unique random numbers from M possible values whereby M >= N
create a container containing one of each of the unique random number
shuffle the container
take the first N from the container and insert to your vector
If M is much bigger than N (like between 0 and rand_max), then you should just check for repetition before insert and repeat until your container size reaches 200. If using vector is not mandatory, I will suggest using std::set instead since it ensures unique values by default.

Instant sort when put new value in array C++

I have a dynamically allocated array containing structs with a key pair value. I need to write an update(key,value) function that puts new struct into array or if struct with same key is already in the array it needs to update its value. Insert and Update is combined in one function.
The problem is:
Before adding a struct I need to check if struct with this key already existing.
I can go through all elements of array and compare key (very slow)
Or I can use binary search, but (!) array must be sorted.
So I tried to sort array with each update (sloooow) or to sort it when calling binary search funtion.....which is each time updating
Finally, I thought that there must be a way of inserting a struct into array so it would be placed in a right place and be always sorted.
However, I couldn't think of an algorithm like that so I came here to ask for some help because google refuses to read my mind.
I need to make my code faster because my array accepts more that 50 000 structs and I'm using bubble sort (because I'm dumb).
Take a look at Red Black Trees: http://en.wikipedia.org/wiki/Red%E2%80%93black_tree
They will ensure the data is always sorted, and it has a complexity of O ( log n ) for inserts.
A binary heap will not suffice, as a binary heap does not have guaranteed sort order, your only guarantee is that the top element is either min or max.
One possible approach is to use a different data structure. As there is no genuine need to keep the structs ordered, there is only need to detect if the struct with the same key exits, so the costs of maintaining order in a balanced tree (for instance by using std::map) are excessive. A more suitable data structure would be a hash table. C++11 provides such in the standard library under obscure name std::unordered_map (http://en.cppreference.com/w/cpp/container/unordered_map).
If you insist on using an array, a possible approach might be to combine these algorithms:
Bloom filter (http://en.wikipedia.org/wiki/Bloom_filter)
Partial sort (http://en.cppreference.com/w/cpp/algorithm/partial_sort)
Binary search
Maintain two ranges in the array -- first goes a range that is already sorted, then goes a range that is not yet. When you insert a struct, first check with the bloom filter if a matching struct might already exist. If the bloom filter gives a negative answer, then just insert the struct at the end of the array. After that the sorted range does not change, the unsorted range grows by one.
If the bloom filter gives a positive answer, then apply partial sort algorithm to make the entire array sorted and then use binary search to check if such an object actually exists. If so, replace this element. After that the sorted range is the entire array, and the unsorted range is empty.
If the binary search has shown that the bloom filter was wrong, and the matching struct is not there, then you just put the new struct at the end of the array. After that the sorted range is entire array minus one, and the unsorted range is the last element in the array.
Each time you insert an element, binary search to find if it exists. If it doesn't exist, the binary search will give you the index at which you can insert it.
You could use std::set, which does not allow duplicate elements and places elements in sorted position. This assumes that you are storing the key and value in a struct, and not separately. In order for the sorting to work properly, you will need to define a comparison function for the structs.

Data structure that returns index and stores count of strings in c++

I am building an xlsx builder, and I have a series of strings to save in a spreadsheet (xml file). There may be duplication, so I want to store the strings in a map and increment their counts. Then instead of storing the strings I can store the index they are at in the map, and store the strings in another xml file. But retrieving the index of a given string is O(n) with std::map. Is there a data structure that can accomplish this faster?
Unless your "separate file" needs to be in lexicographic order don't use the index in the map, store the index explicitly.
So for example a map<string, gubbins>, with struct gubbins { size_t count; size_t index; }.
Whenever you insert a new key to the map, give its index the "next" value of an incrementing counter.
The range of index values used is contiguous unless you later come along and decrement the refcount then remove entries from the map when it hits zero. In that case you can "defragment" the indexes, but of course not if you've already used the indexes to identify the strings elsewhere.
The operation to write the strings file requires sorting by index first. You can do that in linear time -- create a big enough array and then run through the map, storing each string at the correct index. Or you can build the strings file as you go, adding each string when it's added to the map.
It's probably possible to do the whole thing with the right boost:multi_index.
If you need to store the strings in sorted order, you might want to look into the order statistic tree data structure, which is a balanced binary search tree augmented with extra information that makes it possible to determine the nth element in the tree efficiently (in O(log n) time). This gives you all of the original functionality of the std::map, plus random access.
There isn't a standard implementation of order statistic trees in the C++ standard libraries, but a quick Google search should turn some up.
Hope this helps!

How to associate to a number another number without using array

Let's say we have read these values:
3
1241
124515
5322353
341
43262267234
1241
1241
3213131
And I have an array like this (with the elements above):
a[0]=1241
a[1]=124515
a[2]=43262267234
a[3]=3
...
The thing is that the elements' order in the array is not constant (I have to change it somewhere else in my program).
How can I know on which position does one element appear in the read document.
Note that I can not do:
vector <int> a[1000000000000];
a[number].push_back(all_positions);
Because a will be too large (there's a memory restriction). (let's say I have only 3000 elements, but they're values are from 0 to 2^32)
So, in the example above, I would want to know all the positions 1241 is appearing on without iterating again through all the read elements.
In other words, how can I associate to the number "1241" the positions "1,6,7" so I can simply access them in O(1) (where 1 actually is the number of positions the element appears)
If there's no O(1) I want to know what's the optimal one ...
I don't know if I've made myself clear. If not, just say it and I'll update my question :)
You need to use some sort of dynamic array, like a vector (std::vector) or other similar containers (std::list, maybe, it depends on your needs).
Such data structures are safer and easier to use than C-style array, since they take care of memory management.
If you also need to look for an element in O(1) you should consider using some structures that will associate both an index to an item and an item to an index. I don't think STL provides any, but boost should have something like that.
If O(log n) is a cost you can afford, also consider std::map
You can use what is commonly refered to as a multimap. That is, it stores Key and multiple values. This is an O(log) look up time.
If you're working with Visual Studios they provide their own hash_multimap, else may I suggest using Boost::unordered_map with a list as your value?
You don't need a sparse array of 1000000000000 elements; use an std::map to map positions to values.
If you want bi-directional lookup (that is, you sometimes want "what are the indexes for this value?" and sometimes "what value is at this index?") then you can use a boost::bimap.
Things get further complicated as you have values appearing more than once. You can sacrifice the bi-directional lookup and use a std::multimap.
You could use a map for that. Like:
std::map<int, std::vector<int>> MyMap;
So everytime you encounter a value while reading the file, you append it's position to the map. Say X is the value you read and Y is the position then you just do
MyMap[X].push_back( Y );
Instead of you array use
std::map<int, vector<int> > a;
You need an associative collection but you might want to associated with multiple values.
You can use std::multimap< int, int >
or
you can use std::map< int, std::set< int > >
I have found in practice the latter is easier for removing items if you just need to remove one element. It is unique on key-value combinations but not on key or value alone.
If you need higher performance then you may wish to use a hash_map instead of map. For the inner collection though you will not get much performance in using a hash, as you will have very few duplicates and it is better to std::set.
There are many implementations of hash_map, and it is in the new standard. If you don't have the new standard, go for boost.
It seems you need a std::map<int,int>. You can store the mapping such as 1241->0 124515->1 etc. Then perform a look up on this map to get the array index.
Besides the std::map solution offered by others here (O(log n)), there's the approach of a hash map (implemented as boost::unordered_map or std::unordered_map in C++0x, supported by modern compilers).
It would give you O(1) lookup on average, which often is faster than a tree-based std::map. Try for yourself.
You can use a std::multimap to store both a key (e.g. 1241) and multiple values (e.g. 1, 6 and 7).
An insert has logarithmic complexity, but you can speed it up if you give the insert method a hint where it can insert the item.
For O(1) lookup you could hash the number to find its entry (key) in a hash map (boost::unordered_map, dictionary, stdex::hash_map etc)
The value could be a vector of indices where the number occurs or a 3000 bit array (375 bytes) where the bit number for each respective index where the number (key) occurs is set.
boost::unordered_map<unsigned long, std::vector<unsigned long>> myMap;
for(unsigned long i = 0; i < sizeof(a)/sizeof(*a); ++i)
{
myMap[a[i]].push_back(i);
}
Instead of storing an array of integer, you could store an array of structure containing the integer value and all its positions in an array or vector.