How to associate to a number another number without using array - c++

Let's say we have read these values:
3
1241
124515
5322353
341
43262267234
1241
1241
3213131
And I have an array like this (with the elements above):
a[0]=1241
a[1]=124515
a[2]=43262267234
a[3]=3
...
The thing is that the elements' order in the array is not constant (I have to change it somewhere else in my program).
How can I know on which position does one element appear in the read document.
Note that I can not do:
vector <int> a[1000000000000];
a[number].push_back(all_positions);
Because a will be too large (there's a memory restriction). (let's say I have only 3000 elements, but they're values are from 0 to 2^32)
So, in the example above, I would want to know all the positions 1241 is appearing on without iterating again through all the read elements.
In other words, how can I associate to the number "1241" the positions "1,6,7" so I can simply access them in O(1) (where 1 actually is the number of positions the element appears)
If there's no O(1) I want to know what's the optimal one ...
I don't know if I've made myself clear. If not, just say it and I'll update my question :)

You need to use some sort of dynamic array, like a vector (std::vector) or other similar containers (std::list, maybe, it depends on your needs).
Such data structures are safer and easier to use than C-style array, since they take care of memory management.
If you also need to look for an element in O(1) you should consider using some structures that will associate both an index to an item and an item to an index. I don't think STL provides any, but boost should have something like that.
If O(log n) is a cost you can afford, also consider std::map

You can use what is commonly refered to as a multimap. That is, it stores Key and multiple values. This is an O(log) look up time.
If you're working with Visual Studios they provide their own hash_multimap, else may I suggest using Boost::unordered_map with a list as your value?

You don't need a sparse array of 1000000000000 elements; use an std::map to map positions to values.
If you want bi-directional lookup (that is, you sometimes want "what are the indexes for this value?" and sometimes "what value is at this index?") then you can use a boost::bimap.
Things get further complicated as you have values appearing more than once. You can sacrifice the bi-directional lookup and use a std::multimap.

You could use a map for that. Like:
std::map<int, std::vector<int>> MyMap;
So everytime you encounter a value while reading the file, you append it's position to the map. Say X is the value you read and Y is the position then you just do
MyMap[X].push_back( Y );

Instead of you array use
std::map<int, vector<int> > a;

You need an associative collection but you might want to associated with multiple values.
You can use std::multimap< int, int >
or
you can use std::map< int, std::set< int > >
I have found in practice the latter is easier for removing items if you just need to remove one element. It is unique on key-value combinations but not on key or value alone.
If you need higher performance then you may wish to use a hash_map instead of map. For the inner collection though you will not get much performance in using a hash, as you will have very few duplicates and it is better to std::set.
There are many implementations of hash_map, and it is in the new standard. If you don't have the new standard, go for boost.

It seems you need a std::map<int,int>. You can store the mapping such as 1241->0 124515->1 etc. Then perform a look up on this map to get the array index.

Besides the std::map solution offered by others here (O(log n)), there's the approach of a hash map (implemented as boost::unordered_map or std::unordered_map in C++0x, supported by modern compilers).
It would give you O(1) lookup on average, which often is faster than a tree-based std::map. Try for yourself.

You can use a std::multimap to store both a key (e.g. 1241) and multiple values (e.g. 1, 6 and 7).
An insert has logarithmic complexity, but you can speed it up if you give the insert method a hint where it can insert the item.

For O(1) lookup you could hash the number to find its entry (key) in a hash map (boost::unordered_map, dictionary, stdex::hash_map etc)
The value could be a vector of indices where the number occurs or a 3000 bit array (375 bytes) where the bit number for each respective index where the number (key) occurs is set.
boost::unordered_map<unsigned long, std::vector<unsigned long>> myMap;
for(unsigned long i = 0; i < sizeof(a)/sizeof(*a); ++i)
{
myMap[a[i]].push_back(i);
}

Instead of storing an array of integer, you could store an array of structure containing the integer value and all its positions in an array or vector.

Related

C++ efficient way to store and update sorted items

I have a operation that continuously generates random solutions (std::vector<float>). I evaluate the solutions against a mathematical function to see their usefulness (float). I would like to store the top 10 solutions all the time. What would be the most efficient way to do this in C++?
I need to store both the solutions(std::vector) and their usefulness (float). I am performing several hundred thousands of evaluations and hence I am in need of an efficient solution.
Edit:
I am aware of sorting methods. I am looking for methods other than sorting and storing the values. Looking for better data structures if any.
You evaluate the float score() function for current std::vector<T> solution, store them in a std::pair<vector<T>, float>.
You use a std::priority_queue< pair<vector<T>, float> > to store the 10 best solutions based on their score, and the score itself. std::priority_queue is a heap, so it allows you to extract its max value according to a compare function that you can set up to compare score_a < score_b.
Store the first 10 pairs, then for each new one compare it with the top of the heap, if score(new) > score(10th) then insert(new) into the priority_queue p, and p.pop_back() to get rid of the old 10th element.
You keep doing this inside a loop until you run out of vector<T> solutions.
Have a vector of pair, where pair has 1 element as solution and other element as usefulness. Then write custom comparator to compare elements in the vector.
Add element at last, then sort this vector and remove last element.
As #user4581301 mentioned in comments, for 10 elements, you dont need to sort. Just traverse vector everytime, or you can also perform ordered insert in vector.
Here are some links to help you:
https://www.geeksforgeeks.org/sorting-vector-of-pairs-in-c-set-1-sort-by-first-and-second/
Comparator for vector<pair<int,int>>

Returning zero when the key is not exist in unordered_map

I have the following container:
std::unordered_map<uint8_t,int> um;
um is assumed to have keys between 0 and 255 but not all of them. So, in certain point of time I want to ask it to give me the value of the key 13 for example. If it was there, I want its value (which is guaranteed to be not 0). If not, I want it to return 0.
What is the best way (performance point of view) to implement this?
What I tried till now: use find and return 0 if it was not found or the value if it was found.
P.S. Changing to std::vector<int> that contains 256 items is not an option. I can not afford the space to storing 256 values always.
EDIT:
My problem is histogram computing problem keys (colors 0-255) values(frequent, int is enough). I will not be satisfied if I just know that some key is exist or not. I also need the value (the frequent).
Additional information:
I will never erase any item.
I will add items sometimes (at most 256 items) and usually less than 10.
I will query on key so so many times.
Usually querying and inserting come with no specific order.
You have a trade-off between memory and speed.
Your unordered_map should have the less speed complexity.
Using std::vector<std::pair<uint8_t, int>> would be more compact (and more cache friendly).
std::pair<std::vector<uint8_t>, std::vector<int>> would be even more compact (no padding between uint8_t and int)
You can even do better by factorizing size/capacity, but it is no longer in std::.
With vector, you have then an other trade of: complexity for searching and add key:
unsorted vector: constant add, Linear search
sorted vector: linear add (due to insert value in middle of vector), logarithmic search.
I might use a vector for space compactness.
It is tempting to keep it sorted for logarithmic search performance. But since the expected number of elements is less than 10, I might just leave it unsorted and use linear search.
So
vector<pair<uint8_t, int>> data;
If the number of expected elements is large, then having a sorted vector might help.
Boost offers a map-like interface with vector-like layout. See boost flat_map at http://www.boost.org/doc/libs/1_48_0/doc/html/container/non_standard_containers.html#container.non_standard_containers.flat_xxx

C++ Fixed Size Container to Store Most Recent Values

I would like to know what the most suitable data structure is for the following problem in C++
I am wanting to store 100 floats ordered by recency. So when I add (push) a new item the other elements are moved up one position. Every time an event is triggered I receive a value and then add it to my data structure.
When the number of elements reaches 100, I would like to remove (pop) the item at the end (the oldest).
I want to able to iterate over all the elements and perform some mathematical operations on them.
I have looked at all the standard C++ containers but none of them fulfill all my needs. What's the easiest way to achieve this with standard C++ code?
You want a circular buffer. You can use Boost's implementation or make your own by allocating an array, and keeping track of the beginning and end of the used range. This boils down to doing indexing modulo 100.
Without creating your own or using a library, std::vector is the most efficient standard data structure for this. Once it has reached its maximum size, there will be no more dynamic memory allocations. The cost of moving up 100 floats is trivial compared to the cost of dynamic memory allocations. (This is why std::list is a slow data structure for this). There is no push_front function for vector. Instead you have to use v.insert(v.begin(), f)
Of course this assumes what you are doing is performance-critical, which it probably isn't. In that case I would use std::deque for more convenient usage.
Just saw that you need to iterator over them. Use a list.
Your basic function would look something like this
void addToList(int value){
list100.push_back(value);
if(list100.size() > 100){
list100.pop_front();
}
}
Iterating over them is easy as well:
for(int val : list100){
sum += val;
}
// Average, or whatever you need to do
Obviously, if you're using something besides int, you'll need to change that. Although this adds a little bit more functionality than you need, it's very efficient since it's a doubly linked list.
http://www.cplusplus.com/reference/list/list/
You can use either std::array, std::dequeue, std::list or std::priority_queue
A MAP (std::map) should be able to solve your requirement. Use Key as the object and value as the current push number nPusheCount which gets incremented whenever you add an element to map.
when adding a new element to map, if you have less than 100 elements, just add the number to the MAP as key and nPushCount as the value.
If you have 100 elements already, check if the number exists in map already and do following:
If the number already exists in map, then add the number as key and nPushCount as value;
If doesnt, delete the number with lowest nPushCount as value and then add the desired number with updated nPushCount.

Efficient frequency counter

I have 15,000,000 std:vectors of 6 integers.
Those 15M vectors contain duplicates.
Duplicate example:
(4,3,2,0,4,23)
(4,3,2,0,4,23)
I need to obtain a list of unique sequence with their associated count. (A sequence that is only present once would have a 1 count)
Is there an algorithm in the std C++ (can be x11) that does that in one shot?
Windows, 4GB RAM, 30+GB hdd
There is no such algorithm in the standard library which does exactly this, however it's very easy with a single loop and by choosing the proper data structure.
For this you want to use std::unordered_map which is typically a hash map. It has expected constant time per access (insert and look-up) and thus the first choice for huge data sets.
The following access and incement trick will automatically insert a new entry in the counter map if it's not yet there; then it will increment and write back the count.
typedef std::vector<int> VectorType; // Please consider std::array<int,6>!
std::unordered_map<VectorType, int> counters;
for (VectorType vec : vectors) {
counters[vec]++;
}
For further processing, you most probably want to sort the entries by the number of occurrence. For this, either write them out in a vector of pairs (which encapsulates the number vector and the occurrence count), or in an (ordered) map which has key and value swapped, so it's automatically ordered by the counter.
In order to reduce the memory footprint of this solution, try this:
If you don't need to get the keys back from this hash map, you can use a hash map which doesn't store the keys but only their hashes. For this, use size_t for the key type, std::identity<std::size_t> for the internal hash function and access it with a manual call to the hash function std::hash<VectorType>.
std::unordered_map<std::size_t, int, std::identity<std::size_t> > counters;
std::hash<VectorType> hashFunc;
for (VectorType vec : vectors) {
counters[hashFunc(vec)]++;
}
This reduces memory but requires an additional effort to interpret the results, as you have to loop over the original data structure a second time in order to find the original vectors (then look-up them in your hash map by hashing them again).
Yes: first std::sort the list (std::vector uses lexicographic ordering, the first element is the most significant), then loop with std::adjacent_find to find duplicates. When a duplicate is found, use std::adjacent_find again but with an inverted comparator to find the first non-duplicate.
Alternately, you could use std::unique with a custom comparator that flags when a duplicate is found, and maintains a count through the successive calls. This also gives you a deduplicated list.
The advantage of these approaches over std::unordered_map is space complexity proportional to the number of duplicates. You don't have to copy the entire original dataset or add a seldom-used field for dup-count.
You should convert each vector element to string one by one like this "4,3,2,0,4,23".
Then add them into a new string vector by controlling their existance with find() function.
If you need original vector, convert string vector to another integer sequence vector.
If you do not need delete duplicated elements while making sting vector.

C++ search function

This question refers to C++.
Say I have 10 million records of data, each piece of data is a 6 digit number, which I will have numbers being inputted that need to be matched to this data.
It boils down to two questions:
What would be the best way to store this data? An array?
What would be the best way to search or match this data?
I'm looking for performance more than anything else, memory usage is not a problem. I was looking into hash functions but I'm not sure if that's what I should even be looking for.
For fast lookup, there are basically two options: std::map, which has O(log n) lookup, or std::unordered_map, which has expected O(1) lookup (but possibly worse).
If your key type is literally an integer (which by the sound of it is the case), you have perfect hashing for free, so an unordered map would be available with minimal additional cost, so I'd try that one.
But just make a typedef and try both and compare!
#include <map>
#include <unordered_map>
typedef unsigned int key_type; // fine, has < , ==, and std::hash
typedef std::map<key_type, some_value_type> my_map;
// typedef std::unordered_map<key_type, some_value_type> my_map;
my_map m; // populate
my_map::const_iterator it = m.find(<some random key>);
If you don't actually need to associate any data to the keys, i.e. if you don't need a value type, then replace "map" by "set" everywhere. If you need multiple records with the same key, replace "map" by "multimap" everywhere.
With only a 6 digit number to look up, you could keep an array of 1 million elements and do the lookup directly.
If you know right off the bat how many records you're going to have, you can preallocate an array to that size and then start storing the data. Otherwise, some other data structure such as a vector would be better.
For searching, use a binary search. It will significantly cut down on your search time.
Basically, what will happen...(the data needs to be sorted btw)..
You'll jump to the middle element of the data structure and see if your input is higher or lower. If it's higher, you go to the upper half of the structure and repeat this process recursively. If it's lower, you go to the lower half and do the same. You do this until you find your matching data.
Assuming memory is not an issue, why don't you store data into map or set in STL? Search must be one of the fastest.