Ordered and unordered map - c++

I am trying to understand ordered and unordered map
I understand that ordered map will sort your list and unordered map will not sort it. base from this site
But why is this code get sorted anyway
std::unordered_map<std::string, int> map;
map.insert(std::make_pair("hello", 1));
map.insert(std::make_pair("world", 2));
map.insert(std::make_pair("call", 3));
map.insert(std::make_pair("ZZ", 4));
for(auto it = map.begin(); it != map.end();it++)
{
std::cout << it->first << std::endl;
}
result:
world
hello
call
ZZ
I dont get it. its supposed to be hello then world then call then ZZ. How did the world came first if its unordered container is not suppose to sort it.

std::hash<std::string> produces hashes for your input keys. Then, a transformation is applied to these hashes to get a valid index to a bucket in your map (The most basic one one can think of is a simple modulo). These operations take constant time and give you the location of your items. In order to get the good item from a transformed hash, you need it to be at the right place. That is why you observe an apparently random ordering of your inputs.
Now there is NO assumption you can make on the order of the computed indices. Suppose you would add some more strings in your map. Maybe a rehash would be needed for some strings and your input strings could appear in yet another order.
This has nothing to do with the order of insertion which has no reason to be kept, the purpose of the data structure is to place the items at the appropriate location so that their retreival can be made in amortized constant time (here by placing them at the indices given by the transformed hashes).

Probably, the best answer to this question is to say: "study a little about data structures.
Unordered map is supposed to store the data into a special data structure specially designed to be efficient but without any restrictions on storage order, and is therefore "unorderer".
Note that if you get the order of insertion is a type of restriction about the order.
Check out this data structure: hash table.

What they mean by "unordered", is that the elements are returned in a random order (based on element hash).

Related

C++ - Sort map based on values, if values same sort based on key

I came across a problem where I needed to store two values, one id and other its influence, and id should be randomly accessible. Also it should be sorted based on influence and if both influence are same , sort based on id. With these things in mind, I used map,but is there a way to actually do it ?
I tried below comparator and map but it gives error
struct cmp
{
bool comparator()(const pair<int,int>a,const pair<int,int>b)
{
if(a.second==b.second) return a.first<b.first;
else return a.second<b.second;
}
};
unordered_map<int,int,cmp>m;
From what I understand, you want a collection sorted by one value but quickly indexable by another. These two points are in contradiction. Sorting a collection by a key value makes it quicker to index by that key value. There is no easy way to make a collection quickly indexable in two different ways at the same time. Note that I say "quickly indexable" instead of talking about random access. That's yet a different concept.
In short, you need two collections. You can have a main one that stores influence-ID pairs and is sorted by influences, and a secondary one that stores IDs and maps them to the main collection. There are many options for that; here is one:
std::set<std::pair<int,int>> influences;
std::unordered_map<int, decltype(influences)::iterator> ids;
When inserting an influence-ID pair, you can insert into influence first and then take the iterator to that new element and insert it with the ID.
Another solution is to keep a main collection of influence-ID pairs but this time sorted by IDs (and binary search it when you need to get the influence from an ID), and maintain a array of permutation indices that tells you at what index an element of the main collection would be if it were sorted by influence. Something like this:
std::vector<std::pair<int,int>> influences;
// insert all elements sorted by ID
std::vector<decltype(influences)::size_type> sorted_indices;
// now sort by influences but modifying `sorted_indices` instead.
Relevant question
If the IDs are all from 0 to N, you can even just have influences indexed by ID:
std::vector<int> influences; // influences[id] is the influence value corresponding to id
This gives you random access.
The "correct" choice depends on the many other possible constraints you may have.

Accessing adjacent elements of a map in c++

Suppose I have a float-integer map m:
m[1.23] = 3
m[1.25] = 34
m[2.65] = 54
m[3.12] = 51
Imagine that I know that there's a mapping between 2.65 and 54, but I don't know about any other mappings.
Is there any way to visit the adjacent mappings without iterating from the beginning or searching using the find function?
In other words: can I directly access the adjacent values by just knowing about a single mapping...such as m[2.65]=54?
UPDATE Perhaps a more important "point" than my answer, brought up by #MattMcNabb:
Floating point keys in std:map
Can I directly access the adjacent values by just knowing about a single mapping (m[2.65]=54)
Yes. std::map is an ordered collection; which is to say that if an operator< exists (more generally, std::less) for the key type you can expect it to have sorted access. In fact--you won't be able to make a map for a key type if it doesn't have this comparison operator available (unless you pass in a predicate function to perform this comparison in the template invocation)
Note there is also a std::unordered_map which is often preferable for cases where you don't need this property of being able to navigate quickly between "adjacent" map entries. However you will need to have std::hash defined in that case. You can still iterate it, but adjacency of items in the iteration won't have anything to do with the sort order of the keys.
UPDATE also due to #MattMcNabb
Is there any way to visit the adjacent mappings without iterating from the beginning or searching using the find function?
You allude to array notation, and the general answer here would be "not really". Which is to say there is no way of saying:
if (not m[2.65][-2]) {
std::cout << "no element 2 steps prior to m[2.65]";
} else {
std::cout << "the element 2 before m[2.65] is " << *m[2.65][-2];
}
While no such notational means exist, the beauty (and perhaps the horror) of C++ is that you could write an augmentation of map that did that. Though people would come after you with torches and pitchforks. Or maybe they'd give you cult status and put your book on the best seller list. It's a fine line--but before you even try, count the letters and sequential consonants in your last name and make sure it's a large number.
What you need to access the ordering is an iterator. And find will get you one; and all the flexibility that it affords.
If you only use the array notation to read or write from a std::map, it's essentially a less-capable convenience layer built above iterators. So unless you build your own class derived from map, you're going to be stuck with the limits of that layer. The notation provides no way to get information about adjacent values...nor does it let you test for whether a key is in the map or not. (With find you can do this by comparing the result of a lookup to end(m) if m is your map.)
Technically speaking, find gives you the same effect as you could get by walking through the iterators front-to-back or back-to-front and comparing, as they are sorted. But that would be slower if you're seeking arbitrary elements. All the containers have a kind of algorithmic complexity guarantee that you can read up on.
When dereferencing an iterator, you will receive a pair whose first element is the key and second element is the value. The value will be mutable, but the key is constant. So you cannot find an element, then navigate to an adjacent element, and alter its key directly...just its value.

Efficient frequency counter

I have 15,000,000 std:vectors of 6 integers.
Those 15M vectors contain duplicates.
Duplicate example:
(4,3,2,0,4,23)
(4,3,2,0,4,23)
I need to obtain a list of unique sequence with their associated count. (A sequence that is only present once would have a 1 count)
Is there an algorithm in the std C++ (can be x11) that does that in one shot?
Windows, 4GB RAM, 30+GB hdd
There is no such algorithm in the standard library which does exactly this, however it's very easy with a single loop and by choosing the proper data structure.
For this you want to use std::unordered_map which is typically a hash map. It has expected constant time per access (insert and look-up) and thus the first choice for huge data sets.
The following access and incement trick will automatically insert a new entry in the counter map if it's not yet there; then it will increment and write back the count.
typedef std::vector<int> VectorType; // Please consider std::array<int,6>!
std::unordered_map<VectorType, int> counters;
for (VectorType vec : vectors) {
counters[vec]++;
}
For further processing, you most probably want to sort the entries by the number of occurrence. For this, either write them out in a vector of pairs (which encapsulates the number vector and the occurrence count), or in an (ordered) map which has key and value swapped, so it's automatically ordered by the counter.
In order to reduce the memory footprint of this solution, try this:
If you don't need to get the keys back from this hash map, you can use a hash map which doesn't store the keys but only their hashes. For this, use size_t for the key type, std::identity<std::size_t> for the internal hash function and access it with a manual call to the hash function std::hash<VectorType>.
std::unordered_map<std::size_t, int, std::identity<std::size_t> > counters;
std::hash<VectorType> hashFunc;
for (VectorType vec : vectors) {
counters[hashFunc(vec)]++;
}
This reduces memory but requires an additional effort to interpret the results, as you have to loop over the original data structure a second time in order to find the original vectors (then look-up them in your hash map by hashing them again).
Yes: first std::sort the list (std::vector uses lexicographic ordering, the first element is the most significant), then loop with std::adjacent_find to find duplicates. When a duplicate is found, use std::adjacent_find again but with an inverted comparator to find the first non-duplicate.
Alternately, you could use std::unique with a custom comparator that flags when a duplicate is found, and maintains a count through the successive calls. This also gives you a deduplicated list.
The advantage of these approaches over std::unordered_map is space complexity proportional to the number of duplicates. You don't have to copy the entire original dataset or add a seldom-used field for dup-count.
You should convert each vector element to string one by one like this "4,3,2,0,4,23".
Then add them into a new string vector by controlling their existance with find() function.
If you need original vector, convert string vector to another integer sequence vector.
If you do not need delete duplicated elements while making sting vector.

What is the difference between std::set and std::vector?

I am learning STL now. I read about set container. I have question when you want to use set? After reading description of set it looks like it is useless because we can substitute it by vector. Could you say pros and cos for vector vs set containers. Thanks
A set is ordered. It is guaranteed to remain in a specific ordering, according to a functor that you provide. No matter what elements you add or remove (unless you add a duplicate, which is not allowed in a set), it will always be ordered.
A vector has exactly and only the ordering you explicitly give it. Items in a vector are where you put them. If you put them in out of order, then they're out of order; you now need to sort the container to put them back in order.
Admittedly, set has relatively limited use. With proper discipline, one could insert items into a vector and keep it ordered. However, if you are constantly inserting and removing items from the container, vector will run into many issues. It will be doing a lot of copying/moving of elements and so forth, since it is effectively just an array.
The time it takes to insert an item into a vector is proportional to the number of items already in the vector. The time it takes to insert an item into a set is proportional to the log₂ of the number of items. If the number of items is large, that's a huge difference. log₂(100,000) is ~16; that's a major speed improvement. The same goes for removal.
However, if you do all of your insertions at once, at initialization time, then there's no problem. You can insert everything into the vector, sort it (paying that price once), and then use standard algorithms for sorted vectors to find elements and iterate over the sorted list. And while iteration over the elements of a set isn't exactly slow, iterating over a vector is faster.
So there are cases where a sorted vector beats a set. That being said, you really shouldn't bother with the expense of this kind of optimization unless you know that it is necessary. So use a set unless you have experience with the kind of system you're writing (and thus know that you need that performance) or have profiling data in hand that tells you that you need a vector and not a set.
They are different things: you decide how vectors are ordered, and you can also put as many equal things into a vector as you please. Sets are ordered in accordance to that set's internal rules (you may set the rules, but the set will deal with the ordering), and you cannot put multiple equal items into a set.
Of course you could maintain a vector of unique items, but your performance would suffer a lot when you do set-oriented operations. For example, assume that you have a set of 10000 items and a vector of 10000 distinct unordered items. Now suppose that you need to check if a value X is among the values in the set (or among the values in the vector). When X is not among the items, searching the vector would be some 100 times slower. You would see similar performance differences on calculating set unions and intersections.
To summarize, sets and vectors have different purposes. You can use a vector instead of a set, but it would require more work, and would likely hurt the performance rather severely.
form cpluplus.com
set:
Sets are containers that store unique elements following a specific
order.
so the set is ordered AND item are uniquely represented
while vect:
Vectors are sequence containers representing arrays that can change in
size.
so vector is in the order you fill it AND can hold multiple identical items
prefer set:
if you wish to filter multiple identical values
if you wish to parse items in a specified order (doing this in vector requires to specifically sort vector).
prefer vector:
if you want to keep identical values
if you wish to parse items in same order as you pushed them (assuming you don't process the vector order)
The simple difference is that set can contain only unique values, and it is sorted. So you can use it for the cases where you need to continuously sort the values after every insert / delete.
set<int> a;
vector<int> b;
for (int i = 0; i < 10; ++i)
{
int val = rand() % 10;
a.insert(val);
b.push_back(val);
}
cout << "--SET---\n"; for (auto i : a) cout << i << ","; cout << endl;
cout << "--VEC---\n"; for (auto j : b) cout << j << ","; cout << endl;
The output is
--SET---
0,1,2,4,7,8,9,
--VEC---
1,7,4,0,9,4,8,8,2,4,
it is faster to search an item against a set than a vector (O(log(n)) vs O(n)). To search an item against a vector, you need to iterate all items in the vector, but the set use red-black tree to optimize the search, only a few item will be looked to find a match.
The set is ordered, it means you can only iterate it from smallest one to biggest one by order, or the reversed order.
But the vector is unordered, you can travel it by the insert order.

How to associate to a number another number without using array

Let's say we have read these values:
3
1241
124515
5322353
341
43262267234
1241
1241
3213131
And I have an array like this (with the elements above):
a[0]=1241
a[1]=124515
a[2]=43262267234
a[3]=3
...
The thing is that the elements' order in the array is not constant (I have to change it somewhere else in my program).
How can I know on which position does one element appear in the read document.
Note that I can not do:
vector <int> a[1000000000000];
a[number].push_back(all_positions);
Because a will be too large (there's a memory restriction). (let's say I have only 3000 elements, but they're values are from 0 to 2^32)
So, in the example above, I would want to know all the positions 1241 is appearing on without iterating again through all the read elements.
In other words, how can I associate to the number "1241" the positions "1,6,7" so I can simply access them in O(1) (where 1 actually is the number of positions the element appears)
If there's no O(1) I want to know what's the optimal one ...
I don't know if I've made myself clear. If not, just say it and I'll update my question :)
You need to use some sort of dynamic array, like a vector (std::vector) or other similar containers (std::list, maybe, it depends on your needs).
Such data structures are safer and easier to use than C-style array, since they take care of memory management.
If you also need to look for an element in O(1) you should consider using some structures that will associate both an index to an item and an item to an index. I don't think STL provides any, but boost should have something like that.
If O(log n) is a cost you can afford, also consider std::map
You can use what is commonly refered to as a multimap. That is, it stores Key and multiple values. This is an O(log) look up time.
If you're working with Visual Studios they provide their own hash_multimap, else may I suggest using Boost::unordered_map with a list as your value?
You don't need a sparse array of 1000000000000 elements; use an std::map to map positions to values.
If you want bi-directional lookup (that is, you sometimes want "what are the indexes for this value?" and sometimes "what value is at this index?") then you can use a boost::bimap.
Things get further complicated as you have values appearing more than once. You can sacrifice the bi-directional lookup and use a std::multimap.
You could use a map for that. Like:
std::map<int, std::vector<int>> MyMap;
So everytime you encounter a value while reading the file, you append it's position to the map. Say X is the value you read and Y is the position then you just do
MyMap[X].push_back( Y );
Instead of you array use
std::map<int, vector<int> > a;
You need an associative collection but you might want to associated with multiple values.
You can use std::multimap< int, int >
or
you can use std::map< int, std::set< int > >
I have found in practice the latter is easier for removing items if you just need to remove one element. It is unique on key-value combinations but not on key or value alone.
If you need higher performance then you may wish to use a hash_map instead of map. For the inner collection though you will not get much performance in using a hash, as you will have very few duplicates and it is better to std::set.
There are many implementations of hash_map, and it is in the new standard. If you don't have the new standard, go for boost.
It seems you need a std::map<int,int>. You can store the mapping such as 1241->0 124515->1 etc. Then perform a look up on this map to get the array index.
Besides the std::map solution offered by others here (O(log n)), there's the approach of a hash map (implemented as boost::unordered_map or std::unordered_map in C++0x, supported by modern compilers).
It would give you O(1) lookup on average, which often is faster than a tree-based std::map. Try for yourself.
You can use a std::multimap to store both a key (e.g. 1241) and multiple values (e.g. 1, 6 and 7).
An insert has logarithmic complexity, but you can speed it up if you give the insert method a hint where it can insert the item.
For O(1) lookup you could hash the number to find its entry (key) in a hash map (boost::unordered_map, dictionary, stdex::hash_map etc)
The value could be a vector of indices where the number occurs or a 3000 bit array (375 bytes) where the bit number for each respective index where the number (key) occurs is set.
boost::unordered_map<unsigned long, std::vector<unsigned long>> myMap;
for(unsigned long i = 0; i < sizeof(a)/sizeof(*a); ++i)
{
myMap[a[i]].push_back(i);
}
Instead of storing an array of integer, you could store an array of structure containing the integer value and all its positions in an array or vector.