Is this usage of unordered map efficient/right way? - c++

I want to learn about mapping functions in c/c++ in general so this is a basic program on unordered mapping. I use unordered mapping because my input data are not sorted and I read that unordered_map is very efficient. Here I've an array with which I'm creating the hash table and use the lookup function to find if the elements in another array are in the hash table or not. I've several questions regarding this implementation:
#include <stdio.h>
#include <unordered_map>
using namespace std;
typedef std::unordered_map<int,int> Mymap;
int main()
{
int x,z,l=0;
int samplearray[5] = {0,6,4,3,8};
int testarray[10] = {6,3,8,67,78,54,64,74,22,77};
Mymap c1;
for ( x=0;x< sizeof(samplearray)/sizeof(int);x++)
c1.insert(Mymap::value_type(samplearray[x], x));
for ( z=0;z< sizeof(testarray)/sizeof(int);z++)
if((c1.find(testarray[z]) != c1.end()) == true)
l++;
printf("The number of elements equal are : %d\n",l);
printf("the size of samplearray and testarray are : %d\t%d\n",sizeof(samplearray)/sizeof(int),sizeof(testarray)/sizeof(int));
}
First of all, is this a right way to
implement it? I'm getting the
answers right but seems that I use
too much of for loop.
This seems fairly okay with very small data but if I'm dealing with files of size > 500MB then this seems that, if I create a hash table for a 500MB file then the size of the hash table itself will be twice as much which is 1000MB. Is this always the case?
What is the difference between std::unordered map and boost::unordered map?
Finally, a small request. I'm new to C/C++ so if you are giving suggestions like using some other typedef/libraries, I'd highly appreciate if you could use a small example or implement it on my code. Thanks

You're starting off on the wrong foot. A map (ordered or otherwise) is intended to store a key along with some associated data. In your case, you're only storing a number (twice, as both the key and the data). For this situation, you want a set (again, ordered or otherwise) instead of a map.
I'd also avoid at least the first for loop, and use std::copy instead:
// There are better ways to do this, but it'll work for now:
#define end(array) ((array) + (sizeof(array)/sizeof(array[0]))
std::copy(samplearray,
end(samplearray),
std::inserter(Myset));
If you only need to count how many items are common between the two sets, your for loop is fairly reasonable. If you need/want to actually know what items are common between them, you might want to consider using std::set_intersection:
std::set<int> myset, test_set, common;
std::copy(samplearray, end(samplearray), std::inserter(myset));
std::copy(testarray, end(testarray), std::inserter(test_set));
std::set_intersection(myset.begin(), myset.end(),
test_set.begin(), test_set.end(),
std::inserter(common));
// show the common elements (including a count):
std::cout <<common.size() << " common elements:\t";
std::copy(common.begin(), common.end(),
std::ostream_iterator<int>(std::cout, "\t");
Note that you don't need to have an actual set to use set_intersection -- all you need is a sorted collection of items, so if you preferred to you could just sort your two arrays, then use set_intersection on them directly. Likewise, the result could go in some other collection (e.g., a vector) if you prefer.

As mentioned by Jerry, you could use a for loop for the search if you only need to know the number of matches. If that is the case, I would recommend using an unordered_set since you don't need the elements to be sorted.

Related

vector iterating over itself

In my project I have a vector wit some relational data (a struct that holds two similar objects which represent a relationship between them) and I need to check for relationships combinations between all data in the vector.
What I am doing is iterating over the vector and inside the first for loop I am iterating again to look for relationships between data.
This is a simplified model of what I am doing
for(a=0; a<vec.size(); a++)
{
for(b=0; b<vec.size(); b++)
{
if(vec[a].something==vec[b].something) {...}
}
}
My collection has 2800 elements which means that I will be iterating 2800*2800 times...
What kind of data structure is more suitable for this kind of operation?
Would using for_each be any faster then traversing the vector like this?
Thanks in advance!
vec has two structs which are made up of two integers and nothing is ordered.
no, for_each still does the same thing.
Using a hash map could make your problem better. Start with an empty hash and iterate through the list. For each element, see if it's in the hash. If it's not, add it. If it is, then you have a duplicate and you run your code.
In C++, you can use std::map. In C, there is no built in map datastructure, so you'd have to make your own.
The high-level pseudo code would look something like this
foreach (element in array)
if map.has_key(element)
do_stuff(element)
else
map.add_key(element)
The easiest way to improve the efficiency of this operation would be to sort the vector and then look for duplicates. If sorting the vector isn't an option, you could create another vector of pointers to the elements of this vector and sort that. Both of those will take you from an N**2 complexity to an N*log(N) complexity (assuming, of course, that you use an N*log(N) sort). This does mean using more space, but often using a bit of space for significant time improvements is very reasonable.
assuming your vector contains a "relation" structure like:
class Entity;
struct Relation {
Entity* something;
Entity* relative;
};
and you have a vector of "relations":
std::vector<Relation> ties;
So if I understood it correctly, you want to segment ties and have a list of Relations for each Entity. This may be represented by a map:
std::map<Entity*,std::vector<Relation*>> entityTiesIndex;
Then you could just scan once through all ties and collect the relations for each entity:
for (int i=0; i < ties.size(); ++i ) {
Relation* relation = &ties[i];
entityTiesIndex[relation->something].push_back(relation);
}
Mind here the usual disclaimer about references to container elements, as these may change when container is altered.
Hope this makes sense.

How to make a fast search for an object with a particular value in a vector of structs or classes? c++

If I have thousands of struct or class objects in a vector, how to find those that are needed, in a fast way?
For example:
Making a game, and I need fastest way of collision detection. Each tile is a struct, there are many tiles in the vector map, with a values: x and y.
So basically I do:
For(i=0;i<end of vector list;i++)
{
//searching if x= 100 and y =200
}
So maybe there is a different way , like smart pointers or something to search for particular objects faster?
You should sort your vector and then use the standard library algorithms like binary_search, lower_bound, or upper_bound.
The above will give you a better compliexity than o(n) given by walk through of entire vector or by using standard library algorithm find.
i think you have to go more in depth that the simple research of a value inside a group of struct, even more if you are planning on searching among a elevated number.
How are the struct generated, how are they collected and how you keep track of them, there is a common key that you can you can use to order while you create them?
You should focus on sorting them while you add it to the whole structure, that way you avoid massive computation burst every time you have to perform a search. Choose a good algorithm (example AVL sorting), that way you can have a O(log(n))) adding/delete/searching.
A vector is just an unordered collection of objects. There is not really anyway to do what you are asking unless you start sorting your vector in specific ways (e.g. if it is sorted you can jump to the middle of the vector and potentially split your search time in half)
You may be better off picking a different data structure (either instead of the vector or in combination with it)
For example:
for_each(v.begin(),v.end(), [](int e)
{
if (e%2==1)//vector elements that are not divided by 2 without remainder
cout<<e<<endl;
});

STL Map with a Vector for the Key

I'm working with some binary data that I have stored in arbitrarily long arrays of unsigned ints. I've found that I have some duplication of data, and am looking to ignore duplicates in the short term and remove whatever bug is causing them in the long term.
I'm looking at inserting each dataset into a map before storing it, but only if it was not found in the map to start with. My initial thought was to have a map of strings and use memcpy as a hammer to force the ints into a character array, and then copy that into a string and store the string. This failed because a good deal of my data contains multiple bytes of 0 (aka NULL) at the front of the relevant data, so a majority of very real data got thrown out.
My next attempt is planned to be std::map<std::vector<unsigned char>,int>, but I'm realizing that I don't know if the map insert function will work.
Is this doable, even if ill advised, or is there a better way to approach this problem?
Edit
So it's been remarked that I didn't make clear what I'm doing, so here's a hopefully better description.
I'm working on generating a minimum spanning tree, given that I have a number of trees containing the actual end nodes I'm working with. The goal is to come up with the selection of trees that has the shortest length and that covers all of the end nodes, where the chosen trees share at most one node with each other and are all connected. I'm basing my approach off of a binary decision tree, but making a few changes to hopefully allow for greater parallelism.
Rather than taking the binary tree approach I've opted to make a bit vector out of unsigned integers for each dataset, where a 1 in a bit position indicates the inclusion of the corresponding tree.
For example if just tree 0 were included in a 5 tree dataset I would start with
00001
From here I can generate:
00011
00101
01001
10001
Each of these can then be processed in parallel, since none of them depend on each other. I do this for all of the single trees (00010, 00100, etc..) and should, I haven't taken the time to formally prove it, be able to generate all values in the range (0,2^n) once and only once.
I started to notice that many datasets were taking far longer to complete than I thought they should, and enabled a debugging output to look at all of the generated results, and a quick Perl script later it was confirmed that I had multiple processes generating the same output. Since then I've been trying to resolve where the duplicates are coming from with very little success, and I'm hoping that this will work well enough to let me verify the results that are being generated without the, sometimes, 3 day wait on computations.
You will not have problems with that, as std::vector provides you the "==", "<" and ">" operators:
http://en.cppreference.com/w/cpp/container/vector/operator_cmp
The requirements for being a key in std::map are satisfied by std::vector, so yes you can do that. Sounds like a good temporary solution (easy to code, minimum of hassle) -- but you know what they say: "there is nothing more permanent than the temporary".
That should work, as Renan Greinert points out, vector<> meets the requirements to be used as a map key.
You also say:
I'm looking at inserting each dataset into a map before storing it,
but only if it was not found in the map to start with.
That's usually not what you want to do, as that would involve doing a find() on the map, and if not found, then doing an insert() operation. Those two operations would essentially have to do a find twice. It is better just to try and insert the items into the map. If the key is already there, the operation will fail by definition. So your code would look like this:
#include <vector>
#include <map>
#include <utility>
// typedefs help a lot to shorten the verbose C++ code
typedef std::map<std::vector<unsigned char>, int> MyMapType;
std::vector<unsigned char> v = ...; // initialize this somehow
std::pair<MyMapType::iterator, bool> result = myMap.insert(std::make_pair(v, 42));
if (result.second)
{
// the insertion worked and result.first points to the newly
// inserted pair
}
else
{
// the insertion failed and result.first points to the pair that
// was already in the map
}
Why do you need a std::map for that? Maybe I miss some point but what about using an std::vector together with the find algorithm as examplained here?
This means, that you append your unsigned ints to the vector and later search for it, e.g.
std::vector<unsigned int> collector; // vector that is substituting your std::map
for(unsigned int i=0; i<myInts.size(); ++i) { // myInts are the long ints you have
if(find(collector.begin(), collector.end(), myInts.at(i)==collector.end()) {
collector.push_back(myInts.at(i));
}
}

C++ search function

This question refers to C++.
Say I have 10 million records of data, each piece of data is a 6 digit number, which I will have numbers being inputted that need to be matched to this data.
It boils down to two questions:
What would be the best way to store this data? An array?
What would be the best way to search or match this data?
I'm looking for performance more than anything else, memory usage is not a problem. I was looking into hash functions but I'm not sure if that's what I should even be looking for.
For fast lookup, there are basically two options: std::map, which has O(log n) lookup, or std::unordered_map, which has expected O(1) lookup (but possibly worse).
If your key type is literally an integer (which by the sound of it is the case), you have perfect hashing for free, so an unordered map would be available with minimal additional cost, so I'd try that one.
But just make a typedef and try both and compare!
#include <map>
#include <unordered_map>
typedef unsigned int key_type; // fine, has < , ==, and std::hash
typedef std::map<key_type, some_value_type> my_map;
// typedef std::unordered_map<key_type, some_value_type> my_map;
my_map m; // populate
my_map::const_iterator it = m.find(<some random key>);
If you don't actually need to associate any data to the keys, i.e. if you don't need a value type, then replace "map" by "set" everywhere. If you need multiple records with the same key, replace "map" by "multimap" everywhere.
With only a 6 digit number to look up, you could keep an array of 1 million elements and do the lookup directly.
If you know right off the bat how many records you're going to have, you can preallocate an array to that size and then start storing the data. Otherwise, some other data structure such as a vector would be better.
For searching, use a binary search. It will significantly cut down on your search time.
Basically, what will happen...(the data needs to be sorted btw)..
You'll jump to the middle element of the data structure and see if your input is higher or lower. If it's higher, you go to the upper half of the structure and repeat this process recursively. If it's lower, you go to the lower half and do the same. You do this until you find your matching data.
Assuming memory is not an issue, why don't you store data into map or set in STL? Search must be one of the fastest.

How to associate to a number another number without using array

Let's say we have read these values:
3
1241
124515
5322353
341
43262267234
1241
1241
3213131
And I have an array like this (with the elements above):
a[0]=1241
a[1]=124515
a[2]=43262267234
a[3]=3
...
The thing is that the elements' order in the array is not constant (I have to change it somewhere else in my program).
How can I know on which position does one element appear in the read document.
Note that I can not do:
vector <int> a[1000000000000];
a[number].push_back(all_positions);
Because a will be too large (there's a memory restriction). (let's say I have only 3000 elements, but they're values are from 0 to 2^32)
So, in the example above, I would want to know all the positions 1241 is appearing on without iterating again through all the read elements.
In other words, how can I associate to the number "1241" the positions "1,6,7" so I can simply access them in O(1) (where 1 actually is the number of positions the element appears)
If there's no O(1) I want to know what's the optimal one ...
I don't know if I've made myself clear. If not, just say it and I'll update my question :)
You need to use some sort of dynamic array, like a vector (std::vector) or other similar containers (std::list, maybe, it depends on your needs).
Such data structures are safer and easier to use than C-style array, since they take care of memory management.
If you also need to look for an element in O(1) you should consider using some structures that will associate both an index to an item and an item to an index. I don't think STL provides any, but boost should have something like that.
If O(log n) is a cost you can afford, also consider std::map
You can use what is commonly refered to as a multimap. That is, it stores Key and multiple values. This is an O(log) look up time.
If you're working with Visual Studios they provide their own hash_multimap, else may I suggest using Boost::unordered_map with a list as your value?
You don't need a sparse array of 1000000000000 elements; use an std::map to map positions to values.
If you want bi-directional lookup (that is, you sometimes want "what are the indexes for this value?" and sometimes "what value is at this index?") then you can use a boost::bimap.
Things get further complicated as you have values appearing more than once. You can sacrifice the bi-directional lookup and use a std::multimap.
You could use a map for that. Like:
std::map<int, std::vector<int>> MyMap;
So everytime you encounter a value while reading the file, you append it's position to the map. Say X is the value you read and Y is the position then you just do
MyMap[X].push_back( Y );
Instead of you array use
std::map<int, vector<int> > a;
You need an associative collection but you might want to associated with multiple values.
You can use std::multimap< int, int >
or
you can use std::map< int, std::set< int > >
I have found in practice the latter is easier for removing items if you just need to remove one element. It is unique on key-value combinations but not on key or value alone.
If you need higher performance then you may wish to use a hash_map instead of map. For the inner collection though you will not get much performance in using a hash, as you will have very few duplicates and it is better to std::set.
There are many implementations of hash_map, and it is in the new standard. If you don't have the new standard, go for boost.
It seems you need a std::map<int,int>. You can store the mapping such as 1241->0 124515->1 etc. Then perform a look up on this map to get the array index.
Besides the std::map solution offered by others here (O(log n)), there's the approach of a hash map (implemented as boost::unordered_map or std::unordered_map in C++0x, supported by modern compilers).
It would give you O(1) lookup on average, which often is faster than a tree-based std::map. Try for yourself.
You can use a std::multimap to store both a key (e.g. 1241) and multiple values (e.g. 1, 6 and 7).
An insert has logarithmic complexity, but you can speed it up if you give the insert method a hint where it can insert the item.
For O(1) lookup you could hash the number to find its entry (key) in a hash map (boost::unordered_map, dictionary, stdex::hash_map etc)
The value could be a vector of indices where the number occurs or a 3000 bit array (375 bytes) where the bit number for each respective index where the number (key) occurs is set.
boost::unordered_map<unsigned long, std::vector<unsigned long>> myMap;
for(unsigned long i = 0; i < sizeof(a)/sizeof(*a); ++i)
{
myMap[a[i]].push_back(i);
}
Instead of storing an array of integer, you could store an array of structure containing the integer value and all its positions in an array or vector.