STL Map with a Vector for the Key - c++

I'm working with some binary data that I have stored in arbitrarily long arrays of unsigned ints. I've found that I have some duplication of data, and am looking to ignore duplicates in the short term and remove whatever bug is causing them in the long term.
I'm looking at inserting each dataset into a map before storing it, but only if it was not found in the map to start with. My initial thought was to have a map of strings and use memcpy as a hammer to force the ints into a character array, and then copy that into a string and store the string. This failed because a good deal of my data contains multiple bytes of 0 (aka NULL) at the front of the relevant data, so a majority of very real data got thrown out.
My next attempt is planned to be std::map<std::vector<unsigned char>,int>, but I'm realizing that I don't know if the map insert function will work.
Is this doable, even if ill advised, or is there a better way to approach this problem?
Edit
So it's been remarked that I didn't make clear what I'm doing, so here's a hopefully better description.
I'm working on generating a minimum spanning tree, given that I have a number of trees containing the actual end nodes I'm working with. The goal is to come up with the selection of trees that has the shortest length and that covers all of the end nodes, where the chosen trees share at most one node with each other and are all connected. I'm basing my approach off of a binary decision tree, but making a few changes to hopefully allow for greater parallelism.
Rather than taking the binary tree approach I've opted to make a bit vector out of unsigned integers for each dataset, where a 1 in a bit position indicates the inclusion of the corresponding tree.
For example if just tree 0 were included in a 5 tree dataset I would start with
00001
From here I can generate:
00011
00101
01001
10001
Each of these can then be processed in parallel, since none of them depend on each other. I do this for all of the single trees (00010, 00100, etc..) and should, I haven't taken the time to formally prove it, be able to generate all values in the range (0,2^n) once and only once.
I started to notice that many datasets were taking far longer to complete than I thought they should, and enabled a debugging output to look at all of the generated results, and a quick Perl script later it was confirmed that I had multiple processes generating the same output. Since then I've been trying to resolve where the duplicates are coming from with very little success, and I'm hoping that this will work well enough to let me verify the results that are being generated without the, sometimes, 3 day wait on computations.

You will not have problems with that, as std::vector provides you the "==", "<" and ">" operators:
http://en.cppreference.com/w/cpp/container/vector/operator_cmp

The requirements for being a key in std::map are satisfied by std::vector, so yes you can do that. Sounds like a good temporary solution (easy to code, minimum of hassle) -- but you know what they say: "there is nothing more permanent than the temporary".

That should work, as Renan Greinert points out, vector<> meets the requirements to be used as a map key.
You also say:
I'm looking at inserting each dataset into a map before storing it,
but only if it was not found in the map to start with.
That's usually not what you want to do, as that would involve doing a find() on the map, and if not found, then doing an insert() operation. Those two operations would essentially have to do a find twice. It is better just to try and insert the items into the map. If the key is already there, the operation will fail by definition. So your code would look like this:
#include <vector>
#include <map>
#include <utility>
// typedefs help a lot to shorten the verbose C++ code
typedef std::map<std::vector<unsigned char>, int> MyMapType;
std::vector<unsigned char> v = ...; // initialize this somehow
std::pair<MyMapType::iterator, bool> result = myMap.insert(std::make_pair(v, 42));
if (result.second)
{
// the insertion worked and result.first points to the newly
// inserted pair
}
else
{
// the insertion failed and result.first points to the pair that
// was already in the map
}

Why do you need a std::map for that? Maybe I miss some point but what about using an std::vector together with the find algorithm as examplained here?
This means, that you append your unsigned ints to the vector and later search for it, e.g.
std::vector<unsigned int> collector; // vector that is substituting your std::map
for(unsigned int i=0; i<myInts.size(); ++i) { // myInts are the long ints you have
if(find(collector.begin(), collector.end(), myInts.at(i)==collector.end()) {
collector.push_back(myInts.at(i));
}
}

Related

C++ unordered_map or unordered_set : What to use if I wish to keep an "isVisited" data structure

I want to keep a data structure for storing all the elements that I have seen till now. Considering that keeping an array for this is out of question as elements can be of the order of 10^9, what data structure should I use for achieving this : unordered_map or unordered_set in C++ ?
Maximum elements that will be visited in worst case : 10^5
-10^9 <= element <= 10^9
As #MikeCAT said in the comments, a map would only make sense if you wanted to store additional information about the element or the visitation. But if you wanted only to store the truth value of whether the element has been visited or not, the map would look something like this:
// if your elements were strings
std::unordered_map<std::string, bool> isVisited;
and then this would just be a waste of space. Storing the truth value is redundant, if the mere presence of the string within the map already indicates that it has been visited. Let's see a comparison:
std::unordered_map<std::string, bool> isVisitedMap;
std::unordered_set<std::string> isVisitedSet;
// Visit some places
isVisitedMap["madrid"] = true;
isVisitedMap["london"] = true;
isVisitedSet.insert("madrid");
isVisitedSet.insert("london");
// Maybe the information expires so you want to remove them
isVisitedMap["london"] = false;
isVisitedSet.erase("london");
Now the elements stored in each structure will be:
For the map:
{{"london", false}, {"madrid", true}} <--- 4 elements
{"madrid"} <--- 1 element. Much better
In a project in which I had a binary tree converted to a binary DAG for optimization purposes (GRAPHGEN) I passed the exploration function a map from node pointers to bool:
std::map<BinaryDrag<conact>::node*, bool> &visited_fl
The map kept track of the pointers in order not to go through the same nodes again when doing multiple passes.
You could use a std::unordered_map<Value, bool>.
I want to keep a data structure for storing all the elements that I have seen till now.
A way to re-phrase that is to say "I want a data structure to store the set of all elements that I've seen till now". The clue is in the name. Without more information, std::unordered_set seems like a reasonable choice to represent a set.
That said, in practice it depends on details like what you're planning to do with this set. Array can be a good choice as well (yes, even for billions of elements), other set implementations may be better and maps can be useful in some use cases.

Unordered map vs vector

I'm building a little 2d game engine. Now I need to store the prototypes of the game objects (all type of informations). A container that will have at most I guess few thousand elements all with unique key and no elements will be deleted or added after a first load. The key value is a string.
Various threads will run, and I need to send to everyone a key(or index) and with that access other information(like a texture for the render process or sound for the mixer process) available only to those threads.
Normally I use vectors because they are way faster to accessing a known element. But I see that unordered map also usually have a constant speed if I use the ::at element access. It would make the code much cleaner and also easier to maintain because I will deal with much more understandable man made strings.
So the question is, the difference in speed between a access to a vector[n] compared to a unorderedmap.at("string") is negligible compared to his benefits?
From what I understand accessing various maps in different part of the program, with different threads running just with a "name" for me is a big deal and the speed difference isn't that great. But I'm too inexperienced to be sure of this. Although I found informations about it seem I can't really understand if I'm right or wrong.
Thank you for your time.
As an alternative, you could consider using an ordered vector because the vector itself will not be modified. You can easily write an implementation yourself with STL lower_bound etc, or use an implementation from libraries ( boost::flat_map).
There is a blog post from Scott Meyers about container performance in this case. He did some benchmarks and the conclusion would be that an unordered_mapis probably a very good choice with high chances that it will be the fastest option. If you have a restricted set of keys, you can also compute a minimal optimal hash function, e.g. with gperf
However, for these kind of problems the first rule is to measure yourself.
My problem was to find a record on a container by a given std::string type as Key access. Considering Keys that only EXISTS(not finding them was not a option) and the elements of this container are generated only at the beginning of the program and never touched thereafter.
I had huge fears unordered map was not fast enough. So I tested it, and I want to share the results hoping I've not mistaken everything.
I just hope that can help others like me and to get some feedback because in the end I'm beginner.
So, given a struct of record filled randomly like this:
struct The_Mess
{
std::string A_string;
long double A_ldouble;
char C[10];
int* intPointer;
std::vector<unsigned int> A_vector;
std::string Another_String;
}
I made a undordered map, give that A_string contain the key of the record:
std::unordered_map<std::string, The_Mess> The_UnOrdMap;
and a vector I sort by the A_string value(which contain the key):
std::vector<The_Mess> The_Vector;
with also a index vector sorted, and used to access as 3thrd way:
std::vector<std::string> index;
The key will be a random string of 0-20 characters in lenght(I wanted the worst possible scenario) containing letter both capital and normal and numbers or spaces.
So, in short our contendents are:
Unordered map I measure the time the program get to execute:
record = The_UnOrdMap.at( key ); record is just a The_Mess struct.
Sorted Vector measured statements:
low = std::lower_bound (The_Vector.begin(), The_Vector.end(), key, compare);
record = *low;
Sorted Index vector:
low2 = std::lower_bound( index.begin(), index.end(), key);
indice = low2 - index.begin();
record = The_Vector[indice];
The time is in nanoseconds and is a arithmetic average of 200 iterations. I have a vector that I shuffle at every iteration containing all the keys, and at every iteration I cycle through it and look for the key I have here in the three ways.
So this are my results:
I think the initials spikes are a fault of my testing logic(the table I iterate contains only the keys generated so far, so it only has 1-n elements). So 200 iterations of 1 key search for the first time. 200 iterations of 2 keys search the second time etc...
Anyway, it seem that in the end the best option is the unordered map, considering that is a lot less code, it's easier to implement and will make the whole program way easier to read and probably maintain/modify.
You have to think about caching as well. In case of std::vector you'll have very good cache performance when accessing the elements - when accessing one element in RAM, CPU will cache nearby memory values and this will include nearby portions of your std::vector.
When you use std::map (or std::unordered_map) this is no longer true. Maps are usually implemented as self balancing binary-search trees, and in this case values can be scattered around the RAM. This imposes great hit on cache performance, especially as maps get bigger and bigger as CPU just cannot cache the memory that you're about to access.
You'll have to run some tests and measure performance, but cache misses can greatly hurt the performance of your program.
You are most likely to get the same performance (the difference will not be measurable).
Contrary to what some people seem to believe, unordered_map is not a binary tree. The underlying data structure is a vector. As a result, cache locality does not matter here - it is the same as for vector. Granted, you are going to suffer if you have collissions due to your hashing function being bad. But if your key is a simple integer, this is not going to happen. As a result, access to to element in hash map will be exactly the same as access to the element in the vector with time spent on getting hash value for integer, which is really non-measurable.

C++ search function

This question refers to C++.
Say I have 10 million records of data, each piece of data is a 6 digit number, which I will have numbers being inputted that need to be matched to this data.
It boils down to two questions:
What would be the best way to store this data? An array?
What would be the best way to search or match this data?
I'm looking for performance more than anything else, memory usage is not a problem. I was looking into hash functions but I'm not sure if that's what I should even be looking for.
For fast lookup, there are basically two options: std::map, which has O(log n) lookup, or std::unordered_map, which has expected O(1) lookup (but possibly worse).
If your key type is literally an integer (which by the sound of it is the case), you have perfect hashing for free, so an unordered map would be available with minimal additional cost, so I'd try that one.
But just make a typedef and try both and compare!
#include <map>
#include <unordered_map>
typedef unsigned int key_type; // fine, has < , ==, and std::hash
typedef std::map<key_type, some_value_type> my_map;
// typedef std::unordered_map<key_type, some_value_type> my_map;
my_map m; // populate
my_map::const_iterator it = m.find(<some random key>);
If you don't actually need to associate any data to the keys, i.e. if you don't need a value type, then replace "map" by "set" everywhere. If you need multiple records with the same key, replace "map" by "multimap" everywhere.
With only a 6 digit number to look up, you could keep an array of 1 million elements and do the lookup directly.
If you know right off the bat how many records you're going to have, you can preallocate an array to that size and then start storing the data. Otherwise, some other data structure such as a vector would be better.
For searching, use a binary search. It will significantly cut down on your search time.
Basically, what will happen...(the data needs to be sorted btw)..
You'll jump to the middle element of the data structure and see if your input is higher or lower. If it's higher, you go to the upper half of the structure and repeat this process recursively. If it's lower, you go to the lower half and do the same. You do this until you find your matching data.
Assuming memory is not an issue, why don't you store data into map or set in STL? Search must be one of the fastest.

Is this usage of unordered map efficient/right way?

I want to learn about mapping functions in c/c++ in general so this is a basic program on unordered mapping. I use unordered mapping because my input data are not sorted and I read that unordered_map is very efficient. Here I've an array with which I'm creating the hash table and use the lookup function to find if the elements in another array are in the hash table or not. I've several questions regarding this implementation:
#include <stdio.h>
#include <unordered_map>
using namespace std;
typedef std::unordered_map<int,int> Mymap;
int main()
{
int x,z,l=0;
int samplearray[5] = {0,6,4,3,8};
int testarray[10] = {6,3,8,67,78,54,64,74,22,77};
Mymap c1;
for ( x=0;x< sizeof(samplearray)/sizeof(int);x++)
c1.insert(Mymap::value_type(samplearray[x], x));
for ( z=0;z< sizeof(testarray)/sizeof(int);z++)
if((c1.find(testarray[z]) != c1.end()) == true)
l++;
printf("The number of elements equal are : %d\n",l);
printf("the size of samplearray and testarray are : %d\t%d\n",sizeof(samplearray)/sizeof(int),sizeof(testarray)/sizeof(int));
}
First of all, is this a right way to
implement it? I'm getting the
answers right but seems that I use
too much of for loop.
This seems fairly okay with very small data but if I'm dealing with files of size > 500MB then this seems that, if I create a hash table for a 500MB file then the size of the hash table itself will be twice as much which is 1000MB. Is this always the case?
What is the difference between std::unordered map and boost::unordered map?
Finally, a small request. I'm new to C/C++ so if you are giving suggestions like using some other typedef/libraries, I'd highly appreciate if you could use a small example or implement it on my code. Thanks
You're starting off on the wrong foot. A map (ordered or otherwise) is intended to store a key along with some associated data. In your case, you're only storing a number (twice, as both the key and the data). For this situation, you want a set (again, ordered or otherwise) instead of a map.
I'd also avoid at least the first for loop, and use std::copy instead:
// There are better ways to do this, but it'll work for now:
#define end(array) ((array) + (sizeof(array)/sizeof(array[0]))
std::copy(samplearray,
end(samplearray),
std::inserter(Myset));
If you only need to count how many items are common between the two sets, your for loop is fairly reasonable. If you need/want to actually know what items are common between them, you might want to consider using std::set_intersection:
std::set<int> myset, test_set, common;
std::copy(samplearray, end(samplearray), std::inserter(myset));
std::copy(testarray, end(testarray), std::inserter(test_set));
std::set_intersection(myset.begin(), myset.end(),
test_set.begin(), test_set.end(),
std::inserter(common));
// show the common elements (including a count):
std::cout <<common.size() << " common elements:\t";
std::copy(common.begin(), common.end(),
std::ostream_iterator<int>(std::cout, "\t");
Note that you don't need to have an actual set to use set_intersection -- all you need is a sorted collection of items, so if you preferred to you could just sort your two arrays, then use set_intersection on them directly. Likewise, the result could go in some other collection (e.g., a vector) if you prefer.
As mentioned by Jerry, you could use a for loop for the search if you only need to know the number of matches. If that is the case, I would recommend using an unordered_set since you don't need the elements to be sorted.

Searching a C++ Vector<custom_class> for the first/last occurence of a value

I'm trying to work out the best method to search a vector of type "Tracklet" (a class I have built myself) to find the first and last occurrence of a given value for one of its variables. For example, I have the following classes (simplified for this example):
class Tracklet {
TimePoint *start;
TimePoint *end;
int angle;
public:
Tracklet(CvPoint*, CvPoint*, int, int);
}
class TimePoint {
int x, y, t;
public:
TimePoint(int, int, int);
TimePoint(CvPoint*, int);
// Relevant getters and setters exist here
};
I have a vector "vector<Tracklet> tracklets" and I need to search for any tracklets with a given value of "t" for the end timepoint. The vector is ordered in terms of end time (i.e. tracklet.end->t).
I'm happy to code up a search algorithm, but am unsure of which route to take with it. I'm not sure binary search would be suitable, as I seem to remember it won't necessarily find the first. I was thinking of a method where I use binary search to find an index of an element with the correct time, then iterate back to find the first and forward to find the last. I'm sure there's a better way than that, since it wastes binary searches O(log n) by iterating.
Hopefully that makes sense: I struggled to explain it a bit!
Cheers!
If the vector is sorted and contains the value, std::lower_bound will give you an iterator to the first element with a given value and std::upper_bound will give you an iterator to one element past the last one containing the value. Compare the value with the returned element to see if it existed in the vector. Both these functions use binary search, so time is O(logN).
To compare on tracklet.end->t, use:
bool compareTracklets(const Tracklet &tr1, const Tracklet &tr2) {
return (tr1.end->t < tr2.end->t);
}
and pass compareTracklets as the fourth argument to lower_bound or upper_bound
I'd just use find and find_end, and then do something more complicated only if testing showed it to be too slow.
If you're really concerned about lookup performance, you might consider a different data structure, like a map with timestamp as the key and a vector or list of elements as the value.
A binary search seems like your best option here, as long as your vector remains sorted. It's essentially identical, performance-wise, to performing a lookup in a binary tree-structure.
dirkgently referred to a sweet optimization comparative. But I would in fact not use a std::vector for this.
Usually, when deciding to use a STL container, I don't really consider the performance aspect, but I do consider its interface regarding the type of operation I wish to use.
std::set<T>::find
std::set<T>::lower_bound
std::set<T>::upper_bound
std::set<T>::equal_range
Really, if you want an ordered sequence, outside of a key/value setup, std::set is just easier to use than any other.
You don't have to worry about inserting at a 'bad' position
You don't have problems of iterators invalidation when adding / removing an element
You have built-in methods for searching
Of course, you also want your Comparison Predicate to really shine (hopes the compiler inlines the operator() implementation), in every case.
But really, if you are not convinced, try a build with a std::vector and manual insertion / searching (using the <algorithm> header) and try another build using std::set.
Compare the size of the implementations (number of lines of code), compare the number of bugs, compare the speed, and then decide.
Most often, the 'optimization' you aim for is actually a pessimization, and in those rares times it's not, it's just so complicated that it's not worth it.
Optimization:
Don't
Expert only: Don't, we mean it
The vector is ordered in terms of time
The start time or the end time?
What is wrong with a naive O(n) search? Remember you are only searching and not sorting. You could use a sorted container as well (if that doesn't go against the basic design).