how can i create histogram of vector<vector<long>> - c++

I have allocated vector<vector<long>>. what is the right way to create histogram or use std::find over all vectors without relocation of the data ?
thanks

By histogram I understand a map value->occurrences, and with your data this means a map<int, int> and I do not understand how std::find kicks in. Said this I would go for something like this:
// assuming exists vector<vector<long<long>>
std::map<long, int> histogram ;
for (const auto &v1 : vect)
for (auto value : v1)
{
auto it = histogram.find(value) ;
if (it == histogram.end())
histogram[value] = 1 ;
else
it->second++ ;
}

Based on the comments, what you need is a way to gather all the values in the vectors you are collecting at runtime and keep track of how many of each there are. Luckily there are several standard algorithms and containers that can handle this task efficiently.
std::unordered_map<long,unsigned int> histogram;
std::for_each(data.begin(), data.end(), [&histogram=histogram](std::vector<long>& inner_vec)
{
for(long val : inner_vec)
{
++histogram[val];
}
});
You did mention that the amount of vectors you're dealing with could be large. This problem could quit easily lend itself to a multithreaded solution.
An initial attempt may be to have each thread have its own std::unordered_map<long,unsigned int> and track the count for a subset of the larger dataset,then merge the results back into a larger histogram whenever all the threads finish.
Here's a live demo demonstrating each of the methods described above. You should compile all of these with optimization settings to truly get a sense as to how fast this can work out for you and measure to see if the multithreaded solution even offers benefit.

Related

How to group a vector pair by the second value efficiently?

I am trying to group the pair vector vector<pair<int,int>> by the second value of it. For example, if the pair is v0 : (0,1),(1,1),(3,2),(4,2),(5,1). I want to get two outputs. The first one is the unique element of the second elements, which is
vector<int> v2={1,2};
The second is groups of the first elements, which could be
vector<vector<int>>v1;
v1[0]={0,1,5};
v1[1]={3,4};
How to achieve this in a efficient way? Do I need to sort the v0 by the second element at first before the group process? Does std::map is a faster way? Not only the method, I also concern about the speed. Because my v0 is a very long and unsorted triangle mesh vertices index list. Any suggestion will be appreciate.
Updated, I found one solution similar to link. It is in an unsorted way. I have no idea about its speed.
map<int, vector<int> > vpmap;
for (auto it = v0.begin(); it != v0.end(); ++it) {
vpmap[(*it).second].push_back((*it).first);
};
in which, vpmap.first is corresponding to v2; and vpmap.second is corresponding to v1.
What you have is a reasonably performant way of getting the exact data structures you're looking for. Be sure you pre-allocate the vectors since you know the size, and use move iterators to avoid unnecessary copying:
std::vector<int> v0;
std::vector<std::vector<int>> v1;
v0.reserve(vpmap.size());
std::transform(vpmap.begin(), vpmap.end(), std::back_inserter(v0), [](auto p) { return p.first; });
v1.reserve(vpmap.size());
std::transform(make_move_iterator(vpmap.begin()), make_move_iterator(vpmap.end()), std::back_inserter(v1), [](auto p) { return p.second; });
If you can loosen your constraints, do think about big-picture optimizations like "do I need to transform all this data?"
But once you have something reasonable, stop worrying about the fastest techniques or containers or whatever, and start measuring with a profiler. Sometimes the stuff you worry about winds up being a non-issue and there are non-obvious costs that stem from your problem domain and input data and accumulation of code

Performance issues in joining threads

I wrote the following parallel code for examining all elements in a vector of vector. I store only those elements from vector<vector<int> > which satisfy a given condition. However, my problem is some of the vectors within vector<vector<int> > are pretty large while others are pretty small. Due to which my code takes a long time to perform thread.join(). Can someone please suggest as to how can I improve the performance of my code.
void check_if_condition(vector<int>& a, vector<int>& satisfyingElements)
{
for(vector<int>::iterator i1=a.begin(), l1=a.end(); i1!=l1; ++i1)
if(some_check_condition(*i1))
satisfyingElements.push_back(*i1);
}
void doWork(std::vector<vector<int> >& myVec, std::vector<vector<int> >& results, size_t current, size_t end)
{
end = std::min(end, myVec.size());
int numPassed = 0;
for(; current < end; ++current) {
vector<int> satisfyingElements;
check_if_condition(myVec[current], satisfyingElements);
if(!satisfyingElements.empty()){
results[current] = satisfyingElements;
}
}
}
int main()
{
std::vector<std::vector<int> > myVec(1000000);
std::vector<std::vector<int> > results(myVec.size());
unsigned numparallelThreads = std::thread::hardware_concurrency();
std::vector<std::thread> parallelThreads;
auto blockSize = myVec.size() / numparallelThreads;
for(size_t i = 0; i < numparallelThreads - 1; ++i) {
parallelThreads.emplace_back(doWork, std::ref(myVec), std::ref(results), i * blockSize, (i+1) * blockSize);
}
//also do work in this thread
doWork(myVec, results, (numparallelThreads-1) * blockSize, myVec.size());
for(auto& thread : parallelThreads)
thread.join();
std::vector<int> storage;
storage.reserve(numPassed.load());
auto itRes = results.begin();
auto itmyVec = myVec.begin();
auto endRes = results.end();
for(; itRes != endRes; ++itRes, ++itmyVec) {
if(!(*itRes).empty())
storage.insert(storage.begin(),(*itRes).begin(), (*itRes).end());
}
std::cout << "Done" << std::endl;
}
It would be nice to see if you can give some scale of those 'large' inner-vectors just to see how bad is the problem.
I think however, is that your problem is this:
for(auto& thread : parallelThreads)
thread.join();
This bit makes goes through on all thread sequentially and wait until they finish, and only then looks at the next one. For a thread-pool, you want to wait until every thread is done. This can be done by using condition_variable for each thread to finish. Before they finish they have to notify the condition_variable for which you can wait.
Looking at your implementation the bigger issue here is that your worker threads are not balanced in their consumption.
To get a more balanced load on all of your threads, you need to flatten your data structure, so the different worker threads can process relatively similar sized chunks of data. I am not sure where is your data coming from, but having a vector of a vector in an application that is dealing with large data sets doesn't sound like a great idea. Either process the existing vector of vectors into a single one, or read the data in like that if possible. If you need the row number for your processing, you can keep a vector of start-end ranges from which you can find your row number.
Once you have a single big vector, you can break it down to equal sized chunks to feed into worker threads. Second, you don't want to build vectors on the stack handing and pushing them into another vector because, chances are, you are running into issues to allocate memory during the working of your threads. Allocating memory is a global state change and as such will require some level of locking (with proper address partitioning it could be avoided though). As a rule of thumb, whenever your are looking for performance you should remove dynamic allocation from performance critical parts.
In this case, perhaps your threads would rather 'mark' elements are satisfying conditions, rather than building vectors of the satisfying elems. And once that's done, you can iterate through only the good ones without pushing and copying anything. Such solution would be less wastefull.
In fact, if I were you, I would give a try to solve this issue first on a single thread, doing the suggestions above. If you get rid of the vector-of-vectors structure, and iterate through elements conditionally (this might be as simple as using the of the xxxx_if algorithms C++11 standard library provides), you could end up with a decent enough performance. And only at that point worth looking at delegating chunks of this work to worker threads. At this point in your coded there's very little justification to use worker threads, just to filter them. Do as little writing and moving as you can, and you gain a lot of performance. Parallelization only works well in certain circumstances.

Why is inserting into a set<vector<string>> so slow?

For a class project we are making a simple compiler / Relational Database. Mine produces the correct answers, but too slowly on large queries. I ran visual studio's performance analysis and my program is spending 80% of it's time inserting my tuples (rows in a table) into a set. The function is part of computing a cross product, so the result has lots and lots of rows, but I need suggestions on a faster way to insert my tuples into the set.
for (set<vector<string>>::iterator it = tuples.begin(); it != tuples.end(); ++it)
{
for (set<vector<string>>::iterator it2 = tuples2.begin(); it2 != tuples2.end(); ++it2)
{
vector<string> f(*it);
f.insert(f.end(), it2->begin(), it2->end());
newTuples.insert(f); //This is the line that takes all the processing time
}
}
You are copying big vector by value for no reason. You should move: newTuples.insert(std::move(f));
A set might be the wrong container. A set is ordered, and keeps only unique elements. There might be many string comparisons happening when you insert a new vector.
Use a list or a vector instead (if you can).
...and avoid needless copying, as SergeyA already pointed out in his answer
We might as well go C++11 (totally untested code)
for (const auto& it : tuples) {
for (const auto& it2 : tuples2) {
auto where = newTuples.emplace(it); // returns where its placed
auto& vect = where.first; // makes the next more readable
vect.insert(vect.end(), it2.begin(), it2.end());
}
}
Note on collisions some strings disappears from the result, is that really what you want?
Your using the vector as key, will that ever be a collision? add
if (!where.second) {
; // collision
}
to check.
This should remove all double work of moving (if the compiler doesn't optimize it away anyway).

C++ std::map creation taking too long?

UPDATED:
I am working on a program whose performance is very critical. I have a vector of structs that are NOT sorted. I need to perform many search operations in this vector. So I decided to cache the vector data into a map like this:
std::map<long, int> myMap;
for (int i = 0; i < myVector.size(); ++i)
{
const Type& theType = myVector[i];
myMap[theType.key] = i;
}
When I search the map, the results of the rest of the program are much faster. However, the remaining bottleneck is the creation of the map itself (it is taking about 0.8 milliseconds on average to insert about 1,500 elements in it). I need to figure out a way to trim this time down. I am simply inserting a long as the key and an int as the value. I don't understand why it is taking this long.
Another idea I had was to create a copy of the vector (can't touch the original one) and somehow perform a faster sort than the std::sort (it takes way too long to sort it).
Edit:
Sorry everyone. I meant to say that I am creating a std::map where the key is a long and the value is an int. The long value is the struct's key value and the int is the index of the corresponding element in the vector.
Also, I did some more debugging and realized that the vector is not sorted at all. It's completely random. So doing something like a stable_sort isn't going to work out.
ANOTHER UPDATE:
Thanks everyone for the responses. I ended up creating a vector of pairs (std::vector of std::pair(long, int)). Then I sorted the vector by the long value. I created a custom comparator that only looked at the first part of the pair. Then I used lower_bound to search for the pair. Here's how I did it all:
typedef std::pair<long,int> Key2VectorIndexPairT;
typedef std::vector<Key2VectorIndexPairT> Key2VectorIndexPairVectorT;
bool Key2VectorIndexPairComparator(const Key2VectorIndexPairT& pair1, const Key2VectorIndexPairT& pair2)
{
return pair1.first < pair2.first;
}
...
Key2VectorIndexPairVectorT sortedVector;
sortedVector.reserve(originalVector.capacity());
// Assume "original" vector contains unsorted elements.
for (int i = 0; i < originalVector.size(); ++i)
{
const TheStruct& theStruct = originalVector[i];
sortedVector.insert(Key2VectorIndexPairT(theStruct.key, i));
}
std::sort(sortedVector.begin(), sortedVector.end(), Key2VectorIndexPairComparator);
...
const long keyToSearchFor = 20;
const Key2VectorIndexPairVectorT::const_iterator cItorKey2VectorIndexPairVector = std::lower_bound(sortedVector.begin(), sortedVector.end(), Key2VectorIndexPairT(keyToSearchFor, 0 /* Provide dummy index value for search */), Key2VectorIndexPairComparator);
if (cItorKey2VectorIndexPairVector->first == keyToSearchFor)
{
const int vectorIndex = cItorKey2VectorIndexPairVector->second;
const TheStruct& theStruct = originalVector[vectorIndex];
// Now do whatever you want...
}
else
{
// Could not find element...
}
This yielded a modest performance gain for me. Before the total time for my calculations were 3.75 milliseconds and now it is down to 2.5 milliseconds.
Both std::map and std::set are built on a binary tree and so adding items does dynamic memory allocation. If your map is largely static (i.e. initialized once at the start and then rarely or never has new items added or removed) you'd probably be better to use a sorted vector and a std::lower_bound to look up items using a binary search.
Maps take a lot of time for two reasons
You need to do a lot of memory allocation for your data storage
You need to perform O(n lg n) comparisons for the sort.
If you are just creating this as one batch, then throwing the whole map out, using a custom pool allocator may be a good idea here - eg, boost's pool_alloc. Custom allocators can also apply optimizations such as not actually deallocating any memory until the map's completely destroyed, etc.
Since your keys are integers, you may want to consider writing your own container based on a radix tree (on the bits of the key) as well. This may give you significantly improved performance, but since there is no STL implementation, you may need to write your own.
If you don't need to sort the data, use a hash table, such as std::unordered_map; these avoid the significant overhead needed for sorting data, and also can reduce the amount of memory allocation needed.
Finally, depending on the overall design of the program, it may be helpful to simply reuse the same map instead of recreating it over and over. Just delete and add keys as needed, rather than building a new vector, then building a new map. Again, this may not be possible in the context of your program, but if it is, it would definitely help you.
I suspect it's the memory management and tree rebalancing that's costing you here.
Obviously profiling may be able to help you pinpoint the issue.
I would suggest as a general idea to just copy the long/int data you need into another vector and since you said it's almost sorted, use stable_sort on it to finish the ordering. Then use lower_bound to locate the items in the sorted vector.
std::find is a linear scan(it has to be since it works on unsorted data). If you can sort(std::sort guaranties n log(n) behavior) the data then you can use std::binary_search to get log(n) searches. But as pointed out by others it may be copy time is the problem.
If keys are solid and short, perhaps try std::hash_map instead. From MSDN's page on hash_map Class:
The main advantage of hashing over sorting is greater efficiency; a
successful hashing performs insertions, deletions, and finds in
constant average time as compared with a time proportional to the
logarithm of the number of elements in the container for sorting
techniques.
Map creation can be a performance bottleneck (in the sense that it takes a measurable amount of time) if you're creating a large map and you're copying large chunks of data into it. You're also using the obvious (but suboptimal) way of inserting elements into a std::map - if you use something like:
myMap.insert(std::make_pair(theType.key, theType));
this should improve the insertion speed, but it will result in a slight change in behaviour if you encounter duplicate keys - using insert will result in values for duplicate keys being dropped, whereas using your method, the last element with the duplicate key will be inserted into the map.
I would also look into avoiding a making a copy of the data (for example by storing a pointer to it instead) if your profiling results determine that it's the copying of the element that is expensive. But for that you'll have to profile the code, IME guesstimates tend to be wrong...
Also, as a side note, you might want to look into storing the data in a std::set using custom comparator as your contains the key already. That however will not really result in a big speed up as constructing a set in this case is likely to be as expensive as inserting it into a map.
I'm not a C++ expert, but it seems that your problem stems from copying the Type instances, instead of a reference/pointer to the Type instances.
std::map<Type> myMap; // <-- this is wrong, since std::map requires two template parameters, not one
If you add elements to the map and they're not pointers, then I believe the copy constructor is invoked and that will certainly cause delays with a large data structure. Use the pointer instead:
std::map<KeyType, ObjectType*> myMap;
Furthermore, your example is a little confusing since you "insert" a value of type int in the map when you're expecting a value of type Type. I think you should assign the reference to the item, not the index.
myMap[theType.key] = &myVector[i];
Update:
The more I look at your example, the more confused I get. If you're using the std::map, then it should take two template types:
map<T1,T2> aMap;
So what are you REALLY mapping? map<Type, int> or something else?
It seems that you're using the Type.key member field as a key to the map (it's a valid idea), but unless key is of the same type as Type, then you can't use it as the key to the map. So is key an instance of Type??
Furthermore, you're mapping the current vector index to the key in the map, which indicates that you're just want the index to the vector so you can later access that index location fast. Is that what you want to do?
Update 2.0:
After reading your answer it seems that you're using std::map<long,int> and in that case there is no copying of the structure involved. Furthermore, you don't need to make a local reference to the object in the vector. If you just need to access the key, then access it by calling myVector[i].key.
Your building a copy of the table from the broken example you give, and not just a reference.
Why Can't I store references in an STL map in C++?
Whatever you store in the map it relies on you not changing the vector.
Try a lookup map only.
typedef vector<Type> Stuff;
Stuff myVector;
typedef std::map<long, *Type> LookupMap;
LookupMap myMap;
LookupMap::iterator hint = myMap.begin();
for (Stuff::iterator it = myVector.begin(); myVector.end() != it; ++it)
{
hint = myMap.insert(hint, std::make_pair(it->key, &*it));
}
Or perhaps drop the vector and just store it in the map??
Since your vector is already partially ordered, you may want to instead create an auxiliary array referencing (indices of) the elements in your original vector. Then you can sort the auxiliary array using Timsort which has good performance for partially sorted data (such as yours).
I think you've got some other problem. Creating a vector of 1500 <long, int> pairs, and sorting it based on the longs should take considerably less than 0.8 milliseconds (at least assuming we're talking about a reasonably modern, desktop/server type processor).
To try to get an idea of what we should see here, I did a quick bit of test code:
#include <vector>
#include <algorithm>
#include <time.h>
#include <iostream>
int main() {
const int size = 1500;
const int reps = 100;
std::vector<std::pair<long, int> > init;
std::vector<std::pair<long, int> > data;
long total = 0;
// Generate "original" array
for (int i=0; i<size; i++)
init.push_back(std::make_pair(rand(), i));
clock_t start = clock();
for (int i=0; i<reps; i++) {
// copy the original array
std::vector<std::pair<long, int> > data(init.begin(), init.end());
// sort the copy
std::sort(data.begin(), data.end());
// use data that depends on sort to prevent it being optimized away
total += data[10].first;
total += data[size-10].first;
}
clock_t stop = clock();
std::cout << "Ignore: " << total << "\n";
clock_t ticks = stop - start;
double seconds = ticks / (double)CLOCKS_PER_SEC;
double ms = seconds * 1000.0;
double ms_p_iter = ms / reps;
std::cout << ms_p_iter << " ms/iteration.";
return 0;
}
Running this on my somewhat "trailing edge" (~5 year-old) machine, I'm getting times around 0.1 ms/iteration. I'd expect searching in this (using std::lower_bound or std::upper_bound) to be somewhat faster than searching in an std::map as well (since the data in the vector is allocated contiguously, we can expect better locality of reference, leading to better cache usage).
Thanks everyone for the responses. I ended up creating a vector of pairs (std::vector of std::pair(long, int)). Then I sorted the vector by the long value. I created a custom comparator that only looked at the first part of the pair. Then I used lower_bound to search for the pair. Here's how I did it all:
typedef std::pair<long,int> Key2VectorIndexPairT;
typedef std::vector<Key2VectorIndexPairT> Key2VectorIndexPairVectorT;
bool Key2VectorIndexPairComparator(const Key2VectorIndexPairT& pair1, const Key2VectorIndexPairT& pair2)
{
return pair1.first < pair2.first;
}
...
Key2VectorIndexPairVectorT sortedVector;
sortedVector.reserve(originalVector.capacity());
// Assume "original" vector contains unsorted elements.
for (int i = 0; i < originalVector.size(); ++i)
{
const TheStruct& theStruct = originalVector[i];
sortedVector.insert(Key2VectorIndexPairT(theStruct.key, i));
}
std::sort(sortedVector.begin(), sortedVector.end(), Key2VectorIndexPairComparator);
...
const long keyToSearchFor = 20;
const Key2VectorIndexPairVectorT::const_iterator cItorKey2VectorIndexPairVector = std::lower_bound(sortedVector.begin(), sortedVector.end(), Key2VectorIndexPairT(keyToSearchFor, 0 /* Provide dummy index value for search */), Key2VectorIndexPairComparator);
if (cItorKey2VectorIndexPairVector->first == keyToSearchFor)
{
const int vectorIndex = cItorKey2VectorIndexPairVector->second;
const TheStruct& theStruct = originalVector[vectorIndex];
// Now do whatever you want...
}
else
{
// Could not find element...
}
This yielded a modest performance gain for me. Before the total time for my calculations were 3.75 milliseconds and now it is down to 2.5 milliseconds.

std::map and performance, intersecting sets

I'm intersecting some sets of numbers, and doing this by storing a count of each time I see a number in a map.
I'm finding the performance be very slow.
Details:
- One of the sets has 150,000 numbers in it
- The intersection of that set and another set takes about 300ms the first time, and about 5000ms the second time
- I haven't done any profiling yet, but every time I break the debugger while doing the intersection its in malloc.c!
So, how can I improve this performance? Switch to a different data structure? Some how improve the memory allocation performance of map?
Update:
Is there any way to ask std::map or
boost::unordered_map to pre-allocate
some space?
Or, are there any tips for using these efficiently?
Update2:
See Fast C++ container like the C# HashSet<T> and Dictionary<K,V>?
Update3:
I benchmarked set_intersection and got horrible results:
(set_intersection) Found 313 values in the intersection, in 11345ms
(set_intersection) Found 309 values in the intersection, in 12332ms
Code:
int runIntersectionTestAlgo()
{
set<int> set1;
set<int> set2;
set<int> intersection;
// Create 100,000 values for set1
for ( int i = 0; i < 100000; i++ )
{
int value = 1000000000 + i;
set1.insert(value);
}
// Create 1,000 values for set2
for ( int i = 0; i < 1000; i++ )
{
int random = rand() % 200000 + 1;
random *= 10;
int value = 1000000000 + random;
set2.insert(value);
}
set_intersection(set1.begin(),set1.end(), set2.begin(), set2.end(), inserter(intersection, intersection.end()));
return intersection.size();
}
You should definitely be using preallocated vectors which are way faster. The problem with doing set intersection with stl sets is that each time you move to the next element you're chasing a dynamically allocated pointer, which could easily not be in your CPU caches. With a vector the next element will often be in your cache because it's physically close to the previous element.
The trick with vectors, is that if you don't preallocate the memory for a task like this, it'll perform EVEN WORSE because it'll go on reallocating memory as it resizes itself during your initialization step.
Try something like this instaed - it'll be WAY faster.
int runIntersectionTestAlgo() {
vector<char> vector1; vector1.reserve(100000);
vector<char> vector2; vector2.reserve(1000);
// Create 100,000 values for set1
for ( int i = 0; i < 100000; i++ ) {
int value = 1000000000 + i;
set1.push_back(value);
}
sort(vector1.begin(), vector1.end());
// Create 1,000 values for set2
for ( int i = 0; i < 1000; i++ ) {
int random = rand() % 200000 + 1;
random *= 10;
int value = 1000000000 + random;
set2.push_back(value);
}
sort(vector2.begin(), vector2.end());
// Reserve at most 1,000 spots for the intersection
vector<char> intersection; intersection.reserve(min(vector1.size(),vector2.size()));
set_intersection(vector1.begin(), vector1.end(),vector2.begin(), vector2.end(),back_inserter(intersection));
return intersection.size();
}
Without knowing any more about your problem, "check with a good profiler" is the best general advise I can give. Beyond that...
If memory allocation is your problem, switch to some sort of pooled allocator that reduces calls to malloc. Boost has a number of custom allocators that should be compatible with std::allocator<T>. In fact, you may even try this before profiling, if you've already noticed debug-break samples always ending up in malloc.
If your number-space is known to be dense, you can switch to using a vector- or bitset-based implementation, using your numbers as indexes in the vector.
If your number-space is mostly sparse but has some natural clustering (this is a big if), you may switch to a map-of-vectors. Use higher-order bits for map indexing, and lower-order bits for vector indexing. This is functionally very similar to simply using a pooled allocator, but it is likely to give you better caching behavior. This makes sense, since you are providing more information to the machine (clustering is explicit and cache-friendly, rather than a random distribution you'd expect from pool allocation).
I would second the suggestion to sort them. There are already STL set algorithms that operate on sorted ranges (like set_intersection, set_union, etc):
set_intersection
I don't understand why you have to use a map to do intersection. Like people have said, you could put the sets in std::set's, and then use std::set_intersection().
Or you can put them into hash_set's. But then you would have to implement intersection manually: technically you only need to put one of the sets into a hash_set, and then loop through the other one, and test if each element is contained in the hash_set.
Intersection with maps are slow, try a hash_map. (however, this is not provided in all STL implementation.
Alternatively, sort both map and do it in a merge-sort-like way.
What is your intersection algorithm? Maybe there are some improvements to be made?
Here is an alternate method
I do not know it to be faster or slower, but it could be something to try. Before doing so, I also recommend using a profiler to ensure you really are working on the hotspot. Change the sets of numbers you are intersecting to use std::set<int> instead. Then iterate through the smallest one looking at each value you find. For each value in the smallest set, use the find method to see if the number is present in each of the other sets (for performance, search from smallest to largest).
This is optimised in the case that the number is not found in all of the sets, so if the intersection is relatively small, it may be fast.
Then, store the intersection in std::vector<int> instead - insertion using push_back is also very fast.
Here is another alternate method
Change the sets of numbers to std::vector<int> and use std::sort to sort from smallest to largest. Then use std::binary_search to find the values, using roughly the same method as above. This may be faster than searching a std::set since the array is more tightly packed in memory. Actually, never mind that, you can then just iterate through the values in lock-step, looking at the ones with the same value. Increment only the iterators which are less than the minimum value you saw at the previous step (if the values were different).
Might be your algorithm. As I understand it, you are spinning over each set (which I'm hoping is a standard set), and throwing them into yet another map. This is doing a lot of work you don't need to do, since the keys of a standard set are in sorted order already. Instead, take a "merge-sort" like approach. Spin over each iter, dereferencing to find the min. Count the number that have that min, and increment those. If the count was N, add it to the intersection. Repeat until the first map hits it's end (If you compare the sizes before starting, you won't have to check every map's end each time).
Responding to update: There do exist faculties to speed up memory allocation by pre-reserving space, like boost::pool_alloc. Something like:
std::map<int, int, std::less<int>, boost::pool_allocator< std::pair<int const, int> > > m;
But honestly, malloc is pretty good at what it does; I'd profile before doing anything too extreme.
Look at your algorithms, then choose the proper data type. If you're going to have set-like behaviour, and want to do intersections and the like, std::set is the container to use.
Since it's elements are stored in a sorted way, insertion may cost you O(log N), but intersection with another (sorted!) std::set can be done in linear time.
I figured something out: if I attach the debugger to either RELEASE or DEBUG builds (e.g. hit F5 in the IDE), then I get horrible times.