Stream compaction with Thrust; best practices and fastest way?

Stream compaction with Thrust; best practices and fastest way? - c++

I am interested in porting some existing code to use thrust to see if I can speed it up on the GPU with relative ease.
What I'm looking to accomplish is a stream compaction operation, where only nonzero elements will be kept. I have this mostly working, per the example code below. The part that I am unsure of how to tackle is dealing with all the extra fill space that is in d_res and thus h_res, after the compaction happens.
The example just uses a 0-99 sequence with all the even entries set to zero. This is just an example, and the real problem will be a general sparse array.
This answer here helped me greatly, although when it comes to reading out the data, the size is just known to be constant:
How to quickly compact a sparse array with CUDA C?
I suspect that I can work around this by counting the number of 0's in d_src, and then only allocating d_res to be that size, or doing the count after the compaction, and only copying that many element. Is that really the right way to do it?
I get the sense that there will be some easy fix for this, via clever use of iterators or some other feature of thrust.
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/copy.h>
//Predicate functor
struct is_not_zero
{
__host__ __device__
bool operator()(const int x)
{
return (x != 0);
}
};
using namespace std;
int main(void)
{
size_t N = 100;
//Host Vector
thrust::host_vector<int> h_src(N);
//Fill with some zero and some nonzero data, as an example
for (int i = 0; i < N; i++){
if (i % 2 == 0){
h_src[i] = 0;
}
else{
h_src[i] = i;
}
}
//Print out source data
cout << "Source:" << endl;
for (int i = 0; i < N; i++){
cout << h_src[i] << " ";
}
cout << endl;
//copies to device
thrust::device_vector<int> d_src = h_src;
//Result vector
thrust::device_vector<int> d_res(d_src.size());
//Copy non-zero elements from d_src to d_res
thrust::copy_if(d_src.begin(), d_src.end(), d_res.begin(), is_not_zero());
//Copy back to host
thrust::host_vector<int> h_res(d_res.begin(), d_res.end());
//thrust::host_vector<int> h_res = d_res; //Or just this?
//Show results
cout << "h_res size is " << h_res.size() << endl;
cout << "Result after remove:" << endl;
for (int i = 0; i < h_res.size(); i++){
cout << h_res[i] << " ";
}
cout << endl;
return 0;
}
Also, I am a novice with thrust, so if the above code has any obvious flaws that go against recommended practices for using thrust, please let me know.
Similarly, speed is always of interest. Reading some of the various thrust tutorials, it seems like little changes here and there can be big speed savers or wasters. So, please let me know if there is a smart way to speed this up.

What you have appeared to have overlooked is that copy_if returns an iterator which points to the end of the copied data from the stream compaction operation. So all that is required is this:
//copies to device
thrust::device_vector<int> d_src = h_src;
//Result vector
thrust::device_vector<int> d_res(d_src.size());
//Copy non-zero elements from d_src to d_res
auto result_end = thrust::copy_if(d_src.begin(), d_src.end(), d_res.begin(), is_not_zero());
//Copy back to host
thrust::host_vector<int> h_res(d_res.begin(), result_end);
Doing this sizes h_res to only hold the non zeroes and only copies the non zeroes from the output of the stream compaction. No extra computation is required.

Related

how to store a certain set of integer value in C++

actually I am working with tensor storage which dimension is 61872578*33 . Now I am trying to store this integer value in to a vector but unfortunately after a certain period the codeblocks show a message that is std::bad_alloc. Now my question is how can I solve this problem? Is there any solution?here is my code.
#include <iostream>
#include<vector>
#include <stdio.h>
#include <cstdlib>
#include <ctime>
#include <fstream>
#include <sstream>
using namespace std;
int main(){
ofstream bcrs_tensor;
bcrs_tensor.open("bcrs_tensor_Binary", ios::out | ios::binary);
int X,Y,Z,M;
printf("Enter size of 1st dimension X= ");
scanf("%d",&X);
printf("\n Enter size of 2nd dimension Y= ");
scanf("%d",&Y);
printf("\n Enter size of 3rd dimension Z= ");
scanf("%d",&Z);
printf("\n Enter size of 4th dimension M= ");
scanf("%d",&M);
printf("\n");
int new_dimension_1,new_dimension_2,new_x_1,new_x_2;
new_dimension_1=X*Z;
new_dimension_2=Y*M;
int* new_A = new int[ new_dimension_1*new_dimension_2 ];
/* // filup tensor with zero
for(int i =0; i<new_dimension_1; i++){
for(int j= 0; j< new_dimension_2; j++){
*(new_A + i*new_dimension_2 + j)=0;
}
}
*/
//read tensor value from file
ifstream read_tensor("Chicago_fourToTwo_d.txt");
int row,col,val;
if(read_tensor.is_open()){
while(read_tensor >> row >> col >> val){
*(new_A + row*new_dimension_2 + col)=val;
}
}
int x,block_ROW,block_COL;
for(x=11; x<=new_dimension_1;x++ ){
if(new_dimension_1%x == 0){
block_ROW=x;
printf("block ROW %d\n",block_ROW);
break;
}
}
for(x=13; x<=new_dimension_2;x++ ){
if(new_dimension_2%x == 0){
block_COL=x;
printf("block COL %d\n",block_COL);
break;
}
}
cout<<"here"<<endl;
int a,b,c,d,e,f,non_zero;
vector<int> block_value,CO_BCRS,RO_BCRS;
int NZB=0;
RO_BCRS.push_back(0);
for(a=0; a<new_dimension_1; a=a+block_ROW){
for(b=0; b<new_dimension_2; b=b+block_COL){
non_zero=0;
for(c=a; c<a+block_ROW; c++){
for(d=b; d<b+block_COL; d++){
printf("[%d][%d]\n",c,d);
if(*(new_A + c*new_dimension_2 + d)!=0){
non_zero++;
}
}
}
if(non_zero!=0){
for(e=a; e<a+block_ROW; e++){
for(f=b; f<b+block_COL; f++){
block_value.push_back(*(new_A + e*new_dimension_2 + f));
}
}
CO_BCRS.push_back(b);
NZB++;
}
}
RO_BCRS.push_back(NZB);
}
cout<<"Block value"<<endl;
for(vector<int>::iterator itr=block_value.begin();itr!=block_value.end();++itr){
cout<< " " << *itr ;
}
cout<<endl;
cout<<"CO_BCRS"<<endl;
for(vector<int>::iterator itr=CO_BCRS.begin();itr!=CO_BCRS.end();++itr){
cout<< " " << *itr ;
}
cout<<endl;
cout<<"RO_BCRS"<<endl;
for(vector<int>::iterator itr=RO_BCRS.begin();itr!=RO_BCRS.end();++itr){
cout<< " " << *itr ;
}
cout<<endl;
//block_value
int block_value_S=block_value.size();
cout<<"block_value_S "<< block_value_S <<endl;
int block_value_val;
for(int i=0; i<block_value_S;i++){
block_value_val = block_value[i];
bcrs_tensor.write((char *) &block_value_val, sizeof(int));
}
//CO_BCRS
int CO_BCRS_S=CO_BCRS.size();
cout<<"CO_BCRS_S "<< CO_BCRS_S <<endl;
int CO_BCRS_val;
for(int i=0; i<CO_BCRS_S;i++){
CO_BCRS_val = CO_BCRS[i];
bcrs_tensor.write((char *) &CO_BCRS_val, sizeof(int));
}
//RO_BCRS
int RO_BCRS_S=RO_BCRS.size();
cout<<"RO_BCRS_S "<< RO_BCRS_S <<endl;
int RO_BCRS_val;
for(int i=0; i<RO_BCRS_S;i++){
RO_BCRS_val = RO_BCRS[i];
bcrs_tensor.write((char *) &RO_BCRS_val, sizeof(int));
}
bcrs_tensor.close();
return 0;
}

The theoretical limit for size of a vector is given by std::vector::max_size(). However, this value just takes into account limitations of the implementation. Much sooner you will run out of memory.
Vectors typically increase their capacity by factors of 2 (3 or other factors can be used as well). Hence if you call push_back N-times, the memory footprint can be as big as 2*N. And as vectors store their elements in contiguous memory, you might get a bad_alloc on reallocation, even though you still have enough memory for the elements you want to store. You can ask for only the capacity you actually need by calling std::vector::reserve.
Other than that you cannot keep more in memory than you have memory. So you basically have only two options: Get more ram or find a way to store less elements in the vector.

If you have a tensor of 61872578 * 33 int values, then on most architectures you need 61872578*33*4 bytes to store it - that's 7.6 GB of RAM just for one tensor.
You have four choices:
Change your requirements so that you no longer need a tensor that big.
Install enough RAM to store, manipulate, and reallocate all this in memory.
Set up a swap partition or a swap file with enough capacity so that the operating system will be able to put parts of the tensor on the disk. Unless your algorithms traverse the data more-or-less sequentially (or chunks of millions of sequential elements at a time) this will be unreasonably slow. This may be so slow that you may think your computer or your program stuck.
Rewrite your algorithm, so that it will work on parts of your tensor at a time:
If you want to run a convolution for a CNN then you can create a sliding window of K rows (assuming the height of your weights-tensor is K), and get rid of an old row right before reading a new one. You will have to write the result tensor as you go instead of having it all in RAM. A similar approach will work with pooling.
If you want to multiply your tensor by another one then break up your tensor into many smaller tensors, then do the multiplication on the smaller tensors. You can find a rough idea at https://en.wikipedia.org/wiki/Block_matrix .
Adding two tensors requires almost no data to be stored in RAM, just sequentially read the numbers from both tensors, add them up, and immediately store the result to disk. The lower layers (C library and OS) will employ reasonable buffering in order to make this run OK.
Note that options 2 and 3 are acceptable only if a runtime of minutes is acceptable. If you want sub-second performance then only options 0 and 1 are relevant. I may be overly pessimistic regarding to the speed of option 3 - your millage may vary depending on the throughput of your disk (mostly for option 3) and/or its latency (possibly option 2).

creating mxn 2D array in c++ without using any external library

I am a beginner to C++ syntax. Now, I need to create an mxn 2D array in C++ to use it in another project. I have looked at other answers which involve using tools like vector, etc. Many tools are not working on my Visual Studio 15 i.e. for vector I can not define with std::vector without a message like vector is not in std. So, I have wrote the following code:
#include "stdafx.h"
#include <iostream>
using namespace std;
int main()
{
int i; int j; int row[5][10] = {};
for (int j = 0; j < 10;)
for (int i = 0; i < 5;)
{
row[i][j] = 500;
int printf(row[i][j]);
i++;
j++;
cout << "Array:" << row[i][j] << endl;
}
return 0;
}
Surely, this is not the correct syntax. So the output is beyond my expectation. I want to create an m*n array with all the elements being the same integer; 500 in this case. That is, if m=3, n=2, I should get
500 500 500
500 500 500

There's a couple things wrong with your current code.
The first for loop is missing curly brackets
You're redefining int i and int j in your for loop. Not a complilation issue but still an issue.
You're using printf incorrectly. printf is used to output strings to the console. The correct line would be printf("%d", row[i][j]);
If you want to use a vector, you have to include it using #include <vector>. You can use a vector very similar to an array, but you don't have to worry about size.

You seem to be learning. So, I did minimal correctios to make it work. I suggest you to make modifications as per your needs.
#include <iostream>
using namespace std;
int main()
{
int row[5][10] = {};
for (int j = 0; j < 10; j++) {
for (int i = 0; i < 5; i++) {
row[i][j] = 500;
cout << row[i][j] << " ";
}
cout << endl;
}
return 0;
}

Care and feeding of std::vector using OP's program as an example.
#include <iostream>
#include <vector> // needed to get the code that makes the vector work
int main()
{
int m, n; // declare m and n to hold the dimensions of the vector
if (std::cin >> m >> n) // get m and n from user
{ // m and n are good we can continue. Well sort of good. The user could
// type insane numbers that will explode the vector, but at least they
// are numbers.
// Always test ALL user input before you use it. Users are malicious
// and incompetent <expletive delteted>s, so never trust them.
// That said, input validation is a long topic and out of scope for this
// answer, so I'm going to let trapping bad numbers pass in the interests
// of keeping it simple
// declare row as a vector of vectors
std::vector<std::vector<int>> row(m, std::vector<int> (n, 500));
// breaking this down:
// std::vector<std::vector<int>> row
// row is a vector of vectors of ints
// row(m, std::vector<int> (n, 500));
// row's outer vector is m std::vector<int>s constructed with
// n ints all set to 500
for (int j = 0; j < n; j++) // note: j++ has been moved here. This is
// exactly what the third part of a for
// statement is for. Less surprises for
// everyone this way
// note to purists. I'm ignoring the possible advantages of ++j because
// explaining them would muddy the answer.
// Another note: This will output the transverse of row because of the
// ordering of i and j;
{
for (int i = 0; i < m; i++) // ditto I++ here
{
// there is no need to assign anything here. The vector did
// it for us
std::cout << " " << row[i][j]; // moved the line ending so that
// the line isn't ended with
// every column
}
std::cout << '\n'; // end the line on the end of a row
// Note: I also changed the type of line ending. endl ends the line
// AND writes the contents of the output stream to whatever media
// the stream represents (in this case the console) rather than
// buffering the stream and writing at a more opportune time. Too
// much endl can be a performance killer, so use it sparingly and
// almost certainly not in a loop
}
std::cout << std::endl; // ending the line again to demonstrate a better
// placement of endl. The stream is only forced
// to flush once, right at the end of the
// program
// even this may be redundant as the stream will
// flush when the program exits, assuming the
// program does not crash on exit.
}
else
{ // let the use know the input was not accepted. Prompt feedback is good
// otherwise the user may assume everything worked, or in the case of a
// long program, assume that it crashed or is misbehaving and terminate
// the program.
std::cout << "Bad input. Program exiting" << std::endl;
}
return 0;
}
One performance note a vector of vectors does not provide one long block of memory. It provides M+1 blocks of memory that may be anywhere in storage. Normally when a modern CPU reads a value from memory, it also reads values around it off the assumption that if you want the item at location X, you'll probably want the value at location X+1 shortly after. This allows the CPU to load up, "cache", many values at once. This doesn't work if you have to jump around through memory. This means the CPU may find itself spending more time retrieving parts of a vector of vectors than it does processing a vector of vectors. The typical solution is to fake a 2D data structure with a 1D structure and perform the 2D to 1D mapping yourself.
So:
std::vector<int> row(m*n, 500);
Much nicer looking, yes? Access looks a bit uglier, though
std::cout << " " << row[i * n + j];
Fun thing is, the work done behind the scenes converting row[j][i] to a memory address is almost identical to row[j*n+i] so even though you show more work, it doesn't take any longer. Add to this the benefits you get from the CPU successfully predicting and reading ahead and your program is often vastly faster.

Why is the auto-vectorized version of this program fragment slower than the simple version

In a larger numerical computation, I have to perform the trivial task of summing up the products of the elements of two vectors. Since this task needs to be done very often, I tried to make use of the auto vectorization capabilities of my compiler (VC2015). I introduced a temporary vector, where the products are saved in in a first loop and then performed the summation in a second loop. Optimization was set to full and fast code was preferred. This way, the first loop got vectorized by the compiler (I know this from the compiler output).
The result was surprising. The vectorized code performed 3 times slower on my machine (core i5-4570 3.20 GHz) than the simple code. Could anybody explain why and what might improve the performance? I've put both versions of the algorithm fragment into a minimal running example, which I used myself for testing:
#include "stdafx.h"
#include <vector>
#include <Windows.h>
#include <iostream>
using namespace std;
int main()
{
// Prepare timer
LARGE_INTEGER freq,c_start,c_stop;
QueryPerformanceFrequency(&freq);
int size = 20000000; // size of data
double v = 0;
// Some data vectors. The data inside doesn't matter
vector<double> vv(size);
vector<double> tt(size);
vector<float> dd(size);
// Put random values into the vectors
for (int i = 0; i < size; i++)
{
tt[i] = rand();
dd[i] = rand();
}
// The simple version of the algorithm fragment
QueryPerformanceCounter(&c_start); // start timer
for (int p = 0; p < size; p++)
{
v += tt[p] * dd[p];
}
QueryPerformanceCounter(&c_stop); // Stop timer
cout << "Simple version took: " << ((double)(c_stop.QuadPart - c_start.QuadPart)) / ((double)freq.QuadPart) << " s" << endl;
cout << v << endl; // We use v once. This avoids its calculation to be optimized away.
// The version that is auto-vectorized
for (int i = 0; i < size; i++)
{
tt[i] = rand();
dd[i] = rand();
}
v = 0;
QueryPerformanceCounter(&c_start); // start timer
for (int p = 0; p < size; p++) // This loop is vectorized according to compiler output
{
vv[p] = tt[p] * dd[p];
}
for (int p = 0; p < size; p++)
{
v += vv[p];
}
QueryPerformanceCounter(&c_stop); // Stop timer
cout << "Vectorized version took: " << ((double)(c_stop.QuadPart - c_start.QuadPart)) / ((double)freq.QuadPart) << " s" << endl;
cout << v << endl; // We use v once. This avoids its calculation to be optimized away.
cin.ignore();
return 0;
}

You added a large amount of work by storing the products in a temporary vector.
For such a simple computation on large data, the CPU time that you expect to save by vectorization doesn't matter. Only memory references matter.
You added memory references, so it runs slower.
I would have expected the compiler to optimize the original version of that loop. I doubt the optimization would affect the execution time (because it is dominated by memory access regardless). But it should be visible in the generated code. If you wanted to hand optimize code like that, a temporary vector is always the wrong way to go. The right direction is the following (for simplicity, I assumed size is even):
for (int p = 0; p < size; p+=2)
{
v += tt[p] * dd[p];
v1 += tt[p+1] * dd[p+1];
}
v += v1;
Note that your data is large enough and operation simple enough, that NO optimization should be able to improve on the simplest version. That includes my sample hand optimization. But I assume your test is not exactly representative of what you are really trying to do or understand. So with smaller data or a more complicated operation, the approach I showed may help.
Also notice my version relies on addition being commutative. For real numbers, addition is commutative. But in floating point, it isn't. The answer is likely to be different by an amount too tiny for you to care. But that is data dependent. If you have large values of opposite sign in odd/even positions canceling each other early in the original sequence, then by segregating the even and odd positions my "optimization" would totally destroy the answer. (Of course, the opposite can also be true. For example, if all the even positions were tiny and the odds included large values canceling each other, then the original sequence produced garbage and the changed sequence would be more correct).

Multiply values from two arrays (Artificial Neuron)

////////////////////MAKE INPUT VALUES////////////////////
double *NumOfInputsPointer = NULL;
std::cout << "How many inputs?" << std::endl;
int NumOfInputs;
std::cin >> NumOfInputs;
NumOfInputsPointer = new double[NumOfInputs];
std::cout << std::endl;
double InputVal;
for(int a = 0; a < NumOfInputs; a++)
{
std::cout << "What is the value for input " << a << std::endl;
a+1;
std::cin >> InputVal;
*(NumOfInputsPointer + a) = InputVal;
}
std::cout << std::endl;
////////////////////MAKE WEIGHTS////////////////////
double *NumOfWeightsPointer = NULL;
int NumOfWeights;
NumOfWeightsPointer = new double[NumOfWeights];
double WightVal;
for(int a = 0; a < NumOfInputs; a++)
{
*(NumOfWeightsPointer + a) = 0.5;
}
////////////////////Multiplication BRAIN BROKE!!!!!////////////////////
double *MultiplyPointer = NULL;
MultiplyPointer = NumOfInputsPointer;
for(int a = 0; a < NumOfInputs; a++)
{
//Stuff to do things
}
The code above is going to make a single Artificial Neuron. I already have it built to make an array with the users wanted number of inputs which then automatically makes every inputs weight 0.5.
The wall I have hit, has caused me to struggle with the multiplication of the input values array with their weights array, then save those in another array to be added together latter and then go through a modifier.
My struggle is with the multiplication and saving it into an array. I hope I explained my problem well enough.

There are many problems with this code. I would highly recommend using std::vector instead of arrays. If every input has a constant weight of 0.5, then what's the point of creating an array where all elements are 0.5? Just create a constant variable representing the 0.5 weight and apply it to each input. The second array is unnecessary from what I can tell. Creating the last array (again, this would be easier with a vector) would be similar to the first one because the size is going to be the same. It is based on the number of inputs. So just create an array of the same size, loop through each element in the first array, do the multiplication using the constant I described above, and then store the result into the new array.

Just new it like you did with the others, and store the result of the multiplication there.
MultiplyPointer = new double[NumOfInputs];
for (a = 0; a < NumOfInputs; a++) {
MultiplyPointer[a] = NumOfWeightsPointer[a] * NumOfInputsPointer[a];
}
That being said, there are better ways to go about solving your problem. std::vector has been mentioned, which makes the memory management and looping bits easier. I would go a step further and incorporate a library with the notions of a matrix and matrix expressions, such as OpenCV or dlib.
Example using Mat from OpenCV:
cv::Mat input(NumOfInputs, 1, CV_64F, NumOfInputsPointer);
cv::Mat weights(NumOfInputs, 1, CV_64F, cv::Scalar(0.5));
cv::Mat result = input.mul(weights);
If the weights vector is not to be modified and reused, just skip the whole thing:
cv::Mat result = input.mul(cv::Scalar(0.5));

Vector of Maps fill - Compiler fail?

I`m new in C++ programming and try to write some sparse matrix and vector stuff I as a practice.
The sparse matrix is build of a vector of maps, where the vector accesses the rows and the map is used for the sparse entries in the columns.
What I was trying to do is to fill a diagonal dominant sparse matrix with an equation system for a Poisson equation.
Now when filling the matrix in test cases I was able to provoke the following very weird problem, which I broke down to the essential operations.
#include <vector>
#include <iterator>
#include <iostream>
#include <map>
#include <ctime>
int main()
{
unsigned int nDim = 100000;
double clock1;
// alternative std::map<unsigned int, std::map<unsigned int, double> > mat;
std::vector<std::map<unsigned int, double> > mat;
mat.resize(nDim);
// if clause and number set
clock1 = double(clock())/CLOCKS_PER_SEC;
for(unsigned int rowIter = 0; rowIter < nDim; rowIter++)
{
for(unsigned int colIter = 0; colIter < nDim; colIter++)
{
if(rowIter == colIter)
{
mat[rowIter][colIter] = 1.;
}
}
}
std::cout << "time for diagonal fill: " << 1e3 * (double(clock())/CLOCKS_PER_SEC - clock1) << " ms" << std::endl;
// if clause and number insert
clock1 = double(clock())/CLOCKS_PER_SEC;
for(unsigned int rowIter = 0; rowIter < nDim; rowIter++)
{
for(unsigned int colIter = 0; colIter < nDim; colIter++)
{
if(rowIter == colIter)
{
mat[rowIter].insert(std::pair<unsigned int, double>(colIter,1.));
}
}
}
std::cout << "time for insert diagonal fill: " << 1e3 * (double(clock())/CLOCKS_PER_SEC - clock1) << " ms" << std::endl;
// only number set
clock1 = double(clock())/CLOCKS_PER_SEC;
for(unsigned int rowIter = 0; rowIter < nDim; rowIter++)
{
mat[rowIter][rowIter] += 1.;
}
std::cout << "time for easy diagonal fill: " << 1e3 * (double(clock())/CLOCKS_PER_SEC - clock1) << " ms" << std::endl;
// only if clause
clock1 = double(clock())/CLOCKS_PER_SEC;
for(unsigned int rowIter = 0; rowIter < nDim; rowIter++)
{
for(unsigned int colIter = 0; colIter < nDim; colIter++)
{
if(rowIter == colIter)
{
}
}
}
std::cout << "time for if clause: " << 1e3 * (double(clock())/CLOCKS_PER_SEC - clock1) << " ms" << std::endl;
return 0;
}
Running this in gcc (newest version, 4.8.1 I think) the following times appear:
time for diagonal fill: 26317ms
time for insert diagonal: 8783ms
time for easy diagonal fill: 10ms !!!!!!!
time for if clause: 0ms
I only used the loop for the if clause to be sure the it is not responsible for the speed lack.
Optimization level is O3, but the problem also appears on other levels.
So I thought let's try the Visual Studio (2012 Express).
It is a little bit faster, but still as slow as ketchup:
time for diagonal fill: 9408ms
time for insert diagonal: 8860ms
time for easy diagonal fill: 11ms !!!!!!!
time for if clause: 0ms
So MSVSC++ fails, too.
It will probably not even be necessary to used this combination of if-clause and matrix fill, but if... I'm screwed.
Does anybody know where this huge performance gap is coming from and how I could deal with it?
Is it some optimization problem caused by the fact, that the if-clause is inside the loop? Do I maybe just need another compiler flag?
I would also be interested, if it occurs with other systems/compilers, too. I might run it on the Xeon E5 machine at work and see what this baby makes with this devil piece of code :).
EDIT:
I ran it on the Xeon machine: Much faster, still slow.
Times with gcc:
2778ms
2684ms
1ms
0ms

The most obvious performance issue would be allocations within your map. Each time you assign/insert a new item in a map, it's got to allocate space for it and sort the tree appropriately. Doing that thousands of times is bound to be slow.
It's also very significant that you're not clearing the maps after your first loop. That means your subsequent loops don't have to do as much work, so your performance comparisons are not equivalent.
Finally, the nested loops are obviously going to be doing an order of magnitude more iterations than your single loop. From a strict algorithm analysis standpoint, it may be doing the same amount of actual work on the data. However, the program still has to run through all those extra iterations because that's what you've told it to do. The compiler can only optimise it out if there is literally nothing being processed/modified in the loop body.

In the first loop, the runtime system is doing loads of memory allocation, so it takes a lot of time on memory management.
The other loops don't have that overhead; you didn't release the allocation done by the first loop, so they don't have to repeat the memory allocation and it doesn't take anywhere near as long.
The last loop is optimized out by the compiler; it has no side effects, so it doesn't get included in the program.
Morals:
memory allocation has a cost.
benchmarking is hard.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js