Experiment with find algorithm using sentinel

Experiment with find algorithm using sentinel - c++

I was experimenting with some known algorithm which aims to reduce the number of comparisons in an operation of finding element in an unsorted array. The algorithm uses sentinel which is added to the back of the array, which allows to write a loop where we use only one comparison, instead of two. It's important to note that the overall Big O computational complexity is not changed, it is still O(n). However, when looking at the number of comparisons, the standard finding algorithm is so to say O(2n) while the sentinel algorithm is O(n).
Standard find algorithm from the c++ library works like this:
template<class InputIt, class T>
InputIt find(InputIt first, InputIt last, const T& value)
{
for (; first != last; ++first) {
if (*first == value) {
return first;
}
}
return last;
}
We can see two comparisons there and one increment.
In the algorithm with sentinel the loop looks like this:
while (a[i] != key)
++i;
There is only one comparison and one increment.
I made some experiments and measured time, but on every computer the results were different. Unfortunately I didn't have access to any serious machine, I only had my laptop with VirtualBox there with Ubuntu, under which I compiled and run the code. I had a problem with the amount of memory. I tried using online compilers like Wandbox and Ideone but the time limits and memory limits didn't allow me to make reliable experiments. But every time I run my code, changing the number of elements in my vector or changing the number of execution of my test, I saw different results. Sometimes the times were comparable, sometimes std::find was significantly faster, sometimes significantly faster was the sentinel algorithm.
I was surprised because the logic says that the sentinel version indeed should work faster and every time. Do you have any explanation for this? Do you have any experience with this kind of algorithm? Is it worht the effort to even try to use it in production code when performance is crucial and when the array cannot be sorted (and any other mechanism to solve this problem, like hashmap, indexing etc., cannot be used)?
Here's my code of testing this. It's not beautiful, in fact it is ugly, but the beauty wasn't my goal here. Maybe something is wrong with my code?
#include <iostream>
#include <algorithm>
#include <chrono>
#include <vector>
using namespace std::chrono;
using namespace std;
const unsigned long long N = 300000000U;
static void find_with_sentinel()
{
vector<char> a(N);
char key = 1;
a[N - 2] = key; // make sure the searched element is in the array at the last but one index
unsigned long long high = N - 1;
auto tmp = a[high];
// put a sentinel at the end of the array
a[high] = key;
unsigned long long i = 0;
while (a[i] != key)
++i;
// restore original value
a[high] = tmp;
if (i == high && key != tmp)
cout << "find with sentinel, not found" << endl;
else
cout << "find with sentinel, found" << endl;
}
static void find_with_std_find()
{
vector<char> a(N);
int key = 1;
a[N - 2] = key; // make sure the searched element is in the array at the last but one index
auto pos = find(begin(a), end(a), key);
if (pos != end(a))
cout << "find with std::find, found" << endl;
else
cout << "find with sentinel, not found" << endl;
}
int main()
{
const int times = 10;
high_resolution_clock::time_point t1 = high_resolution_clock::now();
for (auto i = 0; i < times; ++i)
find_with_std_find();
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<milliseconds>(t2 - t1).count();
cout << "std::find time = " << duration << endl;
t1 = high_resolution_clock::now();
for (auto i = 0; i < times; ++i)
find_with_sentinel();
t2 = high_resolution_clock::now();
duration = duration_cast<milliseconds>(t2 - t1).count();
cout << "sentinel time = " << duration << endl;
}

Move the memory allocation (vector construction) outside the measured functions (e.g. pass the vector as argument).
Increase times to a few thousands.

You're doing a whole lot of time-consuming work in your functions. That work is hiding the differences in the timings. Consider your find_with_sentinel function:
static void find_with_sentinel()
{
// ***************************
vector<char> a(N);
char key = 1;
a[N - 2] = key; // make sure the searched element is in the array at the last but one index
// ***************************
unsigned long long high = N - 1;
auto tmp = a[high];
// put a sentinel at the end of the array
a[high] = key;
unsigned long long i = 0;
while (a[i] != key)
++i;
// restore original value
a[high] = tmp;
// ***************************************
if (i == high && key != tmp)
cout << "find with sentinel, not found" << endl;
else
cout << "find with sentinel, found" << endl;
// **************************************
}
The three lines at the top and the four lines at the bottom are identical in both functions, and they're fairly expensive to run. The top contains a memory allocation and the bottom contains an expensive output operation. These are going to mask the time it takes to do the real work of the function.
You need to move the allocation and the output out of the function. Change the function signature to:
static int find_with_sentinel(vector<char> a, char key);
In other words, make it the same as std::find. If you do that, then you don't have to wrap std::find, and you get a more realistic view of how your function will perform in a typical situation.
It's quite possible that the sentinel find function will be faster. However, it comes with some drawbacks. The first is that you can't use it with immutable lists. The second is that it's not safe to use in a multi-threaded program due to the potential of one thread overwriting the sentinel that the other thread is using. It also might not be "faster enough" to justify replacing std::find.

Related

Why use find() in unordered_map is much faster than directly read?

Its hard to describe so I will just show the code:
#include <bits/stdc++.h>
using namespace std;
int main()
{
clock_t start, end;
unordered_map<int, int> m;
long test=0;
int size = 9999999;
for (int i=0; i<size/3; i++) {
m[i] = 1;
}
start = clock();
for (int i=0; i<size; i++) {
//if (m.find(i) != m.end())
test += m[i];
}
end = clock();
double time_taken = double(end - start) / double(CLOCKS_PER_SEC);
cout << "Time taken by program is : " << fixed
<< time_taken << setprecision(5);
cout << " sec " << endl;
return 0;
}
The result(3 times):
Without if (m.find(i) != m.end()):
Time taken by program is : 3.508257 sec
Time taken by program is : 3.554726 sec
Time taken by program is : 3.520102 sec
With if (m.find(i) != m.end())：
Time taken by program is : 1.734134 sec
Time taken by program is : 1.663341 sec
Time taken by program is : 1.736100 sec
Can anyone explain why? What really happened inside add m[i] when the key not appeared?

In this line
test += m[i];
the operator[] does two things: First it tries to find the entry for the given key, then if the entry does not exist it creates a new entry.
On the other hand here:
if (m.find(i) != m.end())
test += m[i];
the operator[] does only one thing: It finds the element with the given key (and because you checked before that it exists, no new entry has to be constructed).
As the map contains only keys up to size/3 your results suggest that creating the element outweights the overhead for first checking if the element does exist.
In the first case there are size elements in the map while in the second there are only size/3 elements in the map.
Note that calling operator[] can get more expensive the more elements are in the map. It is Average case: constant, worst case: linear in size. and the same holds for find. However, calling the methods many times, the worst case should amortize and you are left with average constant.
Thanks to Aconcagua, for pointing out that you did not reserve space in the map. In the first case you add many elements that require to allocate space, while in the second, the size of the map stays constant during the part you measure. Try to call reserve before the loop. Naively I would expect that the loops would be very similar in that case.

The difference with and without the if is down to you having only populated the first third of the map.
If you do a find, then the program will go and find the element, and if it exists then it will do the operator[], which finds it again (not terribly efficient), find it exists, and return the value
Without the if, when you do the operator[]. it will try and find the element, fail, and create the element (with the default value for an int, which is 0), and return it
So without the if, you are populating the whole map, which will increase the runtime.
If you wanted to be more efficient, you could use the result of the find to fetch the value
auto iter = m.find(i);
if (iter != m.end())
{
test += iter->second;
}

std::vector is faster than std::map for a key lookup?

I've been using std::vector mostly and was wondering if I should use std::map for a key lookup to improve performance.
And here's my full test code.
#include <iostream>
#include <string>
#include <map>
#include <vector>
#include <ctime>
#include <chrono>
using namespace std;
vector<string> myStrings = {"aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg", "hhh", "iii", "jjj", "kkk", "lll", "mmm", "nnn", "ooo", "ppp", "qqq", "rrr", "sss", "ttt", "uuu", "vvv", "www", "xxx", "yyy", "zzz"};
struct MyData {
string key;
int value;
};
int findStringPosFromVec(const vector<MyData> &myVec, const string &str) {
auto it = std::find_if(begin(myVec), end(myVec),
[&str](const MyData& data){return data.key == str;});
if (it == end(myVec))
return -1;
return static_cast<int>(it - begin(myVec));
}
int main(int argc, const char * argv[]) {
const int testInstance = 10000; //HOW MANY TIMES TO PERFORM THE TEST
//----------------------------std::map-------------------------------
clock_t map_cputime = std::clock(); //START MEASURING THE CPU TIME
for (int i=0; i<testInstance; ++i) {
map<string, int> myMap;
//insert unique keys
for (int i=0; i<myStrings.size(); ++i) {
myMap[myStrings[i]] = i;
}
//iterate again, if key exists, replace value;
for (int i=0; i<myStrings.size(); ++i) {
if (myMap.find(myStrings[i]) != myMap.end())
myMap[myStrings[i]] = i * 100;
}
}
//FINISH MEASURING THE CPU TIME
double map_cpu = (std::clock() - map_cputime) / (double)CLOCKS_PER_SEC;
cout << "Map Finished in " << map_cpu << " seconds [CPU Clock] " << endl;
//----------------------------std::vector-------------------------------
clock_t vec_cputime = std::clock(); //START MEASURING THE CPU TIME
for (int i=0; i<testInstance; ++i) {
vector<MyData> myVec;
//insert unique keys
for (int i=0; i<myStrings.size(); ++i) {
const int pos = findStringPosFromVec(myVec, myStrings[i]);
if (pos == -1)
myVec.push_back({myStrings[i], i});
}
//iterate again, if key exists, replace value;
for (int i=0; i<myStrings.size(); ++i) {
const int pos = findStringPosFromVec(myVec, myStrings[i]);
if (pos != -1)
myVec[pos].value = i * 100;
}
}
//FINISH MEASURING THE CPU TIME
double vec_cpu = (std::clock() - vec_cputime) / (double)CLOCKS_PER_SEC;
cout << "Vector Finished in " << vec_cpu << " seconds [CPU Clock] " << endl;
return 0;
}
And this is the result I got.
Map Finished in 0.38121 seconds [CPU Clock]
Vector Finished in 0.346863 seconds [CPU Clock]
Program ended with exit code: 0
I mostly store less than 30 elements in a container.
Does this mean it is better to use std::vector instead of std::map in my case?
EDIT: when I move map<string, int> myMap; before the loop, std::map was faster than std::vector.
Map Finished in 0.278136 seconds [CPU Clock]
Vector Finished in 0.328548 seconds [CPU Clock]
Program ended with exit code: 0
So If this is the proper test, I guess std::map is faster.
But, If I reduce the amount of elements to 10, std::vector was faster so I guess it really depends on the number of elements.

I would say that in general, it's possible that a vector performs better than a map for lookups, but for a tiny amount of data only, e.g. you've mentioned less than 30 elements.
The reason is that linear search through continuous memory chunk is the cheapest way to access memory. A map keeps data at random memory locations, so it's a little bit more expensive to access them. In case of a tiny number of elements, this might play a role. In real life with hundreds and thousands of elements, algorithmic complexity of a lookup operation will dominate this performance gain.
BUT! You are benchmarking completely different things:
You are populating a map. In case of a vector, you don't do this
Your code could perform TWO map lookups: first, find to check existence, second [] operator to find an element to modify. These are relatively heavy operations. You can modify an element just with single find (figure this out yourself, check references!)
Within each test iteration, you are performing additional heavy operations, like memory allocation for each map/vector. It means that your tests are measuring not only lookup performance but something else.
Benchmarking is a difficult problem, don't do this yourself. For example, there are side effects like cache heating and you have to deal with them. Use something like Celero, hayai or google benchmark

Your vector has constant content, so the compiler optimizes most of your code away anyway.
There is little use in measuring for such small counts, and no use measuring for hard coded values.

Time-efficient way to count number of distinct numbers

get_number() returns an integer. I'm going to call it 30 times and count the number of distinct integers returned. My plan is to put these numbers into an std::array<int,30>, sort it and then use std::unique.
Is that a good solution? Is there a better one? This piece of code will be the bottleneck of my program.
I'm thinking there should be a hash-based solution, but maybe its overhead would be too much when I've only got 30 elements?
Edit I changed unique to distinct. Example:
{1,1,1,1} => 1
{1,2,3,4} => 4
{1,3,3,1} => 2

I would use std::set<int> as it's simpler:
std::set<int> s;
for(/*loop 30 times*/)
{
s.insert(get_number());
}
std::cout << s.size() << std::endl; // You get count of unique numbers
If you want to count return times of each unique number, I'd suggest map
std::map<int, int> s;
for(int i=0; i<30; i++)
{
s[get_number()]++;
}
cout << s.size() << std::endl; // total count of distinct numbers returned
for (auto it : s)
{
cout << it.first << " " << it.second<< std::endl; // each number and return counts
}

The simplest solution would be to use a std::map:
std::map<int, size_t> counters;
for (size_t i = 0; i != 30; ++i) {
counters[getNumber()] += 1;
}
std::vector<int> uniques;
for (auto const& pair: counters) {
if (pair.second == 1) { uniques.push_back(pair.first); }
}
// uniques now contains the items that only appeared once.

Using a std::map, std::set or the std::sort algorithm will give you a O(n*log(n)) complexity. For a small to large number of elements it is perfectly correct. But you use a known integer range and this opens the door to lot of optimizations.
As you say (in a comment) that the range of your integers is known and short: [0..99]. I would recommend to implement a modified counting sort. See: http://en.wikipedia.org/wiki/Counting_sort
You can count the number of distinct items while doing the sort itself, removing the need for the std::unique call. The whole complexity would be O(n). Another advantage is that the memory needed is independent of the number of input items. If you had 30.000.000.000 integers to sort, it would not need a single supplementary byte to count the distinct items.
Even is the range of allowed integer value is large, says [0..10.000.000] the memory consumed would be quite low. Indeed, an optimized version could consume as low as 1 bit per allowed integer value. That is less than 2 MB of memory or 1/1000th of a laptop ram.
Here is a short example program:
#include <cstdlib>
#include <algorithm>
#include <iostream>
#include <vector>
// A function returning an integer between [0..99]
int get_number()
{
return rand() % 100;
}
int main(int argc, char* argv[])
{
// reserves one bucket for each possible integer
// and initialize to 0
std::vector<int> cnt_buckets(100, 0);
int nb_distincts = 0;
// Get 30 numbers and count distincts
for(int i=0; i<30; ++i)
{
int number = get_number();
std::cout << number << std::endl;
if(0 == cnt_buckets[number])
++ nb_distincts;
// We could optimize by doing this only the first time
++ cnt_buckets[number];
}
std::cerr << "Total distincts numbers: " << nb_distincts << std::endl;
}
You can see it working:
$ ./main | sort | uniq | wc -l
Total distincts numbers: 26
26

The simplest way is just to use std::set.
std::set<int> s;
int uniqueCount = 0;
for( int i = 0; i < 30; ++i )
{
int n = get_number();
if( s.find(n) != s.end() ) {
--uniqueCount;
continue;
}
s.insert( n );
}
// now s contains unique numbers
// and uniqueCount contains the number of unique integers returned

Using an array and sort seems good, but unique may be a bit overkill if you just need to count distinct values. The following function should return number of distinct values in a sorted range.
template<typename ForwardIterator>
size_t distinct(ForwardIterator begin, ForwardIterator end) {
if (begin == end) return 0;
size_t count = 1;
ForwardIterator prior = begin;
while (++begin != end)
{
if (*prior != *begin)
++count;
prior = begin;
}
return count;
}
In contrast to the set- or map-based approaches this one does not need any heap allocation and elements are stored continuously in memory, therefore it should be much faster. Asymptotic time complexity is O(N log N) which is the same as when using an associative container. I bet that even your original solution of using std::sort followed by std::unique would be much faster than using std::set.

Try a set, try an unordered set, try sort and unique, try something else that seems fun.
Then MEASURE each one. If you want the fastest implementation, there is no substitute for trying out real code and seeing what it really does.
Your particular platform and compiler and other particulars will surely matter, so test in an environment as close as possible to where it will be running in production.

What is the better implementation strategy?

This question is about the best strategy for implementing the following simulation in C++.
I'm trying to make a simulation as a part of a physics research project, which basically tracks the dynamics of a chain of nodes in space. Each node contains a position together with certain parameters (local curvature, velocity, distance to neighbors etc…) which all evolve trough time.
Each time step can be broken down to these four parts:
Calculate local parameters. The values are dependent on the nearest neighbors in the chain.
Calculate global parameters.
Evolving. The position of each node is moved a small amount, depending on global and local parameters, and some random force fields.
Padding. New nodes are inserted if the distance between two consecutive nodes reach a critical value.
(In addition, nodes can get stuck, which make them inactive for the rest of the simulation. The local parameters of inactive nodes with inactive neighbors, will not change, and does not need any more calculation.)
Each node contains ~ 60 bytes, I have ~ 100 000 nodes in the chain, and i need to evolve the chain about ~ 1 000 000 time steps. I would however like to maximize these numbers, as it would increase the accuracy of my simulation, but under the restriction that the simulation is done in reasonable time (~hours). (~30 % of the nodes will be inactive.)
I have started to implement this simulation as a doubly linked list in C++. This seems natural, as I need to insert new nodes in between existing ones, and because the local parameters depends on the nearest neighbors. (I added an extra pointer to the next active node, to avoid unnecessary calculation, whenever I loop over the whole chain).
I'm no expert when it comes to parallelization (or coding for that matter), but I have played around with OpenMP, and I really like how I can speed up for loops of independent operations with two lines of code. I do not know how to make my linked list do stuff in parallel, or if it even works (?). So I had this idea of working with stl vector. Where Instead of having pointers to the nearest neighbors, I could store the indices of the neighbors and access them by standard lookup. I could also sort the vector by the position the chain (every x'th timestep) to get a better locality in memory. This approach would allowed for looping the OpenMP way.
I'm kind of intimidated by the idea, as I don't have to deal with memory management. And I guess that the stl vector implementation is way better than my simple 'new' and 'delete' way of dealing with Nodes in the list. I know I could have done the same with stl lists, but i don't like the way I have to access the nearest neighbors with iterators.
So I ask you, 1337 h4x0r and skilled programmers, what would be a better design for my simulation? Is the vector approach sketched above a good idea? Or are there tricks to play on linked list to make them work with OpenMP? Or should I consider a totally different approach?
The simulation is going to run on a computer with 8 cores and 48G RAM, so I guess I can trade a lot of memory for speed.
Thanks in advance
Edit:
I need to add 1-2 % new nodes each time step, so storing them as a vector without indices to nearest neighbors won't work unless I sort the vector every time step.

This is a classic tradeoff question. Using an array or std::vector will make the calculations faster and the insertions slower; using a doubly linked list or std::list will make the insertions faster and the calculations slower.
The only way to judge tradeoff questions is empirically; which will work faster for your particular application? All you can really do is try it both ways and see. The more intense the computation and the shorter the stencil (eg, the computational intensity -- how many flops you have to do per amount of memory you have to bring in) the less important a standard array will be. But basically you should mock up an implementation of your basic computation both ways and see if it matters. I've hacked together a very crude go at something with both std::vector and std::list; it is probably wrong in any of a numer of ways, but you can give it a go and play with some of the parameters and see which wins for you. On my system for the sizes and amount of computation given, list is faster, but it can go either way pretty easily.
W/rt openmp, yes, if that's the way you're going to go, you're hands are somewhat tied; you'll almost certainly have to go with the vector structure, but first you should make sure that the extra cost of the insertions won't blow away any benifit of multiple cores.
#include <iostream>
#include <list>
#include <vector>
#include <cmath>
#include <sys/time.h>
using namespace std;
struct node {
bool stuck;
double x[2];
double loccurve;
double disttoprev;
};
void tick(struct timeval *t) {
gettimeofday(t, NULL);
}
/* returns time in seconds from now to time described by t */
double tock(struct timeval *t) {
struct timeval now;
gettimeofday(&now, NULL);
return (double)(now.tv_sec - t->tv_sec) +
((double)(now.tv_usec - t->tv_usec)/1000000.);
}
int main()
{
const int nstart = 100;
const int niters = 100;
const int nevery = 30;
const bool doPrint = false;
list<struct node> nodelist;
vector<struct node> nodevect;
// Note - vector is *much* faster if you know ahead of time
// maximum size of vector
nodevect.reserve(nstart*30);
// Initialize
for (int i = 0; i < nstart; i++) {
struct node *mynode = new struct node;
mynode->stuck = false;
mynode->x[0] = i; mynode->x[1] = 2.*i;
mynode->loccurve = -1;
mynode->disttoprev = -1;
nodelist.push_back( *mynode );
nodevect.push_back( *mynode );
}
const double EPSILON = 1.e-6;
struct timeval listclock;
double listtime;
tick(&listclock);
for (int i=0; i<niters; i++) {
// Calculate local curvature, distance
list<struct node>::iterator prev, next, cur;
double dx1, dx2, dy1, dy2;
next = cur = prev = nodelist.begin();
cur++;
next++; next++;
dx1 = prev->x[0]-cur->x[0];
dy1 = prev->x[1]-cur->x[1];
while (next != nodelist.end()) {
dx2 = cur->x[0]-next->x[0];
dy2 = cur->x[1]-next->x[1];
double slope1 = (dy1/(dx1+EPSILON));
double slope2 = (dy2/(dx2+EPSILON));
cur->disttoprev = sqrt(dx1*dx1 + dx2*dx2 );
cur->loccurve = ( slope1*slope2*(dy1+dy2) +
slope2*(prev->x[0]+cur->x[0]) -
slope1*(cur->x[0] +next->x[0]) ) /
(2.*(slope2-slope1) + EPSILON);
next++;
cur++;
prev++;
}
// Insert interpolated pt every neveryth pt
int count = 1;
next = cur = nodelist.begin();
next++;
while (next != nodelist.end()) {
if (count % nevery == 0) {
struct node *mynode = new struct node;
mynode->x[0] = (cur->x[0]+next->x[0])/2.;
mynode->x[1] = (cur->x[1]+next->x[1])/2.;
mynode->stuck = false;
mynode->loccurve = -1;
mynode->disttoprev = -1;
nodelist.insert(next,*mynode);
}
next++;
cur++;
count++;
}
}
51,0-1 40%
struct timeval vectclock;
double vecttime;
tick(&vectclock);
for (int i=0; i<niters; i++) {
int nelem = nodevect.size();
double dx1, dy1, dx2, dy2;
dx1 = nodevect[0].x[0]-nodevect[1].x[0];
dy1 = nodevect[0].x[1]-nodevect[1].x[1];
for (int elem=1; elem<nelem-1; elem++) {
dx2 = nodevect[elem].x[0]-nodevect[elem+1].x[0];
dy2 = nodevect[elem].x[1]-nodevect[elem+1].x[1];
double slope1 = (dy1/(dx1+EPSILON));
double slope2 = (dy2/(dx2+EPSILON));
nodevect[elem].disttoprev = sqrt(dx1*dx1 + dx2*dx2 );
nodevect[elem].loccurve = ( slope1*slope2*(dy1+dy2) +
slope2*(nodevect[elem-1].x[0] +
nodevect[elem].x[0]) -
slope1*(nodevect[elem].x[0] +
nodevect[elem+1].x[0]) ) /
(2.*(slope2-slope1) + EPSILON);
}
// Insert interpolated pt every neveryth pt
int count = 1;
vector<struct node>::iterator next, cur;
next = cur = nodevect.begin();
next++;
while (next != nodevect.end()) {
if (count % nevery == 0) {
struct node *mynode = new struct node;
mynode->x[0] = (cur->x[0]+next->x[0])/2.;
mynode->x[1] = (cur->x[1]+next->x[1])/2.;
mynode->stuck = false;
mynode->loccurve = -1;
mynode->disttoprev = -1;
nodevect.insert(next,*mynode);
}
next++;
cur++;
count++;
}
}
vecttime = tock(&vectclock);
cout << "Time for list: " << listtime << endl;
cout << "Time for vect: " << vecttime << endl;
vector<struct node>::iterator v;
list <struct node>::iterator l;
if (doPrint) {
cout << "Vector: " << endl;
for (v=nodevect.begin(); v!=nodevect.end(); ++v) {
cout << "[ (" << v->x[0] << "," << v->x[1] << "), " << v->disttoprev << ", " << v->loccurve << "] " << endl;
}
cout << endl << "List: " << endl;
for (l=nodelist.begin(); l!=nodelist.end(); ++l) {
cout << "[ (" << l->x[0] << "," << l->x[1] << "), " << l->disttoprev << ", " << l->loccurve << "] " << endl;
}
}
cout << "List size is " << nodelist.size() << endl;
}

Assuming that creation of new elements happens relatively infrequently, I would take the sorted vector approach, for all the reasons you've listed:
No wasting time following pointers/indices around
Take advantage of spatial locality
Much easier to parallelise
Of course, for this to work, you'd have to make sure that the vector was always sorted, not simply every k-th timestep.

This looks like a nice exercise for parallel programming students.
You seem to have a data structure that leads itself naturally to distribution, a chain. You can do quite a bit of work over subchains that are (semi)statically assigned to different threads. You might want to deal with the N-1 boundary cases separately, but if the subchain lengths are >3 then those are isolated from each other.
Sure, between each step you'll have to update global variables, but variables such as chain length are simple parallel additions. Just calculate the length of each subchain and then add those up. If your subchains are 100000/8 long, the single-threaded piece of work is the addition of those 8 subchain lengths between steps.
If the growth of nodes is highly non-uniform, you might want to rebalance the subchain lengths every so often.

What container type provides better (average) performance than std::map?

In the following example a std::map structure is filled with 26 values from A - Z (for key) and 0 - 26 for value. The time taken (on my system) to lookup the last entry (10000000 times) is roughly 250 ms for the vector, and 125 ms for the map. (I compiled using release mode, with O3 option turned on for g++ 4.4)
But if for some odd reason I wanted better performance than the std::map, what data structures and functions would I need to consider using?
I apologize if the answer seems obvious to you, but I haven't had much experience in the performance critical aspects of C++ programming.
#include <ctime>
#include <map>
#include <vector>
#include <iostream>
struct mystruct
{
char key;
int value;
mystruct(char k = 0, int v = 0) : key(k), value(v) { }
};
int find(const std::vector<mystruct>& ref, char key)
{
for (std::vector<mystruct>::const_iterator i = ref.begin(); i != ref.end(); ++i)
if (i->key == key) return i->value;
return -1;
}
int main()
{
std::map<char, int> mymap;
std::vector<mystruct> myvec;
for (int i = 'a'; i < 'a' + 26; ++i)
{
mymap[i] = i - 'a';
myvec.push_back(mystruct(i, i - 'a'));
}
int pre = clock();
for (int i = 0; i < 10000000; ++i)
{
find(myvec, 'z');
}
std::cout << "linear scan: milli " << clock() - pre << "\n";
pre = clock();
for (int i = 0; i < 10000000; ++i)
{
mymap['z'];
}
std::cout << "map scan: milli " << clock() - pre << "\n";
return 0;
}

For your example, use int value(char x) { return x - 'a'; }
More generalized, since the "keys" are continuous and dense, use an array (or vector) to guarantee Θ(1) access time.
If you don't need the keys to be sorted, use unordered_map, which should provide amortized logarithmic improvement (i.e. O(log n) -> O(1)) to most operations.
(Sometimes, esp. for small data sets, linear search is faster than hash table (unordered_map) / balanced binary trees (map) because the former has a much simpler algorithm, thus reducing the hidden constant in big-O. Profile, profile, profile.)

For starters, you should probably use std::map::find if you want to compare the search times; operator[] has additional functionality over and above the regular find.
Also, your data set is pretty small, which means that the whole vector will easily fit into the processor cache; a lot of modern processors are optimised for this sort of brute-force search so you'd end up getting fairly good performance. The map, while theoretically having better performance (O(log n) rather than O(n)) can't really exploit its advantage of the smaller number of comparisons because there aren't that many keys to compare against and the overhead of its data layout works against it.
TBH for data structures this small, the additional performance gain from not using a vector is often negligible. The "smarter" data structures like std::map come into play when you're dealing with larger amounts of data and a well distributed set of data that you are searching for.

If you really just have values for all entries from A to Z, why don't you use letter (properly adjusted) as the index into a vector?:
std::vector<int> direct_map;
direct_map.resize(26);
for (int i = 'a'; i < 'a' + 26; ++i)
{
direct_map[i - 'a']= i - 'a';
}
// ...
int find(const std::vector<int> &direct_map, char key)
{
int index= key - 'a';
if (index>=0 && index<direct_map.size())
return direct_map[index];
return -1;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js