Optimizing time performances unordered_map c++

Optimizing time performances unordered_map c++ - c++

I'm stuck in an optimization problem. I have a huge database (about 16M entries) which represents ratings given by different users to different items. From this database I have to evaluate a correlation measure between different users (i.e. I have to implement a similarity matrix). Fortunately this correlation matrix is symmetric, so I just have to calculate half of it.
Let me focus for example on the first column of the matrix: there are 135k users in total so I keep one user fixed and I find all the common rated items between this user and all the other ones (with a for loop). The time problem appears also if I compare the single user with 20k other users instead of 135k.
My approach is the following: first I query the DB to obtain for example all the data of the first 20k users (this takes time also with indexes implementation, but it doesn't bother me since I do it only once) and I stored everything in an unordered map using the userID as key; then for this unordered_map I use as bucket another unordered_map which stores all the ratings given by the user, this time using as key the itemID.
Then, in order to find the set of items that both have rated, I cycle on the user which have rated less items, searching if the other one have also rated the same stuff. The fastest data structures that I know are hashmaps, but for a single complete columns my algorithm takes 30s (just for 20k entries) which translates in WEEKS for the complete matrix.
The code is the following:
void similarity_matrix(sqlite3 *db, sqlite3 *db_avg, sqlite3 *similarity, long int tot_users, long int interval) {
long int n = 1;
double sim;
string temp_s;
vector<string> insert_query;
sqlite3_stmt *stmt;
std::cout << "Starting creating similarity matrix..." << std::endl;
string query_string = "SELECT * from usersratings where usersratings.user <= 20000;";
unordered_map<int, unordered_map<int, int>> users_map = db_query(query_string.c_str(), db);
std::cout << "Query time: " << duration_ << " s." << std::endl;
unordered_map<int, int> u1_map = users_map[1];
string select_avg = "SELECT * from averages;";
unordered_map<int, double> avg_map = avg_value(select_avg.c_str(), db_avg);
for (int i = 2; i <= tot_users; i++)
{
unordered_map<int, int> user;
int compare_id;
if (users_map[i].size() <= u1_map.size()) {
user = users_map[i];
compare_id = 1;
}
else {
user = u1_map;
compare_id = i;
}
int matches = 0;
double newnum = 0;
double newden1 = 0;
double newden2 = 0;
unordered_map<int, int> item_map = users_map[compare_id];
for (unordered_map<int, int>::iterator it = user.begin(); it != user.end(); ++it)
{
if (item_map.size() != 0) {
int rating = item_map[it->first];
if (rating != 0) {
double diff1 = (it->second - avg_map[1]);
double diff2 = (rating - avg_map[i]);
newnum += diff1 * diff2;
newden1 += pow(diff1, 2);
newden2 += pow(diff2, 2);
}
}
}
sim = newnum / (sqrt(newden1) * sqrt(newden2));
}
std::cout << "Execution time for first column: " << duration << " s." << std::endl;
std::cout << "First column finished..." << std::endl;
}

This sticks to me as an immediate potential performance trap:
unordered_map<int, unordered_map<int, int>> users_map = db_query(query_string.c_str(), db);
If the size of each sub-map for each user is anywhere close to the number of users, then you have a quadratic complexity algorithm which is going to get exponentially slower the more users you have.
unordered_map does offer constant time search but it's still a search. The amount of instructions required to do it is going to dwarf, say, the cost of indexing an array, especially if there are many collisions which implies inner loops each time you try to search the map. It also isn't necessarily represented in a way that allows for the fastest sequential iteration. So if you can just use std::vector for at least the sub-lists and avg_map like so, that should help a lot for starters:
typedef pair<int, int> ItemRating;
typedef vector<ItemRating> ItemRatings;
unordered_map<int, ItemRatings> users_map = ...;
vector<double> avg_map = ...;
Even the outer users_map could be a vector unless it's sparse and not all indices are used. If it's sparse and the range of user IDs still fits into a reasonable range (not an astronomically large integer), you could potentially construct two vectors -- one which stores the user data and has a size proportional to the number of users, while another is proportional to the valid index range of users and stores nothing but indices into the former vector to translate from a user ID to an index with a simple array lookup if you need to be able to access user data through a user ID.
// User data array.
vector<ItemRatings> user_data(num_users);
// Array that translates sparse user ID integers to indices into the
// above dense array. A value of -1 indicates that a user ID is not used.
// To fetch user data for a particular user ID, we do:
// const ItemRatings& ratings = user_data[user_id_to_index[user_id]];
vector<int> user_id_to_index(biggest_user_index+1, -1);
You're also copying around those unordered_maps needlessly for each iteration of the outer loop. While I don't think that's the source of biggest bottleneck, it would help to avoid deep copying these data structures you don't even modify by using references or pointers:
// Shallow copy, don't deep copy big stuff needlessly.
const unordered_map<int, int>& user = users_map[i].size() <= u1_map.size() ?
users_map[i]: u1_map;
const int compare_id = users_map[i].size() <= u1_map.size() ? 1: i;
const unordered_map<int, int>& item_map = users_map[compare_id];
...
You also don't need to check if item_map is empty in the inner loop. That check should be hoisted outside. That's a micro-optimization which is unlikely to help much at all, but still eliminating blatant waste.
The final code after this first pass would be something like this:
vector<ItemRatings> user_data = ..;
vector<double> avg_map = ...;
// Fill `rating_values` with the values from the first user.
vector<int> rating_values(item_range, 0);
const ItemRatings& ratings1 = user_data[0];
for (auto it = ratings1.begin(); it != ratings1.end(); ++it)
{
const int item = it->first;
const int rating = it->second;
rating_values[item] += rating;
}
// For each user starting from the second user:
for (int i=1; i < tot_users; ++i)
{
double newnum = 0;
double newden1 = 0;
double newden2 = 0;
const ItemRatings& ratings2 = user_data[i];
for (auto it = ratings2.begin(); it != ratings2.end(); ++it)
{
const int item = it->first;
const int rating1 = rating_values[it->first];
if (rating != 0) {
const int rating2 = it->second;
double diff1 = rating2 - avg_map[1];
double diff2 = rating1 - avg_map[i];
newnum += diff1 * diff2;
newden1 += pow(diff1, 2);
newden2 += pow(diff2, 2);
}
}
sim = newnum / (sqrt(newden1) * sqrt(newden2));
}
The biggest difference in the above code is that we eliminated all searches through unordered_map and replaced them with simple indexed access of an array. We also eliminated a lot of redundant copying of data structures.

Related

What C++ container should I choose for frequency graph?

I have a little project. I want to make a frequency graph something like this:
a-axis
3| x
2|` x x
1| x x x
0| x x x x
|`_____________`
2 4 6 7 9 b-axis
What container should I use and how do go about implementing that in the simplest way.
The container should also record zero value (frequency).
I find it difficult to correct implement the container. Is there are any sample code I can look at? Or anybody can write a sample for me to study here.
Update:
I was thinking of making std::map<int,std::vector<int>> keyAndMap
How do I access the vector which is inside keyAndMap?
Because I somehow can't do keyAndMap.insert(std::make_pair<int,std::vector<int>>(bAxis,bVec.push_back(bValue)));
The main problem I think is how to generate unique vector for each key in std::map?
const int tries = 21;
std::vector<int>values;
std::map<int,std::vector<int>>keyAndMap;
for (int i = 1; i < tries; i++)
{
int n = i;
int cycle = calculate(n);
/*This next line is an error... why?*/
keyAndMap.insert(std::make_pair<int,std::vector<int>>(cycle,values.push_back(i)));
}
although if i somehow got through that line, my logic here is flawed because the vector is not unique to each key.
Each 'x' in the little pretty graph i drew is gained with each pass in the loop.
So for each b-axis (2,4,6,7, and 9) should be in their own std::vector

What container should I use and how do go about implementing that in the simplest way?
A possible container would be a std::map<key, frequency>, both of which are of type int, with the additional advantage of your values being sorted. Then you could simply read values by something like:
std::map<int, int> data;
int key;
while (std::cin >> key)
{
++data[key]; // increment the frequency of a given key*
}
To access the data you could use iterators, like so:
for (auto it = data.begin; it != data.end(); ++it)
{
// iterator to key
std::cout << it->first <<": "<< it->second <<"\n";
// iterator to frequency
}
Which will print a two-column table of your data.
Second, even simpler approach:
You could use a std::vector<frequency>, where frequency is of type int, and your data would be the indexes of the std::vector, however, you should know the range** of your data so that you can use it as an initial vector size:
i.e. std::vector<int> data(5, 0) could hold values with range 5.
Then if, for example, your data is consisted of the integers: [0, 4], you could have:
std::vector<int> data(5, 0); // five elements with initial value 0
int key;
while(std::cin >> key) // assuming key >= 0 && key < 5
{
data[key] += 1;
}
and then print with:
for (int i= 0; i < data.size(); ++i)
{
std::cout << i <<": "<< data[i] <<"\n";
}
* The default initial frequency value of a distinct key is: 0, so you can simply use the key to access and increment an element.
**range - difference between the lowest and highest values.

How to erase items which are values of map

I have a map like this:
std::map<unsigned,(string,timestamp)> themap.
But I need to manage the size of the map by retaining only the highest 1000 timestamps in the map. I am wondering what is the most efficient way to handle this?
Should I somehow make a copy of the map above into a
std::map<timestamp,(string,unsigned)> - erase elements not in top 1000, then massage this map back into original?
Or some other way?
Here is my code.
/* map of callid (a number) to a carid (a hex string) and a timestamp (just using unsigned here)
The map will grow through time and potentially grow to some massive amount which would use up all
computer memory. So max sixe of map can be kept to 1000 (safe to do so). Want to remove items
based on timestamp
*/
#include <vector>
#include <string>
#include <iostream>
#include <algorithm>
#include <map>
typedef unsigned callid;
typedef unsigned timestamp;
typedef std::string carid;
typedef std::pair<carid,timestamp> caridtime;
typedef std::map<callid,caridtime> callid2caridmap;
int main() {
//map of callid -> (carid,timestamp)
callid2caridmap cmap;
//test data below
const std::string startstring("4559023584c8");
std::vector<carid> caridvec;
caridvec.reserve(1000);
for(int i = 1; i < 2001; ++i) {
char buff[20] = {0};
sprintf(buff, "%04u", i);
std::string s(startstring);
s += buff;
caridvec.push_back(s);
}
//generate some made up callids
std::vector<callid> tsvec;
for(int i = 9999; i < 12000; ++i) {
tsvec.push_back(i);
}
//populate map
for(unsigned i = 0; i < 2000; ++i)
cmap[tsvec[i]] = std::make_pair(caridvec[i], i+1);
//expiry handling
static const int MAXNUMBER = 1000;
// what I want to do is retain top 1000 with highest timestamps and remove all other entries.
// But of course map is ordered by the key
// what is best approach. std::transform??
// just iterate each one. But then I don't know what my criteria for erasing is until I have
// found the largest 1000 items
// std::for_each(cmap.begin(), cmap.end(), cleaner);
//nth_element seems appropriate. Do I reverse the map and have key as timestamp, use nth_element
//to work out what part to erase, then un-reverse the map as before with 1000 elements
//std::nth_element(coll.begin(), coll.begin()+MAXNUMBER, coll.end());
//erase from coll.begin()+MAXNUMBER to coll.end()
return 0;
}
UPDATE:
Here is a solution which I am playing with.
// as map is populated also fill queue with timestamp
std::deque<timestamp> tsq;
for(unsigned i = 0; i < 2000; ++i) {
cmap[tsvec[i]] = std::make_pair(caridvec[i], i+1);
tsq.push_back(tsvec[i]);
}
std::cout << "initial cmap size = " << cmap.size() << std::endl;
// expire old entries
static const int MAXNUMBER = 1000;
while(tsq.size() > MAXNUMBER) {
callid2caridmap::iterator it = cmap.find(tsq.front());
if(it != cmap.end())
cmap.erase(it);
tsq.pop_front();
}
std::cout << "cmap size now = " << cmap.size() << std::endl;
But still interested in any possible alternatives.

Make a max-heap timestamp -> iterator to the object in the map.
The heap will be <= 1000 items.
check that when you insert in the heap you either have < 1000 items or the timestamp is < the max value of the heap and do the work in consequence when popping the item from the heap, if all this makes sense

Is there a more efficient way to calculate percentages?

I am using the boost/random.hpp to fill a std::map with random numbers on the interval [1,3] and have kind of thrown together something that will give me the % of each count in relation to the total amount of generated numbers but was looking for a maybe a more efficient way of doing it. I have been trying to find something in the boost library but am having trouble finding something completely relevant; is there something in boost that can use my map (I don't want to change my map types) and calculate the %'s or anything else I should consider?
int main()
{
std::map <int, long> results;
int current;
long one = 0;
long two = 0;
long three = 0;
long total = 0;
boost::random::mt19937 rng;
rng.seed(static_cast<boost::uint32_t> (std::time(0)));
boost::random::uniform_int_distribution<int> random(1,3);
for (int n = 0; n < 1000000; ++n)
{
current = random(rng);
++total;
switch (current)
{
case 1:
++one;
break;
case 2:
++two;
break;
case 3:
++three;
break;
}
}
results[1] = one;
results[2] = two;
results[3] = three;
std::cout << (double) results[1]/total*100 << std::endl; // etc.
}
edit: I don't want to change the map container in any way.

You say you don't want to change the map type, but I don't see much reason to use a map at all for this job. It seems like the obvious choice would be a vector:
static const unsigned total = 1000000;
std::vector<unsigned> values(3);
for (int i=0; i<total; i++)
++values[random(rng)-1];
for (int i=0; i<values.size(); i++)
std::cout << (values[i] * 100.0) / total;

Why don't you profile it? There is no point in optimizing the percentage part until you know how it impacts the speed of the program as a whole. For example, if the percentage part takes only 1% of the time of the program (most being spent in random number generation) then even doubling the efficiency would increase the speed by only .5%.

Efficient? Discard the map and declare results as an array of 4 elements: int results[4] = {0};, and instead of using switch/case you can directly do ++results[current].

Optimized way to find M largest elements in an NxN array using C++

I need a blazing fast way to find the 2D positions and values of the M largest elements in an NxN array.
right now I'm doing this:
struct SourcePoint {
Point point;
float value;
}
SourcePoint* maxValues = new SourcePoint[ M ];
maxCoefficients = new SourcePoint*[
for (int j = 0; j < rows; j++) {
for (int i = 0; i < cols; i++) {
float sample = arr[i][j];
if (sample > maxValues[0].value) {
int q = 1;
while ( sample > maxValues[q].value && q < M ) {
maxValues[q-1] = maxValues[q]; // shuffle the values back
q++;
}
maxValues[q-1].value = sample;
maxValues[q-1].point = Point(i,j);
}
}
}
A Point struct is just two ints - x and y.
This code basically does an insertion sort of the values coming in. maxValues[0] always contains the SourcePoint with the lowest value that still keeps it within the top M values encoutered so far. This gives us a quick and easy bailout if sample <= maxValues, we don't do anything. The issue I'm having is the shuffling every time a new better value is found. It works its way all the way down maxValues until it finds it's spot, shuffling all the elements in maxValues to make room for itself.
I'm getting to the point where I'm ready to look into SIMD solutions, or cache optimisations, since it looks like there's a fair bit of cache thrashing happening. Cutting the cost of this operation down will dramatically affect the performance of my overall algorithm since this is called many many times and accounts for 60-80% of my overall cost.
I've tried using a std::vector and make_heap, but I think the overhead for creating the heap outweighed the savings of the heap operations. This is likely because M and N generally aren't large. M is typically 10-20 and N 10-30 (NxN 100 - 900). The issue is this operation is called repeatedly, and it can't be precomputed.
I just had a thought to pre-load the first M elements of maxValues which may provide some small savings. In the current algorithm, the first M elements are guaranteed to shuffle themselves all the way down just to initially fill maxValues.
Any help from optimization gurus would be much appreciated :)

A few ideas you can try. In some quick tests with N=100 and M=15 I was able to get it around 25% faster in VC++ 2010 but test it yourself to see whether any of them help in your case. Some of these changes may have no or even a negative effect depending on the actual usage/data and compiler optimizations.
Don't allocate a new maxValues array each time unless you need to. Using a stack variable instead of dynamic allocation gets me +5%.
Changing g_Source[i][j] to g_Source[j][i] gains you a very little bit (not as much as I'd thought there would be).
Using the structure SourcePoint1 listed at the bottom gets me another few percent.
The biggest gain of around +15% was to replace the local variable sample with g_Source[j][i]. The compiler is likely smart enough to optimize out the multiple reads to the array which it can't do if you use a local variable.
Trying a simple binary search netted me a small loss of a few percent. For larger M/Ns you'd likely see a benefit.
If possible try to keep the source data in arr[][] sorted, even if only partially. Ideally you'd want to generate maxValues[] at the same time the source data is created.
Look at how the data is created/stored/organized may give you patterns or information to reduce the amount of time to generate your maxValues[] array. For example, in the best case you could come up with a formula that gives you the top M coordinates without needing to iterate and sort.
Code for above:
struct SourcePoint1 {
int x;
int y;
float value;
int test; //Play with manual/compiler padding if needed
};

If you want to go into micro-optimizations at this point, the a simple first step should be to get rid of the Points and just stuff both dimensions into a single int. That reduces the amount of data you need to shift around, and gets SourcePoint down to being a power of two long, which simplifies indexing into it.
Also, are you sure that keeping the list sorted is better than simply recomputing which element is the new lowest after each time you shift the old lowest out?

(Updated 22:37 UTC 2011-08-20)
I propose a binary min-heap of fixed size holding the M largest elements (but still in min-heap order!). It probably won't be faster in practice, as I think OPs insertion sort probably has decent real world performance (at least when the recommendations of the other posteres in this thread are taken into account).
Look-up in the case of failure should be constant time: If the current element is less than the minimum element of the heap (containing the max M elements) we can reject it outright.
If it turns out that we have an element bigger than the current minimum of the heap (the Mth biggest element) we extract (discard) the previous min and insert the new element.
If the elements are needed in sorted order the heap can be sorted afterwards.
First attempt at a minimal C++ implementation:
template<unsigned size, typename T>
class m_heap {
private:
T nodes[size];
static const unsigned last = size - 1;
static unsigned parent(unsigned i) { return (i - 1) / 2; }
static unsigned left(unsigned i) { return i * 2; }
static unsigned right(unsigned i) { return i * 2 + 1; }
void bubble_down(unsigned int i) {
for (;;) {
unsigned j = i;
if (left(i) < size && nodes[left(i)] < nodes[i])
j = left(i);
if (right(i) < size && nodes[right(i)] < nodes[j])
j = right(i);
if (i != j) {
swap(nodes[i], nodes[j]);
i = j;
} else {
break;
}
}
}
void bubble_up(unsigned i) {
while (i > 0 && nodes[i] < nodes[parent(i)]) {
swap(nodes[parent(i)], nodes[i]);
i = parent(i);
}
}
public:
m_heap() {
for (unsigned i = 0; i < size; i++) {
nodes[i] = numeric_limits<T>::min();
}
}
void add(const T& x) {
if (x < nodes[0]) {
// reject outright
return;
}
nodes[0] = x;
swap(nodes[0], nodes[last]);
bubble_down(0);
}
};
Small test/usage case:
#include <iostream>
#include <limits>
#include <algorithm>
#include <vector>
#include <stdlib.h>
#include <assert.h>
#include <math.h>
using namespace std;
// INCLUDE TEMPLATED CLASS FROM ABOVE
typedef vector<float> vf;
bool compare(float a, float b) { return a > b; }
int main()
{
int N = 2000;
vf v;
for (int i = 0; i < N; i++) v.push_back( rand()*1e6 / RAND_MAX);
static const int M = 50;
m_heap<M, float> h;
for (int i = 0; i < N; i++) h.add( v[i] );
sort(v.begin(), v.end(), compare);
vf heap(h.get(), h.get() + M); // assume public in m_heap: T* get() { return nodes; }
sort(heap.begin(), heap.end(), compare);
cout << "Real\tFake" << endl;
for (int i = 0; i < M; i++) {
cout << v[i] << "\t" << heap[i] << endl;
if (fabs(v[i] - heap[i]) > 1e-5) abort();
}
}

You're looking for a priority queue:
template < class T, class Container = vector<T>,
class Compare = less<typename Container::value_type> >
class priority_queue;
You'll need to figure out the best underlying container to use, and probably define a Compare function to deal with your Point type.
If you want to optimize it, you could run a queue on each row of your matrix in its own worker thread, then run an algorithm to pick the largest item of the queue fronts until you have your M elements.

A quick optimization would be to add a sentinel value to yourmaxValues array. If you have maxValues[M].value equal to std::numeric_limits<float>::max() then you can eliminate the q < M test in your while loop condition.

One idea would be to use the std::partial_sort algorithm on a plain one-dimensional sequence of references into your NxN array. You could probably also cache this sequence of references for subsequent calls. I don't know how well it performs, but it's worth a try - if it works good enough, you don't have as much "magic". In particular, you don't resort to micro optimizations.
Consider this showcase:
#include <algorithm>
#include <iostream>
#include <vector>
#include <stddef.h>
static const int M = 15;
static const int N = 20;
// Represents a reference to a sample of some two-dimensional array
class Sample
{
public:
Sample( float *arr, size_t row, size_t col )
: m_arr( arr ),
m_row( row ),
m_col( col )
{
}
inline operator float() const {
return m_arr[m_row * N + m_col];
}
bool operator<( const Sample &rhs ) const {
return (float)other < (float)*this;
}
int row() const {
return m_row;
}
int col() const {
return m_col;
}
private:
float *m_arr;
size_t m_row;
size_t m_col;
};
int main()
{
// Setup a demo array
float arr[N][N];
memset( arr, 0, sizeof( arr ) );
// Put in some sample values
arr[2][1] = 5.0;
arr[9][11] = 2.0;
arr[5][4] = 4.0;
arr[15][7] = 3.0;
arr[12][19] = 1.0;
// Setup the sequence of references into this array; you could keep
// a copy of this sequence around to reuse it later, I think.
std::vector<Sample> samples;
samples.reserve( N * N );
for ( size_t row = 0; row < N; ++row ) {
for ( size_t col = 0; col < N; ++col ) {
samples.push_back( Sample( (float *)arr, row, col ) );
}
}
// Let partial_sort find the M largest entry
std::partial_sort( samples.begin(), samples.begin() + M, samples.end() );
// Print out the row/column of the M largest entries.
for ( std::vector<Sample>::size_type i = 0; i < M; ++i ) {
std::cout << "#" << (i + 1) << " is " << (float)samples[i] << " at " << samples[i].row() << "/" << samples[i].col() << std::endl;
}
}

First of all, you are marching through the array in the wrong order!
You always, always, always want to scan through memory linearly. That means the last index of your array needs to be changing fastest. So instead of this:
for (int j = 0; j < rows; j++) {
for (int i = 0; i < cols; i++) {
float sample = arr[i][j];
Try this:
for (int i = 0; i < cols; i++) {
for (int j = 0; j < rows; j++) {
float sample = arr[i][j];
I predict this will make a bigger difference than any other single change.
Next, I would use a heap instead of a sorted array. The standard <algorithm> header already has push_heap and pop_heap functions to use a vector as a heap. (This will probably not help all that much, though, unless M is fairly large. For small M and a randomized array, you do not wind up doing all that many insertions on average... Something like O(log N) I believe.)
Next after that is to use SSE2. But that is peanuts compared to marching through memory in the right order.

You should be able to get nearly linear speedup with parallel processing.
With N CPUs, you can process a band of rows/N rows (and all columns) with each CPU, finding the top M entries in each band. And then do a selection sort to find the overall top M.
You could probably do that with SIMD as well (but here you'd divide up the task by interleaving columns instead of banding the rows). Don't try to make SIMD do your insertion sort faster, make it do more insertion sorts at once, which you combine at the end using a single very fast step.
Naturally you could do both multi-threading and SIMD, but on a problem which is only 30x30, that's not likely to be worthwhile.

I tried replacing float by double, and interestingly that gave me a speed improvement of about 20% (using VC++ 2008). That's a bit counterintuitive, but it seems modern processors or compilers are optimized for double value processing.

Use a linked list to store the best yet M values. You'll still have to iterate over it to find the right spot, but the insertion is O(1). It would probably even be better than binary search and insertion O(N)+O(1) vs O(lg(n))+O(N).
Interchange the fors, so you're not accessing every N element in memory and trashing the cache.
LE: Throwing another idea that might work for uniformly distributed values.
Find the min, max in 3/2*O(N^2) comparisons.
Create anywhere from N to N^2 uniformly distributed buckets, preferably closer to N^2 than N.
For every element in the NxN matrix place it in bucket[(int)(value-min)/range], range=max-min.
Finally create a set starting from the highest bucket to the lowest, add elements from other buckets to it while |current set| + |next bucket| <=M.
If you get M elements you're done.
You'll likely get less elements than M, let's say P.
Apply your algorithm for the remaining bucket and get biggest M-P elements out of it.
If elements are uniform and you use N^2 buckets it's complexity is about 3.5*(N^2) vs your current solution which is about O(N^2)*ln(M).

What is the better implementation strategy?

This question is about the best strategy for implementing the following simulation in C++.
I'm trying to make a simulation as a part of a physics research project, which basically tracks the dynamics of a chain of nodes in space. Each node contains a position together with certain parameters (local curvature, velocity, distance to neighbors etc…) which all evolve trough time.
Each time step can be broken down to these four parts:
Calculate local parameters. The values are dependent on the nearest neighbors in the chain.
Calculate global parameters.
Evolving. The position of each node is moved a small amount, depending on global and local parameters, and some random force fields.
Padding. New nodes are inserted if the distance between two consecutive nodes reach a critical value.
(In addition, nodes can get stuck, which make them inactive for the rest of the simulation. The local parameters of inactive nodes with inactive neighbors, will not change, and does not need any more calculation.)
Each node contains ~ 60 bytes, I have ~ 100 000 nodes in the chain, and i need to evolve the chain about ~ 1 000 000 time steps. I would however like to maximize these numbers, as it would increase the accuracy of my simulation, but under the restriction that the simulation is done in reasonable time (~hours). (~30 % of the nodes will be inactive.)
I have started to implement this simulation as a doubly linked list in C++. This seems natural, as I need to insert new nodes in between existing ones, and because the local parameters depends on the nearest neighbors. (I added an extra pointer to the next active node, to avoid unnecessary calculation, whenever I loop over the whole chain).
I'm no expert when it comes to parallelization (or coding for that matter), but I have played around with OpenMP, and I really like how I can speed up for loops of independent operations with two lines of code. I do not know how to make my linked list do stuff in parallel, or if it even works (?). So I had this idea of working with stl vector. Where Instead of having pointers to the nearest neighbors, I could store the indices of the neighbors and access them by standard lookup. I could also sort the vector by the position the chain (every x'th timestep) to get a better locality in memory. This approach would allowed for looping the OpenMP way.
I'm kind of intimidated by the idea, as I don't have to deal with memory management. And I guess that the stl vector implementation is way better than my simple 'new' and 'delete' way of dealing with Nodes in the list. I know I could have done the same with stl lists, but i don't like the way I have to access the nearest neighbors with iterators.
So I ask you, 1337 h4x0r and skilled programmers, what would be a better design for my simulation? Is the vector approach sketched above a good idea? Or are there tricks to play on linked list to make them work with OpenMP? Or should I consider a totally different approach?
The simulation is going to run on a computer with 8 cores and 48G RAM, so I guess I can trade a lot of memory for speed.
Thanks in advance
Edit:
I need to add 1-2 % new nodes each time step, so storing them as a vector without indices to nearest neighbors won't work unless I sort the vector every time step.

This is a classic tradeoff question. Using an array or std::vector will make the calculations faster and the insertions slower; using a doubly linked list or std::list will make the insertions faster and the calculations slower.
The only way to judge tradeoff questions is empirically; which will work faster for your particular application? All you can really do is try it both ways and see. The more intense the computation and the shorter the stencil (eg, the computational intensity -- how many flops you have to do per amount of memory you have to bring in) the less important a standard array will be. But basically you should mock up an implementation of your basic computation both ways and see if it matters. I've hacked together a very crude go at something with both std::vector and std::list; it is probably wrong in any of a numer of ways, but you can give it a go and play with some of the parameters and see which wins for you. On my system for the sizes and amount of computation given, list is faster, but it can go either way pretty easily.
W/rt openmp, yes, if that's the way you're going to go, you're hands are somewhat tied; you'll almost certainly have to go with the vector structure, but first you should make sure that the extra cost of the insertions won't blow away any benifit of multiple cores.
#include <iostream>
#include <list>
#include <vector>
#include <cmath>
#include <sys/time.h>
using namespace std;
struct node {
bool stuck;
double x[2];
double loccurve;
double disttoprev;
};
void tick(struct timeval *t) {
gettimeofday(t, NULL);
}
/* returns time in seconds from now to time described by t */
double tock(struct timeval *t) {
struct timeval now;
gettimeofday(&now, NULL);
return (double)(now.tv_sec - t->tv_sec) +
((double)(now.tv_usec - t->tv_usec)/1000000.);
}
int main()
{
const int nstart = 100;
const int niters = 100;
const int nevery = 30;
const bool doPrint = false;
list<struct node> nodelist;
vector<struct node> nodevect;
// Note - vector is *much* faster if you know ahead of time
// maximum size of vector
nodevect.reserve(nstart*30);
// Initialize
for (int i = 0; i < nstart; i++) {
struct node *mynode = new struct node;
mynode->stuck = false;
mynode->x[0] = i; mynode->x[1] = 2.*i;
mynode->loccurve = -1;
mynode->disttoprev = -1;
nodelist.push_back( *mynode );
nodevect.push_back( *mynode );
}
const double EPSILON = 1.e-6;
struct timeval listclock;
double listtime;
tick(&listclock);
for (int i=0; i<niters; i++) {
// Calculate local curvature, distance
list<struct node>::iterator prev, next, cur;
double dx1, dx2, dy1, dy2;
next = cur = prev = nodelist.begin();
cur++;
next++; next++;
dx1 = prev->x[0]-cur->x[0];
dy1 = prev->x[1]-cur->x[1];
while (next != nodelist.end()) {
dx2 = cur->x[0]-next->x[0];
dy2 = cur->x[1]-next->x[1];
double slope1 = (dy1/(dx1+EPSILON));
double slope2 = (dy2/(dx2+EPSILON));
cur->disttoprev = sqrt(dx1*dx1 + dx2*dx2 );
cur->loccurve = ( slope1*slope2*(dy1+dy2) +
slope2*(prev->x[0]+cur->x[0]) -
slope1*(cur->x[0] +next->x[0]) ) /
(2.*(slope2-slope1) + EPSILON);
next++;
cur++;
prev++;
}
// Insert interpolated pt every neveryth pt
int count = 1;
next = cur = nodelist.begin();
next++;
while (next != nodelist.end()) {
if (count % nevery == 0) {
struct node *mynode = new struct node;
mynode->x[0] = (cur->x[0]+next->x[0])/2.;
mynode->x[1] = (cur->x[1]+next->x[1])/2.;
mynode->stuck = false;
mynode->loccurve = -1;
mynode->disttoprev = -1;
nodelist.insert(next,*mynode);
}
next++;
cur++;
count++;
}
}
51,0-1 40%
struct timeval vectclock;
double vecttime;
tick(&vectclock);
for (int i=0; i<niters; i++) {
int nelem = nodevect.size();
double dx1, dy1, dx2, dy2;
dx1 = nodevect[0].x[0]-nodevect[1].x[0];
dy1 = nodevect[0].x[1]-nodevect[1].x[1];
for (int elem=1; elem<nelem-1; elem++) {
dx2 = nodevect[elem].x[0]-nodevect[elem+1].x[0];
dy2 = nodevect[elem].x[1]-nodevect[elem+1].x[1];
double slope1 = (dy1/(dx1+EPSILON));
double slope2 = (dy2/(dx2+EPSILON));
nodevect[elem].disttoprev = sqrt(dx1*dx1 + dx2*dx2 );
nodevect[elem].loccurve = ( slope1*slope2*(dy1+dy2) +
slope2*(nodevect[elem-1].x[0] +
nodevect[elem].x[0]) -
slope1*(nodevect[elem].x[0] +
nodevect[elem+1].x[0]) ) /
(2.*(slope2-slope1) + EPSILON);
}
// Insert interpolated pt every neveryth pt
int count = 1;
vector<struct node>::iterator next, cur;
next = cur = nodevect.begin();
next++;
while (next != nodevect.end()) {
if (count % nevery == 0) {
struct node *mynode = new struct node;
mynode->x[0] = (cur->x[0]+next->x[0])/2.;
mynode->x[1] = (cur->x[1]+next->x[1])/2.;
mynode->stuck = false;
mynode->loccurve = -1;
mynode->disttoprev = -1;
nodevect.insert(next,*mynode);
}
next++;
cur++;
count++;
}
}
vecttime = tock(&vectclock);
cout << "Time for list: " << listtime << endl;
cout << "Time for vect: " << vecttime << endl;
vector<struct node>::iterator v;
list <struct node>::iterator l;
if (doPrint) {
cout << "Vector: " << endl;
for (v=nodevect.begin(); v!=nodevect.end(); ++v) {
cout << "[ (" << v->x[0] << "," << v->x[1] << "), " << v->disttoprev << ", " << v->loccurve << "] " << endl;
}
cout << endl << "List: " << endl;
for (l=nodelist.begin(); l!=nodelist.end(); ++l) {
cout << "[ (" << l->x[0] << "," << l->x[1] << "), " << l->disttoprev << ", " << l->loccurve << "] " << endl;
}
}
cout << "List size is " << nodelist.size() << endl;
}

Assuming that creation of new elements happens relatively infrequently, I would take the sorted vector approach, for all the reasons you've listed:
No wasting time following pointers/indices around
Take advantage of spatial locality
Much easier to parallelise
Of course, for this to work, you'd have to make sure that the vector was always sorted, not simply every k-th timestep.

This looks like a nice exercise for parallel programming students.
You seem to have a data structure that leads itself naturally to distribution, a chain. You can do quite a bit of work over subchains that are (semi)statically assigned to different threads. You might want to deal with the N-1 boundary cases separately, but if the subchain lengths are >3 then those are isolated from each other.
Sure, between each step you'll have to update global variables, but variables such as chain length are simple parallel additions. Just calculate the length of each subchain and then add those up. If your subchains are 100000/8 long, the single-threaded piece of work is the addition of those 8 subchain lengths between steps.
If the growth of nodes is highly non-uniform, you might want to rebalance the subchain lengths every so often.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js