This question is about the best strategy for implementing the following simulation in C++.
I'm trying to make a simulation as a part of a physics research project, which basically tracks the dynamics of a chain of nodes in space. Each node contains a position together with certain parameters (local curvature, velocity, distance to neighbors etc…) which all evolve trough time.
Each time step can be broken down to these four parts:
Calculate local parameters. The values are dependent on the nearest neighbors in the chain.
Calculate global parameters.
Evolving. The position of each node is moved a small amount, depending on global and local parameters, and some random force fields.
Padding. New nodes are inserted if the distance between two consecutive nodes reach a critical value.
(In addition, nodes can get stuck, which make them inactive for the rest of the simulation. The local parameters of inactive nodes with inactive neighbors, will not change, and does not need any more calculation.)
Each node contains ~ 60 bytes, I have ~ 100 000 nodes in the chain, and i need to evolve the chain about ~ 1 000 000 time steps. I would however like to maximize these numbers, as it would increase the accuracy of my simulation, but under the restriction that the simulation is done in reasonable time (~hours). (~30 % of the nodes will be inactive.)
I have started to implement this simulation as a doubly linked list in C++. This seems natural, as I need to insert new nodes in between existing ones, and because the local parameters depends on the nearest neighbors. (I added an extra pointer to the next active node, to avoid unnecessary calculation, whenever I loop over the whole chain).
I'm no expert when it comes to parallelization (or coding for that matter), but I have played around with OpenMP, and I really like how I can speed up for loops of independent operations with two lines of code. I do not know how to make my linked list do stuff in parallel, or if it even works (?). So I had this idea of working with stl vector. Where Instead of having pointers to the nearest neighbors, I could store the indices of the neighbors and access them by standard lookup. I could also sort the vector by the position the chain (every x'th timestep) to get a better locality in memory. This approach would allowed for looping the OpenMP way.
I'm kind of intimidated by the idea, as I don't have to deal with memory management. And I guess that the stl vector implementation is way better than my simple 'new' and 'delete' way of dealing with Nodes in the list. I know I could have done the same with stl lists, but i don't like the way I have to access the nearest neighbors with iterators.
So I ask you, 1337 h4x0r and skilled programmers, what would be a better design for my simulation? Is the vector approach sketched above a good idea? Or are there tricks to play on linked list to make them work with OpenMP? Or should I consider a totally different approach?
The simulation is going to run on a computer with 8 cores and 48G RAM, so I guess I can trade a lot of memory for speed.
Thanks in advance
Edit:
I need to add 1-2 % new nodes each time step, so storing them as a vector without indices to nearest neighbors won't work unless I sort the vector every time step.
This is a classic tradeoff question. Using an array or std::vector will make the calculations faster and the insertions slower; using a doubly linked list or std::list will make the insertions faster and the calculations slower.
The only way to judge tradeoff questions is empirically; which will work faster for your particular application? All you can really do is try it both ways and see. The more intense the computation and the shorter the stencil (eg, the computational intensity -- how many flops you have to do per amount of memory you have to bring in) the less important a standard array will be. But basically you should mock up an implementation of your basic computation both ways and see if it matters. I've hacked together a very crude go at something with both std::vector and std::list; it is probably wrong in any of a numer of ways, but you can give it a go and play with some of the parameters and see which wins for you. On my system for the sizes and amount of computation given, list is faster, but it can go either way pretty easily.
W/rt openmp, yes, if that's the way you're going to go, you're hands are somewhat tied; you'll almost certainly have to go with the vector structure, but first you should make sure that the extra cost of the insertions won't blow away any benifit of multiple cores.
#include <iostream>
#include <list>
#include <vector>
#include <cmath>
#include <sys/time.h>
using namespace std;
struct node {
bool stuck;
double x[2];
double loccurve;
double disttoprev;
};
void tick(struct timeval *t) {
gettimeofday(t, NULL);
}
/* returns time in seconds from now to time described by t */
double tock(struct timeval *t) {
struct timeval now;
gettimeofday(&now, NULL);
return (double)(now.tv_sec - t->tv_sec) +
((double)(now.tv_usec - t->tv_usec)/1000000.);
}
int main()
{
const int nstart = 100;
const int niters = 100;
const int nevery = 30;
const bool doPrint = false;
list<struct node> nodelist;
vector<struct node> nodevect;
// Note - vector is *much* faster if you know ahead of time
// maximum size of vector
nodevect.reserve(nstart*30);
// Initialize
for (int i = 0; i < nstart; i++) {
struct node *mynode = new struct node;
mynode->stuck = false;
mynode->x[0] = i; mynode->x[1] = 2.*i;
mynode->loccurve = -1;
mynode->disttoprev = -1;
nodelist.push_back( *mynode );
nodevect.push_back( *mynode );
}
const double EPSILON = 1.e-6;
struct timeval listclock;
double listtime;
tick(&listclock);
for (int i=0; i<niters; i++) {
// Calculate local curvature, distance
list<struct node>::iterator prev, next, cur;
double dx1, dx2, dy1, dy2;
next = cur = prev = nodelist.begin();
cur++;
next++; next++;
dx1 = prev->x[0]-cur->x[0];
dy1 = prev->x[1]-cur->x[1];
while (next != nodelist.end()) {
dx2 = cur->x[0]-next->x[0];
dy2 = cur->x[1]-next->x[1];
double slope1 = (dy1/(dx1+EPSILON));
double slope2 = (dy2/(dx2+EPSILON));
cur->disttoprev = sqrt(dx1*dx1 + dx2*dx2 );
cur->loccurve = ( slope1*slope2*(dy1+dy2) +
slope2*(prev->x[0]+cur->x[0]) -
slope1*(cur->x[0] +next->x[0]) ) /
(2.*(slope2-slope1) + EPSILON);
next++;
cur++;
prev++;
}
// Insert interpolated pt every neveryth pt
int count = 1;
next = cur = nodelist.begin();
next++;
while (next != nodelist.end()) {
if (count % nevery == 0) {
struct node *mynode = new struct node;
mynode->x[0] = (cur->x[0]+next->x[0])/2.;
mynode->x[1] = (cur->x[1]+next->x[1])/2.;
mynode->stuck = false;
mynode->loccurve = -1;
mynode->disttoprev = -1;
nodelist.insert(next,*mynode);
}
next++;
cur++;
count++;
}
}
51,0-1 40%
struct timeval vectclock;
double vecttime;
tick(&vectclock);
for (int i=0; i<niters; i++) {
int nelem = nodevect.size();
double dx1, dy1, dx2, dy2;
dx1 = nodevect[0].x[0]-nodevect[1].x[0];
dy1 = nodevect[0].x[1]-nodevect[1].x[1];
for (int elem=1; elem<nelem-1; elem++) {
dx2 = nodevect[elem].x[0]-nodevect[elem+1].x[0];
dy2 = nodevect[elem].x[1]-nodevect[elem+1].x[1];
double slope1 = (dy1/(dx1+EPSILON));
double slope2 = (dy2/(dx2+EPSILON));
nodevect[elem].disttoprev = sqrt(dx1*dx1 + dx2*dx2 );
nodevect[elem].loccurve = ( slope1*slope2*(dy1+dy2) +
slope2*(nodevect[elem-1].x[0] +
nodevect[elem].x[0]) -
slope1*(nodevect[elem].x[0] +
nodevect[elem+1].x[0]) ) /
(2.*(slope2-slope1) + EPSILON);
}
// Insert interpolated pt every neveryth pt
int count = 1;
vector<struct node>::iterator next, cur;
next = cur = nodevect.begin();
next++;
while (next != nodevect.end()) {
if (count % nevery == 0) {
struct node *mynode = new struct node;
mynode->x[0] = (cur->x[0]+next->x[0])/2.;
mynode->x[1] = (cur->x[1]+next->x[1])/2.;
mynode->stuck = false;
mynode->loccurve = -1;
mynode->disttoprev = -1;
nodevect.insert(next,*mynode);
}
next++;
cur++;
count++;
}
}
vecttime = tock(&vectclock);
cout << "Time for list: " << listtime << endl;
cout << "Time for vect: " << vecttime << endl;
vector<struct node>::iterator v;
list <struct node>::iterator l;
if (doPrint) {
cout << "Vector: " << endl;
for (v=nodevect.begin(); v!=nodevect.end(); ++v) {
cout << "[ (" << v->x[0] << "," << v->x[1] << "), " << v->disttoprev << ", " << v->loccurve << "] " << endl;
}
cout << endl << "List: " << endl;
for (l=nodelist.begin(); l!=nodelist.end(); ++l) {
cout << "[ (" << l->x[0] << "," << l->x[1] << "), " << l->disttoprev << ", " << l->loccurve << "] " << endl;
}
}
cout << "List size is " << nodelist.size() << endl;
}
Assuming that creation of new elements happens relatively infrequently, I would take the sorted vector approach, for all the reasons you've listed:
No wasting time following pointers/indices around
Take advantage of spatial locality
Much easier to parallelise
Of course, for this to work, you'd have to make sure that the vector was always sorted, not simply every k-th timestep.
This looks like a nice exercise for parallel programming students.
You seem to have a data structure that leads itself naturally to distribution, a chain. You can do quite a bit of work over subchains that are (semi)statically assigned to different threads. You might want to deal with the N-1 boundary cases separately, but if the subchain lengths are >3 then those are isolated from each other.
Sure, between each step you'll have to update global variables, but variables such as chain length are simple parallel additions. Just calculate the length of each subchain and then add those up. If your subchains are 100000/8 long, the single-threaded piece of work is the addition of those 8 subchain lengths between steps.
If the growth of nodes is highly non-uniform, you might want to rebalance the subchain lengths every so often.
Related
I was experimenting with some known algorithm which aims to reduce the number of comparisons in an operation of finding element in an unsorted array. The algorithm uses sentinel which is added to the back of the array, which allows to write a loop where we use only one comparison, instead of two. It's important to note that the overall Big O computational complexity is not changed, it is still O(n). However, when looking at the number of comparisons, the standard finding algorithm is so to say O(2n) while the sentinel algorithm is O(n).
Standard find algorithm from the c++ library works like this:
template<class InputIt, class T>
InputIt find(InputIt first, InputIt last, const T& value)
{
for (; first != last; ++first) {
if (*first == value) {
return first;
}
}
return last;
}
We can see two comparisons there and one increment.
In the algorithm with sentinel the loop looks like this:
while (a[i] != key)
++i;
There is only one comparison and one increment.
I made some experiments and measured time, but on every computer the results were different. Unfortunately I didn't have access to any serious machine, I only had my laptop with VirtualBox there with Ubuntu, under which I compiled and run the code. I had a problem with the amount of memory. I tried using online compilers like Wandbox and Ideone but the time limits and memory limits didn't allow me to make reliable experiments. But every time I run my code, changing the number of elements in my vector or changing the number of execution of my test, I saw different results. Sometimes the times were comparable, sometimes std::find was significantly faster, sometimes significantly faster was the sentinel algorithm.
I was surprised because the logic says that the sentinel version indeed should work faster and every time. Do you have any explanation for this? Do you have any experience with this kind of algorithm? Is it worht the effort to even try to use it in production code when performance is crucial and when the array cannot be sorted (and any other mechanism to solve this problem, like hashmap, indexing etc., cannot be used)?
Here's my code of testing this. It's not beautiful, in fact it is ugly, but the beauty wasn't my goal here. Maybe something is wrong with my code?
#include <iostream>
#include <algorithm>
#include <chrono>
#include <vector>
using namespace std::chrono;
using namespace std;
const unsigned long long N = 300000000U;
static void find_with_sentinel()
{
vector<char> a(N);
char key = 1;
a[N - 2] = key; // make sure the searched element is in the array at the last but one index
unsigned long long high = N - 1;
auto tmp = a[high];
// put a sentinel at the end of the array
a[high] = key;
unsigned long long i = 0;
while (a[i] != key)
++i;
// restore original value
a[high] = tmp;
if (i == high && key != tmp)
cout << "find with sentinel, not found" << endl;
else
cout << "find with sentinel, found" << endl;
}
static void find_with_std_find()
{
vector<char> a(N);
int key = 1;
a[N - 2] = key; // make sure the searched element is in the array at the last but one index
auto pos = find(begin(a), end(a), key);
if (pos != end(a))
cout << "find with std::find, found" << endl;
else
cout << "find with sentinel, not found" << endl;
}
int main()
{
const int times = 10;
high_resolution_clock::time_point t1 = high_resolution_clock::now();
for (auto i = 0; i < times; ++i)
find_with_std_find();
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<milliseconds>(t2 - t1).count();
cout << "std::find time = " << duration << endl;
t1 = high_resolution_clock::now();
for (auto i = 0; i < times; ++i)
find_with_sentinel();
t2 = high_resolution_clock::now();
duration = duration_cast<milliseconds>(t2 - t1).count();
cout << "sentinel time = " << duration << endl;
}
Move the memory allocation (vector construction) outside the measured functions (e.g. pass the vector as argument).
Increase times to a few thousands.
You're doing a whole lot of time-consuming work in your functions. That work is hiding the differences in the timings. Consider your find_with_sentinel function:
static void find_with_sentinel()
{
// ***************************
vector<char> a(N);
char key = 1;
a[N - 2] = key; // make sure the searched element is in the array at the last but one index
// ***************************
unsigned long long high = N - 1;
auto tmp = a[high];
// put a sentinel at the end of the array
a[high] = key;
unsigned long long i = 0;
while (a[i] != key)
++i;
// restore original value
a[high] = tmp;
// ***************************************
if (i == high && key != tmp)
cout << "find with sentinel, not found" << endl;
else
cout << "find with sentinel, found" << endl;
// **************************************
}
The three lines at the top and the four lines at the bottom are identical in both functions, and they're fairly expensive to run. The top contains a memory allocation and the bottom contains an expensive output operation. These are going to mask the time it takes to do the real work of the function.
You need to move the allocation and the output out of the function. Change the function signature to:
static int find_with_sentinel(vector<char> a, char key);
In other words, make it the same as std::find. If you do that, then you don't have to wrap std::find, and you get a more realistic view of how your function will perform in a typical situation.
It's quite possible that the sentinel find function will be faster. However, it comes with some drawbacks. The first is that you can't use it with immutable lists. The second is that it's not safe to use in a multi-threaded program due to the potential of one thread overwriting the sentinel that the other thread is using. It also might not be "faster enough" to justify replacing std::find.
I'm stuck in an optimization problem. I have a huge database (about 16M entries) which represents ratings given by different users to different items. From this database I have to evaluate a correlation measure between different users (i.e. I have to implement a similarity matrix). Fortunately this correlation matrix is symmetric, so I just have to calculate half of it.
Let me focus for example on the first column of the matrix: there are 135k users in total so I keep one user fixed and I find all the common rated items between this user and all the other ones (with a for loop). The time problem appears also if I compare the single user with 20k other users instead of 135k.
My approach is the following: first I query the DB to obtain for example all the data of the first 20k users (this takes time also with indexes implementation, but it doesn't bother me since I do it only once) and I stored everything in an unordered map using the userID as key; then for this unordered_map I use as bucket another unordered_map which stores all the ratings given by the user, this time using as key the itemID.
Then, in order to find the set of items that both have rated, I cycle on the user which have rated less items, searching if the other one have also rated the same stuff. The fastest data structures that I know are hashmaps, but for a single complete columns my algorithm takes 30s (just for 20k entries) which translates in WEEKS for the complete matrix.
The code is the following:
void similarity_matrix(sqlite3 *db, sqlite3 *db_avg, sqlite3 *similarity, long int tot_users, long int interval) {
long int n = 1;
double sim;
string temp_s;
vector<string> insert_query;
sqlite3_stmt *stmt;
std::cout << "Starting creating similarity matrix..." << std::endl;
string query_string = "SELECT * from usersratings where usersratings.user <= 20000;";
unordered_map<int, unordered_map<int, int>> users_map = db_query(query_string.c_str(), db);
std::cout << "Query time: " << duration_ << " s." << std::endl;
unordered_map<int, int> u1_map = users_map[1];
string select_avg = "SELECT * from averages;";
unordered_map<int, double> avg_map = avg_value(select_avg.c_str(), db_avg);
for (int i = 2; i <= tot_users; i++)
{
unordered_map<int, int> user;
int compare_id;
if (users_map[i].size() <= u1_map.size()) {
user = users_map[i];
compare_id = 1;
}
else {
user = u1_map;
compare_id = i;
}
int matches = 0;
double newnum = 0;
double newden1 = 0;
double newden2 = 0;
unordered_map<int, int> item_map = users_map[compare_id];
for (unordered_map<int, int>::iterator it = user.begin(); it != user.end(); ++it)
{
if (item_map.size() != 0) {
int rating = item_map[it->first];
if (rating != 0) {
double diff1 = (it->second - avg_map[1]);
double diff2 = (rating - avg_map[i]);
newnum += diff1 * diff2;
newden1 += pow(diff1, 2);
newden2 += pow(diff2, 2);
}
}
}
sim = newnum / (sqrt(newden1) * sqrt(newden2));
}
std::cout << "Execution time for first column: " << duration << " s." << std::endl;
std::cout << "First column finished..." << std::endl;
}
This sticks to me as an immediate potential performance trap:
unordered_map<int, unordered_map<int, int>> users_map = db_query(query_string.c_str(), db);
If the size of each sub-map for each user is anywhere close to the number of users, then you have a quadratic complexity algorithm which is going to get exponentially slower the more users you have.
unordered_map does offer constant time search but it's still a search. The amount of instructions required to do it is going to dwarf, say, the cost of indexing an array, especially if there are many collisions which implies inner loops each time you try to search the map. It also isn't necessarily represented in a way that allows for the fastest sequential iteration. So if you can just use std::vector for at least the sub-lists and avg_map like so, that should help a lot for starters:
typedef pair<int, int> ItemRating;
typedef vector<ItemRating> ItemRatings;
unordered_map<int, ItemRatings> users_map = ...;
vector<double> avg_map = ...;
Even the outer users_map could be a vector unless it's sparse and not all indices are used. If it's sparse and the range of user IDs still fits into a reasonable range (not an astronomically large integer), you could potentially construct two vectors -- one which stores the user data and has a size proportional to the number of users, while another is proportional to the valid index range of users and stores nothing but indices into the former vector to translate from a user ID to an index with a simple array lookup if you need to be able to access user data through a user ID.
// User data array.
vector<ItemRatings> user_data(num_users);
// Array that translates sparse user ID integers to indices into the
// above dense array. A value of -1 indicates that a user ID is not used.
// To fetch user data for a particular user ID, we do:
// const ItemRatings& ratings = user_data[user_id_to_index[user_id]];
vector<int> user_id_to_index(biggest_user_index+1, -1);
You're also copying around those unordered_maps needlessly for each iteration of the outer loop. While I don't think that's the source of biggest bottleneck, it would help to avoid deep copying these data structures you don't even modify by using references or pointers:
// Shallow copy, don't deep copy big stuff needlessly.
const unordered_map<int, int>& user = users_map[i].size() <= u1_map.size() ?
users_map[i]: u1_map;
const int compare_id = users_map[i].size() <= u1_map.size() ? 1: i;
const unordered_map<int, int>& item_map = users_map[compare_id];
...
You also don't need to check if item_map is empty in the inner loop. That check should be hoisted outside. That's a micro-optimization which is unlikely to help much at all, but still eliminating blatant waste.
The final code after this first pass would be something like this:
vector<ItemRatings> user_data = ..;
vector<double> avg_map = ...;
// Fill `rating_values` with the values from the first user.
vector<int> rating_values(item_range, 0);
const ItemRatings& ratings1 = user_data[0];
for (auto it = ratings1.begin(); it != ratings1.end(); ++it)
{
const int item = it->first;
const int rating = it->second;
rating_values[item] += rating;
}
// For each user starting from the second user:
for (int i=1; i < tot_users; ++i)
{
double newnum = 0;
double newden1 = 0;
double newden2 = 0;
const ItemRatings& ratings2 = user_data[i];
for (auto it = ratings2.begin(); it != ratings2.end(); ++it)
{
const int item = it->first;
const int rating1 = rating_values[it->first];
if (rating != 0) {
const int rating2 = it->second;
double diff1 = rating2 - avg_map[1];
double diff2 = rating1 - avg_map[i];
newnum += diff1 * diff2;
newden1 += pow(diff1, 2);
newden2 += pow(diff2, 2);
}
}
sim = newnum / (sqrt(newden1) * sqrt(newden2));
}
The biggest difference in the above code is that we eliminated all searches through unordered_map and replaced them with simple indexed access of an array. We also eliminated a lot of redundant copying of data structures.
I've been using Mac OS gcc 4.2.1 and Eclipse to write a program that sorts numbers using a simple merge sort. I've tested the sort extensively and I know it works, and I thought, maybe somewhat naively, that because of the way the algorithm divides up the list, I could simply have a thread sort half and the main thread sort half, and then it would take half the time, but unfortunately, it doesn't seem to be working.
Here's the main code:
float x = clock(); //timing
int half = (int)size/2; // size is the length of the list
status = pthread_create(thready,NULL,voidSort,(void *)datay); //start the thread sorting
sortStep(testArray,tempList,half,0,half); //sort using the main thread
int join = pthread_join(*thready,&someptr); //wait for the thread to finish
mergeStep(testArray,tempList,0,half,half-1); //merge the two sublists
if (status != 0) { std::cout << "Could not create thread.\nError: " << status << "\n"; }
if (join != 0) { std::cout << "Could not create thread.\nError: " << status << "\n"; }
float y = clock() - x; //timing
sortStep is the main sorting function, mergeStep is used to merge two sublists within one array (it uses a placeholder array to switch the numbers around), and voidSort is a function I use to pass a struct containing all the arguments for sortStep to the thread. I feel like maybe the main thread is waiting until the new thread is done, but I'm not sure how to overcome that. I'm extremely, unimaginably grateful for any and all help, thank you in advanced!
EDIT:
Here's the merge step
void mergeStep (int *array,int *tempList,int start, int lengthOne, int lengthTwo) //the merge step of a merge sort
{
int i = start;
int j = i+lengthOne;
int k = 0; // index for the entire templist
while (k < lengthOne+lengthTwo) // a C++ while loop
{
if (i - start == lengthOne)
{ //list one exhausted
for (int n = 0; n+j < lengthTwo+lengthOne+start;n++ ) //add the rest
{
tempList[k++] = array[j+n];
}
break;
}
if (j-(lengthOne+lengthTwo)-start == 0)
{//list two exhausted
for (int n = i; n < start+lengthOne;n++ ) //add the rest
{
tempList[k++] = array[n];
}
break;
}
if (array[i] > array[j]) // figure out which variable should go first
{
tempList[k] = array[j++];
}
else
{
tempList[k] = array[i++];
}
k++;
}
for (int s = 0; s < lengthOne+lengthTwo;s++) // add the templist into the original
{
array[start+s] = tempList[s];
}
}
-Will
The overhead of creating threads is quite large, so unless you have a large amount (to be determined) of data to sort your better off sorting it in the main thread.
The mergeStep also counts against the part of the code that can't be palletized, remember Amdahl's law.
If you don't have a coarsening step as the last part of you sortStep when you get below 8-16 elements much of your performance will go up in function calls. The coarsening step will have to be done by a simpler sort, insertion sort or sorting network.
Unless you have a large enough sorting the actual timing could drown in measuring uncertainty.
I am interested in porting some existing code to use thrust to see if I can speed it up on the GPU with relative ease.
What I'm looking to accomplish is a stream compaction operation, where only nonzero elements will be kept. I have this mostly working, per the example code below. The part that I am unsure of how to tackle is dealing with all the extra fill space that is in d_res and thus h_res, after the compaction happens.
The example just uses a 0-99 sequence with all the even entries set to zero. This is just an example, and the real problem will be a general sparse array.
This answer here helped me greatly, although when it comes to reading out the data, the size is just known to be constant:
How to quickly compact a sparse array with CUDA C?
I suspect that I can work around this by counting the number of 0's in d_src, and then only allocating d_res to be that size, or doing the count after the compaction, and only copying that many element. Is that really the right way to do it?
I get the sense that there will be some easy fix for this, via clever use of iterators or some other feature of thrust.
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/copy.h>
//Predicate functor
struct is_not_zero
{
__host__ __device__
bool operator()(const int x)
{
return (x != 0);
}
};
using namespace std;
int main(void)
{
size_t N = 100;
//Host Vector
thrust::host_vector<int> h_src(N);
//Fill with some zero and some nonzero data, as an example
for (int i = 0; i < N; i++){
if (i % 2 == 0){
h_src[i] = 0;
}
else{
h_src[i] = i;
}
}
//Print out source data
cout << "Source:" << endl;
for (int i = 0; i < N; i++){
cout << h_src[i] << " ";
}
cout << endl;
//copies to device
thrust::device_vector<int> d_src = h_src;
//Result vector
thrust::device_vector<int> d_res(d_src.size());
//Copy non-zero elements from d_src to d_res
thrust::copy_if(d_src.begin(), d_src.end(), d_res.begin(), is_not_zero());
//Copy back to host
thrust::host_vector<int> h_res(d_res.begin(), d_res.end());
//thrust::host_vector<int> h_res = d_res; //Or just this?
//Show results
cout << "h_res size is " << h_res.size() << endl;
cout << "Result after remove:" << endl;
for (int i = 0; i < h_res.size(); i++){
cout << h_res[i] << " ";
}
cout << endl;
return 0;
}
Also, I am a novice with thrust, so if the above code has any obvious flaws that go against recommended practices for using thrust, please let me know.
Similarly, speed is always of interest. Reading some of the various thrust tutorials, it seems like little changes here and there can be big speed savers or wasters. So, please let me know if there is a smart way to speed this up.
What you have appeared to have overlooked is that copy_if returns an iterator which points to the end of the copied data from the stream compaction operation. So all that is required is this:
//copies to device
thrust::device_vector<int> d_src = h_src;
//Result vector
thrust::device_vector<int> d_res(d_src.size());
//Copy non-zero elements from d_src to d_res
auto result_end = thrust::copy_if(d_src.begin(), d_src.end(), d_res.begin(), is_not_zero());
//Copy back to host
thrust::host_vector<int> h_res(d_res.begin(), result_end);
Doing this sizes h_res to only hold the non zeroes and only copies the non zeroes from the output of the stream compaction. No extra computation is required.
I`m new in C++ programming and try to write some sparse matrix and vector stuff I as a practice.
The sparse matrix is build of a vector of maps, where the vector accesses the rows and the map is used for the sparse entries in the columns.
What I was trying to do is to fill a diagonal dominant sparse matrix with an equation system for a Poisson equation.
Now when filling the matrix in test cases I was able to provoke the following very weird problem, which I broke down to the essential operations.
#include <vector>
#include <iterator>
#include <iostream>
#include <map>
#include <ctime>
int main()
{
unsigned int nDim = 100000;
double clock1;
// alternative std::map<unsigned int, std::map<unsigned int, double> > mat;
std::vector<std::map<unsigned int, double> > mat;
mat.resize(nDim);
// if clause and number set
clock1 = double(clock())/CLOCKS_PER_SEC;
for(unsigned int rowIter = 0; rowIter < nDim; rowIter++)
{
for(unsigned int colIter = 0; colIter < nDim; colIter++)
{
if(rowIter == colIter)
{
mat[rowIter][colIter] = 1.;
}
}
}
std::cout << "time for diagonal fill: " << 1e3 * (double(clock())/CLOCKS_PER_SEC - clock1) << " ms" << std::endl;
// if clause and number insert
clock1 = double(clock())/CLOCKS_PER_SEC;
for(unsigned int rowIter = 0; rowIter < nDim; rowIter++)
{
for(unsigned int colIter = 0; colIter < nDim; colIter++)
{
if(rowIter == colIter)
{
mat[rowIter].insert(std::pair<unsigned int, double>(colIter,1.));
}
}
}
std::cout << "time for insert diagonal fill: " << 1e3 * (double(clock())/CLOCKS_PER_SEC - clock1) << " ms" << std::endl;
// only number set
clock1 = double(clock())/CLOCKS_PER_SEC;
for(unsigned int rowIter = 0; rowIter < nDim; rowIter++)
{
mat[rowIter][rowIter] += 1.;
}
std::cout << "time for easy diagonal fill: " << 1e3 * (double(clock())/CLOCKS_PER_SEC - clock1) << " ms" << std::endl;
// only if clause
clock1 = double(clock())/CLOCKS_PER_SEC;
for(unsigned int rowIter = 0; rowIter < nDim; rowIter++)
{
for(unsigned int colIter = 0; colIter < nDim; colIter++)
{
if(rowIter == colIter)
{
}
}
}
std::cout << "time for if clause: " << 1e3 * (double(clock())/CLOCKS_PER_SEC - clock1) << " ms" << std::endl;
return 0;
}
Running this in gcc (newest version, 4.8.1 I think) the following times appear:
time for diagonal fill: 26317ms
time for insert diagonal: 8783ms
time for easy diagonal fill: 10ms !!!!!!!
time for if clause: 0ms
I only used the loop for the if clause to be sure the it is not responsible for the speed lack.
Optimization level is O3, but the problem also appears on other levels.
So I thought let's try the Visual Studio (2012 Express).
It is a little bit faster, but still as slow as ketchup:
time for diagonal fill: 9408ms
time for insert diagonal: 8860ms
time for easy diagonal fill: 11ms !!!!!!!
time for if clause: 0ms
So MSVSC++ fails, too.
It will probably not even be necessary to used this combination of if-clause and matrix fill, but if... I'm screwed.
Does anybody know where this huge performance gap is coming from and how I could deal with it?
Is it some optimization problem caused by the fact, that the if-clause is inside the loop? Do I maybe just need another compiler flag?
I would also be interested, if it occurs with other systems/compilers, too. I might run it on the Xeon E5 machine at work and see what this baby makes with this devil piece of code :).
EDIT:
I ran it on the Xeon machine: Much faster, still slow.
Times with gcc:
2778ms
2684ms
1ms
0ms
The most obvious performance issue would be allocations within your map. Each time you assign/insert a new item in a map, it's got to allocate space for it and sort the tree appropriately. Doing that thousands of times is bound to be slow.
It's also very significant that you're not clearing the maps after your first loop. That means your subsequent loops don't have to do as much work, so your performance comparisons are not equivalent.
Finally, the nested loops are obviously going to be doing an order of magnitude more iterations than your single loop. From a strict algorithm analysis standpoint, it may be doing the same amount of actual work on the data. However, the program still has to run through all those extra iterations because that's what you've told it to do. The compiler can only optimise it out if there is literally nothing being processed/modified in the loop body.
In the first loop, the runtime system is doing loads of memory allocation, so it takes a lot of time on memory management.
The other loops don't have that overhead; you didn't release the allocation done by the first loop, so they don't have to repeat the memory allocation and it doesn't take anywhere near as long.
The last loop is optimized out by the compiler; it has no side effects, so it doesn't get included in the program.
Morals:
memory allocation has a cost.
benchmarking is hard.