I`m new in C++ programming and try to write some sparse matrix and vector stuff I as a practice.
The sparse matrix is build of a vector of maps, where the vector accesses the rows and the map is used for the sparse entries in the columns.
What I was trying to do is to fill a diagonal dominant sparse matrix with an equation system for a Poisson equation.
Now when filling the matrix in test cases I was able to provoke the following very weird problem, which I broke down to the essential operations.
#include <vector>
#include <iterator>
#include <iostream>
#include <map>
#include <ctime>
int main()
{
unsigned int nDim = 100000;
double clock1;
// alternative std::map<unsigned int, std::map<unsigned int, double> > mat;
std::vector<std::map<unsigned int, double> > mat;
mat.resize(nDim);
// if clause and number set
clock1 = double(clock())/CLOCKS_PER_SEC;
for(unsigned int rowIter = 0; rowIter < nDim; rowIter++)
{
for(unsigned int colIter = 0; colIter < nDim; colIter++)
{
if(rowIter == colIter)
{
mat[rowIter][colIter] = 1.;
}
}
}
std::cout << "time for diagonal fill: " << 1e3 * (double(clock())/CLOCKS_PER_SEC - clock1) << " ms" << std::endl;
// if clause and number insert
clock1 = double(clock())/CLOCKS_PER_SEC;
for(unsigned int rowIter = 0; rowIter < nDim; rowIter++)
{
for(unsigned int colIter = 0; colIter < nDim; colIter++)
{
if(rowIter == colIter)
{
mat[rowIter].insert(std::pair<unsigned int, double>(colIter,1.));
}
}
}
std::cout << "time for insert diagonal fill: " << 1e3 * (double(clock())/CLOCKS_PER_SEC - clock1) << " ms" << std::endl;
// only number set
clock1 = double(clock())/CLOCKS_PER_SEC;
for(unsigned int rowIter = 0; rowIter < nDim; rowIter++)
{
mat[rowIter][rowIter] += 1.;
}
std::cout << "time for easy diagonal fill: " << 1e3 * (double(clock())/CLOCKS_PER_SEC - clock1) << " ms" << std::endl;
// only if clause
clock1 = double(clock())/CLOCKS_PER_SEC;
for(unsigned int rowIter = 0; rowIter < nDim; rowIter++)
{
for(unsigned int colIter = 0; colIter < nDim; colIter++)
{
if(rowIter == colIter)
{
}
}
}
std::cout << "time for if clause: " << 1e3 * (double(clock())/CLOCKS_PER_SEC - clock1) << " ms" << std::endl;
return 0;
}
Running this in gcc (newest version, 4.8.1 I think) the following times appear:
time for diagonal fill: 26317ms
time for insert diagonal: 8783ms
time for easy diagonal fill: 10ms !!!!!!!
time for if clause: 0ms
I only used the loop for the if clause to be sure the it is not responsible for the speed lack.
Optimization level is O3, but the problem also appears on other levels.
So I thought let's try the Visual Studio (2012 Express).
It is a little bit faster, but still as slow as ketchup:
time for diagonal fill: 9408ms
time for insert diagonal: 8860ms
time for easy diagonal fill: 11ms !!!!!!!
time for if clause: 0ms
So MSVSC++ fails, too.
It will probably not even be necessary to used this combination of if-clause and matrix fill, but if... I'm screwed.
Does anybody know where this huge performance gap is coming from and how I could deal with it?
Is it some optimization problem caused by the fact, that the if-clause is inside the loop? Do I maybe just need another compiler flag?
I would also be interested, if it occurs with other systems/compilers, too. I might run it on the Xeon E5 machine at work and see what this baby makes with this devil piece of code :).
EDIT:
I ran it on the Xeon machine: Much faster, still slow.
Times with gcc:
2778ms
2684ms
1ms
0ms
The most obvious performance issue would be allocations within your map. Each time you assign/insert a new item in a map, it's got to allocate space for it and sort the tree appropriately. Doing that thousands of times is bound to be slow.
It's also very significant that you're not clearing the maps after your first loop. That means your subsequent loops don't have to do as much work, so your performance comparisons are not equivalent.
Finally, the nested loops are obviously going to be doing an order of magnitude more iterations than your single loop. From a strict algorithm analysis standpoint, it may be doing the same amount of actual work on the data. However, the program still has to run through all those extra iterations because that's what you've told it to do. The compiler can only optimise it out if there is literally nothing being processed/modified in the loop body.
In the first loop, the runtime system is doing loads of memory allocation, so it takes a lot of time on memory management.
The other loops don't have that overhead; you didn't release the allocation done by the first loop, so they don't have to repeat the memory allocation and it doesn't take anywhere near as long.
The last loop is optimized out by the compiler; it has no side effects, so it doesn't get included in the program.
Morals:
memory allocation has a cost.
benchmarking is hard.
Related
I am currently studying different search algorithms, and I have made a little program to see the difference in the efficiency. Binary search should be faster than linear search, but the time mesures show otherwise. Did I made some mistake in the code or is this some special case?
#include <chrono>
#include <unistd.h>
using namespace std;
const int n=1001;
int a[n];
void assign() {
for (int i=0; i<n; i++) {
a[i]=i;
}
}
void print() {
for (int i=0; i<n; i++) {
cout << a[i] << endl;
}
}
bool find1 (int x) {
for (int i=0; i<n; i++) {
if (x==a[i]){
return true;
}
} return false;
}
bool binsearch(int x) {
int l=0,m;
int r=n-1;
while (l<=r) {
m = ((l+r)/2);
if (a[m]==x) return true;
if (a[m]<x) l=m+1;
if (a[m]>x) r=m-1;
}
return false;
}
int main() {
assign();
//print();
auto start1 = chrono::steady_clock::now();
cout << binsearch(500) << endl;
auto end1 = chrono::steady_clock::now();
auto start2 = chrono::steady_clock::now();
cout << find1(500) << endl;
auto end2 = chrono::steady_clock::now();
cout << "binsearch: " << chrono::duration_cast<chrono::nanoseconds>(end1 - start1).count()
<< " ns " << endl;
cout << "linsearch: " << chrono::duration_cast<chrono::nanoseconds>(end2 - start2).count()
<< " ns " << endl;
return 0;
}
Your test dataset is too small (1001 integers). It will fit entirely in the fastest (L1) cache when you fill it; consequently, you're bound by branch complexity, not memory.
The binary search version exhibits more branch mispredictions, resulting in more pipeline stalls than a simple linear pass.
I increased n to 1000001 and also increased the number of test passes:
auto start1 = chrono::steady_clock::now();
for (int i = 0; i < n; i += n/13) {
if (!binsearch(i%n)) {
std::cerr << i << std::endl;
}
}
auto end1 = chrono::steady_clock::now();
auto start2 = chrono::steady_clock::now();
for (int i = 0; i < n; i += n / 13) {
if (!find1(i%n)) {
std::cerr << i << std::endl;
}
}
auto end2 = chrono::steady_clock::now();
and I'm getting different results:
binsearch: 10300 ns
linsearch: 3129600 ns
Note also that you should not call cout in a timed loop, but you do need to use the result of the find in order for it to not get optimized away.
To my mind N=1001 is enough to notice that binary search has a better performance. Specific realizations of linear search could be faster only for small N (approximately < 100). However, in your case the reason of such strange results is incorrect profiling measurements. All your data has been successfully cached during calculations of the first algorithm (binary search), which dramatically improved performance of the second (linear search).
If you just swap their calls, you will get an opposite result:
binsearch: 6019 ns
linsearch: 77587 ns
For precise measurements you should use special frameworks (google benchmark, for example), which ensures the 'fair conditions' for both algorithms.
Other online benchmarking tool (it runs the testing code on a pool of many AWS machines whose load is unknown and returns average result) gives these charts for your code without changes (with the same n=1001 as well):
Get the best of both!
Do a binary search down to some level, then switch to linear. Think of it this way, a binary search has a bunch of bookkeeping; a linear search is faster because it is 'simpler'.
When I first experimented with this (back in the 1970's) in assembly language, I deduced that doing binary searches down to about 4 items, then doing linear, was about optimal. However YMMV; It depends on the hardware, the complexity of comparing two items (float / int / string / whatever), etc.
Tip: Count the number of operations in your code. I see about twice as many operations are needed for each step in your binsearch() routine versus the linear scan.
I've been using std::vector mostly and was wondering if I should use std::map for a key lookup to improve performance.
And here's my full test code.
#include <iostream>
#include <string>
#include <map>
#include <vector>
#include <ctime>
#include <chrono>
using namespace std;
vector<string> myStrings = {"aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg", "hhh", "iii", "jjj", "kkk", "lll", "mmm", "nnn", "ooo", "ppp", "qqq", "rrr", "sss", "ttt", "uuu", "vvv", "www", "xxx", "yyy", "zzz"};
struct MyData {
string key;
int value;
};
int findStringPosFromVec(const vector<MyData> &myVec, const string &str) {
auto it = std::find_if(begin(myVec), end(myVec),
[&str](const MyData& data){return data.key == str;});
if (it == end(myVec))
return -1;
return static_cast<int>(it - begin(myVec));
}
int main(int argc, const char * argv[]) {
const int testInstance = 10000; //HOW MANY TIMES TO PERFORM THE TEST
//----------------------------std::map-------------------------------
clock_t map_cputime = std::clock(); //START MEASURING THE CPU TIME
for (int i=0; i<testInstance; ++i) {
map<string, int> myMap;
//insert unique keys
for (int i=0; i<myStrings.size(); ++i) {
myMap[myStrings[i]] = i;
}
//iterate again, if key exists, replace value;
for (int i=0; i<myStrings.size(); ++i) {
if (myMap.find(myStrings[i]) != myMap.end())
myMap[myStrings[i]] = i * 100;
}
}
//FINISH MEASURING THE CPU TIME
double map_cpu = (std::clock() - map_cputime) / (double)CLOCKS_PER_SEC;
cout << "Map Finished in " << map_cpu << " seconds [CPU Clock] " << endl;
//----------------------------std::vector-------------------------------
clock_t vec_cputime = std::clock(); //START MEASURING THE CPU TIME
for (int i=0; i<testInstance; ++i) {
vector<MyData> myVec;
//insert unique keys
for (int i=0; i<myStrings.size(); ++i) {
const int pos = findStringPosFromVec(myVec, myStrings[i]);
if (pos == -1)
myVec.push_back({myStrings[i], i});
}
//iterate again, if key exists, replace value;
for (int i=0; i<myStrings.size(); ++i) {
const int pos = findStringPosFromVec(myVec, myStrings[i]);
if (pos != -1)
myVec[pos].value = i * 100;
}
}
//FINISH MEASURING THE CPU TIME
double vec_cpu = (std::clock() - vec_cputime) / (double)CLOCKS_PER_SEC;
cout << "Vector Finished in " << vec_cpu << " seconds [CPU Clock] " << endl;
return 0;
}
And this is the result I got.
Map Finished in 0.38121 seconds [CPU Clock]
Vector Finished in 0.346863 seconds [CPU Clock]
Program ended with exit code: 0
I mostly store less than 30 elements in a container.
Does this mean it is better to use std::vector instead of std::map in my case?
EDIT: when I move map<string, int> myMap; before the loop, std::map was faster than std::vector.
Map Finished in 0.278136 seconds [CPU Clock]
Vector Finished in 0.328548 seconds [CPU Clock]
Program ended with exit code: 0
So If this is the proper test, I guess std::map is faster.
But, If I reduce the amount of elements to 10, std::vector was faster so I guess it really depends on the number of elements.
I would say that in general, it's possible that a vector performs better than a map for lookups, but for a tiny amount of data only, e.g. you've mentioned less than 30 elements.
The reason is that linear search through continuous memory chunk is the cheapest way to access memory. A map keeps data at random memory locations, so it's a little bit more expensive to access them. In case of a tiny number of elements, this might play a role. In real life with hundreds and thousands of elements, algorithmic complexity of a lookup operation will dominate this performance gain.
BUT! You are benchmarking completely different things:
You are populating a map. In case of a vector, you don't do this
Your code could perform TWO map lookups: first, find to check existence, second [] operator to find an element to modify. These are relatively heavy operations. You can modify an element just with single find (figure this out yourself, check references!)
Within each test iteration, you are performing additional heavy operations, like memory allocation for each map/vector. It means that your tests are measuring not only lookup performance but something else.
Benchmarking is a difficult problem, don't do this yourself. For example, there are side effects like cache heating and you have to deal with them. Use something like Celero, hayai or google benchmark
Your vector has constant content, so the compiler optimizes most of your code away anyway.
There is little use in measuring for such small counts, and no use measuring for hard coded values.
In the constructor I have piece of code which does the following:
code snippet 1
for (unsigned i=0; i<n; ++i) {
auto xIdIt = _originIdSet.find(xIds[i]);
auto yIdIt = _originIdSet.find(yIds[i]);
if (xIdIt == _originIdSet.end()) { _originIdSet.insert(xIds[i]); }
if (yIdIt == _originIdSet.end()) { _originIdSet.insert(yIds[i]); }
}
_originIdSet is of the type std::unordered_set<uint32_t>xIds and yIds is of the type std::vector<uint32_t>
xIds and yIds contain lots of duplicate entries, e.g.
xIds = {1,2,1,2,3,5,....}
yIds = {2,3,4,1,4,1,....}
xIds[i] never equals yIds[i]
I'm compiling with gcc 5.30 as follows: g++ -g -Wall -m64 -O3 -std=c++11
I've been profiling how much time this piece of code (i.e. code snippet 1) takes when n equals 10k^2 and I've found that if I change the code to:
code snippet 2
for (unsigned i=0; i<n; ++i) {
// We cannot insert duplicates in a set so why check anyway
_originIdSet.insert(xIds[i]);
_originIdSet.insert(yIds[i]);
}
It will be around 5 seconds slower (total run code-snippet 1 takes around 15 seconds).
I'm wondering what is the underlying reason for the performance decrease. My first guess would be that this is caused due branch optimization (excellent explanation here), however I believe it doesn't make sense that in this situation branch optimization would only be applied if an if/else conditions is used. Hopefully somebody can clarify what is happening here.
Here is two samples:
#include <unordered_set>
#include <vector>
#include <chrono>
#include <iostream>
using namespace std;
int main() {
unsigned n = 10000000;
unordered_set<uint32_t> _originIdSet;
vector<uint32_t> xIds, yIds;
for (unsigned i = 0; i < n; ++i) {
xIds.push_back(rand() % n);
yIds.push_back(rand() % n);
}
auto start = chrono::steady_clock::now();
for (unsigned i = 0; i < n; ++i) {
auto xIdIt = _originIdSet.find(xIds[i]);
auto yIdIt = _originIdSet.find(yIds[i]);
if (xIdIt == _originIdSet.end()) { _originIdSet.insert(xIds[i]); }
if (yIdIt == _originIdSet.end()) { _originIdSet.insert(yIds[i]); }
}
auto end = chrono::steady_clock::now();
std::cout << "inserted " << _originIdSet.size() << " to " << chrono::duration_cast<chrono::milliseconds>(end-start).count() << " milliseconds" << std::endl;
}
And without any checking:
for (unsigned i = 0; i < n; ++i) {
_originIdSet.insert(xIds[i]);
_originIdSet.insert(yIds[i]);
}
I see ~5% of time difference 3200ms vs 3400ms.
But if i replace two array xIds & yIds to one array that contains all elements from both array I get same time for both cases and it works slowly than two array.
I think, that general reason, it is because method find() doesn't throw any exception and it is constant.
Modern cpu is able to do more than 1 instruction in one tact for each thread, than you get un-obviously parallel running of two find().
Or, second possibilities, gcc is able to generate SSE code for calculation of 2 hash values and search 2 position in set in same moment for two calls of find().
In a larger numerical computation, I have to perform the trivial task of summing up the products of the elements of two vectors. Since this task needs to be done very often, I tried to make use of the auto vectorization capabilities of my compiler (VC2015). I introduced a temporary vector, where the products are saved in in a first loop and then performed the summation in a second loop. Optimization was set to full and fast code was preferred. This way, the first loop got vectorized by the compiler (I know this from the compiler output).
The result was surprising. The vectorized code performed 3 times slower on my machine (core i5-4570 3.20 GHz) than the simple code. Could anybody explain why and what might improve the performance? I've put both versions of the algorithm fragment into a minimal running example, which I used myself for testing:
#include "stdafx.h"
#include <vector>
#include <Windows.h>
#include <iostream>
using namespace std;
int main()
{
// Prepare timer
LARGE_INTEGER freq,c_start,c_stop;
QueryPerformanceFrequency(&freq);
int size = 20000000; // size of data
double v = 0;
// Some data vectors. The data inside doesn't matter
vector<double> vv(size);
vector<double> tt(size);
vector<float> dd(size);
// Put random values into the vectors
for (int i = 0; i < size; i++)
{
tt[i] = rand();
dd[i] = rand();
}
// The simple version of the algorithm fragment
QueryPerformanceCounter(&c_start); // start timer
for (int p = 0; p < size; p++)
{
v += tt[p] * dd[p];
}
QueryPerformanceCounter(&c_stop); // Stop timer
cout << "Simple version took: " << ((double)(c_stop.QuadPart - c_start.QuadPart)) / ((double)freq.QuadPart) << " s" << endl;
cout << v << endl; // We use v once. This avoids its calculation to be optimized away.
// The version that is auto-vectorized
for (int i = 0; i < size; i++)
{
tt[i] = rand();
dd[i] = rand();
}
v = 0;
QueryPerformanceCounter(&c_start); // start timer
for (int p = 0; p < size; p++) // This loop is vectorized according to compiler output
{
vv[p] = tt[p] * dd[p];
}
for (int p = 0; p < size; p++)
{
v += vv[p];
}
QueryPerformanceCounter(&c_stop); // Stop timer
cout << "Vectorized version took: " << ((double)(c_stop.QuadPart - c_start.QuadPart)) / ((double)freq.QuadPart) << " s" << endl;
cout << v << endl; // We use v once. This avoids its calculation to be optimized away.
cin.ignore();
return 0;
}
You added a large amount of work by storing the products in a temporary vector.
For such a simple computation on large data, the CPU time that you expect to save by vectorization doesn't matter. Only memory references matter.
You added memory references, so it runs slower.
I would have expected the compiler to optimize the original version of that loop. I doubt the optimization would affect the execution time (because it is dominated by memory access regardless). But it should be visible in the generated code. If you wanted to hand optimize code like that, a temporary vector is always the wrong way to go. The right direction is the following (for simplicity, I assumed size is even):
for (int p = 0; p < size; p+=2)
{
v += tt[p] * dd[p];
v1 += tt[p+1] * dd[p+1];
}
v += v1;
Note that your data is large enough and operation simple enough, that NO optimization should be able to improve on the simplest version. That includes my sample hand optimization. But I assume your test is not exactly representative of what you are really trying to do or understand. So with smaller data or a more complicated operation, the approach I showed may help.
Also notice my version relies on addition being commutative. For real numbers, addition is commutative. But in floating point, it isn't. The answer is likely to be different by an amount too tiny for you to care. But that is data dependent. If you have large values of opposite sign in odd/even positions canceling each other early in the original sequence, then by segregating the even and odd positions my "optimization" would totally destroy the answer. (Of course, the opposite can also be true. For example, if all the even positions were tiny and the odds included large values canceling each other, then the original sequence produced garbage and the changed sequence would be more correct).
I am interested in porting some existing code to use thrust to see if I can speed it up on the GPU with relative ease.
What I'm looking to accomplish is a stream compaction operation, where only nonzero elements will be kept. I have this mostly working, per the example code below. The part that I am unsure of how to tackle is dealing with all the extra fill space that is in d_res and thus h_res, after the compaction happens.
The example just uses a 0-99 sequence with all the even entries set to zero. This is just an example, and the real problem will be a general sparse array.
This answer here helped me greatly, although when it comes to reading out the data, the size is just known to be constant:
How to quickly compact a sparse array with CUDA C?
I suspect that I can work around this by counting the number of 0's in d_src, and then only allocating d_res to be that size, or doing the count after the compaction, and only copying that many element. Is that really the right way to do it?
I get the sense that there will be some easy fix for this, via clever use of iterators or some other feature of thrust.
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/copy.h>
//Predicate functor
struct is_not_zero
{
__host__ __device__
bool operator()(const int x)
{
return (x != 0);
}
};
using namespace std;
int main(void)
{
size_t N = 100;
//Host Vector
thrust::host_vector<int> h_src(N);
//Fill with some zero and some nonzero data, as an example
for (int i = 0; i < N; i++){
if (i % 2 == 0){
h_src[i] = 0;
}
else{
h_src[i] = i;
}
}
//Print out source data
cout << "Source:" << endl;
for (int i = 0; i < N; i++){
cout << h_src[i] << " ";
}
cout << endl;
//copies to device
thrust::device_vector<int> d_src = h_src;
//Result vector
thrust::device_vector<int> d_res(d_src.size());
//Copy non-zero elements from d_src to d_res
thrust::copy_if(d_src.begin(), d_src.end(), d_res.begin(), is_not_zero());
//Copy back to host
thrust::host_vector<int> h_res(d_res.begin(), d_res.end());
//thrust::host_vector<int> h_res = d_res; //Or just this?
//Show results
cout << "h_res size is " << h_res.size() << endl;
cout << "Result after remove:" << endl;
for (int i = 0; i < h_res.size(); i++){
cout << h_res[i] << " ";
}
cout << endl;
return 0;
}
Also, I am a novice with thrust, so if the above code has any obvious flaws that go against recommended practices for using thrust, please let me know.
Similarly, speed is always of interest. Reading some of the various thrust tutorials, it seems like little changes here and there can be big speed savers or wasters. So, please let me know if there is a smart way to speed this up.
What you have appeared to have overlooked is that copy_if returns an iterator which points to the end of the copied data from the stream compaction operation. So all that is required is this:
//copies to device
thrust::device_vector<int> d_src = h_src;
//Result vector
thrust::device_vector<int> d_res(d_src.size());
//Copy non-zero elements from d_src to d_res
auto result_end = thrust::copy_if(d_src.begin(), d_src.end(), d_res.begin(), is_not_zero());
//Copy back to host
thrust::host_vector<int> h_res(d_res.begin(), result_end);
Doing this sizes h_res to only hold the non zeroes and only copies the non zeroes from the output of the stream compaction. No extra computation is required.