I have the following recursive program which I would like to parallelize using OpenMP:
#include <iostream>
#include <cmath>
#include <numeric>
#include <vector>
#include <algorithm>
#include <thread>
#include <omp.h>
// Determines if a point of dimension point.size() is within the sphere
bool isPointWithinSphere(std::vector<int> point, const double &radius) {
// Since we know that the sphere is centered at the origin, we can simply
// find the euclidean distance (square root of the sum of squares) and check to
// see if it is less than or equal to the length of the radius
//square each element inside the point vector
std::transform(point.begin(), point.end(), point.begin(), [](auto &x){return std::pow(x,2);});
//find the square root of the sum of squares and check if it is less than or equal to the radius
return std::sqrt(std::accumulate(point.begin(), point.end(), 0, std::plus<int>())) <= radius;
}
// Counts the number of lattice points inside the sphere( all points (x1 .... xn) such that xi is an integer )
// The algorithm: If the radius is a floating point value, first find the floor of the radius and cast it to
// an integer. For example, if the radius is 2.43 then the only integer points we must check are those between
// -2 and 2. We generate these points by simulating n - nested loops using recursion and passing each point
// in to the boolean function isPointWithinSphere(...), if the function returns true, we add one to the count
// (we have found a lattice point on the sphere).
int countLatticePoints(std::vector<int> point, const double radius, const int dimension, int count = 0) {
const int R = static_cast<int>(std::floor(radius));
#pragma omp parallel for
for(int i = -R; i <= R; i++) {
point.push_back(i);
if(point.size() == dimension){
if(isPointWithinSphere(point, radius)) count++;
}else count = countLatticePoints(point, radius, dimension, count);
point.pop_back();
}
return count;
}
int main(int argc, char ** argv) {
std::vector<int> vec;
#pragma omp parallel
std::cout << countLatticePoints(vec, 5, 7) << std::endl;
return 0;
}
I have tried adding a parallel region within the main function as well as parallelizing the for loop within countLatticePoints yet I see hardly any improvement gained from parallelizing vs running the algorithm sequentially.
Any help / advice would be appreciated in terms of other OpenMP strategies that I can use.
I'll take the advice route. Before trying to make your program faster using threads, you first want to make it faster in the single threaded case. There are several improvements you can make. You're making a lot of copies of your point vectors, which incurs a lot of expensive memory allocations.
Pass point into isPointWithinSphere as a reference. Then, rather than two loops, use one loop to square and accumulate the each element in point. Then, when checking the radius, compare the square of the distance rather than the distance. This avoids the sqrt call and replaces it with a simple square.
countLatticePoints should also take point by reference. Rather than calling point.size(), subtract 1 from dimension each time you recurse, then just check for dimension == 1 instead of computing the size.
With all that, if you still want/need to introduce threading, you'll need to make some adjustments due to passing point by reference. countLatticePoint will need to have two variants, the initial call that has the OpenMP directive in it, and the recursive one that does not have them.
The #pragma omp parallel in main won't do anything because there is only one block of code to execute.
Related
I've wrote a program, that approximates Pi using the Monte-Carlo method. It is working fine, but I wonder if I can make it work better and faster, because when inserting something like ~n = 100000000 and bigger from that point, it takes some time to do the calculations and print the result.
I've imagined how I could try to approximate it better by doing a median for n results, but considering how slow my algorithm is for big numbers, I decided not doing so.
Basically, the question is: How can I make this function work faster?
Here is the code that I've got so far:
double estimate_pi(double n)
{
int i;
double x,y, distance;
srand( (unsigned)time(0));
double num_point_circle = 0;
double num_point_total = 0;
double final;
for (i=0; i<n; i++)
{
x = (double)rand()/RAND_MAX;
y = (double)rand()/RAND_MAX;
distance = sqrt(x*x + y*y);
if (distance <= 1)
{
num_point_circle+=1;
}
num_point_total+=1;
}
final = ((4 * num_point_circle) / num_point_total);
return final;
}
An obvious (small) speedup is to get rid of the square root operation.
sqrt(x*x + y*y) is exactly smaller than 1, when x*x + y*y is also smaller than 1.
So you can just write:
double distance2 = x*x + y*y;
if (distance2 <= 1) {
...
}
That might gain something, as computing the square root is an expensive operation, and takes much more time (~4-20 times slower than one addition, depending on CPU, architecture, optimization level, ...).
You should also use integer values for variables where you count, like num_point_circle, n and num_point_total. Use int, long or long long for them.
The Monte Carlo algorithm is also embarrassingly parallel. So an idea to speed it up would be to run the algorithm with multiple threads, each one processes a fraction of n, and then at the end sum up the number of points that are inside the circle.
Additionally you could try to improve the speed with SIMD instructions.
And as last, there are much more efficient ways of computing Pi.
With the Monte Carlo method you can do millions of iterations, and only receive an accuracy of a couple of digits. With better algorithms its possible to compute thousands, millions or billions of digits.
E.g. you could read up on the on the website https://www.i4cy.com/pi/
First of all, you should modify the code to consider how much the estimate changes after a certain number of iterations and stop when it reached a satisfiying accuracy.
For your code as is, with a fixed number of iterations, there isnt much to optimize that the compiler cannot do better than you. One minor thing you can improve is to drop computing the square root in every iteration. You don't need it, because sqrt(x) <= 1 is true if x <= 1.
Very roughly speaking you can expect a good estimate from your Monte Carlo method once your x,y points are uniformly distributed in the quarter circle. Considering that, there is a much simpler way to get points uniformly distributed without randomness: Use a uniform grid and count how many points are inside the circle and how many are outside.
I would expect this to converge much better (when decreasing the grid spacing) than Monte Carlo. However, you can also try to use it to speed up your Monte Carlo code by starting from counts of num_point_circle and num_point_total obtained from points on a uniform grid and then incrementing the counters by continuing with randomly distributed points.
As the error decreases as the inverse of the square root of the number of samples, to gain a single digit you need one hundred times more samples. No micro-optimization will help significantly. Just consider that this method is of no use to effectively compute π.
If you insist: https://en.wikipedia.org/wiki/Variance_reduction.
A first level improvement would be to sample on a regular grid, using brute force algorithm, as suggested by 463035818_is_not_a_number.
The next level improvement would be to "draw" just a semi circle for each 'x', counting how many points must be below y = sqrt(radius*radius - x*x).
This reduces the complexity from O(N^2) to O(N).
With the grid size == radius of 10, 100, 1000, 10000, etc, one should have about one digit of each step.
One of the problems with the rand() function being, that the numbers soon begin to repeat -- with regular grid and this O(N) turned problem we can even simulate a grid of the size of 2^32 in quite a reasonable time.
With ideas from Bresenham's Circle drawing algorithm we can even quickly evaluate if the candidate (x+1, y_previous) or (x+1, y_previous-1) should be selected, as only one of those is inside the circle for the first octant. For second some other hacks are needed to avoid sqrt.
How can I make this function work faster? not more precise!
when Monte Carlo is an estimate anyway and the final multiplication is a multiple of 2,4,8 and so on you can also do bit operations. Any if statement makes it slow, so trying to get rid of it. But when we increase 1 or nothing (0) anyway you can get rid of the statement and reduce it to simple math, which should be faster. When i is initialised before the loop, and we are counting up inside the loop, it can also be a while loop.
#include <time.h>
double estimate_alt_pi(double n){
uint64_t num_point_total = 0;
double x, y;
srand( (unsigned)time(0));
uint64_t num_point_circle = 0;
while (num_point_total<n) {
x = (double)rand()/RAND_MAX;
y = (double)rand()/RAND_MAX;
num_point_circle += (sqrt(x*x + y*y) <= 1);
num_point_total++; // +1.0 we don't need 'i'
}
return ((num_point_circle << 2) / (double)num_point_total); // x<<2 == x * 4
}
Benchmarked with Xcode on a mac, looks like.
extern uint64_t dispatch_benchmark(size_t count, void (^block)(void));
int main(int argc, const char * argv[]) {
size_t count = 1000000;
double input = 1222.52764523423;
__block double resultA;
__block double resultB;
uint64_t t = dispatch_benchmark(count,^{
resultA = estimate_pi(input);
});
uint64_t t2 = dispatch_benchmark(count,^{
resultB = estimate_alt_pi(input);
});
fprintf(stderr,"estimate_____pi=%f time used=%llu\n",resultA, t);
fprintf(stderr,"estimate_alt_pi=%f time used=%llu\n",resultB, t2);
return 0;
}
~1,35 times faster, or takes ~73% of the original time
but significant less difference when the given number is lower.
And also the whole algorithm works properly only up to inputs with maximal 7 digits which is because of the used data types. Below 2 digits it is even slower. So the whole Monte Carlo algo is indeed not really worth digging deeper, when it's only about speed while keeping some kind of reliability.
Literally nothing will be faster than using a #define or static with a fixed number as Pi, but that was not your question.
Your code is very bad from performance aspect as:
x = (double)rand()/RAND_MAX;
y = (double)rand()/RAND_MAX;
is converting between int and double, also using integer division ... Why not use Random() instead? Also this:
for (i=0; i<n; i++)
is a bad idea as n is double and i is int so either store n to int variable at start or change the header to int n. Btw why are you computing num_point_total when you already got n ? Also:
num_point_circle += (sqrt(x*x + y*y) <= 1);
is a bad idea why sqrt? you know 1^2 = 1 so you can simply do:
num_point_circle += (x*x + y*y <= 1);
Why not do continuous computing? ... so what you need to implement is:
load of state at app start
computation either in timer or OnIdle event
so in each iteration/even you will do N iterations of Pi (adding to some global sum and count)
save of state at app exit
Beware Monte Carlo Pi computation converges very slowly and you will hit floating point accuracy problems once the sum grows too big
Here small example I did a long time ago doing continuous Monte Carlo...
Form cpp:
//$$---- Form CPP ----
//---------------------------------------------------------------------------
#include <vcl.h>
#include <math.h>
#pragma hdrstop
#include "Unit1.h"
#include "performance.h"
//---------------------------------------------------------------------------
#pragma package(smart_init)
#pragma resource "*.dfm"
TForm1 *Form1;
//---------------------------------------------------------------------------
int i=0,n=0,n0=0;
//---------------------------------------------------------------------------
void __fastcall TForm1::Idleloop(TObject *Sender, bool &Done)
{
int j;
double x,y;
for (j=0;j<10000;j++,n++)
{
x=Random(); x*=x;
y=Random(); y*=y;
if (x+y<=1.0) i++;
}
}
//---------------------------------------------------------------------------
__fastcall TForm1::TForm1(TComponent* Owner):TForm(Owner)
{
tbeg();
Randomize();
Application->OnIdle = Idleloop;
}
//-------------------------------------------------------------------------
void __fastcall TForm1::Timer1Timer(TObject *Sender)
{
double dt;
AnsiString txt;
txt ="ref = 3.1415926535897932384626433832795\r\n";
if (n) txt+=AnsiString().sprintf("Pi = %.20lf\r\n",4.0*double(i)/double(n));
txt+=AnsiString().sprintf("i/n = %i / %i\r\n",i,n);
dt=tend();
if (dt>1e-100) txt+=AnsiString().sprintf("IPS = %8.0lf\r\n",double(n-n0)*1000.0/dt);
tbeg(); n0=n;
mm_log->Text=txt;
}
//---------------------------------------------------------------------------
Form h:
//$$---- Form HDR ----
//---------------------------------------------------------------------------
#ifndef Unit1H
#define Unit1H
//---------------------------------------------------------------------------
#include <Classes.hpp>
#include <Controls.hpp>
#include <StdCtrls.hpp>
#include <Forms.hpp>
#include <ExtCtrls.hpp>
//---------------------------------------------------------------------------
class TForm1 : public TForm
{
__published: // IDE-managed Components
TMemo *mm_log;
TTimer *Timer1;
void __fastcall Timer1Timer(TObject *Sender);
private: // User declarations
public: // User declarations
__fastcall TForm1(TComponent* Owner);
void __fastcall TForm1::Idleloop(TObject *Sender, bool &Done);
};
//---------------------------------------------------------------------------
extern PACKAGE TForm1 *Form1;
//---------------------------------------------------------------------------
#endif
Form dfm:
object Form1: TForm1
Left = 0
Top = 0
Caption = 'Project Euler'
ClientHeight = 362
ClientWidth = 619
Color = clBtnFace
Font.Charset = OEM_CHARSET
Font.Color = clWindowText
Font.Height = 14
Font.Name = 'System'
Font.Pitch = fpFixed
Font.Style = [fsBold]
OldCreateOrder = False
PixelsPerInch = 96
TextHeight = 14
object mm_log: TMemo
Left = 0
Top = 0
Width = 619
Height = 362
Align = alClient
ScrollBars = ssBoth
TabOrder = 0
end
object Timer1: TTimer
Interval = 100
OnTimer = Timer1Timer
Left = 12
Top = 10
end
end
So you should add the save/load of state ...
As mentioned there are much much better ways of obtaining Pi like BBP
Also the code above use my time measurement heder so here it is:
//---------------------------------------------------------------------------
//--- Performance counter time measurement: 2.01 ----------------------------
//---------------------------------------------------------------------------
#ifndef _performance_h
#define _performance_h
//---------------------------------------------------------------------------
const int performance_max=64; // push urovni
double performance_Tms=-1.0, // perioda citaca [ms]
performance_tms=0.0, // zmerany cas po tend [ms]
performance_t0[performance_max]; // zmerane start casy [ms]
int performance_ix=-1; // index aktualneho casu
//---------------------------------------------------------------------------
void tbeg(double *t0=NULL) // mesure start time
{
double t;
LARGE_INTEGER i;
if (performance_Tms<=0.0)
{
for (int j=0;j<performance_max;j++) performance_t0[j]=0.0;
QueryPerformanceFrequency(&i); performance_Tms=1000.0/double(i.QuadPart);
}
QueryPerformanceCounter(&i); t=double(i.QuadPart); t*=performance_Tms;
if (t0) { t0[0]=t; return; }
performance_ix++;
if ((performance_ix>=0)&&(performance_ix<performance_max)) performance_t0[performance_ix]=t;
}
//---------------------------------------------------------------------------
void tpause(double *t0=NULL) // stop counting time between tbeg()..tend() calls
{
double t;
LARGE_INTEGER i;
QueryPerformanceCounter(&i); t=double(i.QuadPart); t*=performance_Tms;
if (t0) { t0[0]=t-t0[0]; return; }
if ((performance_ix>=0)&&(performance_ix<performance_max)) performance_t0[performance_ix]=t-performance_t0[performance_ix];
}
//---------------------------------------------------------------------------
void tresume(double *t0=NULL) // resume counting time between tbeg()..tend() calls
{
double t;
LARGE_INTEGER i;
QueryPerformanceCounter(&i); t=double(i.QuadPart); t*=performance_Tms;
if (t0) { t0[0]=t-t0[0]; return; }
if ((performance_ix>=0)&&(performance_ix<performance_max)) performance_t0[performance_ix]=t-performance_t0[performance_ix];
}
//---------------------------------------------------------------------------
double tend(double *t0=NULL) // return duration [ms] between matching tbeg()..tend() calls
{
double t;
LARGE_INTEGER i;
QueryPerformanceCounter(&i); t=double(i.QuadPart); t*=performance_Tms;
if (t0) { t-=t0[0]; performance_tms=t; return t; }
if ((performance_ix>=0)&&(performance_ix<performance_max)) t-=performance_t0[performance_ix]; else t=0.0;
performance_ix--;
performance_tms=t;
return t;
}
//---------------------------------------------------------------------------
double tper(double *t0=NULL) // return duration [ms] between tper() calls
{
double t,tt;
LARGE_INTEGER i;
if (performance_Tms<=0.0)
{
for (int j=0;j<performance_max;j++) performance_t0[j]=0.0;
QueryPerformanceFrequency(&i); performance_Tms=1000.0/double(i.QuadPart);
}
QueryPerformanceCounter(&i); t=double(i.QuadPart); t*=performance_Tms;
if (t0) { tt=t-t0[0]; t0[0]=t; performance_tms=tt; return tt; }
performance_ix++;
if ((performance_ix>=0)&&(performance_ix<performance_max))
{
tt=t-performance_t0[performance_ix];
performance_t0[performance_ix]=t;
}
else { t=0.0; tt=0.0; };
performance_ix--;
performance_tms=tt;
return tt;
}
//---------------------------------------------------------------------------
AnsiString tstr()
{
AnsiString s;
s=s.sprintf("%8.3lf",performance_tms); while (s.Length()<8) s=" "+s; s="["+s+" ms]";
return s;
}
//---------------------------------------------------------------------------
AnsiString tstr(int N)
{
AnsiString s;
s=s.sprintf("%8.3lf",performance_tms/double(N)); while (s.Length()<8) s=" "+s; s="["+s+" ms]";
return s;
}
//---------------------------------------------------------------------------
//---------------------------------------------------------------------------
#endif
//---------------------------------------------------------------------------
//---------------------------------------------------------------------------
And here result after few seconds:
where IPS is iterations per second, i,n are the global variables holding actual state of computation, Pi is actual approximation and ref is refernce Pi value for comparison. As I computing in OnIdle and showing up in OnTimer there is no need for any locks as its all single thread. For more speed you could launch more computing threads however then you need to add multi threading lock mechanisms and have the global state as volatile.
In case you are doing console app (no forms) you still can do this just convert your code to infinite loop, recompute and output Pi value once in ??? iterations and stop on some key hit like escape.
I am implementing Latent Dirichlet Allocation (LDA) in Rcpp. In LDA, we need to deal with a huge sparse matrix (e.g. 50 x 3000).
I decided to use SparseMatrix in Eigen. However, since I need access to each cell, computationally expensive .coeffRef slows down my function a lot.
Is there any way to use SparseMatrix while keeping the speed?
What I want to do has four steps,
I know which cell (i,j) I want to access.
I want to know whether the cell (i,j) is 0 or not.
If the cell (i,j) is not 0, I want to know its value.
After doing some analysis with the value in step 2 and 3, I want to update the cell (i,j). In this step, I might need to update the cell (i,j) which originally has 0.
#include <iostream>
#include <Eigen/dense>
#include <Eigen/Sparse>
using namespace std;
using namespace Eigen;
typedef Eigen::Triplet<double> T;
int main(){
Eigen::SparseMatrix<double> spmat;
// Insert in spmat
vector<T> tripletList;
int value;
tripletList.push_back(T(0,1,1));
tripletList.push_back(T(0,3,2));
tripletList.push_back(T(1,5,3));
tripletList.push_back(T(2,4,4));
tripletList.push_back(T(4,1,5));
tripletList.push_back(T(4,5,6));
spmat.resize(5,7); // define size
spmat.setFromTriplets(tripletList.begin(), tripletList.end());
for(int i=0; i<5; i++){ // I am accessing all cells just to clarify I need to access cell
for(int j=0; j<7; j++){
// Check if (i,j) is 0
if(spmat.coeffRef(i,j) != 0){
// Some analysis
value = spmat.coeffRef(i,j)*2; // just an example, more complex in the model
}
spmat.coeffRef(i,j) += value; // update (i,j)
}
}
cout << spmat << endl;
return 0;
}
Since the number of rows is much smaller than the columns, I considered accessing a column and then check the row value, but I couldn't handle SparseMatrix<double>::InnerIterator it(spmat, colid).
What would be the most efficient way to compute the fewest hops it takes to get from x1, y1 to x2, y2 on an unbounded/infinite chess board? Assume that from x1, y1 we can always generate a set of legal moves.
This sounds tailor made for BFS and I have implemented one successfully. But its space and time complexity seem atrocious if x2, y2 is arbitrarily large.
I have been looking at various other algorithms like A*, Bidirectional search, iterative deepening DFS etc but so far I am clueless as to which approach would yield the most optimal (and complete) solution. Is there some insight I am missing?
If the set of legal moves is independent of the current space, then this seems ideal as an integer linear programming (ILP) problem. You'd basically solve for the number of each type of move, such that the total number of moves is minimized. For instance, for a knight constrained to only move up and to the right (so that each move was either x+=1, y+=2 or x+=2, y+=1, you'd minimize a1+a2, subject to 2*a1+a2 == x2-x1, a1+2*a2 == y2-y1, a1 >= 0, a2 >= 0. While ILPs in general are NP-complete, I'd expect a standard hill-climbing algorithm to be able to solve it quite efficiently.
I don't have a complete proof yet, but I believe that if x1,y1 and x2,y2 are far away in both directions, then any optimal solution will have a lot of moves that move directly toward x2 and directly toward y2 (2 possible L-shaped moves that move in this direction). If the current position x is close to x2 but the current position y is far away from y2 for example, then alternate between the two moves that move two squares toward y2. And similarly if y is close to y2 and x and x2 are far away. Then, as soon as both the vertical and horizontal distance to x2,y2 are less than some rather small threshold (probably like 5 or 10), then you have to solve the problem with BFS or whatever to get the optimal solution, and the solution you get should be guaranteed to be optimal. I'll update my answer when I have a proof but I am almost certain this is true. If so, it means that no matter how far away x1,y1 and x2,y2 are from each other, you basically only have to solve a problem where the horizontal and vertical distances are like 5 or 10, which can be done quickly.
To expand on the discussion in comments, an uninformed search like breadth-first search (BFS) will find the optimal solution (the shortest path). However it only considers the cost-so-far g(n) for a node n and its cost increases exponentially with distance from source to target. To tame the cost of the search whilst still ensuring that the search finds the optimal solution, you need to add some information to the search algorithm via a heuristic, h(n).
Your case is a good fit for A* search, where the heuristic is a measure of distance from a node to the target (x2, y2). You could use the Euclidian distance "as the crow flies", but as you're considering a Knight then Manhattan distance might be more appropriate. Whatever measure you choose it has to be less (or equal to) the actual distance from the node to the target for the search to find the optimal solution (in this case the heuristic is known as "admissible"). Note that you need to divide each distance by a constant in order to get it to underestimate moves: divide by 3 for the Manhattan distance, and by sqrt(5) for the Euclidian distance (sqrt(5) is the length of the diagonal of a 2 by 1 square).
When you're running the algorithm you estimate the total distance f(n) from any node n that we've got to already as the distance so far plus the heuristic distance. I.e. f(n) = g(n) + h(n) where g(n) is the distance from (x1,y1) to node n and h(n) is the estimated heuristic distance from node n to (x2,y2). Given the nodes you've got to, you always choose the node n with the lowest f(n). I like the way you put it:
maintain a priority queue of nodes to be checked out ordered by g(n) + h(n).
If the heuristic is admissible then the algorithm finds the optimal solution because a suboptimal path can never be at the front of the priority queue: any fragment of the optimal path will always have a lower total distance (where, again, total distance is incurred distance plus heuristic distance).
The distance measure we've chosen here is monotonic (i.e. increases as the path lengthens rather than going up or down). In this case it's possible to show that it's efficient. As usual, see wikipedia or other sources on the web for more details. The Colorado state university page is particularly good and has nice discussions on optimality and efficiency.
Taking an example of going from (-200,-100) to (0,0), which is equivalent to your example of (0,0) to (200,100), in my implementation what we see with a Manhattan heuristic is as follows
The implementation does too much searching because with the heuristic h = Manhattan distance, taking steps of across 1 up 2 seem just as good as the optimal steps of across 2 up 1, i.e. the f() values don't distinguish the two. However the algorithm still finds the optimal solution of 100 moves. It takes 2118 steps, which is still a lot better than the breadth first search, which spreads out like an ink blot (I estimate it might take 20000 to 30000 steps).
How does it do if you choose the h = Euclidian distance?
This is a lot better! It only takes 104 steps, and it does so well because it incorporates our intuition that you need to head in roughly the right direction. But before we congratulate ourselves let's try another example, from (-200,0) to (0,0). Both heuristics find an optimal path of length 100. The Euclidian heuristic takes 12171 steps to find an optimal path, as shown below.
Whereas the Manhattan heuristic takes 16077 steps
Leaving aside the fact that the Manhattan heuristic does worse, again, I believe the real problem here is that there are multiple optimal paths. This isn't so strange: a re-ordering of an optimal path is still optimal. This fact is automatically taken into account by recasting the problem in a mathematical form along the lines of #Sneftel's answer.
In summary, A* with an admissible heuristic produces an optimal solution more efficiently than does BFS but it is likely that there are more efficient soluions out there. A* is a good default algorithm in cases where you can easily come up with a distance heuristic, and although in this case it isn't going to be the best solution, it's possible to learn a lot about the problem by implementing it.
Code below in C++ as you requested.
#include <memory>
using std::shared_ptr;
#include <vector>
using std::vector;
#include <queue>
using std::priority_queue;
#include <map>
using std::map;
using std::pair;
#include <math.h>
#include <iostream>
using std::cout;
#include <fstream>
using std::ofstream;
struct Point
{
short x;
short y;
Point(short _x, short _y) { x = _x; y = _y; }
bool IsOrigin() { return x == 0 && y == 0; }
bool operator<(const Point& p) const {
return pair<short, short>(x, y) < pair<short, short>(p.x, p.y);
}
};
class Path
{
Point m_end;
shared_ptr<Path> m_prev;
int m_length; // cached
public:
Path(const Point& start)
: m_end(start)
{ m_length = 0; }
Path(const Point& start, shared_ptr<Path> prev)
: m_end(start)
, m_prev(prev)
{ m_length = m_prev->m_length +1; }
Point GetEnd() const { return m_end; }
int GetLength() const { return m_length; }
vector<Point> GetPoints() const
{
vector<Point> points;
for (const Path* curr = this; curr; curr = curr->m_prev.get()) {
points.push_back(curr->m_end);
}
return points;
}
double g() const { return m_length; }
//double h() const { return (abs(m_end.x) + abs(m_end.y)) / 3.0; } // Manhattan
double h() const { return sqrt((m_end.x*m_end.x + m_end.y*m_end.y)/5); } // Euclidian
double f() const { return g() + h(); }
};
bool operator<(const shared_ptr<Path>& p1, const shared_ptr<Path>& p2)
{
return 1/p1->f() < 1/p2->f(); // priority_queue has biggest at end of queue
}
int main()
{
const Point source(-200, 0);
const Point target(0, 0);
priority_queue<shared_ptr<Path>> q;
q.push(shared_ptr<Path>(new Path(source)));
map<Point, short> endPath2PathLength;
endPath2PathLength.insert(map<Point, short>::value_type(source, 0));
int pointsExpanded = 0;
shared_ptr<Path> path;
while (!(path = q.top())->GetEnd().IsOrigin())
{
q.pop();
const short newLength = path->GetLength() + 1;
for (short dx = -2; dx <= 2; ++dx){
for (short dy = -2; dy <= 2; ++dy){
if (abs(dx) + abs(dy) == 3){
const Point newEnd(path->GetEnd().x + dx, path->GetEnd().y + dy);
auto existingEndPath = endPath2PathLength.find(newEnd);
if (existingEndPath == endPath2PathLength.end() ||
existingEndPath->second > newLength) {
q.push(shared_ptr<Path>(new Path(newEnd, path)));
endPath2PathLength[newEnd] = newLength;
}
}
}
}
pointsExpanded++;
}
cout<< "Path length " << path->GetLength()
<< " (points expanded = " << pointsExpanded << ")\n";
ofstream fout("Points.csv");
for (auto i : endPath2PathLength) {
fout << i.first.x << "," << i.first.y << "," << i.second << "\n";
}
vector<Point> points = path->GetPoints();
ofstream fout2("OptimalPoints.csv");
for (auto i : points) {
fout2 << i.x << "," << i.y << "\n";
}
return 0;
}
Note this isn't very well tested so there may well be bugs but I hope the general idea is clear.
I started a similar question on another thread, but then I was focusing on how to use OpenCV. Having failed to achieve what I originally wanted, I will ask here exactly what I want.
I have two matrices. Matrix a is 2782x128 and Matrix b is 4000x128, both unsigned char values. The values are stored in a single array. For each vector in a, I need the index of the vector in b with the closest euclidean distance.
Ok, now my code to achieve this:
#include <windows.h>
#include <stdlib.h>
#include <stdio.h>
#include <cstdio>
#include <math.h>
#include <time.h>
#include <sys/timeb.h>
#include <iostream>
#include <fstream>
#include "main.h"
using namespace std;
void main(int argc, char* argv[])
{
int a_size;
unsigned char* a = NULL;
read_matrix(&a, a_size,"matrixa");
int b_size;
unsigned char* b = NULL;
read_matrix(&b, b_size,"matrixb");
LARGE_INTEGER liStart;
LARGE_INTEGER liEnd;
LARGE_INTEGER liPerfFreq;
QueryPerformanceFrequency( &liPerfFreq );
QueryPerformanceCounter( &liStart );
int* indexes = NULL;
min_distance_loop(&indexes, b, b_size, a, a_size);
QueryPerformanceCounter( &liEnd );
cout << "loop time: " << (liEnd.QuadPart - liStart.QuadPart) / long double(liPerfFreq.QuadPart) << "s." << endl;
if (a)
delete[]a;
if (b)
delete[]b;
if (indexes)
delete[]indexes;
return;
}
void read_matrix(unsigned char** matrix, int& matrix_size, char* matrixPath)
{
ofstream myfile;
float f;
FILE * pFile;
pFile = fopen (matrixPath,"r");
fscanf (pFile, "%d", &matrix_size);
*matrix = new unsigned char[matrix_size*128];
for (int i=0; i<matrix_size*128; ++i)
{
unsigned int matPtr;
fscanf (pFile, "%u", &matPtr);
matrix[i]=(unsigned char)matPtr;
}
fclose (pFile);
}
void min_distance_loop(int** indexes, unsigned char* b, int b_size, unsigned char* a, int a_size)
{
const int descrSize = 128;
*indexes = (int*)malloc(a_size*sizeof(int));
int dataIndex=0;
int vocIndex=0;
int min_distance;
int distance;
int multiply;
unsigned char* dataPtr;
unsigned char* vocPtr;
for (int i=0; i<a_size; ++i)
{
min_distance = LONG_MAX;
for (int j=0; j<b_size; ++j)
{
distance=0;
dataPtr = &a[dataIndex];
vocPtr = &b[vocIndex];
for (int k=0; k<descrSize; ++k)
{
multiply = *dataPtr++-*vocPtr++;
distance += multiply*multiply;
// If the distance is greater than the previously calculated, exit
if (distance>min_distance)
break;
}
// if distance smaller
if (distance<min_distance)
{
min_distance = distance;
(*indexes)[i] = j;
}
vocIndex+=descrSize;
}
dataIndex+=descrSize;
vocIndex=0;
}
}
And attached are the files with sample matrices.
matrixa
matrixb
I am using windows.h just to calculate the consuming time, so if you want to test the code in another platform than windows, just change windows.h header and change the way of calculating the consuming time.
This code in my computer is about 0.5 seconds. The problem is that I have another code in Matlab that makes this same thing in 0.05 seconds. In my experiments, I am receiving several matrices like matrix a every second, so 0.5 seconds is too much.
Now the matlab code to calculate this:
aa=sum(a.*a,2); bb=sum(b.*b,2); ab=a*b';
d = sqrt(abs(repmat(aa,[1 size(bb,1)]) + repmat(bb',[size(aa,1) 1]) - 2*ab));
[minz index]=min(d,[],2);
Ok. Matlab code is using that (x-a)^2 = x^2 + a^2 - 2ab.
So my next attempt was to do the same thing. I deleted my own code to make the same calculations, but It was 1.2 seconds approx.
Then, I tried to use different external libraries. The first attempt was Eigen:
const int descrSize = 128;
MatrixXi a(a_size, descrSize);
MatrixXi b(b_size, descrSize);
MatrixXi ab(a_size, b_size);
unsigned char* dataPtr = matrixa;
for (int i=0; i<nframes; ++i)
{
for (int j=0; j<descrSize; ++j)
{
a(i,j)=(int)*dataPtr++;
}
}
unsigned char* vocPtr = matrixb;
for (int i=0; i<vocabulary_size; ++i)
{
for (int j=0; j<descrSize; ++j)
{
b(i,j)=(int)*vocPtr ++;
}
}
ab = a*b.transpose();
a.cwiseProduct(a);
b.cwiseProduct(b);
MatrixXi aa = a.rowwise().sum();
MatrixXi bb = b.rowwise().sum();
MatrixXi d = (aa.replicate(1,vocabulary_size) + bb.transpose().replicate(nframes,1) - 2*ab).cwiseAbs2();
int* index = NULL;
index = (int*)malloc(nframes*sizeof(int));
for (int i=0; i<nframes; ++i)
{
d.row(i).minCoeff(&index[i]);
}
This Eigen code costs 1.2 approx for just the line that says: ab = a*b.transpose();
A similar code using opencv was used also, and the cost of the ab = a*b.transpose(); was 0.65 seconds.
So, It is real annoying that matlab is able to do this same thing so quickly and I am not able in C++! Of course being able to run my experiment would be great, but I think the lack of knowledge is what really is annoying me. How can I achieve at least the same performance than in Matlab? Any kind of soluting is welcome. I mean, any external library (free if possible), loop unrolling things, template things, SSE intructions (I know they exist), cache things. As I said, my main purpose is increase my knowledge for being able to code thinks like this with a faster performance.
Thanks in advance
EDIT: more code suggested by David Hammen. I casted the arrays to int before making any calculations. Here is the code:
void min_distance_loop(int** indexes, unsigned char* b, int b_size, unsigned char* a, int a_size)
{
const int descrSize = 128;
int* a_int;
int* b_int;
LARGE_INTEGER liStart;
LARGE_INTEGER liEnd;
LARGE_INTEGER liPerfFreq;
QueryPerformanceFrequency( &liPerfFreq );
QueryPerformanceCounter( &liStart );
a_int = (int*)malloc(a_size*descrSize*sizeof(int));
b_int = (int*)malloc(b_size*descrSize*sizeof(int));
for(int i=0; i<descrSize*a_size; ++i)
a_int[i]=(int)a[i];
for(int i=0; i<descrSize*b_size; ++i)
b_int[i]=(int)b[i];
QueryPerformanceCounter( &liEnd );
cout << "Casting time: " << (liEnd.QuadPart - liStart.QuadPart) / long double(liPerfFreq.QuadPart) << "s." << endl;
*indexes = (int*)malloc(a_size*sizeof(int));
int dataIndex=0;
int vocIndex=0;
int min_distance;
int distance;
int multiply;
/*unsigned char* dataPtr;
unsigned char* vocPtr;*/
int* dataPtr;
int* vocPtr;
for (int i=0; i<a_size; ++i)
{
min_distance = LONG_MAX;
for (int j=0; j<b_size; ++j)
{
distance=0;
dataPtr = &a_int[dataIndex];
vocPtr = &b_int[vocIndex];
for (int k=0; k<descrSize; ++k)
{
multiply = *dataPtr++-*vocPtr++;
distance += multiply*multiply;
// If the distance is greater than the previously calculated, exit
if (distance>min_distance)
break;
}
// if distance smaller
if (distance<min_distance)
{
min_distance = distance;
(*indexes)[i] = j;
}
vocIndex+=descrSize;
}
dataIndex+=descrSize;
vocIndex=0;
}
}
The entire process is now 0.6, and the casting loops at the beginning are 0.001 seconds. Maybe I did something wrong?
EDIT2: Anything about Eigen? When I look for external libs they always talk about Eigen and their speed. I made something wrong? Here a simple code using Eigen that shows it is not so fast. Maybe I am missing some config or some flag, or ...
MatrixXd A = MatrixXd::Random(1000, 1000);
MatrixXd B = MatrixXd::Random(1000, 500);
MatrixXd X;
This code is about 0.9 seconds.
As you observed, your code is dominated by the matrix product that represents about 2.8e9 arithmetic operations. Yopu say that Matlab (or rather the highly optimized MKL) computes it in about 0.05s. This represents a rate of 57 GFLOPS showing that it is not only using vectorization but also multi-threading. With Eigen, you can enable multi-threading by compiling with OpenMP enabled (-fopenmp with gcc). On my 5 years old computer (2.66Ghz Core2), using floats and 4 threads, your product takes about 0.053s, and 0.16s without OpenMP, so there must be something wrong with your compilation flags. To summary, to get the best of Eigen:
compile in 64bits mode
use floats (doubles are twice as slow owing to vectorization)
enable OpenMP
if your CPU has hyper-threading, then either disable it or define the OMP_NUM_THREADS environment variable to the number of physical cores (this is very important, otherwise the performance will be very bad!)
if you have other task running, it might be a good idea to reduce OMP_NUM_THREADS to nb_cores-1
use the most recent compiler that you can, GCC, clang and ICC are best, MSVC is usually slower.
One thing that is definitely hurting you in your C++ code is that it has a boatload of char to int conversions. By boatload, I mean up to 2*2782*4000*128 char to int conversions. Those char to int conversions are slow, very slow.
You can reduce this to (2782+4000)*128 such conversions by allocating a pair of int arrays, one 2782*128 and the other 4000*128, to contain the cast-to-integer contents of your char* a and char* b arrays. Work with these int* arrays rather than your char* arrays.
Another problem might be your use of int versus long. I don't work on windows, so this might not be applicable. On the machines I work on, int is 32 bits and long is now 64 bits. 32 bits is more than enough because 255*255*128 < 256*256*128 = 223.
That obviously isn't the problem.
What's striking is that the code in question is not calculating that huge 2728 by 4000 array that the Matlab code is creating. What's even more striking is that Matlab is most likely doing this with doubles rather than ints -- and it's still beating the pants off the C/C++ code.
One big problem is cache. That 4000*128 array is far too big for level 1 cache, and you are iterating over that big array 2782 times. Your code is doing far too much waiting on memory. To overcome this problem, work with smaller chunks of the b array so that your code works with level 1 cache for as long as possible.
Another problem is the optimization if (distance>min_distance) break;. I suspect that this is actually a dis-optimization. Having if tests inside your innermost loop is oftentimes a bad idea. Blast through that inner product as fast as possible. Other than wasted computations, there is no harm in getting rid of this test. Sometimes it is better to make apparently unneeded computations if doing so can remove a branch in an innermost loop. This is one of those cases. You might be able to solve your problem just by eliminating this test. Try doing that.
Getting back to the cache problem, you need to get rid of this branch so that you can split the operations over the a and b matrix into smaller chunks, chunks of no more than 256 rows at a time. That's how many rows of 128 unsigned chars fit into one of the two modern Intel chip's L1 caches. Since 250 divides 4000, look into logically splitting that b matrix into 16 chunks. You may well want to form that big 2872 by 4000 array of inner products, but do so in small chunks. You can add that if (distance>min_distance) break; back in, but do so at a chunk level rather than at the byte by byte level.
You should be able to beat Matlab because it almost certainly is working with doubles, but you can work with unsigned chars and ints.
Matrix multiply generally uses the worst possible cache access pattern for one of the two matrices, and the solution is to transpose one of the matrices and use a specialized multiply algorithm that works on data stored that way.
Your matrix already IS stored transposed. By transposing it into the normal order and then using a normal matrix multiply, your are absolutely killing performance.
Write your own matrix multiply loop that inverts the order of indices to the second matrix (which has the effect of transposing it, without actually moving anything around and breaking cache behavior). And pass your compiler whatever options it has for enabling auto-vectorization.
I have a ~3000x3000 covariance-alike matrix on which I compute the eigenvalue-eigenvector decomposition (it's a OpenCV matrix, and I use cv::eigen() to get the job done).
However, I actually only need the, say, first 30 eigenvalues/vectors, I don't care about the rest. Theoretically, this should allow to speed up the computation significantly, right? I mean, that means it has 2970 eigenvectors less that need to be computed.
Which C++ library will allow me to do that? Please note that OpenCV's eigen() method does have the parameters for that, but the documentation says they are ignored, and I tested it myself, they are indeed ignored :D
UPDATE:
I managed to do it with ARPACK. I managed to compile it for windows, and even to use it. The results look promising, an illustration can be seen in this toy example:
#include "ardsmat.h"
#include "ardssym.h"
int n = 3; // Dimension of the problem.
double* EigVal = NULL; // Eigenvalues.
double* EigVec = NULL; // Eigenvectors stored sequentially.
int lowerHalfElementCount = (n*n+n) / 2;
//whole matrix:
/*
2 3 8
3 9 -7
8 -7 19
*/
double* lower = new double[lowerHalfElementCount]; //lower half of the matrix
//to be filled with COLUMN major (i.e. one column after the other, always starting from the diagonal element)
lower[0] = 2; lower[1] = 3; lower[2] = 8; lower[3] = 9; lower[4] = -7; lower[5] = 19;
//params: dimensions (i.e. width/height), array with values of the lower or upper half (sequentially, row major), 'L' or 'U' for upper or lower
ARdsSymMatrix<double> mat(n, lower, 'L');
// Defining the eigenvalue problem.
int noOfEigVecValues = 2;
//int maxIterations = 50000000;
//ARluSymStdEig<double> dprob(noOfEigVecValues, mat, "LM", 0, 0.5, maxIterations);
ARluSymStdEig<double> dprob(noOfEigVecValues, mat);
// Finding eigenvalues and eigenvectors.
int converged = dprob.EigenValVectors(EigVec, EigVal);
for (int eigValIdx = 0; eigValIdx < noOfEigVecValues; eigValIdx++) {
std::cout << "Eigenvalue: " << EigVal[eigValIdx] << "\nEigenvector: ";
for (int i = 0; i < n; i++) {
int idx = n*eigValIdx+i;
std::cout << EigVec[idx] << " ";
}
std::cout << std::endl;
}
The results are:
9.4298, 24.24059
for the eigenvalues, and
-0.523207, -0.83446237, -0.17299346
0.273269, -0.356554, 0.893416
for the 2 eigenvectors respectively (one eigenvector per row)
The code fails to find 3 eigenvectors (it can only find 1-2 in this case, an assert() makes sure of that, but well, that's not a problem).
In this article, Simon Funk shows a simple, effective way to estimate a singular value decomposition (SVD) of a very large matrix. In his case, the matrix is sparse, with dimensions: 17,000 x 500,000.
Now, looking here, describes how eigenvalue decomposition closely related to SVD. Thus, you might benefit from considering a modified version of Simon Funk's approach, especially if your matrix is sparse. Furthermore, your matrix is not only square but also symmetric (if that is what you mean by covariance-like), which likely leads to additional simplification.
... Just an idea :)
It seems that Spectra will do the job with good performances.
Here is an example from their documentation to compute the 3 first eigen values of a dense symmetric matrix M (likewise your covariance matrix):
#include <Eigen/Core>
#include <Spectra/SymEigsSolver.h>
// <Spectra/MatOp/DenseSymMatProd.h> is implicitly included
#include <iostream>
using namespace Spectra;
int main()
{
// We are going to calculate the eigenvalues of M
Eigen::MatrixXd A = Eigen::MatrixXd::Random(10, 10);
Eigen::MatrixXd M = A + A.transpose();
// Construct matrix operation object using the wrapper class DenseSymMatProd
DenseSymMatProd<double> op(M);
// Construct eigen solver object, requesting the largest three eigenvalues
SymEigsSolver< double, LARGEST_ALGE, DenseSymMatProd<double> > eigs(&op, 3, 6);
// Initialize and compute
eigs.init();
int nconv = eigs.compute();
// Retrieve results
Eigen::VectorXd evalues;
if(eigs.info() == SUCCESSFUL)
evalues = eigs.eigenvalues();
std::cout << "Eigenvalues found:\n" << evalues << std::endl;
return 0;
}