I have the following piece of C++ code. The scale of the problem is N and M. Running the code takes about two minutes on my machine. (after g++ -O3 compilation). Is there anyway to further accelerate it, on the same machine? Any kind of option, choosing a better data structure, library, GPU or parallelism, etc, is on the table.
void demo() {
int N = 1000000;
int M=3000;
vector<vector<int> > res(M);
for (int i =0; i <N;i++) {
for (int j=1; j < M; j++){
res[j].push_back(i);
}
}
}
int main() {
demo();
return 0;
}
An additional info: The second loop above for (int j=1; j < M; j++) is a simplified version of the real problem. In fact, j could be in a different range for each i (of the outer loop), but the number of iterations is about 3000.
With the exact code as shown when writing this answer, you could create the inner vector once, with the specific size, and call iota to initialize it. Then just pass this vector along to the outer vector constructor to use it for each element.
Then you don't need any explicit loops at all, and instead use the (highly optimized, hopefully) standard library to do all the work for you.
Perhaps something like this:
void demo()
{
static int const N = 1000000;
static int const M = 3000;
std::vector<int> data(N);
std::iota(begin(data), end(data), 0);
std::vector<std::vector<int>> res(M, data);
}
Alternatively you could try to initialize just one vector with that elements, and then create the other vectors just by copying that part of the memory using std::memcpy or std::copy.
Another optimization would be to allocate the memory in advance (e.g. array.reserve(3000)).
Also if you're sure that all the members of the vector are similar vectors, you could do a hack by just creating a single vector with 3000 elements, and in the other res just put the same reference of that 3000-element vector million times.
On my machine which has enough memory to avoid swapping your original code took 86 seconds.
Adding reserve:
for (auto& v : res)
{
v.reserve(N);
}
made basically no difference (85 seconds but I only ran each version once).
Swapping the loop order:
for (int j = 1; j < M; j++) {
for (int i = 0; i < N; i++) {
res[j].push_back(i);
}
}
reduced the time to 10 seconds, this is likely due to a combination of allowing the compiler to use SIMD optimisations and improving cache coherency by accessing memory in sequential order.
Creating one vector and copying it into the others:
for (int i = 0; i < N; i++) {
res[1].push_back(i);
}
for (int j = 2; j < M; j++) {
res[j] = res[1];
}
reduced the time to 4 seconds.
Using a single vector:
void demo() {
size_t N = 1000000;
size_t M = 3000;
vector<int> res(M*N);
size_t offset = N;
for (size_t i = 0; i < N; i++) {
res[offset++] = i;
}
for (size_t j = 2; j < M; j++) {
std::copy(res.begin() + N, res.begin() + N * 2, res.begin() + offset);
offset += N;
}
}
also took 4 seconds, there probably isn't much improvement because you have 3,000 4 MB vectors, there would likely be more difference if N was smaller or M was larger.
Related
So I have a vector of vectors type double. I basically need to be able to set 360 numbers to cosY, and then put those 360 numbers into cosineY[0], then get another 360 numbers that are calculated with a different a now, and put them into cosineY[1].Technically my vector is going to be cosineYa I then need to be able to take out just cosY for a that I specify...
My code is saying this:
for (int a = 0; a < 8; a++)
{
for int n=0; n <= 360; n++
{
cosY[n] = cos(a*vectorOfY[n]);
}
cosineY.push_back(cosY);
}
which I hope is the correct way of actually setting it.
But then I need to take cosY for a that I specify, and calculate another another 360 vector, which will be stored in another vector again as a vector of vectors.
Right now I've got:
for (int a = 0; a < 8; a++
{
for (int n = 0; n <= 360; n++)
{
cosProductPt[n] = (VectorOfY[n]*cosY[n]);
}
CosProductY.push_back(cosProductPt);
}
The VectorOfY is besically the amplitude of an input wave. What I am doing is trying to create a cosine wave with different frequencies (a). I am then calculation the product of the input and cosine wave at each frequency. I need to be able to access these 360 points for each frequency later on in the program, and right now also I need to calculate the addition of all elements in cosProductPt, for every frequency (stored in cosProductY), and store it in a vector dotProductCos[a].
I've been trying to work it out but I don't know how to access all the elements in a vector of vectors to add them. I've been trying to do this for the whole day without any results. Right now I know so little that I don't even know how I would display or access a vector inside a vector, but I need to use that access point for the addition.
Thank you for your help.
for (int a = 0; a < 8; a++)
{
for int n=0; n < 360; n++) // note traded in <= for <. I think you had an off by one
// error here.
{
cosY[n] = cos(a*vectorOfY[n]);
}
cosineY.push_back(cosY);
}
Is sound so long as cosY has been pre-allocated to contain at least 360 elements. You could
std::vector<std::vector<double>> cosineY;
std::vector<double> cosY(360); // strongly consider replacing the 360 with a well-named
// constant
for (int a = 0; a < 8; a++) // same with that 8
{
for int n=0; n < 360; n++)
{
cosY[n] = cos(a*vectorOfY[n]);
}
cosineY.push_back(cosY);
}
for example, but this hangs on to cosY longer than you need to and could cause problems later, so I'd probably scope cosY by throwing the above code into a function.
std::vector<std::vector<double>> buildStageOne(std::vector<double> &vectorOfY)
{
std::vector<std::vector<double>> cosineY;
std::vector<double> cosY(NumDegrees);
for (int a = 0; a < NumVectors; a++)
{
for int n=0; n < NumDegrees; n++)
{
cosY[n] = cos(a*vectorOfY[n]); // take radians into account if needed.
}
cosineY.push_back(cosY);
}
return cosineY;
}
This looks horrible, returning the vector by value, but the vast majority of compilers will take advantage of Copy Elision or some other sneaky optimization to eliminate the copying.
Then I'd do almost the exact same thing for the second step.
std::vector<std::vector<double>> buildStageTwo(std::vector<double> &vectorOfY,
std::vector<std::vector<double>> &cosineY)
{
std::vector<std::vector<double>> CosProductY;
for (int a = 0; a < numVectors; a++)
{
for (int n = 0; n < NumDegrees; n++)
{
cosProductPt[n] = (VectorOfY[n]*cosineY[a][n]);
}
CosProductY.push_back(cosProductPt);
}
return CosProductY;
}
But we can make a couple optimizations
std::vector<std::vector<double>> buildStageTwo(std::vector<double> &vectorOfY,
std::vector<std::vector<double>> &cosineY)
{
std::vector<std::vector<double>> CosProductY;
for (int a = 0; a < numVectors; a++)
{
// why risk constantly looking up cosineY[a]? grab it once and cache it
std::vector<double> & cosY = cosineY[a]; // note the reference
for (int n = 0; n < numDegrees; n++)
{
cosProductPt[n] = (VectorOfY[n]*cosY[n]);
}
CosProductY.push_back(cosProductPt);
}
return CosProductY;
}
And the next is kind of an extension of the first:
std::vector<std::vector<double>> buildStageTwo(std::vector<double> &vectorOfY,
std::vector<std::vector<double>> &cosineY)
{
std::vector<std::vector<double>> CosProductY;
std::vector<double> cosProductPt(360);
for (std::vector<double> & cosY: cosineY) // range based for. Gets rid of
{
for (int n = 0; n < NumDegrees; n++)
{
cosProductPt[n] = (VectorOfY[n]*cosY[n]);
}
CosProductY.push_back(cosProductPt);
}
return CosProductY;
}
We could do the same range-based for trick for the for (int n = 0; n < NumDegrees; n++), but since we are iterating multiple arrays here it's not all that helpful.
I have a struct:
struct xyz{
int x,y,z;
};
and I initialize a struct xyz type vector:
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
for (int k = 0; k < N; k++)
{
v.x=i;
v.y=j;
v.z=k;
vect.push_back(v);
}
}
}
then I want to transform that vector to array because array is 2 time faster than vector to manipulate, so I do
xyz arr[vect.size()];
std::copy(vect.begin(), vect.end(), arr);
when I run this program it shows me segmentation fault which I think is because vect.size() is too large.
So I am wondering is there any way to convert that large size vector to array without that problem.
I appreciate for any help
My overly pedantic comment got too big, so instead I'll try to make this a somewhat roundabout answer. The short answer is probably just to stick with vector but make sure to use reserve; oh, and benchmark.
You didn't say what compiler or C++ version you're using, so I'll just go with my current gcc.godbolt.org default of gcc 4.9.2, C++14. I'm also assuming that you really want this as a 1-dimension array, rather than the more natural (for your example) 3.
If you know N at compile time, you could do something like this (assuming I got the array offset calculation correct):
#include <array>
...
std::array<xyz, N*N*N> xyzs;
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
for (int k = 0; k < N; k++) {
xyzs[i*N*N+j*N+k] = {i, j, k};
}
}
}
The biggest downsides, IMO:
error-prone offset calculation
depending on N, where the code is run, etc, this can blow the stack
On the compilers I tried this on, the optimizers seem to understand that we're moving through the array in contiguous order, and the generated machine code is more sensible, but it could also be written like so, if you prefer:
#include <array>
...
std::array<xyz, N*N*N> xyzs;
auto p = xyzs.data();
for (int i = 0; i < N; ++i) {
for (int j = 0; j < N; ++j) {
for (int k = 0; k < N; ++k) {
(*p++) = {i, j, k};
}
}
}
Of course, if you actually know N at compile time, and it won't blow the stack, you might consider a 3-dimensional array xyz xyzs[N][N][N]; since this might be more natural for the way these things are being ultimately being used.
As pointed out in comments, variable length arrays aren't legal C++, but they are legal in C99; if you don't know N at compile time you should be allocating off the heap.
A vector and an array will wind up being identical in terms memory layout; they differ in that vector allocates memory from the heap, and the array (as you are writing it) would be on the stack. The only recommendation I'd make is to call reserve before entering your loop:
vect.reserve(N*N*N);
This means you'll only be doing a single memory allocation up front, rather than grow-and-copy mechanism that you'll get from a default constructed vector.
Assuming xyz is as simple as you declare here, you could also do something like the second example above:
std::vector<xyz> xyzs{N*N*N};
auto p = xyzs.data();
for (int i = 0; i < N; ++i) {
for (int j = 0; j < N; ++j) {
for (int k = 0; k < N; ++k) {
(*p++) = {i, j, k};
}
}
}
You lose the safety of push_back, and it is less efficient if xyz default constructor needs to do anything (like if xyz members were changed to have default values).
Having said all that, you really should benchmark. But then, you should probably be benchmarking the code that ultimately uses this array, rather than the code to construct it; I'd have other concerns if construction was dominating usage.
I need to compute a product vector-matrix as efficiently as possible. Specifically, given a vector s and a matrix A, I need to compute s * A. I have a class Vector which wraps a std::vector and a class Matrix which also wraps a std::vector (for efficiency).
The naive approach (the one that I am using at the moment) is to have something like
Vector<T> timesMatrix(Matrix<T>& matrix)
{
Vector<unsigned int> result(matrix.columns());
// constructor that does a resize on the underlying std::vector
for(unsigned int i = 0 ; i < vector.size() ; ++i)
{
for(unsigned int j = 0 ; j < matrix.columns() ; ++j)
{
result[j] += (vector[i] * matrix.getElementAt(i, j));
// getElementAt accesses the appropriate entry
// of the underlying std::vector
}
}
return result;
}
It works fine and takes nearly 12000 microseconds. Note that the vector s has 499 elements, while A is 499 x 15500.
The next step was trying to parallelize the computation: if I have N threads then I can give each thread a part of the vector s and the "corresponding" rows of the matrix A. Each thread will compute a 499-sized Vector and the final result will be their entry-wise sum.
First of all, in the class Matrix I added a method to extract some rows from a Matrix and build a smaller one:
Matrix<T> extractSomeRows(unsigned int start, unsigned int end)
{
unsigned int rowsToExtract = end - start + 1;
std::vector<T> tmp;
tmp.reserve(rowsToExtract * numColumns);
for(unsigned int i = start * numColumns ; i < (end+1) * numColumns ; ++i)
{
tmp.push_back(matrix[i]);
}
return Matrix<T>(rowsToExtract, numColumns, tmp);
}
Then I defined a thread routine
void timesMatrixThreadRoutine
(Matrix<T>& matrix, unsigned int start, unsigned int end, Vector<T>& newRow)
{
// newRow is supposed to contain the partial result
// computed by a thread
newRow.resize(matrix.columns());
for(unsigned int i = start ; i < end + 1 ; ++i)
{
for(unsigned int j = 0 ; j < matrix.columns() ; ++j)
{
newRow[j] += vector[i] * matrix.getElementAt(i - start, j);
}
}
}
And finally I modified the code of the timesMatrix method that I showed above:
Vector<T> timesMatrix(Matrix<T>& matrix)
{
static const unsigned int NUM_THREADS = 4;
unsigned int matRows = matrix.rows();
unsigned int matColumns = matrix.columns();
unsigned int rowsEachThread = vector.size()/NUM_THREADS;
std::thread threads[NUM_THREADS];
Vector<T> tmp[NUM_THREADS];
unsigned int start, end;
// all but the last thread
for(unsigned int i = 0 ; i < NUM_THREADS - 1 ; ++i)
{
start = i*rowsEachThread;
end = (i+1)*rowsEachThread - 1;
threads[i] = std::thread(&Vector<T>::timesMatrixThreadRoutine, this,
matrix.extractSomeRows(start, end), start, end, std::ref(tmp[i]));
}
// last thread
start = (NUM_THREADS-1)*rowsEachThread;
end = matRows - 1;
threads[NUM_THREADS - 1] = std::thread(&Vector<T>::timesMatrixThreadRoutine, this,
matrix.extractSomeRows(start, end), start, end, std::ref(tmp[NUM_THREADS-1]));
for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
{
threads[i].join();
}
Vector<unsigned int> result(matColumns);
for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
{
result = result + tmp[i]; // the operator+ is overloaded
}
return result;
}
It still works but now it takes nearly 30000 microseconds, which is almost three times as much as before.
Am I doing something wrong? Do you think there is a better approach?
EDIT - using a "lightweight" VirtualMatrix
Following Ilya Ovodov's suggestion, I defined a class VirtualMatrix that wraps a T* matrixData, which is initialized in the constructor as
VirtualMatrix(Matrix<T>& m)
{
numRows = m.rows();
numColumns = m.columns();
matrixData = m.pointerToData();
// pointerToData() returns underlyingVector.data();
}
Then there is a method to retrieve a specific entry of the matrix:
inline T getElementAt(unsigned int row, unsigned int column)
{
return *(matrixData + row*numColumns + column);
}
Now the execution time is better (approximately 8000 microseconds) but maybe there are some improvements to be made. In particular the thread routine is now
void timesMatrixThreadRoutine
(VirtualMatrix<T>& matrix, unsigned int startRow, unsigned int endRow, Vector<T>& newRow)
{
unsigned int matColumns = matrix.columns();
newRow.resize(matColumns);
for(unsigned int i = startRow ; i < endRow + 1 ; ++i)
{
for(unsigned int j = 0 ; j < matColumns ; ++j)
{
newRow[j] += (vector[i] * matrix.getElementAt(i, j));
}
}
}
and the really slow part is the one with the nested for loops. If I remove it, the result is obviously wrong but is "computed" in less than 500 microseconds. This to say that now passing the arguments takes almost no time and the heavy part is really the computation.
According to you, is there any way to make it even faster?
Actually you make a partial copy of matrix for each thread in extractSomeRows. It takes a lot of time.
Redesign it so that "some rows" become virtual matrix pointing at data located in original matrix.
Use vectorized assembly instructions for an architecture by making it more explicit that you want to multiply in 4's, i.e. for the x86-64 SSE2+ and possibly ARM'S NEON.
C++ compilers can often unroll the loop into vectorized code if you explicitly make an operation happen in contingent elements:
Simple and fast matrix-vector multiplication in C / C++
There is also the option of using libraries specifically made for matrix multipication. For larger matrices, it may be more efficient to use special implementations based on the Fast Fourier Transform, alternate algorithms like Strassen's Algorithm, etc. In fact, your best bet would be to use a C library like this, and then wrap it in an interface that looks similar to a C++ vector.
Here is code to find determinant of matrix n x n.
#include <iostream>
using namespace std;
int determinant(int *matrix[], int size);
void ijMinor(int *matrix[], int *minorMatrix[], int size, int row, int column);
int main()
{
int size;
cout << "What is the size of the matrix for which you want to find the determinant?:\t";
cin >> size;
int **matrix;
matrix = new int*[size];
for (int i = 0 ; i < size ; i++)
matrix[i] = new int[size];
cout << "\nEnter the values of the matrix seperated by spaces:\n\n";
for(int i = 0; i < size; i++)
for(int j = 0; j < size; j++)
cin >> matrix[i][j];
cout << "\nThe determinant of the matrix is:\t" << determinant(matrix, size) << endl;
return 0;
}
int determinant(int *matrix[], int size){
if(size==1)return matrix[0][0];
else{
int result=0, sign=-1;
for(int j = 0; j < size; j++){
int **minorMatrix;
minorMatrix = new int*[size-1];
for (int k = 0 ; k < size-1 ; k++)
minorMatrix[k] = new int[size-1];
ijMinor(matrix, minorMatrix, size, 0, j);
sign*=-1;
result+=sign*matrix[0][j]*determinant(minorMatrix, size-1);
for(int i = 0; i < size-1; i++){
delete minorMatrix[i];
}
}
return result;
}
}
void ijMinor(int *matrix[], int *minorMatrix[], int size, int row, int column){
for(int i = 0; i < size; i++){
for(int j = 0; j < size; j++){
if(i < row){
if(j < column)minorMatrix[i][j] = matrix[i][j];
else if(j == column)continue;
else minorMatrix[i][j-1] = matrix[i][j];
}
else if(i == row)continue;
else{
if(j < column)minorMatrix[i-1][j] = matrix[i][j];
else if(j == column)continue;
else minorMatrix[i-1][j-1] = matrix[i][j];
}
}
}
}
After adding OpenMP pragmas, I've changed the determinant function and now it looks like this:
int determinant(int *matrix[], int size){
if(size==1)return matrix[0][0];
else{
int result=0, sign=-1;
#pragma omp parallel for default(none) shared(size,matrix,sign) private(j,k) reduction(+ : result)
for(int j = 0; j < size; j++){
int **minorMatrix;
minorMatrix = new int*[size-1];
for (int k = 0 ; k < size-1 ; k++)
minorMatrix[k] = new int[size-1];
ijMinor(matrix, minorMatrix, size, 0, j);
sign*=-1;
result+=sign*matrix[0][j]*determinant(minorMatrix, size-1);
for(int i = 0; i < size-1; i++){
delete minorMatrix[i];
}
}
return result;
delete [] matrix;
}
}
My problem is that the result is every time different. Sometimes it gives correct value, but most often it is wrong. I think it's because of the sign variable. I am following the formula:
As you can see, in every iteration of my for loop there should be different sign but when I use OpenMP, something is wrong. How can I make this program to run with OpenMP?
Finally, my second issue is that using OpenMP does not make the program run quicker than without OpenMP. I also tried to make a 100,000 x 100,000 matrix, but my program reports an error about allocating memory. How can I run this program with very large matrices?
Your issues as I see it are as follows:
1) As noted by Hristo, your threads are stomping over each other's data with respect to the sign variable. It should be private to each thread so that they have full read/write access to it without having to worry about race conditions. Then, you simply need an algorithm to compute whether sign is plus or minus 1 depending on the iteration j independently from the other iterations. With a little thinking, you'll see that Hristo's suggestion is correct: sign = (j % 2) ? -1 : 1; should do the trick.
2) Your determinant() function is recursive. As is, that means that every iteration of the loop, after forming your minors, you then call your function again on that minor. Therefore, a single thread is going to be performing its iteration, enter the recursive function, and then try to split itself up into nthreads more threads. You can see now how you are oversubscribing your system by launching many more threads than you physically have cores. Two easy solutions:
Call your original serial function from within the omp parallel code. This is the fastest way to do it because this would avoid any OpenMP-startup overhead.
Turn off nested parallelism by calling omp_set_nested(0); before your first call to determinant().
Add an if clause to your parallel for directive: if(omp_in_parallel())
3) Your memory issues are because every iteration of your recursion, you are allocating more memory. If you fix problem #2, then you should be using comparable amounts of memory in the serial case as the parallel case. That being said, it would be much better to allocate all the memory you want before entering your algorithm. Allocating large chunks of memory (and then freeing it!), especially in parallel, is a terrible bottleneck in your code.
Compute the amount of memory you would need (on paper) before entering the first loop and allocate it all at once. I would also strongly suggest you consider allocating your memory contiguously (aka in 1D) to take better advantage of caching as well. Remember that each thread should have its own separate area to work with. Then, change your function to:
int determinant(int *matrix, int *startOfMyWorkspace, int size).
Instead of allocating a new (size-1)x(size-1) matrix inside of your loop, you would simply utilize the next (size-1)*(size-1) integers of your workspace, update what startOfMyWorkspace would be for the next recursive call, and continue along.
in my previous question
Shared vectors in OpenMP
it was stated that one can let diferent threads read and write on a shared vector as long as
the different threads access different elements of the vector.
What if different threads have to read all the (so sometimes the same) elements of a vector, like in the following case ?
#include <vector>
int main(){
vector<double> numbers;
vector<double> results(10);
double x;
//write 25 values in vector numbers
for (int i =0; i<25; i++){
numbers.push_back(cos(i));
}
#pragma omp parallel for default(none) \
shared(numbers, results) \
private(x)
for (int j = 0; j < 10; j++){
for(int k = 0; k < 25; k++){
x += 2 * numbers[j] * numbers[k] + 5 * numbers[j * k / 25];
}
results[j] = x;
}
return 0;
}
Will this parallelization be slow because only one thread at a time can read any element of the vector or is this not the case? Could I resolve the problem with the clause firstprivate(numbers)?
Would it make sense to create an array of vectors so every thread gets his own vector ?
For instance:
vector<double> numbersx[**-number of threads-**];
Reading elements of the same vector from multiple threads is not a problem. There is no synchronization in your code, so they will be accessed concurrently.
With the size of vectors that you are working with, you will not have any cache problems either, although for bigger vectors you may get some slow-downs due to the cache access pattern. In that case, separate copies of the numbers data would improve performance.
better approach:
#include <vector>
int main(){
vector<double> numbers;
vector<double> results(10);
//write 25 values in vector numbers
for (int i =0; i<25; i++){
numbers.push_back(cos(i));
}
#pragma omp parallel for
for (int j = 0; j < 10; j++){
double x = 0; // make x local var
for(int k = 0; k < 25; k++){
x += 2 * numbers[j] * numbers[k] + 5 * numbers[j * k / 25];
}
results[j] = x; // no race here
}
return 0;
}
it will be slow kinda due to fact that there isn't much work to share