Related
this is my first time using multi-threading to speed up a heavy calculation.
Background: The idea is to calculate a Kernel Covariance matrix, by reading a list of 3D points x_test and calculating the corresponding matrix, which has dimensions x_test.size() x x_test.size().
I already sped up the calculations by only calculating the lower triangluar matrix. Since all the calculations are independent from each other I tried to speed up the process (x_test.size() = 27000 in my case) by splitting the calculations of the matrix entries row-wise, assigning a range of rows to each thread.
On a single core the calculations took about 280 seconds each time, on 4 cores it took 270-290 seconds.
main.cpp
int main(int argc, char *argv[]) {
double sigma0sq = 1;
double lengthScale [] = {0.7633, 0.6937, 3.3307e+07};
const std::vector<std::vector<double>> x_test = parse2DCsvFile(inputPath);
/* Finding data slices of similar size */
//This piece of code works, each thread is assigned roughly the same number of matrix entries
int numElements = x_test.size()*x_test.size()/2;
const int numThreads = 4;
int elemsPerThread = numElements / numThreads;
std::vector<int> indices;
int j = 0;
for(std::size_t i=1; i<x_test.size()+1; ++i){
int prod = i*(i+1)/2 - j*(j+1)/2;
if (prod > elemsPerThread) {
i--;
j = i;
indices.push_back(i);
if(indices.size() == numThreads-1)
break;
}
}
indices.insert(indices.begin(), 0);
indices.push_back(x_test.size());
/* Spreding calculations to multiple threads */
std::vector<std::thread> threads;
for(std::size_t i = 1; i < indices.size(); ++i){
threads.push_back(std::thread(calculateKMatrixCpp, x_test, lengthScale, sigma0sq, i, indices.at(i-1), indices.at(i)));
}
for(auto & th: threads){
th.join();
}
return 0;
}
As you can see, each thread performs the following calculations on the data assigned to it:
void calculateKMatrixCpp(const std::vector<std::vector<double>> xtest, double lengthScale[], double sigma0sq, int threadCounter, int start, int stop){
char buffer[8192];
std::ofstream out("lower_half_matrix_" + std::to_string(threadCounter) +".csv");
out.rdbuf()->pubsetbuf(buffer, 8196);
for(int i = start; i < stop; ++i){
for(int j = 0; j < i+1; ++j){
double kij = seKernel(xtest.at(i), xtest.at(j), lengthScale, sigma0sq);
if (j!=0)
out << ',';
out << kij;
}
if(i!=xtest.size()-1 )
out << '\n';
}
out.close();
}
and
double seKernel(const std::vector<double> x1,const std::vector<double> x2, double lengthScale[], double sigma0sq) {
double sum(0);
for(std::size_t i=0; i<x1.size();i++){
sum += pow((x1.at(i)-x2.at(i))/lengthScale[i],2);
}
return sigma0sq*exp(-0.5*sum);
}
Aspects I considered
locking by simultaneous access to data vector -> I don't pass a reference to the threads, but a copy of the data. I know this is not optimal in terms of RAM usage, but as far as I know this should prevent simultaneous data access since every thread has its own copy
Output -> every thread writes its part of the lower triangular matrix to its own file. My task manager doesn't indicate a full SSD utilization in the slightest
Compiler and machine
Windows 11
GNU GCC Compiler
Code::Blocks (although I don't think that should be of importance)
There are many details that can be improved in your code, but I think the two biggest issues are:
using vectors or vectors, which leads to fragmented data;
writing each piece of data to file as soon as its value is computed.
The first point is easy to fix: use something like std::vector<std::array<double, 3>>. In the code below I use an alias to make it more readable:
using Point3D = std::array<double, 3>;
std::vector<Point3D> x_test;
The second point is slightly harder to address. I assume you wanted to write to the disk inside each thread because you couldn't manage to write to a shared buffer that you could then write to a file.
Here is a way to do exactly that:
void calculateKMatrixCpp(
std::vector<Point3D> const& xtest, Point3D const& lengthScale, double sigma0sq,
int threadCounter, int start, int stop, std::vector<double>& kMatrix
) {
// ...
double& kij = kMatrix[i * xtest.size() + j];
kij = seKernel(xtest[i], xtest[j], lengthScale, sigma0sq);
// ...
}
// ...
threads.push_back(std::thread(
calculateKMatrixCpp, x_test, lengthScale, sigma0sq,
i, indices[i-1], indices[i], std::ref(kMatrix)
));
Here, kMatrix is the shared buffer and represents the whole matrix you are trying to compute. You need to pass it to the thread via std::ref. Each thread will write to a different location in that buffer, so there is no need for any mutex or other synchronization.
Once you make these changes and try to write kMatrix to the disk, you will realize that this is the part that takes the most time, by far.
Below is the full code I tried on my machine, and the computation time was about 2 seconds whereas the writing-to-file part took 300 seconds! No amount of multithreading can speed that up.
If you truly want to write all that data to the disk, you may have some luck with file mapping. Computing the exact size needed should be easy enough if all values have the same number of digits, and it looks like you could write the values with multithreading. I have never done anything like that, so I can't really say much more about it, but it looks to me like the fastest way to write multiple gigabytes of memory to the disk.
#include <vector>
#include <thread>
#include <iostream>
#include <string>
#include <cmath>
#include <array>
#include <random>
#include <fstream>
#include <chrono>
using Point3D = std::array<double, 3>;
auto generateSampleData() -> std::vector<Point3D> {
static std::minstd_rand g(std::random_device{}());
std::uniform_real_distribution<> d(-1.0, 1.0);
std::vector<Point3D> data;
data.reserve(27000);
for (auto i = 0; i < 27000; ++i) {
data.push_back({ d(g), d(g), d(g) });
}
return data;
}
double seKernel(Point3D const& x1, Point3D const& x2, Point3D const& lengthScale, double sigma0sq) {
double sum = 0.0;
for (auto i = 0u; i < 3u; ++i) {
double distance = (x1[i] - x2[i]) / lengthScale[i];
sum += distance*distance;
}
return sigma0sq * std::exp(-0.5*sum);
}
void calculateKMatrixCpp(std::vector<Point3D> const& xtest, Point3D const& lengthScale, double sigma0sq, int threadCounter, int start, int stop, std::vector<double>& kMatrix) {
std::cout << "start of thread " << threadCounter << "\n" << std::flush;
for(int i = start; i < stop; ++i) {
for(int j = 0; j < i+1; ++j) {
double& kij = kMatrix[i * xtest.size() + j];
kij = seKernel(xtest[i], xtest[j], lengthScale, sigma0sq);
}
}
std::cout << "end of thread " << threadCounter << "\n" << std::flush;
}
int main() {
double sigma0sq = 1;
Point3D lengthScale = {0.7633, 0.6937, 3.3307e+07};
const std::vector<Point3D> x_test = generateSampleData();
/* Finding data slices of similar size */
//This piece of code works, each thread is assigned roughly the same number of matrix entries
int numElements = x_test.size()*x_test.size()/2;
const int numThreads = 4;
int elemsPerThread = numElements / numThreads;
std::vector<int> indices;
int j = 0;
for(std::size_t i = 1; i < x_test.size()+1; ++i){
int prod = i*(i+1)/2 - j*(j+1)/2;
if (prod > elemsPerThread) {
i--;
j = i;
indices.push_back(i);
if(indices.size() == numThreads-1)
break;
}
}
indices.insert(indices.begin(), 0);
indices.push_back(x_test.size());
auto start = std::chrono::system_clock::now();
std::vector<double> kMatrix(x_test.size() * x_test.size(), 0.0);
std::vector<std::thread> threads;
for (std::size_t i = 1; i < indices.size(); ++i) {
threads.push_back(std::thread(calculateKMatrixCpp, x_test, lengthScale, sigma0sq, i, indices[i - 1], indices[i], std::ref(kMatrix)));
}
for (auto& t : threads) {
t.join();
}
auto end = std::chrono::system_clock::now();
auto elapsed_seconds = std::chrono::duration<double>(end - start).count();
std::cout << "computation time: " << elapsed_seconds << "s" << std::endl;
start = std::chrono::system_clock::now();
constexpr int buffer_size = 131072;
char buffer[buffer_size];
std::ofstream out("matrix.csv");
out.rdbuf()->pubsetbuf(buffer, buffer_size);
for (int i = 0; i < x_test.size(); ++i) {
for (int j = 0; j < i + 1; ++j) {
if (j != 0) {
out << ',';
}
out << kMatrix[i * x_test.size() + j];
}
if (i != x_test.size() - 1) {
out << '\n';
}
}
end = std::chrono::system_clock::now();
elapsed_seconds = std::chrono::duration<double>(end - start).count();
std::cout << "writing time: " << elapsed_seconds << "s" << std::endl;
}
Okey I've wrote implementation with optimized formatting.
By using #Nelfeal code it was taking on my system around 250 seconds for the run to complete with write time taking the most by far. Or rather std::ofstream formatting taking most of the time.
I've written a C++20 version via std::format_to/format. It is a multi-threaded version that takes around 25-40 seconds to complete all the computations, formatting, and writing. If run in a single thread, it takes on my system around 70 seconds. Same performance should be achievable via fmt library on C++11/14/17.
Here is the code:
import <vector>;
import <thread>;
import <iostream>;
import <string>;
import <cmath>;
import <array>;
import <random>;
import <fstream>;
import <chrono>;
import <format>;
import <filesystem>;
using Point3D = std::array<double, 3>;
auto generateSampleData(Point3D scale) -> std::vector<Point3D>
{
static std::minstd_rand g(std::random_device{}());
std::uniform_real_distribution<> d(-1.0, 1.0);
std::vector<Point3D> data;
data.reserve(27000);
for (auto i = 0; i < 27000; ++i)
{
data.push_back({ d(g)* scale[0], d(g)* scale[1], d(g)* scale[2] });
}
return data;
}
double seKernel(Point3D const& x1, Point3D const& x2, Point3D const& lengthScale, double sigma0sq) {
double sum = 0.0;
for (auto i = 0u; i < 3u; ++i) {
double distance = (x1[i] - x2[i]) / lengthScale[i];
sum += distance * distance;
}
return sigma0sq * std::exp(-0.5 * sum);
}
void calculateKMatrixCpp(std::vector<Point3D> const& xtest, Point3D lengthScale, double sigma0sq, int threadCounter, int start, int stop, std::filesystem::path localPath)
{
using namespace std::string_view_literals;
std::vector<char> buffer;
buffer.reserve(15'000);
std::ofstream out(localPath);
std::cout << std::format("starting thread {}: from {} to {}\n"sv, threadCounter, start, stop);
for (int i = start; i < stop; ++i)
{
for (int j = 0; j < i; ++j)
{
double kij = seKernel(xtest[i], xtest[j], lengthScale, sigma0sq);
std::format_to(std::back_inserter(buffer), "{:.6g}, "sv, kij);
}
double kii = seKernel(xtest[i], xtest[i], lengthScale, sigma0sq);
std::format_to(std::back_inserter(buffer), "{:.6g}\n"sv, kii);
out.write(buffer.data(), buffer.size());
buffer.clear();
}
}
int main() {
double sigma0sq = 1;
Point3D lengthScale = { 0.7633, 0.6937, 3.3307e+07 };
const std::vector<Point3D> x_test = generateSampleData(lengthScale);
/* Finding data slices of similar size */
//This piece of code works, each thread is assigned roughly the same number of matrix entries
int numElements = x_test.size() * (x_test.size()+1) / 2;
const int numThreads = 3;
int elemsPerThread = numElements / numThreads;
std::vector<int> indices;
int j = 0;
for (std::size_t i = 1; i < x_test.size() + 1; ++i) {
int prod = i * (i + 1) / 2 - j * (j + 1) / 2;
if (prod > elemsPerThread) {
i--;
j = i;
indices.push_back(i);
if (indices.size() == numThreads - 1)
break;
}
}
indices.insert(indices.begin(), 0);
indices.push_back(x_test.size());
auto start = std::chrono::system_clock::now();
std::vector<std::thread> threads;
using namespace std::string_view_literals;
for (std::size_t i = 1; i < indices.size(); ++i)
{
threads.push_back(std::thread(calculateKMatrixCpp, std::ref(x_test), lengthScale, sigma0sq, i, indices[i - 1], indices[i], std::format("./matrix_{}.csv"sv, i-1)));
}
for (auto& t : threads)
{
t.join();
}
auto end = std::chrono::system_clock::now();
auto elapsed_seconds = std::chrono::duration<double>(end - start);
std::cout << std::format("total elapsed time: {}"sv, elapsed_seconds);
return 0;
}
Note: I used 6 digits of precision here as it is the default for std::ofstream. More digits means more writing time to disk and lower performance.
I'm looking to sort a large 3D array along the z-axis.
Example array is X x Y x Z (1000x1000x5)
I'd like to sort along the z-axis so I'd perform 1000x1000 sorts for 5 element along the z-axis.
Edit Update: Tried an attempt to use thrust below. It's functional and I'd store the output back, but this is very slow since I'm sorting 5 elements at a time per (x,y) location:
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <thrust/device_ptr.h>
#include <thrust/sort.h>
#include <thrust/gather.h>
#include <thrust/iterator/counting_iterator.h>
int main(){
int x = 1000, y = 1000, z = 5;
float*** unsorted_cube = new float** [x];
for (int i = 0; i < x; i++)
{
// Allocate memory blocks for
// rows of each 2D array
unsorted_cube[i] = new float* [y];
for (int j = 0; j < y; j++)
{
// Allocate memory blocks for
// columns of each 2D array
unsorted_cube[i][j] = new float[z];
}
}
for (int i = 0; i < x; i++)
{
for (int j = 0; j < y; j++)
{
unsorted_cube[i][j][0] = 4.0f;
unsorted_cube[i][j][1] = 3.0f;
unsorted_cube[i][j][2] = 1.0f;
unsorted_cube[i][j][3] = 5.0f;
unsorted_cube[i][j][4] = 2.0f;
}
}
for (int i = 0; i < 5; i++)
{
printf("unsorted_cube first 5 elements to sort at (0,0): %f\n", unsorted_cube[0][0][i]);
}
float* temp_input;
float* temp_output;
float* raw_ptr;
float raw_ptr_out[5];
cudaMalloc((void**)&raw_ptr, N_Size * sizeof(float));
for (int i = 0; i < x; i++)
{
for (int j = 0; j < y; j++)
{
temp_input[0] = unsorted_cube[i][j][0];
temp_input[1] = unsorted_cube[i][j][1];
temp_input[2] = unsorted_cube[i][j][2];
temp_input[3] = unsorted_cube[i][j][3];
temp_input[4] = unsorted_cube[i][j][4];
cudaMemcpy(raw_ptr, temp_input, 5 * sizeof(float), cudaMemcpyHostToDevice);
thrust::device_ptr<float> dev_ptr = thrust::device_pointer_cast(raw_ptr);
thrust::sort(dev_ptr, dev_ptr + 5);
thrust::host_vector<float> host_vec(5);
thrust::copy(dev_ptr, dev_ptr + 5, raw_ptr_out);
if (i == 0 && j == 0)
{
for (int i = 0; i < 5; i++)
{
temp_output[i] = raw_ptr_out[i];
}
printf("sorted_cube[0,0,0] : %f\n", temp_output[0]);
printf("sorted_cube[0,0,1] : %f\n", temp_output[1]);
printf("sorted_cube[0,0,2] : %f\n", temp_output[2]);
printf("sorted_cube[0,0,3] : %f\n", temp_output[3]);
printf("sorted_cube[0,0,4] : %f\n", temp_output[4]);
}
}
}
}
Assuming that the data is in a format where the values in each xy-plane are consecutive in memory: data[((z * y_length) + y) * x_length + x] (which is be best for coalescing memory accesses on the GPU, as well)
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/for_each.h>
#include <thrust/zip_iterator.h>
void sort_in_z_dir(thrust::device_vector<float> &data,
int x_length, int y_length) { // z_length == 5
auto z_stride = x_length * y_length;
thrust::for_each(
thrust::make_zip_iterator(thrust::make_tuple(
data.begin(),
data.begin() + z_stride,
data.begin() + 2 * z_stride,
data.begin() + 3 * z_stride,
data.begin() + 4 * z_stride)),
thrust::make_zip_iterator(thrust::make_tuple(
data.begin() + z_stride,
data.begin() + 2 * z_stride,
data.begin() + 3 * z_stride,
data.begin() + 4 * z_stride,
data.begin() + 5 * z_stride)),
[] __host__ __device__
(thrust::tuple<float, float, float, float, float> &values) {
float local_data[5] = {thrust::get<0>(values),
thrust::get<1>(values),
thrust::get<2>(values),
thrust::get<3>(values),
thrust::get<4>(values)};
thrust::sort(thrust::seq, local_data, local_data + 5);
thrust::get<0>(values) = local_data[0];
thrust::get<1>(values) = local_data[1];
thrust::get<2>(values) = local_data[2];
thrust::get<3>(values) = local_data[3];
thrust::get<4>(values) = local_data[4];
});
}
This solution is certainly very ugly in terms of hardcoding z_length. One can use some C++ template-"magic" to make z_length into a template parameter, but this seemed to be overkill for this answer about Thrust.
See Convert std::tuple to std::array C++11 and How to convert std::array to std::tuple? for examples on interfacing between arrays and tuples.
The good thing about this solution that up to the sorting algorithm itself it should be pretty much optimal performance-wise. I don't know if thrust::sort is optimized for such small input arrays, but you can replace it by any self written sorting algorithm as I proposed in the comments.
If you want to be able to use different z_length without all this hassle, you might prefer this solution, which sorts in global memory, which is far from optimal, and feels a bit hacky because it uses Thrust pretty much only to launch a kernel. Here you want to have the data ordered the other way around: data[((x * y_length) + y) * z_length + z]
#include <thrust/counting_iterator.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/for_each.h>
void sort_in_z_dir_alternative(thrust::device_vector<float> &data,
int x_length, int y_length, int z_length) {
int n_threads = x_length * y_length;
thrust::for_each(
thrust::make_counting_iterator(0),
thrust::make_counting_iterator(n_threads),
[ddata = thrust::raw_pointer_cast(data.data()), z_length] __host__ __device__ (int idx) {
thrust::sort(thrust::seq,
ddata + z_length * idx,
ddata + z_length * (idx + 1));
});
}
If you are ok with z_length being a template parameter, this might be a solution that combines the best from both worlds (data format like in the first example):
#include <thrust/counting_iterator.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/for_each.h>
template <int z_length>
void sort_in_z_dir_middle_ground(thrust::device_vector<float> &data,
int x_length, int y_length) {
int n_threads = x_length * y_length; // == z_stride
thrust::for_each(
thrust::make_counting_iterator(0),
thrust::make_counting_iterator(n_threads),
[ddata = thrust::raw_pointer_cast(data.data()),
z_length, n_threads] __host__ __device__ (int idx) {
float local_data[z_length];
#pragma unroll
for (int i = 0; i < z_length; ++i) {
local_data[i] = ddata[idx + i * n_threads];
}
thrust::sort(thrust::seq,
local_data,
local_data + z_length);
#pragma unroll
for (int i = 0; i < z_length; ++i) {
ddata[idx + i * n_threads] = local_data[i];
}
});
}
I'm trying to learn about matrix multiplication and encounter this code for Strassen multiplication vs standard matrix multiplication, so I've tried to implement it. However, this code uses too much memory to the point that when the matrix it's big enough it kills the program. Also, because it uses too much memory it takes longer to process.
I'm not too comfortable to mess around with the code too much since I don't fully understand complex memory management and I would really like to learn about this topic.
Build in the code there's a cut parameter and found that at 320 makes it run faster and seems like improves with memory management.
EDIT. I've implemented a copy constructor, destructor and a function to track memory usage and it fixed the memory leaks it was having, but the big jump on the time between 1990 dimension to 2100 still there for the Strassen matrix.
matrix.h
#ifndef MATRIX_H
#define MATRIX_H
#include <vector>
using namespace std;
class matrix
{
public:
matrix(int dim, bool random, bool strassen);
matrix(const matrix& old_m);
inline int dim() {
return dim_;
}
inline int& operator()(unsigned row, unsigned col) {
return data_[dim_ * row + col];
}
inline int operator()(unsigned row, unsigned col) const {
return data_[dim_ * row + col];
}
void print();
matrix operator+(matrix b);
matrix operator-(matrix b);
~matrix();
private:
int dim_;
int* data_;
};
#endif
Matrix.cpp
#include <iostream>
#include <vector>
#include <stdlib.h>
#include <time.h>
#include "SAMmatrix.h"
using namespace std;
matrix::matrix(int dim, bool random, bool strassen) : dim_(dim) {
if (strassen) {
int dim2 = 2;
while (dim2 < dim)
dim2 *= 2;
dim_ = dim2;
}
data_ = new int[dim_ * dim_];
if (!random) return;
for (int i = 0; i < dim_ * dim_; i++)
data_[i] = rand() % 10;
}
matrix::matrix(const matrix& old_m){
dim_ = old_m.dim_;
data_ = new int[dim_ * dim_];
for (int i = 0; i < dim_ * dim_; i++)
data_[i] = old_m.data_[i];
}
void matrix::print() {
for (int i = 0; i < dim_; i++) {
for (int j = 0; j < dim_; j++)
cout << (*this)(i, j) << " ";
cout << "\n";
}
cout << "\n";
}
matrix matrix::operator+(matrix b) {
matrix c(dim_, false, false);
for (int i = 0; i < dim_; i++)
for (int j = 0; j < dim_; j++)
c(i, j) = (*this)(i, j) + b(i, j);
return c;
}
matrix matrix::operator-(matrix b) {
matrix c(dim_, false, false);
for (int i = 0; i < dim_; i++)
for (int j = 0; j < dim_; j++)
c(i, j) = (*this)(i, j) - b(i, j);
return c;
}
matrix::~matrix()
{
delete [] data_;
}
Matrix main
#include <iostream>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include "SAMmatrix.h"
#include "stdlib.h"
#include "stdio.h"
#include "string.h"
typedef pair<matrix, long> result;
int cut = 64;
matrix mult_std(matrix a, matrix b)
{
matrix c(a.dim(), false, false);
for (int i = 0; i < a.dim(); i++)
for (int k = 0; k < a.dim(); k++)
for (int j = 0; j < a.dim(); j++)
c(i, j) += a(i, k) * b(k, j);
return c;
}
matrix get_part(int pi, int pj, matrix m)
{
matrix p(m.dim() / 2, false, true);
pi = pi * p.dim();
pj = pj * p.dim();
for (int i = 0; i < p.dim(); i++)
for (int j = 0; j < p.dim(); j++)
p(i, j) = m(i + pi, j + pj);
return p;
}
void set_part(int pi, int pj, matrix* m, matrix p)
{
pi = pi * p.dim();
pj = pj * p.dim();
for (int i = 0; i < p.dim(); i++)
for (int j = 0; j < p.dim(); j++)
(*m)(i + pi, j + pj) = p(i, j);
}
matrix mult_strassen(matrix a, matrix b)
{
if (a.dim() <= cut)
return mult_std(a, b);
matrix a11 = get_part(0, 0, a);
matrix a12 = get_part(0, 1, a);
matrix a21 = get_part(1, 0, a);
matrix a22 = get_part(1, 1, a);
matrix b11 = get_part(0, 0, b);
matrix b12 = get_part(0, 1, b);
matrix b21 = get_part(1, 0, b);
matrix b22 = get_part(1, 1, b);
matrix m1 = mult_strassen(a11 + a22, b11 + b22);
matrix m2 = mult_strassen(a21 + a22, b11);
matrix m3 = mult_strassen(a11, b12 - b22);
matrix m4 = mult_strassen(a22, b21 - b11);
matrix m5 = mult_strassen(a11 + a12, b22);
matrix m6 = mult_strassen(a21 - a11, b11 + b12);
matrix m7 = mult_strassen(a12 - a22, b21 + b22);
matrix c(a.dim(), false, true);
set_part(0, 0, &c, m1 + m4 - m5 + m7);
set_part(0, 1, &c, m3 + m5);
set_part(1, 0, &c, m2 + m4);
set_part(1, 1, &c, m1 - m2 + m3 + m6);
return c;
}
pair<matrix, long> run(matrix(*f)(matrix, matrix), matrix a, matrix b)
{
struct timeval start, end;
gettimeofday(&start, NULL);
matrix c = f(a, b);
gettimeofday(&end, NULL);
long e = (end.tv_sec * 1000 + end.tv_usec / 1000);
long s = (start.tv_sec * 1000 + start.tv_usec / 1000);
return pair<matrix, long>(c, e - s);
}
int parseLine(char* line){ /* overflow*/
// This assumes that a digit will be found and the line ends in " Kb".
int i = strlen(line);
const char* p = line;
while (*p <'0' || *p > '9') p++;
line[i-3] = '\0';
i = atoi(p);
return i;
}
int getValue(){ //Note: this value is in KB!
FILE* file = fopen("/proc/self/status", "r");
int result = -1;
char line[128];
while (fgets(line, 128, file) != NULL){
if (strncmp(line, "VmSize:", 7) == 0){
result = parseLine(line);
break;
}
}
fclose(file);
return result;
}
int main()
{
/* test cut of for strassen
/*
for (cut = 2; cut <= 512; cut++) {
matrix a(512, true, true);
matrix b(512, true, true);
result r = run(mult_strassen, a, b);
cout << cut << " " << r.second << "\n";
}
*/
/* performance test: standard and strassen */
/*1024 going up by 64*/
for (int dim = 1500; dim <= 2300; dim += 200)
{
double space = getValue() * .01;
cout << "Space before: " << space << "Mb" << "\n";
matrix a(dim, true, false);
matrix b(dim, true, false);
result std = run(mult_std, a, b);
matrix c(dim, true, true);
matrix d(dim, true, true);
result strassen = run(mult_strassen, c, d);
cout << "Dim " << " Std " << " Stranssen" << endl;
cout << dim << " " << std.second << "ms " << strassen.second << "ms " << "\n";
double spaceA = getValue() * .01;
cout << "Space: " << spaceA << "Mb" << "\n";
cout << " " << endl;
}
}
I set it to go from 1500 to 2300 by 200 and the program is "killed" before finishing
1500 2406 4250
1700 3463 4252
1900 4819 4247
2100 6487 30023
Killed
Also, it shouldn't make a big jump on time like that when the dimension goes from 1900 to 2100.
I am studying computer architecture in the university.
I have a home work which making convolution faster using parallelism(openMP).
For now I made convolution code (your_convolution) with omp, but It did not be faster at all!
I'm using visual studio 2012.
How can i make it faster??
here's whole convolution's code.
give me some help.
#include <intrin.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include <vector>
#include <assert.h>
#include <omp.h>
using namespace std;
void convolution(float* output, float* input, float* filter, int width, int height, int r)
{
assert(output!=NULL && input!=NULL && filter!=NULL && width>0 && height>0 && r>0);
int w1=width-1;
int h1=height-1;
int fwidth=2*r+1;
int i, j, di, dj, ii, jj;
float sum;
for (i=0;i<height;++i)
{
for (j=0;j<width;++j)
{
sum=0;
for (di=-r;di<=r;++di)
{
ii=i+di;
ii=max(min(ii,h1),0);
for (dj=-r;dj<=r;++dj)
{
jj=j+dj;
jj=max(min(jj,w1),0);
sum+=filter[dj+r+(di+r)*fwidth]*input[jj+ii*width];
}
}
output[j+i*width]=sum;
}
}
}
void your_convolution(float* output, float* input, float* filter, int width, int height, int r)
{
// write your code here //
assert(output != NULL && input != NULL && filter != NULL && width>0 && height>0 && r>0);
int w1 = width - 1;
int h1 = height - 1;
int fwidth = 2 * r + 1;
int i, j, di, dj, ii, jj;
float sum;
omp_set_num_threads(4);
#pragma omp parallel
{
for (i = 0; i<height; ++i)
{
for (j = 0; j<width; ++j)
{
sum = 0;
for (di = -r; di <= r; ++di)
{
ii = i + di;
ii = max(min(ii, h1), 0);
#pragma omp parallel for
for (dj = -r; dj <= r; ++dj)
{
jj = j + dj;
jj = max(min(jj, w1), 0);
sum += filter[dj + r + (di + r)*fwidth] * input[jj + ii*width];
}
}
output[j + i*width] = sum;
}
}
}
}
int main()
{
// load the image
int width=1920; // width of the image
int height=1080; // height of the image
int len=width*height; // pixels in the image
int i, j, ii, jj, i2;
float* data=(float*)malloc(sizeof(float)*len); // buffer to load the image
float* output=(float*)malloc(sizeof(float)*len); // output buffer
FILE* fp=fopen("../image.dat", "rb"); // open the image, assume that the bld directory is a subdirectory to the src directory
fread(data, sizeof(float), width*height, fp); // load the float values, the image is gray.
fclose(fp);
// set the filter
int radius=3; // filter radius
float sigma=(float)(radius/3.0); // standard deviation of the Gaussian filter
float beta=(float)(-0.5/(sigma*sigma)); // coefficient exp(beta*x*x)
int fwidth=2*radius+1; // width of the filter
int flen=fwidth*fwidth; // number of elements in the filter
float* filter=(float*)malloc(sizeof(float)*flen); // filter buffer
float sum_weight=0; // we want to normalize the filter weights
for (i=-radius;i<=radius;++i)
{
ii=(i+radius)*fwidth;
i2=i*i;
for (j=-radius;j<=radius;++j)
{
jj=j+radius+ii;
filter[jj]=exp(beta*(i2+j*j));
sum_weight+=filter[jj];
}
}
sum_weight=(float)(1.0/sum_weight);
for (i=0;i<flen;++i)
filter[i]*=sum_weight; // now the weights are normalized to sum to 1
clock_t start=clock();
convolution(output, data, filter, width, height, radius);
clock_t finish=clock();
double duration = (double)(finish - start) / CLOCKS_PER_SEC;
printf( "convolution naive: %2.3f seconds\n", duration );
float* output2=(float*)malloc(sizeof(float)*len); // output buffer
start=clock();
your_convolution(output2, data, filter, width, height, radius);
finish=clock();
double duration2 = (double)(finish - start) / CLOCKS_PER_SEC;
printf( "your convolution: %2.3f seconds\n", duration2 );
double sum=0;
for (i=0;i<len;++i)
sum+=fabs(output[i]-output2[i]);
printf("difference of the outputs=%lf\n", sum);
printf( "The performance of your convolve is %2.1f times higher than convolution naive.\n", duration/duration2);
free(data);
free(filter);
free(output);
return 0;
}
I am trying to write an efficient code to perform circular shift which I need to implement it on multiple times on big matrices during my data processing.
On my first trial, compiler throws some exception and it seems that I may be trying to access matrix element outside its size and I have no idea what is going on wrong.
1) I am also using Armadillo lib which has "mat" definition.
2) I intend to shift it by row and/ or column.
Here is my try:
#include "stdafx.h"
#include <vector>
#include <iostream>
#include "C:\Users\kumar\Documents\Visual Studio 2012\UserLibs\armadillo-3-910-0\include\armadillo"
#include <stdlib.h> /* srand, rand */
using namespace arma;
template<class ty>
void circshift(ty *out, const ty *in, int xdim, int ydim, int xshift, int yshift)
{
int iOutputInd, iInputInd, ii, jj;
for (int i =0; i < xdim; i++)
{
ii = (i + xshift) % xdim;
for (int j = 0; j < ydim; j++)
{
jj = (j + yshift) % ydim;
iOutputInd = ii * ydim + jj;
iInputInd = i * ydim + j;
std::cout << " iOutputInd --> " << iOutputInd << " ; iInputInd -->" << iInputInd << "\n";
out[iOutputInd] = in[iInputInd]; // EXCEPTION BEING THROWN HERE
}
}
}
int _tmain(int argc, _TCHAR* argv[])
{
//a = [1 2 3; 4 5 6; 7 8 9];
mat a, a_out; // "mat" defined in C++ lib Armadillo
a << 1 << 2 << 3 << endr
<< 4 << 5 << 6 << endr
<< 7 << 8 << 9 <<endr;
a.reshape(3,3);
//a.print();
a_out = a;
int xdim = 3; int ydim = 3; int xshift = 1; int yshift = 0;
circshift(&a_out, &a, xdim, ydim, xshift, yshift);
a_out.print();
return 0;
}
It compiles fine. However, when I try to run, Visual studio throws following error:
Unhandled exception at 0x3FF00000 in Circshift_Example.exe: 0xC0000005: Access violation (parameters: 0x00000008).
I get another error in visual studio console, which complains:
error: Mat::init(): requested size is too large
Update: FINAL SOLUTION
I am posting my code as it may be useful for some users.
Please note that I am using "Armadillo" library to create matrix. One can replace Armadillo "mat" class wwith their own matrix class.
Please up-vote if you use this code.
#include "stdafx.h"
#include "armadillo-3-910-0\include\armadillo"
using namespace arma;
template<class ty>
void circshift(ty& out, const ty& in, int xshift, int yshift)
{
int iOutputInd, iInputInd, ii, jj;
int ydim = in.n_cols;
int xdim = in.n_rows;
for (int j =0; j < ydim; j++)
{
jj = (j + yshift) % ydim;
if (jj <0) jj = jj + ydim;
for (int i = 0; i < xdim; i++)
{
ii = (i + xshift) % xdim;
if (ii <0) ii = ii + xdim;
out[jj * xdim + ii] = in[j * xdim + i];
}
}
}
int _tmain(int argc, _TCHAR* argv[])
{
//a = [1 2 3; 4 5 6; 7 8 9];
mat a, a_out;
a << 1 << 2 << 3 << endr
<< 4 << 5 << 6 << endr
<< 7 << 8 << 9 <<endr;
a.reshape(3,3);
a_out = a;
int xshift = 1; int yshift = 0;
circshift(a_out, a, xshift, yshift);
a_out.print();
xshift = 1; yshift = -1;
circshift(a_out, a, xshift, yshift);
a_out.print();
return 0;
}
The main error here is that you pass pointers to mat type objects to the circshift() function (the out and in argument, but then use these arguments as arrays to mat. The following line is not interpreted as you think
out[iOutputInd] = in[iInputInd];
because out and in are not mat objects. They are pointers to mat objects, so the compiler will interpret in and out as being pointer to arrays of mat and index these arrays, copying a non-existant mat from in[...] to another non-existant location.
One simple way to fix that is to use references instead of pointers to pass the mat objects, i.e.:
template<class ty> void circshift(ty& out, const ty& in, int xdim, int ydim, int xshift, int yshift)
{
...
}
and call it in _tmain using:
circshift(a_out, a, xdim, ydim, xshift, yshift);