Related
this is optimized implementation of matrix multiplication and this routine performs a matrix multiplication operation.
C := C + A * B (where A, B, and C are n-by-n matrices stored in column-major format)
On exit, A and B maintain their input values.
void matmul_optimized(int n, int *A, int *B, int *C)
{
// to the effective bitwise calculation
// save the matrix as the different type
int i, j, k;
int cij;
for (i = 0; i < n; ++i) {
for (j = 0; j < n; ++j) {
cij = C[i + j * n]; // the initialization into C also, add separate additions to the product and sum operations and then record as a separate variable so there is no multiplication
for (k = 0; k < n; ++k) {
cij ^= A[i + k * n] & B[k + j * n]; // the multiplication of each terms is expressed by using & operator the addition is done by ^ operator.
}
C[i + j * n] = cij; // allocate the final result into C }
}
}
how do I more speed up the multiplication of matrix based on above function/method?
this function is tested up to 2048 by 2048 matrix.
the function matmul_optimized is done with matmul.
#include <stdio.h>
#include <stdlib.h>
#include "cpucycles.c"
#include "helper_functions.c"
#include "matmul_reference.c"
#include "matmul_optimized.c"
int main()
{
int i, j;
int n = 1024; // Number of rows or columns in the square matrices
int *A, *B; // Input matrices
int *C1, *C2; // Output matrices from the reference and optimized implementations
// Performance and correctness measurement declarations
long int CLOCK_start, CLOCK_end, CLOCK_total, CLOCK_ref, CLOCK_opt;
long int COUNTER, REPEAT = 5;
int difference;
float speedup;
// Allocate memory for the matrices
A = malloc(n * n * sizeof(int));
B = malloc(n * n * sizeof(int));
C1 = malloc(n * n * sizeof(int));
C2 = malloc(n * n * sizeof(int));
// Fill bits in A, B, C1
fill(A, n * n);
fill(B, n * n);
fill(C1, n * n);
// Initialize C2 = C1
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
C2[i * n + j] = C1[i * n + j];
// Measure performance of the reference implementation
CLOCK_total = 0;
for (COUNTER = 0; COUNTER < REPEAT; COUNTER++)
{
CLOCK_start = cpucycles();
matmul_reference(n, A, B, C1);
CLOCK_end = cpucycles();
CLOCK_total = CLOCK_total + CLOCK_end - CLOCK_start;
}
CLOCK_ref = CLOCK_total / REPEAT;
printf("n=%d Avg cycle count for reference implementation = %ld\n", n, CLOCK_ref);
// Measure performance of the optimized implementation
CLOCK_total = 0;
for (COUNTER = 0; COUNTER < REPEAT; COUNTER++)
{
CLOCK_start = cpucycles();
matmul_optimized(n, A, B, C2);
CLOCK_end = cpucycles();
CLOCK_total = CLOCK_total + CLOCK_end - CLOCK_start;
}
CLOCK_opt = CLOCK_total / REPEAT;
printf("n=%d Avg cycle count for optimized implementation = %ld\n", n, CLOCK_opt);
speedup = (float)CLOCK_ref / (float)CLOCK_opt;
// Check correctness by comparing C1 and C2
difference = 0;
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
difference = difference + C1[i * n + j] - C2[i * n + j];
if (difference == 0)
printf("Speedup factor = %.2f\n", speedup);
if (difference != 0)
printf("Reference and optimized implementations do not match\n");
//print(C2, n);
free(A);
free(B);
free(C1);
free(C2);
return 0;
}
You can try algorithm like Strassen or Coppersmith-Winograd and here is also a good example.
Or maybe try Parallel computing like future::task or std::thread
Optimizing matrix-matrix multiplication requires careful attention to be paid to a number of issues:
First, you need to be able to use vector instructions. Only vector instructions can access parallelism inherent in the architecture. So, either your compiler needs to be able to automatically map to vector instructions, or you have to do so by hand, for example by calling the vector intrinsic library for AVX-2 instructions (for x86 architectures).
Next, you need to pay careful attention to the memory hierarchy. Your performance can easily drop to less than 5% of peak if you don't do this.
Once you do this right, you will hopefully have broken the computation up into small enough computational chunks that you can also parallelize via OpenMP or pthreads.
A document that carefully steps through what is required can be found at http://www.cs.utexas.edu/users/flame/laff/pfhp/LAFF-On-PfHP.html. (This is very much a work in progress.) At the end of it all, you will have an implementation that gets close to the performance attained by high-performance libraries like Intel's Math Kernel Library (MKL) or the BLAS-like Library Instantiation Software (BLIS).
(And, actually, you CAN then also effectively incorporate Strassen's algorithm. But that is another story, told in Unit 3.5.3 of these notes.)
You may find the following thread relevant: How does BLAS get such extreme performance?
I am trying to learn programming a GPU. My system environment is as follows:
OS: windows 10 pro
GPU: NVIDIA GTX 1080 Ti (display does not run on this; there is another gpu for that)
CUDA toolkit: v9.1
I wrote this simple program using CUDA to calculate FFT from scratch on a GPU. The algorithm follows the wikipedia example of Cooley-Tukey algorithm. The code uses recursive functions to calculate the FFT of an array of complex values.
#include <iostream>
#include <string>
#include "conio.h"
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <thrust\complex.h>
#include <cstdio>
#include <fstream>
using namespace std;
#define winSize 2048
#define winShift 1024
#define M_PI 3.14159265358979323846
__device__ void separate(thrust::complex<double>* a, int n)
{
thrust::complex<double>* b = new thrust::complex<double>[n / 2]; // get temp heap storage
for (int i = 0; i<n / 2; i++) // copy all odd elements to heap storage
b[i] = a[i * 2 + 1];
for (int i = 0; i<n / 2; i++) // copy all even elements to lower-half of a[]
a[i] = a[i * 2];
for (int i = 0; i<n / 2; i++) // copy all odd (from heap) to upper-half of a[]
a[i + n / 2] = b[i];
cudaFree(b); // delete heap storage
}
// N must be a power-of-2, or bad things will happen.
// Currently no check for this condition.
//
// N input samples in X[] are FFT'd and results left in X[].
// Because of Nyquist theorem, N samples means
// only first N/2 FFT results in X[] are the answer.
// (upper half of X[] is a reflection with no new information).
__global__ void fft2(thrust::complex<double>* X, int N)
{
if (N < 2)
{
// bottom of recursion.
// Do nothing here, because already X[0] = x[0]
}
else
{
separate(X, N); // all evens to lower half, all odds to upper half
fft2 << <1, 1 >> >(X, N / 2); // recurse even items
fft2 << <1, 1 >> >(X + N / 2, N / 2); // recurse odd items
// combine results of two half recursions
for (int k = 0; k<N / 2; k++)
{
thrust::complex<double> e = X[k]; // even
thrust::complex<double> o = X[k + N / 2]; // odd
// w is the "twiddle-factor"
thrust::complex<double> w = exp(thrust::complex<double>(0, -2.*M_PI*k / N));
X[k] = e + w * o;
X[k + N / 2] = e - w * o;
}
}
}
int main()
{
const int nSamples = 64;
double nSeconds = 0.02; // total time for sampling
double sampleRate = nSamples / nSeconds; // n Hz = n / second
double freqResolution = sampleRate / nSamples; // freq step in FFT result
thrust::complex<double> x[nSamples]; // storage for sample data
thrust::complex<double> X[nSamples]; // storage for FFT answer
thrust::complex<double> *d_arr1;
const int nFreqs = 5;
double freq[nFreqs] = { 2,4,8,32,72 }; // known freqs for testing
size_t n_byte = nSamples * sizeof(complex<double>);
// generate samples for testing
for (int i = 0; i<nSamples; i++)
{
x[i] = thrust::complex<double>(0., 0.);
// sum several known sinusoids into x[]
for (int j = 0; j < nFreqs; j++)
x[i] += sin(2 * M_PI*freq[j] * i); // / nSamples);
X[i] = x[i]; // copy into X[] for FFT work & result
}
// compute fft for this data
cudaMalloc((void**)&d_arr1, n_byte);
cudaMemcpy(d_arr1, X, n_byte, cudaMemcpyHostToDevice);
//launchKernel << <1, 1 >> >(d_arr1, nSamples);
fft2 << <1, 1 >> > (d_arr1, nSamples);
cudaMemcpy(X, d_arr1, n_byte, cudaMemcpyDeviceToHost);
printf(" n\tx[]\tX[]\tf\n"); // header line
// loop to print values
for (int i = 0; i<nSamples; i++)
{
printf("% 3d\t%+.3f\t%+.3f\t%g\n",
i, x[i].real(), abs(X[i]), i*freqResolution);
}
ofstream myfile("example_cuda.txt");
printf("I am trying to write to file\n");
if (myfile.is_open())
{
for (int count = 0; count < nSamples; count++)
{
myfile << x[count].real() << "," << abs(X[count]) << "," << count*freqResolution << "\n";
}
myfile.close();
}
}
I used the following command to compile the code using VS2015 command prompt:
nvcc -o fft_Wiki2.exe -c -arch=compute_35 -rdc=true
--expt-relaxed-constexpr --machine 64 -Xcompiler "/wd4819" fftWiki_2.cu
The compilation itself doesn't show any errors or warnings, but the executable does not run. When I try the
fft_Wiki2.exe
it simply says the version of this executable is incompatible with the 64 bit Windows version and so cannot execute. But I am using the --machine 64 option to force the executable version.
How do I get this program to execute ?
How do I get this program to execute ?
It isn't a program you are trying to run, it is an object file.
In your compilation command you pass -c:
nvcc -o fft_Wiki2.exe -c -arch=compute_35 -rdc=true --expt-relaxed-constexpr --machine 64 -Xcompiler "/wd4819" fftWiki_2.cu
which means only compilation and no linking. What you would need to do is something like this:
nvcc -o fft_Wiki2.obj -c -arch=compute_35 -rdc=true --expt-relaxed-constexpr --machine 64 -Xcompiler "/wd4819" fftWiki_2.cu
nvcc -o fft_Wiki2.exe -arch=compute_35 --expt-relaxed-constexpr --machine 64 -Xcompiler "/wd4819" fftWiki_2.obj
[Note I don't have access to a Windows development platform to check the accuracy of the commands]
The first command compiles and emits an object file. The second performs both host and device code linking and emits an executable which you should be able to run
I am trying to run the C++ FFT code from this web page:
https://www.nayuki.io/page/free-small-fft-in-multiple-languages
Pretty new to C++, so don't know how to run it. Essentially, I want to pass on a REAL vector and an IMAG vector to the program and generate an output of REAL and IMAG vectors.
Say my REAL_VEC = {1, 2, 3, 4, 5}
Say my IMAG_VEC = {0, 1, 0, 1, 0}
Am pasting the code that I have and its compiling. But where to give input and how to get output (for above vectors)?
//FftRealPairTest.cpp
#include <algorithm>
#include <cmath>
#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <random>
#include <vector>
#include "FftRealPair.hpp"
using std::cout;
using std::endl;
using std::vector;
// Private function prototypes
static void testFft(int n);
static vector<double> randomReals(int n);
// Mutable global variable
static double maxLogError = -INFINITY;
// Random number generation
std::default_random_engine randGen((std::random_device())());
int main() {
// Test diverse size FFTs
for (int i = 0, prev = 0; i <= 4; i++) {
int n = static_cast<int>(std::lround(std::pow(1500.0, i / 100.0)));
if (n > prev) {
testFft(n);
prev = n;
}
}
cout << endl;
cout << "Max log err = " << std::setprecision(3) << maxLogError << endl;
cout << "Test " << (maxLogError < -10 ? "passed" : "failed") << endl;
return EXIT_SUCCESS;
}
static void testFft(int n) {
vector<double> inputreal(randomReals(n));
vector<double> inputimag(randomReals(n));
vector<double> actualoutreal(inputreal);
vector<double> actualoutimag(inputimag);
Fft::transform(actualoutreal, actualoutimag);
}
static vector<double> randomReals(int n) {
std::uniform_real_distribution<double> valueDist(-1.0, 1.0);
vector<double> result;
for (int i = 0; i < n; i++)
result.push_back(valueDist(randGen));
return result;
}
/////////////////
//FftRealPair.cpp
/*
* Free FFT and convolution (C++)
*
* Copyright (c) 2017 Project Nayuki. (MIT License)
* https://www.nayuki.io/page/free-small-fft-in-multiple-languages
*
* Permission is hereby granted, free of charge, to any person obtaining a copy of
* this software and associated documentation files (the "Software"), to deal in
* the Software without restriction, including without limitation the rights to
* use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
* the Software, and to permit persons to whom the Software is furnished to do so,
* subject to the following conditions:
* - The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
* - The Software is provided "as is", without warranty of any kind, express or
* implied, including but not limited to the warranties of merchantability,
* fitness for a particular purpose and noninfringement. In no event shall the
* authors or copyright holders be liable for any claim, damages or other
* liability, whether in an action of contract, tort or otherwise, arising from,
* out of or in connection with the Software or the use or other dealings in the
* Software.
*/
#include <algorithm>
#include <cmath>
#include <cstddef>
#include <cstdint>
#include "FftRealPair.hpp"
using std::size_t;
using std::vector;
// Private function prototypes
static size_t reverseBits(size_t x, int n);
void Fft::transform(vector<double> &real, vector<double> &imag) {
size_t n = real.size();
if (n != imag.size())
throw "Mismatched lengths";
if (n == 0)
return;
else if ((n & (n - 1)) == 0) // Is power of 2
transformRadix2(real, imag);
else // More complicated algorithm for arbitrary sizes
transformBluestein(real, imag);
}
void Fft::inverseTransform(vector<double> &real, vector<double> &imag) {
transform(imag, real);
}
void Fft::transformRadix2(vector<double> &real, vector<double> &imag) {
// Length variables
size_t n = real.size();
if (n != imag.size())
throw "Mismatched lengths";
int levels = 0; // Compute levels = floor(log2(n))
for (size_t temp = n; temp > 1U; temp >>= 1)
levels++;
if (static_cast<size_t>(1U) << levels != n)
throw "Length is not a power of 2";
// Trignometric tables
vector<double> cosTable(n / 2);
vector<double> sinTable(n / 2);
for (size_t i = 0; i < n / 2; i++) {
cosTable[i] = std::cos(2 * M_PI * i / n);
sinTable[i] = std::sin(2 * M_PI * i / n);
}
// Bit-reversed addressing permutation
for (size_t i = 0; i < n; i++) {
size_t j = reverseBits(i, levels);
if (j > i) {
std::swap(real[i], real[j]);
std::swap(imag[i], imag[j]);
}
}
// Cooley-Tukey decimation-in-time radix-2 FFT
for (size_t size = 2; size <= n; size *= 2) {
size_t halfsize = size / 2;
size_t tablestep = n / size;
for (size_t i = 0; i < n; i += size) {
for (size_t j = i, k = 0; j < i + halfsize; j++, k += tablestep) {
size_t l = j + halfsize;
double tpre = real[l] * cosTable[k] + imag[l] * sinTable[k];
double tpim = -real[l] * sinTable[k] + imag[l] * cosTable[k];
real[l] = real[j] - tpre;
imag[l] = imag[j] - tpim;
real[j] += tpre;
imag[j] += tpim;
}
}
if (size == n) // Prevent overflow in 'size *= 2'
break;
}
}
void Fft::transformBluestein(vector<double> &real, vector<double> &imag) {
// Find a power-of-2 convolution length m such that m >= n * 2 + 1
size_t n = real.size();
if (n != imag.size())
throw "Mismatched lengths";
size_t m = 1;
while (m / 2 <= n) {
if (m > SIZE_MAX / 2)
throw "Vector too large";
m *= 2;
}
// Trignometric tables
vector<double> cosTable(n), sinTable(n);
for (size_t i = 0; i < n; i++) {
unsigned long long temp = static_cast<unsigned long long>(i) * i;
temp %= static_cast<unsigned long long>(n) * 2;
double angle = M_PI * temp / n;
// Less accurate alternative if long long is unavailable: double angle = M_PI * i * i / n;
cosTable[i] = std::cos(angle);
sinTable[i] = std::sin(angle);
}
// Temporary vectors and preprocessing
vector<double> areal(m), aimag(m);
for (size_t i = 0; i < n; i++) {
areal[i] = real[i] * cosTable[i] + imag[i] * sinTable[i];
aimag[i] = -real[i] * sinTable[i] + imag[i] * cosTable[i];
}
vector<double> breal(m), bimag(m);
breal[0] = cosTable[0];
bimag[0] = sinTable[0];
for (size_t i = 1; i < n; i++) {
breal[i] = breal[m - i] = cosTable[i];
bimag[i] = bimag[m - i] = sinTable[i];
}
// Convolution
vector<double> creal(m), cimag(m);
convolve(areal, aimag, breal, bimag, creal, cimag);
// Postprocessing
for (size_t i = 0; i < n; i++) {
real[i] = creal[i] * cosTable[i] + cimag[i] * sinTable[i];
imag[i] = -creal[i] * sinTable[i] + cimag[i] * cosTable[i];
}
}
void Fft::convolve(const vector<double> &x, const vector<double> &y, vector<double> &out) {
size_t n = x.size();
if (n != y.size() || n != out.size())
throw "Mismatched lengths";
vector<double> outimag(n);
convolve(x, vector<double>(n), y, vector<double>(n), out, outimag);
}
void Fft::convolve(
const vector<double> &xreal, const vector<double> &ximag,
const vector<double> &yreal, const vector<double> &yimag,
vector<double> &outreal, vector<double> &outimag) {
size_t n = xreal.size();
if (n != ximag.size() || n != yreal.size() || n != yimag.size()
|| n != outreal.size() || n != outimag.size())
throw "Mismatched lengths";
vector<double> xr(xreal);
vector<double> xi(ximag);
vector<double> yr(yreal);
vector<double> yi(yimag);
transform(xr, xi);
transform(yr, yi);
for (size_t i = 0; i < n; i++) {
double temp = xr[i] * yr[i] - xi[i] * yi[i];
xi[i] = xi[i] * yr[i] + xr[i] * yi[i];
xr[i] = temp;
}
inverseTransform(xr, xi);
for (size_t i = 0; i < n; i++) { // Scaling (because this FFT implementation omits it)
outreal[i] = xr[i] / n;
outimag[i] = xi[i] / n;
}
}
static size_t reverseBits(size_t x, int n) {
size_t result = 0;
for (int i = 0; i < n; i++, x >>= 1)
result = (result << 1) | (x & 1U);
return result;
}
///////////
//FftRealPair.hpp
/*
* Free FFT and convolution (C++)
*
* Copyright (c) 2017 Project Nayuki. (MIT License)
* https://www.nayuki.io/page/free-small-fft-in-multiple-languages
*
* Permission is hereby granted, free of charge, to any person obtaining a copy of
* this software and associated documentation files (the "Software"), to deal in
* the Software without restriction, including without limitation the rights to
* use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
* the Software, and to permit persons to whom the Software is furnished to do so,
* subject to the following conditions:
* - The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
* - The Software is provided "as is", without warranty of any kind, express or
* implied, including but not limited to the warranties of merchantability,
* fitness for a particular purpose and noninfringement. In no event shall the
* authors or copyright holders be liable for any claim, damages or other
* liability, whether in an action of contract, tort or otherwise, arising from,
* out of or in connection with the Software or the use or other dealings in the
* Software.
*/
#pragma once
#include <vector>
namespace Fft {
/*
* Computes the discrete Fourier transform (DFT) of the given complex vector, storing the result back into the vector.
* The vector can have any length. This is a wrapper function.
*/
void transform(std::vector<double> &real, std::vector<double> &imag);
/*
* Computes the inverse discrete Fourier transform (IDFT) of the given complex vector, storing the result back into the vector.
* The vector can have any length. This is a wrapper function. This transform does not perform scaling, so the inverse is not a true inverse.
*/
void inverseTransform(std::vector<double> &real, std::vector<double> &imag);
/*
* Computes the discrete Fourier transform (DFT) of the given complex vector, storing the result back into the vector.
* The vector's length must be a power of 2. Uses the Cooley-Tukey decimation-in-time radix-2 algorithm.
*/
void transformRadix2(std::vector<double> &real, std::vector<double> &imag);
/*
* Computes the discrete Fourier transform (DFT) of the given complex vector, storing the result back into the vector.
* The vector can have any length. This requires the convolution function, which in turn requires the radix-2 FFT function.
* Uses Bluestein's chirp z-transform algorithm.
*/
void transformBluestein(std::vector<double> &real, std::vector<double> &imag);
/*
* Computes the circular convolution of the given real vectors. Each vector's length must be the same.
*/
void convolve(const std::vector<double> &x, const std::vector<double> &y, std::vector<double> &out);
/*
* Computes the circular convolution of the given complex vectors. Each vector's length must be the same.
*/
void convolve(
const std::vector<double> &xreal, const std::vector<double> &ximag,
const std::vector<double> &yreal, const std::vector<double> &yimag,
std::vector<double> &outreal, std::vector<double> &outimag);
}
If you look at the .hpp file that you posted, the first function transform() takes two inputs: your real and imaginary vectors. The FFT is done 'in place' so the result is returned in the same vectors.
If you want to give a try, you may look at the testFft() and initialize
inputReal and inputImag with your data. The vectors are then copied in actualOutReal and actualOutImag (to avoid overwriting the original data) and passed to transform.
After that you should have your output in the same vectors (actualOutReal and actualOutImag).
This code does precisely what you want (requires C++11):
#include <cstddef>
#include <vector>
#include "FftRealPair.hpp"
int main() {
// Declare input
std::vector<double> real{1, 2, 3, 4, 5};
std::vector<double> imag{0, 1, 0, 1, 0};
// Do FFT
Fft::transform(real, imag);
// Print result
for (std::size_t i = 0; i < real.size(); i++) {
std::cout << real[i] << " " << imag[i] << std::endl;
}
return 0;
}
This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I hesitate to ask this question because there is probably something wrong with my C++ template program but this problem has been bugging me for the past couple of hours. I am running the exact same program on Visual C++ and Mingw-g++ compilers but only VC2010 is giving me the expected results. I am not proficient C++ programmer by any means so not getting any error messages from either compilers is even more frustrating.
Edit : I did mingw-get upgrade after failing to resolve the error. I was running g++ 4.5.2 and now I have version 4.7.2 but the problem persists.
Late Update - I did a complete uninstall of MinGW platform, manually removed every folder and then installed TDM-GCC but the problem persists. Maybe there is some conflict with my Windows Installation. I have installed Cygwin and g++ 4.5.3 for the time being (It is working) as OS reinstallation isn't really an option right now. Thanks for all the help.
Here is my code. (Header File itertest.h)
#ifndef ITERTEST_H
#define ITERTEST_H
#include <iostream>
#include <cmath>
#include <vector>
#include <string>
#include <algorithm>
using namespace std;
template <typename T>
class fft_data{
public:
vector<T> re;
vector<T> im;
};
template <typename T>
void inline twiddle(fft_data<T> &vec,int N,int radix){
// Calculates twiddle factors for radix-2
T PI2 = (T) 6.28318530717958647692528676655900577;
T theta = (T) PI2/N;
vec.re.resize(N/radix,(T) 0.0);
vec.im.resize(N/radix,(T) 0.0);
vec.re[0] = (T) 1.0;
for (int K = 1; K < N/radix; K++) {
vec.re[K] = (T) cos(theta * K);
vec.im[K] = (T) sin(theta * K);
}
}
template <typename T>
void inline sh_radix5_dif(fft_data<T> &x,fft_data<T> &wl, int q, int sgn) {
int n = x.re.size();
int L = (int) pow(5.0, (double)q);
int Ls = L / 5;
int r = n / L;
T c1 = 0.30901699437;
T c2 = -0.80901699437;
T s1 = 0.95105651629;
T s2 = 0.58778525229;
T tau0r,tau0i,tau1r,tau1i,tau2r,tau2i,tau3r,tau3i;
T tau4r,tau4i,tau5r,tau5i;
T br,bi,cr,ci,dr,di,er,ei;
fft_data<T> y = x;
T wlr,wli,wl2r,wl2i,wl3r,wl3i,wl4r,wl4i;
int lsr = Ls*r;
for (int j = 0; j < Ls; j++) {
int ind = j*r;
wlr = wl.re[ind];
wli = wl.im[ind];
wl2r = wlr*wlr - wli*wli;
wl2i = 2.0*wlr*wli;
wl3r = wl2r*wlr - wli*wl2i;
wl3i= wl2r*wli + wl2i*wlr;
wl4r = wl2r*wl2r - wl2i*wl2i;
wl4i = 2.0*wl2r*wl2i;
for (int k =0; k < r; k++) {
int index = k*L+j;
int index1 = index+Ls;
int index2 = index1+Ls;
int index3 = index2+Ls;
int index4 = index3+Ls;
tau0r = y.re[index1] + y.re[index4];
tau0i = y.im[index1] + y.im[index4];
tau1r = y.re[index2] + y.re[index3];
tau1i = y.im[index2] + y.im[index3];
tau2r = y.re[index1] - y.re[index4];
tau2i = y.im[index1] - y.im[index4];
tau3r = y.re[index2] - y.re[index3];
tau3i = y.im[index2] - y.im[index3];
tau4r = c1 * tau0r + c2 * tau1r;
tau4i = c1 * tau0i + c2 * tau1i;
tau5r = sgn * ( s1 * tau2r + s2 * tau3r);
tau5i = sgn * ( s1 * tau2i + s2 * tau3i);
br = y.re[index] + tau4r + tau5i;
bi = y.im[index] + tau4i - tau5r;
er = y.re[index] + tau4r - tau5i;
ei = y.im[index] + tau4i + tau5r;
tau4r = c2 * tau0r + c1 * tau1r;
tau4i = c2 * tau0i + c1 * tau1i;
tau5r = sgn * ( s2 * tau2r - s1 * tau3r);
tau5i = sgn * ( s2 * tau2i - s1 * tau3i);
cr = y.re[index] + tau4r + tau5i;
ci = y.im[index] + tau4i - tau5r;
dr = y.re[index] + tau4r - tau5i;
di = y.im[index] + tau4i + tau5r;
int indexo = k*Ls+j;
int indexo1 = indexo+lsr;
int indexo2 = indexo1+lsr;
int indexo3 = indexo2+lsr;
int indexo4 = indexo3+lsr;
x.re[indexo]= y.re[index] + tau0r + tau1r;
x.im[indexo]= y.im[index] + tau0i + tau1i;
x.re[indexo1] = wlr*br - wli*bi;
x.im[indexo1] = wlr*bi + wli*br;
x.re[indexo2] = wl2r*cr - wl2i*ci;
x.im[indexo2] = wl2r*ci + wl2i*cr;
x.re[indexo3] = wl3r*dr - wl3i*di;
x.im[indexo3] = wl3r*di + wl3i*dr;
x.re[indexo4] = wl4r*er - wl4i*ei;
x.im[indexo4] = wl4r*ei + wl4i*er;
}
}
}
template <typename T>
void inline fftsh_radix5_dif(fft_data<T> &data,int sgn, unsigned int N) {
//unsigned int len = data.re.size();
int num = (int) ceil(log10(static_cast<double>(N))/log10(5.0));
//indrev(data,index);
fft_data<T> twi;
twiddle(twi,N,5);
if (sgn == 1) {
transform(twi.im.begin(), twi.im.end(),twi.im.begin(),bind1st(multiplies<T>(),(T) -1.0));
}
for (int i=num; i > 0; i--) {
sh_radix5_dif(data,twi,i,sgn);
}
}
#endif
main.cpp
#include "itertest.h"
using namespace std;
int main(int argc, char **argv)
{
int N = 25;
//vector<complex<double> > sig1;
fft_data<double> sig1;
for (int i =0; i < N; i++){
//sig1.push_back(complex<double>((double)1.0, 0.0));
//sig2.re.push_back((double) i);
//sig2.im.push_back((double) i+2);
sig1.re.push_back((double) 1);
sig1.im.push_back((double) 0);
}
fftsh_radix5_dif(sig1,1,N);
for (int i =0; i < N; i++){
cout << sig1.re[i] << " " << sig1.im[i] << endl;
}
cin.get();
return 0;
}
The expected Output (which I am getting from VC2010)
25 0
4.56267e-016 -2.50835e-016
2.27501e-016 -3.58484e-016
1.80101e-017 -2.86262e-016
... rest 21 rows same as the last three rows ( < 1e-015)
The Output from Mingw-g++
20 0
4.94068e-016 -2.10581e-016
2.65385e-016 -3.91346e-016
-5.76751e-017 -2.93654e-016
5 0
-1.54508 -4.75528
-3.23032e-017 1.85061e-017
-4.68253e-017 -1.18421e-016
-6.32003e-017 -2.05833e-016
1.11022e-016 0
4.04508 -2.93893
8.17138e-017 6.82799e-018
3.5246e-017 9.06767e-017
-6.59101e-017 -1.62762e-016
1.11022e-016 0
4.04508 2.93893
-6.28467e-017 6.40636e-017
1.79807e-016 3.34411e-017
-6.94919e-017 -1.05831e-016
1.11022e-016 0
-1.54508 4.75528
5.70402e-017 -1.68674e-017
-1.36169e-016 -8.30473e-017
-9.75639e-017 3.40359e-016
1.11022e-016 0
There must be something wrong with your MinGW installation. You might have an out-of-date, buggy version of GCC. The unofficial TDM-GCC distribution usually has a more up-to-date version: http://tdm-gcc.tdragon.net/
When I compile your code with GCC 4.6.3 on Ubuntu, it produces the output below, which appears to match the VC2010 output exactly (but I can't verify this, since you didn't provide it in full). Adding the options -O3 -ffast-math -march=native doesn't seem to change anything.
Note that I had to fix an obvious typo in fftsh_radix5_dif (missing closing angle bracket in the list of template arguments to multiply), but I assume you do not have it in your code, since it wouldn't compile at all.
25 0
4.56267e-16 -2.50835e-16
2.27501e-16 -3.58484e-16
1.80101e-17 -2.86262e-16
-5.76751e-17 -1.22566e-16
8.88178e-16 0
9.45774e-17 1.19479e-17
1.27413e-16 -5.04465e-17
7.97139e-17 -9.63575e-17
1.35142e-17 -7.08438e-17
8.88178e-16 0
4.84283e-17 4.54772e-17
1.02473e-16 2.63107e-17
1.02473e-16 -2.63107e-17
4.84283e-17 -4.54772e-17
8.88178e-16 0
1.35142e-17 7.08438e-17
7.97139e-17 9.63575e-17
1.27413e-16 5.04465e-17
9.45774e-17 -1.19479e-17
8.88178e-16 0
-5.76751e-17 1.22566e-16
1.80101e-17 2.86262e-16
2.27501e-16 3.58484e-16
4.56267e-16 2.50835e-16
Check the creation date of the executable you're running.
You may be running an earlier draft of your program.
I'm writing a sparse matrix solver using the Gauss-Seidel method. By profiling, I've determined that about half of my program's time is spent inside the solver. The performance-critical part is as follows:
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
All arrays involved are of float type. Actually, they are not arrays but objects with an overloaded [] operator, which (I think) should be optimized away, but is defined as follows:
inline float &operator[](size_t i) { return d_cells[i]; }
inline float const &operator[](size_t i) const { return d_cells[i]; }
For d_nx = d_ny = 128, this can be run about 3500 times per second on an Intel i7 920. This means that the inner loop body runs 3500 * 128 * 128 = 57 million times per second. Since only some simple arithmetic is involved, that strikes me as a low number for a 2.66 GHz processor.
Maybe it's not limited by CPU power, but by memory bandwidth? Well, one 128 * 128 float array eats 65 kB, so all 6 arrays should easily fit into the CPU's L3 cache (which is 8 MB). Assuming that nothing is cached in registers, I count 15 memory accesses in the inner loop body. On a 64-bits system this is 120 bytes per iteration, so 57 million * 120 bytes = 6.8 GB/s. The L3 cache runs at 2.66 GHz, so it's the same order of magnitude. My guess is that memory is indeed the bottleneck.
To speed this up, I've attempted the following:
Compile with g++ -O3. (Well, I'd been doing this from the beginning.)
Parallelizing over 4 cores using OpenMP pragmas. I have to change to the Jacobi algorithm to avoid reads from and writes to the same array. This requires that I do twice as many iterations, leading to a net result of about the same speed.
Fiddling with implementation details of the loop body, such as using pointers instead of indices. No effect.
What's the best approach to speed this guy up? Would it help to rewrite the inner body in assembly (I'd have to learn that first)? Should I run this on the GPU instead (which I know how to do, but it's such a hassle)? Any other bright ideas?
(N.B. I do take "no" for an answer, as in: "it can't be done significantly faster, because...")
Update: as requested, here's a full program:
#include <iostream>
#include <cstdlib>
#include <cstring>
using namespace std;
size_t d_nx = 128, d_ny = 128;
float *d_x, *d_b, *d_w, *d_e, *d_s, *d_n;
void step() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
void solve(size_t iters) {
for (size_t i = 0; i < iters; ++i) {
step();
}
}
void clear(float *a) {
memset(a, 0, d_nx * d_ny * sizeof(float));
}
int main(int argc, char **argv) {
size_t n = d_nx * d_ny;
d_x = new float[n]; clear(d_x);
d_b = new float[n]; clear(d_b);
d_w = new float[n]; clear(d_w);
d_e = new float[n]; clear(d_e);
d_s = new float[n]; clear(d_s);
d_n = new float[n]; clear(d_n);
solve(atoi(argv[1]));
cout << d_x[0] << endl; // prevent the thing from being optimized away
}
I compile and run it as follows:
$ g++ -o gstest -O3 gstest.cpp
$ time ./gstest 8000
0
real 0m1.052s
user 0m1.050s
sys 0m0.010s
(It does 8000 instead of 3500 iterations per second because my "real" program does a lot of other stuff too. But it's representative.)
Update 2: I've been told that unititialized values may not be representative because NaN and Inf values may slow things down. Now clearing the memory in the example code. It makes no difference for me in execution speed, though.
Couple of ideas:
Use SIMD. You could load 4 floats at a time from each array into a SIMD register (e.g. SSE on Intel, VMX on PowerPC). The disadvantage of this is that some of the d_x values will be "stale" so your convergence rate will suffer (but not as bad as a jacobi iteration); it's hard to say whether the speedup offsets it.
Use SOR. It's simple, doesn't add much computation, and can improve your convergence rate quite well, even for a relatively conservative relaxation value (say 1.5).
Use conjugate gradient. If this is for the projection step of a fluid simulation (i.e. enforcing non-compressability), you should be able to apply CG and get a much better convergence rate. A good preconditioner helps even more.
Use a specialized solver. If the linear system arises from the Poisson equation, you can do even better than conjugate gradient using an FFT-based methods.
If you can explain more about what the system you're trying to solve looks like, I can probably give some more advice on #3 and #4.
I think I've managed to optimize it, here's a code, create a new project in VC++, add this code and simply compile under "Release".
#include <iostream>
#include <cstdlib>
#include <cstring>
#define _WIN32_WINNT 0x0400
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <conio.h>
using namespace std;
size_t d_nx = 128, d_ny = 128;
float *d_x, *d_b, *d_w, *d_e, *d_s, *d_n;
void step_original() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
void step_new() {
//size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
float
*d_b_ic,
*d_w_ic,
*d_e_ic,
*d_x_ic,
*d_x_iw,
*d_x_ie,
*d_x_is,
*d_x_in,
*d_n_ic,
*d_s_ic;
d_b_ic = d_b;
d_w_ic = d_w;
d_e_ic = d_e;
d_x_ic = d_x;
d_x_iw = d_x;
d_x_ie = d_x;
d_x_is = d_x;
d_x_in = d_x;
d_n_ic = d_n;
d_s_ic = d_s;
for (size_t y = 1; y < d_ny - 1; ++y)
{
for (size_t x = 1; x < d_nx - 1; ++x)
{
/*d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];*/
*d_x_ic = *d_b_ic
- *d_w_ic * *d_x_iw - *d_e_ic * *d_x_ie
- *d_s_ic * *d_x_is - *d_n_ic * *d_x_in;
//++ic; ++iw; ++ie; ++is; ++in;
d_b_ic++;
d_w_ic++;
d_e_ic++;
d_x_ic++;
d_x_iw++;
d_x_ie++;
d_x_is++;
d_x_in++;
d_n_ic++;
d_s_ic++;
}
//ic += 2; iw += 2; ie += 2; is += 2; in += 2;
d_b_ic += 2;
d_w_ic += 2;
d_e_ic += 2;
d_x_ic += 2;
d_x_iw += 2;
d_x_ie += 2;
d_x_is += 2;
d_x_in += 2;
d_n_ic += 2;
d_s_ic += 2;
}
}
void solve_original(size_t iters) {
for (size_t i = 0; i < iters; ++i) {
step_original();
}
}
void solve_new(size_t iters) {
for (size_t i = 0; i < iters; ++i) {
step_new();
}
}
void clear(float *a) {
memset(a, 0, d_nx * d_ny * sizeof(float));
}
int main(int argc, char **argv) {
size_t n = d_nx * d_ny;
d_x = new float[n]; clear(d_x);
d_b = new float[n]; clear(d_b);
d_w = new float[n]; clear(d_w);
d_e = new float[n]; clear(d_e);
d_s = new float[n]; clear(d_s);
d_n = new float[n]; clear(d_n);
if(argc < 3)
printf("app.exe (x)iters (o/n)algo\n");
bool bOriginalStep = (argv[2][0] == 'o');
size_t iters = atoi(argv[1]);
/*printf("Press any key to start!");
_getch();
printf(" Running speed test..\n");*/
__int64 freq, start, end, diff;
if(!::QueryPerformanceFrequency((LARGE_INTEGER*)&freq))
throw "Not supported!";
freq /= 1000000; // microseconds!
{
::QueryPerformanceCounter((LARGE_INTEGER*)&start);
if(bOriginalStep)
solve_original(iters);
else
solve_new(iters);
::QueryPerformanceCounter((LARGE_INTEGER*)&end);
diff = (end - start) / freq;
}
printf("Speed (%s)\t\t: %u\n", (bOriginalStep ? "original" : "new"), diff);
//_getch();
//cout << d_x[0] << endl; // prevent the thing from being optimized away
}
Run it like this:
app.exe 10000 o
app.exe 10000 n
"o" means old code, yours.
"n" is mine, the new one.
My results:
Speed (original):
1515028
1523171
1495988
Speed (new):
966012
984110
1006045
Improvement of about 30%.
The logic behind:
You've been using index counters to access/manipulate.
I use pointers.
While running, breakpoint at a certain calculation code line in VC++'s debugger, and press F8. You'll get the disassembler window.
The you'll see the produced opcodes (assembly code).
Anyway, look:
int *x = ...;
x[3] = 123;
This tells the PC to put the pointer x at a register (say EAX).
The add it (3 * sizeof(int)).
Only then, set the value to 123.
The pointers approach is much better as you can understand, because we cut the adding process, actually we handle it ourselves, thus able to optimize as needed.
I hope this helps.
Sidenote to stackoverflow.com's staff:
Great website, I hope I've heard of it long ago!
For one thing, there seems to be a pipelining issue here. The loop reads from the value in d_x that has just been written to, but apparently it has to wait for that write to complete. Just rearranging the order of the computation, doing something useful while it's waiting, makes it almost twice as fast:
d_x[ic] = d_b[ic]
- d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in]
- d_w[ic] * d_x[iw] /* d_x[iw] has just been written to, process this last */;
It was Eamon Nerbonne who figured this out. Many upvotes to him! I would never have guessed.
Poni's answer looks like the right one to me.
I just want to point out that in this type of problem, you often gain benefits from memory locality. Right now, the b,w,e,s,n arrays are all at separate locations in memory. If you could not fit the problem in L3 cache (mostly in L2), then this would be bad, and a solution of this sort would be helpful:
size_t d_nx = 128, d_ny = 128;
float *d_x;
struct D { float b,w,e,s,n; };
D *d;
void step() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d[ic].b
- d[ic].w * d_x[iw] - d[ic].e * d_x[ie]
- d[ic].s * d_x[is] - d[ic].n * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
void solve(size_t iters) { for (size_t i = 0; i < iters; ++i) step(); }
void clear(float *a) { memset(a, 0, d_nx * d_ny * sizeof(float)); }
int main(int argc, char **argv) {
size_t n = d_nx * d_ny;
d_x = new float[n]; clear(d_x);
d = new D[n]; memset(d,0,n * sizeof(D));
solve(atoi(argv[1]));
cout << d_x[0] << endl; // prevent the thing from being optimized away
}
For example, this solution at 1280x1280 is a little less than 2x faster than Poni's solution (13s vs 23s in my test--your original implementation is then 22s), while at 128x128 it's 30% slower (7s vs. 10s--your original is 10s).
(Iterations were scaled up to 80000 for the base case, and 800 for the 100x larger case of 1280x1280.)
I think you're right about memory being a bottleneck. It's a pretty simple loop with just some simple arithmetic per iteration. the ic, iw, ie, is, and in indices seem to be on opposite sides of the matrix so i'm guessing that there's a bunch of cache misses there.
I'm no expert on the subject, but I've seen that there are several academic papers on improving the cache usage of the Gauss-Seidel method.
Another possible optimization is the use of the red-black variant, where points are updated in two sweeps in a chessboard-like pattern. In this way, all updates in a sweep are independent and can be parallelized.
I suggest putting in some prefetch statements and also researching "data oriented design":
void step_original() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
float dw_ic, dx_ic, db_ic, de_ic, dn_ic, ds_ic;
float dx_iw, dx_is, dx_ie, dx_in, de_ic, db_ic;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
// Perform the prefetch
// Sorting these statements by array may increase speed;
// although sorting by index name may increase speed too.
db_ic = d_b[ic];
dw_ic = d_w[ic];
dx_iw = d_x[iw];
de_ic = d_e[ic];
dx_ie = d_x[ie];
ds_ic = d_s[ic];
dx_is = d_x[is];
dn_ic = d_n[ic];
dx_in = d_x[in];
// Calculate
d_x[ic] = db_ic
- dw_ic * dx_iw - de_ic * dx_ie
- ds_ic * dx_is - dn_ic * dx_in;
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
This differs from your second method since the values are copied to local temporary variables before the calculation is performed.