In place real to complex FFT with cufft - c++

I am trying to perform an inplace real to complex FFT with cufft.
I am aware of the similar question How to perform a Real to Complex Transformation with cuFFT. However I have issues trying to reproduce the same method.
If I do an out of place transformation, there is no problem, but as soon as I do it in place, I do not have the correct values in the FFT (Checked with python, using binary files in between). I do not have errors, but just non correct values.
Here is my code:
void fftCuda2d(mat3d* scene)
{
cufftResult resultStatus;
cudaError_t cuda_status;
cufftHandle plan_forward;
resultStatus = cufftPlan2d(&plan_forward, scene->_height, scene->_width, CUFFT_R2C);
cout << "Creating plan forward: " << _cudaGetErrorEnum(resultStatus) << endl;
cufftComplex *d_fft, *d_scene, *h_fft;
size_t size_fft = (int(scene->_width/2)+1)*scene->_height;
cudaMalloc((void**)&d_scene, sizeof(cufftComplex)*size_fft);
cudaMalloc((void**)&d_fft, sizeof(cufftComplex)*size_fft);
h_fft = (cufftComplex*) malloc(sizeof(cufftComplex)*size_fft);
cuda_status = cudaMemcpy(d_scene, scene->_pData, sizeof(cufftReal) * scene->_height * scene->_width, cudaMemcpyHostToDevice);
resultStatus = cufftExecR2C(plan_forward, (cufftReal*) d_scene, d_scene);
cuda_status = cudaMemcpy(h_fft, d_scene, sizeof(cufftReal)*scene->_height*scene->_width, cudaMemcpyDeviceToHost);
FILE* *pFileTemp;
pFileTemp = fopen("temp.bin", "wb");
check = fwrite(h_fft, sizeof(cufftComplex), sizeFft, pFileTemp);
}
If I use resultStatus = cufftExecR2C(plan_forward, (cufftReal*) d_scene, d_fft); and save the output of d_fft I have the correct result.
So you see any mistake of mine here?
P.S Mat3d is a struct where _width and _height contain the size of the matrix and pData is the pointer to the data but there is no issue with that.

(It seems like this should be a duplicate question but I was not able to locate the duplicate.)
Your input data needs to be organized differently (padded) when using an in-place transform. This is particularly noticeable in the 2D case, because each row of data must be padded.
In the non-inplace R2C transform, the input data is real-valued and of size height*width (for an example R=4, C=4 case):
X X X X
X X X X
X X X X
X X X X
The above data would occupy exactly 16*sizeof(cufftReal) (assuming float input data, dimension R = 4, C = 4), and it would be organized that way in memory, linearly, with no gaps. However, when we switch to an in-place transform, the size of the input buffer changes. And this change in size has ramifications for data arrangement. Specifically, the sizeof the input buffer is R*(C/2 + 1)*sizeof(cufftComplex). For the R=4, C=4 example case, that is 12*sizeof(cufftComplex) or 24*sizeof(cufftReal), but it is still organized as 4 rows of data. Each row, therefore, is of length 6 (if measured in cufftReal) or 3 (if measured in cufftComplex). Considering it as cufftReal, then when we create our input data, we must organize it like this:
X X X X P P
X X X X P P
X X X X P P
X X X X P P
where the P locations are "padding" data, not your input data. If we view this linearly in memory, it looks like:
X X X X P P X X X X P P X X X X P P X X X X P P
That is the expectation/requirement of CUFFT (and I believe it is the same for FFTW). However since you made no changes to the way you deposited your data, you provided data that looks like this:
X X X X X X X X X X X X X X X X P P P P P P P P
and the difference in those 2 patterns is what accounts for the difference in the result output. There are a variety of ways to fix this. I'll choose to demonstrate using cudaMemcpy2D to populate the device input buffer in the in-place case, which will give us the desired pattern. This may not be the best/fastest way, depending on your application needs.
You were also not copying the correct size of the result data from device back to host.
Here is a fixed example:
$ cat t1589.cu
#include <cufft.h>
#include <iostream>
#include <cstdlib>
struct mat3d{
int _width;
int _height;
cufftReal *_pData;
};
void fftCuda2d(mat3d* scene)
{
cufftResult resultStatus;
cudaError_t cuda_status;
cufftHandle plan_forward;
resultStatus = cufftPlan2d(&plan_forward, scene->_height, scene->_width, CUFFT_R2C);
std::cout << "Creating plan forward: " << (int)resultStatus << std::endl;
cufftComplex *d_fft, *d_scene, *h_fft;
size_t size_fft = (int(scene->_width/2)+1)*scene->_height;
cudaMalloc((void**)&d_scene, sizeof(cufftComplex)*size_fft);
cudaMalloc((void**)&d_fft, sizeof(cufftComplex)*size_fft);
h_fft = (cufftComplex*) malloc(sizeof(cufftComplex)*size_fft);
#ifdef USE_IP
cuda_status = cudaMemcpy2D(d_scene, ((scene->_width/2)+1)*sizeof(cufftComplex), scene->_pData, (scene->_width)*sizeof(cufftReal), sizeof(cufftReal) * scene->_width, scene->_height, cudaMemcpyHostToDevice);
resultStatus = cufftExecR2C(plan_forward, (cufftReal*) d_scene, d_scene);
cuda_status = cudaMemcpy(h_fft, d_scene, sizeof(cufftComplex)*size_fft, cudaMemcpyDeviceToHost);
#else
cuda_status = cudaMemcpy(d_scene, scene->_pData, sizeof(cufftReal) * scene->_height * scene->_width, cudaMemcpyHostToDevice);
resultStatus = cufftExecR2C(plan_forward, (cufftReal*) d_scene, d_fft);
cuda_status = cudaMemcpy(h_fft, d_fft, sizeof(cufftComplex)*size_fft, cudaMemcpyDeviceToHost);
#endif
std::cout << "exec: " << (int)resultStatus << std::endl;
for (int i = 0; i < size_fft; i++)
std::cout << h_fft[i].x << " " << h_fft[i].y << ",";
std::cout << std::endl;
}
const int dim = 4;
int main(){
mat3d myScene;
myScene._pData = new cufftReal[dim*dim];
myScene._width = dim;
myScene._height = dim;
for (int i = 0; i < dim*dim; i++) myScene._pData[i] = rand()/(float)RAND_MAX;
fftCuda2d(&myScene);
std::cout << cudaGetErrorString(cudaGetLastError()) << std::endl;
}
$ nvcc -lineinfo -o t1589 t1589.cu -lcufft
t1589.cu(15): warning: variable "cuda_status" was set but never used
$ ./t1589
Creating plan forward: 0
exec: 0
9.71338 0,-0.153554 1.45243,0.171302 0,0.878097 0.533959,0.424595 -0.834714,0.858133 -0.393671,-0.205139 0,-0.131513 -0.494514,-0.165712 0,0.878097 -0.533959,0.0888268 1.49303,0.858133 0.393671,
no error
$ nvcc -lineinfo -o t1589 t1589.cu -lcufft -DUSE_IP
t1589.cu(15): warning: variable "cuda_status" was set but never used
$ ./t1589
Creating plan forward: 0
exec: 0
9.71338 0,-0.153554 1.45243,0.171302 0,0.878097 0.533959,0.424595 -0.834714,0.858133 -0.393671,-0.205139 0,-0.131513 -0.494514,-0.165712 0,0.878097 -0.533959,0.0888268 1.49303,0.858133 0.393671,
no error
$

Related

Replace pointer arithmetic with std::span

I have the following code that uses pointer arithmetic and would like to replace it using std::span (or, I suppose, gsl::span). The code iterates over a number of pixels, each represented by 4 contiguous bytes, and updates their blue and green colours.
auto* row = (uint8_t*)buffer->data;
for (auto y = 0; y < buffer->height; ++y) {
auto* pixel = (uint32_t*)row;
for (auto x = 0; x < buffer->width; ++x) {
auto blue = x + blueOffset;
auto green = y + greenOffset;
*pixel++ = ((green << 8) | blue);
}
row += buffer->pitch;
}
buffer->data is a void* returned from a call to Windows VirtualAlloc(...) function.
How can this code be written to use safe, modern C++, such as std::span, as suggested by the C++ Core Guidelines and Moderns C++?
This compiles with each of C++20, gsl and gsl-lite. As you didn't provide a reproducible example I didn't test or benchmark this solution.
auto const my_span = span{
reinterpret_cast<uint8_t *>(buffer->data),
buffer->height * buffer->pitch};
constexpr auto size_ratio = sizeof(uint32_t) / sizeof(uint8_t); // 4
auto const uint8_width = size_ratio * buffer->width;
auto row_offset = 0UL;
auto row = my_span.subspan(row_offset, buffer->width);
for (auto y = 0; y < buffer->height; ++y) {
auto const pixels = span{
reinterpret_cast<uint32_t *>(row.data()),
buffer->width};
auto pixel = pixels.begin();
for (auto x = 0; x < buffer->width; ++x) {
auto const blue = x + blueOffset;
auto const green = y + greenOffset;
*pixel++ = ((green << 8) | blue);
}
row_offset += buffer->pitch;
row = my_span.subspan(row_offset, uint8_width);
}
Obviously using spans here doesn't necessarily make the code easier to read due to interpreting the same memory both as uint8_t and uint32_t. The spans should still give more safety.
The code could be made easier to read by providing more members or member functions in the struct which is pointed at by buffer. E.g. member functions could provide you with the needed subspans or at least with uint8_width. As this wasn't asked, I didn't touch that struct here. It might be given by a library, but one could still write a wrapper in that case.

About the using vertices as index in graphs c++ why we wasting space

I have a question about the vertices of graphs in c++. Like, let's suppose I want to have a graph with vertices as 100,200,300,400 and they are connected in some manner not important but if we are creating an adjacency list graph what we do is.
adj[u].push_back(v);
adj[v].push_back(u);
and let 400 is connected with 200 we are doing adj[400] and creating a large matrix of vectors when all we need was a matrix of size 4 as there are four vertices and here we going till 400 can someone explain this. Is it like in graphs we have all vertices consecutive and must start from some small number? The code works fine when you have vertices like 1,2,3,4,5. We are using vertices as an index and depending on our vertices they can vary by a lot than what we needed.
An adjacency list stores a list of the connected vertices for each vertex in the graph. For example, given this graph:
1---2
|\ |
| \ |
| \|
3---4
You would store:
1: 2, 3, 4
2: 1, 4
3: 1, 4
4: 1, 2, 3
This can be done with a std::vector<std::vector<int>>. Note that you do not need to use the values of the graph as the indexes into these vectors. If the values of the graph were instead 100, 200, 300, 400 you could use a separate map container to convert from vertex value to an index into the adjacency list (std::unordered_map<ValueType, IndexType>). You could also store a Vertex structure such as this:
struct Vertex {
int index; // 0, 1, 2, 3, 4, 5, etc.
int value; // 100, 200, or whatever value you want
};
Not sure what the problem is exactly but i guess si about the speed, the most simple and easy fix is to have a "memory layout" like in a pixel buffer, a index is a implicit value defied by de position since each segment is.
-------------------------------------------------------------------...
| float, float, float, float | float, float, float, float | float,
-------------------------------------------------------------------...
| index 0 | index 1 | index 2
-------------------------------------------------------------------...
As you didn't give a sample code to give a better idea the example asumes a lot if things but basically implements the layout idea; using arrays is not needed, is my preference, vector would give almost no performance penalty the bigges one being the resizement; some of the lines are not intuitive, like why is a operation + a array faster than having an array inside an array, it just is, the memory is slower than te cpu.
Small note, bacause all the "small arrays" are just a big array you need to worrie of overflows and underflow or add a check; if some vertex groups are smaller that the chunk size just waste the space, the time to compact and un compact the data is worst in most cases than having the padding.
#include <iostream>
#include <chrono>
template <typename VAL>
struct Ver_Map {
VAL * base_ptr;
uint32_t map_size;
uint32_t vertex_len;
void alloc_map(uint32_t elem, uint32_t ver_len, VAL in){
base_ptr = new VAL[elem * ver_len] { in };
vertex_len = ver_len;
map_size = elem;
}
void free_map(){
delete base_ptr;
}
VAL * operator()(uint32_t object){
return &base_ptr[(object * vertex_len)];
}
VAL & operator()(uint32_t object, uint32_t vertex){
return base_ptr[(object * vertex_len) + vertex];
}
};
int main (void) {
const uint32_t map_len = 10000;
Ver_Map<float> ver_map;
ver_map.alloc_map(map_len, 4, 0.0f);
// Use case
ver_map(0, 2) = 0.5f;
std::cout << ver_map(0)[1] << std::endl;
std::cout << ver_map(0)[2] << std::endl;
std::cout << ver_map(0, 2) << std::endl;
// Size in memory
std::cout << "Size of struct -> "
<< (map_len * sizeof(float)) + sizeof(Ver_Map<float>)
<< " bytes" << std::endl;
// Time to fully clear
auto start = std::chrono::steady_clock::now();
for(int x=0; x < map_len; x++){
for(int y=0; y < ver_map.vertex_len; y++){
ver_map(x, y) = 1.0f;
}
}
std::cout << "Full write time -> "
<< (uint32_t)std::chrono::duration_cast<std::chrono::microseconds>
(std::chrono::steady_clock::now() - start).count()
<< " microseconds" << std::endl;
ver_map.free_map();
return 0;
}

Eigen: Obtain the kernel of a sparse matrix

Given a sparse matrix A and a vector b, I would like to obtain a solution x to the equation A * x = b as well as the kernel of A.
One possibility is to convert A to a dense representation.
#include <iostream>
#include <Eigen/Dense>
#include <Eigen/SparseQR>
int main()
{
// This is a toy problem. My actual matrix
// is of course bigger and sparser.
Eigen::SparseMatrix<double> A(2,2);
A.insert(0,0) = 1;
A.insert(0,1) = 2;
A.insert(1,0) = 4;
A.insert(1,1) = 8;
A.makeCompressed();
Eigen::Vector2d b;
b << 3, 12;
Eigen::SparseQR<Eigen::SparseMatrix<double>,
Eigen::COLAMDOrdering<int> > solver;
solver.compute(A);
std::cout << "Solution:\n" << solver.solve(b) << std::endl;
Eigen::Matrix2d A_dense(A);
std::cout << "Kernel:\n" << A_dense.fullPivLu().kernel() << std::endl;
return 0;
}
Is it possible to do the same directly in the sparse representation? I could not find a function kernel() anywhere except in FullPivLu.
I think #chtz's answer is almost correct, except we need to take the last A.cols() - qr.rank() columns. Here is a mathematical derivation.
Say we do a QR decomposition of your matrix Aᵀ as
Aᵀ * P = [Q₁ Q₂] * [R; 0] = Q₁ * R
where P is the permutation matrix, thus
Aᵀ = Q₁ * R * P⁻¹.
We can see that Range(Aᵀ) = Range(Q₁ * R * P⁻¹) = Range(Q₁) (because both P and R are invertible).
Since Aᵀ and Q₁ have the same range space, this implies that A and Q₁ᵀ will also have the same null space, namely Null(A) = Null(Q₁ᵀ). (Here we use the property that Range(M) and Null(Mᵀ) are complements to each other for any matrix M, hence Null(A) = complement(Range(Aᵀ)) = complement(Range(Q₁)) = Null(Q₁ᵀ)).
On the other hand, since the matrix [Q₁ Q₂] is orthonormal, Null(Q₁ᵀ) = Range(Q₂), thus Null(A) = Range(Q₂), i.e., kernal(A) = Q₂.
Since Q₂ is the right A.cols() - qr.rank() columns, you could call rightCols(A.cols() - qr.rank()) to retrieve the kernal of A.
For more information on kernal space, you could refer to https://en.wikipedia.org/wiki/Kernel_(linear_algebra)

Sparse random projection (aka Johnson Lindenstrauss transform) not preserving distance between 2 points

I'm trying to use the method of random projection (basically, reduce dimensions while preserving Euclidean distance between 2 points), and recently I found some code online (mex file for matlab):
/*
* sjlt.c - Sparse Johnson-Lindenstrauss Transform
*
* Creates a random sparse Johnson-Lindenstrauss projection matrix.
* The columns are independent and each column has exactly s non-zero
* entries. All non-zero entries are independent Rademacher random
* variables. Details can be found in [1].
*
* The calling syntax is:
*
* projection = sjlt(rows, columns, sparsity)
*
* This is a MEX file for MATLAB.
*
* Depending on your compiler, you can compile the function using
* one of the following calls:
* $ mex CXXFLAGS='$CXXFLAGS -std=c++0x' COPTIMFLAGS='-O3 -DNDEBUG' -largeArrayDims sjlt.cpp
* or
* $ mex CXXFLAGS='$CXXFLAGS -std=c++11' COPTIMFLAGS='-O3 -DNDEBUG' -largeArrayDims sjlt.cpp
*
* Author: Tobias Pohlen <tobias.pohlen#rwth-aachen.de>
*
* References:
* [1] Jean Bourgain, Sjoerd Dirksen, and Jelani Nelson. "Toward a Unified
* Theory of Sparse Dimensionality Reduction in Euclidean Space",
* Symposium on Theory of Computing, 2015.
*/
#include "mex.h"
#include <random>
std::random_device rd;
std::mt19937 g(rd());
// We use this in order to generate rademacher random variables
std::uniform_int_distribution<int> rademacherDist(0, 1);
inline int rademacher()
{
return 2*rademacherDist(g) - 1;
}
/* Tries to extract an integer from arg */
mwSize getIntegerScalar(const mxArray* arg)
{
if (mxGetNumberOfElements(arg) == 1)
{
return mxGetScalar(arg);
}
else
{
mexErrMsgTxt("Integer scalar is not of size == [1 1].\n");
}
}
/* Returns an integer from arg or 0 if the integer is negative */
mwSize getNonNegativeIntegerScalar(const mxArray* arg)
{
int res = getIntegerScalar(arg);
if (res < 0)
{
return 0;
}
else
{
return res;
}
}
/* Shuffles the array randomly */
void shuffle(
mwSize* array,
mwSize size,
std::uniform_int_distribution<mwSize> & indexDistribution)
{
for (mwSize i = 0; i < size; i++)
{
std::swap(array[i], array[indexDistribution(g)]);
}
}
/* Creates a sparse Johnson Lindenstrauss Transform of size numRows x numCols
* of sparsity.
*/
void createSJLT(
mwSize sparsity,
mwSize numRows,
mwSize numCols,
double *entries,
mwSize* rowIndices,
mwSize* colIndices)
{
// Create an array of row indices to shuffle. We use this in order
// to draw random rows without replacement
std::uniform_int_distribution<mwSize> rowDist(0, numRows-1);
mwSize* rowCache = (mwSize*) malloc(numRows*sizeof(mwSize));
for (mwSize i = 0; i < numRows; i++)
{
rowCache[i] = i;
}
// Fill the column indices and the entries (remember that the entries are
// just independent rademacher random variables)
mwSize colOffset = 0;
for (mwSize c = 0; c < numCols; c++)
{
// Shuffle the row indices
shuffle(rowCache, sparsity, rowDist);
for (mwSize s = 0; s < sparsity; s++)
{
entries[colOffset+s] = rademacher();
rowIndices[colOffset+s] = rowCache[s];
}
colIndices[c] = c*sparsity;
colOffset += sparsity;
}
colIndices[numCols] = numCols*sparsity;
free(rowCache);
}
/*
* This is the function called by MATLAB.
*/
void mexFunction(
int numLeftHandSide,
mxArray *pointerLeftHandSide[],
int numRightHandSide,
const mxArray *pointerRightHandSide[])
{
// Inputs:
// 1. number of rows
// 2. number of columns
// 3. sparsity (number of non-zeros per column)
if(numRightHandSide != 3)
{
mexErrMsgIdAndTxt(
"arrayProduct:numRightHandSide",
"Three inputs required.");
}
// Outputs:
// 1. SJLT matrix
if (numLeftHandSide != 1)
{
mexErrMsgIdAndTxt(
"arrayProduct:numLeftHandSide",
"One output required.");
}
// Read the inputs
int numRows = getNonNegativeIntegerScalar(pointerRightHandSide[0]);
int numCols = getNonNegativeIntegerScalar(pointerRightHandSide[1]);
int sparsity = getNonNegativeIntegerScalar(pointerRightHandSide[2]);
// The sparsity cannot be higher than the number of rows
if (sparsity > numRows)
{
sparsity = numRows;
}
// Create the outputs
pointerLeftHandSide[0] = mxCreateSparse(numRows,numCols,numCols*sparsity,mxREAL);
// Create the transformation
createSJLT(
sparsity,
numRows,
numCols,
mxGetPr(pointerLeftHandSide[0]),
mxGetIr(pointerLeftHandSide[0]),
mxGetJc(pointerLeftHandSide[0]));
}
The equations on which this method is based can be found here: http://web.stanford.edu/~hastie/Papers/Ping/KDD06_rp.pdf :
I understand the variable "s" to be the number of non-zero entries in each column. Anyway, I have written a matlab script to test if this piece of code does indeed preserve distance between 2 points:
>> mex CXXFLAGS='$CXXFLAGS -std=c++0x' COPTIMFLAGS='-O3 -DNDEBUG' -largeArrayDims sjlt.cpp
Building with 'g++'.
Warning: You are using gcc version '5.4.0'. The version of gcc is not supported. The version currently
supported with MEX is '4.7.x'. For a list of currently supported compilers see:
http://www.mathworks.com/support/compilers/current_release.
MEX completed successfully.
>> rng('default');
>> rng(1);
>> nObservations = 100;
>> nFeatures = 10000;
>> X = randn(nObservations, nFeatures);
>> X1 = X(1,:);
>> X2 = X(2,:);
>> dist = sqrt(sum((X1 - X2) .^ 2));
>> dist
dist =
142.1365
>> nFeatures_new = 3947; % This number was taken from: http://scikit-learn.org/stable/modules/random_projection.html
>> sparsity = 1;
>> R = sjlt(nFeatures, nFeatures_new,sparsity);
>> Y = X*R;
>> Y = (sqrt(sparsity) / sqrt(nFeatures_new)) * Y;
>> Y1 = Y(1,:);
>> Y2 = Y(2,:);
>> dist_transformed = sqrt(sum((Y1 - Y2) .^ 2));
>> dist_transformed
dist_transformed =
1.4397
Strangely the distance was not preserved! There must be something wrong, either with the code, or with the way I compiled the .cpp file, since there was a warning (I'm using ubuntu 16.04, 64 bit version). Can anyone help me? Thank you in advance!
The reason my code did not preserve the Euclidean distance was because I misunderstood the variable "s" to be the number of non-zero entries in each column. Turns out it was like this: 1/s = sparsity / D .
Here is the working code:
rng('default');
rng(1);
n = 100; D = 10000; k = 3947;
s = round(log(D)) + 1;
sparsity = D / s;
X = randn(n,D);
X1 = X(1,:);
X2 = X(2,:);
dist = sqrt(sum((X1 - X2) .^ 2));
Y = X * sjlt(D,k,sparsity);
Y = Y .* (sqrt(s) / sqrt(k));
Y1 = Y(1,:); Y2 = Y(2,:);
dist_transformed = sqrt(sum((Y1 - Y2) .^ 2));
dist
dist_transformed
Note that the first 2 lines does not guarantee repeatable results, since there was also randomization within the mex file, therefore the value of "dist_transformed" would be different on every run (but "dist" would be unchanged)

SSE: reinterpret_cast<__m128*> instead of _mm_load_ps

I am in the process of coding up a simple convolution function in C++, starting from the very basic "sliding-window" convolution with regular products (no FFT stuff for now), up to SEE, AVX and possibly OpenCL. I ran into a problem with SSE though. My code looks like this:
for (x = 0; x < SIZEX - KSIZEX + 1; ++x)
{
for (y = 0; y < SIZEY - KSIZEY + 1; ++y)
{
tmp = 0.0f;
float fDPtmp = 0.0f;
float *Kp = &K[0];
for (xi = 0; xi < KSIZEX; ++xi, Kp=Kp+4)
{
float *Cp = &C[(x+xi)*SIZEY + y];
__m128 *KpSSE = reinterpret_cast<__m128*>(&K);
__m128 *CpSSE = reinterpret_cast<__m128*>(&C[(x + xi)*SIZEY + y]);
__m128 DPtmp = _mm_dp_ps(*KpSSE, *CpSSE, 0xFF);
_mm_store_ss(&fDPtmp, DPtmp);
tmp += fDPtmp;
}
R[k] = tmp;
++k;
}
}
The necessary matrices are initialized like this (the size of those is considerd ok because the simpler implementations work just fine):
__declspec(align(16)) float *C = ReadMatrix("E:\\Code\\conv\\C.bin");
__declspec(align(16)) float *K = ReadMatrix("E:\\Code\\conv\\K.bin");
__declspec(align(16)) float *R = new float[CSIZEX*CSIZEY];
The code crashes at y=1 so I feel there might be a mistake with the way I handle the pointers. The interesting thing is that if I replace the reinterpret_casts with _mm_set_ps, i.e.
__m128 KpSSE = _mm_set_ps(Kp[0], Kp[1], Kp[2], Kp[3]);
__m128 CpSSE = _mm_set_ps(Cp[0], Cp[1], Cp[2], Cp[3]);
__m128 DPtmp = _mm_dp_ps(KpSSE, CpSSE, 0xFF);
_mm_store_ss(&fDPtmp, DPtmp);
the whole thing works just fine although slower, which I blame on all the copy operations.
Can anybody please point me to what exactly I am doing wrong here?
Thank you very much
Pat
Update: Ok, so as pointed out by Paul the problem lies with ReadMatrix (or another solution would be to use _mm_loadu_ps). As for ReadMatrix(), it looks like this:
__declspec(align(16)) float* ReadMatrix(string path)
{
streampos size;
ifstream file(path, ios::in | ios::binary | ios::ate);
if (file.is_open())
{
size = file.tellg();
__declspec(align(16)) float *C = new float[size];
file.seekg(0, ios::beg);
file.read(reinterpret_cast<char*>(&C[0]), size);
file.close();
return C;
}
else cout << "Unable to open file" << endl;
}
It does not do the trick. Is there any other way of doing this elegantly rather than being forced to read the file piece by piece and perform memcpy, which I assume should work?!
Update:
Still does not seem to want to work after
__declspec(align(16)) float* ReadMatrix(string path)
{
streampos size;
ifstream file(path, ios::in | ios::binary | ios::ate);
if (file.is_open())
{
size = file.tellg();
__declspec(align(16)) float *C = static_cast<__declspec(align(16)) float*>(_aligned_malloc(size * sizeof(*C), 16));
file.seekg(0, ios::beg);
file.read(reinterpret_cast<char*>(&C[0]), size);
file.close();
return C;
}
else cout << "Unable to open file" << endl;
}
I added the static_cast up there since it seemed necessary to get Paul's code to compile (i.e. _aligned_malloc returns a void pointer). I am getting close to just read chunks of the file with fread and memcpy them into an alligned array. :/ Yet again I am finding myself asking for advice. Thank you very much all.
Pat
PS: Non-SSE code works fine with these data structures. _mm_loadu_ps is slower than using the non-SSE code.
This doesn't do what you think it does:
__declspec(align(16)) float *C = ReadMatrix("E:\\Code\\conv\\C.bin");
All that the alignment directive achieves here is to align the pointer itself (i.e. C) to a 16 byte boundary, not the contents of the pointer.
You either need to fix ReadMatrix so that it returns suitably aligned data, or use _mm_loadu_ps, as others have already suggested.
Do not use _mm_set_ps as this will tend to generate a lot of instructions under the hood, unlike _mm_loadu_ps, which maps to a single instruction.
UPDATE
You have repeated the same mistake in ReadMatrix:
__declspec(align(16)) float *C = new float[size];
again this does not guarantee the alignment of the data, only of the pointer C itself. To fix this allocation you can use _mm_malloc or _aligned_malloc:
float *C = _mm_malloc(size * sizeof(*C), 16);
or
float *C = _aligned_malloc(size * sizeof(*C), 16);
In ReadMatrix, you have no guarantee whatsoever that the new expression returns a properly aligned pointer. It doesn't matter that you assign to an aligned pointer (and I'm not even sure if your syntax means the pointer itself is aligned, or what it points to).
You need to use _mm_align, or _mm_malloc, or some other aligned allocation facility.
You can't use reinterpret_cast here, and I understand _mmloadu_ps is slow. But there is another way. Unroll your loop, read in aligned data, and shift and mask in the new value before you perform operations on it. This will be fast and correct. That is, you can do some like this in your inner loop:
__m128i x = _mm_load_ps(p);
__m128i y = _mm_load_ps(p + sizeof(float));
__m128i z;
// do your operation on x 1st time this iteration here
z = _mm_slli_si128(y, sizeof(float) * 3);
x = _mm_srli_si128(x, sizeof(float));
x = _mm_or_si128(x, z);
// do your operation on x 2nd time this iteration here
z = _mm_slli_si128(y, sizeof(float) * 2);
x = _mm_srli_si128(x, sizeof(float) * 2);
x = _mm_or_si128(x, z);
// do your operation on x 3rd time this iteration here
z = _mm_slli_si128(y, sizeof(float));
x = _mm_srli_si128(x, sizeof(float) * 3);
x = _mm_or_si128(x, z);
// do your operation on x 4th time this iteration here
x = y; // don’t need to read in x next iteration, only y
loopCounter += 4 * sizeof(float);