Reorganizing a vector in c++ - c++

I'd like to preface this question with the fact that I am very inexperienced when it comes to coding, so the solution to this problem could be much easier than what I have been trying. I have a vector 'phas' defined as vector<float> phase; that has 7987200 elements and I want to rearrange this vector into 133120 vectors of 60 elements (called line2 defined as vector<long double> line2;). Each vector of 60 should then be placed one after the other in a vector of vectors 'RFlines2' defined as vector< vector<long double> > RFlines2;and RFlines2.resize(7987200);. I want to fill each of the 60 element vectors with elements of 'phas' separated by 128. for example, the first vector of 60 elements would be filled with phas[0], phas[128], phas[256], ... phas[7680]. The second vector of 60 would then be filled with phas[1], phas[129], phas[257], ... phas[7681],...etc. My current code is as follows:
for(int x = 0; x<133120; x++){
if((x == 128 || x == 7680+128 || x == (7680*a)+128)){
x = 7680*a;
a = a + 1;
}
int j = x;
for(int i = 0; i<60;i++){
line2.pushback(i);
line2[i] = phas[j];
j = j + 128;
}
cout<<"This is x: "<<x<<endl;
RFlines2[x] = line2;
line2.clear();
}
however, after 128 iterations of the outter loop (128 vectors of 60 have been created and 7680 elements from phas have been used), I would need the x value to jump to 7680 to avoid putting elements from phas that have already been used into the next vector of 60 since when x = 128 the first element of the next vector of 60 would be phase[128], which was already used as the 2nd element of the first vector of 60. And then after another 128 x iterations, I would need the x value to jump to 15,360 and so on. The code above is my latest attempt, but when I try to do the fftw on each vector of 60 in RFlines2 as follows:
int c = 0;
for(int x = 0; x < 133120; x++){
//cout<<x<<endl;
fftw_plan p2;
inter = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * W);
outter = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * W);
/* cast elements in line to type fftw_complex */
for (int i = 0; i <60; i++) {
//cout<<i<<endl;
//inter[i][0] = phas[i];
//inter[x][0] = zlines[x];
inter[i][0] = RFlines2[x][i];
inter[i][1] = 0;
}
p2 = fftw_plan_dft_1d(60, inter, outter, FFTW_FORWARD, FFTW_ESTIMATE);
fftw_execute(p2);
//inter[x][0].clear();
for(int u = 0; u<60;u++){
if(u == 0){
cout<<' '<<outter[0][0]<<' '<<c++<<endl;
}
}
fftw_free(inter);
fftw_free(outter);
fftw_destroy_plan((p2));
}
the program crashes after displaying outer[0][0] 128 times. Any ideas how to fix this? Also, let me know if anything that I said doesn't make sense and I'll try to clarify. Thanks in advance!
-Mike

I don't know why your code crashes, because I can't see the whole code here. But I'm going to suggest a way to scatter your data and manage your vectors.
(There is an important caveat though: you should not be using vectors (at least not vectors of vectors) for this task; you are better off using 1D vectors and managing the 2D indexing yourself. But this is a performance thing, and does not impact correctness.)
This is how I suggest you fill your RFLines2: (I have not tried this code, so it may not work.)
// first, build the memory for RFLines2...
vector<vector<long double>> RFLines2 (133120, vector<long double>(60));
// assuming a "phase" vector...
for (unsigned i = 0; i < 7987200; ++i)
{
unsigned const row = (i / (128 * 60)) * 128 + (i % (128 * 60)) % 128;
unsigned const col = (i % (128 * 60)) / 128;
RFLines[row][col] = phase[i];
}
You won't need the line2 intermediate this way.
The rest of the code "should" work. (BTW, I don't understand the inner for loop on u at all. What were you trying to do there?)

Related

C++ performance optimization for linear combination of large matrices?

I have a large tensor of floating point data with the dimensions 35k(rows) x 45(cols) x 150(slices) which I have stored in an armadillo cube container. I need to linearly combine all the 150 slices together in under 35 ms (a must for my application). The linear combination floating point weights are also stored in an armadillo container. My fastest implementation so far takes 70 ms, averaged over a window of 30 frames, and I don't seem to be able to beat that. Please note I'm allowed CPU parallel computations but not GPU.
I have tried multiple different ways of performing this linear combination but the following code seems to be the fastest I can get (70 ms) as I believe I'm maximizing the cache hit chances by fetching the largest possible contiguous memory chunk at each iteration.
Please note that Armadillo stores data in column major format. So in a tensor, it first stores the columns of the first channel, then the columns of the second channel, then third and so forth.
typedef std::chrono::system_clock Timer;
typedef std::chrono::duration<double> Duration;
int rows = 35000;
int cols = 45;
int slices = 150;
arma::fcube tensor(rows, cols, slices, arma::fill::randu);
arma::fvec w(slices, arma::fill::randu);
double overallTime = 0;
int window = 30;
for (int n = 0; n < window; n++) {
Timer::time_point start = Timer::now();
arma::fmat result(rows, cols, arma::fill::zeros);
for (int i = 0; i < slices; i++)
result += tensor.slice(i) * w(i);
Timer::time_point end = Timer::now();
Duration span = end - start;
double t = span.count();
overallTime += t;
cout << "n = " << n << " --> t = " << t * 1000.0 << " ms" << endl;
}
cout << endl << "average time = " << overallTime * 1000.0 / window << " ms" << endl;
I need to optimize this code by at least 2x and I would very much appreciate any suggestions.
First at all I need to admit, I'm not familiar with the arma framework or the memory layout; the least if the syntax result += slice(i) * weight evaluates lazily.
Two primary problem and its solution anyway lies in the memory layout and the memory-to-arithmetic computation ratio.
To say a+=b*c is problematic because it needs to read the b and a, write a and uses up to two arithmetic operations (two, if the architecture does not combine multiplication and accumulation).
If the memory layout is of form float tensor[rows][columns][channels], the problem is converted to making rows * columns dot products of length channels and should be expressed as such.
If it's float tensor[c][h][w], it's better to unroll the loop to result+= slice(i) + slice(i+1)+.... Reading four slices at a time reduces the memory transfers by 50%.
It might even be better to process the results in chunks of 4*N results (reading from all the 150 channels/slices) where N<16, so that the accumulators can be allocated explicitly or implicitly by the compiler to SIMD registers.
There's a possibility of a minor improvement by padding the slice count to multiples of 4 or 8, by compiling with -ffast-math to enable fused multiply accumulate (if available) and with multithreading.
The constraints indicate the need to perform 13.5GFlops, which is a reasonable number in terms of arithmetic (for many modern architectures) but also it means at least 54 Gb/s memory bandwidth, which could be relaxed with fp16 or 16-bit fixed point arithmetic.
EDIT
Knowing the memory order to be float tensor[150][45][35000] or float tensor[kSlices][kRows * kCols == kCols * kRows] suggests to me to try first unrolling the outer loop by 4 (or maybe even 5, as 150 is not divisible by 4 requiring special case for the excess) streams.
void blend(int kCols, int kRows, float const *tensor, float *result, float const *w) {
// ensure that the cols*rows is a multiple of 4 (pad if necessary)
// - allows the auto vectorizer to skip handling the 'excess' code where the data
// length mod simd width != 0
// one could try even SIMD width of 16*4, as clang 14
// can further unroll the inner loop to 4 ymm registers
auto const stride = (kCols * kRows + 3) & ~3;
// try also s+=6, s+=3, or s+=4, which would require a dedicated inner loop (for s+=2)
for (int s = 0; s < 150; s+=5) {
auto src0 = tensor + s * stride;
auto src1 = src0 + stride;
auto src2 = src1 + stride;
auto src3 = src2 + stride;
auto src4 = src3 + stride;
auto dst = result;
for (int x = 0; x < stride; x++) {
// clang should be able to optimize caching the weights
// to registers outside the innerloop
auto add = src0[x] * w[s] +
src1[x] * w[s+1] +
src2[x] * w[s+2] +
src3[x] * w[s+3] +
src4[x] * w[s+4];
// clang should be able to optimize this comparison
// out of the loop, generating two inner kernels
if (s == 0) {
dst[x] = add;
} else {
dst[x] += add;
}
}
}
}
EDIT 2
Another starting point (before adding multithreading) would be consider changing the layout to
float tensor[kCols][kRows][kSlices + kPadding]; // padding is optional
The downside now is that kSlices = 150 can't anymore fit all the weights in registers (and secondly kSlices is not a multiple of 4 or 8). Furthermore the final reduction needs to be horizontal.
The upside is that reduction no longer needs to go through memory, which is a big thing with the added multithreading.
void blendHWC(float const *tensor, float const *w, float *dst, int n, int c) {
// each thread will read from 4 positions in order
// to share the weights -- finding the best distance
// might need some iterations
auto src0 = tensor;
auto src1 = src0 + c;
auto src2 = src1 + c;
auto src3 = src2 + c;
for (int i = 0; i < n/4; i++) {
vec8 acc0(0.0f), acc1(0.0f), acc2(0.0f), acc3(0.0f);
// #pragma unroll?
for (auto j = 0; j < c / 8; c++) {
vec8 w(w + j);
acc0 += w * vec8(src0 + j);
acc1 += w * vec8(src1 + j);
acc2 += w * vec8(src2 + j);
acc3 += w * vec8(src3 + j);
}
vec4 sum = horizontal_reduct(acc0,acc1,acc2,acc3);
sum.store(dst); dst+=4;
}
}
These vec4 and vec8 are some custom SIMD classes, which map to SIMD instructions either through intrinsics, or by virtue of the compiler being able to do compile using vec4 = float __attribute__ __attribute__((vector_size(16))); to efficient SIMD code.
As #hbrerkere suggested in the comment section, by using the -O3 flag and making the following changes, the performance improved by almost 65%. The code now runs at 45 ms as opposed to the initial 70 ms.
int lastStep = (slices / 4 - 1) * 4;
int i = 0;
while (i <= lastStep) {
result += tensor.slice(i) * w_id(i) + tensor.slice(i + 1) * w_id(i + 1) + tensor.slice(i + 2) * w_id(i + 2) + tensor.slice(i + 3) * w_id(i + 3);
i += 4;
}
while (i < slices) {
result += tensor.slice(i) * w_id(i);
i++;
}
Without having the actual code, I'm guessing that
+= tensor.slice(i) * w_id(i)
creates a temporary object and then adds it to the lhs. Yes, overloaded operators look nice, but I would write a function
addto( lhs, slice1, w1, slice2, w2, ....unroll to 4... )
which translates to pure loops over the elements:
for (i=....)
for (j=...)
lhs[i][j] += slice1[i][j]*w1[j] + slice2[i][j] &c
It would surprise me if that doesn't buy you an extra factor.

Getting the variance of a vector of long doubles

I'm trying to calculate the variance of a vector of long doubles. I've tried implementing other code I've seen, but it doesn't return the correct value.
long double variance = 0;
for (int x = 0; x < (v.size() - 1); x++) {
variance += (v.at(x) - mean) * (v.at(x) - mean);
}
variance /= v.size();
For example, if my vector is {1,2,3,4,5}, the above code gives me 2.25. To my understanding the correct answer is 2.
Any help is appreciated, I'm not sure what I'm missing.
x < (v.size() - 1)? This skips the last element. Use <= or omit the - 1.
The index of the element is v.size() - 1, and since x must be less than that, the loop breaks before the last element is processed.

Put a multidimensional array into a one-dimensional array

I've got a question. I'm writing a simple application in C++ and I have the following problem:
I want to use a two-dimensional array to specify the position of an object (x and y coordinates). But when I created such an array, I got many access violation problems, when I accessed it. I'm not pretty sure, where that violations came from, but I think, my stack is not big enough and I shuld use pointers. But when I searched for a solution to use a multidimensional array in heap and point on it, the solutions where too complicated for me.
So I remembered there's a way to use a "normal" one-dimensional array as an multidimensional array. But I do not remember exactly, how I can access it the right way. I declared it this way:
char array [SCREEN_HEIGHT * SCREEN_WIDTH];
Then I tried to fill it this way:
for(int y = 0; y < SCREEN_HEIGHT; y++) {
for(int x = 0; x < SCREEN_WIDTH; x++) {
array [y + x * y] = ' ';
}
}
But this is not right, because the char that is at position y + x * y is not exactly specified (because y + y * x points to the same position)
But I am pretty sure, there was a way to do this. Maybe I am wrong, so tell it to me :D
In this case, a solution to use multidimensional array would be great!
You don't want y + x*y, you want y * SCREEN_WIDTH + x. That said, a 2D array declared as:
char array[SCREEN_HEIGHT][SCREEN_WIDTH];
Has exactly the same memory layout, and you could just access it directly the way you want:
array[y][x] = ' ';
char array2D[ROW_COUNT][COL_COUNT] = { {...} };
char array1D[ROW_COUNT * COL_COUNT];
for (int row = 0; row < ROW_COUNT; row++)
{
for (int col = 0; col < COL_COUNT; col++)
{
array1D[row * COL_COUNT + col] = array2D[row][col];
}
}
You access the correct element for your 1D array by taking "current row * total columns + current column," or vice-versa if you're looping through columns first.

Generating incomplete iterated function systems

I am doing this assignment for fun.
http://groups.csail.mit.edu/graphics/classes/6.837/F04/assignments/assignment0/
There are sample outputs at site if you want to see how it is supposed to look. It involves iterated function systems, whose algorithm according the the assignment is:
for "lots" of random points (x0, y0)
for k=0 to num_iters
pick a random transform fi
(xk+1, yk+1) = fi(xk, yk)
display a dot at (xk, yk)
I am running into trouble with my implementation, which is:
void IFS::render(Image& img, int numPoints, int numIterations){
Vec3f color(0,1,0);
float x,y;
float u,v;
Vec2f myVector;
for(int i = 0; i < numPoints; i++){
x = (float)(rand()%img.Width())/img.Width();
y = (float)(rand()%img.Height())/img.Height();
myVector.Set(x,y);
for(int j = 0; j < numIterations;j++){
float randomPercent = (float)(rand()%100)/100;
for(int k = 0; k < num_transforms; k++){
if(randomPercent < range[k]){
matrices[k].Transform(myVector);
}
}
}
u = myVector.x()*img.Width();
v = myVector.y()*img.Height();
img.SetPixel(u,v,color);
}
}
This is how my pick a random transform from the input matrices:
fscanf(input,"%d",&num_transforms);
matrices = new Matrix[num_transforms];
probablility = new float[num_transforms];
range = new float[num_transforms+1];
for (int i = 0; i < num_transforms; i++) {
fscanf (input,"%f",&probablility[i]);
matrices[i].Read3x3(input);
if(i == 0) range[i] = probablility[i];
else range[i] = probablility[i] + range[i-1];
}
My output shows only the beginnings of a Sierpinski triangle (1000 points, 1000 iterations):
My dragon is better, but still needs some work (1000 points, 1000 iterations):
If you have RAND_MAX=4 and picture width 3, an evenly distributed sequence like [0,1,2,3,4] from rand() will be mapped to [0,1,2,0,1] by your modulo code, i.e. some numbers will occur more often. You need to cut off those numbers that are above the highest multiple of the target range that is below RAND_MAX, i.e. above ((RAND_MAX / 3) * 3). Just check for this limit and call rand() again.
Since you have to fix that error in several places, consider writing a utility function. Then, reduce the scope of your variables. The u,v declaration makes it hard to see that these two are just used in three lines of code. Declare them as "unsigned const u = ..." to make this clear and additionally get the compiler to check that you don't accidentally modify them afterwards.

Iterate through 2D Array block by block in C++

I'm working on a homework assignment for an image shrinking program in C++. My picture is represented by a 2D array of pixels; each pixel is an object with members "red", "green" and "blue." To solve the problem I am trying to access the 2D array one block at a time and then call a function which finds the average RGB value of each block and adds a new pixel to a smaller image array. The size of each block (or scale factor) is input by the user.
As an example, imagine a 100-item 2D array like myArray[10][10]. If the user input a shrink factor of 3, I would need to break out mini 2D arrays of size 3 by 3. I do not have to account for overflow, so in this example I can ignore the last row and the last column.
I have most of the program written, including the function to find the average color. I am confused about how to traverse the 2D array. I know how to cycle through a 2D array sequentially (one row at a time), but I'm not sure how to get little squares within an array.
Any help would be greatly appreciated!
Something like this should work:
for(size_t bx = 0; bx < width; bx += block_width)
for(size_t by = 0; by < height; by += block_height) {
float sum = 0;
for(size_t x = 0; x < block_width; ++x)
for(size_t y = 0; y < block_height; ++y) {
sum += array[bx + x][by + y];
}
average = sum / (block_width * block_height);
new_array[bx][by] = average;
}
width is the whole width, block_width is the length of your blue squares on diagram
This is how you traverse an array in C++:
for(i=0; i < m; i++) {
for(j=0; j < n; j++) {
// do something with myArray[i][j] where i represents the row and j the column
}
}
I'll leave figuring out how to cylcle through the array in different ways as an exercise to the reader.
you could use two nested loops one for x and one for y and move the start point of those loops across the image. As this is homework I wont put any code up but you should be able to work it out.