Need help optimizing code (minimum image convention) - c++

I have written some simulation code and am using the "randomly break in GDB" method of debugging. I am finding that 99.9% of my program's time is spent in this routine (it's the minimum image convention):
inline double distanceSqPeriodic(double const * const position1, double const * const position2, double boxWidth) {
double xhw, yhw, zhw, x, y, z;
xhw = boxWidth / 2.0;
yhw = xhw;
zhw = xhw;
x = position2[0] - position1[0];
if (x > xhw)
x -= boxWidth;
else if (x < -xhw)
x += boxWidth;
y = position2[1] - position1[1];
if (y > yhw)
y -= boxWidth;
else if (y < -yhw)
y += boxWidth;
z = position2[2] - position1[2];
if (z > zhw)
z -= boxWidth;
else if (z < -zhw)
z += boxWidth;
return x * x + y * y + z * z;
}
The optimizations I have performed so far (maybe not very significant ones):
Return the square of the distance instead of the square root
Inline it
Const what I can
No standard library bloat
Compiling with every g++ optimization flag I can think of
I am running out of things I can do with this. Maybe I could use floats instead of doubles but I would prefer that be a last resort. And maybe I could somehow use SIMD on this, but I've never done that so I imagine that's a lot of work. Any ideas?
Thanks

First, you're not using the right algorithm. What if the two points are greater than boxWidth apart? Second, if you have multiple particles, calling a single function that does all of the distance calculations and places the results in an output buffer is going to be significantly more efficient. Inlining helps reduce some of this, but not all. Any of the precalculation -- like dividing the box length by 2 in your algorithm -- is going to be repeated when it doesn't need to be.
Here is some SIMD code to do the calculation. You need to compile with -msse4. Using -O3, on my machine (macbook pro, llvm-gcc-4.2), I get a speed up of about 2x. This does require using 32bit floats instead of double precision arithmetic.
SSE really isn't that complicated, it just looks terrible. e.g. instead of writing a*b, you have to write the clunky _mm_mul_ps(a,b).
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <smmintrin.h>
// you can compile this code with -DDOUBLE to try using doubles vs. floats
// in the unoptimized code. The SSE code uses only floats.
#ifdef DOUBLE
typedef double real;
#else
typedef float real;
#endif
static inline __m128 loadFloat3(const float const* value) {
// Load (x,y,z) into a SSE register, leaving the last entry
// set to zero.
__m128 x = _mm_load_ss(&value[0]);
__m128 y = _mm_load_ss(&value[1]);
__m128 z = _mm_load_ss(&value[2]);
__m128 xy = _mm_movelh_ps(x, y);
return _mm_shuffle_ps(xy, z, _MM_SHUFFLE(2, 0, 2, 0));
}
int fdistanceSqPeriodic(float* position1, float* position2, const float boxWidth,
float* out, const int n_points) {
int i;
__m128 r1, r2, r12, s12, r12_2, s, box, invBox;
box = _mm_set1_ps(boxWidth);
invBox = _mm_div_ps(_mm_set1_ps(1.0f), box);
for (i = 0; i < n_points; i++) {
r1 = loadFloat3(position1);
r2 = loadFloat3(position1);
r12 = _mm_sub_ps(r1, r2);
s12 = _mm_mul_ps(r12, invBox);
s12 = _mm_sub_ps(s12, _mm_round_ps(s12, _MM_FROUND_TO_NEAREST_INT));
r12 = _mm_mul_ps(box, s12);
r12_2 = _mm_mul_ps(r12, r12);
// double horizontal add instruction accumulates the sum of
// all four elements into each of the elements
// (e.g. s.x = s.y = s.z = s.w = r12_2.x + r12_2.y + r12_2.z + r12_2.w)
s = _mm_hadd_ps(r12_2, r12_2);
s = _mm_hadd_ps(s, s);
_mm_store_ss(out++, s);
position1 += 3;
position2 += 3;
}
return 1;
}
inline real distanceSqPeriodic(real const * const position1, real const * const position2, real boxWidth) {
real xhw, yhw, zhw, x, y, z;
xhw = boxWidth / 2.0;
yhw = xhw;
zhw = xhw;
x = position2[0] - position1[0];
if (x > xhw)
x -= boxWidth;
else if (x < -xhw)
x += boxWidth;
y = position2[1] - position1[1];
if (y > yhw)
y -= boxWidth;
else if (y < -yhw)
y += boxWidth;
z = position2[2] - position1[2];
if (z > zhw)
z -= boxWidth;
else if (z < -zhw)
z += boxWidth;
return x * x + y * y + z * z;
}
int main(void) {
real* position1;
real* position2;
real* output;
int n_runs = 10000000;
posix_memalign((void**) &position1, 16, n_runs*3*sizeof(real));
posix_memalign((void**) &position2, 16, n_runs*3*sizeof(real));
posix_memalign((void**) &output, 16, n_runs*sizeof(real));
real boxWidth = 1.8;
real result = 0;
int i;
clock_t t;
#ifdef OPT
printf("Timing optimized SSE implementation\n");
#else
printf("Timinig original implementation\n");
#endif
#ifdef DOUBLE
printf("Using double precision\n");
#else
printf("Using single precision\n");
#endif
t = clock();
#ifdef OPT
fdistanceSqPeriodic(position1, position2, boxWidth, output, n_runs);
#else
for (i = 0; i < n_runs; i++) {
*output = distanceSqPeriodic(position1, position2, boxWidth);
position1 += 3;
position2 += 3;
output++;
}
#endif
t = clock() - t;
printf("It took me %d clicks (%f seconds).\n", (int) t, ((float)t)/CLOCKS_PER_SEC);
}

you may want to use fabs (standarized in ISO 90 C) since this should be able to be reduced to a single non-branching instruction.

Return the square of the distance instead of the square root
That's a good idea as long as you are comparing squares to squares.
Inline it
This is sometimes a counter-optimization: Inlined code takes up space in the execution pipeline/cache, whether it is branched to or not.
Often it makes no difference because the compiler has the final word on whether to inline or not.
Const what I can
Normally no difference at all.
No standard library bloat
What bloat?
Compiling with every g++ optimization flag I can think of
That's good: Leave most optimizations to the compiler. Only if you measured your real bottleneck, and determined if that bottleneck is significant, invest money on hand optimizing.
What you could try do is to make your code branchfree. Without using bitmasks, this may look like this:
//if (z > zhw)
// z -= boxWidths[2];
//else if (z < -zhw)
// z += boxWidths[2];
const auto z_a[] = {
z,
z - boxWidths[2]
};
z = z_a[z>zhw];
...
or
z -= (z>zhw) * boxWidths[2];
However, there is no guarantee that this is faster. Your compiler may now have a harder time identifying SIMD spots in your code, or the branch target buffer does a good job and most of the times you have the same code paths through your function.

You need to get rid of the comparisons, as those are hard to predict.
The function to be implemented is:
/ / /\ /\
/ / / \/ \
----0----- or ------------ , as (-x)^2 == x^2
/ /
/ /
The latter is a result of two abs statements:
x = abs(half-abs(diff))+half;
The code
double tst(double a[4], double b[4], double half)
{
double sum=0.0,t;
int i;
for (i=0;i<3;i++) { t=fabs(fabs(b[i]-a[i])-half)-half; sum+=t*t;}
return sum;
}
beats the original implementation by a factor of four (+some) -- and at this point there's not even full parallelism: only the lower half of xmm registers are used.
With parallel processing of x && y, there's a theoretical gain of about 50% to be achieved. Using floats instead of doubles could in theory make it still about 3x faster.

Related

Ineffective "Peel/Remainder" Loop in my code

I have this function:
bool interpolate(const Mat &im, float ofsx, float ofsy, float a11, float a12, float a21, float a22, Mat &res)
{
bool ret = false;
// input size (-1 for the safe bilinear interpolation)
const int width = im.cols-1;
const int height = im.rows-1;
// output size
const int halfWidth = res.cols >> 1;
const int halfHeight = res.rows >> 1;
float *out = res.ptr<float>(0);
const float *imptr = im.ptr<float>(0);
for (int j=-halfHeight; j<=halfHeight; ++j)
{
const float rx = ofsx + j * a12;
const float ry = ofsy + j * a22;
#pragma omp simd
for(int i=-halfWidth; i<=halfWidth; ++i, out++)
{
float wx = rx + i * a11;
float wy = ry + i * a21;
const int x = (int) floor(wx);
const int y = (int) floor(wy);
if (x >= 0 && y >= 0 && x < width && y < height)
{
// compute weights
wx -= x; wy -= y;
int rowOffset = y*im.cols;
int rowOffset1 = (y+1)*im.cols;
// bilinear interpolation
*out =
(1.0f - wy) * ((1.0f - wx) * imptr[rowOffset+x] + wx * imptr[rowOffset+x+1]) +
( wy) * ((1.0f - wx) * imptr[rowOffset1+x] + wx * imptr[rowOffset1+x+1]);
} else {
*out = 0;
ret = true; // touching boundary of the input
}
}
}
return ret;
}
halfWidth is very random: it can be 9, 84, 20, 95, 111...I'm only trying to optimize this code, I don't understand it in details.
As you can see, the inner for has been already vectorized, but Intel Advisor suggests this:
And this is the Trip Count analysis result:
To my understand this means that:
Vector length is 8, so it means that 8 floats can be processed at the same time for each loop. This would mean (if I'm not wrong) that data are 32 bytes aligned (even though as I explain here it seems that the compiler think that data is not aligned).
On average, 2 cycles are totally vectorized, while 3 cycles are remainder loops. The same goes for Min and Max. Otherwise I don't understand what ; means.
Now my question is: how can I follow Intel Advisor first suggestion? It says to "increase the size of objects and add iterations so the trip count is a multiple of vector length"...Ok, so it's simply sayin' "hey man do this so halfWidth*2+1 (since it goes from -halfWidth to +halfWidth is a multiple of 8)". But how can I do this? If I add random cycles, this would obviously break the algorithm!
The only solution that came to my mind is to add "fake" iterations like this:
const int vectorLength = 8;
const int iterations = halfWidth*2+1;
const int remainder = iterations%vectorLength;
for(int i=0; i<loop+length-remainder; i++){
//this iteration was not supposed to exist, skip it!
if(i>halfWidth)
continue;
}
Of course this code would not work since it goes from -halfWidth to halfWidth, but it's to make you understand my strategy of "fake" iterations.
About the second option ("Increase the size of static and automatic objects, and use a compiler option to add data padding") I have no idea how to implement this.
First, you have to check Vector Advisor Efficiency metric as well as relative time spent in Loop Remainder compared to Loop Body (see hotspots list in advisor). If efficiency is close to 100% (or time spent in Remainder is very small), then it is not worth effort (and money as MSalters mentioned in comments).
If it is << 100% (and there are no other penalties reported by the tool), then you can either refactor the code to "add fake iterations" (rare users can afford it) or you should try #pragma loop_count for most typical #iterations values (depending on typical halfWidth value).
If halfWIdth is totally random (no common or average values), then there is nothing you can really do with this issue.

OpenACC present clause update data

I am trying to do openACC optimizations for many body simulations. Currently, I am facing a problem which lead to memory problem in below
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution
srun: error: jrc0017: task 0: Exited with exit code 1
I am using pgc++ compiler and my compiler flags are -acc -Minfo=accel -ta=tesla -fast -std=c++11 and I don't want to use -ta=tesla:managed because I want to organise memory by myself.
#pragma acc kernels present(sim.part.rx, sim.part.ry, sim.part.rz, sim.part.vx, sim.part.vy, sim.part.vz)
{
for(int idx = 0; idx < sim.num; ++idx) { // Loop over target particle
float
prx = sim.part.rx[idx], // my position
pry = sim.part.ry[idx],
prz = sim.part.rz[idx];
float Fx = 0.f, Fy = 0.f, Fz = 0.f; // Force
#pragma acc loop
for(int jdx = 0; jdx < sim.num; ++jdx) { // Loop over interaction partners
if(idx != jdx) { // No self-force
const float dx = prx - sim.part.rx[jdx]; // Distance to partner
const float dy = pry - sim.part.ry[jdx];
const float dz = prz - sim.part.rz[jdx];
const float h = 1.f/sqrt(dx*dx + dy*dy + dz*dz + eps);
const float h3 = h*h*h;
Fx += dx*h3; // Sum up force
Fy += dy*h3;
Fz += dz*h3;
}
}
sim.part.vx[idx] += sim.mass*dt*Fx; // update velocity
sim.part.vy[idx] += sim.mass*dt*Fy;
sim.part.vz[idx] += sim.mass*dt*Fz;
}
}
If I delete the code in below
sim.part.vx[idx] += sim.mass*dt*Fx; // update velocity
sim.part.vy[idx] += sim.mass*dt*Fy;
sim.part.vz[idx] += sim.mass*dt*Fz;
my code is able to run without problem. But I got memory problem if I un-comment them. It seems that sim.part.vx are try to update the data but compiler don't know which lead to the memory problem.
Does anyone know how to fix this problem?
I suspect the problem is that sim and sim.part are not on the device (or the compiler doesn't realize that they're on the device. As a workaround, can you try introducing pointers to those arrays directly?
float *rx = sim.part.rx, *ry = sim.part.ry, *rz = sim.part.rz,
*vx = sim.part.vx, *vy = sim.part.vy, *vz = sim.part.vz;
#pragma acc kernels present(rx, ry, rz, vx, vy, vz)
{
for(int idx = 0; idx < sim.num; ++idx) { // Loop over target particle
float
prx = rx[idx], // my position
pry = ry[idx],
prz = rz[idx];
float Fx = 0.f, Fy = 0.f, Fz = 0.f; // Force
#pragma acc loop
for(int jdx = 0; jdx < sim.num; ++jdx) { // Loop over interaction partners
if(idx != jdx) { // No self-force
const float dx = prx - rx[jdx]; // Distance to partner
const float dy = pry - ry[jdx];
const float dz = prz - rz[jdx];
const float h = 1.f/sqrt(dx*dx + dy*dy + dz*dz + eps);
const float h3 = h*h*h;
Fx += dx*h3; // Sum up force
Fy += dy*h3;
Fz += dz*h3;
}
}
vx[idx] += sim.mass*dt*Fx; // update velocity
vy[idx] += sim.mass*dt*Fy;
vz[idx] += sim.mass*dt*Fz;
}
}
How are sim and sim.part allocated? It's possible to use unstructured data directives in the constructor and destructor to make sure that sim and sim.part are on the device too. If you've already done this, then another possible solution is to add present(sim, sim.part) to your existing present clause so the compiler knows that you've already taken care of those data structures too.

Drawing circle, OpenGL style

I have a 13 x 13 array of pixels, and I am using a function to draw a circle onto them. (The screen is 13 * 13, which may seem strange, but its an array of LED's so that explains it.)
unsigned char matrix[13][13];
const unsigned char ON = 0x01;
const unsigned char OFF = 0x00;
Here is the first implementation I thought up. (It's inefficient, which is a particular problem as this is an embedded systems project, 80 MHz processor.)
// Draw a circle
// mode is 'ON' or 'OFF'
inline void drawCircle(float rad, unsigned char mode)
{
for(int ix = 0; ix < 13; ++ ix)
{
for(int jx = 0; jx < 13; ++ jx)
{
float r; // Radial
float s; // Angular ("theta")
matrix_to_polar(ix, jx, &r, &s); // Converts polar coordinates
// specified by r and s, where
// s is the angle, to index coordinates
// specified by ix and jx.
// This function just converts to
// cartesian and then translates by 6.0.
if(r < rad)
{
matrix[ix][jx] = mode; // Turn pixel in matrix 'ON' or 'OFF'
}
}
}
}
I hope that's clear. It's pretty simple, but then I programmed it so I know how it's supposed to work. If you'd like more info / explanation then I can add some more code / comments.
It can be considered that drawing several circles, eg 4 to 6, is very slow... Hence I'm asking for advice on a more efficient algorithm to draw the circles.
EDIT: Managed to double the performance by making the following modification:
The function calling the drawing used to look like this:
for(;;)
{
clearAll(); // Clear matrix
for(int ix = 0; ix < 6; ++ ix)
{
rad[ix] += rad_incr_step;
drawRing(rad[ix], rad[ix] - rad_width);
}
if(rad[5] >= 7.0)
{
for(int ix = 0; ix < 6; ++ ix)
{
rad[ix] = rad_space_step * (float)(-ix);
}
}
writeAll(); // Write
}
I added the following check:
if(rad[ix] - rad_width < 7.0)
drawRing(rad[ix], rad[ix] - rad_width);
This increased the performance by a factor of about 2, but ideally I'd like to make the circle drawing more efficient to increase it further. This checks to see if the ring is completely outside of the screen.
EDIT 2: Similarly adding the reverse check increased performance further.
if(rad[ix] >= 0.0)
drawRing(rad[ix], rad[ix] - rad_width);
Performance is now pretty good, but again I have made no modifications to the actual drawing code of the circles and this is what I was intending to focus on with this question.
Edit 3: Matrix to polar:
inline void matrix_to_polar(int i, int j, float* r, float* s)
{
float x, y;
matrix_to_cartesian(i, j, &x, &y);
calcPolar(x, y, r, s);
}
inline void matrix_to_cartesian(int i, int j, float* x, float* y)
{
*x = getX(i);
*y = getY(j);
}
inline void calcPolar(float x, float y, float* r, float* s)
{
*r = sqrt(x * x + y * y);
*s = atan2(y, x);
}
inline float getX(int xc)
{
return (float(xc) - 6.0);
}
inline float getY(int yc)
{
return (float(yc) - 6.0);
}
In response to Clifford that's actually a lot of function calls if they are not inlined.
Edit 4: drawRing just draws 2 circles, firstly an outer circle with mode ON and then an inner circle with mode OFF. I am fairly confident that there is a more efficient method of drawing such a shape too, but that distracts from the question.
You're doing a lot of calculations that aren't really needed. For example, you're calculating the angle of the polar coordinates, but never use it. The square root can also easily be avoided by comparing the square of the values.
Without doing anything fancy, something like this should be a good start:
int intRad = (int)rad;
int intRadSqr = (int)(rad * rad);
for (int ix = 0; ix <= intRad; ++ix)
{
for (int jx = 0; jx <= intRad; ++jx)
{
if (ix * ix + jx * jx <= radSqr)
{
matrix[6 - ix][6 - jx] = mode;
matrix[6 - ix][6 + jx] = mode;
matrix[6 + ix][6 - jx] = mode;
matrix[6 + ix][6 + jx] = mode;
}
}
}
This does all the math in integer format, and takes advantage of the circle symmetry.
Variation of the above, based on feedback in the comments:
int intRad = (int)rad;
int intRadSqr = (int)(rad * rad);
for (int ix = 0; ix <= intRad; ++ix)
{
for (int jx = 0; ix * ix + jx * jx <= radSqr; ++jx)
{
matrix[6 - ix][6 - jx] = mode;
matrix[6 - ix][6 + jx] = mode;
matrix[6 + ix][6 - jx] = mode;
matrix[6 + ix][6 + jx] = mode;
}
}
Don't underestimate the cost of even basic arithmetic using floating point on a processor with no FPU. It seems unlikely that floating point is necessary, but the details of its use are hidden in your matrix_to_polar() implementation.
Your current implementation considers every pixel as a candidate - that is also unnecessary.
Using the equation y = cy ± √[rad2 - (x-cx)2] where cx, cy is the centre (7, 7 in this case), and a suitable integer square root implementation, the circle can be drawn thus:
void drawCircle( int rad, unsigned char mode )
{
int r2 = rad * rad ;
for( int x = 7 - rad; x <= 7 + rad; x++ )
{
int dx = x - 7 ;
int dy = isqrt( r2 - dx * dx ) ;
matrix[x][7 - dy] = mode ;
matrix[x][7 + dy] = mode ;
}
}
In my test I used the isqrt() below based on code from here, but given that the maximum r2 necessary is 169 (132, you could implement a 16 or even 8 bit optimised version if necessary. If your processor is 32 bit, this is probably fine.
uint32_t isqrt(uint32_t n)
{
uint32_t root = 0, bit, trial;
bit = (n >= 0x10000) ? 1<<30 : 1<<14;
do
{
trial = root+bit;
if (n >= trial)
{
n -= trial;
root = trial+bit;
}
root >>= 1;
bit >>= 2;
} while (bit);
return root;
}
All that said, on such a low resolution device, you will probably get better quality circles and faster performance by hand generating bitmap lookup tables for each radius required. If memory is an issue, then a single circle needs only 7 bytes to describe a 7 x 7 quadrant that you can reflect to all three quadrants, or for greater performance you could use 7 x 16 bit words to describe a semi-circle (since reversing bit order is more expensive than reversing array access - unless you are using an ARM Cortex-M with bit-banding). Using semi-circle look-ups, 13 circles would need 13 x 7 x 2 bytes (182 bytes), quadrant look-ups would be 7 x 8 x 13 (91 bytes) - you may find that is fewer bytes that the code space required to calculate the circles.
For a slow embedded device with only a 13x13 element display, you should really just make a look-up table. For example:
struct ComputedCircle
{
float rMax;
char col[13][2];
};
Where the draw routine uses rMax to determine which LUT element to use. For example, if you have 2 elements with one rMax = 1.4f, the other = 1.7f, then any radius between 1.4f and 1.7f will use that entry.
The column elements would specify zero, one, or two line segments per row, which can be encoded in the lower and upper 4 bits of each char. -1 can be used as a sentinel value for nothing-at-this-row. It is up to you how many look-up table entries to use, but with a 13x13 grid you should be able to encode every possible outcome of pixels with well under 100 entries, and a reasonable approximation using only 10 or so. You can also trade off compression for draw speed as well, e.g. putting the col[13][2] matrix in a flat list and encoding the number of rows defined.
I would accept MooseBoy's answer if only he explained the method he proposes better. Here's my take on the lookup table approach.
Solve it with a lookup table
The 13x13 display is quite small, and if you only need circles which are fully visible within this pixel count, you will get around with a quite small table. Even if you need larger circles, it should be still better than any algorithmic way if you need it to be fast (and have the ROM to store it).
How to do it
You basically need to define how each possible circle looks like on the 13x13 display. It is not sufficient to just produce snapshots for the 13x13 display, as it is likely you would like to plot the circles at arbitrary positions. My take for a table entry would look like this:
struct circle_entry_s{
unsigned int diameter;
unsigned int offset;
};
The entry would map a given diameter in pixels to offsets in a large byte table containing the shape of the circles. For example for diameter 9, the byte sequence would look like this:
0x1CU, 0x00U, /* 000111000 */
0x63U, 0x00U, /* 011000110 */
0x41U, 0x00U, /* 010000010 */
0x80U, 0x80U, /* 100000001 */
0x80U, 0x80U, /* 100000001 */
0x80U, 0x80U, /* 100000001 */
0x41U, 0x00U, /* 010000010 */
0x63U, 0x00U, /* 011000110 */
0x1CU, 0x00U, /* 000111000 */
The diameter specifies how many bytes of the table belong to the circle: one row of pixels are generated from (diameter + 7) >> 3 bytes, and the number of rows correspond to the diameter. The output code of these can be made quite fast, while the lookup table is sufficiently compact to get even larger than the 13x13 display circles defined in it if needed.
Note that defining circles this way for odd and even diameters may or may not appeal you when output by a centre location. The odd diameter circles will appear to have a centre in the "middle" of a pixel, while the even diameter circles will appear to have their centre on the "corner" of a pixel.
You may also find it nice later to refine the overall method so having multiple circles of different apparent sizes, but having the same pixel radius. Depends on what is your goal: if you want some kind of smooth animation, you may get there eventually.
Algorithmic solutions I think mostly will perform poorly here, since with this limited display surface really every pixel's state counts for the appearance.

Getting NAN when only dealing with integer and float

I am working on an opengl assignment where I have to make a creature (I chose a snowman) move around some terrain. I am trying to make it move around, and I am getting the strangest errors. After printing the numbers out, I frequently get "-1.#QNAN0" as a number. I don't even know what that means. Below is the snowman's update function, constructor, and the header file. I am trying to get 2 numbers to use as velocity and add them to the position while it is set to animate (randomly changing), but I don't understand what errors are causing me to not get numbers out of rand().
Each time that the probability check succeeds, it prints out:
DEBUG: probability check succeeded
-1.#QNAN0 0.000000
or
DEBUG: probability check succeeded
0.000000 0.000000
with about 50% chance of each.
From Snowman.cpp
void Snowman::update(canvas_t texture){
//randomly toggle the walking variable
int probability = rand() % 100;
//printf("DEBUG: probability = %d\n", probability);
if(probability <= 10){
printf("DEBUG: probability check succeeded\n");
walking = !walking;
dx = static_cast<float>(( (rand() % 10) - 5));
dy = static_cast<float>(( (rand() % 10) - 5));
printf("%f %f\n", dx, dy);
}
//code to control movement
if(walking){
animate = true;
x += dx;
y += dy;
constrain(x, 0, texture.width);
constrain(y, 0, texture.height);
}else{
animate = false;
}
//set the height after x and y are resolved
z = getHeight(texture);
}
Snowman::Snowman(canvas_t terrain)
{
wireFrame = false;
animate = false;
armSegments = 2;
animationFrameNumber = 0;
manualUserOffset = 0;
//set its initial position
x = rand() % terrain.width;
y = rand() % terrain.height;
dx = 0;
dy = 0;
}
From Snowman.h
class Snowman
{
public:
Snowman(canvas_t);
~Snowman(void);
void setWireframe(bool);
void toggleWireframe(void);
void setAnimate(bool);
void toggleAnimate(void);
void setArmSegments(int);
void addArmSegment(void);
void subtractArmSegment(void);
void update(canvas_t);
void draw(void);
private:
bool wireFrame;
bool animate;
bool walking;
int armSegments;
int animationFrameNumber;
float manualUserOffset;
float x, y, z;
int dx, dy;
inline float f(void);
inline void drawMouth(int headRadius);
inline void drawFace(int headRadius);
void drawArm(int remainingSegments);
inline void drawBody();
inline float getHeight(canvas_t);
};
dx and dy are ints, but your format specifier %f requires a double or a float. So you have undefined behaviour.

can anyone look over some simple gradient descent code?

I'm trying to implement a very simple 1-dimensional gradient descent algorithm. The code I have does not work at all. Basically depending on my alpha value, the end parameters will either be wildly huge (like ~70 digits), or basically zero (~ 0.000). I feel like a gradient descent should not be nearly this sensitive in alpha (I'm generating small data in [0.0,1.0], but I think the gradient itself should account for the scale of the data, no?).
Here's the code:
#include <cstdio>
#include <cstdlib>
#include <ctime>
#include <vector>
using namespace std;
double a, b;
double theta0 = 0.0, theta1 = 0.0;
double myrand() {
return double(rand()) / RAND_MAX;
}
double f(double x) {
double y = a * x + b;
y *= 0.1 * (myrand() - 0.5); // +/- 5% noise
return y;
}
double h(double x) {
return theta1 * x + theta0;
}
int main() {
srand(time(NULL));
a = myrand();
b = myrand();
printf("set parameters: a = %lf, b = %lf\n", a, b);
int N = 100;
vector<double> xs(N);
vector<double> ys(N);
for (int i = 0; i < N; ++i) {
xs[i] = myrand();
ys[i] = f(xs[i]);
}
double sensitivity = 0.008;
double d0, d1;
for (int n = 0; n < 100; ++n) {
d0 = d1 = 0.0;
for (int i = 0; i < N; ++i) {
d0 += h(xs[i]) - ys[i];
d1 += (h(xs[i]) - ys[i]) * xs[i];
}
theta0 -= sensitivity * d0;
theta1 -= sensitivity * d1;
printf("theta0: %lf, theta1: %lf\n", theta0, theta1);
}
return 0;
}
Changing the value of alpha can produce the algorithm to diverge, so that may be one of the causes of what is happening. You can check by computing the error in each iteration and see if is increasing or decreasing.
In adition, it is recommended to set randomly the values of theta at the beginning in stead of assigning them to zero.
Apart from that, you should divide by N when you update the value of theta as follows:
theta0 -= sensitivity * d0/N;
theta1 -= sensitivity * d1/N;
I had a quick look at your implementation and it looks fine to me.
The code I have does not work at all.
I wouldn't say that. It seems to behave correctly for small enough values of sensitivity, which is a value that you just have to "guess", and that is how the gradient descent is supposed to work.
I feel like a gradient descent should not be nearly this sensitive in alpha
If you struggle to visualize that, remember that you are using gradient descent to find the minimum of the cost function of linear regression, which is a quadratic function. If you plot the cost function you will see why the learning rate is so sensitive in these cases: intuitively, if the parabola is narrow, the algorithm will converge more quickly, which is good, but then the learning rate is more "sensitive" and the algorithm can easily diverge if you are not careful.