Unexpectedly for me I faced strange issue:
Here is example of LOGICAL implementation of trivial linear Lagrange interpolation :
unsigned char mix(unsigned char x0, unsigned char x1, float position){
// LOGICALLY must be something like (real implementation should be
// with casts)...
return x0 + (x1 - x0) * position;
}
Arguments x0, x1 are always in range 0 - 255.
Argument position is always in range 0.0f - 1.0f.
Really I tried huge amount of implementations (with different casts and etc.) but it doesn't work in my case! It returns incorrect results (looks like variable overflow or something similar. After looking for solution in internet for a whole week i decided to ask. May be someone has faced similar issues.
I'm using MSVC 2017 compiler (most of parameters are default except language level).
OS - Windows 10x64, Little Endian.
What do i do wrong and what is possible source of the issue?
UPDATED:
It looks like this issue is more deep than I expected (thanks for your responses).
Here is the link to tiny github project which demonstrates my issue:
https://github.com/elRadiance/altitudeMapVisualiser
Output bmp-file should contain smooth altitude map. Instead of it, it contains garbage. If I use just x0 or x1 as result of interpolation function (without interpolation) it works. Without it - doesn't (produces garbage).
Desired result (as here, but in interpolated colors, smooth)
Actual result (updated, best result achieved)
Main class to run it:
#include "Visualiser.h"
int main() {
unsigned int width = 512;
unsigned int height = 512;
float* altitudes = new float[width * height];
float c;
for (int w = 0; w < width; w++) {
c = (2.0f * w / width) - 1.0f;
for (int h = 0; h < height; h++) {
altitudes[w*height + h] = c;
}
}
Visualiser::visualiseAltitudeMap("gggggggggg.bmp", width, height, altitudes);
delete(altitudes);
}
Thank you in advance!
SOLVED: Thankfully #wololo. Mistake in my project was not in calculations.
I should open file with option "binary":
file.open("test.bin", std::ios_base::out | std::ios_base::trunc | std::ios_base::binary);
Without it in some point in data can be faced byte with value 10
In Windows environment it can be processed like LineFeed and changed to 13.
Related
I'm following the book Ray Tracing in on Weekend in which the author produces a small Ray Tracer using plain C++ and the result is a PPM image.
The author's code
Which produces this PPM image.
So the author suggests as an exercise to make it so the program produces a JPG image via the stb_image library. So far I tried by changing the original code like this :
#include <fstream>
#define STB_IMAGE_WRITE_IMPLEMENTATION
#include "stb_image_write.h"
struct RGB{
unsigned char R;
unsigned char G;
unsigned char B;
};
int main(){
int nx = 200;
int ny = 100;
struct RGB data[nx][ny];
for(int j = ny - 1 ; j >= 0 ; j-- ){
for(int i = 0; i < nx ; i++){
float r = float(i) / float(nx);
float g = float(j) / float(ny);
float b = 0.2;
int ir = int(255.99 * r);
int ig = int(255.99 * g);
int ib = int(255.99 * b);
data[i][j].R = ir;
data[i][j].G = ig;
data[i][j].B = ib;
}
}
stbi_write_jpg("image.jpg", nx, ny, 3, data, 100);
}
And this is the result:
As you can see my result is slightly different and I don't know why.
The main problems are:
That the black color shows at the top left of the screen, and in general the colors don't show in the correct order left to right, top to bottom.
The image is "split" in half and the result is actually the original image of the author but produced in a pair ????
Probably I'm misunderstanding something about the way STB_IMAGE_WRITE is supposed to be used so if anyone experienced with this library can tell me what's going on I would be grateful.
EDIT 1 I implemented the changes suggested by #1201ProgramAlarm in comments plus I changed struct RGB data[nx][ny] to struct RGB data[ny][nx] ,so the result now is this.
The library should be working as intended. The problem is how you give the data and what you should do is invert the y axis. So, when you are at index 4 from the start, you should give the color of the index 4 from the end.
Taking the result from your edit, just change the line:
float g = float(j) / float(ny);
to
float g = float(ny - 1 - j) / float(ny);
You've got your indexes wrong for data. The inner loop variable should be the second subscript (data[j][i]).
tl;dr: double b=a-(size_t)(a) faster than double b=a-trunc(a)
I am implementing a rotation function for an image and I noticed that the trunc function seems to be awfully slow.
Looping code for the image, the actual affectation of the pixels is commented out for the performance test so I don't even access the pixels.
double sina(sin(angle)), cosa(cos(angle));
int h = (int) (_in->h*cosa + _in->w*sina);
int w = (int) (_in->w*cosa + _in->h*sina);
int offsetx = (int)(_in->h*sina);
SDL_Surface* out = SDL_CreateARGBSurface(w, h); //wrapper over SDL_CreateRGBSurface
SDL_FillRect(out, NULL, 0x0);//transparent black
for (int y = 0; y < _in->h; y++)
for (int x = 0; x < _in->w; x++){
//calculate the new position
const double destY = y*cosa + x*sina;
const double destX = x*cosa - y*sina + offsetx;
So here is the code using trunc
size_t tDestX = (size_t) trunc(destX);
size_t tDestY = (size_t) trunc(destY);
double left = destX - trunc(destX);
double top = destY - trunc(destY);
And here is the faster equivalent
size_t tDestX = (size_t)(destX);
size_t tDestY = (size_t)(destY);
double left = destX - tDestX;
double top = destY - tDestY;
The answers suggest not to use trunc when converting back to integral so I also tried that case:
size_t tDestX = (size_t) (destX);
size_t tDestY = (size_t) (destY);
double left = destX - trunc(destX);
double top = destY - trunc(destY);
The fast version seems to take an average of 30ms to go through the full image (2048x1200) while the slow version using trunc takes about 135ms for the same image. The version with only two calls to trunc is still much slower than the one without (about 100ms).
As far as I understand C++ rules, both expressions should return always the same thing. Am I missing something here? dextX and destY are declared const so only one call should be made to the trunc function and even then it wouldn't explain the over three times slower factor by itself.
I'm compiling with Visual Studio 2013 with optimizations (/O2). Is there any reason to use the trunc function at all? Even for getting the fractional part using an integer seems to be faster.
The way you're using it, there's no reason for you to use the trunc function at all. It transforms a double into a double, which you then cast into an integral and throw away. The fact that the alternative is faster, is not that surprising.
On modern x86 CPUs, int <-> float conversions are quite fast - typically inline SSE code is generated for the conversion and the cost is of the order of a few instruction cycles.1
For trunc however a function call is required, and the function call overhead alone is almost certainly greater than than the cost of an inline float -> int conversion. Furthermore, the trunc function itself may be relatively costly - it has to be fully IEEE-754 compliant, so the full range of floating point values has to be dealt with correctly, as do edge cases such as NaN, INF, denorms, values which are out of range, etc. So overall I would expect the cost of trunc to be of the order of tens of instruction cycles, i.e. an order of magnitude or so greater than the cost of an inline float -> int conversion.
1. Note that float <-> int conversions are not always inexpensive - other CPU families, and even older x86 CPUs, may not have ISA support for such conversions, in which case a library function will normally be used, and the cost of this would be similar to that of trunc. Modern x86 CPUs are a special case in this regard.
I'm using the next algorithm to perform nearest neighbor resizing. Is there anyway to optimize it's speed? Input and Output buffers are in ARGB format, though images are known to be always opaque. Thank you.
void resizeNearestNeighbor(const uint8_t* input, uint8_t* output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight)
{
const int x_ratio = (int)((sourceWidth << 16) / targetWidth);
const int y_ratio = (int)((sourceHeight << 16) / targetHeight) ;
const int colors = 4;
for (int y = 0; y < targetHeight; y++)
{
int y2_xsource = ((y * y_ratio) >> 16) * sourceWidth;
int i_xdest = y * targetWidth;
for (int x = 0; x < targetWidth; x++)
{
int x2 = ((x * x_ratio) >> 16) ;
int y2_x2_colors = (y2_xsource + x2) * colors;
int i_x_colors = (i_xdest + x) * colors;
output[i_x_colors] = input[y2_x2_colors];
output[i_x_colors + 1] = input[y2_x2_colors + 1];
output[i_x_colors + 2] = input[y2_x2_colors + 2];
output[i_x_colors + 3] = input[y2_x2_colors + 3];
}
}
}
restrict keyword will help a lot, assuming no aliasing.
Another improvement is to declare another pointerToOutput and pointerToInput as uint_32_t, so that the four 8-bit copy-assignments can be combined into a 32-bit one, assuming pointers are 32bit aligned.
There's little that you can do to speed this up, as you already arranged the loops in the right order and cleverly used fixed-point arithmetic. As others suggested, try to move the 32 bits in a single go (hoping that the compiler didn't see that yet).
In case of significant enlargement, there is a possibility: you can determine how many times every source pixel needs to be replicated (you'll need to work on the properties of the relation Xd=Wd.Xs/Ws in integers), and perform a single pixel read for k writes. This also works on the y's, and you can memcpy the identical rows instead of recomputing them. You can precompute and tabulate the mappings of the X's and Y's using run-length coding.
But there is a barrier that you will not pass: you need to fill the destination image.
If you are desperately looking for speedup, there could remain the option of using vector operations (SEE or AVX) to handle several pixels at a time. Shuffle instructions are available that might enable to control the replication (or decimation) of the pixels. But due to the complicated replication pattern combined with the fixed structure of the vector registers, you will probably need to integrate a complex decision table.
The algorithm is fine, but you can utilize massive parallelization by submitting your image to the GPU. If you use opengl, simply creating a context of the new size and providing a properly sized quad can give you inherent nearest neighbor calculations. Also opengl could give you access to other resizing sampling techniques by simply changing the properties of the texture you read from (which would amount to a single gl command which could be an easy paramter to your resize function).
Also later in development, you could simply swap out a shader for other blending techniques which also keeps you utilizing your wonderful GPU processor of image processing glory.
Also, since you aren't using any fancy geometry it can become almost trivial to write the program. It would be a little more involved than your algorithm, but it could perform magnitudes faster depending on image size.
I hope I didn't break anything. This combines some of the suggestions posted thus far and is about 30% faster. I'm amazed that is all we got. I did not actually check the destination image to see if it was right.
Changes:
- remove multiplies from inner loop (10% improvement)
- uint32_t instead of uint8_t (10% improvement)
- __restrict keyword (1% improvement)
This was on an i7 x64 machine running Windows, compiled with MSVC 2013. You will have to change the __restrict keyword for other compilers.
void resizeNearestNeighbor2_32(const uint8_t* __restrict input, uint8_t* __restrict output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight)
{
const uint32_t* input32 = (const uint32_t*)input;
uint32_t* output32 = (uint32_t*)output;
const int x_ratio = (int)((sourceWidth << 16) / targetWidth);
const int y_ratio = (int)((sourceHeight << 16) / targetHeight);
int x_ratio_with_color = x_ratio;
for (int y = 0; y < targetHeight; y++)
{
int y2_xsource = ((y * y_ratio) >> 16) * sourceWidth;
int i_xdest = y * targetWidth;
int source_x_offset = 0;
int startingOffset = y2_xsource;
const uint32_t * inputLine = input32 + startingOffset;
for (int x = 0; x < targetWidth; x++)
{
i_xdest += 1;
source_x_offset += x_ratio_with_color;
int sourceOffset = source_x_offset >> 16;
output[i_xdest] = inputLine[sourceOffset];
}
}
}
I have a 13 x 13 array of pixels, and I am using a function to draw a circle onto them. (The screen is 13 * 13, which may seem strange, but its an array of LED's so that explains it.)
unsigned char matrix[13][13];
const unsigned char ON = 0x01;
const unsigned char OFF = 0x00;
Here is the first implementation I thought up. (It's inefficient, which is a particular problem as this is an embedded systems project, 80 MHz processor.)
// Draw a circle
// mode is 'ON' or 'OFF'
inline void drawCircle(float rad, unsigned char mode)
{
for(int ix = 0; ix < 13; ++ ix)
{
for(int jx = 0; jx < 13; ++ jx)
{
float r; // Radial
float s; // Angular ("theta")
matrix_to_polar(ix, jx, &r, &s); // Converts polar coordinates
// specified by r and s, where
// s is the angle, to index coordinates
// specified by ix and jx.
// This function just converts to
// cartesian and then translates by 6.0.
if(r < rad)
{
matrix[ix][jx] = mode; // Turn pixel in matrix 'ON' or 'OFF'
}
}
}
}
I hope that's clear. It's pretty simple, but then I programmed it so I know how it's supposed to work. If you'd like more info / explanation then I can add some more code / comments.
It can be considered that drawing several circles, eg 4 to 6, is very slow... Hence I'm asking for advice on a more efficient algorithm to draw the circles.
EDIT: Managed to double the performance by making the following modification:
The function calling the drawing used to look like this:
for(;;)
{
clearAll(); // Clear matrix
for(int ix = 0; ix < 6; ++ ix)
{
rad[ix] += rad_incr_step;
drawRing(rad[ix], rad[ix] - rad_width);
}
if(rad[5] >= 7.0)
{
for(int ix = 0; ix < 6; ++ ix)
{
rad[ix] = rad_space_step * (float)(-ix);
}
}
writeAll(); // Write
}
I added the following check:
if(rad[ix] - rad_width < 7.0)
drawRing(rad[ix], rad[ix] - rad_width);
This increased the performance by a factor of about 2, but ideally I'd like to make the circle drawing more efficient to increase it further. This checks to see if the ring is completely outside of the screen.
EDIT 2: Similarly adding the reverse check increased performance further.
if(rad[ix] >= 0.0)
drawRing(rad[ix], rad[ix] - rad_width);
Performance is now pretty good, but again I have made no modifications to the actual drawing code of the circles and this is what I was intending to focus on with this question.
Edit 3: Matrix to polar:
inline void matrix_to_polar(int i, int j, float* r, float* s)
{
float x, y;
matrix_to_cartesian(i, j, &x, &y);
calcPolar(x, y, r, s);
}
inline void matrix_to_cartesian(int i, int j, float* x, float* y)
{
*x = getX(i);
*y = getY(j);
}
inline void calcPolar(float x, float y, float* r, float* s)
{
*r = sqrt(x * x + y * y);
*s = atan2(y, x);
}
inline float getX(int xc)
{
return (float(xc) - 6.0);
}
inline float getY(int yc)
{
return (float(yc) - 6.0);
}
In response to Clifford that's actually a lot of function calls if they are not inlined.
Edit 4: drawRing just draws 2 circles, firstly an outer circle with mode ON and then an inner circle with mode OFF. I am fairly confident that there is a more efficient method of drawing such a shape too, but that distracts from the question.
You're doing a lot of calculations that aren't really needed. For example, you're calculating the angle of the polar coordinates, but never use it. The square root can also easily be avoided by comparing the square of the values.
Without doing anything fancy, something like this should be a good start:
int intRad = (int)rad;
int intRadSqr = (int)(rad * rad);
for (int ix = 0; ix <= intRad; ++ix)
{
for (int jx = 0; jx <= intRad; ++jx)
{
if (ix * ix + jx * jx <= radSqr)
{
matrix[6 - ix][6 - jx] = mode;
matrix[6 - ix][6 + jx] = mode;
matrix[6 + ix][6 - jx] = mode;
matrix[6 + ix][6 + jx] = mode;
}
}
}
This does all the math in integer format, and takes advantage of the circle symmetry.
Variation of the above, based on feedback in the comments:
int intRad = (int)rad;
int intRadSqr = (int)(rad * rad);
for (int ix = 0; ix <= intRad; ++ix)
{
for (int jx = 0; ix * ix + jx * jx <= radSqr; ++jx)
{
matrix[6 - ix][6 - jx] = mode;
matrix[6 - ix][6 + jx] = mode;
matrix[6 + ix][6 - jx] = mode;
matrix[6 + ix][6 + jx] = mode;
}
}
Don't underestimate the cost of even basic arithmetic using floating point on a processor with no FPU. It seems unlikely that floating point is necessary, but the details of its use are hidden in your matrix_to_polar() implementation.
Your current implementation considers every pixel as a candidate - that is also unnecessary.
Using the equation y = cy ± √[rad2 - (x-cx)2] where cx, cy is the centre (7, 7 in this case), and a suitable integer square root implementation, the circle can be drawn thus:
void drawCircle( int rad, unsigned char mode )
{
int r2 = rad * rad ;
for( int x = 7 - rad; x <= 7 + rad; x++ )
{
int dx = x - 7 ;
int dy = isqrt( r2 - dx * dx ) ;
matrix[x][7 - dy] = mode ;
matrix[x][7 + dy] = mode ;
}
}
In my test I used the isqrt() below based on code from here, but given that the maximum r2 necessary is 169 (132, you could implement a 16 or even 8 bit optimised version if necessary. If your processor is 32 bit, this is probably fine.
uint32_t isqrt(uint32_t n)
{
uint32_t root = 0, bit, trial;
bit = (n >= 0x10000) ? 1<<30 : 1<<14;
do
{
trial = root+bit;
if (n >= trial)
{
n -= trial;
root = trial+bit;
}
root >>= 1;
bit >>= 2;
} while (bit);
return root;
}
All that said, on such a low resolution device, you will probably get better quality circles and faster performance by hand generating bitmap lookup tables for each radius required. If memory is an issue, then a single circle needs only 7 bytes to describe a 7 x 7 quadrant that you can reflect to all three quadrants, or for greater performance you could use 7 x 16 bit words to describe a semi-circle (since reversing bit order is more expensive than reversing array access - unless you are using an ARM Cortex-M with bit-banding). Using semi-circle look-ups, 13 circles would need 13 x 7 x 2 bytes (182 bytes), quadrant look-ups would be 7 x 8 x 13 (91 bytes) - you may find that is fewer bytes that the code space required to calculate the circles.
For a slow embedded device with only a 13x13 element display, you should really just make a look-up table. For example:
struct ComputedCircle
{
float rMax;
char col[13][2];
};
Where the draw routine uses rMax to determine which LUT element to use. For example, if you have 2 elements with one rMax = 1.4f, the other = 1.7f, then any radius between 1.4f and 1.7f will use that entry.
The column elements would specify zero, one, or two line segments per row, which can be encoded in the lower and upper 4 bits of each char. -1 can be used as a sentinel value for nothing-at-this-row. It is up to you how many look-up table entries to use, but with a 13x13 grid you should be able to encode every possible outcome of pixels with well under 100 entries, and a reasonable approximation using only 10 or so. You can also trade off compression for draw speed as well, e.g. putting the col[13][2] matrix in a flat list and encoding the number of rows defined.
I would accept MooseBoy's answer if only he explained the method he proposes better. Here's my take on the lookup table approach.
Solve it with a lookup table
The 13x13 display is quite small, and if you only need circles which are fully visible within this pixel count, you will get around with a quite small table. Even if you need larger circles, it should be still better than any algorithmic way if you need it to be fast (and have the ROM to store it).
How to do it
You basically need to define how each possible circle looks like on the 13x13 display. It is not sufficient to just produce snapshots for the 13x13 display, as it is likely you would like to plot the circles at arbitrary positions. My take for a table entry would look like this:
struct circle_entry_s{
unsigned int diameter;
unsigned int offset;
};
The entry would map a given diameter in pixels to offsets in a large byte table containing the shape of the circles. For example for diameter 9, the byte sequence would look like this:
0x1CU, 0x00U, /* 000111000 */
0x63U, 0x00U, /* 011000110 */
0x41U, 0x00U, /* 010000010 */
0x80U, 0x80U, /* 100000001 */
0x80U, 0x80U, /* 100000001 */
0x80U, 0x80U, /* 100000001 */
0x41U, 0x00U, /* 010000010 */
0x63U, 0x00U, /* 011000110 */
0x1CU, 0x00U, /* 000111000 */
The diameter specifies how many bytes of the table belong to the circle: one row of pixels are generated from (diameter + 7) >> 3 bytes, and the number of rows correspond to the diameter. The output code of these can be made quite fast, while the lookup table is sufficiently compact to get even larger than the 13x13 display circles defined in it if needed.
Note that defining circles this way for odd and even diameters may or may not appeal you when output by a centre location. The odd diameter circles will appear to have a centre in the "middle" of a pixel, while the even diameter circles will appear to have their centre on the "corner" of a pixel.
You may also find it nice later to refine the overall method so having multiple circles of different apparent sizes, but having the same pixel radius. Depends on what is your goal: if you want some kind of smooth animation, you may get there eventually.
Algorithmic solutions I think mostly will perform poorly here, since with this limited display surface really every pixel's state counts for the appearance.
I've been working on implementing black-body radiation according to Planck's law with the following:
double BlackBody(double T, double wavelength) {
wavelength /= 1e9; // pre-scale wavelength to meters
static const double h = 6.62606957e-34; // Planck constant
static const double c = 299792458.0; // speed of light in vacuum
static const double k = 1.3806488e-23; // Boltzmann constant
double exparg = h*c / (k*wavelength*T);
double exppart = std::exp(exparg) - 1.0;
double constpart = (2.0*h*c*c);
double powpart = pow(wavelength, -5.0);
double v = constpart * powpart / exppart;
return v;
}
I have a float[max-min+1] array, where static const int max=780, static const int min = 380. I simply iterate over the array, and put in what the BlackBody gives for the wavelength (wavelength = array-index + min). The IntensitySpectrum::BlackBody performs this iteration, while both min and max are static member vars, and the array is also inside IntensitySpectrum.
IntensitySpectrum spectrum;
Vec3 rgb = spectrum.ToRGB();
rgb /= std::max(rgb.x, std::max(rgb.y, rgb.z));
for (int xc = 0; xc < grapher.GetWidth(); xc++) {
if (xc % 10 == 0) {
spectrum.BlackBody(200.f + xc * 200.f);
spectrum.Scale(1.0f / 1e+14f);
rgb = spectrum.ToRGB();
rgb /= std::max(rgb.x, std::max(rgb.y, rgb.z));
}
for (int yc = 20; yc < 40; yc++) {
grapher(xc, yc) = grapher.FloatToUint(rgb.x, rgb.y, rgb.z);
}
}
The problem is that, the line spectrum.BlackBody() sets the 0th element of the array to NaN, and only the 0th. Also it does not happen for the very first iteration, but all the following ones where xc>=10.
The text from the VS debugger:
spectrum = {intensity=0x009bec50 {-1.#IND0000, 520718784., 537559104., 554832896., 572547904., 590712128., 609333504., ...} }
I tracked the error down, and exppart in the ::BlackBody() function becomes NaN, basically exp() returns NaN, even though it's argument is near 2.0, so definetely not overflow. But only for array index 0. It magically starts working for the rest 400 indices.
I know memory overruns might cause things like that. That's why I double checked my memory handling.
I'm linking Vec3 from another self-made library, which is much bigger, and might contain errors, but what I use from Vec3 has nothing to do with memory.
After many hours I'm completely clueless. What else can cause this? Is the optimizer or WINAPI fooling me...? (Uhm, yes, the program creates a window, with WINAPI, and uses a nearly empty WndProc that calls my code on WM_PAINT.)
Thanks for you help in advance.
Sorry for making it unclear. This is the layout:
// member
class IntensitySpectrum {
public:
void BlackBody(float temperature) {
// ...
this->intensity[i] = ::BlackBody(temperature, wavelength(i));
// ...
}
private:
static const int min = 380;
static const int max = 780;
float intensity[max-min+1];
}
// global
double BlackBody(double T, double wavelength);
If you happen to be using MSVC 2013, one possible explanation is that you have some code somewhere that is trying to convert a float infinity to int. A bug in MSVC 2013 causes an unbalanced push on the x87 FPU stack when this happens. Trigger that bug 8 times and your FPU stack is totally full, and any subsequent attempt to push a value (such as calling 'exp()') will result in an 'invalid operation' and return an indefinite (like 1.#IND). Note that even if you are compiling with SSE2 floating point instructions, this bug can still bite because the calling convention dictates that floating point return values are returned on the top of the FPU stack.
To check if this is your issue, have a look at your FPU registers just prior to the bad call to 'exp()'. If your TAGS register is all zero, then your FPU stack is full.
http://connect.microsoft.com/VisualStudio/feedback/details/806362/vc12-pollutes-the-floating-point-stack-when-casting-infinity-nan-to-unsigned-long
MS claims this will be fixed in update 2 for MSVC 2013.
The following function call only has 1 parameter:
spectrum.BlackBody(200.f + xc * 200.f);
So it cannot be calling the function you defined as
double BlackBody(double T, double wavelength)
If you look at the ::BlackBody implementation, I'm betting you have a divide by 0 error somewhere.