Optimizing custom IDCT calculation

Optimizing custom IDCT calculation - c++

For an image format I have to decode a jpeg like compression. The steps are pretty standard, however the color coding and the produced IDCT values seem to be slightly different (I think, I am no expert). This is why using a standard JPEG library produced invalid results it seems and I had to implement it myself.
Now I am not really happy with the performance. After optimizing the naive approach by doing some caching of data that does not change I need ~150ms to decode a 1024x1024 image. Of that 120ms is spent in the idct function which looks like this:
float idctHelper(const int16_t *inBlock, int32_t u, int32_t v, int32_t blockWidth, int32_t blockHeight) {
glm::vec<4, float, glm::packed_lowp> vec1{};
glm::vec<4, float, glm::packed_lowp> vec2{};
glm::vec<4, float, glm::packed_lowp> vec3{};
float result = 0.0f;
for (auto y = 0; y < blockHeight; ++y) {
for (auto x = 0; x < blockWidth; x += 4) {
const auto idx = (v * 8 + u) * 64 + y * 8 + x;
vec1 = glm::vec<4, float, glm::packed_lowp>(inBlock[y * blockWidth + x], inBlock[y * blockWidth + x + 1], inBlock[y * blockWidth + x + 2], inBlock[y * blockWidth + x + 3]);
vec2 = glm::vec<4, float, glm::packed_lowp>(idctLookup[idx], idctLookup[idx + 1], idctLookup[idx + 2], idctLookup[idx + 3]);
vec3 = vec1 * vec2;
result += vec3.x + vec3.y + vec3.z + vec3.w;
}
}
return result;
}
template<typename T, typename U = T>
U clamp(T value, T min, T max) {
return static_cast<U>(std::min<T>(std::max<T>(value, min), max));
}
void idct(int16_t *outBlock, int16_t *inBlock, bool isLuminance, int32_t blockWidth = 8, int32_t blockHeight = 8) {
for (auto y = 0; y < blockHeight; ++y) {
for (auto x = 0; x < blockWidth; ++x) {
auto value = static_cast<int16_t>(std::round(
0.25f * idctHelper(inBlock, x, y, blockWidth, blockHeight)));
if (isLuminance) {
value = clamp<int16_t>(static_cast<int16_t>(value + 128), 0, 255);
} else {
value = clamp<int16_t>(value, -256, 255);
}
outBlock[y * blockWidth + x] = value;
}
}
}
As you can see I have tried to use (potentially) SIMD and low precision operations to speed up the idctHelper but now I am kinda lost on where I can continue optimizing this. I would really appreciate any idea or hint.

Related

SSE mean filter in c++ and OpenCV

I would like to modify the code for an OpenCV mean filter to use Intel intrinsics. I'm an SSE newbie and I really don't know where to start from. I checked a lot of resources on the web, but I didn't have a lot of success.
This is the program:
#include "opencv2/imgproc/imgproc.hpp"
#include "opencv2/highgui/highgui.hpp"
using namespace cv;
using namespace std;
int main()
{
int A[3][3] = { { 1, 1, 1 }, { 1, 1, 1 }, { 1, 1, 1 } };
int c = 0;
int d = 0;
Mat var1 = imread("images.jpg", 1);
Mat var2(var1.rows, var1.cols, CV_8UC3, Scalar(0, 0, 0));
for (int i = 0; i < var1.rows; i++)
{
var2.at<Vec3b>(i, 0) = var1.at<Vec3b>(i, 0);
var2.at<Vec3b>(i, var1.cols - 1) = var1.at<Vec3b>(i, var1.cols - 1);
}
for (int i = 0; i < var1.cols; i++)
{
var2.at<Vec3b>(0, i) = var1.at<Vec3b>(0, i);
var2.at<Vec3b>(var1.rows - 1, i) = var1.at<Vec3b>(var1.rows - 1, i);
}
for (int i = 0; i < var1.rows; i++) {
for (int j = 0; j < var1.cols; j++)
{
c = 0;
for (int m = i; m < var1.rows; m++, c++)
{
if (c < 3)
{
d = 0;
for (int n = j; n < var1.cols; n++, d++)
{
if (d < 3)
{
if ((i + 1) < var1.rows && (j + 1) < var1.cols)
{
var2.at<Vec3b>(i + 1, j + 1)[0] += var1.at<Vec3b>(m, n)[0] * A[m - i][n - j] / 9;
var2.at<Vec3b>(i + 1, j + 1)[1] += var1.at<Vec3b>(m, n)[1] * A[m - i][n - j] / 9;
var2.at<Vec3b>(i + 1, j + 1)[2] += var1.at<Vec3b>(m, n)[2] * A[m - i][n - j] / 9;
}
}
}
}
}
}
}
imshow("window1", var1);
imshow("window2", var2);
waitKey(0);
return(0);
}
The part that I find difficult is understanding how to convert the innermost 2 loops, where the mean value is computed. Any help will be greatly appreciated.

Just for fun, I thought it might be interesting to start with a naive implementation of a 3x3 mean filter and then optimise this incrementally, ending up with a SIMD (SSE) implementation, measuring the throughput improvement at each stage.
1 - Mean_3_3_ref - reference implementation
This is just a simple scalar implementation which we'll use as a baseline for throughput and for validating further implementations:
void Mean_3_3_ref(const Mat &image_in, Mat &image_out)
{
for (int y = 1; y < image_in.rows - 1; ++y)
{
for (int x = 1; x < image_in.cols - 1; ++x)
{
for (int c = 0; c < 3; ++c)
{
image_out.at<Vec3b>(y, x)[c] = (image_in.at<Vec3b>(y - 1, x - 1)[c] +
image_in.at<Vec3b>(y - 1, x )[c] +
image_in.at<Vec3b>(y - 1, x + 1)[c] +
image_in.at<Vec3b>(y , x - 1)[c] +
image_in.at<Vec3b>(y , x )[c] +
image_in.at<Vec3b>(y , x + 1)[c] +
image_in.at<Vec3b>(y + 1, x - 1)[c] +
image_in.at<Vec3b>(y + 1, x )[c] +
image_in.at<Vec3b>(y + 1, x + 1)[c] + 4) / 9;
}
}
}
}
2 - Mean_3_3_scalar - somewhat optimised scalar implementation
Exploit the redundancy in summing successive columns - we save the last two column sums so that we only need to calculate one new column sum (per channel) on each iteration:
void Mean_3_3_scalar(const Mat &image_in, Mat &image_out)
{
for (int y = 1; y < image_in.rows - 1; ++y)
{
int r_1, g_1, b_1;
int r0, g0, b0;
int r1, g1, b1;
r_1 = g_1 = b_1 = 0;
r0 = g0 = b0 = 0;
for (int yy = y - 1; yy <= y + 1; ++yy)
{
r_1 += image_in.at<Vec3b>(yy, 0)[0];
g_1 += image_in.at<Vec3b>(yy, 0)[1];
b_1 += image_in.at<Vec3b>(yy, 0)[2];
r0 += image_in.at<Vec3b>(yy, 1)[0];
g0 += image_in.at<Vec3b>(yy, 1)[1];
b0 += image_in.at<Vec3b>(yy, 1)[2];
}
for (int x = 1; x < image_in.cols - 1; ++x)
{
r1 = g1 = b1 = 0;
for (int yy = y - 1; yy <= y + 1; ++yy)
{
r1 += image_in.at<Vec3b>(yy, x + 1)[0];
g1 += image_in.at<Vec3b>(yy, x + 1)[1];
b1 += image_in.at<Vec3b>(yy, x + 1)[2];
}
image_out.at<Vec3b>(y, x)[0] = (r_1 + r0 + r1 + 4) / 9;
image_out.at<Vec3b>(y, x)[1] = (g_1 + g0 + g1 + 4) / 9;
image_out.at<Vec3b>(y, x)[2] = (b_1 + b0 + b1 + 4) / 9;
r_1 = r0;
g_1 = g0;
b_1 = b0;
r0 = r1;
g0 = g1;
b0 = b1;
}
}
}
3 - Mean_3_3_scalar_opt - further optimised scalar implementation
As per Mean_3_3_scalar, but also remove OpenCV overheads by caching pointers to each row that we are working on:
void Mean_3_3_scalar_opt(const Mat &image_in, Mat &image_out)
{
for (int y = 1; y < image_in.rows - 1; ++y)
{
const uint8_t * const input_1 = image_in.ptr(y - 1);
const uint8_t * const input0 = image_in.ptr(y);
const uint8_t * const input1 = image_in.ptr(y + 1);
uint8_t * const output = image_out.ptr(y);
int r_1 = input_1[0] + input0[0] + input1[0];
int g_1 = input_1[1] + input0[1] + input1[1];
int b_1 = input_1[2] + input0[2] + input1[2];
int r0 = input_1[3] + input0[3] + input1[3];
int g0 = input_1[4] + input0[4] + input1[4];
int b0 = input_1[5] + input0[5] + input1[5];
for (int x = 1; x < image_in.cols - 1; ++x)
{
int r1 = input_1[x * 3 + 3] + input0[x * 3 + 3] + input1[x * 3 + 3];
int g1 = input_1[x * 3 + 4] + input0[x * 3 + 4] + input1[x * 3 + 4];
int b1 = input_1[x * 3 + 5] + input0[x * 3 + 5] + input1[x * 3 + 5];
output[x * 3 ] = (r_1 + r0 + r1 + 4) / 9;
output[x * 3 + 1] = (g_1 + g0 + g1 + 4) / 9;
output[x * 3 + 2] = (b_1 + b0 + b1 + 4) / 9;
r_1 = r0;
g_1 = g0;
b_1 = b0;
r0 = r1;
g0 = g1;
b0 = b1;
}
}
}
4 - Mean_3_3_blur - leverage OpenCV's blur function
OpenCV has a function called blur, which is based on the function boxFilter, which is just another name for a mean filter. Since OpenCV code has been quite heavily optimised over the years (using SIMD in many cases), let's see if this makes a big improvement over our scalar code:
void Mean_3_3_blur(const Mat &image_in, Mat &image_out)
{
blur(image_in, image_out, Size(3, 3));
}
5 - Mean_3_3_SSE - SSE implementation
This a reasonably efficient SIMD implementation. It uses the same techniques as the scalar code above in order to eliminate redundancy in processing successive pixels:
#include <tmmintrin.h> // Note: requires SSSE3 (aka MNI)
inline void Load2(const ssize_t offset, const uint8_t* const src, __m128i& vh, __m128i& vl)
{
const __m128i v = _mm_loadu_si128((__m128i *)(src + offset));
vh = _mm_unpacklo_epi8(v, _mm_setzero_si128());
vl = _mm_unpackhi_epi8(v, _mm_setzero_si128());
}
inline void Store2(const ssize_t offset, uint8_t* const dest, const __m128i vh, const __m128i vl)
{
__m128i v = _mm_packus_epi16(vh, vl);
_mm_storeu_si128((__m128i *)(dest + offset), v);
}
template <int SHIFT> __m128i ShiftL(const __m128i v0, const __m128i v1) { return _mm_alignr_epi8(v1, v0, SHIFT * sizeof(short)); }
template <int SHIFT> __m128i ShiftR(const __m128i v0, const __m128i v1) { return _mm_alignr_epi8(v1, v0, 16 - SHIFT * sizeof(short)); }
template <int CHANNELS> void Mean_3_3_SSE_Impl(const Mat &image_in, Mat &image_out)
{
const int nx = image_in.cols;
const int ny = image_in.rows;
const int kx = 3 / 2; // x, y borders
const int ky = 3 / 2;
const int kScale = 3 * 3; // scale factor = total number of pixels in sum
const __m128i vkScale = _mm_set1_epi16((32768 + kScale / 2) / kScale);
const int nx0 = ((nx + kx) * CHANNELS + 15) & ~15; // round up total width to multiple of 16
int x, y;
for (y = ky; y < ny - ky; ++y)
{
const uint8_t * const input_1 = image_in.ptr(y - 1);
const uint8_t * const input0 = image_in.ptr(y);
const uint8_t * const input1 = image_in.ptr(y + 1);
uint8_t * const output = image_out.ptr(y);
__m128i vsuml_1, vsumh0, vsuml0;
__m128i vh, vl;
vsuml_1 = _mm_set1_epi16(0);
Load2(0, input_1, vsumh0, vsuml0);
Load2(0, input0, vh, vl);
vsumh0 = _mm_add_epi16(vsumh0, vh);
vsuml0 = _mm_add_epi16(vsuml0, vl);
Load2(0, input1, vh, vl);
vsumh0 = _mm_add_epi16(vsumh0, vh);
vsuml0 = _mm_add_epi16(vsuml0, vl);
for (x = 0; x < nx0; x += 16)
{
__m128i vsumh1, vsuml1, vsumh, vsuml;
Load2((x + 16), input_1, vsumh1, vsuml1);
Load2((x + 16), input0, vh, vl);
vsumh1 = _mm_add_epi16(vsumh1, vh);
vsuml1 = _mm_add_epi16(vsuml1, vl);
Load2((x + 16), input1, vh, vl);
vsumh1 = _mm_add_epi16(vsumh1, vh);
vsuml1 = _mm_add_epi16(vsuml1, vl);
vsumh = _mm_add_epi16(vsumh0, ShiftR<CHANNELS>(vsuml_1, vsumh0));
vsuml = _mm_add_epi16(vsuml0, ShiftR<CHANNELS>(vsumh0, vsuml0));
vsumh = _mm_add_epi16(vsumh, ShiftL<CHANNELS>(vsumh0, vsuml0));
vsuml = _mm_add_epi16(vsuml, ShiftL<CHANNELS>(vsuml0, vsumh1));
// round mean
vsumh = _mm_mulhrs_epi16(vsumh, vkScale);
vsuml = _mm_mulhrs_epi16(vsuml, vkScale);
Store2(x, output, vsumh, vsuml);
vsuml_1 = vsuml0;
vsumh0 = vsumh1;
vsuml0 = vsuml1;
}
}
}
void Mean_3_3_SSE(const Mat &image_in, Mat &image_out)
{
const int channels = image_in.channels();
switch (channels)
{
case 1:
Mean_3_3_SSE_Impl<1>(image_in, image_out);
break;
case 3:
Mean_3_3_SSE_Impl<3>(image_in, image_out);
break;
default:
throw("Unsupported format.");
break;
}
}
Results
I benchmarked all of the above implementations on an 8th gen Core i9 (MacBook Pro 16,1) at 2.4 GHz, with an image size of 2337 rows x 3180 cols. The compiler was Apple clang version 12.0.5 (clang-1205.0.22.9) and the only optimisation switch was -O3. OpenCV version was 4.5.0 (via Homebrew). (Note: I verified that for Mean_3_3_blur the cv::blur function was dispatched to an AVX2 implementation.) The results:
Mean_3_3_ref 62153 µs
Mean_3_3_scalar 41144 µs = 1.51062x
Mean_3_3_scalar_opt 26238 µs = 2.36882x
Mean_3_3_blur 20121 µs = 3.08896x
Mean_3_3_SSE 4838 µs = 12.84680x
Notes
I have ignored the border pixels in all implementations - if required these can either be filled with pixels from the original image or using some other form of edge pixel processing.
The code is not "industrial strength" - it was only written for benchmarking purposes.
There are a few further possible optimisations, e.g. use wider SIMD (AVX2, AVX512), exploit the redundancy between successive rows, etc - these are left as an exercise for the reader.
The SSE implementation is fastest, but this comes at the cost of increased complexity, decreased mantainability and reduced portability.
The OpenCV blur function gives the second best performance, and should probably be the preferred solution if it meets throughput requirements - it's the simplest solution, and simple is good.

Why are my values created in a for loop not working inside the body of the loop?

just had a quick couple questions as to why my I keep getting an error saying that x and y are not assigned a value at computeBarycentri2d(x, y, t.v) when in the if(!insideTriangle(x, y, t.v)) the values are assigned as an int value. The error I keep receiving for each value is "identifier x is not defined"
The other issue I am running into is that the continue statement in the if(zp >= depth_buf[y * width + x]) wont work giving me the error "a continue statement can only be used in a loop".
Any sort of help on how to fix these errors is greatly appreciated
void rst::rasterizer::rasterize_triangle(const Triangle& t, const std::array<Eigen::Vector3f, 3>& view_pos)
{
std::array v = t.toVector4();
float trix[3] = {t.v[0][0], t.v[1][0], t.v[2][0] };
float triy[3] = {t.v[0][1], t.v[1][1], t.v[2][1] };
std::pair<float*, float*> xrange = std::minmax_element(trix, trix + 3);
std::pair<float*, float*> yrange = std::minmax_element(triy, triy + 3);
for (int x = std::floor(*xrange.first); x < std::ceil(*xrange.second); ++x)
for(int y = std::floor(*yrange.first); y < std::ceil(*yrange.second); ++y)
if(!insideTriangle(x, y, t.v)) continue;
auto[alpha, beta, gamma] = computeBarycentric2D(x, y, t.v);
float Z = 1.0 / (alpha / v[0].w() + beta / v[1].w() + gamma / v[2].w());
float zp = alpha * v[0].z() / v[0].w() + beta * v[1].z() / v[1].w() + gamma * v[2].z() / v[2].w();
zp *= Z;
if(zp >= depth_buf[y * width + x]) continue;
depth_buf[y * width + x] = zp;
}

So from the help of #1201ProgramAlarm I found out that all i needed to do was add in brackets to each for loop which got rid of all of my errors!
Here is the code updated:
void rst::rasterizer::rasterize_triangle(const Triangle& t, const std::array<Eigen::Vector3f, 3>& view_pos)
{
std::array v = t.toVector4();
float trix[3] = {t.v[0][0], t.v[1][0], t.v[2][0] };
float triy[3] = {t.v[0][1], t.v[1][1], t.v[2][1] };
std::pair<float*, float*> xrange = std::minmax_element(trix, trix + 3);
std::pair<float*, float*> yrange = std::minmax_element(triy, triy + 3);
for (int x = std::floor(*xrange.first); x < std::ceil(*xrange.second); ++x)
{
for(int y = std::floor(*yrange.first); y < std::ceil(*yrange.second); ++y)
{
if(!insideTriangle(x, y, t.v)) continue;
auto[alpha, beta, gamma] = computeBarycentric2D(x, y, t.v);
float Z = 1.0 / (alpha / v[0].w() + beta / v[1].w() + gamma / v[2].w());
float zp = alpha * v[0].z() / v[0].w() + beta * v[1].z() / v[1].w() + gamma * v[2].z() / v[2].w();
zp *= Z;
if(zp >= depth_buf[y * width + x]) continue;
depth_buf[y * width + x] = zp;
}
}
}

Fast implementation of covariance of two 8-bit arrays

I need to compare a big amount of similar images of small size (up to 200x200).
So I try to implement SSIM (Structural similarity see https://en.wikipedia.org/wiki/Structural_similarity ) algorithm.
SSIM requires calculation of covariance of two 8-bit gray images.
A trivial implementation look like:
float SigmaXY(const uint8_t * x, const uint8_t * y, size_t size, float averageX, float averageY)
{
float sum = 0;
for(size_t i = 0; i < size; ++i)
sum += (x[i] - averageX) * (y[i] - averageY);
return sum / size;
}
But it has poor performance.
So I hope to improve it with using SIMD or CUDA (I heard that it can be done).
Unfortunately I have no experience to do this.
How it will look? And where I have to go?

I have another nice solution!
At first I want to mention some mathematical formulas:
averageX = Sum(x[i])/size;
averageY = Sum(y[i])/size;
And therefore:
Sum((x[i] - averageX)*(y[i] - averageY))/size =
Sum(x[i]*y[i])/size - Sum(x[i]*averageY)/size -
Sum(averageX*y[i])/size + Sum(averageX*averageY)/size =
Sum(x[i]*y[i])/size - averageY*Sum(x[i])/size -
averageX*Sum(y[i])/size + averageX*averageY*Sum(1)/size =
Sum(x[i]*y[i])/size - averageY*averageX -
averageX*averageY + averageX*averageY =
Sum(x[i]*y[i])/size - averageY*averageX;
It allows to modify our algorithm:
float SigmaXY(const uint8_t * x, const uint8_t * y, size_t size, float averageX, float averageY)
{
uint32_t sum = 0; // If images will have size greater then 256x256 than you have to use uint64_t.
for(size_t i = 0; i < size; ++i)
sum += x[i]*y[i];
return sum / size - averageY*averageX;
}
And only after that we can use SIMD (I used SSE2):
#include <emmintrin.h>
inline __m128i SigmaXY(__m128i x, __m128i y)
{
__m128i lo = _mm_madd_epi16(_mm_unpacklo_epi8(x, _mm_setzero_si128()), _mm_unpacklo_epi8(y, _mm_setzero_si128()));
__m128i hi = _mm_madd_epi16(_mm_unpackhi_epi8(y, _mm_setzero_si128()), _mm_unpackhi_epi8(y, _mm_setzero_si128()));
return _mm_add_epi32(lo, hi);
}
float SigmaXY(const uint8_t * x, const uint8_t * y, size_t size, float averageX, float averageY)
{
uint32_t sum = 0;
size_t i = 0, alignedSize = size/16*16;
if(size >= 16)
{
__m128i sums = _mm_setzero_si128();
for(; i < alignedSize; i += 16)
{
__m128i _x = _mm_loadu_si128((__m128i*)(x + i));
__m128i _y = _mm_loadu_si128((__m128i*)(y + i));
sums = _mm_add_epi32(sums, SigmaXY(_x, _y));
}
uint32_t _sums[4];
_mm_storeu_si128(_sums, sums);
sum = _sums[0] + _sums[1] + _sums[2] + _sums[3];
}
for(; i < size; ++i)
sum += x[i]*y[i];
return sum / size - averageY*averageX;
}

There is a SIMD implementation of the algorithm (I used SSE4.1):
#include <smmintrin.h>
template <int shift> inline __m128 SigmaXY(const __m128i & x, const __m128i & y, __m128 & averageX, __m128 & averageY)
{
__m128 _x = _mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(x, shift)));
__m128 _y = _mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(y, shift)));
return _mm_mul_ps(_mm_sub_ps(_x, averageX), _mm_sub_ps(_y, averageY))
}
float SigmaXY(const uint8_t * x, const uint8_t * y, size_t size, float averageX, float averageY)
{
float sum = 0;
size_t i = 0, alignedSize = size/16*16;
if(size >= 16)
{
__m128 sums = _mm_setzero_ps();
__m128 avgX = _mm_set1_ps(averageX);
__m128 avgY = _mm_set1_ps(averageY);
for(; i < alignedSize; i += 16)
{
__m128i _x = _mm_loadu_si128((__m128i*)(x + i));
__m128i _y = _mm_loadu_si128((__m128i*)(y + i));
sums = _mm_add_ps(sums, SigmaXY<0>(_x, _y, avgX, avgY);
sums = _mm_add_ps(sums, SigmaXY<4>(_x, _y, avgX, avgY);
sums = _mm_add_ps(sums, SigmaXY<8>(_x, _y, avgX, avgY);
sums = _mm_add_ps(sums, SigmaXY<12>(_x, _y, avgX, avgY);
}
float _sums[4];
_mm_storeu_ps(_sums, sums);
sum = _sums[0] + _sums[1] + _sums[2] + _sums[3];
}
for(; i < size; ++i)
sum += (x[i] - averageX) * (y[i] - averageY);
return sum / size;
}
I hope that it will useful for you.

How to speed up my bilinear interpolation?

I have spent countless hours trying to speed up my bilinear interpolation up, with no avail. I even tried a SSE version (a double version and a float version), but that was even slower than this version.
Does anyone have any ideas?
template <typename T>
__forceinline void interp2_mx(const T& x, const T& y,
const T* z,
const int32_t& n,
const int32_t& mm2,
const int32_t& nm2,
T& val,
const T& extrapval = T(0))
{
int64_t xp = (int64_t)(x) - 1; // adjust for MATLAB indexing
int64_t yp = (int64_t)(y) - 1;
if (xp < 0 || xp > nm2 || yp < 0 || yp > mm2)
{
val = extrapval;
}
else
{
const T* line = z + yp * n + xp;
T xf = x - (int64_t)(x);
T yf = y - (int64_t)(y);
T x1mf = (T)1 - xf;
T y1mf = (T)1 - yf;
T v00 = x1mf * y1mf * (*(line));
T v01 = xf * y1mf * (*(line + 1));
T v10 = x1mf * yf * (*(line + n));
T v11 = xf * yf * (*(line + n + 1));
val = v00 + v01 + v10 + v11;
}
}
template <typename T>
void interp2(const T* z,
const int32_t& mz, const int32_t& nz,
const T* xi, const T* yi,
const int32_t& mi, const int32_t& ni,
T* zi,
const T& extrapval = T(0))
{
const int32_t nzm2 = nz - 2;
const int32_t mzm2 = mz - 2;
#pragma omp parallel for
for (int m = 0; m < mi; ++m)
{
T* line_zi = zi + m * ni;
const T* x = xi + m * ni;
const T* y = yi + m * ni;
for (int n = 0; n < ni; ++n, ++x, ++y, ++line_zi)
{
interp2_mx((*x), (*y), z, nz, mzm2, nzm2, (*line_zi));
}
}
}

Your calculation of xf does a float-to-int64_t-to-float conversion. I assume you know the value is in range, otherwise this would be Undefined Behavior (and mathematically pointless). std::modf() may be the better function as it directly calculates the desired value.
I also think that adjacent pixels have rather related xf & x1mf values, yet you recalculate them. I'm not sure about this as your coordinates seem to be stored indirectly ((*x), (*y)). It may very well be more efficient to calculate those on the fly. Since these pointers may alias the output, they can't be prefetched and the reads will be blocking the memory bus.

My perlin noise looks like wrong, almost like grey t-shirt material (heather). Why?

I tried a quick and dirty translation of the code here.
However, my version outputs noise comparable to grey t-shirt material, or heather if it please you:
#include <fstream>
#include "perlin.h"
double Perlin::cos_Interp(double a, double b, double x)
{
ft = x * 3.1415927;
f = (1 - cos(ft)) * .5;
return a * (1 - f) + b * f;
}
double Perlin::noise_2D(double x, double y)
{
/*
int n = (int)x + (int)y * 57;
n = (n << 13) ^ n;
int nn = (n * (n * n * 60493 + 19990303) + 1376312589) & 0x7fffffff;
return 1.0 - ((double)nn / 1073741824.0);
*/
int n = (int)x + (int)y * 57;
n = (n<<13) ^ n;
return ( 1.0 - ( (n * (n * n * 15731 + 789221) + 1376312589) & 0x7fffffff) / 1073741824.0);
}
double Perlin::smooth_2D(double x, double y)
{
corners = ( noise_2D(x - 1, y - 1) + noise_2D(x + 1, y - 1) + noise_2D(x - 1, y + 1) + noise_2D(x + 1, y + 1) ) / 16;
sides = ( noise_2D(x - 1, y) + noise_2D(x + 1, y) + noise_2D(x, y - 1) + noise_2D(x, y + 1) ) / 8;
center = noise_2D(x, y) / 4;
return corners + sides + center;
}
double Perlin::interp(double x, double y)
{
int x_i = int(x);
double x_left = x - x_i;
int y_i = int(y);
double y_left = y - y_i;
double v1 = smooth_2D(x_i, y_i);
double v2 = smooth_2D(x_i + 1, y_i);
double v3 = smooth_2D(x_i, y_i + 1);
double v4 = smooth_2D(x_i + 1, y_i + 1);
double i1 = cos_Interp(v1, v2, x_left);
double i2 = cos_Interp(v3, v4, x_left);
return cos_Interp(i1, i2, y_left);
}
double Perlin::perlin_2D(double x, double y)
{
double total = 0;
double p = .25;
int n = 1;
for(int i = 0; i < n; ++i)
{
double freq = pow(2, i);
double amp = pow(p, i);
total = total + interp(x * freq, y * freq) * amp;
}
return total;
}
int main()
{
Perlin perl;
ofstream ofs("./noise2D.ppm", ios_base::binary);
ofs << "P6\n" << 512 << " " << 512 << "\n255\n";
for(int i = 0; i < 512; ++i)
{
for(int j = 0; j < 512; ++j)
{
double n = perl.perlin_2D(i, j);
n = floor((n + 1.0) / 2.0 * 255);
unsigned char c = n;
ofs << c << c << c;
}
}
ofs.close();
return 0;
}
I don't believe that I strayed too far from the aforementioned site's directions aside from adding in the ppm image generation code, but then again I'll admit to not fully grasping what is going on in the code.
As you'll see by the commented section, I tried two (similar) ways of generating pseudorandom numbers for noise. I also tried different ways of scaling the numbers returned by perlin_2D to RGB color values. These two ways of editing the code have just yielded different looking t-shirt material. So, I'm forced to believe that there's something bigger going on that I am unable to recognize.
Also, I'm compiling with g++ and the c++11 standard.
EDIT: Here's an example: http://imgur.com/Sh17QjK

To convert a double in the range of [-1.0, 1.0] to an integer in range [0, 255]:
n = floor((n + 1.0) / 2.0 * 255.99);
To write it as a binary value to the PPM file:
ofstream ofs("./noise2D.ppm", ios_base::binary);
...
unsigned char c = n;
ofs << c << c << c;

Is this a direct copy of your code? You assigned an integer to what should be the Y fractional value - it's a typo and it will throw the entire noise algorithm off if you don't fix:
double Perlin::interp(double x, double y)
{
int x_i = int(x);
double x_left = x - x_i;
int y_i = int(y);
double y_left = y = y_i; //This Should have a minus, not an "=" like the line above
.....
}
My guess is if you're successfully generating the bitmap with the proper color computation, you're getting vertical bars or something along those lines?
You also need to remember that the Perlin generator usually spits out numbers in the range of -1 to 1 and you need to multiply the resultant value as such:
value * 127 + 128 = {R, G, B}
to get a good grayscale image.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Optimizing custom IDCT calculation - c++

Related

SSE mean filter in c++ and OpenCV

Why are my values created in a for loop not working inside the body of the loop?

Fast implementation of covariance of two 8-bit arrays

How to speed up my bilinear interpolation?

My perlin noise looks like wrong, almost like grey t-shirt material (heather). Why?

Categories

Resources