Procedural Landscape Generation from Data (BP, C++) - c++

in UE4 we want to create procedural landscapes from freely available geological and height data. We've been following the book "Unreal Engine 4 Scripting with C++ Cookbook", which is a little older. The code adapted accordingly also works well, until it comes to updating the landscape. It crashes at:
int32 numHeights = (rect.Width()+1)*(rect.Height()+1);
TArray<uint16> Data;
Data.Init( 0, numHeights );
for( int i = 0; i < Data.Num(); i++ ) {
float nx = (i % cols) / cols; // normalized x value
float ny = (i / cols) / rows; // normalized y value
Data[i] = GeoHeightData( nx, ny, 16, 4, 4 );
}
LandscapeEditorUtils::SetHeightmapData( landscape, Data );
The function
LandscapeEditorUtils::SetHeightmapData( landscape, Data );
no longer exists. In LandscapeEdit.h you can find the
LandscapeEdit::SetHeightData
This is defined by
SetHeightData(InMinX, InMinY, InMaxX, InMaxY, (uint16*)ImportHeightData->GetData(), 0, false, nullptr);
is this function the equivalent to SetHeightmapData? The engine crushes with this approach too.
Do you have any suggestions or workarounds for creating procedural landscapes, either from blueprint or code? We also checked out the approach of Christian Sparks (https://hippowombat.tumblr.com/post/...-ue4-420#notes), which is cool, but we need the landscape for runtime applications.
Thx!
reiti

Related

Matrix Multiplication using SIMD vectors in C++

I am currently reading an article on github about performance optimisation using Clang's extended vector syntax. The author gives the following code snippet:
The templated code below implements the innermost loops that calculate a patch of size regA x regB in matrix C. The code loads regA scalars from matrixA and regB SIMD-width vectors from matrix B. The program uses Clang's extended vector syntax.
/// Compute a RAxRB block of C using a vectorized dot product, where RA is the
/// number of registers to load from matrix A, and RB is the number of registers
/// to load from matrix B.
template <unsigned regsA, unsigned regsB>
void matmul_dot_inner(int k, const float *a, int lda, const float *b, int ldb,
float *c, int ldc) {
float8 csum[regsA][regsB] = {{0.0}};
for (int p = 0; p < k; p++) {
// Perform the DOT product.
for (int bi = 0; bi < regsB; bi++) {
float8 bb = LoadFloat8(&B(p, bi * 8));
for (int ai = 0; ai < regsA; ai++) {
float8 aa = BroadcastFloat8(A(ai, p));
csum[ai][bi] += aa * bb;
}
}
}
// Accumulate the results into C.
for (int ai = 0; ai < regsA; ai++) {
for (int bi = 0; bi < regsB; bi++) {
AdduFloat8(&C(ai, bi * 8), csum[ai][bi]);
}
}
}
The code, outlines below, confuses me the most. I read the full article and understood the logic behind using blocking and calculating a small patch, but I can't entirely understand what does this bit means:
// Perform the DOT product.
for (int bi = 0; bi < regsB; bi++) {
float8 bb = LoadFloat8(&B(p, bi * 8)); //the pointer to the range of values?
for (int ai = 0; ai < regsA; ai++) {
float8 aa = BroadcastFloat8(A(ai, p));
csum[ai][bi] += aa * bb;
}
}
}
Can anyone elaborate what's going on in here?
The article could be found here
The 2nd comment on the article links to https://github.com/pytorch/glow/blob/405e632ef138f1d49db9c3181182f7efd837bccc/lib/Backends/CPU/libjit/libjit_defs.h#L26 which defines the float8 type as
typedef float float8 __attribute__((ext_vector_type(8)));
(similar to how immintrin.h defines __m256). And defines the load / broadcast functions Similar to _mm256_load_ps and _mm256_set1_ps. With that header, you should be able to compile the code in the article.
See Clang's native vector documentation. GNU C native vector syntax is a nice way to get an overloaded * operator. I don't know what clang's ext_vector_type does that GCC/clang/ICC float __attribute__((vector_width(32))) (32 byte width) wouldn't.
The article could have added 1 small section to explain that, but it seems it was more focused on the performance details, and wasn't really interested in explaining how to use syntax.
Most of the discussion in the article is about how to manually vectorize matmul for cache efficiency with SIMD vectors. That part looks good from the quick skim I gave it.
You can do those things with any of multiple ways to manually vectors: GNU C native vectors or clang's very similar "extended" vectors, or portable Intel intrinsics.

Image Processing (vector subscript out of range)

I have spent probably too many hours looking for tutorials on image processing (WITHOUT the use of external libraries) with no real success. If anyone knows any good tutorials that can be found that can help in this way, I'd really appreciate that.
I am pretty new to coding (this is my first year in college), and the assignment our professor is asking for requires original code to transform 24-bit bitmap images.
I found a question in StackExchange that shows rotation of an image without use of external libraries:
My code rotates a bmp picture correctly but only if the number of pixels is a muliple of 4... can anyone see whats wrong?
Using this code (with the starter project we were given and I had to build upon), I was able to create this code:
Byte is defined as a typedef of unsigned chars.
void BMPImage::RotateImage()
{
vector<byte> newBMP(m_BIH.biWidth * m_BIH.biHeight);
long newHeight = m_BIH.biWidth; /* Preserving the original width */
m_BIH.biWidth = m_BIH.biHeight; /* Setting the width as the height*/
m_BIH.biHeight = newHeight; /* Using the value of the original width, we set it as the new height */
for (int r = 0; r < m_BIH.biHeight; r++)
{
for (int c = 0; c < m_BIH.biWidth; c++)
{
long y = c + (r*m_BIH.biHeight);
long x = c + (r*m_BIH.biWidth - r - 1) + (m_BIH.biHeight*c);
newBMP[y] = m_ImageData[x];
}
}
m_ImageData = newBMP;
}
This code doesn't show any red squigglies, but when I try to execute the rotation, I get a vector subscript out of range error message pop-up. I've only used vectors in one assignment before, so I don't know where the issue is. Help please!
I think the issue might be here:
m_ImageData = newBMP;
Assume your newBMP has width = 1 and height = 2 then
vector<byte> newBMP(m_BIH.biWidth * m_BIH.biHeight);
would be an array of size 2 with valid indexrange [0 1]. Your index calculation
long y = c + (r*m_BIH.biHeight);
would be 2 for c = 0 and r = 1. But 2 is not a valid index for your vector and with
newBMP[y] = ...
you access an element that is not part of the vector. Your index x would be -1 for this example.

Efficient 2D FFT of fixed length real input data in C/C++

I'm developing an algorithm that calls several times to a FFT function. I have several time constraints (real-time desired) so I need to minimize the time expended in every FFT call.
I'm working with OpenCV library and I have already implemented my code with two different approaches:
Using FFTW library. Data/memory management + FFT(8ms) = 14ms (in mean, FFT_MEASURE flag).
Using OpenCV fft function. Data/memory management + FFT (21ms) = 23ms (in mean).
As my input data is always fixed as a real image of 512x512 pixels, do you think if I implement myself the FFT algorithm based in the mathematical definition of DFT, storing the sine/cosine tables can I achieve better performance or the FFTW library is really very optimized? Any better ideas?
All ideas and suggestions will be really appreciated. By now, I don't consider paralellization or GPU implementation.
Thank you
Update:
System: Intel Xeon 5130 2.0GHz CPU in Windows 7, Visual Studio 10.0 and FFTW 3.3.3 (compiled following instructions in the site), OpenCV 2.4.3.
Code example for FFT call with FFTW (input: OpenCV Mat CV_32F (1 channel, float type), output OpenCV Mat CV_32FC2 (2 channels, float type):
float *im_data;
fftwf_complex *data_in;
fftwf_complex *fft;
fftwf_plan plan_f;
int i, j, k;
int height=I.rows;
int width=I.cols;
int N=height*width;
float* outdata = new float[2*N];
im_data = ( float* ) I.data;
data_in = ( fftwf_complex* )fftwf_malloc( sizeof( fftwf_complex ) * N );
fft = ( fftwf_complex* )fftwf_malloc( sizeof( fftwf_complex ) * N );
plan_f = fftwf_plan_dft_2d( height , width , data_in , fft , FFTW_FORWARD , FFTW_MEASURE );
for(int i = 0,k=0; i < height; ++i) {
float* row = I.ptr<float>(i);
for(int j = 0; j < width; j++) {
data_in[k][0]=(float)row[j];
data_in[k][1] =(float)0.0;
k++;
}
}
fftwf_execute( plan_f );
int width2=2*width;
// writing output matrix: RealFFT[0],ImaginaryFFT[0],RealFFT[1],ImaginaryFFT[1],...
for( i = 0, k = 0 ; i < height ; i++ ) {
for( j = 0 ; j < width2 ; j++ ) {
outdata[i * width2 + j] = ( float )fft[k][0];
outdata[i * width2 + j+1] = ( float )fft[k][1];
j++;
k++;
}
}
Mat fft_I(height,width,CV_32FC2,outdata);
fftwf_destroy_plan( plan_f );
fftwf_free( data_in );
fftwf_free( fft );
return fft_I;
Your FFT time with FFTW seems very high. To get the best of out FFTW with fixed size FFTs you should generate a plan using the FFTW_PATIENT flag and then ideally save the generated "wisdom" for subsequent re-use. You can generate wisdom either from your own code or using the fftw-wisdom tool.
The FFT from the Intel Math Kernel Library (separate from the Intel compiler) is faster than FFTW most of the time. I don't know if it will be enough of an improvement in your case to justify the price though.
I will agree with the others that rolling your own FFT is probably not a good use of your time (unless you are wanting to learn how to do it). The available FFT implementations (FFTW, MKL) have been so finely tuned over many years. I'm not saying that you can't do better, but it would probably be a lot of work and time for marginal gains.
Believe me fftw is realy very optimized, there is very small chance, that you can do it better.
Which compiler you have used for compiling fftw? Sometimes compiler from Intel gives better perfomance than gcc

OpenCV Foreground Detection slow

I am trying to implement the codebook foreground detection algorithm outlined here in the book Learning OpenCV.
The algorithm only describes a codebook based approach for each pixel of the picture. So I took the simplest approach that came to mind - to have a array of codebooks, one for each pixel, much like the matrix structure underlying IplImage. The length of the array is equal to the number of pixels in the image.
I wrote the following two loops to learn the background and segment the foreground. It uses my limited understanding of the matrix structure inside the src image, and uses pointer arithmetic to traverse the pixels.
void foreground(IplImage* src, IplImage* dst, codeBook* c, int* minMod, int* maxMod){
int height = src->height;
int width = src->width;
uchar* srcCurrent = (uchar*) src->imageData;
uchar* srcRowHead = srcCurrent;
int srcChannels = src->nChannels;
int srcRowWidth = src->widthStep;
uchar* dstCurrent = (uchar*) dst->imageData;
uchar* dstRowHead = dstCurrent;
// dst has 1 channel
int dstRowWidth = dst->widthStep;
for(int row = 0; row < height; row++){
for(int column = 0; column < width; column++){
(*dstCurrent) = find_foreground(srcCurrent, (*c), srcChannels, minMod, maxMod);
dstCurrent++;
c++;
srcCurrent += srcChannels;
}
srcCurrent = srcRowHead + srcRowWidth;
srcRowHead = srcCurrent;
dstCurrent = dstRowHead + dstRowWidth;
dstRowHead = dstCurrent;
}
}
void background(IplImage* src, codeBook* c, unsigned* learnBounds){
int height = src->height;
int width = src->width;
uchar* srcCurrent = (uchar*) src->imageData;
uchar* srcRowHead = srcCurrent;
int srcChannels = src->nChannels;
int srcRowWidth = src->widthStep;
for(int row = 0; row < height; row++){
for(int column = 0; column < width; column++){
update_codebook(srcCurrent, c[row*column], learnBounds, srcChannels);
srcCurrent += srcChannels;
}
srcCurrent = srcRowHead + srcRowWidth;
srcRowHead = srcCurrent;
}
}
The program works, but is very sluggish. Is there something obvious that is slowing it down? Or is it an inherent problem in the simple implementation? Is there anything I can do to speed it up? Each code book is sorted in no specific order, so it does take linear time to process each pixel. So double the background samples, and the program runs slower by 2 for each pixel, which is then magnified by the number of pixels. But as the implementation stands, I don't see any clear, logical way to sort the code element entries.
I am aware that there is an example implementation of the same algorithm in the opencv samples. However, that structure seems to be much more complex. I am looking more to understand the reasoning behind this method, I am aware that I can just modify the sample for real life applications.
Thanks
Operating on every pixel in an image is going to be slow, regardless of how you implement it.

exchanging 2 memory positions

I am working with OpenCV and Qt, Opencv use BGR while Qt uses RGB , so I have to swap those 2 bytes for very big images.
There is a better way of doing the following?
I can not think of anything faster but looks so simple and lame...
int width = iplImage->width;
int height = iplImage->height;
uchar *iplImagePtr = (uchar *) iplImage->imageData;
uchar buf;
int limit = height * width;
for (int y = 0; y < limit; ++y) {
buf = iplImagePtr[2];
iplImagePtr[2] = iplImagePtr[0];
iplImagePtr[0] = buf;
iplImagePtr += 3;
}
QImage img((uchar *) iplImage->imageData, width, height,
QImage::Format_RGB888);
We are currently dealing with this issue in a Qt application. We've found that the Intel Performance Primitives to be be fastest way to do this. They have extremely optimized code. In the html help files at Intel ippiSwapChannels Documentation they have an example of exactly what you are looking for.
There are couple of downsides
Is the size of the library, but you can link static link just the library routines you need.
Running on AMD cpus. Intel libs run VERY slow by default on AMD. Check out www.agner.org/optimize/asmlib.zip for details on how do a work around.
I think this looks absolutely fine. That the code is simple is not something negative. If you want to make it shorter you could use std::swap:
std::swap(iplImagePtr[0], iplImagePtr[2]);
You could also do the following:
uchar* end = iplImagePtr + height * width * 3;
for ( ; iplImagePtr != end; iplImagePtr += 3) {
std::swap(iplImagePtr[0], iplImagePtr[2]);
}
There's cvConvertImage to do the whole thing in one line, but I doubt it's any faster either.
Couldn't you use one of the following methods ?
void QImage::invertPixels ( InvertMode mode = InvertRgb )
or
QImage QImage::rgbSwapped () const
Hope this helps a bit !
I would be inclined to do something like the following, working on the basis of that RGB data being in three byte blocks.
int i = 0;
int limit = (width * height); // / 3;
while(i != limit)
{
buf = iplImagePtr[i]; // should be blue colour byte
iplImagePtr[i] = iplImagaePtr[i + 2]; // save the red colour byte in the blue space
iplImagePtr[i + 2] = buf; // save the blue color byte into what was the red slot
// i++;
i += 3;
}
I doubt it is any 'faster' but at end of day, you just have to go through the entire image, pixel by pixel.
You could always do this:
int width = iplImage->width;
int height = iplImage->height;
uchar *start = (uchar *) iplImage->imageData;
uchar *end = start + width * height;
for (uchar *p = start ; p < end ; p += 3)
{
uchar buf = *p;
*p = *(p+2);
*(p+2) = buf;
}
but a decent compiler would do this anyway.
Your biggest overhead in these sorts of operations is going to be memory bandwidth.
If you're using Windows then you can probably do this conversion using the BitBlt and two appropriately set up DIBs. If you're really lucky then this could be done in the graphics hardware.
I hate to ruin anyone's day, but if you don't want to go the IPP route (see photo_tom) or pull in an optimized library, you might get better performance from the following (modifying Andreas answer):
uchar *iplImagePtr = (uchar *) iplImage->imageData;
uchar buf;
size_t limit = height * width;
for (size_t y = 0; y < limit; ++y) {
std::swap(iplImagePtr[y * 3], iplImagePtr[y * 3 + 2]);
}
Now hold on, folks, I hear you yelling "but all those extra multiplies and adds!" The thing is, this form of the loop is far easier for a compiler to optimize, especially if they get smart enough to multithread this sort of algorithm, because each pass through the loop is independent of those before or after. In the other form, the value of iplImagePtr was dependent on the value in previous pass. In this form, it is constant throughout the whole loop; only y changes, and that is in a very, very common "count from 0 to N-1" loop construct, so it's easier for an optimizer to digest.
Or maybe it doesn't make a difference these days because optimizers are insanely smart (are they?). I wonder what a benchmark would say...
P.S. If you actually benchmark this, I'd also like to see how well the following performs:
uchar *iplImagePtr = (uchar *) iplImage->imageData;
uchar buf;
size_t limit = height * width;
for (size_t y = 0; y < limit; ++y) {
uchar *pixel = iplImagePtr + y * 3;
std::swap(pix[0], pix[2]);
}
Again, pixel is defined in the loop to limit its scope and keep the optimizer from thinking there's a cycle-to-cycle dependency. If the compiler increments and decrements the stack pointer each time through the loop to "create" and "destroy" pixel, well, it's stupid and I'll apologize for wasting your time.
cvCvtColor(iplImage, iplImage, CV_BGR2RGB);