I have implemented a pixel mask class used for checking for perfect collision. I am using SFML so the implementation is fairly straight forward:
Loop through each pixel of the image and decide whether its true or false based on its transparency value. Here is the code I have used:
// Create an Image from the given texture
sf::Image image(texture.copyToImage());
// measure the time this function takes
sf::Clock clock;
sf::Time time = sf::Time::Zero;
clock.restart();
// Reserve memory for the pixelMask vector to avoid repeating allocation
pixelMask.reserve(image.getSize().x);
// Loop through every pixel of the texture
for (unsigned int i = 0; i < image.getSize().x; i++)
{
// Create the mask for one line
std::vector<bool> tempMask;
// Reserve memory for the pixelMask vector to avoid repeating allocation
tempMask.reserve(image.getSize().y);
for (unsigned int j = 0; j < image.getSize().y; j++)
{
// If the pixel is not transparrent
if (image.getPixel(i, j).a > 0)
// Some part of the texture is there --> push back true
tempMask.push_back(true);
else
// The user can't see this part of the texture --> push back false
tempMask.push_back(false);
}
pixelMask.push_back(tempMask);
}
time = clock.restart();
std::cout << std::endl << "The creation of the pixel mask took: " << time.asMicroseconds() << " microseconds (" << time.asSeconds() << ")";
I have used the an instance of the sf::Clock to meassure time.
My problem is that this function takes ages (e.g. 15 seconds) for larger images(e.g. 1280x720). Interestingly, only in debug mode. When compiling the release version the same texture/image only takes 0.1 seconds or less.
I have tried to reduce memory allocations by using the resize() method but it didn't change much. I know that looping through almost 1 million pixels is slow but it should not be 15 seconds slow should it?
Since I want to test my code in debug mode (for obvious reasons) and I don't want to wait 5 min till all the pixel masks have been created, what I am looking for is basically a way to:
Either optimise the code / have I missed somthing obvious?
Or get something similar to the release performance in debug mode
Thanks for your help!
Optimizing For Debug
Optimizing for debug builds is generally a very counter-productive idea. It could even have you optimize for debug in a way that not only makes maintaining code more difficult, but may even slow down release builds. Debug builds in general are going to be much slower to run. Even with the flattest kind of C code I write which doesn't pose much for an optimizer to do beyond reasonable register allocation and instruction selection, it's normal for the debug build to take 20 times longer to finish an operation. That's just something to accept rather than change too much.
That said, I can understand the temptation to do so at times. Sometimes you want to debug a certain part of code only for the other operations in the software to takes ages, requiring you to wait a long time before you can even get to the code you are interested in tracing through. I find in those cases that it's helpful, if you can, to separate debug mode input sizes from release mode (ex: having the debug mode only work with an input that is 1/10th of the original size). That does cause discrepancies between release and debug as a negative, but the positives sometimes outweigh the negatives from a productivity standpoint. Another strategy is to build parts of your code in release and just debug the parts you're interested in, like building a plugin in debug against a host application in release.
Approach at Your Own Peril
With that aside, if you really want to make your debug builds run faster and accept all the risks associated, then the main way is to just pose less work for your compiler to optimize away. That's going to be flatter code typically with more plain old data types, less function calls, and so forth.
First and foremost, you might be spending a lot of time on debug mode assertions for safety. See things like checked iterators and how to disable them:
https://msdn.microsoft.com/en-us/library/aa985965.aspx
For your case, you can easily flatten your nested loop into a single loop. There's no need to create these pixel masks with separate containers per scanline, since you can always get at your scanline data with some basic arithmetic (y*image_width or y*image_stride). So initially I'd flatten the loop. That might even help modestly for release mode. I don't know the SFML API so I'll illustrate with pseudocode.
const int num_pixels = image.w * image.h;
vector<bool> pixelMask(num_pixels);
for (int j=0; j < num_pixels; ++j)
pixelMask[j] = image.pixelAlpha(j) > 0;
Just that already might help a lot. Hopefully SFML lets you access pixels with a single index without having to specify column and row (x and y). If you want to go even further, it might help to grab the pointer to the array of pixels from SFML (also hopefully possible) and use that:
vector<bool> pixelMask(image.w * image.h);
const unsigned int* pixels = image.getPixels();
for (int j=0; j < num_pixels; ++j)
{
// Assuming 32-bit pixels (should probably use uint32_t).
// Note that no right shift is necessary when you just want
// to check for non-zero values.
const unsigned int alpha = pixels[j] & 0xff000000;
pixelMask[j] = alpha > 0;
}
Also vector<bool> stores each boolean as a single bit. That saves memory but translates to some more instructions for random-access. Sometimes you can get a speed up even in release by just using more memory. I'd test both release and debug and time carefully, but you can try this:
vector<char> pixelMask(image.w * image.h);
const unsigned int* pixels = image.getPixels();
char* pixelUsed = &pixelMask[0];
for (int j=0; j < num_pixels; ++j)
{
const unsigned int alpha = pixels[j] & 0xff000000;
pixelUsed[j] = alpha > 0;
}
Loops are faster if working with costants:
1. for (unsigned int i = 0; i < image.getSize().x; i++) get this image.getSize() before the loop.
2. get the mask for one line out of the loop and reuse it. Lines are of the same length I assume. std::vector tempMask;
This shall speed you up a bit.
Note that the compilation for debugging gives way more different machine code.
Related
I wrote a neural network program in C++ to test something, and I found that my program gets slower as computation proceeds. Since this kind of phenomenon is somewhat I've never seen before, I checked possible causes. Memory used by program did not change when it got slower. RAM and CPU status were fine when I ran the program.
Fortunately, the previous version of the program did not have such problem. So I finally found that a single statement that makes the program slow. The program does not get slower when I use this statement:
dw[k][i][j] = hidden[k-1][i].y * hidden[k][j].phi;
However, the program gets slower and slower as soon as I replace above statement with:
dw[k][i][j] = hidden[k-1][i].y * hidden[k][j].phi - lambda*w[k][i][j];
To solve this problem, I did my best to find and remove the cause but I failed... The below is the simple code structure. For the case that this is not the problem that is related to local statement, I uploaded my code to google drive. The URL is located at the end of this question.
MLP.h
class MLP
{
private:
...
double lambda;
double ***w;
double ***dw;
neuron **hidden;
...
MLP.cpp
...
for(k = n_depth - 1; k > 0; k--)
{
if(k == n_depth - 1)
...
else
{
...
for(j = 1; n_neuron > j; j++)
{
for(i = 0; n_neuron > i; i++)
{
//dw[k][i][j] = hidden[k-1][i].y * hidden[k][j].phi;
dw[k][i][j] = hidden[k-1][i].y * hidden[k][j].phi - lambda*w[k][i][j];
}
}
}
}
...
Full source code: https://drive.google.com/open?id=1A8Uw0hNDADp3-3VWAgO4eTtj4sVk_LZh
I'm not sure exactly why it gets slower and slower, but I do see where you can gain some performance.
Two and higher dimensional arrays are still stored in one dimensional
memory. This means (for C/C++ arrays) array[i][j] and array[i][j+1]
are adjacent to each other, whereas array[i][j] and array[i+1][j] may
be arbitrarily far apart.
Accessing data in a more-or-less sequential fashion, as stored in
physical memory, can dramatically speed up your code (sometimes by an
order of magnitude, or more)!
When modern CPUs load data from main memory into processor cache,
they fetch more than a single value. Instead they fetch a block of
memory containing the requested data and adjacent data (a cache line
). This means after array[i][j] is in the CPU cache, array[i][j+1] has
a good chance of already being in cache, whereas array[i+1][j] is
likely to still be in main memory.
Source: https://people.cs.clemson.edu/~dhouse/courses/405/papers/optimize.pdf
With your current code, w[k][i][j] will be read, and on the next iteration, w[k][i+1][j] will be read. You should invert i and j so that w is read in sequential order:
for(j = 1; n_neuron > j; ++j)
{
for(i = 0; n_neuron > i; ++i)
{
dw[k][j][i] = hidden[k-1][j].y * hidden[k][i].phi - lambda*w[k][j][i];
}
}
Also note that ++x should be slightly faster than x++, since x++ has to create a temporary containing the old value of x as the expression result. The compiler might optimize it when the value is unused though, but do not count on it.
I encountered this weird bug in a c++ multithread program on linux. The multithreaded part basically executes a loop. One single iteration first loads a sift file containing some features. And then it queries these features against a tree. Since I have a lot of images, I used multiple threads to do this querying. Here is the code snippets.
struct MultiMatchParam
{
int thread_id;
float *scores;
double *scores_d;
int *perm;
size_t db_image_num;
std::vector<std::string> *query_filenames;
int start_id;
int num_query;
int dim;
VocabTree *tree;
FILE *file;
};
// multi-thread will do normalization anyway
void MultiMatch(MultiMatchParam ¶m)
{
// Clear scores
for(size_t t = param.start_id; t < param.start_id + param.num_query; t++)
{
for (size_t i = 0; i < param.db_image_num; i++)
param.scores[i] = 0.0;
DTYPE *keys;
int num_keys;
keys = ReadKeys_sfm((*param.query_filenames)[t].c_str(), param.dim, num_keys);
int normalize = true;
double mag = param.tree->MultiScoreQueryKeys(num_keys, normalize, keys, param.scores);
delete [] keys;
}
}
I run this on a 8-core cpu. At first it runs perfectly and the cpu usage is nearly 100% on all 8 cores. After each thread has queried several images (about 20 images), all of a sudden the performance (cpu usage) drops drastically, down to about 30% across all eight cores.
I doubt the key to this bug is concerned with this line of code.
double mag = param.tree->MultiScoreQueryKeys(num_keys, normalize, keys, param.scores);
Since if I replace it with another costly operations (e.g., a large for-loop containing sqrt). The cpu usage is always nearly 100%. This MultiScoreQueryKeys function does a complex operation on a tree. Since all eight cores may read the same tree (no write operation to this tree), I wonder whether the read operation has some kind of blocking effect. But it shouldn't have this effect because I don't have write operations in this function. Also the operations in the loop are basically the same. If it were to block the cpu usage, it would happen in the first few iterations. If you need to see the details of this function or other part of this project, please let me know.
Use std::async() instead of zeta::SimpleLock lock
In a function that updates all particles I have the following code:
for (int i = 0; i < _maxParticles; i++)
{
// check if active
if (_particles[i].lifeTime > 0.0f)
{
_particles[i].lifeTime -= _decayRate * deltaTime;
}
}
This decreases the lifetime of the particle based on the time that passed.
It gets calculated every loop, so if I've 10000 particles, that wouldn't be very efficient because it doesn't need to(it doesn't get changed anyways).
So I came up with this:
float lifeMin = _decayRate * deltaTime;
for (int i = 0; i < _maxParticles; i++)
{
// check if active
if (_particles[i].lifeTime > 0.0f)
{
_particles[i].lifeTime -= lifeMin;
}
}
This calculates it once and sets it to a variable that gets called every loop, so the CPU doesn't have to calculate it every loop, which would theoretically increase performance.
Would it run faster than the old code? Or does the release compiler do optimizations like this?
I wrote a program that compares both methods:
#include <time.h>
#include <iostream>
const unsigned int MAX = 1000000000;
int main()
{
float deltaTime = 20;
float decayRate = 200;
float foo = 2041.234f;
unsigned int start = clock();
for (unsigned int i = 0; i < MAX; i++)
{
foo -= decayRate * deltaTime;
}
std::cout << "Method 1 took " << clock() - start << "ms\n";
start = clock();
float calced = decayRate * deltaTime;
for (unsigned int i = 0; i < MAX; i++)
{
foo -= calced;
}
std::cout << "Method 2 took " << clock() - start << "ms\n";
int n;
std::cin >> n;
return 0;
}
Result in debug mode:
Method 1 took 2470ms
Method 2 took 2410ms
Result in release mode:
Method 1 took 0ms
Method 2 took 0ms
But that doesn't work. I know it doesn't do exactly the same, but it gives an idea.
In debug mode, they take roughly the same time. Sometimes Method 1 is faster than Method 2(especially at fewer numbers), sometimes Method 2 is faster.
In release mode, it takes 0 ms. A little weird.
I tried measuring it in the game itself, but there aren't enough particles to get a clear result.
EDIT
I tried to disable optimizations, and let the variables be user inputs using std::cin.
Here are the results:
Method 1 took 2430ms
Method 2 took 2410ms
It will almost certainly make no difference what so ever, at least if
you compile with optimization (and of course, if you're concerned with
performance, you are compiling with optimization). The opimization in
question is called loop invariant code motion, and is universally
implemented (and has been for about 40 years).
On the other hand, it may make sense to use the separate variable
anyway, to make the code clearer. This depends on the application, but
in many cases, giving a name to the results of an expression can make
code clearer. (In other cases, of course, throwing in a lot of extra
variables can make it less clear. It's all depends on the application.)
In any case, for such things, write the code as clearly as possible
first, and then, if (and only if) there is a performance problem,
profile to see where it is, and fix that.
EDIT:
Just to be perfectly clear: I'm talking about this sort of code optimization in general. In the exact case you show, since you don't use foo, the compiler will probably remove it (and the loops) completely.
In theory, yes. But your loop is extremely simple and thus likeley to be heavily optimized.
Try the -O0 option to disable all compiler optimizations.
The release runtime might be caused by the compiler statically computing the result.
I am pretty confident that any decent compiler will replace your loops with the following code:
foo -= MAX * decayRate * deltaTime;
and
foo -= MAX * calced ;
You can make the MAX size depending on some kind of input (e.g. command line parameter) to avoid that.
Let me first preface this with the fact that I know these kind of micro-optimisations are rarely cost-effective. I'm curious about how stuff works though. For all cacheline numbers etc, I am thinking in terms of an x86-64 i5 Intel CPU. The numbers would obviously differ for different CPUs.
I've often been under the impression that walking an array forwards is faster than walking it backwards. This is, I believed, due to the fact that pulling in large amounts of data is done in a forward-facing manner - that is, if I read byte 0x128, then the cacheline (assuming 64bytes in length) will read in bytes 0x128-0x191 inclusive. Consequently, if the next byte I wanted to access was at 0x129, it would already be in the cache.
However, after reading a bit, I'm now under the impression that it actually wouldn't matter? Because cache line alignment will pick the starting point at the closest 64-divisible boundary, then if I pick byte 0x127 to start with, I will load 0x64-0x127 inclusive, and consequently will have the data in the cache for my backwards walk. I will suffer a cachemiss when transitioning from 0x128 to 0x127, but that's a consequence of where I've picked the addresses for this example more than any real-world consideration.
I am aware that the cachelines are read in as 8-byte chunks, and as such the full cacheline would have to be loaded before the first operation could begin if we were walking backwards, but I doubt it would make a hugely significant difference.
Could somebody clear up if I'm right here, and old me is wrong? I've searched for a full day and still not been able to get a final answer on this.
tl;dr : Is the direction in which we walk an array really that important? Does it actually make a difference? Did it make a difference in the past? (To 15 years back or so)
I have tested with the following basic code, and see the same results forwards and backwards:
#include <windows.h>
#include <iostream>
// Size of dataset
#define SIZE_OF_ARRAY 1024*1024*256
// Are we walking forwards or backwards?
#define FORWARDS 1
int main()
{
// Timer setup
LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds;
LARGE_INTEGER Frequency;
int* intArray = new int[SIZE_OF_ARRAY];
// Memset - shouldn't affect the test because my cache isn't 256MB!
memset(intArray, 0, SIZE_OF_ARRAY);
// Arbitrary numbers for break points
intArray[SIZE_OF_ARRAY - 1] = 55;
intArray[0] = 15;
int* backwardsPtr = &intArray[SIZE_OF_ARRAY - 1];
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
// Actual code
if (FORWARDS)
{
while (true)
{
if (*(intArray++) == 55)
break;
}
}
else
{
while (true)
{
if (*(backwardsPtr--) == 15)
break;
}
}
// Cleanup
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
std::cout << ElapsedMicroseconds.QuadPart << std::endl;
// So I can read the output
char a;
std::cin >> a;
return 0;
}
I apologise for A) Windows code, and B) Hacky implementation. It's thrown together to test a hypothesis, but doesn't prove the reasoning.
Any information about how the walking direction could make a difference, not just with cache but also other aspects, would be greatly appreciated!
Just as your experimentation shows, there is no difference. Unlike the interface between the processor and L1 cache, the memory system transacts on full cachelines, not bytes. As #user657267 pointed out, processor specific prefetchers exist. These might preference forward vs. backward, but I heavily doubt it. All modern prefetchers detect direction rather than assuming them. Furthermore, they detect stride as well. They involve incredibly complex logic and something as easy as direction isn't going to be their downfall.
Short answer: go in either direction you want and enjoy the same performance for both!
I'm trying to optimize some C++ code for speed, and not concerned about memory usage. If I have some function that, for example, tells me if a character is a letter:
bool letterQ ( char letter ) {
return (lchar>=65 && lchar<=90) ||
(lchar>=97 && lchar<=122);
}
Would it be faster to just create a lookup table, i.e.
int lookupTable[128];
for (i = 0 ; i < 128 ; i++) {
lookupTable[i] = // some int value that tells what it is
}
and then modifying the letterQ function above to be
bool letterQ ( char letter ) {
return lookupTable[letter]==LETTER_VALUE;
}
I'm trying to optimize for speed in this simple region, because these functions are called a lot, so even a small increase in speed would accumulate into long-term gain.
EDIT:
I did some testing, and it seems like a lookup array performs significantly better than a lookup function if the lookup array is cached. I tested this by trying
for (int i = 0 ; i < size ; i++) {
if ( lookupfunction( text[i] ) )
// do something
}
against
bool lookuptable[128];
for (int i = 0 ; i < 128 ; i++) {
lookuptable[i] = lookupfunction( (char)i );
}
for (int i = 0 ; i < size ; i++) {
if (lookuptable[(int)text[i]])
// do something
}
Turns out that the second one is considerably faster - about a 3:1 speedup.
About the only possible answer is "maybe" -- and you can find out by running a profiler or something else to time the code. At one time, it would have been pretty easy to give "yes" as the answer with little or no qualification. Now, given how much faster CPUs have gotten than memory, it's a lot less certain -- you can do a lot of computation in the time it takes to fill one cache line from main memory.
Edit: I should add that in either C or C++, it's probably best to at least start with the functions (or macros) built into the standard library. These are often fairly carefully optimized for the target and (more importantly for most people) support things like switching locales, so you won't be stuck trying to explain to your German users that 'ß' isn't really a letter (and I doubt many will be much amused by "but that's really two letters, not one!)
First, I assume you have profiled the code and verified that this particular function is consuming a noticeable amount of CPU time over the runtime of the program?
I wouldn't create a vector as you're dealing with a very fixed data size. In fact, you could just create a regular C++ array and initialize is at program startup. With a really modern compiler that supports array initializers you even can do something like this:
bool lookUpTable[128] = { false, false, false, ..., true, true, ... };
Admittedly I'd probably write a small script that generates out the code rather then doing it all manually.
For a simple calculation like this, the memory access (caused by a lookup table) is going to be more expensive than just doing the calculation every time.