GPU Programming Strategy

GPU Programming Strategy - c++

I am trying to program a type of neural network using c in CUDA. I have one basic question. For the programming, I can either use big arrays or different naming strategy. For example for the weights, I can put all the weights in one big array or use different arrays for different layers with different names such as weight1 which is for layer one and weight2 for layer2 and so on. The first strategy is a little bit troublesome while the second one is easier for me. However, I am wondering if I use the different naming strategy, does it make the program slower to run on GPU?

As long as all the arrays are allocated only once and not resized, the difference in performance should be negligible.
If you are constantly reallocating memory and resizing arrays holding the weights, then there might be a performance benefit in managing your own memory within the big array.
That however that is very implementation specific, if you don't know what you are doing, managing your own memory/arrays could make your code slower and less robust. Also if your NN is huge, you might have trouble finding a contiguous block of memory large enough to hold your memory/array block.

This is my 2 cents.
The drawbacks of having 1 very large array:
harder to resize, so if you intent on resizing indiviual layers. Go for a large block.
As Daniel said it might be hard to find a contiguous block of memory(take in mind that something might feel large. But isn't from a techinal/hardware perspective.
The drawbacks of Seperate arrays or containers.
If you have a very granulated, unpredictable access pattern. The access times can be slower if it takes multiple steps to find a single location in an array. For example, if you have a list of pointers to a list of pointers, to a list of pointers. You have to take three(slightly expensive) steps every time. This can be avoided with proper coding.
In general I would be in favor of splitting up.

Related

Being cache efficient with data, mainly arrays

I have recently started to look into being cache efficient by trying to avoid cache misses in c++. So far I have taken away the following:
Try and avoid linked lists objects where possible when processing. Instead use them to point to contiguous data that you can store in cache and perform operations on.
Be careful of holding state in classes as it makes the above potentially more difficult.
Use structs when allocating on the heap, as this helps in localising data.
Try and use 1D arrays when possible for lists of data.
So my question is broken into two parts:
Is the above correct? Have I made any fundamental misunderstandings?
When dealing with 2D arrays I have seen other users recommend the use of Hilbert curves. I do not understand how this provides a speed increase over using division and modulus operators on an index to simulate a 2D array as that is surely less instructions which is good for speed and instruction cache usage?
Thanks for reading.
P.S. I do not have a CompSci background therefore, if you notice anything that I have said that is incorrect I would appreciate it if you could alert me so that I can read around that topic.

Your approach is flawed for at least one reason: you are willing to sacrifice everything to avoid cache misses. How do you know if that (cache misses) is the major performance factor in your code?
For example, there are MANY cases where the use of linked list is better that a contiguous array, specifically - where you frequently insert / delete items. You would pay greatly for compacting or expanding an array.
So the answer to your first question is: yes, you will improve the data locality using those four principals. But - at the cost, probably greater than the savings.
For the second question, I suggest you read about Hilbert curves. You don't need them if you are processing your 2D array in order, row-by-row. They will help a lot (with data locality) if you process some area of your 2D array, because the distance between elements in the same column/different rows is much smaller that way.

Most efficient way to grow array C++

Apologies if this has been asked before, I can't find a question that fully answers what I want to know. They mention ways to do this, but don't compare approaches.
I am writing a program in C++ to solve a PDE to steady state. I don't know how many time steps this will take. Therefore I don't know how long my time arrays will be. This will have a maximum time of 100,000s, but the time step could be as small as .001, so it could be as many as 1e8 doubles in length in the worst case (not necessarily a rare case either).
What is the most efficient way to implement this in terms of memory allocated and running time?
Options I've looked at:
Dynamically allocating an array with 1e8 elements, most of which won't ever be used.
Allocating a smaller array initially, creating a larger array when needed and copying elements over
Using std::vector and it's size increasing functionality
Are there any other options?
I'm primarily concerned with speed, but I want to know what memory considerations come into it as well

If you are concerned about speed just allocate 1e8 doubles and be done with it.
In most cases vector should work just fine. Remember that amortized it's O(1) for the append.
Unless you are running on something very weird the OS memory allocation should take care of most fragmentation issues and the fact that it's hard to find a 800MB free memory block.

As noted in the comments, if you are careful using vector, you can actually reserve the capacity to store the maximum input size in advance (1e8 doubles) without paging in any memory.
For this you want to avoid the fill constructor and methods like resize (which would end up accessing all the memory) and use reserve and push_back to fill it and only touch memory as needed. That will allow most operating systems to simply page in chunks of your accessed vector at a time instead of the entire contents all at once.
Yet I tend to avoid this solution for the most part at these kinds of input scales, but for simple reasons:
A possibly-paranoid portability fear that I may encounter an operating system which doesn't have this kind of page-on-demand behavior.
A possibly-paranoid fear that the allocation may fail to find a contiguous set of unused pages and face out of memory errors (this is a grey zone -- I tend to worry about this for arrays which span gigabytes, hundreds of megabytes is borderline).
Just a totally subjective and possibly dumb/old bias towards not leaning too heavily on the operating system's behavior for paging in allocated memory, and preferring to have a data structure which simply allocates on demand.
Debugging.
Among the four, the first two could simply be paranoia. The third might just be plain dumb. Yet at least on operating systems like Windows, when using a debug build, the memory is initialized in its entirety early, and we end up mapping the allocated pages to DRAM immediately on reserving capacity for such a vector. Then we might end up leading to a slight startup delay and a task manager showing 800 megabytes of memory usage for a debug build even before we've done anything.
While generally the efficiency of a debug build should be a minor concern, when the potential discrepancy between release and debug is enormous, it can start to render production code almost incapable of being effectively debugged. So when the differences are potentially vast like this, my preference is to "chunk it up".
The strategy I like to apply here is to allocate smaller chunks -- smaller arrays of N elements, where N might be, say, 512 doubles (just snug enough to fit a common denominator page size of 4 kilobytes -- possibly minus a couple of doubles for chunk metadata). We fill them up with elements, and when they get full, create another chunk.
With these chunks, we can aggregate them together by either linking them (forming an unrolled list) or storing a vector of pointers to them in a separate aggregate depending on whether random-access is needed or merely sequential access will suffice. For the random-access case, this incurs a slight overhead, yet one I've tended to find relatively small at these input scales which often have times dominated by the upper levels of the memory hierarchy rather than register and instruction level.
This might be overkill for your case and a careful use of vector may be the best bet. Yet if that doesn't suffice and you have similar concerns/needs as I do, this kind of chunky solution might help.

The only way to know which option is 'most efficient' on your machine is to try a few different options and profile. I'd probably start with the following:
std::vector constructed with the maximum possible size.
std::vector constructed with a conservative ballpark size and push_back.
std::deque and push_back.
The std::vector vs std::deque debate is ongoing. In my experience, when the number of elements is unknown and not too large, std::deque is almost never faster than std::vector (even if the std::vector needs multiple reallocations) but may end up using less memory. When the number of elements is unknown and very large, std::deque memory consumption seems to explode and std::vector is the clear winner.
If after profiling, none of these options offers satisfactory performance, then you may want to consider writing a custom allocator.

Understanding the efficiency of an std::string

I'm trying to learn a little bit more about c++ strings.
consider
const char* cstring = "hello";
std::string string(cstring);
and
std::string string("hello");
Am I correct in assuming that both store "hello" in the .data section of an application and the bytes are then copied to another area on the heap where the pointer managed by the std::string can access them?
How could I efficiently store a really really long string? I'm kind of thinking about an application that reads in data from a socket stream. I fear concatenating many times. I could imagine using a linked list and traverse this list.
Strings have intimidated me for far too long!
Any links, tips, explanations, further details, would be extremely helpful.

I have stored strings in the 10's or 100's of MB range without issue. Naturally, it will be primarily limited by your available (contiguous) memory / address space.
If you are going to be appending / concatenating, there are a few things that may help efficiency-wise: If possible, try to use the reserve() member function to pre-allocate space-- even if you have a rough idea of how big the final size might be, it would save from unnecessary re-allocations as the string grows.
Additionally, many string implementations use "exponential growth", meaning that they grow by some percentage, rather than fixed byte size. For example, it might simply double the capacity any time additional space is needed. By increasing size exponentially, it becomes more efficient to perform lots of concatenations. (The exact details will depend on your version of stl.)
Finally, another option (if your library supports it) is to use rope<> template: Ropes are similar to strings, except that they are much more efficient when performing operations on very large strings. In particular, "ropes are allocated in small chunks, significantly reducing memory fragmentation problems introduced by large blocks". Some additional details on SGI's STL guide.

Since you're reading the string from a socket, you can reuse the same packet buffers and chain them together to represent the huge string. This will avoid any needless copying and is probably the most efficient solution possible. I seem to remember that the ACE library provides such a mechanism. I'll try to find it.
EDIT: ACE has ACE_Message_Block that allows you to store large messages in a linked-list fashion. You almost need to read the C++ Network Programming books to make sense of this colossal library. The free tutorials on the ACE website really suck.
I bet Boost.Asio must be capable of doing the same thing as ACE's message blocks. Boost.Asio now seems to have a larger mindshare than ACE, so I suggest looking for a solution within Boost.Asio first. If anyone can enlighten us about a Boost.Asio solution, that would be great!
It's about time I try writing a simple client-server app using Boost.Asio to see what all the fuss is about.

I don't think efficiency should be the issue. Both will perform well enough.
The deciding factor here is encapsulation. std::string is a far better abstraction than char * could ever be. Encapsulating pointer arithmetic is a good thing.
A lot of people thought long and hard to come up with std::string. I think failing to use it for unfounded efficiency reasons is foolish. Stick to the better abstraction and encapsulation.

As you probably know, an std::string is really just another name for basic_string<char>.
That said, they are a sequence container and memory will be allocated sequentially. It's possible to get an exceptions from an std::string if you try to make one bigger than the available contiguous memory that you can allocate. This threshold is typically considerably less than the total available memory due to memory fragmentation.
I've seen problems allocating contiguous memory when trying to allocate, for instance, large contiguous 3D buffers for images. But these issues don't start happening at least on the order of 100MB or so, at least in my experience, on Windows XP Pro (for instance.)
Are your strings this big?

Higher dimensional array vs 1-D array efficiency in C++

I'm curious about the efficiency of using a higher dimensional array vs a one dimensional array. Do you lose anything when defining, and iterating through an array like this:
array[i][j][k];
or defining and iterating through an array like this:
array[k + j*jmax + i*imax];
My inclination is that there wouldn't be a difference, but I'm still learning about high efficiency programming (I've never had to care about this kind of thing before).
Thanks!

The only way to know for sure is to benchmark both ways (with optimization flags on in the compiler of course). The one think you lose for sure in the second method is the clarity of reading.

The former way and the latter way to access arrays are identical once you compile it. Keep in mind that accessing memory locations that are close to one another does make a difference in performance, as they're going to be cached differently. Thus, if you're storing a high-dimensional matrix, ensure that you store rows one after the other if you're going to be accessing them that way.
In general, CPU caches optimize for temporal and spacial ordering. That is, if you access memory address X, the odds of you accessing X+1 are higher. It's much more efficient to operate on values within the same cache line.
Check out this article on CPU caches for more information on how different storage policies affect performance: http://en.wikipedia.org/wiki/CPU_cache

If you can rewrite the indexing, so can the compiler. I wouldn't worry about that.
Trust your compiler(tm)!

It probably depends on implementation, but I'd say it more or less amounts to your code for one-dimensional array.

Do yourself a favor and care about such things after profiling the code. It is very unlikely that something like that will affect the performance of the application as a whole. Using the correct algorithms is much more important
And even if it does matter, it is most certainly only a single inner loop that needs attention.

Faster to malloc multiple small times or few large times?

When using malloc to allocate memory, is it generally quicker to do multiple mallocs of smaller chunks of data or fewer mallocs of larger chunks of data? For example, say you are working with an image file that has black pixels and white pixels. You are iterating through the pixels and want to save the x and y position of each black pixel in a new structure that also has a pointer to the next and previous pixels x and y values. Would it be generally faster to iterate through the pixels allocating a new structure for each black pixel's x and y values with the pointers, or would it be faster to get a count of the number of black pixels by iterating through once, then allocating a large chunk of memory using a structure containing just the x and y values, but no pointers, then iterating through again, saving the x and y values into that array? I'm assuming certain platforms might be different than others as to which is faster, but what does everyone think would generally be faster?

It depends:
Multiple small times means multiple times, which is slower
There may be a special/fast implementation for small allocations.
If I cared, I'd measure it! If I really cared a lot, and couldn't guess, then I might implement both, and measure at run-time on the target machine, and adapt accordingly.
In general I'd assume that fewer is better: but there are size and run-time library implementations such that a (sufficiently) large allocation will be delegated to the (relatively slow) O/S. whereas a (sufficiently) small allocation will be served from a (relatively quick) already-allocated heap.

Allocating large blocks is more efficient; additionally, since you are using larger contiguous blocks, you have greater locality of reference, and traversing your in-memory structure once you've generated it should also be more efficient! Further, allocating large blocks should help to reduce memory fragmentation.

Generally speaking, allocating larger chunks of memory fewer times will be faster. There's overhead involved each time a call to malloc() is made.

Except speed issues there is also the memory fragmentation problem.

Allocating memory is work. The amount of work done when allocating a block of memory is typically independent of the size of the block. You work it out from here.

It's faster not to allocate in performance-sensitive code at all. Allocate the memory you're going to need once in advance, and then use and reuse that as much as you like.
Memory allocation is a relatively slow operation in general, so don't do it more often than necessary.

In general malloc is expensive. It has to find an appropriate memory chunk from which to allocate memory and keep track of non-contiguous memory blocks. In several libraries you will find small memory allocators that try to minimize the impact by allocating a large block and managing the memory in the allocator.
Alexandrescu deals with the problem in 'Modern C++ Design' and in the Loki library if you want to take a look at one such libs.

This question is one of pragmatism, I'm afraid; that is to say, it depends.
If you have a LOT of pixels, only a few of which are black then counting them might be the highest cost.
If you're using C++, which your tags suggest you are, I would strongly suggest using STL, somthing like std::vector.
The implementation of vector, if I remember correctly, uses a pragmatic approach to allocation. There are a few heuristics for allocation strategies, an informative one is this:
class SampleVector {
int N,used,*data;
public:
SampleVector() {N=1;used=0;data=malloc(N);}
void push_back(int i)
{
if (used>=N)
{
// handle reallocation
N*=2;
data=realloc(data,N);
}
data[used++]=i;
}
};
In this case, you DOUBLE the amount of memory allocated every time you realloc.
This means that reallocations progressively halve in frequency.
Your STL implementation will have been well-tuned, so if you can use that, do!

Another point to consider is how this interacts with threading. Using malloc many times in a threaded concurrent application is a major drag on performance. In that environment you are better off with a scalable allocator like the one used in Intel's Thread Building Blocks or Hoard. The major limitation with malloc is that there is a single global lock that all the threads contend for. It can be so bad that adding another thread dramatically slows down your application.

As already mentonned, malloc is costly, so fewer will probably be faster.
Also, working with the pixels, on most platforms will have less cache-misses and will be faster.
However, there is no guarantee on every platforms

Next to the allocation overhead itself, allocating multiple small chunks may result in lots of cache misses, while if you can iterate through a contiguous block, chances are better.
The scenario you describe asks for preallocation of a large block, imho.

Although allocating large blocks is faster per byte of allocated memory, it will probably not be faster if you artificially increase the allocation size only to chop it up yourself. You're are just duplicating the memory management.

Do an iteration over the pixels to count the number of them to be stored.
Then allocate an array for the exact number of items. This is the most efficient solution.
You can use std::vector for easier memory management (see the std::vector::reserve procedure). Note: reserve will allocate probably a little (probably up to 2 times) more memory then necessary.

"I can allocate-it-all" (really, I can!)
We can philosophy about some special implementations, that speed up small allocations considerably ... yes! But in general this holds:
malloc must be general. It must implement all different kinds of allocations. That is the reason it is considerably slow! It might be, that you use a special kinky-super-duper Library, that speeds things up, but also those can not do wonders, since they have to implement malloc in its full spectrum.
The rule is, when you have more specialized allocation coding, you are always faster then the broad "I can allocate-it-all" routine "malloc".
So when you are able to allocate the memory in bigger blocks in your coding (and it does not cost you to much) you can speed up things considerably. Also - as mentioned by others - you will get lot less fragmentation of memory, that also speeds things up and can cost less memory. You must also see, that malloc needs additional memory for every chunk of memory it returns to you (yes, special routines can reduce this ... but you don't know! what it does really unless you implemented it yourself or bought some wonder-library).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js