tile_static dynamically indexed arrays; should I even bother? - c++

I'm going to great lengths to try and store frequently accessed data in tile_static memory to take advantage of the boundless performance nirvana which will ensue.
However, I've just read that only certain hardware/drivers can actually dynamically index tile_static arrays, and that the operation might just spill over to global memory anyway.
In an ideal world I'd just do it and profile, but this is turning out to be a major operation and I'd like to get an indication as to whether or not I'm wasting my time here:
tile_static int staticArray[128];
int resultFast = staticArray[0]; // this is super fast
// but what about this:
i = // dynamically derived value!
int resultNotSoFast = staticArray[i]; // is this faster than getting it from global memory?
How can I find out whether my GPU/driver supports dynamic indexing of static arrays?

Dynamic Indexing of Local Memory
So I did some digging on this because I wanted to understand this too.
If you are referring to dynamic indexing of local memory, not tile_static (or in CUDA parlance, "shared memory"). In your example above staticArray should be declared as:
int staticArray[128]; // not tile_static
This cannot be dynamically indexed because an array of int staticArray[128] is actually stored as 128 registers and these cannot be dynamically accessed. Allocating large arrays like this is problematic anyway because it uses up a large number of registers which are a limited resource on the GPU. Use too many registers per thread and your application will be unable to use all the available parallelism because some available threads will be stalled waiting for registers to become available.
In the case of C++ AMP I'm not even sure that the level of abstraction provided by DX11 may make this somewhat irrelevant. I'm not enough of an expert on DX11 to know.
There's a great explanation of this here, In a CUDA kernel, how do I store an array in "local thread memory"?
Bank Conflicts
Tile static memory is divided into a number of modules referred to as
banks. Tile static memory typically consists of 16, 32, or 64 banks,
each of which is 32 bits wide. This is specific to the particular GPU
hardware and might change in the future. Tile static memory is
interleaved across these banks. This means that for a GPU with tile
static memory implemented with 32 banks if arr is an array < float, 1>, then arr[ 1] and arr[ 33] are in the same bank because each float occupies a single 32-bit bank location. This is the key point to
understand when it comes to dealing with bank conflicts.
Each bank can
service one address per cycle. For best performance, threads in a warp
should either access data in different banks or all read the same data
in a single bank, a pattern typically optimized by the hardware. When
these access patterns are followed, your application can maximize the
available tile static memory bandwidth. In the worst case, multiple
threads in the same warp access data from the same bank. This causes
these accesses to be serialized, which might result in a
significant degradation in performance.
I think the key point of confusion might be (based on some of your other questions) is that a memory bank is 32 bits wide but is responsible for access to all the memory within the bank, which will be 1/16, 1/32 or 1/64 of the total tile static memory.
You can read more about bank conflicts here What is a bank conflict? (Doing Cuda/OpenCL programming)

Related

How much is there overhead per single object memory allocation? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Say, if I call malloc(sizeof(int)), requesting 4 bytes, how much extra will be added by system (or std library?) to support memory management infrastructure? I believe there should be some. Otherwise, how the system would know how many bytes to dispose of when I call free(ptr).
UPDATE 1: It may sound like a 'too broad question' and, obviously, a C/C++ library specific, but what I am interested in, is a minimum extra memory needed to support a single allocation. Even not a system or implementation specific. For example, for binary tree, there is a must of 2 pointers - left and right children, and no way you can squeeze it.
UPDATE 2:
I decided to check it for myself on Windows 64.
#include <stdio.h>
#include <conio.h>
#include <windows.h>
#include <psapi.h>
void main(int argc, char *argv[])
{
int m = (argc > 1) ? atoi(argv[1]) : 1;
int n = (argc > 2) ? atoi(argv[2]) : 0;
for (int i = 0; i < n; i++)
malloc(m);
size_t peakKb(0);
PROCESS_MEMORY_COUNTERS pmc;
if ( GetProcessMemoryInfo(GetCurrentProcess(), &pmc, sizeof(pmc)) )
peakKb = pmc.PeakWorkingSetSize >> 10;
printf("requested : %d kb, total: %d kb\n", (m*n) >> 10, peakKb);
_getch();
}
requested : 0 kb, total: 2080 kb
1 byte:
requested : 976 kb, total: 17788 kb
extra: 17788 - 2080 - 976 = 14732 (+1410%)
2 bytes:
requested : 1953 kb, total: 17784 kb
extra: 17784 - 2080 - 1953 = (+605% over)
4 bytes:
requested : 3906 kb, total: 17796 kb
extra: 17796 - 2080 - 3906 = 10810 (+177%)
8 bytes:
requested : 7812 kb, total: 17784 kb
extra: 17784 - 2080 - 7812 = (0%)
UPDATE 3: THIS IS THE ANSWER TO MY QUESTION I’VE BEEN LOOKING FOR: In addition to being slow, the genericity of the default C++ allocator makes it very space inefficient for small objects. The default allocator manages a pool of memory, and such management often requires some extra memory. Usually, the bookkeeping memory amounts to a few extra bytes (4 to 32) for each block allocated with new. If you allocate 1024-byte blocks, the per-block space overhead is insignificant (0.4% to 3%). If you allocate 8-byte objects, the per-object overhead becomes 50% to 400%, a figure big enough to make you worry if you allocate many such small objects.
For allocated objects, no additional metadata is theoretically required. A conforming implementation of malloc could round up all allocation requests to a fixed maximum object size, for example. So for malloc (25), you would actually receive a 256-byte buffer, and malloc (257) would fail and return a null pointer.
More realistically, some malloc implementations encode the allocation size in the pointer itself, either directly using bit patterns corresponding to specific fixed sized classes, or indirectly using a hash table or a multi-level trie. If I recall correctly, the internal malloc for Address Sanitizer is of this type. For such mallocs, at least part of the immediate allocation overhead does not come from the addition of metadata for heap management, but from rounding the allocation size up to a supported size class.
Other mallocs have a per-allocation header of a single word. (dlmalloc and its derivative are popular examples). The actual per-allocation overhead is usually slightly larger because due to the header word, you get weired supported allocation sizes (such as 24, 40, 56, … bytes with 16-byte alignment on a 64-bit system).
One thing to keep in mind is that many malloc implementations put a lot of data deallocated objects (which have not yet been returned to the operating system kernel), so that malloc (the function) can quickly find an unused memory region of the appropriate size. Particularly for dlmalloc-style allocators, this also provides a constraint on minimum object size. The use of deallocated objects for heap management contributes to malloc overhead, too, but its impact on individual allocations is difficult to quantify.
Say, if I call malloc(sizeof(int)), requesting 4 bytes, how much extra will be added by system (or std library?) to support memory management infrastructure? I believe there should be some. Otherwise, how the system would know how many bytes to dispose of when I call free(ptr).
This is entirely library specific. The answer could be anything from zero to whatever. Your library could add data to the front of the block. Some add data to the front and back of the block to track overwrites. The amount of overhead added varies among libraries.
The length could be tracked within the library itself with a table. In that case, there may not be hidden field added to the allocated memory.
The library might only allocate blocks in fixed sizes. The amount you ask for gets rounded up to the next block size.
The pointer itself is essentially overhead and can be a dominant driver of memory use in some programs.
The theoretical minimum overhead, might be sizeof(void*) for some theoretical system and use, but that combination of CPU, Memory and usage pattern is so unlikely to exist as to be absolutely worthless for consideration. The standard requires that memory returned by malloc be suitably aligned for any data type, therefore there will always be some overhead; in the form of unused memory between the end of one allocated block, and the beginning of the next allocated block, except in the rare cases where all memory usage is sized to a multiple of the block size.
The minimum implementation of malloc/free/realloc, assumes the heap manager has one contiguous block of memory at it's disposal, located somewhere in system memory, the pointer that said heap manager uses to reference that original block, is overhead (again sizeof(void*)). One could imagine a highly contrived application that requested that entire block of memory, thus avoiding the need for additional tracking data. At this point we have 2 * sizeof(void*) worth of overhead, one internal to the heap manager, plus the returned pointer to the one allocated block (the theoretical minimum). Such a conforming heap manager is unlikely to exist as it must also have some means of allocating more than one block from its pool and that implies at a minimum, tracking which blocks within its pool are in use.
One scheme to avoid overhead involves using pointer sizes that are larger than the physical or logical memory available to the application. One can store some information in those unused bits, but they would also count as overhead if their number exceeds a processor word size. Generally, only a hand full of bits are used and those identify which of the memory managers internal pools the memory comes from. The later, implies additonal overhead of pointers to pools. This brings us to real world systems, where the heap manager implementation is tuned to the OS, hardware architecture and typical usage patterns.
Most hosted implementations (hosted == runs on OS) request one or more memory blocks from the operating system in the c-runtime initialization phase. OS memory management calls are expensive in time and space. The OS has it's own memory manager, with its own overhead set, driven by its own design criteria and usage patterns. So c-runtime heap managers attempt to limit the number of calls into the OS memory manager, in order to reduce the latency of the average call to malloc() and free(). Most request the first block from the OS when malloc is first called, but this usually happens at some point in the c-runtime initialization code. This first block is usually a low multiple of the system page size, which can be one or more orders of magnitude larger than the size requested in the initial malloc() call.
At this point it is obvious that heap manager overhead is extremely fluid and difficult to quantify. On a typical modern system, the heap manager must track multiple blocks of memory allocated from the OS, how many bytes are currently allocated to the application in each of those blocks and potentially, how much time has passed since a block went to zero. Then there's the overhead of tracking allocations from within each of those blocks.
Generally malloc rounds up to a minimum alignment boundary and often this is not special cased for small allocations as applications are expected to aggregate many of these into a single allocation. The minimum alignment is often based on the largest required alignment for a load instruction in the architecture the code is running on. So with 128-bit SIMD (e.g. SSE or NEON) the minimum is 16 bytes. In practice there is a header as well which causes the minimum cost in size to be doubled. As SIMD register widths have increased, malloc hasn't increased it's guaranteed alignment.
As was pointed out, the minimum possible overhead is 0. Though the pointer itself should probably be counted in any reasonable analysis. In a garbage collector design, at least one pointer to the data has to be present. In straight a non-GC design, one has to have a pointer to call free, but there's not an iron clad requirement it has to be called. One could theoretically compress a bunch of pointers together into less space as well, but now we're into an analysis of the entropy of the bits in the pointers. Point being you likely need to specify some more constraints to get a really solid answer.
By way of illustration, if one needs arbitrary allocation and deallocation of just int size, one can allocate a large block and create a linked list of indexes using each int to hold the index of the next. Allocation pulls an item off the list and deallocation adds one back. There is a constraint that each allocation is exactly an int. (And that the block is small enough that the maximum index fits in an int.) Multiple sizes can be handled by having different blocks and searching for which block the pointer is in when deallocation happens. Some malloc implementations do something like this for small fixed sizes such as 4, 8, and 16 bytes.
This approach doesn't hit zero overhead as one needs to maintain some data structure to keep track of the blocks. This is illustrated by considering the case of one-byte allocations. A block can at most hold 256 allocations as that is the maximum index that can fit in the block. If we want to allow more allocations than this, we will need at least one pointer per block, which is e.g. 4 or 8 bytes overhead per 256 bytes.
One can also use bitmaps, which amortize to one bit per some granularity plus the quantization of that granularity. Whether this is low overhead or not depends on the specifics. E.g. one bit per byte has no quantization but eats one eighth the allocation size in the free map. Generally this will all require storing the size of the allocation.
In practice allocator design is hard because the trade-off space between size overhead, runtime cost, and fragmentation overhead is complicated, often has large cost differences, and is allocation pattern dependent.

CUDA programming: Memory access speed and memory usage: thread-local variables vs. shared-memory variables vs. numeric literals? [duplicate]

This question already has answers here:
Using constants with CUDA
(2 answers)
Closed 5 years ago.
Suppose I have an array with several fixed numerical values that would be accessed multiple times by multiple threads within the same block, what are some pros and cons in terms of access speed and memory usage if I store these values in:
thread-local memory: double x[3] = {1,2,3};
shared memory: __shared__ double x[3] = {1,2,3};
numeric literals: directly hardcode these values in the expression where they appear
Thanks!
TL;DR
use __constant__ double x[3]; // ... initialization ...
First, know where a variable actually resides
In your question:
thread-local memory: double x[3] = {1,2,3};
This is imprecise. Depends on how your code access x[], x[] can reside in either registers or local memory.
Since there is no type qualifiers, the compiler will try best to put things in register,
An automatic variable declared in device code without any of the __device__, __shared__ and __constant__ qualifiers described in this section generally resides in a register. However in some cases the compiler might choose to place it in local memory,
but when it can't, it will put them in local memory:
Arrays for which it cannot determine that they are indexed with constant quantities,
Large structures or arrays that would consume too much register space,
Any variable if the kernel uses more registers than available (this is also known as register spilling).
You really don't want x to be in local memory, it's slow. In your situation,
an array with several fixed numerical values that would be accessed multiple times by multiple threads within the same block
Both __constant__ and __shared__ can be a good choice.
For a complete description on this topic, check: CUDA Toolkit Documentation: variable-type-qualifiers
Then, consider speed & availability
Hardcode
The number will be embedded in instructions. You may expect some performance improvement. Better benchmark your program before and after doing this.
Register
It's fast, but scarce. Consider a block with 16x16 threads, with a maximum 64k registers per block, each thread can use 256 registers. (Well, maybe not that scarce, should be enough for most kernels)
Local Memory
It's slow. However, a thread can use up to 512KB local memory.
The local memory space resides in device memory, so local memory accesses have same high latency and low bandwidth as global memory accesses...
Shared Memory
It's fast, but scarce. Typically 48KB per block (less than registers!).
Because it is on-chip, shared memory has much higher bandwidth and much lower latency than local or global memory.
Constant Memory
It's fast in a different way (see below), which highly depends on cache, and cache is scarce. Typically 8KB ~ 10KB cache per multiprocessor.
The constant memory space resides in device memory and is cached in the constant cache mentioned in Compute Capability 2.x.
A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests.
The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise.
read: CUDA Toolkit Documentation: device-memory-accesses

C++ pool allocator vs static allocation , cache performance

Given that I have two parallel and identically sized arrays of the following structs:
struct Matrix
{
float data[16];
};
struct Vec4
{
float data[4];
}
//Matrix arrM[256]; //for illustration
//Vec4 arrV[256];
Lets say I wish to iterate over the two arrays sequentially as fast as possible. Lets say the function is something like:
for (int i=0; i < 256; ++i)
{
readonlyfunc(arrMPtr[i].data);
readonlyfunc(arrVPtr[i].data
}
Assuming that my allocations are aligned for each array, both in the case of statically allocated or heap memory. Assuming that my cache line size 64 bytes.
Would I achieve the same cache locality and performance if I were to store my data as:
A)
//aligned
static Matrix arrM[256];
static Vec4 arrV[256];
Matrix* arrMPtr = arrM[0];
Vec4* arrVPtr = arrV[0];
vs
B)
//aligned
char* ptr = (char*) malloc(256*sizeof(Matrix)+256*sizeof(Vec4));
Matrix* arrMPtr = (Matrix*) ptr;
Vec4* arrVPtr = (Vec4*) ptr+256*sizeof(Matrix);
How the memory is allocated (heap or statically allocated) makes no difference to the memory's ability to be cached. Since both of these data structures are fairly large (1024 and 4096 bytes, respectively), the exact alignment of the first and last elements probably doesn't matter either (but it does matter if you are using SSE instructions to access the content!).
Whether the memory is close together or not won't make a huge difference, as long as the allocation is small enough to easily fit in the cache, but big enough to take up multiple cache-lines.
You may find that using a structure with 20 float values works out better, if you are working sequentially through both arrays. But that only works if you don't ever need to do other things with the data where having a single array makes more sense.
There may be a difference in the compiler's ability to translate the code to avoid an extra memory access. This will clearly depend on the actual code (e.g. will the compiler inline function containing the for-loop, does it inline the readonlyfunc code, etc, etc. If so, the static allocation can be translated from the pointer variant (which loads the address of the pointer to get the address of the data) into a constant address calculation. It probably doesn't make a huge difference in such a large loop as this.
Always, when it comes to performance, sometimes small things can make big differences, so if this is really important, do some experiments, using YOUR compiler, YOUR actual code. We can only give relatively speculative advice, based on our experience. Different compilers do different things with the same code, different processors do different things with the same machine code (both different actual architectures (whether it's instruction set architecture ARM vs X86, or implementation of the architecture such as AMD Opteron vs Intel Atom or ARM Cortex A15 vs Cortex M3). Memory configurations on your particular system will also affect things, how big caches are, etc, etc.
It's impossible to say without knowning more about what your doing and testing. It might be more efficient to use a struct of arrays.
struct MatrixVec
{
float m[16];
float v[4];
};
One important point is that malloc allocates memory from the heap whereas static arrays are allocated on the stack. The stack is already likely in the L1 cache whereas the memory from the heap will have to be read in. Instead you can try a less known function for dynamic memory allocation called alloca which allocates memory on the stack. In your case you could try
char* ptr = (char*) alloca(256*sizeof(Matrix)+256*sizeof(Vec4))
See Agner Fog's Optimizing software in C++. See the section "9.6 Dynamic memory allocation". Here are the advantages he lists for alloca compared to malloc
There is very little overhead to the allocation process because the microprocessor
has hardware support for the stack.
The memory space never becomes fragmented thanks to the first-in-last-out nature
of the stack.
Deallocation has no cost because it goes automatically when the function returns.
There is no need for garbage collection.
The allocated memory is contiguous with other objects on the stack, which makes
data caching very efficient.

Memory calculation of objects inaccurate?

I'm creating a small cache daemon, and I want to limit its memory usage to approximately a specified amount. However, there seems to be an issue just trying to calculate how much memory is in use.
Every time a CacheEntry object is created, it adds the size of a CacheEntry object (apparently 64 bytes) plus the number of bytes used in internal arrays to the counter for how many bytes are in use. When the CacheEntry object is deleted, it subtracts that amount. I can confirm that the math, at least, is correct.
However, when run inside NetBeans, the memory profiler reports vastly different numbers. Roughly twice as high, to be specific. It is not a memory leak, and it is specifically related to the amount of CacheEntry objects currently in existence. Increasing the amount of data stored in the internal arrays actually brings the numbers closer together (as opposed to further apart, if that were being improperly calculated); from this, I have concluded that the overhead of having a CacheEntry object in memory is almost twice what sizeof() is reporting. It does not rise in steps or "chunks".
Is there some common reason why this might happen?
UPDATE: Just to check, I ran my tests without a profiler in place. Linux reports the same VmHWM/VmRSS either way, so the memory profiler is definitely not affecting the calculations.
Perhaps the profiler is adding reference objects to track the objects? Do you see the same results when you run the application in release vs Debug?
Is there some common reason why this might happen?
Yeah, that could be internal fragmentation and overhead of the memory manager. If your data type is small (eg. sizeof(CacheEntry) is 8 bytes), newing such data type might produce a bigger chunk of memory. It is partly used for malloc's internal bookkeeping (it usually stores the size of the block somewhere), partly for padding needed to align your data type on its natural boundary (eg. 8 bytes data + 4 bytes bookkeeping + 4 bytes padding needed to align the whole thing on 8-byte boundary).
You can solve it by allocating from a single continuous array of CacheEntry (eg. CacheEntry array[1000] takes exactly 1000*sizeof(CacheEntry) bytes). You'd have to track the usage of the individual elements in the array, but that should be doable without additional memory. (eg. by running a free-list of entries in the place of the free entries).
This memory bloat is caused by use of new, specifically on relatively small objects. On Windows, dynamically allocated memory incurs a 16- or 24-byte overhead each time; I haven't found the exact numbers for Linux, but it's roughly the same. This is because each allocated chunk needs to record its location and size (possibly more than once) so that it can be accurately freed later.
As far as I'm aware, the running program also does not know exactly how much overhead is involved in this, at least in any way accessible to the programmer.
Generally speaking, large quantities of small objects should use a memory pool, both for speed and memory conservation.

Access cost of dynamically created objects with dynamically allocated members

I'm building an application which will have dynamic allocated objects of type A each with a dynamically allocated member (v) similar to the below class
class A {
int a;
int b;
int* v;
};
where:
The memory for v will be allocated in the constructor.
v will be allocated once when an object of type A is created and will never need to be resized.
The size of v will vary across all instances of A.
The application will potentially have a huge number of such objects and mostly need to stream a large number of these objects through the CPU but only need to perform very simple computations on the members variables.
Could having v dynamically allocated could mean that an instance of A and its member v are not located together in memory?
What tools and techniques can be used to test if this fragmentation is a performance bottleneck?
If such fragmentation is a performance issue, are there any techniques that could allow A and v to allocated in a continuous region of memory?
Or are there any techniques to aid memory access such as pre-fetching scheme? for example get an object of type A operate on the other member variables whilst pre-fetching v.
If the size of v or an acceptable maximum size could be known at compile time would replacing v with a fixed sized array like int v[max_length] lead to better performance?
The target platforms are standard desktop machines with x86/AMD64 processors, Windows or Linux OSes and compiled using either GCC or MSVC compilers.
If you have a good reason to care about performance...
Could having v dynamically allocated could mean that an instance of A and its member v
are not located together in memory?
If they are both allocated with 'new', then it is likely that they will be near one another. However, the current state of memory can drastically affect this outcome, it depends significantly on what you've been doing with memory. If you just allocate a thousand of these things one after another, then the later ones will almost certainly be "nearly contiguous".
If the A instance is on the stack, it is highly unlikely that its 'v' will be nearby.
If such fragmentation is a performance issue, are there any techniques that could
allow A and v to allocated in a continuous region of memory?
Allocate space for both, then placement new them into that space. It's dirty, but it should typically work:
char* p = reinterpret_cast<char*>(malloc(sizeof(A) + sizeof(A::v)));
char* v = p + sizeof(A);
A* a = new (p) A(v);
// time passes
a->~A();
free(a);
Or are there any techniques to aid memory access such as pre-fetching scheme?
Prefetching is compiler and platform specific, but many compilers have intrinsics available to do it. Mind- it won't help a lot if you're going to try to access that data right away, for prefetching to be of any value you often need to do it hundreds of cycles before you want the data. That said, it can be a huge boost to speed. The intrinsic would look something like __pf(my_a->v);
If the size of v or an acceptable maximum size could be known at compile time
would replacing v with a fixed sized array like int v[max_length] lead to better
performance?
Maybe. If the fixed size buffer is usually close to the size you'll need, then it could be a huge boost in speed. It will always be faster to access one A instance in this way, but if the buffer is unnecessarily gigantic and largely unused, you'll lose the opportunity for more objects to fit into the cache. I.e. it's better to have more smaller objects in the cache than it is to have a lot of unused data filling the cache up.
The specifics depend on what your design and performance goals are. An interesting discussion about this, with a "real-world" specific problem on a specific bit of hardware with a specific compiler, see The Pitfalls of Object Oriented Programming (that's a Google Docs link for a PDF, the PDF itself can be found here).
Could having v dynamically allocated could mean that an instance of A and its member v are not located together in memory?
Yes, it that will be likely.
What tools and techniques can be used to test if this fragmentation is a performance bottleneck?
cachegrind, shark.
If such fragmentation is a performance issue, are there any techniques that could allow A and v to allocated in a continuous region of memory?
Yes, you could allocate them together, but you should probably see if it's an issue first. You could use arena allocation, for example, or write your own allocators.
Or are there any techniques to aid memory access such as pre-fetching scheme? for example get an object of type A operate on the other member variables whilst pre-fetching v.
Yes, you could do this. The best thing to do would be to allocate regions of memory used together near each other.
If the size of v or an acceptable maximum size could be known at compile time would replacing v with a fixed sized array like int v[max_length] lead to better performance?
It might or might not. It would at least make v local with the struct members.
Write code.
Profile
Optimize.
If you need to stream a large number of these through the CPU and do very little calculation on each one, as you say, why are we doing all this memory allocation?
Could you just have one copy of the structure, and one (big) buffer of v, read your data into it (in binary, for speed), do your very little calculation, and move on to the next one.
The program should spend almost 100% of time in I/O.
If you pause it several times while it's running, you should see it almost every time in the process of calling a system routine like FileRead. Some profilers might give you this information, except they tend to be allergic to I/O time.