If I query the maximum compute shader shared memory size with:
GLint maximum_shared_mem_size;
glGetIntegerv(GL_MAX_COMPUTE_SHARED_MEMORY_SIZE, &maximum_shared_mem_size);
I get 48KB as a result. However, according to this whitepaper:
https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
on page 13, it is stated, that for my GPU (2080TI):
The Turing L1 can be as large as 64 KB in size, combined with a 32 KB per SM shared memory allocation, or it can reduce to 32 KB, allowing 64 KB of allocation to be used for shared memory. Turing’s L2 cache capacity has also been increased.
So, I expect OpenGL to return 64KB for the maximum shared memory size.
Is this a wrong assumption? If so, why?
It looks like the 48KB is the expected result, as documented in the Turing Tuning Guide for CUDA:
Turing allows a single thread block to address the full 64 KB of shared memory. To maintain architectural compatibility, static shared memory allocations remain limited to 48 KB, and an explicit opt-in is also required to enable dynamic allocations above this limit. See the CUDA C Programming Guide for details.
It seems that you can either take the default 48KB or use CUDA to gain control over the carveout configuration.
Related
I have a kernel which is called multiple times. In each call a constant data of around 240 kbytes will be shared and processed by threads. Threads work independently like a map function. The stalling time of the threads is considerable. The reason behind that can be the bank conflict of memory reads. How can I handle this?(I have GTX 1080 ti)
Can "const global" of opencl handle this? (because constant memory in cuda is limited to 64 kb)
In CUDA, I believe the best recommendation would be to make use of the so called "Read-Only" cache. This has at least two possible benefits over the __constant__ memory/constant cache system:
It is not limited to 64kB like __constant__ memory is.
It does not expect or require "uniform access" like the constant cache does, to deliver full access bandwidth/best performance. Uniform access refers to the idea that all threads in a warp are accessing the same location or same constant memory value (per read cycle/instruction).
The read-only cache is documented in the CUDA programming guide. Possibly the easiest way to use it is to decorate your pointers passed to the CUDA kernel with __restrict__ (assuming you are not aliasing between pointers) and to decorate the pointer that refers to the large constant data with const ... __restrict__. This will allow the compiler to generate appropriate LDG instructions for access to constant data, pulling it through the read-only cache mechanism.
This read-only cache mechanism is only supported on cc 3.5 or higher GPUs, but that covers some GPUs in the Kepler generation and all GPUs in the Maxwell, Pascal (including your GTX 1080 ti), Volta, and Turing generations.
If you have a GPU that is less than cc3.5, possibly the best suggestion for similar benefits (larger than __const__, not needing uniform access) then would be to use texture memory. This is also documented elsewhere in the programming guide, there are various CUDA sample codes that demonstrate the use of texture memory, and plenty of questions here on the SO cuda tag covering it as well.
Constant memory that doesn't fit in the hardware's constant buffer will typically "spill" into global memory on OpenCL. Bank conflicts are usually an issue with local memory, however, so that's probably not it. I'm assuming CUDA's 64kiB constant limit reflects nvidia's hardware, so OpenCL isn't going to magically perform better here.
Reading global memory without a predictable pattern can of course be slow, however, especially if you don't have sufficient thread occupancy and arithmetic to mask the memory latency.
Without knowing anything further about your problem space, this also brings me to the directions you could take further optimisations, assuming your global memory reads are the issue:
Reduce the amount of constant/global data you need, for example by using more efficient types, other compression mechanisms, or computing some of the values on the fly (possibly storing them in local memory for all threads in a group to share).
Cluster the most frequently used data in a small constant buffer, and explicitly place the more rarely used constants in a global buffer. This may help the runtime lay it out more efficiently in the hardware. If that doesn't help, try to copy the frequently used data into local memory, and make your thread groups large and comparatively long-running to hide the copying hit.
Check if thread occupancy could be improved. It usually can, and this tends to give you substantial performance improvements in almost any situation. (except if your code is already extremely ALU/FPU bound)
Occupancy in CUDA is defined as
occupancy = active_warps / maximum_active_warps
What is the difference between a resident CUDA warp and an active one?
From my research on the web it seems that a block is resident (i.e. allocated along with its register/shared memory files) on a SM for the entire duration of its execution. Is there a difference with "being active"?
If I have a kernel which uses very few registers and shared memory.. does it mean that I can have maximum_active_warps resident blocks and achieve 100% occupancy since occupancy just depends on the amount of register/shared memory used?
What is the difference between a resident CUDA warp and an active one?
In this context presumably nothing.
From my research on the web it seems that a block is resident (i.e. allocated along with its register/shared memory files) on a SM for the entire duration of its execution. Is there a difference with "being active"?
Now you have switched from asking about warps to asking about blocks. But again, in this context no, you could consider them to be the same.
If I have a kernel which uses very few registers and shared memory..
does it mean that I can have maximum_active_warps resident blocks and
achieve 100% occupancy since occupancy just depends on the amount of
register/shared memory used?
No because a warp and a block are not the same thing. As you yourself have quoted, occupancy is defined in terms of warps, not blocks. The maximum number of warps is fixed at 48 or 64 depending on your hardware. The maximum number of blocks is fixed at 8, 16 or 32 depending on hardware. There are two independent limits which are not the same. Both can influence the effective occupancy a given kernel can achieve.
I have been developing my program using malloc() to allocate memory. However, my investigations made me think that I am facing a memory fragmentation problem.
My program needs 5 memory allocations of ~70 MB each. When I run my program using 4 threads, I need 5x4 memory allocations of ~70 MB each (and I cannot use less memory). At the end, I want to be able to use the 8 cores of my i7, this is, 5x8 memory allocations.
If I do 5x2 malloc()s, the program works. Not for 5x3 malloc()s.
I have been reading about std::vector and std::deque. I believe that std::deque is my solution for this problem, as std::vector allocates a big chunk of consecutive memory as malloc() does.
There any other solutions to explore or std::deque is my only solution?
EDIT
OS: Windows 8.1 (x64)
RAM: 8 GB (5 GB of free space)
I detect malloc() errors by checking errno == ENOMEM
NOTE: ERROR_MEM_ALLOC_FAILED is one of the errors I generate when memory allocation fails.
A debug trace for the program with 4 threads (i.e. 5x4 malloc()s):
Start
Thread 01
(+53.40576 MB) Total allocated 53.4/4095 total MB
(+53.40576 MB) Total allocated 106.8/4095 total MB
(+0.00008 MB) Total allocated 106.8/4095 total MB
(+0.00008 MB) Total allocated 106.8/4095 total MB
Tried to allocate 267 MB
ERROR_MEM_ALLOC_FAILED
Thread 02
(+53.40576 MB) Total allocated 160.2/4095 total MB
(+53.40576 MB) Total allocated 213.6/4095 total MB
(+0.00008 MB) Total allocated 213.6/4095 total MB
(+0.00008 MB) Total allocated 213.6/4095 total MB
Tried to allocate 267 MB
ERROR_MEM_ALLOC_FAILED
Thread 03
(+53.40576 MB) Total allocated 267.0/4095 total MB
Tried to allocate 53 MB
ERROR_MEM_ALLOC_FAILED
Thread 04
Tried to allocate 53 MB
ERROR_MEM_ALLOC_FAILED
End of program
I tried to run the same thing but changing the order of the memory allocations, but no memory was allocated.
Start
Thread 01
Tried to allocate 267 MB
ERROR_MEM_ALLOC_FAILED
Thread 02
Tried to allocate 267 MB
ERROR_MEM_ALLOC_FAILED
Thread 03
Tried to allocate 267 MB
ERROR_MEM_ALLOC_FAILED
Thread 04
Tried to allocate 267 MB
ERROR_MEM_ALLOC_FAILED
End of program
SOLUTION
The solution was to compile the application as a 64-bit application. Hence, probably it was not a fragmentation problem.
Why do you believe it's a memory fragmentation problem? Fragmentation is typically caused by allocating and deleting a large number of blocks of varying sizes, resulting in holes of available memory in between allocations that are not usable or useful sizes. It does not sound at all like the pattern of memory access you describe.
Also, this amount of memory is not large by today's standards, though it depends on your hardware and operating system. How much physical memory does your machine have? What OS are you running? Is it build as a 32-bit or 64-bit app? How do you know malloc is failing - is it returning null? Have you tried memory profiling?
Heap usage: 8 threads * 5 blocks * 70MB per block = 2800MB total
On Windows, the default per-process limit for heap allocations is 2GB for a 32-bit program, so it is quite likely to are hitting this limit. Probably the best solution would be to develop your app in 64-bit mode, then you can allocate huge amounts of (virtual) RAM.
I have been reading about std::vector and std::deque. I believe that std::deque is my solution for this problem, as std::vector allocates a big chunk of consecutive memory as malloc() does.
No, using std::vector or std::deque won't necessarily solve your problem if it is either fragmentation or overallocation (most likely). They will both use new/malloc in their implementation to allocate memory anyway, so if you already know the bounds of your allocations, you might as well request the full amount up front as you are doing.
There any other solutions to explore or std::deque is my only solution?
A deque is not a solution
Analyse your memory requirements, access patterns and reduce usage
If you can't get usage well below 2GB, switch to a 64-bit OS
It depends on how much RAM you have. You need 5 * 70MB * 8 = 2800MB. There are some cases:
If you have much more than that, it shouldn't be a problem to find it, even in contiguous blocks. I suppose you don't have so much.
If, on the other hand, you don't have that much memory, no container will suit your needs, and there's nothing you can really do, other than adding RAM or modifying your program to use less cores.
In the intermediate case, that is, your memory is not less than that, but not much more either, switching to another container might work but there are still problems: keep in mind that a vector is very space-efficient, as it is contiguous; any kind of linked list needs to store pointers to the next elements, and these pointers can take significant space, so you might end up needing more than 2800MB, although not in contiguous chunks. A std::list, from this point of view, would be terrible, because it needs a pointer to every element. So if your vectors hold a few, large items, switching to a list will give you a little overhead due to those few pointers, but if they are holding a lot of small values, the list will force you to waste a lot of space to store the pointers. In this sense, a deque should be what you need, as internally it is usually implemented as a group of arrays, so you don't need a pointer to every element.
To conclude: yes, a deque is what you are looking for. It will require more memory than vectors, but only a little, and that memory won't have to be contiguous, so you shouldn't have any more RAM fragmentation problems.
On a compute capability 2.x device how would I make sure that the gpu uses coalesced memory access when using mapped pinned memory and assuming that normally when using global memory the 2D data would require padding?
I can't seem to find information about this anywhere, perhaps I should be looking better or perhaps I am missing something. Any pointers in the right direction are welcome...
The coalescing approach should be applied when using zero copy memory. Quoting the CUDA C BEST PRACTICES GUIDE:
Because the data is not cached on the GPU, mapped
pinned memory should be read or written only once, and the global loads and stores
that read and write the memory should be coalesced.
Quoting the "CUDA Programming" book, by S. Cook
If you think about what happens with access to global memory, an entire cache line is brought in from memory on compute 2.x hardware. Even on compute 1.x hardware the same 128 bytes, potentially reduced to 64 or 32, is fetched from global memory.
NVIDIA does not publish the size of the PCI-E transfers it uses, or details on how zero copy is actually implemented. However, the coalescing approach used for global memory could be used with PCI-E transfer. The warp memory latency hiding model can equally be applied to PCI-E transfers, providing there is enough arithmetic density to hide the latency of the PCI-E transfers.
With very large amounts of ram these days I was wondering, it is possible to allocate a single chunk of memory that is larger than 4GB? Or would I need to allocate a bunch of smaller chunks and handle switching between them?
Why???
I'm working on processing some openstreetmap xml data and these files are huge. I'm currently streaming them in since I can't load them all in one chunk but I just got curious about the upper limits on malloc or new.
Short answer: Not likely
In order for this to work, you absolutely would have to use a 64-bit processor.
Secondly, it would depend on the Operating System support for allocating more than 4G of RAM to a single process.
In theory, it would be possible, but you would have to read the documentation for the memory allocator. You would also be more susceptible to memory fragmentation issues.
There is good information on Windows memory management.
A Primer on physcal and virtual memory layouts
You would need a 64-bit CPU and O/S build and almost certainly enough memory to avoid thrashing your working set. A bit of background:
A 32 bit machine (by and large) has registers that can store one of 2^32 (4,294,967,296) unique values. This means that a 32-bit pointer can address any one of 2^32 unique memory locations, which is where the magic 4GB limit comes from.
Some 32 bit systems such as the SPARCV8 or Xeon have MMU's that pull a trick to allow more physical memory. This allows multiple processes to take up memory totalling more than 4GB in aggregate, but each process is limited to its own 32 bit virtual address space. For a single process looking at a virtual address space, only 2^32 distinct physical locations can be mapped by a 32 bit pointer.
I won't go into the details but This presentation (warning: powerpoint) describes how this works. Some operating systems have facilities (such as those described Here - thanks to FP above) to manipulate the MMU and swap different physical locations into the virtual address space under user level control.
The operating system and memory mapped I/O will take up some of the virtual address space, so not all of that 4GB is necessarily available to the process. As an example, Windows defaults to taking 2GB of this, but can be set to only take 1GB if the /3G switch is invoked on boot. This means that a single process on a 32 bit architecture of this sort can only build a contiguous data structure of somewhat less than 4GB in memory.
This means you would have to explicitly use the PAE facilities on Windows or Equivalent facilities on Linux to manually swap in the overlays. This is not necessarily that hard, but it will take some time to get working.
Alternatively you can get a 64-bit box with lots of memory and these problems more or less go away. A 64 bit architecture with 64 bit pointers can build a contiguous data structure with as many as 2^64 (18,446,744,073,709,551,616) unique addresses, at least in theory. This allows larger contiguous data structures to be built and managed.
The advantage of memory mapped files is that you can open a file much bigger than 4Gb (almost infinite on NTFS!) and have multiple <4Gb memory windows into it.
It's much more efficent than opening a file and reading it into memory,on most operating systems it uses the built-in paging support.
This shouldn't be a problem with a 64-bit OS (and a machine that has that much memory).
If malloc can't cope then the OS will certainly provide APIs that allow you to allocate memory directly. Under Windows you can use the VirtualAlloc API.
it depends on which C compiler you're using, and on what platform (of course) but there's no fundamental reason why you cannot allocate the largest chunk of contiguously available memory - which may be less than you need. And of course you may have to be using a 64-bit system to address than much RAM...
see Malloc for history and details
call HeapMax in alloc.h to get the largest available block size
Have you considered using memory mapped files? Since you are loading in really huge files, it would seem that this might be the best way to go.
It depends on whether the OS will give you virtual address space that allows addressing memory above 4GB and whether the compiler supports allocating it using new/malloc.
For 32-bit Windows you won't be able to get single chunk bigger than 4GB, as the pointer size is 32-bit, thus limiting your virtual address space to 4GB. (You could use Physical Address Extension to get more than 4GB memory; however, I believe you have to map that memory into the virtualaddress space of 4GB yourself)
For 64-bit Windows, the VC++ compiler supports 64-bit pointers with theoretical limit of the virtual address space to 8TB.
I suspect the same applies for Linux/gcc - 32-bit does not allow you, whereas 64-bit allows you.
As Rob pointed out, VirtualAlloc for Windows is a good option for this, as is an anonymouse file mapping. However, specifically with respect to your question, the answer to "if C or C++" can allocate, the answer is NO THIS IS NOT SUPPORTED EVEN ON WIN7 RC 64
In the PE/COFF specification for exe files, the field which specifies the HEAP reserve and HEAP commit, is a 32 bit quantity. This is in-line with the physical size limitations of the current heap implmentation in the windows CRT, which is just short of 4GB. So, there is no way to allocate more than 4GB from C/C++ (technicall the OS support facilities of CreateFileMapping and VirtualAlloc/VirtualAllocNuma etc... are not C or C++).
Also, BE AWARE that there are underlying x86 or amd64 ABI construct's known as the page table's. This WILL in effect do what you are concerened about, allocating smaller chunks for your larger request, even though this is happining in kernel memory, there is an effect on the overall system, these tables are finite.
If you are allocating memory in such grandious purportions, you would be well advised to allocate based on the allocation granularity (which VirtualAlloc enforces) and also to identify optional flags's or methods to enable larger pages.
4kb pages were the initial page size for the 386, subsaquently the pentium added 4MB. Today, the AMD64 (Software Optimization Guide for AMD Family 10h Processors) has a maximum page table entry size of 1GB. This mean's for your case here, let's say you just did 4GB, it would require only 4 unique entries in the kernel's directory to locate\assign and permission your process's memory.
Microsoft has also released this manual that articulates some of the finer points of application memory and it's use for the Vista/2008 platform and newer.
Contents
Introduction. 4
About the Memory Manager 4
Virtual Address Space. 5
Dynamic Allocation of Kernel Virtual
Address Space. 5
Details for x86 Architectures. 6
Details for 64-bit Architectures. 7
Kernel-Mode Stack Jumping in x86
Architectures. 7
Use of Excess Pool Memory. 8
Security: Address Space Layout
Randomization. 9
Effect of ASLR on Image Load
Addresses. 9
Benefits of ASLR.. 11
How to Create Dynamically Based
Images. 11
I/O Bandwidth. 11
Microsoft SuperFetch. 12
Page-File Writes. 12
Coordination of Memory Manager and
Cache Manager 13
Prefetch-Style Clustering. 14
Large File Management 15
Hibernate and Standby. 16
Advanced Video Model 16
NUMA Support 17
Resource Allocation. 17
Default Node and Affinity. 18
Interrupt Affinity. 19
NUMA-Aware System Functions for
Applications. 19
NUMA-Aware System Functions for
Drivers. 19
Paging. 20
Scalability. 20
Efficiency and Parallelism.. 20
Page-Frame Number and PFN Database. 20
Large Pages. 21
Cache-Aligned Pool Allocation. 21
Virtual Machines. 22
Load Balancing. 22
Additional Optimizations. 23
System Integrity. 23
Diagnosis of Hardware Errors. 23
Code Integrity and Driver Signing. 24
Data Preservation during Bug Checks. 24
What You Should Do. 24
For Hardware Manufacturers. 24
For Driver Developers. 24
For Application Developers. 25
For System Administrators. 25
Resources. 25
If size_t is greater than 32 bits on your system, you've cleared the first hurdle. But the C and C++ standards aren't responsible for determining whether any particular call to new or malloc succeeds (except malloc with a 0 size). That depends entirely on the OS and the current state of the heap.
Like everyone else said, getting a 64bit machine is the way to go. But even on a 32bit machine intel machine, you can address bigger than 4gb areas of memory if your OS and your CPU support PAE. Unfortunately, 32bit WinXP does not do this (does 32bit Vista?). Linux lets you do this by default, but you will be limited to 4gb areas, even with mmap() since pointers are still 32bit.
What you should do though, is let the operating system take care of the memory management for you. Get in an environment that can handle that much RAM, then read the XML file(s) into (a) data structure(s), and let it allocate the space for you. Then operate on the data structure in memory, instead of operating on the XML file itself.
Even in 64bit systems though, you're not going to have a lot of control over what portions of your program actually sit in RAM, in Cache, or are paged to disk, at least in most instances, since the OS and the MMU handle this themselves.