How to extend ESP32 heap size? - c++

I'm writing a code about play gif from SDCard on TFT Screen, so I create a array to put the gif file. (Using Nodemcu-32s 4MB)
#include<TFT_eSPI.h>
#include<SPI.h>
#include<AnimatedGIF.h>
TFT_eSPI tft;
AnimatedGIF gif;
uint8_t *gifArray;
int gifArrayLen;
void setup(){
tft.init();
tft.setRotation(2);
tft.fillScreen(TFT_BLACK);
File fgif=SD.open("/test.gif",FILE_READ);
gifArrayLen=fgif.size();
gifArray=(uint8_t *)malloc(gifArrayLen);
for(int i=0;i<gifArrayLen;i++) gifArray[i]=fgif.read();
fgif.close();
}
void loop(){
tft.startWrite();
gif.open(gifArray,gifArrayLen*sizeof(uint8_t),GIFDraw);
while(gif.playFrame(true,NULL)) yield();
gif.close();
tft.endWrite();
}
But if gif size > 131KB, it will trigger fatal error like this.
Guru Meditation Error: Core 1 panic'ed (StoreProhibited). Exception was unhandled.
After malloc the array, when I set a value on it, it triggered.
I found some forum says it's because it over the FreeRTOS heap size.
Can I extend the heap size or use another storage method to replace it?

The "4MB" in NodeMCU refers to the size of flash, the size of RAM on ESP32 is fixed at 512KB, roughly 200KB of which is used by IRAM cache/code sections, leaving around 320KB for program memory, half of which is available for dynamic allocation.
From documentation Heap Memory - Available Heap:
Due to a technical limitation, the maximum statically allocated DRAM
usage is 160KB. The remaining 160KB (for a total of 320KB of DRAM) can
only be allocated at runtime as heap.
There is a way to connect more RAM via SPI, but it's going to be very slow. You might want to look into a larger SoC with more RAM instead.

The heap is the pool of memory available to your program to allocate storage.
You're using an ESP32, which is a small CPU that has a small amount of RAM, roughly 400KB.
You cannot extend the heap on the ESP32. Unlike Linux which has virtual memory, there's nothing to extend it with.
However, you can add additional storage in the form of external PSRAM - additional RAM that is accessed via SPI. PSRAM will be slower than internal RAM and will have some restrictions on its use (for instance, call stacks cannot live in PSRAM and certain DMA buffers cannot be located there).
The ESP32's manufacturer, Espressif, has documented how to use PSRAM with the ESP32.
You'll generally (maybe exclusively) find PSRAM available built onto ESP32 boards, so if you're using an ESP32 module that doesn't already have it you'll need a new ESP32 module that does.
And as other people have pointed out, you can artificially cause less memory to be available in the heap by over-allocating and freeing memory, fragmenting the heap so that there are no large pieces still available. This is called heap fragmentation and has been written about extensively here on Stack Overflow and other web sites.

You can use IRAM to add more space to heap. Usually, the DRAM comes from SRAM1 (128 KB) and SRAM2 (200 KB) with a total of 328 KB. But because ROM uses 8 KB for their function in RAM, the total available DRAM is only 320 KB. IRAM (from SRAM0) is usually used for code not data. But since not all space is used for code, the remaining byte is available to data. If you use arduino framework, usually there will be about 72 KB available to use as heap. But the access must be 32-bit aligned, such as array of int or pointer. See this links for further information.
https://demo-dijiudu.readthedocs.io/en/stable/api-reference/system/mem_alloc.html
https://kenny-peng.com/2021/08/23/esp32_iram.html

Related

Can I create an array exceeding RAM, if I have enough swap memory?

Let's say I have 8 Gigabytes of RAM and 16 Gigabytes of swap memory. Can I allocate a 20 Gigabyte array there in C? If yes, how is it possible? What would that memory layout look like?
[linux] Can I create an array exceeding RAM, if I have enough swap memory?
Yes, you can. Note that accessing swap is veerry slooww.
how is it possible
Allocate dynamic memory. The operating system handles the rest.
How would that memory layout look like?
On an amd64 system, you can have 256 TiB of address space. You can easily fit a contiguous block of 8 GiB in that space. The operating system divides the virtual memory into pages and copies the pages between physical memory and swap space as needed.
Modern operating systems use virtual memory. In Linux and most other OSes rach process has it's own address space according to the abilities of the architecture. You can check the size of the virtual address space in /proc/cpuinfo. For example you may see:
address sizes : 43 bits physical, 48 bits virtual
This means that virtual addresses use 48 bit. Half of that is reserved for the kernel so you only can use 47 bit, or 128TiB. Any memory you allocate will be placed somewhere in those 128 TiB of address space as if you actually had that much memory.
Linux uses demand page loading and per default over commits memory. When you say
char *mem = (char*)malloc(1'000'000'000'000);
what happens is that Linux picks a suitable address and just records that you have allocated 1'000'000'000'000 (rounded up to the nearest page) of memory starting at that point. (It does some sanity check that the amount isn't totally bonkers depending on the amount of physical memory that is free, the amount of swap that is free and the overcommit setting. Per default you can allocate a lot more than you have memory and swap.)
Note that at this point no physical memory and no swap space is connected to your allocated block at all. This changes when you first write to the memory:
mem[4096] = 0;
At this point the program will page fault. Linux checks the address is actually something your program is allowed to write to, finds a physical page and map it to &mem[4096]. Then it lets the program retry to write there and everything continues.
If Linux can't find a physical page it will try to swap something out to make a physical page available for your programm. If that also fails your program will receive a SIGSEGV and likely die.
As a result you can allocate basically unlimited amounts of memory as long as you never write to more than the physical memory and swap and support. On the other hand if you initialize the memory (explicitly or implicitly using calloc()) the system will quickly notice if you try to use more than available.
You can, but not with a simple malloc. It's platform-dependent.
It requires an OS call to allocate swapable memory (it's VirtualAlloc on Windows, for example, on Linux it should be mmap and related functions).
Once it's done, the allocated memory is divided into pages, contiguous blocks of fixed size. You can lock a page, therefore it will be loaded in RAM and you can read and modify it freely. For old dinosaurs like me, it's exactly how EMS memory worked under DOS... You address your swappable memory with a kind of segment:offset method: first, you divide your linear address by the page size to find which page is needed, then you use the remainder to get the offset within this page.
Once unlocked, the page remains in memory until the OS needs memory: then, an unlocked page will be flushed to disk, in swap, and discarded in RAM... Until you lock (and load...) it again, but this operation may requires to free RAM, therefore another process may have its unlocked pages swapped BEFORE your own page is loaded again. And this is damnly SLOOOOOOW... Even on a SSD!
So, it's not always a good thing to use swap. A better way is to use memory mapped files - perfect for reading very big files mostly sequentially, with few random accesses - if it can suits your needs.

Allocating aligned memory for larger arrays

In my program I want to allocate 32 byte aligned memory to use SSE/AVX. The amount I want to allocate is somewhere around 2000*1300*17*17*4(large data set). I tried using functions _aligned_malloc() and _mm_malloc but for larger sizes it doesn't allocate memory and results in a access violation exception. If the amount allocated is small like around 512*320*4*17*17(small data set) then the code work fine.
Here these functions return a null pointer when allocation is done for large data set.But works fine when input data size is small. Also here if I just use unaligned memory allocation using new then code works fine for large data set too.
Finally Can someone tell me Is there any significant performance gains in using aligned memory for AVX.
Edit: After some research according to this post it says that new allocate memory from free store and malloc() allocate memory from heap. Here I am exceeding maximum heap size as _aligned_malloc() return errno 12 which means ENOMEM in that case Can someone tell me a work around for this.
On memory allocation:
I seems you are actually trying to alocate 2000*1300*17*17*4 32 bytes elements. This is means you are trying to allocate 96 GB while your system has only 12 GB memory.
Since new is working but malloc not it seems your local implementation of new seems to be able to allocate huge amounts of virtual memory. Malloc allocates from the heap which means it is usally limited to the physical amount of memory you've got. That's the reason it fails.
As the dataset is bigger than your main memory you might want to allocate the memory using mmap which maps a file into virtual memory making it accessable as if it was in physical memory (but it will only partially be cached in memory). I'm not sure if it's guaranteed but mmap usally aligns on optimal page size boundary (almost always 4096 byte).
Anyway you will have a huge performance loss due to the fact that your disk is way slower than your RAM. This is so serious that using AVX will probably not speed up anything at all.
On the performance loss of using unaligned memory:
On modern hardware (say Intel's Haswell onwards I think) this depends on your access patterns. Unaligned access should have almost no performance overhead on iterating over the array in memory order (each cache line will still be loaded only once). If you access it in random order than you will often cross the 64 byte cache line boundry. This means your processor will have to load 2 lines into cache and remove 2 lines from the cache instead of only one. While this might be a serious problem for some situations in your case the disk will slows things down so much that you will barely notice this.
Addtional tips (or a shot in the dark):
The way you gave the size of the array (2000*1300*17*17*4) suggests that you are using a multidimensional array (e.g. auto x = new __m256[2000][1300][17][17][4]). So some tipps on that:
Iterate through it mostly sequential
Check if it is sparse (meaning some of the memory will never be accessed) and shrink it if possible.
You could try to flatten the array and do more complex index calculation yourself in order to reduce the amount of memory need. If you get it to fit completely into your RAM you can start to optimise your code (using AVX and/or aligned memory).
"Total paging file size for all drives is 15247MB" suggests that you actually using only parts of that 96 GB so there might be a way to further reduce your usage.
In that case you might also want to ask another question on how to reduce the memory usage with more info on what you are doing.

Apparent CUDA magic

I'm using CUDA (in reality I'm using pyCUDA if the difference matters) and performing some computation over arrays. I'm launching a kernel with a grid of 320*600 threads. Inside the kernel I'm declaring two linear arrays of 20000 components using:
float test[20000]
float test2[20000]
With these arrays I perform simple calculations, like for example filling them with constant values. The point is that the kernel executes normally and perform correctly the computations (you can see this filling an array with a random component of test and sending that array to host from device).
The problem is that my NVIDIA card has only 2GB of memory and the total amount of memory to allocate the arrays test and test2 is 320*600*20000*4 bytes that is much more than 2GB.
Where is this memory coming from? and how can CUDA perform the computation in every thread?
Thank you for your time
The actual sizing of the local/stack memory requirements is not as you suppose (for the entire grid, all at once) but is actually based on a formula described by #njuffa here.
Basically, the local/stack memory require is sized based on the maximum instantaneous capacity of the device you are running on, rather than the size of the grid.
Based on the information provided by njuffa, the available stack size limit (per thread) is the lesser of:
The maximum local memory size (512KB for cc2.x and higher)
available GPU memory/(#of SMs)/(max threads per SM)
For your first case:
float test[20000];
float test2[20000];
That total is 160KB (per thread) so we are under the maximum limit of 512KB per thread. What about the 2nd limit?
GTX 650m has 2 cc 3.0 (kepler) SMs (each Kepler SM has 192 cores). Therefore, the second limit above gives, if all the GPU memory were available:
2GB/2/2048 = 512KB
(kepler has 2048 max threads per multiprocessor)
so it is the same limit in this case. But this assumes all the GPU memory is available.
Since you're suggesting in the comments that this configuration fails:
float test[40000];
float test2[40000];
i.e. 320KB, I would conclude that your actual available GPU memory is at the point of this bulk allocation attempt is somewhere above (160/512)*100% i.e. above 31% but below (320/512)*100% i.e. below 62.5% of 2GB, so I would conclude that your available GPU memory at the time of this bulk allocation request for the stack frame would be something less than 1.25GB.
You could try to see if this is the case by calling cudaGetMemInfo right before the kernel launch in question (although I don't know how to do this in pycuda). Even though your GPU starts out with 2GB, if you are running the display from it, you are likely starting with a number closer to 1.5GB. And dynamic (e.g. cudaMalloc) and or static (e.g. __device__) allocations that occur prior to this bulk allocation request at kernel launch, will all impact available memory.
This is all to explain some of the specifics. The general answer to your question is that the "magic" arises due to the fact that the GPU does not necessarily allocate the stack frame and local memory for all threads in the grid, all at once. It need only allocate what is required for the maximum instantaneous capacity of the device (i.e. SMs * max threads per SM), which may be a number that is significantly less than what would be required for the whole grid.

Create too large array in C++, how to solve?

Recently, I work in C++ and I have to create a array[60.000][60.000]. However, i cannot create this array because it's too large. I tried float **array or even static float array but nothing is good. Does anyone have an ideas?
Thanks for your helps!
A matrix of size 60,000 x 60,000 has 3,600,000,000 elements.
You're using type float so it becomes:
60,000 x 60,000 * 4 bytes = 14,400,000,000 bytes ~= 13.4 GB
Do you even have that much memory in your machine?
Note that the issue of stack vs heap doesn't even matter unless you have enough memory to begin with.
Here's a list of possible problems:
You don't have enough memory.
If the matrix is declared globally, you'll exceed the maximum size of the binary.
If the matrix is declared as a local array, then you will blow your stack.
If you're compiling for 32-bit, you have far exceeded the 2GB/4GB addressing limit.
Does "60.000" actually mean "60000"? If so, the size of the required memory is 60000 * 60000 * sizeof(float), which is roughly 13.4 GB. A typical 32-bit process is limited to only 2 GB, so it is clear why it doesn't fit.
On the other hand, I don't see why you shouldn't be able to fit that into a 64-bit process, assuming your machine has enough RAM.
Allocate the memory at runtime -- consider using a memory mapped file as the backing. Like everyone says, 14 gigs is a lot of memory. But it's not unreasonable to find a computer with 14GB of memory, nor is it unreasonable to page the memory as necessary.
With a matrix of this size, you will likely become very curious about memory access performance. Remember to consider the cache grain of your target architecture and if your target has a TLB you may be able to use larger pages to relieve some TLB pressure. Then again, if you don't have enough memory you'll likely care only about how fast your storage I/O is.
If it's not already obvious, you'll need an architecture that supports a 64-bit address space in order to access this memory directly/conveniently.
To initialise the 2D array of floats that you want, you will need:
60000 * 60000 * 4 bytes = 14400000000 bytes
Which is approximately 14GB of memory. That's a LOT of memory. To even hold that theoretically, you will need to be running a 64bit machine, not to mention one with quite a bit of RAM installed.
Furthermore, allocating this much memory is almost never necessary in most situations, are you sure no optimisations could be made here?
EDIT:
In light of new information from your comments on other answers: You only have 4GB memory (RAM). Your operating system is hence going to have to page at least 9GB on the Hard Drive, in reality probably more. But you also only have 20GB of Hard Drive space. This is barely enough to page all that data, especially if the disk is fragmented. Finally, (I could be wrong because you haven't stated explicitly) it is quite possible that you're running a 32bit machine. This isn't really capable of handling more than 4GB of memory at a time.
I had this problem too. I did a workaround where I chopped the array into sections (my biggest allowed array was float A_sub_matrix_20[62944560]). When I declared just one of these in main(), it seems to be put in RAM as I got a runtime exception as soon as main() starts. I was able to declare 20 buffers of that size as global variables which works (looks like in global form they are stored on the HDD - when I added A_sub_matrix_20[n] to the watch list in VisualStudio it gave a message "reading from file").

what could be reason for virtual bytes to grow 2x private bytes?

An application's virtual bytes grow 2-times the private bytes.
does this indicate memory leak? bad application design?
OS is 32Bit
any thoughts are welcome.
application is stream database.
Fragmentation.
If you allocate the following chunks of memory:
16KB
8KB
16KB
and you then free the chunk of 8KB, your application will have 32 KB of private bytes, but 40 KB bytes of virtual memory, which is actually the highest virtual memory address that has ever been in use by your process (ignoring the other memory parts for sake of simplicity).
Consider (if possible) using another memory manager. Some alternatives are:
The Windows Low-fragementation heap (see http://msdn.microsoft.com/en-us/library/aa366750%28VS.85%29.aspx for more info)
The Doug-Lea open source memory manager
Commercial alternatives like Hoard
A fourth alternative is to write your own memory manager. It's not that easy, but if done right, it can have quite some benefits. Especially for certain niche or special applications, writing your own memory manager can be useful.
An application's virtual bytes grow 2-times the private bytes.
If application allocates only heap, then to me it would be the sign that application allocates lots of memory but never actually touches it. For example:
void *p = malloc( 16u<<20 );
would eat up 16MB of virtual memory. But as long as application doesn't perform any actions with the memory block, OS wouldn't even attempt to map the virtual memory to the RAM. Simplest way to force the actual allocation of private memory is to memset() it:
void *p = malloc( 16u<<20 );
memset( p, 0, 16u<<20 );
does this indicate memory leak? bad application design?
Or both. Or neither.
The longer variant of the response: unknown, depends on what memory application allocates, what other resources application uses, OS, h/w platform, etc.
If unsure, use a memory leak analysis tools to investigate, e.g. valgrind. Read up SO for more information on memory leak analysis in C++.
Memory allocation has overhead to store management information about what was allocated. If you're allocating very small buffers the extra information can be a significant percentage of the total. That might be what you're seeing.
One possibility is if you set a large stack reserve size for your threads with linker option /STACK:reserve_bytes and then you start a lot of threads.
For example, if you have an ATL service, it automatically starts 4*numberOfCores apartment message dispatching threads by default. Compile and link such a service with /STACK:12000000 (12 megabytes), then run it on a 16-core server and it will start 64 threads, each with a 12MB stack, immediately consuming 768MB of virtual address space, although the actual committed memory may be much lower.