I am running FreeRTOS on STM32F103 and using IAR workbench. I am trying to understand the relationship between 'general stack size set by the linker' and 'stack size for each task in FreeRTOS'. For instance, when the FreeRTOS creates tasks, does it use the stack defined by linker or define its own in the free RAM? I am trying to determine stack size for the project. I understand that I can use waterMark function call to determine stack of each task. Thoughts?
First of all you have to understand that when you create task in FreeRTOS memory (TCB and Stack) for this task is allocated on FreeRTOS heap which size is defined in FreeRTOSConfig.h.
The linker heap is the C library heap, not the FreeRTOS heap. The linker stack is normally only used for your startup code, and in some ports, for the interrupt stack. It is NOT used by any of the tasks.
For example imagine that you have FreeRTOS based system with one custom thread called DEMO. Your heap layout might look like schema below. The most important lesson from this picture for you is that each task stack is unrelated to the stack defined in the linker. Task stack is allocated on the FreeRTOS heap which is unrelated to the heap defined in the linker as I have already mentioned!
Example Heap Layout:
+-------------------+ <----------+
| | |
| FREE HEAP MEMORY | FREE SPACE
| | |
+-------------------+ <----------+
| TIMER TASK TCB | |
+-------------------+ |
| TIMER TASK STACK | |
+-------------------+ |
| IDLE TASK TCB | |
+-------------------+ |
| IDLE TASK STACK | ALLOCATED SPACE
+-------------------+ |
| DEMO TASK TCB | |
+-------------------+ |
| DEMO TASK STACK | |
+-------------------+ |
| MUTEXES, SETS ETC.| |
+-------------------+ <----------+
Related
Implementing CNN inference on Vulkan, I'd like to have some insights and pointers to terminology / papers to study for best parallelism. (Some open source frameworks exists, such as NCNN, but it has not been optimised for Vulkan, which I hope to be able to harness even a fraction of the theoretical computing power of modern mobile GPUs).
The parallelism or dependencies can be modelled on a high level as
Layers 1..50
A--------------+ A1------+ A50-----+
| Input Image |-->| |-->|output |
| | | | +-------+
| | +-------+
+--------------+
B----------+ B1-----+ B50--+
|downscaled|-->| |-->| |
|image / 2 | +------+ +----+
+----------+
C----+
|/16 |-->...-> [C50] = output
+----+
I could probably make a command buffer containing the sequence
Convolve(A1), Convolve(B1), Convolve(C1), Barrier,
Convolve(A2), Convolve(B2), Convolve(C3), Barrier,
...
MaxPool(A3), MaxPool(B3), MaxPool(C3), Barrier, ...
where there would be a barrier or forced synchronisation point after every layer. The different input scales would execute in parallel, but would probably not benefit that much from the parallelism, i.e. most GPU resources would probably be idle.
Case 2 -- more detailed parallelism
+-A----------------------+ +-C-----------------+
| Layer N, rows 0..5 |-->| Layer N+1 | C has dependency on A
| | | rows 0..3 | D has dependency on A & B
| |\ +-D-----------------+ E has dependency on B
+-B----------------------+ ->| Layer N+1 |
| Layer N, rows 6..11 |/ | rows 4..7 | G has dependency on F
| | +-E-----------------+
| |-->| rows 8..11 |
+------------------------+ +-------------------+
+-F-----------+ +-G-----------+
| layer N, |--->| layer N+1 |
| scale /2 | | scale/2 |
+-------------+ +-------------+
The reason for the dependency is that most of the convolutional layers are Nx3x3xC convolutions, which in order to compute row N, require the rows N-1..N+1 of the previous layer being computed. (Ultimately the dependencies could be split also horizontally down to each pixel x,y,C on a layer N requiring the pixels x-1..x+1, y-1..y+1, 0..C-1 of the previous layer N-1, but that would most likely result in more synchronisation than computation...)
There of course exists another dimension of parallelism (if the Nx3x3xC convolution was split to N parallel task), but that coexists with the spatial xy-domain and the availability requires a planar memory layout float tensor[N][H][W] instead of float tensor[H][W][N] or a variation of that like vecKf tensor[H][W][N].
In c++ implementation I've had each of the subsections/consumers to contain std::atomic<T> counter initialised to the number of producers; each producer will atomically decrement all the consumers atomics and spawn a task in a thread pool to process it. I would suspect in Vulkan (1.1) I have not such opportunity, but I would possibly need to create a list of binary semaphores to signal and a list of binary semaphores to wait.
So the question: are these valid approaches and what would the actual primitives in Vulkan compute shaders to call? (Most of the tutorials I've seen are concerned about graphics pipelines). I would not want to rely on any new features in Vulkan 1.2 or later.
If you have X11 running on a GPU like so:
Fri Aug 2 23:52:39 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.30 Driver Version: 430.30 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 Off | 00000000:00:1E.0 Off | 0 |
| N/A 28C P8 14W / 150W | 141MiB / 7618MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3255 G /usr/lib/xorg/Xorg 57MiB |
| 0 3286 G /usr/bin/gnome-shell 81MiB |
+-----------------------------------------------------------------------------+
If you run XShmGetImage(), does it give you a pointer to a memory address in GPU memory or host memory?
If the GPU, I assume you can do other operations on the NVIDIA card with it, like H264 encode that data.
Is there a way to copy the memory from one GPU memory block to a different GPU memory block?
I am using NVENC libraries.
Reading the MIT Shared Memory extension's documentation:
The next step is to create the shared memory segment. This is best
done after the creation of the XImage, since you need to make use of
the information in that XImage to know how much memory to allocate. To
create the segment, you need a call like:
shminfo.shmid = shmget(IPC_PRIVATE, image->bytes_per_line * image->height, IPC_CREAT|0777);
This implies the extension regards "shared memory" as "that which is returned by shmget or equivalent". Since shmget is incapable of allocating GPU memory, my answer is the XImage is in host memory, not device.
I'm trying to achieve the results of the following command that lists all programmable wake devices, or those that can be set/reset to wake the system:
powercfg -devicequery wake_programmable
I need to do the same from a C++ service. I'm using the code similar to this, but it gives me a smaller list. Here's how I call DevicePowerEnumDevices:
if(DevicePowerEnumDevices(index,
DEVICEPOWER_FILTER_DEVICES_PRESENT,
PDCAP_WAKE_FROM_D0_SUPPORTED |
PDCAP_WAKE_FROM_D1_SUPPORTED |
PDCAP_WAKE_FROM_D2_SUPPORTED |
PDCAP_WAKE_FROM_D3_SUPPORTED |
PDCAP_WAKE_FROM_S0_SUPPORTED |
PDCAP_WAKE_FROM_S1_SUPPORTED |
PDCAP_WAKE_FROM_S2_SUPPORTED |
PDCAP_WAKE_FROM_S3_SUPPORTED,
buff, &dwBuffSize))
{
//Got it
}
What flags am I missing for wake_programmable?
Suppose I need to read many distinct, independent chunks of data from the same file saved on disk.
Is it possible to multi-thread this upload?
Related: Do all threads on the same processor use the same IO device to read from disk? In this case, multi-threading would not speed up the upload at all - the threads would just be waiting in line.
(I am currently multi-threading with OpenMP.)
Yes, it is possible. However:
Do all threads on the same processor use the same IO device to read from disk?
Yes. The read head on the disk. As an example, try copying two files in parallel as opposed to in series. It will take significantly longer in parallel, because the OS uses scheduling algorithms to make sure the IO rate is "fair," or equal between the two threads/processes. Because of this, the read head will jump back and forth between different parts of the disk, slowing the process down A LOT. The time to actually read the data is pretty small compared to the time to seek to it, and when you're reading two different parts of the disk at once, you spend most of the time seeking.
Note that all of this assumes you're using a hard disk. If you're using an SSD, it will not be slower in parallel, but it will not be faster either. Edit: according to comments parallel is actually faster for an SSD. With RAID the situation becomes more complicated, and (obviously) depends on what kind of RAID you're using.
This is what it looks like (I've unwrapped the circular disk into a rectangle because ascii circles are hard, and simplified the data layout to make it easier to read):
Assume the files are separated by some space on the platter like so:
| |
A series read will look like (* indicates reading)
space ----->
| *| t
| *| i
| *| m
| *| e
| *| |
| / | |
| / | |
| / | V
| / |
|* |
|* |
|* |
|* |
While a parallel read will look like
| \ |
| *|
| / |
| / |
| / |
| / |
|* |
| \ |
| \ |
| \ |
| \ |
| *|
| / |
| / |
| / |
| / |
|* |
| \ |
| \ |
| \ |
| \ |
| *|
etc
If you're doing this on Windows you might want to look into the ReadFileScatter function. It will let you read multiple segments from a file in a single asynchronous call. This will allow the OS to better control the file IO bottle neck and hopefully optimizes the reads.
The matching write call on Windows would be WriteFileGather.
For UNIX you're looking at readv and writev to do the same thing.
As mentioned in the other answers a parallel read may be slower depending on the way the file is physically stored on disk. So if the head has to move a significant distance it can cause an actual slowdown. This being said there are however storage systems which can support multiple simultaneous reads and writes efficiently. The most simple one I can imagine is a SSD disk. I myself worked with magnificent storage systems from IBM which could perform simultaneous reads and writes with no slowdown.
So let's assume you have such a file system and physical storage which will not slow down on parallel reads.
In that case parallel reads are very logical. In general there are two ways to achieve that:
If you want to use the standard C/C++ library to perform the IO then the only option you have is to keep one open file handle (descriptor) per thread. This is because the file pointer (which points to where to read or write from in the file) is kept per handle. So if you try to read simultaneously from the same file handle you will not have any way of knowing what you are actually reading.
Use platform specific API to perform asynchronous (OVERLAPPED) IO. On windows you use the WinAPI functions with what is called OVERLAPPED IO. On Unix/Linux you have posix AIO although I understand that it's use is discouraged although I didn't see any satisfactory explanation as to why that is the case.
I myself implemented the both fd/thread approach on both linux and windows and the OVERLAPPED approach on windows. Both work great.
You won't be able to speed up the process of reading to disk. If you're calculating at the same time as you're writing, parallelizing will help. But the pure writing will be limited by the bandwidth of the lane between processor and hard drive and, more notably, by the harddrive itself (my hard drive does 30 MB/s, I've heard about raid setups serving 120 MB/s over network, but don't rely on that).
Multiple reads from a disk should be thread-safe by the design of the op system if you use the standard system functions there's no need to manually locking it, open the files read-only though. (Otherwise you'll get file access errors.)
Btw you are not necessary reading from the disk in practice, the op system will decide where it will serve you from. It typically prefetches the reads and serves from the memory.
I have an application that interacts with external devices using serial communication. There are two versions of the device differing in their implementations.
-->One is developed and tested by my team
-->The other version by a different team.
Since the other team has left, our team is looking after it's maintenance. The other day while testing the application I noticed that the application takes up 60 Mb memory at startup and to my horror it's memory usage starts increasing with 200Kb chunks, in 60 hrs it shoots up to 295 Mb though there is no slow down in the responsiveness and usage of application. I tested it again and again and the same memory usage pattern is repeated.
The application is made in C++,Qt 4.2.1 on RHEL4.
I used mtrace to check for any memory leaks and it shows no such leaks. I then used valgrind memcheck tool, but the messages it gives are cryptic and not very conclusive, it shows leaks in graphical elements of Qt, which on scrutiny can be straightaway rejected.
I am in a fix as to what other tools/methodologies can be adopted to pinpoint the source of these memory leaks if any.
-->Also, in a larger context, how can we detect and debug presence of memory leaks in a C++ Qt application?
-->How can we check, how much memory a process uses in Linux?
I had used gnome-system-monitor and top command to check for memory used by the application, but I have heard that results given by above mentioned tools are not absolute.
EDIT:
I used ccmalloc for detecting memory leaks and this is the error report I got after I closed the application. During application execution, there were no error messages.
|ccmalloc report|
=======================================================
| total # of| allocated | deallocated | garbage |
+-----------+-------------+-------------+-------------+
| bytes| 387325257 | 386229435 | 1095822 |
+-----------+-------------+-------------+-------------+
|allocations| 1232496 | 1201351 | 31145 |
+-----------------------------------------------------+
| number of checks: 1 |
| number of counts: 2434332 |
| retrieving function names for addresses ... done. |
| reading file info from gdb ... done. |
| sorting by number of not reclaimed bytes ... done. |
| number of call chains: 3 |
| number of ignored call chains: 0 |
| number of reported call chains: 3 |
| number of internal call chains: 3 |
| number of library call chains: 1 |
=======================================================
|
| 3.1% = 33.6 KB of garbage allocated in 47 allocations
| |
| | 0x???????? in
| |
| | 0x081ef2b6 in
| | at src/wrapper.c:489
| |
| | 0x081ef169 in <_realloc>
| | at src/wrapper.c:435
| |
| `-----> 0x081ef05c in
| at src/wrapper.c:318
|
| 0.8% = 8722 Bytes of garbage allocated in 35 allocations
| |
| | 0x???????? in
| |
| | 0x081ef134 in
| | at src/wrapper.c:422
| |
| `-----> 0x081ef05c in
| at src/wrapper.c:318
|
| 0.1% = 1144 Bytes of garbage allocated in 5 allocations
| |
| | 0x???????? in
| |
| | 0x081ef1cb in
| | at src/wrapper.c:455
| |
| `-----> 0x081ef05c in
| at src/wrapper.c:318
|
`------------------------------------------------------
free(0x09cb650c) after reporting
(This can happen with static destructors.
When linking put `ccmalloc.o' at the end (for gcc) or
in front of the list of object files.)
free(0x09cb68f4) after reporting
free(0x09cb68a4) after reporting
free(0x09cb6834) after reporting
free(0x09cb6814) after reporting
free(0x09cb67a4) after reporting
free(0x09cb6784) after reporting
free(0x09cb66cc) after reporting
free(0x09cb66ac) after reporting
free(0x09cb65e4) after reporting
free(0x09cb65c4) after reporting
free(0x09cb653c) after reporting
ccmalloc_report() called in non valid state
I have no clue, what this means, it doesn't seem to indicate any memory leaks to me? I may be wrong. Does anyone of you have come across such a scenario?
link|edit|delete
Valgrind can be a bitch if you don't really read the manuals or whatever documentation is actually available (man page for starters) - but they are worth it.
Basicly, you could start by running the valgrind on your application with --gen-suppressions=all and then create a suppressions for each block that is originating from QT itself and then use the suppression file to block those errors and you should be left with only with errors in your own code.
Also, you could try to use valgrind thru a alleyoop frontend if that makes things easier for you.
There are also bunch of other tools that can be used to detect memory leaks and Linux Journal has article about those here: http://www.linuxjournal.com/article/6556
And last, in some cases, some static analysis tools can spot memory errors too..
I'd like to make the minor point that just because the meory used by a process is increasing, it does not follow that you have a memory leak. Take a word processor as an example - as you write text, the memory usage increases, but there is no leak. Most processes in fact increase their memoryy usage as they run, often until they reach some sort of near steady-state, where objects been created are balanced by old objects being destroyed.
You said you tried Valgrind's memcheck tool; you should also try the massif tool, which should be able to graph the heap usage over time, and tell you where the memory was allocated from.
One of the reasons why top isn't too useful to measure memory usage is that they don't take into account that memory is often shared between processes. For the best overview on where the process has allocated memory, I recommend using a recent Linux kernel and checking /proc/<pid>/maps for your process. This shows what memory is mapped to that process and from where. For example, here's a snippet from konqueror on my system.
b732a000-b7a20000 r-xp 00000000 fd:05 205437 /usr/lib/qt3/lib/libqt-mt.so.3.3.8
Size: 7128 kB
Rss: 3456 kB
Pss: 347 kB
Shared Clean: 3452 kB
Shared Dirty: 0 kB
Private Clean: 4 kB
Private Dirty: 0 kB
Referenced: 3452 kB
The important thing here is that, although the resident set resulting from the load of libqt-mt.so.3.3.8 is 3456kB, all but 4kB of that is shared between all processes which loaded the library, so it's a one-off system-wide cost. top doesn't expose this information, so just reading the RSS from top is misleading.