freeRTOS scheduling configurations for tasks - scheduling

I have my freeRTOS currently working on my Microzed board. I am using the Xilinx SDK as the software platform and until now I have been able to create tasks and assign priority.
I was just curious to know if it would be possible to assign a fixed time for each of my tasks such that for example after 100 miliseconds my scheduler would switch to the next task . So is it possible to set a fixed execution time for each of my tasks ?? As far as I checked I could not find a method to work this out, if there is any means to implement this using the utilities of freeRTOS, kindly let me know guys.

By default FreeRTOS will time slice tasks of equal priority, see http://www.freertos.org/a00110.html#configUSE_TIME_SLICING, but there is nothing to guarantee that each task gets an equal share of the CPU. For example, interrupts use an unknown amount of processing time during each time slice, and higher priority tasks can use part or all of a time slice.
Question for you though - why would you want the behaviour you requested? Maybe if you said what you were trying to achieve, rather than than ask if a feature existed, people would be able to make helpful suggestions.

Related

How do I measure GPU time on Metal?

I want to see programmatically how much GPU time a part of my application consumes on macOS and iOS. On OpenGL and D3D I can use GPU timer query objects. I searched and couldn't find anything similar for Metal. How do I measure GPU time on Metal without using Instruments etc. I'm using Objective-C.
There are a couple of problems with this method:
1) You really want to know what is the GPU side latency within a command buffer most of the time, not round trip to CPU. This is better measured as the time difference between running 20 instances of the shader and 10 instances of the shader. However, that approach can add noise since the error is the sum of the errors associated with the two measurements.
2) Waiting for completion causes the GPU to clock down when it stops executing. When it starts back up again, the clock is in a low power state and may take quite a while to come up again, skewing your results. This can be a serious problem and may understate your performance in benchmark vs. actual by a factor of two or more.
3) if you start the clock on scheduled and stop on completed, but the GPU is busy running other work, then your elapsed time includes time spent on the other workload. If the GPU is not busy, then you get the clock down problems described in (2).
This problem is considerably harder to do right than most benchmarking cases I've worked with, and I have done a lot of performance measurement.
The best way to measure these things is to use on device performance monitor counters, as it is a direct measure of what is going on, using the machine's own notion of time. I favor ones that report cycles over wall clock time because that tends to weed out clock slewing, but there is not universal agreement about that. (Not all parts of the hardware run at the same frequency, etc.) I would look to the developer tools for methods to measure based on PMCs and if you don't find them, ask for them.
You can add scheduled and completed handler blocks to a command buffer. You can take timestamps in each and compare. There's some latency, since the blocks are executed on the CPU, but it should get you close.
With Metal 2.1, Metal now provides "events", which are more like fences in other APIs. (The name MTLFence was already used for synchronizing shared heap stuff.) In particular, with MTLSharedEvent, you can encode commands to modify the event's value at particular points in the command buffer(s). Then, you can either way for the event to have that value or ask for a block to be executed asynchronously when the event reaches a target value.
That still has problems with latency, etc. (as Ian Ollmann described), but is more fine grained than command buffer scheduling and completion. In particular, as Klaas mentions in a comment, a command buffer being scheduled does not indicate that it has started executing. You could put commands to set an event's value at the beginning and (with a different value) at the end of a sequence of commands, and those would only notify at actual execution time.
Finally, on iOS 10.3+ but not macOS, MTLCommandBuffer has two properties, GPUStartTime and GPUEndTime, with which you can determine how much time a command buffer took to execute on the GPU. This should not be subject to latency in the same way as the other techniques.
As an addition to Ken's comment above, GPUStartTime and GPUEndTime is now available on macOS too (10.15+):
https://developer.apple.com/documentation/metal/mtlcommandbuffer/1639926-gpuendtime?language=objc

What is the proper way to calculate latency in omnet++?

I have written a simulation module. For measuring latency, I am using this:
simTime().dbl() - tempLinkLayerFrame->getCreationTime().dbl();
Is this the proper way ? If not then please suggest me or a sample code would be very helpful.
Also, is the simTime() latency is the actual latency in terms of micro
seconds which I can write in my research paper? or do I need to
scale it up?
Also, I found that the channel data rate and channel delay has no impact on the link latency instead if I vary the trigger duration the latency varies. For example
timer = new cMessage("SelfTimer");
scheduleAt(simTime() + 0.000000000249, timer);
If this is not the proper way to trigger simple module recursively then please suggest one.
Assuming both simTime and getCreationTime use the OMNeT++ class for representing time, you can operate on them directly, because that class overloads the relevant operators. Going with what the manual says, I'd recommend using a signal for the measurements (e.g., emit(latencySignal, simTime() - tempLinkLayerFrame->getCreationTime());).
simTime() is in seconds, not microseconds.
Regarding your last question, this code will have problems if you use it for all nodes, and you start all those nodes at the same time in the simulation. In that case you'll have perfect synchronization of all nodes, meaning you'll only see collisions in the first transmission. Therefore, it's probably a good idea to add a random jitter to every newly scheduled message at the start of your simulation.

How to use .. QNX Momentics Application Profiler?

I'd like to profile my (multi-threaded) application in terms of timing. Certain threads are supposed to be re-activated frequently, i.e. a thread executes its main job once every fixed time interval. In other words, there's a fixed time slice in which all the threads a getting re-activated.
More precisely, I expect certain threads to get activated every 2ms (since this is the cycle period). I made some simplified measurements which confirmed the 2ms to be indeed effective.
For the purpose of profiling my app more accurately it seemed suitable to use Momentics' tool "Application Profiler".
However when I do so, I fail to interpret the timing figures that I selected. I would be interested in the average as well in the min and max time it takes before a certain thread is re-activated. So far it seems, the idea is to be only able to monitor the times certain functions occupy. However, even that does not really seem to be the case. E.g. I've got 2 lines of code that are put literally next to each other:
if (var1 && var2 && var3) var5=1; takes 1ms (avg)
if (var4) var5=0; takes 5ms (avg)
What is that supposed to tell me?
Another thing confuses me - the parent thread "takes" up 33ms on avg, 2ms on max and 1ms on min. Aside the fact that the avg shouldn't be bigger than max (i.e. even more I expect avg to be not bigger than 2ms - since this is the cycle time), it's actually increasing the longer I run the the profiling tool. So, if I would run the tool for half an hour the 33ms would actually be something like 120s. So, it seems that avg is actually the total amount of time the thread occupies the CPU.
If that is the case, I would assume to be able to offset against the total time using the count figure which doesn't work either. Mostly due to the figure being almost never available - i.e. there is only as a separate list entry (for every parent thread) called which does not represent a specific process scope.
So, I read QNX community wiki about the "Application Profiler", incl. the manual about "New IDE Application Profiler Enhancements", as well as the official manual articles about how to use the profiler tool.. but I couldn't figure out how I would use the tool to serve my interest.
Bottom line: I'm pretty sure I'm misinterpreting and misusing the tool for what it was intended to be used. Thus my question - how would I interpret the numbers or use the tool's feedback properly to get my 2ms cycle time confirmed?
Additional information
CPU: single core
QNX SDP 6.5 / Momentics 4.7.0
Profiling Method: Sampling and Call Count Instrumentation
Profiling Scope: Single Application
I enabled "Build for Profiling (Sampling and Call Count Instrumentation)" in the Build Options1
The System Profiler should give you what you are looking for. It hooks into the micro kernel and lets you see the state of all threads on the system. I used it in a similar setup to find out what our system was getting unexpected time-outs. (The cause turned out to be Page Waits on critical threads.)

How to setup ZERO-MQ architecture to deal with workers of different speed

[as a small context provider: I am new to networking and ZERO-MQ, but I did spend quite a bit of time on the guide and examples]
I have the following challenge (done in C++, but irrelevant to the question). I have a single source that generates tasks. I have multiple engines that need to process those tasks, and send back the result.
First attempt:
I created a client with a ZMQ_PUSH socket. The engines have a ZMQ_PULL socket. To get the answers back to the client, I created the reverse: a ZMQ_PUSH on the workers and a ZMQ_PULL on the client. It worked out of the box. Only to find out that after some time the client ran out of memory since I was pushing way more requests than the workers could process. I need some backpressure.
Second attempt:
I added a counter on the client that took care of only pushing when no more than say 1000 tasks were 'in progress'. The out of memory issue was solved, since I was never having more than 1000 'in progress' tasks. But ... some workers were slower than others. Since PUSH/PULL uses fair queueing, the amount of work for that slow worker kept increasing and increasing...until the slowest worker had all 1000 requests queued and the others were starved. I was not using my workers effectively.
Now, what architecture could I use that solves the issue of 'workers with different speed'? Is the 'count the number of in progress tasks' approach a good way of balancing the number of pushed requests? Or is there a way I can PUSH tasks to the workers, and the pushing blocks on a predefined point? Can I do that with HWM?
I am sure this problem is of such a generic nature that I should be able to easily deal with this. Can anyone point me in the right direction?
Thanks!
we used the Paranoid Pirate Protocol http://rfc.zeromq.org/spec:6,
but in case of many very small jobs, where the overhead of communication might be high, a credit-based flow control pattern might be more efficient. http://unprotocols.org/blog:15
in both cases it is necessary for the requester to directly assign jobs to individual workers. this is abstracted away of course and, depending on the use-case, could be made available as a sync call, which returns when all tasks have been processed.

Platform independent parallelization without changing the framework?

I hope the title did not mislead you.
My problem is the following: Currently I try to speed up a raytracer and this is done with the help of the graphics card. It works fine despite the fact that it got slower by this. :)
This is caused by the fact, that I trace one ray on the whole geometry at once on the graphics card(my "tracing server") and then fetch the results, which is awfully slow, so I have to gather some rays and calc them and fetch the results together to speed this up.
The next problem is, that I am not allowed to rewrite the surrounding framework that should know nothing or least possible about this parallelization.
So here is my approach:
I thought about using several threads, where each one gets a ray and requests my "tracing server" to calc the intersections. Then the thread is stopped until enough rays were gathered to calc the intersections on the graphics card and get the results back efficiently. This means that each thread will wait until the results were fetched.
You see I already have some plan but following I do not know:
Which threading framework should I take to be platformindependent?
Should I use a threadpool of fixed size or create them as needed?
Can any given thread library handle at least 1000 waiting threads(because that would be the number that I need to gather for my fetch to be efficient)?
But I also could imagine doing this with one thread that
dumps its load (a new ray) to the "tracing server" and fetches the next load until
there is enough to fetch the results.
Then the thread would take the results one by one, do the further calculations until all results are processed and then goes back to step one until all rays are done.
Also if you have some better idea how to parallelize this, tell me about it.
Regards,
Nobody
PS
If you need this information: The two platforms I want to use are Linux and Windows.
use either Thread Building Blocks or boost::thread.
http://www.boost.org/doc/libs/1_46_0/doc/html/thread.html
http://threadingbuildingblocks.org/
As far as threadpool/on-demand-threads - threadpool is generally better idea as it avoids creation overhead.
Number of waiting threads is gonna depend on the underlying system more than anything else:
Maximum number of threads per process in Linux?