simultaneous tasks with 8051 - c++

Is there any way to run two tasks with the 8051 μC simultaneously? For example,
Task one
Delay 1 sec
P2.B2 = 1
Delay 1 sec
P2.B2 = 0
Task 2
If P1.B0 = 1
P2.B3=1
So at any time, press the switch connected to P2.0 is 1, LED at P2.3=ON, and P2.2 keeps LED at P2.2 blinking.

A task is something what is typically provided by the underlying OS. If you are running on a bare metal system without any OS, you have no tasks at the first point.
But your application can build its own tasks. The job is more or less easy. You have to build a scheduler, typically triggered by a hardware clock for task switching, create stacks for each of the tasks and some control structures for the maintenance of the tasks. As you have no MMU and no memory protection on bare metal systems like 8051, you simply can modify stack pointers to do the task switching.
That is exactly what a library like FreeRTos can do for you. There is a port for 8051 available as I know. Searching on the web returns a lot of links to 8051 FreeRtos. Maybe there are some more libraries which offering tasks for you.
But mostly the overhead of scheduling and all the administration effort is much to high. Running an endless loop which is doing some jobs by reading some kind of queues or flags is much easier and often the more efficient solution. Also running some jobs in interrupt service routines fits well to bare metal requirements.

I assume you are running on bare metal with no battery saving requirements. I assume you can now write a program, load it to your device and run it. What I suggest you do is roughly this.
This program should have a main loop, which in its most simple would be like this:
MAX_TIME is the largest possible value of system clock, should never be reached
task_table is table with
next execution time as system clock time (MAX_TIME means disabled)
function pointer
initialize task-table with the three tasks below
forever
for each task with time 0
set task time MAX_TIME (disable)
call task function (task probably enables itself or other task)
find a task with lowest non-zero time in task_queue
if task time is in past or now
set task time MAX_TIME (disable)
call task function (task probably enables itself or other task)
Time 0 tasks are checked separately, and then tasks with time, so that the time 0 tasks don't block each others or the tasks with time from ever being called. Same could be achieved in different ways, this is just an example.
Then your requirements really call for 3 "tasks":
task_p2_b2_0:
P2.B2 = 0
enable task task_p2_b2_1 at current_time + 1 second
task_p2_b2_1:
P2.B2 = 1
enable task task_p2_b2_0 at current_time + 1 second
task_p1_b0_poll:
If P1.B0 = 1
P2.B3=1
enable task task_p1_b0_poll at time 0 (or current time + 10 ms or whatever)
Future development: Above is for a small number of static tasks. Iterating up to... 5-10 item table is so fast that there is no point trying to optimize it. Once you have more tasks than that, you should consider using a priority heap to store the tasks. Then you could also consider making main loop sleep when it has nothing to do, and use interrupt to wake it up (timer interrupt, serial port interrupt, pin activation interrupt etc). Also, you could have different task types, such as tasks which are activated when there is some IO (button press, byte from serial port, whatever). Etc. At the upper end of this adding features is a complete operating system really, but for simple things what I wrote above is really enough.

Related

Dynamically Evaluate load and create Threads depending on machine performance

Hi i have started to work on a project where i use parallel computing to separate job loads among multiple machines, such as hashing and other forms of mathematical calculations. Im using C++
it is running on a Master/slave or Server/Client model if you prefer where every client connects to the server and waits for a job. The server can than take a job and seperate it depending on the number of clients
1000 jobs -- > 3 clients
IE: client 1 --> calculate(0 to 333)
Client 2 --> calculate(334 to 666)
Client 3 --> calculate(667 to 999)
I wanted to further enhance the speed by creating multiple threads on every running client. But since every machine are not likely (almost 100%) not going to have the same hardware, i cannot arbitrarily decide on a number of threads to run on every client.
i would like to know if one of you guys knew a way to evaluate the load a thread has on the cpu and extrapolate the number of threads that can be run concurently on the machine.
there are ways i see of doing this.
I start threads one by one, evaluating the cpu load every time and stop when i reach a certain prefix ceiling of (50% - 75% etc) but this has the flaw that ill have to stop and re-separate the job every time i start a new thread.
(and this is the more complex)
run some kind of test thread and calculate its impact on the cpu base load and extrapolate the number of threads that can be run on the machine and than start threads and separate jobs accordingly.
any idea or pointer are welcome, thanks in advance !

Using timers with performance-critical software (Qt)

I am developing an application that is responsible of moving and managing robots over an UDP connection.
The application needs to:
Read joystick/user input using SDL.
Generate and send a control packet to the robot every 20 milliseconds (UDP)
Receive and decode response packets from the robot (~20 msecs). This was implemented with the signal/slot mechanism and does not require a timer.
Receive and process robot messages for debugging reasons. This is not time-regulated.
Update the UI regularly to keep the user notified about the status of the robot (e.g. battery voltage). For most cases, I have also used Qt's signal/slot mechanism.
Use a watchdog that disables the robot if no response is received after 1 second. The watchdog is reset when the application receives a robot packet (~20 msecs)
For the moment, I have implemented all of the above. However, the application fails to send the packets regularly when the watchdog is activated or when two or more QTimer objects are used. The application would generally work, but I would not consider it "production ready". I have tried to use the precision flags of the timers (Qt::Precise, Qt::Coarse and Qt::VeryCoarse), but I still experienced problems.
Notes:
The code is generally well organized, there are no "god objects" in the code base (most source files are less than 150 lines long and only create the necessary dependencies).
Most of the times, I use QTimer::singleShot() (e.g. I will only send the next packet once the current packet has been sent).
Where we use timers:
To read joystick input (~50 msecs, precise timer)
To send robot packets (~20 msecs, precise timer)
To update some aspects of the UI (~500 msecs, coarse timer)
To update the elapsed time since the robot was enabled (~100 msecs, precise timer)
To implement a watchdog (put the application and robot in safe state if 1000 msecs have passed without a robot response)
Note: the watchdog is feed when we receive a response packet from the robot (~20 msecs)
Do you have any recommendations for using QTimer objects with performance-critical code (any idea is welcome). Note that I have also tried to use different threads, but it has caused me more problems, since the application would not be in "sync", thus failing to effectively control the robots that we have tested.
Actually, I seem to have underestimated Qt's timer and event loop performance. On my system I get on average around 20k nanoseconds for an event loop cycle plus the overhead from scheduling a queued function call, and a timer with interval 1 millisecond is rarely late, most of the timeouts are a few thousand nanoseconds short of a millisecond. But it is a high end system, on embedded hardware it may be a lot worse.
You should take the time and profile your target system and Qt build to determine whether it can indeed run snappy enough, and based on those measurements, adjust your timings to compensate for the system delays to get your events scheduled more on time.
You should definitely keep the timer thread as free as possible, because if you block it by IO or extensive computation, your timer will not be accurate. Use a dedicated thread to schedule work and extra worker threads to do the actual work. You may also try playing with thread priorities a bit.
Worst case scenario, look for 3rd party high performance event loop implementations or create your own and potentially, also a faster signaling mechanism as ell. As I already mentioned in the comments, Qt's inter-thread queued signals are very slow, at least compared to something like indirect function calls.
Last but not least, if you want to do task X every N units of time, it will only be only possible if task X takes N units of time or less on your system. You need to make this consideration for each task, and for all tasks running concurrently. And in order to get accurate scheduling, you should measure how long did task X took, and if less than its frequency, schedule the next execution in the time remaining, otherwise execute immediately.

Unbalanced load (v2.0) using MPI

(the problem is embarrassingly parallel)
Consider an array of 12 cells:
|__|__|__|__|__|__|__|__|__|__|__|__|
and four (4) CPUs.
Naively, I would run 4 parallel jobs and feeding 3 cells to each CPU.
|__|__|__|__|__|__|__|__|__|__|__|__|
=========|========|========|========|
1 CPU 2 CPU 3 CPU 4 CPU
BUT, it appears, that each cell has different evaluation time, some cells are evaluated very quickly, and some are not.
So, instead of wasting "relaxed CPU", I think to feed EACH cell to EACH CPU at time and continue until the entire job is done.
Namely:
at the beginning:
|____|____|____|____|____|____|____|____|____|____|____|____|
1cpu 2cpu 3cpu 4cpu
if, 2cpu finished his job at cell "2", it can jump to the first empty cell "5" and continue working:
|____|done|____|____|____|____|____|____|____|____|____|____|
1cpu 3cpu 4cpu 2cpu
|-------------->
if 1cpu finished, it can take sixth cell:
|done|done|____|____|____|____|____|____|____|____|____|____|
3cpu 4cpu 2cpu 1cpu
|------------------------>
and so on, until the full array is done.
QUESTION:
I do not know a priori which cell is "quick" and which cell is "slow", so I cannot spread cpus according to the load (more cpus to slow, less to quick).
How one can implement such algorithm for dynamic evaluation with MPI?
Thanks!!!!!
UPDATE
I use a very simple approach, how to divide the entire job into chunks, with IO-MPI:
given: array[NNN] and nprocs - number of available working units:
for (int i=0;i<NNN/nprocs;++i)
{
do_what_I_need(start+i);
}
MPI_File_write(...);
where "start" corresponds to particular rank number. In simple words, I divide the entire NNN array into fixed size chunk according to the number of available CPU and each CPU performs its chunk, writes the result to (common) output and relaxes.
IS IT POSSIBLE to change the code (Not to completely re-write in terms of Master/Slave paradigm) in such a way, that each CPU will get only ONE iteration (and not NNN/nprocs) and after it completes its job and writes its part to the file, will Continue to the next cell and not to relax.
Thanks!
There is a well known parallel programming pattern, known under many names, some of which are: bag of tasks, master / worker, task farm, work pool, etc. The idea is to have a single master process, which distributes cells to the other processes (workers). Each worker runs an infinite loop in which it waits for a message from the master, computes something and then returns the result. The loop is terminated by having the master send a message with a special tag. The wildcard tag value MPI_ANY_TAG can be used by the worker to receive messages with different tags.
The master is more complex. It also runs a loop but until all cells have been processed. Initially it sends each worker a cell and then starts a loop. In this loop it receives a message from any worker using the wildcard source value of MPI_ANY_SOURCE and if there are more cells to be processed, sends one of them to the same worker that have returned the result. Otherwise it sends a message with a tag set to the termination value.
There are many many many readily available implementations of this model on the Internet and even some on Stack Overflow (for example this one). Mind that this scheme requires one additional MPI process that often does very little work. If this is unacceptable, one can run a worker loop in a separate thread.
You want to implement a kind of client-server architecture where you have workers asking the server for work whenever they are out of work.
Depending on the size of the chunks and the speed of your communication between workers and server, you may want to adjust the size of the chunks sent to workers.
To answer your updated question:
Under the master/slave (or worker pool if that's how you prefer it to be labelled) model, you will basically need a task scheduler. The master should have information about what work has been done and what still needs to be done. The master will give each process some work to be done, then sit and wait until a process completes (using nonblocking receives and a wait_all). Once a process completes, have it send the data to the master then wait for the master to respond with more work. Continue this until the work is done.

Can I set a single thread's priority above 15 for a normal priority process?

I have a data acquisition application running on Windows 7, using VC2010 in C++. One thread is a heartbeat which sends out a change every .2 seconds to keep-alive some hardware which has a timeout of about .9 seconds. Typically the heartbeat call takes 10-20ms and the thread spends the rest of the time sleeping.
Occasionally however there will be a delay of 1-2 seconds and the hardware will shut down momentarily. The heartbeat thread is running at THREAD_PRIORITY_TIME_CRITICAL which is 15 for a normal priority process. My other threads are running at normal priority, although I use a DLL to control some other hardware and have noticed with Process Explorer that it starts several threads running at level 15.
I can't track down the source of the slow down but other theads in my application are seeing the same kind of delays when this happens. I have made several optimizations to the heartbeat code even though it is quite simple, but the occasional failures are still happening. Now I wonder if I can increase the priority of this thread beyond 15 without specifying REALTIME_PRIORITY_CLASS for the entire process. If not, are there any downsides I should be aware of to using REALTIME_PRIORITY_CLASS? (Other than this heartbeat thread, the rest of the application doesn't have real-time timing needs.)
(Or does anyone have any ideas about how to track down these slowdowns...not sure if the source could be in my app or somewhere else on the system).
Update: So I hadn't actually tried passing 31 into my AfxBeginThread call and turns out it ignores that value and sets the thread to normal priority instead of the 15 that I get with THREAD_PRIORITY_TIME_CRITICAL.
Update: Turns out running the Disk Defragmenter is a good way to cause lots of thread delays. Even running the process at REALTIME_PRIORITY_CLASS and the heartbeat thread at THREAD_PRIORITY_TIME_CRITICAL (level 31) doesn't seem to help. Next thing to try is calling AvSetMmThreadCharacteristics("Pro Audio")
Update: Scheduling heartbeat thread as "Pro Audio" does work to increase the thread's priority beyond 15 (Base=1, Dynamic=24) but it doesn't seem to make any real difference when defrag is running. I've been able to correlate many of the slowdowns with the disk defragmenter so turned off the weekly scan. Still can't explain some delays so we're going to increase to a 5-10 second watchdog timeout.
Even if you could, increasing the priority will not help. The highest priority runnable thread gets the processor at all times.
Most likely there is some extended interrupt processing occurring while interrupts are disabled. Interrupts effectively work at a higher priority than any thread.
It could be video, network, disk, serial, USB, etc., etc. It will take some insight to selectively disable or use an alternate driver to see if the problem system hesitation is affected. Once you find that, then figuring out a way to prevent it might range from trivial to impossible depending on what it is.
Without more knowledge about the system, it is hard to say. Have you tried running it on a different PC?
Officially you can't use REALTIME threads in a process which does not have the REALTIME_PRIORITY_CLASS.
Unoficially you could play with the undocumented NtSetInformationThread
see:
http://undocumented.ntinternals.net/UserMode/Undocumented%20Functions/NT%20Objects/Thread/NtSetInformationThread.html
But since I have not tried it, I don't have any more info about this.
On the other hand, as it was said before, you can never be sure that the OS will not take its time when your thread's quantum will expire. Certain poorly written drivers are often the cause of such latency.
Otherwise there is a software which can tell you if you have misbehaving kernel parts:
http://www.thesycon.de/deu/latency_check.shtml
I would try using CreateWaitableTimer() & SetWaitableTimer() and see if they are subject to the same preemption problems.

What could delay pre-emption of a VxWorks task?

In my current project, I have two levels of tasking, in a VxWorks system, a higher priority (100) task for number crunching and other work and then a lower priority (200) task for background data logging to on-board flash memory. Logging is done using the fwrite() call, to a file stored on a TFFS file system. The high priority task runs at a periodic rate and then sleeps to allow background logging to be done.
My expectation was that the background logging task would run when the high priority task sleeps and be preempted as soon as the high priority task wakes.
What appears to be happening is a significant delay in suspending the background logging task once the high priority task is ready to run again, when there is sufficient data to keep the logging task continuously occupied.
What could delay the pre-emption of a lower priority task under VxWorks 6.8 on a Power PC architecture?
You didn't quantify significant, so the following is just speculation...
You mention writing to flash. One of the issue is that writing to flash typically requires the driver to poll the status of the hardware to make sure the operation completes successfully.
It is possible that during certain operations, the file system temporarily disables preemption to insure that no corruption occurs - coupled with having to wait for hardware to complete, this might account for the delay.
If you have access to the System Viewer tool, that would go a long way towards identifying the cause of the delay.
I second the suggestion of using the System Viewer, It'll show all the tasks involved in TFFS stack and you may be surprised how many layers there are. If you're making an fwrite with a large block of data, the flash access may be large (and slow as Benoit said). You may try a bunch of smaller fwrites. I suggest doing a test to see how long fwrites() take for various sizes, and you may see differences from test to test with the same sizea as you cross flash block boundaries.