What is the definition of realtime, near realtime and batch? Give examples of each? - web-services

I'm trying to get a good definition of realtime, near realtime and batch? I am not talking about sync and async although to me, they are different dimensions. Here is what I'm thinking
Realtime is sync web services or async web services.
Near realtime could be JMS or messaging systems or most event driven systems.
Batch to me is more of an timed system that is processing when it wakes up.
Give examples of each and feel free to fix my assumptions.

https://stackoverflow.com/tags/real-time/info
Real-Time
Real-time means that the time of an activity's completion is part of its functional correctness. For example, the sqrt() function's correctness is something like
The sqrt() function is implemented
correctly if, for all x >=0, sqrt(x) =
y implies y^2 == x.
In this setting, the time it takes to execute the sqrt() procedure is not part of its functional correctness. A faster algorithm may be better in some qualitative sense, but no more or less correct.
Suppose we have a mythical function called sqrtrt(), a real-time version of square root. Imagine, for instance, we need to compute the square root of velocity in order to properly execute the next brake application in an anti-lock braking system. In this setting, we might say instead:
The sqrtrt() function is implemented
correctly if
for all x >=0, sqrtrt(x) =
y implies y^2 == x and
sqrtrt() returns a result in <= 275 microseconds.
In this case, the time constraint is not merely a performance parameter. If sqrtrt() fails to complete in 275 microseconds, you may be late applying the brakes, triggering either a skid or reduced braking efficiency, possibly resulting in an accident. The time constraint is part of the functional correctness of the routine. Lift this up a few layers, and you get a real-time system as one (at least partially) composed of activities that have timeliness as part of their functional correctness conditions.
Near Real-Time
A near real-time system is one in which activities completion times, responsiveness, or perceived latency when measured against wall clock time are important aspects of system quality. The canonical example of this is a stock ticker system -- you want to get quotes reasonably quickly after the price changes. For most of us non-high-speed-traders, what this means is that the perceived delay between data being available and our seeing it is negligible.
The difference between "real-time" and "near real-time" is both a difference in precision and magnitude. Real-time systems have time constraints that range from microseconds to hours, but those time constraints tend to be fairly precise. Near-real-time usually implies a narrower range of magnitudes -- within human perception tolerances -- but typically aren't articulated precisely.
I would claim that near-real-time systems could be called real-time systems, but that their time constraints are merely probabilistic:
The stock price will be displayed to the user within 500ms of its change at the exchange, with
probability p > 0.75.
Batch
Batch operations are those which are perceived to be large blocks of computing tasks with only macroscopic, human- or process-induced deadlines. The specific context of computation is typically not important, and a batch computation is usually a self-contained computational task. Real-time and near-real-time tasks are often strongly coupled to the physical world, and their time constraints emerge from demands from physical/real-world interactions. Batch operations, by contrast, could be computed at any time and at any place; their outputs are solely defined by the inputs provided when the batch is defined.
Original Post
I would say that real-time means that the time (rather than merely the correct output) to complete an operation is part of its correctness.
Near real-time is weasel words for wanting the same thing as real-time but not wanting to go to the discipline/effort/cost to guarantee it.
Batch is "near real-time" where you are even more tolerant of long response times.
Often these terms are used (badly, IMHO) to distinguish among human perceptions of latency/performance. People think real-time is real-fast, e.g., milliseconds or something. Near real-time is often seconds or milliseconds. Batch is a latency of seconds, minutes, hours, or even days. But I think those aren't particularly useful distinctions. If you care about timeliness, there are disciplines to help you get that.

I'm curious for feedback myself on this. Real-time and batch are well defined and covered by others (though be warned that they are terms-of-art with very specific technical meanings in some contexts). However, "near real-time" seems a lot fuzzier to me.
I favor (and have been using) "near real-time" to describe a signal-processing system which can 'keep up' on average, but lags sometimes. Think of a system processing events which only happen sporadically... Assuming it has sufficient buffering capacity and the time it takes to process an event is less than the average time between events, it can keep up.
In a signal processing context:
- Real-time seems to imply a system where processing is guaranteed to complete with a specified (short) delay after the signal has been received. A minimal buffer is needed.
- Near real-time (as I have been using it) means a system where the delay between receiving and completion of processing may get relatively large on occasion, but the system will not (except under pathological conditions) fall behind so far that the buffer gets filled up.
- Batch implies post-processing to me. The incoming signal is just saved (maybe with a bit of real-time pre-processing) and then analyzed later.
This gives the nice framework of real-time and near real-time being systems where they can (in theory) run forever while new data is being acquired... processing happens in parallel with acquisition. Batch processing happens after all the data has been collected.
Anyway, I could be conflicting with some technical definitions I'm unaware of... and I assume someone here will gleefully correct me if needed.

There are issues with all of these answers in that the definitions are flawed. For instance, "batch" simply means that transactions are grouped and sent together. Real Time implies transactional, but may also have other implications. So when you combine batch in the same attribute as real time and near real time, clarity in purpose for that attribute is lost. The definition becomes less cohesive, less clear. This would make any application created with the data more fragile. I would guess that practitioners would be better off w/ a clearly modeled taxonomy such as:
Attribute1: Batched (grouped) or individual transactions.
Attribute2: Scheduled (time-driven), event-driven.
Attribute3: Speed per transaction. For batch that would be the average speed/transaction.
Attribute4: Protocol/Technology: SOAP, REST, combination, FTP, SFTP, etc. for data movement.
Attributex: Whatever.
Attribute4 is more related to something I am doing right now, so you could throw that out or expand the list for what you are trying to achieve. For each of these attribute values, there would likely be additional, specific attributes. But to bring the information together, we need to think about what is needed to make the collective data useful. For instance, what do we need to know between batched & transactional flows, to make them useful together. For instance, you may consider attributes for each to provide the ability to understand total throughput for a given time period. Seems funny how we may create conceptual, logical, and physical data models (hopefully) for our business clients, but we don't always apply that kind of thought to how we define terminology in our discussions.

Any system in which time at which output is produced is significant. This is usually because the input corresponding to some movement in the physical environment or world and the output has to relate to the same movement. The lag from input to output time must be sufficiently small for acceptable timelines.

Related

Compiling an application for use in highly radioactive environments

We are compiling an embedded C++ application that is deployed in a shielded device in an environment bombarded with ionizing radiation. We are using GCC and cross-compiling for ARM. When deployed, our application generates some erroneous data and crashes more often than we would like. The hardware is designed for this environment, and our application has run on this platform for several years.
Are there changes we can make to our code, or compile-time improvements that can be made to identify/correct soft errors and memory-corruption caused by single event upsets? Have any other developers had success in reducing the harmful effects of soft errors on a long-running application?
Working for about 4-5 years with software/firmware development and environment testing of miniaturized satellites*, I would like to share my experience here.
*(miniaturized satellites are a lot more prone to single event upsets than bigger satellites due to its relatively small, limited sizes for its electronic components)
To be very concise and direct: there is no mechanism to recover from detectable, erroneous
situation by the software/firmware itself without, at least, one
copy of minimum working version of the software/firmware somewhere for recovery purpose - and with the hardware supporting the recovery (functional).
Now, this situation is normally handled both in the hardware and software level. Here, as you request, I will share what we can do in the software level.
...recovery purpose.... Provide ability to update/recompile/reflash your software/firmware in real environment. This is an almost must-have feature for any software/firmware in highly ionized environment. Without this, you could have redundant software/hardware as many as you want but at one point, they are all going to blow up. So, prepare this feature!
...minimum working version... Have responsive, multiple copies, minimum version of the software/firmware in your code. This is like Safe mode in Windows. Instead of having only one, fully functional version of your software, have multiple copies of the minimum version of your software/firmware. The minimum copy will usually having much less size than the full copy and almost always have only the following two or three features:
capable of listening to command from external system,
capable of updating the current software/firmware,
capable of monitoring the basic operation's housekeeping data.
...copy... somewhere... Have redundant software/firmware somewhere.
You could, with or without redundant hardware, try to have redundant software/firmware in your ARM uC. This is normally done by having two or more identical software/firmware in separate addresses which sending heartbeat to each other - but only one will be active at a time. If one or more software/firmware is known to be unresponsive, switch to the other software/firmware. The benefit of using this approach is we can have functional replacement immediately after an error occurs - without any contact with whatever external system/party who is responsible to detect and to repair the error (in satellite case, it is usually the Mission Control Centre (MCC)).
Strictly speaking, without redundant hardware, the disadvantage of doing this is you actually cannot eliminate all single point of failures. At the very least, you will still have one single point of failure, which is the switch itself (or often the beginning of the code). Nevertheless, for a device limited by size in a highly ionized environment (such as pico/femto satellites), the reduction of the single point of failures to one point without additional hardware will still be worth considering. Somemore, the piece of code for the switching would certainly be much less than the code for the whole program - significantly reducing the risk of getting Single Event in it.
But if you are not doing this, you should have at least one copy in your external system which can come in contact with the device and update the software/firmware (in the satellite case, it is again the mission control centre).
You could also have the copy in your permanent memory storage in your device which can be triggered to restore the running system's software/firmware
...detectable erroneous situation.. The error must be detectable, usually by the hardware error correction/detection circuit or by a small piece of code for error correction/detection. It is best to put such code small, multiple, and independent from the main software/firmware. Its main task is only for checking/correcting. If the hardware circuit/firmware is reliable (such as it is more radiation hardened than the rests - or having multiple circuits/logics), then you might consider making error-correction with it. But if it is not, it is better to make it as error-detection. The correction can be by external system/device. For the error correction, you could consider making use of a basic error correction algorithm like Hamming/Golay23, because they can be implemented more easily both in the circuit/software. But it ultimately depends on your team's capability. For error detection, normally CRC is used.
...hardware supporting the recovery Now, comes to the most difficult aspect on this issue. Ultimately, the recovery requires the hardware which is responsible for the recovery to be at least functional. If the hardware is permanently broken (normally happen after its Total ionizing dose reaches certain level), then there is (sadly) no way for the software to help in recovery. Thus, hardware is rightly the utmost importance concern for a device exposed to high radiation level (such as satellite).
In addition to the suggestion for above anticipating firmware's error due to single event upset, I would also like to suggest you to have:
Error detection and/or error correction algorithm in the inter-subsystem communication protocol. This is another almost must have in order to avoid incomplete/wrong signals received from other system
Filter in your ADC reading. Do not use the ADC reading directly. Filter it by median filter, mean filter, or any other filters - never trust single reading value. Sample more, not less - reasonably.
NASA has a paper on radiation-hardened software. It describes three main tasks:
Regular monitoring of memory for errors then scrubbing out those errors,
robust error recovery mechanisms, and
the ability to reconfigure if something no longer works.
Note that the memory scan rate should be frequent enough that multi-bit errors rarely occur, as most ECC memory can recover from single-bit errors, not multi-bit errors.
Robust error recovery includes control flow transfer (typically restarting a process at a point before the error), resource release, and data restoration.
Their main recommendation for data restoration is to avoid the need for it, through having intermediate data be treated as temporary, so that restarting before the error also rolls back the data to a reliable state. This sounds similar to the concept of "transactions" in databases.
They discuss techniques particularly suitable for object-oriented languages such as C++. For example
Software-based ECCs for contiguous memory objects
Programming by Contract: verifying preconditions and postconditions, then checking the object to verify it is still in a valid state.
And, it just so happens, NASA has used C++ for major projects such as the Mars Rover.
C++ class abstraction and encapsulation enabled rapid development and testing among multiple projects and developers.
They avoided certain C++ features that could create problems:
Exceptions
Templates
Iostream (no console)
Multiple inheritance
Operator overloading (other than new and delete)
Dynamic allocation (used a dedicated memory pool and placement new to avoid the possibility of system heap corruption).
Here are some thoughts and ideas:
Use ROM more creatively.
Store anything you can in ROM. Instead of calculating things, store look-up tables in ROM. (Make sure your compiler is outputting your look-up tables to the read-only section! Print out memory addresses at runtime to check!) Store your interrupt vector table in ROM. Of course, run some tests to see how reliable your ROM is compared to your RAM.
Use your best RAM for the stack.
SEUs in the stack are probably the most likely source of crashes, because it is where things like index variables, status variables, return addresses, and pointers of various sorts typically live.
Implement timer-tick and watchdog timer routines.
You can run a "sanity check" routine every timer tick, as well as a watchdog routine to handle the system locking up. Your main code could also periodically increment a counter to indicate progress, and the sanity-check routine could ensure this has occurred.
Implement error-correcting-codes in software.
You can add redundancy to your data to be able to detect and/or correct errors. This will add processing time, potentially leaving the processor exposed to radiation for a longer time, thus increasing the chance of errors, so you must consider the trade-off.
Remember the caches.
Check the sizes of your CPU caches. Data that you have accessed or modified recently will probably be within a cache. I believe you can disable at least some of the caches (at a big performance cost); you should try this to see how susceptible the caches are to SEUs. If the caches are hardier than RAM then you could regularly read and re-write critical data to make sure it stays in cache and bring RAM back into line.
Use page-fault handlers cleverly.
If you mark a memory page as not-present, the CPU will issue a page fault when you try to access it. You can create a page-fault handler that does some checking before servicing the read request. (PC operating systems use this to transparently load pages that have been swapped to disk.)
Use assembly language for critical things (which could be everything).
With assembly language, you know what is in registers and what is in RAM; you know what special RAM tables the CPU is using, and you can design things in a roundabout way to keep your risk down.
Use objdump to actually look at the generated assembly language, and work out how much code each of your routines takes up.
If you are using a big OS like Linux then you are asking for trouble; there is just so much complexity and so many things to go wrong.
Remember it is a game of probabilities.
A commenter said
Every routine you write to catch errors will be subject to failing itself from the same cause.
While this is true, the chances of errors in the (say) 100 bytes of code and data required for a check routine to function correctly is much smaller than the chance of errors elsewhere. If your ROM is pretty reliable and almost all the code/data is actually in ROM then your odds are even better.
Use redundant hardware.
Use 2 or more identical hardware setups with identical code. If the results differ, a reset should be triggered. With 3 or more devices you can use a "voting" system to try to identify which one has been compromised.
You may also be interested in the rich literature on the subject of algorithmic fault tolerance. This includes the old assignment: Write a sort that correctly sorts its input when a constant number of comparisons will fail (or, the slightly more evil version, when the asymptotic number of failed comparisons scales as log(n) for n comparisons).
A place to start reading is Huang and Abraham's 1984 paper "Algorithm-Based Fault Tolerance for Matrix Operations". Their idea is vaguely similar to homomorphic encrypted computation (but it is not really the same, since they are attempting error detection/correction at the operation level).
A more recent descendant of that paper is Bosilca, Delmas, Dongarra, and Langou's "Algorithm-based fault tolerance applied to high performance computing".
Writing code for radioactive environments is not really any different than writing code for any mission-critical application.
In addition to what has already been mentioned, here are some miscellaneous tips:
Use everyday "bread & butter" safety measures that should be present on any semi-professional embedded system: internal watchdog, internal low-voltage detect, internal clock monitor. These things shouldn't even need to be mentioned in the year 2016 and they are standard on pretty much every modern microcontroller.
If you have a safety and/or automotive-oriented MCU, it will have certain watchdog features, such as a given time window, inside which you need to refresh the watchdog. This is preferred if you have a mission-critical real-time system.
In general, use a MCU suitable for these kind of systems, and not some generic mainstream fluff you received in a packet of corn flakes. Almost every MCU manufacturer nowadays have specialized MCUs designed for safety applications (TI, Freescale, Renesas, ST, Infineon etc etc). These have lots of built-in safety features, including lock-step cores: meaning that there are 2 CPU cores executing the same code, and they must agree with each other.
IMPORTANT: You must ensure the integrity of internal MCU registers. All control & status registers of hardware peripherals that are writeable may be located in RAM memory, and are therefore vulnerable.
To protect yourself against register corruptions, preferably pick a microcontroller with built-in "write-once" features of registers. In addition, you need to store default values of all hardware registers in NVM and copy-down those values to your registers at regular intervals. You can ensure the integrity of important variables in the same manner.
Note: always use defensive programming. Meaning that you have to setup all registers in the MCU and not just the ones used by the application. You don't want some random hardware peripheral to suddenly wake up.
There are all kinds of methods to check for errors in RAM or NVM: checksums, "walking patterns", software ECC etc etc. The best solution nowadays is to not use any of these, but to use a MCU with built-in ECC and similar checks. Because doing this in software is complex, and the error check in itself could therefore introduce bugs and unexpected problems.
Use redundancy. You could store both volatile and non-volatile memory in two identical "mirror" segments, that must always be equivalent. Each segment could have a CRC checksum attached.
Avoid using external memories outside the MCU.
Implement a default interrupt service routine / default exception handler for all possible interrupts/exceptions. Even the ones you are not using. The default routine should do nothing except shutting off its own interrupt source.
Understand and embrace the concept of defensive programming. This means that your program needs to handle all possible cases, even those that cannot occur in theory. Examples.
High quality mission-critical firmware detects as many errors as possible, and then handles or ignores them in a safe manner.
Never write programs that rely on poorly-specified behavior. It is likely that such behavior might change drastically with unexpected hardware changes caused by radiation or EMI. The best way to ensure that your program is free from such crap is to use a coding standard like MISRA, together with a static analyser tool. This will also help with defensive programming and with weeding out bugs (why would you not want to detect bugs in any kind of application?).
IMPORTANT: Don't implement any reliance of the default values of static storage duration variables. That is, don't trust the default contents of the .data or .bss. There could be any amount of time between the point of initialization to the point where the variable is actually used, there could have been plenty of time for the RAM to get corrupted. Instead, write the program so that all such variables are set from NVM in run-time, just before the time when such a variable is used for the first time.
In practice this means that if a variable is declared at file scope or as static, you should never use = to initialize it (or you could, but it is pointless, because you cannot rely on the value anyhow). Always set it in run-time, just before use. If it is possible to repeatedly update such variables from NVM, then do so.
Similarly in C++, don't rely on constructors for static storage duration variables. Have the constructor(s) call a public "set-up" routine, which you can also call later on in run-time, straight from the caller application.
If possible, remove the "copy-down" start-up code that initializes .data and .bss (and calls C++ constructors) entirely, so that you get linker errors if you write code relying on such. Many compilers have the option to skip this, usually called "minimal/fast start-up" or similar.
This means that any external libraries have to be checked so that they don't contain any such reliance.
Implement and define a safe state for the program, to where you will revert in case of critical errors.
Implementing an error report/error log system is always helpful.
It may be possible to use C to write programs that behave robustly in such environments, but only if most forms of compiler optimization are disabled. Optimizing compilers are designed to replace many seemingly-redundant coding patterns with "more efficient" ones, and may have no clue that the reason the programmer is testing x==42 when the compiler knows there's no way x could possibly hold anything else is because the programmer wants to prevent the execution of certain code with x holding some other value--even in cases where the only way it could hold that value would be if the system received some kind of electrical glitch.
Declaring variables as volatile is often helpful, but may not be a panacea.
Of particular importance, note that safe coding often requires that dangerous
operations have hardware interlocks that require multiple steps to activate,
and that code be written using the pattern:
... code that checks system state
if (system_state_favors_activation)
{
prepare_for_activation();
... code that checks system state again
if (system_state_is_valid)
{
if (system_state_favors_activation)
trigger_activation();
}
else
perform_safety_shutdown_and_restart();
}
cancel_preparations();
If a compiler translates the code in relatively literal fashion, and if all
the checks for system state are repeated after the prepare_for_activation(),
the system may be robust against almost any plausible single glitch event,
even those which would arbitrarily corrupt the program counter and stack. If
a glitch occurs just after a call to prepare_for_activation(), that would imply
that activation would have been appropriate (since there's no other reason
prepare_for_activation() would have been called before the glitch). If the
glitch causes code to reach prepare_for_activation() inappropriately, but there
are no subsequent glitch events, there would be no way for code to subsequently
reach trigger_activation() without having passed through the validation check or calling cancel_preparations first [if the stack glitches, execution might proceed to a spot just before trigger_activation() after the context that called prepare_for_activation() returns, but the call to cancel_preparations() would have occurred between the calls to prepare_for_activation() and trigger_activation(), thus rendering the latter call harmless.
Such code may be safe in traditional C, but not with modern C compilers. Such compilers can be very dangerous in that sort of environment because aggressive they strive to only include code which will be relevant in situations that could come about via some well-defined mechanism and whose resulting consequences would also be well defined. Code whose purpose would be to detect and clean up after failures may, in some cases, end up making things worse. If the compiler determines that the attempted recovery would in some cases invoke undefined behavior, it may infer that the conditions that would necessitate such recovery in such cases cannot possibly occur, thus eliminating the code that would have checked for them.
This is an extremely broad subject. Basically, you can't really recover from memory corruption, but you can at least try to fail promptly. Here are a few techniques you could use:
checksum constant data. If you have any configuration data which stays constant for a long time (including hardware registers you have configured), compute its checksum on initialization and verify it periodically. When you see a mismatch, it's time to re-initialize or reset.
store variables with redundancy. If you have an important variable x, write its value in x1, x2 and x3 and read it as (x1 == x2) ? x2 : x3.
implement program flow monitoring. XOR a global flag with a unique value in important functions/branches called from the main loop. Running the program in a radiation-free environment with near-100% test coverage should give you the list of acceptable values of the flag at the end of the cycle. Reset if you see deviations.
monitor the stack pointer. In the beginning of the main loop, compare the stack pointer with its expected value. Reset on deviation.
What could help you is a watchdog. Watchdogs were used extensively in industrial computing in the 1980s. Hardware failures were much more common then - another answer also refers to that period.
A watchdog is a combined hardware/software feature. The hardware is a simple counter that counts down from a number (say 1023) to zero. TTL or other logic could be used.
The software has been designed as such that one routine monitors the correct operation of all essential systems. If this routine completes correctly = finds the computer running fine, it sets the counter back to 1023.
The overall design is so that under normal circumstances, the software prevents that the hardware counter will reach zero. In case the counter reaches zero, the hardware of the counter performs its one-and-only task and resets the entire system. From a counter perspective, zero equals 1024 and the counter continues counting down again.
This watchdog ensures that the attached computer is restarted in a many, many cases of failure. I must admit that I'm not familiar with hardware that is able to perform such a function on today's computers. Interfaces to external hardware are now a lot more complex than they used to be.
An inherent disadvantage of the watchdog is that the system is not available from the time it fails until the watchdog counter reaches zero + reboot time. While that time is generally much shorter than any external or human intervention, the supported equipment will need to be able to proceed without computer control for that timeframe.
This answer assumes you are concerned with having a system that works correctly, over and above having a system that is minimum cost or fast; most people playing with radioactive things value correctness / safety over speed / cost
Several people have suggested hardware changes you can make (fine - there's lots of good stuff here in answers already and I don't intend repeating all of it), and others have suggested redundancy (great in principle), but I don't think anyone has suggested how that redundancy might work in practice. How do you fail over? How do you know when something has 'gone wrong'? Many technologies work on the basis everything will work, and failure is thus a tricky thing to deal with. However, some distributed computing technologies designed for scale expect failure (after all with enough scale, failure of one node of many is inevitable with any MTBF for a single node); you can harness this for your environment.
Here are some ideas:
Ensure that your entire hardware is replicated n times (where n is greater than 2, and preferably odd), and that each hardware element can communicate with each other hardware element. Ethernet is one obvious way to do that, but there are many other far simpler routes that would give better protection (e.g. CAN). Minimise common components (even power supplies). This may mean sampling ADC inputs in multiple places for instance.
Ensure your application state is in a single place, e.g. in a finite state machine. This can be entirely RAM based, though does not preclude stable storage. It will thus be stored in several place.
Adopt a quorum protocol for changes of state. See RAFT for example. As you are working in C++, there are well known libraries for this. Changes to the FSM would only get made when a majority of nodes agree. Use a known good library for the protocol stack and the quorum protocol rather than rolling one yourself, or all your good work on redundancy will be wasted when the quorum protocol hangs up.
Ensure you checksum (e.g. CRC/SHA) your FSM, and store the CRC/SHA in the FSM itself (as well as transmitting in the message, and checksumming the messages themselves). Get the nodes to check their FSM regularly against these checksum, checksum incoming messages, and check their checksum matches the checksum of the quorum.
Build as many other internal checks into your system as possible, making nodes that detect their own failure reboot (this is better than carrying on half working provided you have enough nodes). Attempt to let them cleanly remove themselves from the quorum during rebooting in case they don't come up again. On reboot have them checksum the software image (and anything else they load) and do a full RAM test before reintroducing themselves to the quorum.
Use hardware to support you, but do so carefully. You can get ECC RAM, for instance, and regularly read/write through it to correct ECC errors (and panic if the error is uncorrectable). However (from memory) static RAM is far more tolerant of ionizing radiation than DRAM is in the first place, so it may be better to use static DRAM instead. See the first point under 'things I would not do' as well.
Let's say you have an 1% chance of failure of any given node within one day, and let's pretend you can make failures entirely independent. With 5 nodes, you'll need three to fail within one day, which is a .00001% chance. With more, well, you get the idea.
Things I would not do:
Underestimate the value of not having the problem to start off with. Unless weight is a concern, a large block of metal around your device is going to be a far cheaper and more reliable solution than a team of programmers can come up with. Ditto optical coupling of inputs of EMI is an issue, etc. Whatever, attempt when sourcing your components to source those rated best against ionizing radiation.
Roll your own algorithms. People have done this stuff before. Use their work. Fault tolerance and distributed algorithms are hard. Use other people's work where possible.
Use complicated compiler settings in the naive hope you detect more failures. If you are lucky, you may detect more failures. More likely, you will use a code-path within the compiler which has been less tested, particularly if you rolled it yourself.
Use techniques which are untested in your environment. Most people writing high availability software have to simulate failure modes to check their HA works correctly, and miss many failure modes as a result. You are in the 'fortunate' position of having frequent failures on demand. So test each technique, and ensure its application actual improves MTBF by an amount that exceeds the complexity to introduce it (with complexity comes bugs). Especially apply this to my advice re quorum algorithms etc.
Since you specifically ask for software solutions, and you are using C++, why not use operator overloading to make your own, safe datatypes? For example:
Instead of using uint32_t (and double, int64_t etc), make your own SAFE_uint32_t which contains a multiple (minimum of 3) of uint32_t. Overload all of the operations you want (* + - / << >> = == != etc) to perform, and make the overloaded operations perform independently on each internal value, ie don't do it once and copy the result. Both before and after, check that all of the internal values match. If values don't match, you can update the wrong one to the value with the most common one. If there is no most-common value, you can safely notify that there is an error.
This way it doesn't matter if corruption occurs in the ALU, registers, RAM, or on a bus, you will still have multiple attempts and a very good chance of catching errors. Note however though that this only works for the variables you can replace - your stack pointer for example will still be susceptible.
A side story: I ran into a similar issue, also on an old ARM chip. It turned out to be a toolchain which used an old version of GCC that, together with the specific chip we used, triggered a bug in certain edge cases that would (sometimes) corrupt values being passed into functions. Make sure your device doesn't have any problems before blaming it on radio-activity, and yes, sometimes it is a compiler bug =)
Disclaimer: I'm not a radioactivity professional nor worked for this kind of application. But I worked on soft errors and redundancy for long term archival of critical data, which is somewhat linked (same problem, different goals).
The main problem with radioactivity in my opinion is that radioactivity can switch bits, thus radioactivity can/will tamper any digital memory. These errors are usually called soft errors, bit rot, etc.
The question is then: how to compute reliably when your memory is unreliable?
To significantly reduce the rate of soft errors (at the expense of computational overhead since it will mostly be software-based solutions), you can either:
rely on the good old redundancy scheme, and more specifically the more efficient error correcting codes (same purpose, but cleverer algorithms so that you can recover more bits with less redundancy). This is sometimes (wrongly) also called checksumming. With this kind of solution, you will have to store the full state of your program at any moment in a master variable/class (or a struct?), compute an ECC, and check that the ECC is correct before doing anything, and if not, repair the fields. This solution however does not guarantee that your software can work (simply that it will work correctly when it can, or stops working if not, because ECC can tell you if something is wrong, and in this case you can stop your software so that you don't get fake results).
or you can use resilient algorithmic data structures, which guarantee, up to a some bound, that your program will still give correct results even in the presence of soft errors. These algorithms can be seen as a mix of common algorithmic structures with ECC schemes natively mixed in, but this is much more resilient than that, because the resiliency scheme is tightly bounded to the structure, so that you don't need to encode additional procedures to check the ECC, and usually they are a lot faster. These structures provide a way to ensure that your program will work under any condition, up to the theoretical bound of soft errors. You can also mix these resilient structures with the redundancy/ECC scheme for additional security (or encode your most important data structures as resilient, and the rest, the expendable data that you can recompute from the main data structures, as normal data structures with a bit of ECC or a parity check which is very fast to compute).
If you are interested in resilient data structures (which is a recent, but exciting, new field in algorithmics and redundancy engineering), I advise you to read the following documents:
Resilient algorithms data structures intro by Giuseppe F.Italiano, Universita di Roma "Tor Vergata"
Christiano, P., Demaine, E. D., & Kishore, S. (2011). Lossless fault-tolerant data structures with additive overhead. In Algorithms and Data Structures (pp. 243-254). Springer Berlin Heidelberg.
Ferraro-Petrillo, U., Grandoni, F., & Italiano, G. F. (2013). Data structures resilient to memory faults: an experimental study of dictionaries. Journal of Experimental Algorithmics (JEA), 18, 1-6.
Italiano, G. F. (2010). Resilient algorithms and data structures. In Algorithms and Complexity (pp. 13-24). Springer Berlin Heidelberg.
If you are interested in knowing more about the field of resilient data structures, you can checkout the works of Giuseppe F. Italiano (and work your way through the refs) and the Faulty-RAM model (introduced in Finocchi et al. 2005; Finocchi and Italiano 2008).
/EDIT: I illustrated the prevention/recovery from soft-errors mainly for RAM memory and data storage, but I didn't talk about computation (CPU) errors. Other answers already pointed at using atomic transactions like in databases, so I will propose another, simpler scheme: redundancy and majority vote.
The idea is that you simply do x times the same computation for each computation you need to do, and store the result in x different variables (with x >= 3). You can then compare your x variables:
if they all agree, then there's no computation error at all.
if they disagree, then you can use a majority vote to get the correct value, and since this means the computation was partially corrupted, you can also trigger a system/program state scan to check that the rest is ok.
if the majority vote cannot determine a winner (all x values are different), then it's a perfect signal for you to trigger the failsafe procedure (reboot, raise an alert to user, etc.).
This redundancy scheme is very fast compared to ECC (practically O(1)) and it provides you with a clear signal when you need to failsafe. The majority vote is also (almost) guaranteed to never produce corrupted output and also to recover from minor computation errors, because the probability that x computations give the same output is infinitesimal (because there is a huge amount of possible outputs, it's almost impossible to randomly get 3 times the same, even less chances if x > 3).
So with majority vote you are safe from corrupted output, and with redundancy x == 3, you can recover 1 error (with x == 4 it will be 2 errors recoverable, etc. -- the exact equation is nb_error_recoverable == (x-2) where x is the number of calculation repetitions because you need at least 2 agreeing calculations to recover using the majority vote).
The drawback is that you need to compute x times instead of once, so you have an additional computation cost, but's linear complexity so asymptotically you don't lose much for the benefits you gain. A fast way to do a majority vote is to compute the mode on an array, but you can also use a median filter.
Also, if you want to make extra sure the calculations are conducted correctly, if you can make your own hardware you can construct your device with x CPUs, and wire the system so that calculations are automatically duplicated across the x CPUs with a majority vote done mechanically at the end (using AND/OR gates for example). This is often implemented in airplanes and mission-critical devices (see triple modular redundancy). This way, you would not have any computational overhead (since the additional calculations will be done in parallel), and you have another layer of protection from soft errors (since the calculation duplication and majority vote will be managed directly by the hardware and not by software -- which can more easily get corrupted since a program is simply bits stored in memory...).
One point no-one seems to have mentioned. You say you're developing in GCC and cross-compiling onto ARM. How do you know that you don't have code which makes assumptions about free RAM, integer size, pointer size, how long it takes to do a certain operation, how long the system will run for continuously, or various stuff like that? This is a very common problem.
The answer is usually automated unit testing. Write test harnesses which exercise the code on the development system, then run the same test harnesses on the target system. Look for differences!
Also check for errata on your embedded device. You may find there's something about "don't do this because it'll crash, so enable that compiler option and the compiler will work around it".
In short, your most likely source of crashes is bugs in your code. Until you've made pretty damn sure this isn't the case, don't worry (yet) about more esoteric failure modes.
You want 3+ slave machines with a master outside the radiation environment. All I/O passes through the master which contains a vote and/or retry mechanism. The slaves must have a hardware watchdog each and the call to bump them should be surrounded by CRCs or the like to reduce the probability of involuntary bumping. Bumping should be controlled by the master, so lost connection with master equals reboot within a few seconds.
One advantage of this solution is that you can use the same API to the master as to the slaves, so redundancy becomes a transparent feature.
Edit: From the comments I feel the need to clarify the "CRC idea." The possibilty of the slave bumping it's own watchdog is close to zero if you surround the bump with CRC or digest checks on random data from the master. That random data is only sent from master when the slave under scrutiny is aligned with the others. The random data and CRC/digest are immediately cleared after each bump. The master-slave bump frequency should be more than double the watchdog timeout. The data sent from the master is uniquely generated every time.
How about running many instances of your application. If crashes are due to random memory bit changes, chances are some of your app instances will make it through and produce accurate results. It's probably quite easy (for someone with statistical background) to calculate how many instances do you need given bit flop probability to achieve as tiny overall error as you wish.
What you ask is quite complex topic - not easily answerable. Other answers are ok, but they covered just a small part of all the things you need to do.
As seen in comments, it is not possible to fix hardware problems 100%, however it is possible with high probabily to reduce or catch them using various techniques.
If I was you, I would create the software of the highest Safety integrity level level (SIL-4). Get the IEC 61513 document (for the nuclear industry) and follow it.
Someone mentioned using slower chips to prevent ions from flipping bits as easily. In a similar fashion perhaps use a specialized cpu/ram that actually uses multiple bits to store a single bit. Thus providing a hardware fault tolerance because it would be very unlikely that all of the bits would get flipped. So 1 = 1111 but would need to get hit 4 times to actually flipped. (4 might be a bad number since if 2 bits get flipped its already ambiguous). So if you go with 8, you get 8 times less ram and some fraction slower access time but a much more reliable data representation. You could probably do this both on the software level with a specialized compiler(allocate x amount more space for everything) or language implementation (write wrappers for data structures that allocate things this way). Or specialized hardware that has the same logical structure but does this in the firmware.
Perhaps it would help to know does it mean for the hardware to be "designed for this environment". How does it correct and/or indicates the presence of SEU errors ?
At one space exploration related project, we had a custom MCU, which would raise an exception/interrupt on SEU errors, but with some delay, i.e. some cycles may pass/instructions be executed after the one insn which caused the SEU exception.
Particularly vulnerable was the data cache, so a handler would invalidate the offending cache line and restart the program. Only that, due to the imprecise nature of the exception, the sequence of insns headed by the exception raising insn may not be restartable.
We identified the hazardous (not restartable) sequences (like lw $3, 0x0($2), followed by an insn, which modifies $2 and is not data-dependent on $3), and I made modifications to GCC, so such sequences do not occur (e.g. as a last resort, separating the two insns by a nop).
Just something to consider ...
If your hardware fails then you can use mechanical storage to recover it. If your code base is small and have some physical space then you can use a mechanical data store.
There will be a surface of material which will not be affected by radiation. Multiple gears will be there. A mechanical reader will run on all the gears and will be flexible to move up and down. Down means it is 0 and up means it is 1. From 0 and 1 you can generate your code base.
Use a cyclic scheduler. This gives you the ability to add regular maintenance times to check the correctness of critical data. The problem most often encountered is corruption of the stack. If your software is cyclical you can reinitialize the stack between cycles. Do not reuse the stacks for interrupt calls, setup a separate stack of each important interrupt call.
Similar to the Watchdog concept is deadline timers. Start a hardware timer before calling a function. If the function does not return before the deadline timer interrupts then reload the stack and try again. If it still fails after 3/5 tries you need reload from ROM.
Split your software into parts and isolate these parts to use separate memory areas and execution times (Especially in a control environment). Example: signal acquisition, prepossessing data, main algorithm and result implementation/transmission. This means a failure in one part will not cause failures through the rest of the program. So while we are repairing the signal acquisition the rest of tasks continues on stale data.
Everything needs CRCs. If you execute out of RAM even your .text needs a CRC. Check the CRCs regularly if you using a cyclical scheduler. Some compilers (not GCC) can generate CRCs for each section and some processors have dedicated hardware to do CRC calculations, but I guess that would fall out side of the scope of your question. Checking CRCs also prompts the ECC controller on the memory to repair single bit errors before it becomes a problem.
Use watchdogs for bootup no just once operational. You need hardware help if your bootup ran into trouble.
Firstly, design your application around failure. Ensure that as part of normal flow operation, it expects to reset (depending on your application and the type of failure either soft or hard). This is hard to get perfect: critical operations that require some degree of transactionality may need to be checked and tweaked at an assembly level so that an interruption at a key point cannot result in inconsistent external commands.
Fail fast as soon as any unrecoverable memory corruption or control flow deviation is detected. Log failures if possible.
Secondly, where possible, correct corruption and continue. This means checksumming and fixing constant tables (and program code if you can) often; perhaps before each major operation or on a timed interrupt, and storing variables in structures that autocorrect (again before each major op or on a timed interrupt take a majority vote from 3 and correct if is a single deviation). Log corrections if possible.
Thirdly, test failure. Set up a repeatable test environment that flips bits in memory psuedo-randomly. This will allow you to replicate corruption situations and help design your application around them.
Given supercat's comments, the tendencies of modern compilers, and other things, I'd be tempted to go back to the ancient days and write the whole code in assembly and static memory allocations everywhere. For this kind of utter reliability I think assembly no longer incurs a large percentage difference of the cost.
Here are huge amount of replies, but I'll try to sum up my ideas about this.
Something crashes or does not work correctly could be result of your own mistakes - then it should be easily to fix when you locate the problem. But there is also possibility of hardware failures - and that's difficult if not impossible to fix in overall.
I would recommend first to try to catch the problematic situation by logging (stack, registers, function calls) - either by logging them somewhere into file, or transmitting them somehow directly ("oh no - I'm crashing").
Recovery from such error situation is either reboot (if software is still alive and kicking) or hardware reset (e.g. hw watchdogs). Easier to start from first one.
If problem is hardware related - then logging should help you to identify in which function call problem occurs and that can give you inside knowledge of what is not working and where.
Also if code is relatively complex - it makes sense to "divide and conquer" it - meaning you remove / disable some function calls where you suspect problem is - typically disabling half of code and enabling another half - you can get "does work" / "does not work" kind of decision after which you can focus into another half of code. (Where problem is)
If problem occurs after some time - then stack overflow can be suspected - then it's better to monitor stack point registers - if they constantly grows.
And if you manage to fully minimize your code until "hello world" kind of application - and it's still failing randomly - then hardware problems are expected - and there needs to be "hardware upgrade" - meaning invent such cpu / ram / ... -hardware combination which would tolerate radiation better.
Most important thing is probably how you get your logs back if machine fully stopped / resetted / does not work - probably first thing bootstap should do - is a head back home if problematic situation is entcovered.
If it's possible in your environment also to transmit a signal and receive response - you could try out to construct some sort of online remote debugging environment, but then you must have at least of communication media working and some processor/ some ram in working state. And by remote debugging I mean either GDB / gdb stub kind of approach or your own implementation of what you need to get back from your application (e.g. download log files, download call stack, download ram, restart)
I've really read a lot of great answers!
Here is my 2 cent: build a statistical model of the memory/register abnormality, by writing a software to check the memory or to perform frequent register comparisons. Further, create an emulator, in the style of a virtual machine where you can experiment with the issue. I guess if you vary junction size, clock frequency, vendor, casing, etc would observe a different behavior.
Even our desktop PC memory has a certain rate of failure, which however doesn't impair the day to day work.

Does "real-time" constraints prevent the use of a task scheduler?

By task scheduler I mean any implementation of worker threads pool which distribute work to the threads according to whatever algorithm they are designed with. (like Intel TBB)
I know that "real-time" constraints imply that work gets done in predictable time (I'm not talking about speed). So my guess is that using a task scheduler, which, as far as I know, can't guarantee that some task will be executed before a given time, makes the application impossible to use in these constraints.
Or am I missing something? Is there a way to have both? Maybe by forcing assumptions on the quantity of data that can be processed? Or maybe there are predictable task schedulers?
I'm talking about "hard" real time constraints, not soft real time (like video games).
To clarify:
It is known that there are features in C++ that are not possible to use in this kind of context: new, delete, throw, dynamic_cast. They are not predictable (you don't know how much time can be spent on one of these operations, it depends on too much parameters that are not even known before execution).
You can't really use them in real time contexts.
What I ask is, does task schedulers have the same unpredictability that would make them unusable in real-time applications?
Yes, it can be done, but no it's not trivial, and yes there are limits.
You can write the scheduler to guarantee (for example) that an interrupt handler, exception handler (etc.) is guaranteed to be invoked without a fixed period of time from when it occurs. You can guarantee that any given thread will (for example) get at least X milliseconds of CPU time out of any given second (or suitable fraction of a second).
To enforce the latter, you generally need admittance criteria -- an ability for the scheduler to say "sorry, but I can't schedule this as a real-time thread, because the CPU is already under too much load.
In other cases, all it does is guarantee that at least (say) 99% of CPU time will be given the real-time tasks (if any exist) and it's up to whomever designs the system on top of that to schedule few enough real-time tasks that this will ensure they all finish quickly enough.
I feel obliged to add that the "hardness" of real-time requirements is almost entirely orthogonal to the response speed needed. Rather, it's almost entirely about the seriousness of the consequences of being late.
Just for example, consider a nuclear power plant. For a lot of what happens, you're dealing with time periods on the order of minutes, or in some cases even hours. Filling a particular chamber with, say, half a million gallons of water just isn't going to happen in microseconds or milliseconds.
At the same time, the consequences of a later answer can be huge -- quite possibly causing not just a few deaths like hospital equipment could, but potentially hundreds or even thousands of deaths, hundreds of millions in damage, etc. As such, it's about as "hard" as real-time requirements get, even though the deadlines are unusually "loose" by most typical standards.
In the other direction, digital audio playback has much tighter limits. Delays or dropouts can be quite audible down to a fraction of a millisecond in some cases. At the same time, unless you're providing sound processing for a large concert (or something on that order) the consequences of a dropout will generally be a moment's minor annoyance on the part of a user.
Of course, it's also possible to combine the two -- for an obvious example, in high-frequency trading, deadlines may well be in the order of microseconds (or so) and the loss from missing a deadline could easily be millions or tens of millions of (dollars|euros|pounds|etc.)
The term real-time is quite flexible. "Hard real-time" tends to mean things where a few tens of microseconds make the difference between "works right" and "doesn't work right. Not all "real-time" systems require that sort of real-time-ness.
I once worked on a radio-base-station for mobile phones. One of the devices on the board had an interrupt that fired every 2-something milliseconds. For correct operation (not losing calls), we had to deal with the interrupt, that is, do the work inside the interrupt and write the hardware registers with the new values, within 100 microseconds - if we missed, there would be dropped calls. If the interrupt wasn't taken after 160 microseconds, the system would reboot. That is "hard real-time", especially as the processor was just running at a few tens of MHz.
If you produce a video-player, it requires real-time in the a few milliseconds range.
A "display stock prices" probably can be within the 100ms range.
For a webserver it is probably acceptable to respond within 1-2seconds without any big problems.
Also, there is a difference between "worst case worse than X means failure" (like the case above with 100 microseconds or dropped calls - that's bad if it happens more than once every few weeks - and even a few times a year is really something that should be fixed). This is called "Hard real-time".
But other systems, missing your deadline means "Oh, well, we have to do that over again" or "a frame of video flickered a bit", as long as it doesn't happen very often, it's probably OK. This is called "soft real-time".
A lot of modern hardware will make "hard real-time" (the 10s or 100 microsecond range) difficult, because the graphics processor will simply stop the processor from accessing memory, or if the processor gets hot, the stopclk pin is pulled for 100 microseconds...
Most modern OS's, such as Linux and Windows, aren't really meant to be "hard real-time". There are sections of code that does disable interrupt for longer than 100 microseconds in some parts of these OS's.
You can almost certainly get some good "soft real-time" (that is, missing the deadline isn't a failure, just a minor annoyance) out of a mainstream modern OS with modern hardware. It'll probably require either modifications to the OS or a dedicated real-time OS (and perhaps suitable special hardware) to make the system do hard real-time.
But only a few things in the world requires that sort of hard real-time. Often the hard real-time requirements are dealt with by hardware - for example, the next generation of radio-base-stations that I described above, had more clever hardware, so you just needed to give it the new values within the next 2-something milliseconds, and you didn't have the "mad rush to get it done in a few tens of microseconds". In a modern mobile phone, the GSM or UMTS protocol is largely dealt with by a dedicated DSP (digital signal processor).
A "hard real-time" requirement is where the system is really failing if a particular deadline (or set of deadlines) can't be met, even if the failure to meet deadlines happens only once. However, different systems have different systems have different sensitivity to the actual time that the deadline is at (as Jerry Coffin mentions). It is almost certainly possible to find cases where a commercially available general purpose OS is perfectly adequate in dealing with the real-time requirements of a hard real-time system. It is also absolutely sure that there are other cases where such hard real-time requirements are NOT possible to meet without a specialized system.
I would say that if you want sub-millisecond guarantees from the OS, then Desktop Windows or Linux are not for you. This is really down to the overall philosophy of the OS and scheduler design, and to build a hard real-time OS requires a lot of thought about locking and potential for one thread to block another thread, from running, etc.
I don't think there is ONE answer that applies to your question. Yes, you can certainly use thread-pools in a system that has hard real-time requirements. You probably can't do it on a sub-millisecond basis unless there is specific support for this in the OS. And you may need to have dedicated threads and processes to deal with the highest priority real-time behaviour, which is not part of the thread-pool itself.
Sorry if this isn't saying "Yes" or "No" to your answer, but I think you will need to do some research into the actual behaviour of the OS, and see what sort of guarantees it can give (worst case). You will also have to decide what is the worse case scenario, and what happens if you miss a deadline - are (lots of) people dying (plane falling out of the sky), or are some banker going to lose millions, is the green lights going to come on at the same time on two directions on a road crossing or is it some bad sound coming out of a speaker?
"Real time" doesn't just mean "fast", it means that the system can respond to meet deadlines in the real world. Those deadlines depend on what you're dealing with in the real world.
Whether or not a task finishes in a particular timeframe is a characteristic of the task, not the scheduler. The scheduler might decide which task gets resources, and if a task hasn't finished by a deadline it might be stopped or have its resource usage constrained so that other tasks can meet their deadlines.
So, the answer to your question is that you need to consider the workload, deadlines and the scheduler together, and construct your system to meet your requirements. There is no magic scheduler that will take arbitrary tasks and make them complete in a predictable time.
Update:
A task scheduler can be used in real-time systems if it provides the guarantees you need. As others have said, there are task schedulers that provide those guarantees.
On the comments: The issue is the upper bound on time taken.
You can use new and delete if you overload them to have the performance characteristics you are after; the problem isn't new and delete, it is dynamic memory allocation. There is no requirement that new and delete use a general-purpose dynamic allocator, you can use them to allocate out of a statically allocated pool sized appropriately for your workload with deterministic behaviour.
On dynamic_cast: I tend not to use it, but I don't think it's performance is non-deterministic to the extent that it should be banned in real-time code. This is an example of the same issue: understanding worst-case performance is important.

Processing instrument capture data

I have an instrument that produces a stream of data; my code accesses this data though a callback onDataAcquisitionEvent(const InstrumentOutput &data). The data processing algorithm is potentially much slower than the rate of data arrival, so I cannot hope to process every single piece of data (and I don't have to), but would like to process as many as possible. Thank of the instrument as an environmental sensor with the rate of data acquisition that I don't control. InstrumentOutput could for example be a class that contains three simultaneous pressure measurements in different locations.
I also need to keep some short history of data. Assume for example that I can reasonably hope to process a sample of data every 200ms or so. Most of the time I would be happy processing just a single last sample, but occasionally I would need to look at a couple of seconds worth of data that arrived prior to that latest sample, depending on whether abnormal readings are present in the last sample.
The other requirement is to get out of the onDataAcquisitionEvent() callback as soon as possible, to avoid data loss in the sensor.
Data acquisition library (third party) collects the instrument data on a separate thread.
I thought of the following design; have single producer/single consumer queue and push the data tokens into the synchronized queue in the onDataAcquisitionEvent() callback.
On the receiving end, there is a loop that pops the data from the queue. The loop will almost never sleep because of the high rate of data arrival. On each iteration, the following happens:
Pop all the available data from the queue,
The popped data is copied into a circular buffer (I used boost circular buffer), this way some history is always available,
Process the last element in the buffer (and potentially look at the prior ones),
Repeat the loop.
Questions:
Is this design sound, and what are the pitfalls? and
What could be a better design?
Edit: One problem I thought of is when the size of the circular buffer is not large enough to hold the needed history; currently I simply reallocate the circular buffer, doubling its size. I hope I would only need to do that once or twice.
I have a bit of experience with data acquisition, and I can tell you a lot of developers have problems with premature feature creep. Because it sounds easy to simply capture data from the instrument into a log, folks tend to add unessential components to the system before verifying that logging is actually robust. This is a big mistake.
The other requirement is to get out of the onDataAcquisitionEvent() callback as soon as possible, to avoid data loss in the sensor.
That's the only requirement until that part of the product is working 110% under all field conditions.
Most of the time I would be happy processing just a single last sample, but occasionally I would need to look at a couple of seconds worth of data that arrived prior to that latest sample, depending on whether abnormal readings are present in the last sample.
"Most of the time" doesn't matter. Code for the worst case, because onDataAcquisitionEvent() can't be spending its time thinking about contingencies.
It sounds like you're falling into the pitfall of designing it to work with the best data that might be available, and leaving open what might happen if it's not available or if providing the best data to the monitor is ultimately too expensive.
Decimate the data at the source. Specify how many samples will be needed for the abnormal case processing, and attempt to provide that many, at a constant sample rate, plus a margin of maybe 20%.
There should certainly be no loops that never sleep. A circular buffer is fine, but just populate it with whatever minimum you need, and analyze it only as frequently as necessary.
The quality of the system is determined by its stability and determinism, not trying to go an extra mile and provide as much as possible.
Your producer/consumer design is exactly the right design. In real-time systems we often also give different run-time priorities to the consuming threads, not sure this applies in your case.
Use a data structure that's basically a doubly-linked-list, so that if it grows you don't need to re-allocate everything, and you also have O(1) access to the samples you need.
If your memory isn't large enough to hold your several seconds worth of data (which it should -- one sample every 200ms? 5 samples per second.) then you need to see whether you can stand reading from auxiliary memory, but that's throughput and in your case has nothing to do with your design and requirement for "Getting out of the callback as soon as possible".
Consider an implementation of the queue that does not need locking (remember: single reader and single writer only!), so that your callback doesn't stall.
If your callback is really quick, consider disabling interrupts/giving it a high priority. May not be necessary if it can never block and has the right priority set.
Questions, (1) is this design sound, and what are the pitfalls, and (2) what could be a better design. Thanks.
Yes, it is sound. But for performance reasons, you should design the code so that it processes an array of input samples at each processing stage, instead of just a single sample each. This results in much more optimal code for current state of the art CPUs.
The length of such a an array (=a chunk of data) is either fixed (simpler code) or variable (flexible, but some processing may become more complicated).
As a second design choice, you probably should ignore the history at this architectural level, and relegate that feature...
Most of the time I would be happy processing just a single last sample, but occasionally I would need to look at a couple of seconds worth of data [...]
Maybe, tracking a history should be implemented in just that special part of the code, that occasionally requires access to it. Maybe, that should not be part of the "overall architecture". If so, it simplifies processing at all.

Predict C++ program running time

How to predict C++ program running time, if program executes different functions (working with database, reading files, parsing xml and others)? How installers do it?
They do not predict the time. They calculate the number of operations to be done on a total of operations.
You can predict the time by using measurement and estimation. Of course the quality of the predictions will differ. And BTW: The word "predict" is correct.
You split the workload into small tasks, and create an estimation rule for each task, e.g.: if copying files one to ten took 10s, then the remaining 90 files may take another 90s. Measure the time that these tasks take at runtime, and update your estimations.
Each new measurement will make the prediction a bit more precise.
There really is no way to do this in any sort of reliable way, since it depends on thousands of factors.
Progress bars typically measure this in one of two ways:
Overall progress - I have n number of bytes/files/whatever to transfer, and so far I have done m.
Overall work divided by current speed - I have n bytes to transfer, and so far I have done m and it took t seconds, so if things continue at this rate it will take u seconds to complete.
Short answer:
No you can't. For progress bars and such, most applications simply increase the bar length with a percentage based on the overall tasks done. Some psuedo-code:
for(int i=0; i<num_files_to_load; ++i){
files.push_back(File(filepath[i]));
SetProgressBarLength((float)i/((float)num_files_to_load) - 1.0f);
}
This is a very simplified example. Making a for-loop like this would surely block the window system's event/message queue. You would probably add a timed event or something similar instead.
Longer answer:
Given N known parameters, the problem finding whether a program completes at all is undecidable. This is called the Halting problem. You can however, find the time it takes to execute a single instruction. Some very old games actually depended on exact cycle timings, and failed to execute correctly on newer computers due to race conditions that occur because of subtle differences in runtime. Also, on architectures with data and instruction caches, the cycles the instructions consume is not constant anymore. So cache makes cycle-counting unpredictable.
Raymond Chen discussed this issue in his blog.
Why does the copy dialog give such
horrible estimates?
Because the copy dialog is just
guessing. It can't predict the future,
but it is forced to try. And at the
very beginning of the copy, when there
is very little history to go by, the
prediction can be really bad.
In general it is impossible to predict the running time of a program. It is even impossible to predict whether a program will even halt at all. This is undecidable.
http://en.wikipedia.org/wiki/Halting_problem
As others have said, you can't predict the time. Approaches suggested by Partial and rmn are valid solutions.
What you can do more is assign weights to certain operations (for instance, if you know a db call takes roughly twice as long as some processing step, you can adjust accordingly).
A cool installer compiler would execute a faux install, time each op, then save this to disk for the future.
I used such a technique for a 3D application once, which had a pretty dead-on progress bar for loading and mashing data, after you've run it a few times. It wasn't that hard, and it made development much nicer. (Since we had to see that bar 10-15 times/day, startup was 10-20 secs)
You can't predict it entirely.
What you can do is wait until a fraction of the work is done, say 1%, and estimate the remaining time by that - just time how long it takes for 1% and multiply by 100, for example. That is easily done if you can enumerate all that you have to do in advance, or have some kind of a loop going on..
As I mentioned in a previous answer, it is impossible in general to predict the running time.
However, empirically it may be possible to predict with good accuracy.
Typically all of these programs are approximatelyh linear in some input.
But if you wanted a more sophisticated approach, you could define a large number of features (database size, file size, OS, etc. etc.) and input those feature values + running time into a neural network. If you had millions of examples (obviously you would have an automated method for gathering data, e.g. some discovery programs) you might come up with a very flexible and intelligent prediction algorithm.
Of course this would only be worth doing for fun, as I'm sure the value to your company over some crude guessing algorithm will probably be nil :)
You should make estimation of time needed for different phases of the program. For example: reading files - 50, working with database - 30, working with network - 20. In ideal it would be good if you make some progress callback during all of those phases, but it requires coding the progress calculation into the iterations of algorithm.

BerkeleyDB Concurrency

What's the optimal level of concurrency that the C++ implementation of BerkeleyDB can reasonably support?
How many threads can I have hammering away at the DB before throughput starts to suffer because of resource contention?
I've read the manual and know how to set the number of locks, lockers, database page size, etc. but I'd just like some advice from someone who has real-world experience with BDB concurrency.
My application is pretty simple, I'll be doing gets and puts of records that are about 1KB each. No cursors, no deleting.
It depends on what kind of application you are building. Create a representative test scenario, and start hammering away. Then you will know the definitive answer.
Besides your use case, it also depends on CPU, memory, front-side bus, operating system, cache settings, etcetera.
Seriously, just test your own scenario.
If you need some numbers (that actually may mean nothing in your scenario):
Oracle Berkeley DB:
Performance Metrics and
Benchmarks
Performance Metrics
& Benchmarks:
Berkeley DB
I strongly agree with Daan's point: create a test program, and make sure the way in which it accesses data mimics as closely as possible the patterns you expect your application to have. This is extremely important with BDB because different access patterns yield very different throughput.
Other than that, these are general factors I found to be of major impact on throughput:
Access method (which in your case i guess is BTREE).
Level of persistency with which you configured DBD (for example, in my case the 'DB_TXN_WRITE_NOSYNC' environment flag improved write performance by an order of magnitude, but it compromises persistency)
Does the working set fit in cache?
Number of Reads Vs. Writes.
How spread out your access is (remember that BTREE has a page level locking - so accessing different pages with different threads is a big advantage).
Access pattern - meanig how likely are threads to lock one another, or even deadlock, and what is your deadlock resolution policy (this one may be a killer).
Hardware (disk & memory for cache).
This amounts to the following point:
Scaling a solution based on DBD so that it offers greater concurrency has two key ways of going about it; either minimize the number of locks in your design or add more hardware.
Doesn't this depend on the hardware as well as number of threads and stuff?
I would make a simple test and run it with increasing amounts of threads hammering and see what seems best.
What I did when working against a database of unknown performance was to measure turnaround time on my queries. I kept upping the thread count until turn-around time dropped, and dropping the thread count until turn-around time improved (well, it was processes in my environment, but whatever).
There were moving averages and all sorts of metrics involved, but the take-away lesson was: just adapt to how things are working at the moment. You never know when the DBAs will improve performance or hardware will be upgraded, or perhaps another process will come along to load down the system while you're running. So adapt.
Oh, and another thing: avoid process switches if you can - batch things up.
Oh, I should make this clear: this all happened at run time, not during development.
The way I understand things, Samba created tdb to allow "multiple concurrent writers" for any particular database file. So if your workload has multiple writers your performance may be bad (as in, the Samba project chose to write its own system, apparently because it wasn't happy with Berkeley DB's performance in this case).
On the other hand, if your workload has lots of readers, then the question is how well your operating system handles multiple readers.