SystemVerilog register design race avoidance

SystemVerilog register design race avoidance - scheduling

While doing digital design in systemverilog, I ran into an issue regarding racing conditions.
The test-bench (which I cannot modify) that drives my design, drives the inputs in such a way that certain registers in the design do not function properly due to a race condition.
Here is an eda-playground example which illustrates what is happening (input changes "before" the clock does, at time 15ns):
http://www.edaplayground.com/x/rWJ
Is there a way to make the design (a simple register in this case) resistant to this particular problem? What I need is a statement like "out_data <= preponed(in_data);" or something similar which will make the order of input signal changes irrelevant.
I have read about #1step in the SystemVerilog LRM but I am unsure how to use it, nor if it can help me with this particular problem at all.

Your test bench is essentially creating stimulus that is in a setup violation to your register. You have two options:
Convince the testbench writer of their erroneous ways and get them to fix it.
Insert a layer of hierarchy around the design that delays the clock to eliminate the race.

Related

Compiling an application for use in highly radioactive environments

We are compiling an embedded C++ application that is deployed in a shielded device in an environment bombarded with ionizing radiation. We are using GCC and cross-compiling for ARM. When deployed, our application generates some erroneous data and crashes more often than we would like. The hardware is designed for this environment, and our application has run on this platform for several years.
Are there changes we can make to our code, or compile-time improvements that can be made to identify/correct soft errors and memory-corruption caused by single event upsets? Have any other developers had success in reducing the harmful effects of soft errors on a long-running application?

Working for about 4-5 years with software/firmware development and environment testing of miniaturized satellites*, I would like to share my experience here.
*(miniaturized satellites are a lot more prone to single event upsets than bigger satellites due to its relatively small, limited sizes for its electronic components)
To be very concise and direct: there is no mechanism to recover from detectable, erroneous
situation by the software/firmware itself without, at least, one
copy of minimum working version of the software/firmware somewhere for recovery purpose - and with the hardware supporting the recovery (functional).
Now, this situation is normally handled both in the hardware and software level. Here, as you request, I will share what we can do in the software level.
...recovery purpose.... Provide ability to update/recompile/reflash your software/firmware in real environment. This is an almost must-have feature for any software/firmware in highly ionized environment. Without this, you could have redundant software/hardware as many as you want but at one point, they are all going to blow up. So, prepare this feature!
...minimum working version... Have responsive, multiple copies, minimum version of the software/firmware in your code. This is like Safe mode in Windows. Instead of having only one, fully functional version of your software, have multiple copies of the minimum version of your software/firmware. The minimum copy will usually having much less size than the full copy and almost always have only the following two or three features:
capable of listening to command from external system,
capable of updating the current software/firmware,
capable of monitoring the basic operation's housekeeping data.
...copy... somewhere... Have redundant software/firmware somewhere.
You could, with or without redundant hardware, try to have redundant software/firmware in your ARM uC. This is normally done by having two or more identical software/firmware in separate addresses which sending heartbeat to each other - but only one will be active at a time. If one or more software/firmware is known to be unresponsive, switch to the other software/firmware. The benefit of using this approach is we can have functional replacement immediately after an error occurs - without any contact with whatever external system/party who is responsible to detect and to repair the error (in satellite case, it is usually the Mission Control Centre (MCC)).
Strictly speaking, without redundant hardware, the disadvantage of doing this is you actually cannot eliminate all single point of failures. At the very least, you will still have one single point of failure, which is the switch itself (or often the beginning of the code). Nevertheless, for a device limited by size in a highly ionized environment (such as pico/femto satellites), the reduction of the single point of failures to one point without additional hardware will still be worth considering. Somemore, the piece of code for the switching would certainly be much less than the code for the whole program - significantly reducing the risk of getting Single Event in it.
But if you are not doing this, you should have at least one copy in your external system which can come in contact with the device and update the software/firmware (in the satellite case, it is again the mission control centre).
You could also have the copy in your permanent memory storage in your device which can be triggered to restore the running system's software/firmware
...detectable erroneous situation.. The error must be detectable, usually by the hardware error correction/detection circuit or by a small piece of code for error correction/detection. It is best to put such code small, multiple, and independent from the main software/firmware. Its main task is only for checking/correcting. If the hardware circuit/firmware is reliable (such as it is more radiation hardened than the rests - or having multiple circuits/logics), then you might consider making error-correction with it. But if it is not, it is better to make it as error-detection. The correction can be by external system/device. For the error correction, you could consider making use of a basic error correction algorithm like Hamming/Golay23, because they can be implemented more easily both in the circuit/software. But it ultimately depends on your team's capability. For error detection, normally CRC is used.
...hardware supporting the recovery Now, comes to the most difficult aspect on this issue. Ultimately, the recovery requires the hardware which is responsible for the recovery to be at least functional. If the hardware is permanently broken (normally happen after its Total ionizing dose reaches certain level), then there is (sadly) no way for the software to help in recovery. Thus, hardware is rightly the utmost importance concern for a device exposed to high radiation level (such as satellite).
In addition to the suggestion for above anticipating firmware's error due to single event upset, I would also like to suggest you to have:
Error detection and/or error correction algorithm in the inter-subsystem communication protocol. This is another almost must have in order to avoid incomplete/wrong signals received from other system
Filter in your ADC reading. Do not use the ADC reading directly. Filter it by median filter, mean filter, or any other filters - never trust single reading value. Sample more, not less - reasonably.

NASA has a paper on radiation-hardened software. It describes three main tasks:
Regular monitoring of memory for errors then scrubbing out those errors,
robust error recovery mechanisms, and
the ability to reconfigure if something no longer works.
Note that the memory scan rate should be frequent enough that multi-bit errors rarely occur, as most ECC memory can recover from single-bit errors, not multi-bit errors.
Robust error recovery includes control flow transfer (typically restarting a process at a point before the error), resource release, and data restoration.
Their main recommendation for data restoration is to avoid the need for it, through having intermediate data be treated as temporary, so that restarting before the error also rolls back the data to a reliable state. This sounds similar to the concept of "transactions" in databases.
They discuss techniques particularly suitable for object-oriented languages such as C++. For example
Software-based ECCs for contiguous memory objects
Programming by Contract: verifying preconditions and postconditions, then checking the object to verify it is still in a valid state.
And, it just so happens, NASA has used C++ for major projects such as the Mars Rover.
C++ class abstraction and encapsulation enabled rapid development and testing among multiple projects and developers.
They avoided certain C++ features that could create problems:
Exceptions
Templates
Iostream (no console)
Multiple inheritance
Operator overloading (other than new and delete)
Dynamic allocation (used a dedicated memory pool and placement new to avoid the possibility of system heap corruption).

Here are some thoughts and ideas:
Use ROM more creatively.
Store anything you can in ROM. Instead of calculating things, store look-up tables in ROM. (Make sure your compiler is outputting your look-up tables to the read-only section! Print out memory addresses at runtime to check!) Store your interrupt vector table in ROM. Of course, run some tests to see how reliable your ROM is compared to your RAM.
Use your best RAM for the stack.
SEUs in the stack are probably the most likely source of crashes, because it is where things like index variables, status variables, return addresses, and pointers of various sorts typically live.
Implement timer-tick and watchdog timer routines.
You can run a "sanity check" routine every timer tick, as well as a watchdog routine to handle the system locking up. Your main code could also periodically increment a counter to indicate progress, and the sanity-check routine could ensure this has occurred.
Implement error-correcting-codes in software.
You can add redundancy to your data to be able to detect and/or correct errors. This will add processing time, potentially leaving the processor exposed to radiation for a longer time, thus increasing the chance of errors, so you must consider the trade-off.
Remember the caches.
Check the sizes of your CPU caches. Data that you have accessed or modified recently will probably be within a cache. I believe you can disable at least some of the caches (at a big performance cost); you should try this to see how susceptible the caches are to SEUs. If the caches are hardier than RAM then you could regularly read and re-write critical data to make sure it stays in cache and bring RAM back into line.
Use page-fault handlers cleverly.
If you mark a memory page as not-present, the CPU will issue a page fault when you try to access it. You can create a page-fault handler that does some checking before servicing the read request. (PC operating systems use this to transparently load pages that have been swapped to disk.)
Use assembly language for critical things (which could be everything).
With assembly language, you know what is in registers and what is in RAM; you know what special RAM tables the CPU is using, and you can design things in a roundabout way to keep your risk down.
Use objdump to actually look at the generated assembly language, and work out how much code each of your routines takes up.
If you are using a big OS like Linux then you are asking for trouble; there is just so much complexity and so many things to go wrong.
Remember it is a game of probabilities.
A commenter said
Every routine you write to catch errors will be subject to failing itself from the same cause.
While this is true, the chances of errors in the (say) 100 bytes of code and data required for a check routine to function correctly is much smaller than the chance of errors elsewhere. If your ROM is pretty reliable and almost all the code/data is actually in ROM then your odds are even better.
Use redundant hardware.
Use 2 or more identical hardware setups with identical code. If the results differ, a reset should be triggered. With 3 or more devices you can use a "voting" system to try to identify which one has been compromised.

You may also be interested in the rich literature on the subject of algorithmic fault tolerance. This includes the old assignment: Write a sort that correctly sorts its input when a constant number of comparisons will fail (or, the slightly more evil version, when the asymptotic number of failed comparisons scales as log(n) for n comparisons).
A place to start reading is Huang and Abraham's 1984 paper "Algorithm-Based Fault Tolerance for Matrix Operations". Their idea is vaguely similar to homomorphic encrypted computation (but it is not really the same, since they are attempting error detection/correction at the operation level).
A more recent descendant of that paper is Bosilca, Delmas, Dongarra, and Langou's "Algorithm-based fault tolerance applied to high performance computing".

Writing code for radioactive environments is not really any different than writing code for any mission-critical application.
In addition to what has already been mentioned, here are some miscellaneous tips:
Use everyday "bread & butter" safety measures that should be present on any semi-professional embedded system: internal watchdog, internal low-voltage detect, internal clock monitor. These things shouldn't even need to be mentioned in the year 2016 and they are standard on pretty much every modern microcontroller.
If you have a safety and/or automotive-oriented MCU, it will have certain watchdog features, such as a given time window, inside which you need to refresh the watchdog. This is preferred if you have a mission-critical real-time system.
In general, use a MCU suitable for these kind of systems, and not some generic mainstream fluff you received in a packet of corn flakes. Almost every MCU manufacturer nowadays have specialized MCUs designed for safety applications (TI, Freescale, Renesas, ST, Infineon etc etc). These have lots of built-in safety features, including lock-step cores: meaning that there are 2 CPU cores executing the same code, and they must agree with each other.
IMPORTANT: You must ensure the integrity of internal MCU registers. All control & status registers of hardware peripherals that are writeable may be located in RAM memory, and are therefore vulnerable.
To protect yourself against register corruptions, preferably pick a microcontroller with built-in "write-once" features of registers. In addition, you need to store default values of all hardware registers in NVM and copy-down those values to your registers at regular intervals. You can ensure the integrity of important variables in the same manner.
Note: always use defensive programming. Meaning that you have to setup all registers in the MCU and not just the ones used by the application. You don't want some random hardware peripheral to suddenly wake up.
There are all kinds of methods to check for errors in RAM or NVM: checksums, "walking patterns", software ECC etc etc. The best solution nowadays is to not use any of these, but to use a MCU with built-in ECC and similar checks. Because doing this in software is complex, and the error check in itself could therefore introduce bugs and unexpected problems.
Use redundancy. You could store both volatile and non-volatile memory in two identical "mirror" segments, that must always be equivalent. Each segment could have a CRC checksum attached.
Avoid using external memories outside the MCU.
Implement a default interrupt service routine / default exception handler for all possible interrupts/exceptions. Even the ones you are not using. The default routine should do nothing except shutting off its own interrupt source.
Understand and embrace the concept of defensive programming. This means that your program needs to handle all possible cases, even those that cannot occur in theory. Examples.
High quality mission-critical firmware detects as many errors as possible, and then handles or ignores them in a safe manner.
Never write programs that rely on poorly-specified behavior. It is likely that such behavior might change drastically with unexpected hardware changes caused by radiation or EMI. The best way to ensure that your program is free from such crap is to use a coding standard like MISRA, together with a static analyser tool. This will also help with defensive programming and with weeding out bugs (why would you not want to detect bugs in any kind of application?).
IMPORTANT: Don't implement any reliance of the default values of static storage duration variables. That is, don't trust the default contents of the .data or .bss. There could be any amount of time between the point of initialization to the point where the variable is actually used, there could have been plenty of time for the RAM to get corrupted. Instead, write the program so that all such variables are set from NVM in run-time, just before the time when such a variable is used for the first time.
In practice this means that if a variable is declared at file scope or as static, you should never use = to initialize it (or you could, but it is pointless, because you cannot rely on the value anyhow). Always set it in run-time, just before use. If it is possible to repeatedly update such variables from NVM, then do so.
Similarly in C++, don't rely on constructors for static storage duration variables. Have the constructor(s) call a public "set-up" routine, which you can also call later on in run-time, straight from the caller application.
If possible, remove the "copy-down" start-up code that initializes .data and .bss (and calls C++ constructors) entirely, so that you get linker errors if you write code relying on such. Many compilers have the option to skip this, usually called "minimal/fast start-up" or similar.
This means that any external libraries have to be checked so that they don't contain any such reliance.
Implement and define a safe state for the program, to where you will revert in case of critical errors.
Implementing an error report/error log system is always helpful.

It may be possible to use C to write programs that behave robustly in such environments, but only if most forms of compiler optimization are disabled. Optimizing compilers are designed to replace many seemingly-redundant coding patterns with "more efficient" ones, and may have no clue that the reason the programmer is testing x==42 when the compiler knows there's no way x could possibly hold anything else is because the programmer wants to prevent the execution of certain code with x holding some other value--even in cases where the only way it could hold that value would be if the system received some kind of electrical glitch.
Declaring variables as volatile is often helpful, but may not be a panacea.
Of particular importance, note that safe coding often requires that dangerous
operations have hardware interlocks that require multiple steps to activate,
and that code be written using the pattern:
... code that checks system state
if (system_state_favors_activation)
{
prepare_for_activation();
... code that checks system state again
if (system_state_is_valid)
{
if (system_state_favors_activation)
trigger_activation();
}
else
perform_safety_shutdown_and_restart();
}
cancel_preparations();
If a compiler translates the code in relatively literal fashion, and if all
the checks for system state are repeated after the prepare_for_activation(),
the system may be robust against almost any plausible single glitch event,
even those which would arbitrarily corrupt the program counter and stack. If
a glitch occurs just after a call to prepare_for_activation(), that would imply
that activation would have been appropriate (since there's no other reason
prepare_for_activation() would have been called before the glitch). If the
glitch causes code to reach prepare_for_activation() inappropriately, but there
are no subsequent glitch events, there would be no way for code to subsequently
reach trigger_activation() without having passed through the validation check or calling cancel_preparations first [if the stack glitches, execution might proceed to a spot just before trigger_activation() after the context that called prepare_for_activation() returns, but the call to cancel_preparations() would have occurred between the calls to prepare_for_activation() and trigger_activation(), thus rendering the latter call harmless.
Such code may be safe in traditional C, but not with modern C compilers. Such compilers can be very dangerous in that sort of environment because aggressive they strive to only include code which will be relevant in situations that could come about via some well-defined mechanism and whose resulting consequences would also be well defined. Code whose purpose would be to detect and clean up after failures may, in some cases, end up making things worse. If the compiler determines that the attempted recovery would in some cases invoke undefined behavior, it may infer that the conditions that would necessitate such recovery in such cases cannot possibly occur, thus eliminating the code that would have checked for them.

This is an extremely broad subject. Basically, you can't really recover from memory corruption, but you can at least try to fail promptly. Here are a few techniques you could use:
checksum constant data. If you have any configuration data which stays constant for a long time (including hardware registers you have configured), compute its checksum on initialization and verify it periodically. When you see a mismatch, it's time to re-initialize or reset.
store variables with redundancy. If you have an important variable x, write its value in x1, x2 and x3 and read it as (x1 == x2) ? x2 : x3.
implement program flow monitoring. XOR a global flag with a unique value in important functions/branches called from the main loop. Running the program in a radiation-free environment with near-100% test coverage should give you the list of acceptable values of the flag at the end of the cycle. Reset if you see deviations.
monitor the stack pointer. In the beginning of the main loop, compare the stack pointer with its expected value. Reset on deviation.

What could help you is a watchdog. Watchdogs were used extensively in industrial computing in the 1980s. Hardware failures were much more common then - another answer also refers to that period.
A watchdog is a combined hardware/software feature. The hardware is a simple counter that counts down from a number (say 1023) to zero. TTL or other logic could be used.
The software has been designed as such that one routine monitors the correct operation of all essential systems. If this routine completes correctly = finds the computer running fine, it sets the counter back to 1023.
The overall design is so that under normal circumstances, the software prevents that the hardware counter will reach zero. In case the counter reaches zero, the hardware of the counter performs its one-and-only task and resets the entire system. From a counter perspective, zero equals 1024 and the counter continues counting down again.
This watchdog ensures that the attached computer is restarted in a many, many cases of failure. I must admit that I'm not familiar with hardware that is able to perform such a function on today's computers. Interfaces to external hardware are now a lot more complex than they used to be.
An inherent disadvantage of the watchdog is that the system is not available from the time it fails until the watchdog counter reaches zero + reboot time. While that time is generally much shorter than any external or human intervention, the supported equipment will need to be able to proceed without computer control for that timeframe.

This answer assumes you are concerned with having a system that works correctly, over and above having a system that is minimum cost or fast; most people playing with radioactive things value correctness / safety over speed / cost
Several people have suggested hardware changes you can make (fine - there's lots of good stuff here in answers already and I don't intend repeating all of it), and others have suggested redundancy (great in principle), but I don't think anyone has suggested how that redundancy might work in practice. How do you fail over? How do you know when something has 'gone wrong'? Many technologies work on the basis everything will work, and failure is thus a tricky thing to deal with. However, some distributed computing technologies designed for scale expect failure (after all with enough scale, failure of one node of many is inevitable with any MTBF for a single node); you can harness this for your environment.
Here are some ideas:
Ensure that your entire hardware is replicated n times (where n is greater than 2, and preferably odd), and that each hardware element can communicate with each other hardware element. Ethernet is one obvious way to do that, but there are many other far simpler routes that would give better protection (e.g. CAN). Minimise common components (even power supplies). This may mean sampling ADC inputs in multiple places for instance.
Ensure your application state is in a single place, e.g. in a finite state machine. This can be entirely RAM based, though does not preclude stable storage. It will thus be stored in several place.
Adopt a quorum protocol for changes of state. See RAFT for example. As you are working in C++, there are well known libraries for this. Changes to the FSM would only get made when a majority of nodes agree. Use a known good library for the protocol stack and the quorum protocol rather than rolling one yourself, or all your good work on redundancy will be wasted when the quorum protocol hangs up.
Ensure you checksum (e.g. CRC/SHA) your FSM, and store the CRC/SHA in the FSM itself (as well as transmitting in the message, and checksumming the messages themselves). Get the nodes to check their FSM regularly against these checksum, checksum incoming messages, and check their checksum matches the checksum of the quorum.
Build as many other internal checks into your system as possible, making nodes that detect their own failure reboot (this is better than carrying on half working provided you have enough nodes). Attempt to let them cleanly remove themselves from the quorum during rebooting in case they don't come up again. On reboot have them checksum the software image (and anything else they load) and do a full RAM test before reintroducing themselves to the quorum.
Use hardware to support you, but do so carefully. You can get ECC RAM, for instance, and regularly read/write through it to correct ECC errors (and panic if the error is uncorrectable). However (from memory) static RAM is far more tolerant of ionizing radiation than DRAM is in the first place, so it may be better to use static DRAM instead. See the first point under 'things I would not do' as well.
Let's say you have an 1% chance of failure of any given node within one day, and let's pretend you can make failures entirely independent. With 5 nodes, you'll need three to fail within one day, which is a .00001% chance. With more, well, you get the idea.
Things I would not do:
Underestimate the value of not having the problem to start off with. Unless weight is a concern, a large block of metal around your device is going to be a far cheaper and more reliable solution than a team of programmers can come up with. Ditto optical coupling of inputs of EMI is an issue, etc. Whatever, attempt when sourcing your components to source those rated best against ionizing radiation.
Roll your own algorithms. People have done this stuff before. Use their work. Fault tolerance and distributed algorithms are hard. Use other people's work where possible.
Use complicated compiler settings in the naive hope you detect more failures. If you are lucky, you may detect more failures. More likely, you will use a code-path within the compiler which has been less tested, particularly if you rolled it yourself.
Use techniques which are untested in your environment. Most people writing high availability software have to simulate failure modes to check their HA works correctly, and miss many failure modes as a result. You are in the 'fortunate' position of having frequent failures on demand. So test each technique, and ensure its application actual improves MTBF by an amount that exceeds the complexity to introduce it (with complexity comes bugs). Especially apply this to my advice re quorum algorithms etc.

Since you specifically ask for software solutions, and you are using C++, why not use operator overloading to make your own, safe datatypes? For example:
Instead of using uint32_t (and double, int64_t etc), make your own SAFE_uint32_t which contains a multiple (minimum of 3) of uint32_t. Overload all of the operations you want (* + - / << >> = == != etc) to perform, and make the overloaded operations perform independently on each internal value, ie don't do it once and copy the result. Both before and after, check that all of the internal values match. If values don't match, you can update the wrong one to the value with the most common one. If there is no most-common value, you can safely notify that there is an error.
This way it doesn't matter if corruption occurs in the ALU, registers, RAM, or on a bus, you will still have multiple attempts and a very good chance of catching errors. Note however though that this only works for the variables you can replace - your stack pointer for example will still be susceptible.
A side story: I ran into a similar issue, also on an old ARM chip. It turned out to be a toolchain which used an old version of GCC that, together with the specific chip we used, triggered a bug in certain edge cases that would (sometimes) corrupt values being passed into functions. Make sure your device doesn't have any problems before blaming it on radio-activity, and yes, sometimes it is a compiler bug =)

Disclaimer: I'm not a radioactivity professional nor worked for this kind of application. But I worked on soft errors and redundancy for long term archival of critical data, which is somewhat linked (same problem, different goals).
The main problem with radioactivity in my opinion is that radioactivity can switch bits, thus radioactivity can/will tamper any digital memory. These errors are usually called soft errors, bit rot, etc.
The question is then: how to compute reliably when your memory is unreliable?
To significantly reduce the rate of soft errors (at the expense of computational overhead since it will mostly be software-based solutions), you can either:
rely on the good old redundancy scheme, and more specifically the more efficient error correcting codes (same purpose, but cleverer algorithms so that you can recover more bits with less redundancy). This is sometimes (wrongly) also called checksumming. With this kind of solution, you will have to store the full state of your program at any moment in a master variable/class (or a struct?), compute an ECC, and check that the ECC is correct before doing anything, and if not, repair the fields. This solution however does not guarantee that your software can work (simply that it will work correctly when it can, or stops working if not, because ECC can tell you if something is wrong, and in this case you can stop your software so that you don't get fake results).
or you can use resilient algorithmic data structures, which guarantee, up to a some bound, that your program will still give correct results even in the presence of soft errors. These algorithms can be seen as a mix of common algorithmic structures with ECC schemes natively mixed in, but this is much more resilient than that, because the resiliency scheme is tightly bounded to the structure, so that you don't need to encode additional procedures to check the ECC, and usually they are a lot faster. These structures provide a way to ensure that your program will work under any condition, up to the theoretical bound of soft errors. You can also mix these resilient structures with the redundancy/ECC scheme for additional security (or encode your most important data structures as resilient, and the rest, the expendable data that you can recompute from the main data structures, as normal data structures with a bit of ECC or a parity check which is very fast to compute).
If you are interested in resilient data structures (which is a recent, but exciting, new field in algorithmics and redundancy engineering), I advise you to read the following documents:
Resilient algorithms data structures intro by Giuseppe F.Italiano, Universita di Roma "Tor Vergata"
Christiano, P., Demaine, E. D., & Kishore, S. (2011). Lossless fault-tolerant data structures with additive overhead. In Algorithms and Data Structures (pp. 243-254). Springer Berlin Heidelberg.
Ferraro-Petrillo, U., Grandoni, F., & Italiano, G. F. (2013). Data structures resilient to memory faults: an experimental study of dictionaries. Journal of Experimental Algorithmics (JEA), 18, 1-6.
Italiano, G. F. (2010). Resilient algorithms and data structures. In Algorithms and Complexity (pp. 13-24). Springer Berlin Heidelberg.
If you are interested in knowing more about the field of resilient data structures, you can checkout the works of Giuseppe F. Italiano (and work your way through the refs) and the Faulty-RAM model (introduced in Finocchi et al. 2005; Finocchi and Italiano 2008).
/EDIT: I illustrated the prevention/recovery from soft-errors mainly for RAM memory and data storage, but I didn't talk about computation (CPU) errors. Other answers already pointed at using atomic transactions like in databases, so I will propose another, simpler scheme: redundancy and majority vote.
The idea is that you simply do x times the same computation for each computation you need to do, and store the result in x different variables (with x >= 3). You can then compare your x variables:
if they all agree, then there's no computation error at all.
if they disagree, then you can use a majority vote to get the correct value, and since this means the computation was partially corrupted, you can also trigger a system/program state scan to check that the rest is ok.
if the majority vote cannot determine a winner (all x values are different), then it's a perfect signal for you to trigger the failsafe procedure (reboot, raise an alert to user, etc.).
This redundancy scheme is very fast compared to ECC (practically O(1)) and it provides you with a clear signal when you need to failsafe. The majority vote is also (almost) guaranteed to never produce corrupted output and also to recover from minor computation errors, because the probability that x computations give the same output is infinitesimal (because there is a huge amount of possible outputs, it's almost impossible to randomly get 3 times the same, even less chances if x > 3).
So with majority vote you are safe from corrupted output, and with redundancy x == 3, you can recover 1 error (with x == 4 it will be 2 errors recoverable, etc. -- the exact equation is nb_error_recoverable == (x-2) where x is the number of calculation repetitions because you need at least 2 agreeing calculations to recover using the majority vote).
The drawback is that you need to compute x times instead of once, so you have an additional computation cost, but's linear complexity so asymptotically you don't lose much for the benefits you gain. A fast way to do a majority vote is to compute the mode on an array, but you can also use a median filter.
Also, if you want to make extra sure the calculations are conducted correctly, if you can make your own hardware you can construct your device with x CPUs, and wire the system so that calculations are automatically duplicated across the x CPUs with a majority vote done mechanically at the end (using AND/OR gates for example). This is often implemented in airplanes and mission-critical devices (see triple modular redundancy). This way, you would not have any computational overhead (since the additional calculations will be done in parallel), and you have another layer of protection from soft errors (since the calculation duplication and majority vote will be managed directly by the hardware and not by software -- which can more easily get corrupted since a program is simply bits stored in memory...).

One point no-one seems to have mentioned. You say you're developing in GCC and cross-compiling onto ARM. How do you know that you don't have code which makes assumptions about free RAM, integer size, pointer size, how long it takes to do a certain operation, how long the system will run for continuously, or various stuff like that? This is a very common problem.
The answer is usually automated unit testing. Write test harnesses which exercise the code on the development system, then run the same test harnesses on the target system. Look for differences!
Also check for errata on your embedded device. You may find there's something about "don't do this because it'll crash, so enable that compiler option and the compiler will work around it".
In short, your most likely source of crashes is bugs in your code. Until you've made pretty damn sure this isn't the case, don't worry (yet) about more esoteric failure modes.

You want 3+ slave machines with a master outside the radiation environment. All I/O passes through the master which contains a vote and/or retry mechanism. The slaves must have a hardware watchdog each and the call to bump them should be surrounded by CRCs or the like to reduce the probability of involuntary bumping. Bumping should be controlled by the master, so lost connection with master equals reboot within a few seconds.
One advantage of this solution is that you can use the same API to the master as to the slaves, so redundancy becomes a transparent feature.
Edit: From the comments I feel the need to clarify the "CRC idea." The possibilty of the slave bumping it's own watchdog is close to zero if you surround the bump with CRC or digest checks on random data from the master. That random data is only sent from master when the slave under scrutiny is aligned with the others. The random data and CRC/digest are immediately cleared after each bump. The master-slave bump frequency should be more than double the watchdog timeout. The data sent from the master is uniquely generated every time.

How about running many instances of your application. If crashes are due to random memory bit changes, chances are some of your app instances will make it through and produce accurate results. It's probably quite easy (for someone with statistical background) to calculate how many instances do you need given bit flop probability to achieve as tiny overall error as you wish.

What you ask is quite complex topic - not easily answerable. Other answers are ok, but they covered just a small part of all the things you need to do.
As seen in comments, it is not possible to fix hardware problems 100%, however it is possible with high probabily to reduce or catch them using various techniques.
If I was you, I would create the software of the highest Safety integrity level level (SIL-4). Get the IEC 61513 document (for the nuclear industry) and follow it.

Someone mentioned using slower chips to prevent ions from flipping bits as easily. In a similar fashion perhaps use a specialized cpu/ram that actually uses multiple bits to store a single bit. Thus providing a hardware fault tolerance because it would be very unlikely that all of the bits would get flipped. So 1 = 1111 but would need to get hit 4 times to actually flipped. (4 might be a bad number since if 2 bits get flipped its already ambiguous). So if you go with 8, you get 8 times less ram and some fraction slower access time but a much more reliable data representation. You could probably do this both on the software level with a specialized compiler(allocate x amount more space for everything) or language implementation (write wrappers for data structures that allocate things this way). Or specialized hardware that has the same logical structure but does this in the firmware.

Perhaps it would help to know does it mean for the hardware to be "designed for this environment". How does it correct and/or indicates the presence of SEU errors ?
At one space exploration related project, we had a custom MCU, which would raise an exception/interrupt on SEU errors, but with some delay, i.e. some cycles may pass/instructions be executed after the one insn which caused the SEU exception.
Particularly vulnerable was the data cache, so a handler would invalidate the offending cache line and restart the program. Only that, due to the imprecise nature of the exception, the sequence of insns headed by the exception raising insn may not be restartable.
We identified the hazardous (not restartable) sequences (like lw $3, 0x0($2), followed by an insn, which modifies $2 and is not data-dependent on $3), and I made modifications to GCC, so such sequences do not occur (e.g. as a last resort, separating the two insns by a nop).
Just something to consider ...

If your hardware fails then you can use mechanical storage to recover it. If your code base is small and have some physical space then you can use a mechanical data store.
There will be a surface of material which will not be affected by radiation. Multiple gears will be there. A mechanical reader will run on all the gears and will be flexible to move up and down. Down means it is 0 and up means it is 1. From 0 and 1 you can generate your code base.

Use a cyclic scheduler. This gives you the ability to add regular maintenance times to check the correctness of critical data. The problem most often encountered is corruption of the stack. If your software is cyclical you can reinitialize the stack between cycles. Do not reuse the stacks for interrupt calls, setup a separate stack of each important interrupt call.
Similar to the Watchdog concept is deadline timers. Start a hardware timer before calling a function. If the function does not return before the deadline timer interrupts then reload the stack and try again. If it still fails after 3/5 tries you need reload from ROM.
Split your software into parts and isolate these parts to use separate memory areas and execution times (Especially in a control environment). Example: signal acquisition, prepossessing data, main algorithm and result implementation/transmission. This means a failure in one part will not cause failures through the rest of the program. So while we are repairing the signal acquisition the rest of tasks continues on stale data.
Everything needs CRCs. If you execute out of RAM even your .text needs a CRC. Check the CRCs regularly if you using a cyclical scheduler. Some compilers (not GCC) can generate CRCs for each section and some processors have dedicated hardware to do CRC calculations, but I guess that would fall out side of the scope of your question. Checking CRCs also prompts the ECC controller on the memory to repair single bit errors before it becomes a problem.
Use watchdogs for bootup no just once operational. You need hardware help if your bootup ran into trouble.

Firstly, design your application around failure. Ensure that as part of normal flow operation, it expects to reset (depending on your application and the type of failure either soft or hard). This is hard to get perfect: critical operations that require some degree of transactionality may need to be checked and tweaked at an assembly level so that an interruption at a key point cannot result in inconsistent external commands.
Fail fast as soon as any unrecoverable memory corruption or control flow deviation is detected. Log failures if possible.
Secondly, where possible, correct corruption and continue. This means checksumming and fixing constant tables (and program code if you can) often; perhaps before each major operation or on a timed interrupt, and storing variables in structures that autocorrect (again before each major op or on a timed interrupt take a majority vote from 3 and correct if is a single deviation). Log corrections if possible.
Thirdly, test failure. Set up a repeatable test environment that flips bits in memory psuedo-randomly. This will allow you to replicate corruption situations and help design your application around them.

Given supercat's comments, the tendencies of modern compilers, and other things, I'd be tempted to go back to the ancient days and write the whole code in assembly and static memory allocations everywhere. For this kind of utter reliability I think assembly no longer incurs a large percentage difference of the cost.

Here are huge amount of replies, but I'll try to sum up my ideas about this.
Something crashes or does not work correctly could be result of your own mistakes - then it should be easily to fix when you locate the problem. But there is also possibility of hardware failures - and that's difficult if not impossible to fix in overall.
I would recommend first to try to catch the problematic situation by logging (stack, registers, function calls) - either by logging them somewhere into file, or transmitting them somehow directly ("oh no - I'm crashing").
Recovery from such error situation is either reboot (if software is still alive and kicking) or hardware reset (e.g. hw watchdogs). Easier to start from first one.
If problem is hardware related - then logging should help you to identify in which function call problem occurs and that can give you inside knowledge of what is not working and where.
Also if code is relatively complex - it makes sense to "divide and conquer" it - meaning you remove / disable some function calls where you suspect problem is - typically disabling half of code and enabling another half - you can get "does work" / "does not work" kind of decision after which you can focus into another half of code. (Where problem is)
If problem occurs after some time - then stack overflow can be suspected - then it's better to monitor stack point registers - if they constantly grows.
And if you manage to fully minimize your code until "hello world" kind of application - and it's still failing randomly - then hardware problems are expected - and there needs to be "hardware upgrade" - meaning invent such cpu / ram / ... -hardware combination which would tolerate radiation better.
Most important thing is probably how you get your logs back if machine fully stopped / resetted / does not work - probably first thing bootstap should do - is a head back home if problematic situation is entcovered.
If it's possible in your environment also to transmit a signal and receive response - you could try out to construct some sort of online remote debugging environment, but then you must have at least of communication media working and some processor/ some ram in working state. And by remote debugging I mean either GDB / gdb stub kind of approach or your own implementation of what you need to get back from your application (e.g. download log files, download call stack, download ram, restart)

I've really read a lot of great answers!
Here is my 2 cent: build a statistical model of the memory/register abnormality, by writing a software to check the memory or to perform frequent register comparisons. Further, create an emulator, in the style of a virtual machine where you can experiment with the issue. I guess if you vary junction size, clock frequency, vendor, casing, etc would observe a different behavior.
Even our desktop PC memory has a certain rate of failure, which however doesn't impair the day to day work.

How do I detect memory access violation and/or memory race conditions?

I have a target platform reporting when memory is read from or written to as well as when locks(think mutex for example) are taken/freed. It reports the program counter, data address and read/write flag. I am writing a program to use this information on a separate host machine where the reports are received so it does not interfere with the target. The target already reports this data so I am not changing the target code at all.
Are there any references or already available algorithms that do this kind of detection? For example, some way of detecting race conditions when multiple threads try to write to a global variable without protecting it first.
I am currently brewing my own but I convince myself there is definitely some code out there that does this already. Or at least some proven algorithm of how to go about it.
Note This is not to detect memory leaks.
Note Implementation language is C++
I am trying to make the detection code I write platform agnostic so I am using STL and just Standard C++ with libraries like boost, poco, loki.
Any leads will help
thanks.

It is probably too late to talk you out of this, but this does not work. Threading races are caused by subtle timing issues between threads. You can never diagnose timing related problems with logging. Heisenbergian, just logging alters the timing of a thread. Especially the kind you are contemplating. Infamously, there's plenty of software that shipped with logging kept turned on because it would nosedive with it turned off.
Flushing out threading bugs is hard. The kind of tool that works is one that intentionally injects random delays in code. Microsoft CHESS is an example, works on native code too.

To address only part of your question, race conditions are extremely nasty precisely because there is no good way to test for them. By definition they're unpredictable sequences of events that are quite difficult to diagnose. Detection code depends on the fact that the race condition is actually happening, and in that case it's likely that you'll see errant behavior anyway. Any test code you add may make them more or less likely to appear, or possibly even change the timing such that they never appear at all.
Instead of trying to detect race conditions, what about attempting program design that helps make you more resilient to having them in the first place?
For example if your global variable were simply encapsulated in an object that knows all the proper protection that needs to happen on access, then it's impossible for threads to concurrently write to it, because such a interface doesn't exist. Programmatically preventing race conditions is going to be easier than trying to detect them algorithmically (chances are you'll still catch some during unit/subsystem testing).

Testing approach for multi-threaded software

I have a piece of mature geospatial software that has recently had areas rewritten to take better advantage of the multiple processors available in modern PCs. Specifically, display, GUI, spatial searching, and main processing have all been hived off to seperate threads. The software has a pretty sizeable GUI automation suite for functional regression, and another smaller one for performance regression. While all automated tests are passing, I'm not convinced that they provide nearly enough coverage in terms of finding bugs relating race conditions, deadlocks, and other nasties associated with multi-threading. What techniques would you use to see if such bugs exist? What techniques would you advocate for rooting them out, assuming there are some in there to root out?
What I'm doing so far is running the GUI functional automation on the app running under a debugger, such that I can break out of deadlocks and catch crashes, and plan to make a bounds checker build and repeat the tests against that version. I've also carried out a static analysis of the source via PC-Lint with the hope of locating potential dead locks, but not had any worthwhile results.
The application is C++, MFC, mulitple document/view, with a number of threads per doc. The locking mechanism I'm using is based on an object that includes a pointer to a CMutex, which is locked in the ctor and freed in the dtor. I use local variables of this object to lock various bits of code as required, and my mutex has a time out that fires my a warning if the timeout is reached. I avoid locking where possible, using resource copies where possible instead.
What other tests would you carry out?
Edit: I have cross posted this question on a number of different testing and programming forums, as I'm keen to see how the different mind-sets and schools of thought would approach this issue. So apologies if you see it cross-posted elsewhere. I'll provide a summary links to responses after a week or so

Some suggestions:
Utilize the law of large numbers and perform the operation under test not only once, but many times.
Stress-test your code by exaggerating the scenarios. E.g. to test your mutex-holding class, use scenarios where the mutex-protected code:
is very short and fast (a single instruction)
is time-consuming (Sleep with a large value)
contains explicit context switches (Sleep (0))
Run your test on various different architectures. (Even if your software is Windows-only, test it on single- and multicore processors with and without hyperthreading, and a wide range of clock speeds)
Try to design your code such that most of it is not exposed to multithreading issues. E.g. instead of accessing shared data (which requires locking or very carefully designed lock-avoidance techniques), let your worker threads operate on copies of the data, and communicate with them using queues. Then you only have to test your queue class for thread-safety
Run your tests when the system is idle as well as when it is under load from other tasks (e.g. our build server frequently runs multiple builds in parallel. This alone revealed many multithreading bugs that happened when the system was under load.)
Avoid asserting on timeouts. If such an assert fails, you don't know whether the code is broken or whether the timeout was too short. Instead, use a very generous timeout (just to ensure that the test eventually fails). If you want to test that an operation doesn't take longer than a certain time, measure the duration, but don't use a timeout for this.

Whilst I agree with #rstevens answer in that there's currently no way to unit test threading issues with 100% certainty there are some things that I've found useful.
Firstly whatever tests you have make sure you run them on lots of different spec boxes. I have several build machines, all different, multi-core, single core, fast, slow, etc. The good thing about how diverse they are is that different ones will throw up different threading issues. I've regularly been surprised to add a new build machine to my farm and suddenly have a new threading bug exposed; and I'm talking about a new bug being exposed in code that has run 10000s of times on the other build machines and which shows up 1 in 10 on the new one...
Secondly most of the unit testing that you do on your code needn't involve threading at all. The threading is, generally, orthogonal. So step one is to tease the code apart so that you can test the actual code that does the work without worrying too much about the threaded nature. This usually means creating an interface that the threading code uses to drive the real code. You can then test the real code in isolation.
Thridly you can test where the threaded code interacts with the main body of code. This means writing a mock for the interface that you developed to separate the two blocks of code. By now the threading code is likely much simpler and you can then often place synchronisation objects in the mock that you've made so that you can control the code under test. So, you'd spin up your thread and wait for it to set an event by calling into your mock and then have it block on another event which your test code controls. The test code can then step the threaded code from one point in your interface to the next.
Finally (if you've decoupled things enough that you can do the earlier stuff then this is easy) you can then run larger pieces of the multi-threaded parts of the app under test and make sure you get the results that you expect; you can play with the priority of the threads and maybe even add a couple of test threads that simply eat CPU to stir things up a bit.
Now you run all of these tests many many times on different hardware...
I've also found that running the tests (or the app) under something like DevPartner BoundsChecker can help a lot as it messes with the thread scheduling such that it sometimes shakes out hard to find bugs. I also wrote a deadlock detection tool which checks for lock inversions during program execution but I only use that rarely.
You can see an example of how I test multi-threaded C++ code here: http://www.lenholgate.com/blog/2004/05/practical-testing.html

Not really an answer:
Testing multithreaded bugs is very difficult. Most bugs only show up if two (or more) threads go to specific places in code in a specific order.
If and when this condition is met may depend on the timing of the process running. This timing may change due to one of the following pre-conditions:
Type of processor
Processor speed
Number of processors/cores
Optimization level
Running inside or outside the debugger
Operating system
There are for sure more pre-conditions that I forgot.
Because MT-bugs so highly depend on the exact timing of the code running Heisenberg's "Uncertainty principle" comes in here: If you want to test for MT bugs you change the timing by your "measures" which may prevent the bug from occurring...
The timing thing is what makes MT bugs so highly non-deterministic.
In other words: You may have a software that runs for months and then crashes some day and after that may run for years. If you don't have some debug logs/core dumps etc. you may never know why it crashes.
So my conclusion is: There is no really good way to Unit-Test for thread-safety. You always have to keep your eyes open when programming.
To make this clear I will give you a (simplified) example from real life (I encountered this when changing my employer and looking on the existing code there):
Imagine you have a class. You want that class to automatically deleted if no-one uses it anymore. So you build a reference-counter into that class:
(I know it is a bad style to delete an instance of a class in one of it's methods. This is because of the simplification of the real code which uses a Ref class to handle counted references.)
class A {
private:
int refcount;
public:
A() : refcount(0) {
}
void Ref() {
refcount++;
}
void Release() {
refcount--;
if (refcount == 0) {
delete this;
}
}
};
This seams pretty simple and nothing to worry about. But this is not thread-safe!
It's because "refcount++" and "refcount--" are not atomic operations but both are three operations:
read refcount from memory to register
increment/decrement register
write refcount from register to memory
Each of those operations can be interrupted and another thread may, at the same time manipulate the same refcount. So if for example two threads want to incremenet refcount the following COULD happen:
Thread A: read refcount from memory to register (refcount: 8)
Thread A: increment register
CONTEXT CHANGE -
Thread B: read refcount from memory to register (refcount: 8)
Thread B: increment register
Thread B: write refcount from register to memory (refcount: 9)
CONTEXT CHANGE -
Thread A: write refcount from register to memory (refcount: 9)
So the result is: refcount = 9 but it should have been 10!
This can only be solved by using atomic operations (i.e. InterlockedIncrement() & InterlockedDecrement() on Windows).
This bug is simply untestable! The reason is that it is so highly unlikely that there are two threads at the same time trying to modify the refcount of the same instance and that there are context switches in between that code.
But it can happen! (The probability increases if you have a multi-processor or multi-core system because there is no context switch needed to make it happen).
It will happen in some days, weeks or months!

Looks like you are using Microsoft tools. There's a group at Microsoft Research that has been working on a tool specifically designed to shake out concurrency bugz. Check out CHESS. Other research projects, in their early stages, are Cuzz and Featherlite.
VS2010 includes a very good looking concurrency profiler, video is available here.

As Len Holgate mentions, I would suggest refactoring (if needed) and creating interfaces for the parts of the code where different threads interact with objects carrying a state. These parts of the code can then be tested separate from the code containing the actual functionality. To verify such a unit test, I would consider using a code coverage tool (I use gcov and lcov for this) to verify that everything in the thread safe interface is covered.
I think this is a pretty convenient way of verifying that new code is covered in the tests.
The next step is then to follow the advice of the other answers regarding how to run the tests.

Firstly, many thanks for the responses. For the responses posted across different forumes see;
http://www.sqaforums.com/showflat.php?Cat=0&Number=617621&an=0&page=0#Post617621
Testing approach for multi-threaded software
http://www.softwaretestingclub.com/forum/topics/testing-approach-for?xg_source=activity
and the following mailing list; software-testing#yahoogroups.com
The testing took significantly longer than expected, hence this late reply, leading me to the conclusion that adding multi-threading to existing apps is liable to be very expensive in terms of testing, even if the coding is quite straightforward. This could prove interesting for the SQA community, as there is increasingly more multi-threaded development going on out there.
As per Joe Strazzere's advice, I found the most effective way of hitting bugs was via automation with varied input. I ended up doing this on three PCs, which have ran a bank of tests repeatedly with varied input over about six weeks. Initially, I was seeing crashes one or two times per PC per day. As I tracked these down, it ended up with one or two per week between the three PCs, and we haven't had any further problems for the last two weeks. For the last two weeks we have also had a version with users beta testing, and are using the software in-house.
In addition to varying the input under automation, I also got good results from the following;
Adding a test option that allowed mutex time-outs to be read from a configuration file, which in turn could be controlled by my automation.
Extending mutex time-outs beyond the typical time expected to execute a section of thread code, and firing a debug exception on time-out.
Running the automation in conjunction with a debugger (VS2008) such that when a problem occurred there was a better chance of tracking it down.
Running without a debugger to ensure that the debugger was not hiding other timing related bugs.
Running the automation against normal release, debug, and fully optimised build. FWIW, the optimised build threw up errors not reproducible in the other builds.
The type of bugs uncovered tended to be serious in nature, e.g. dereferencing invalid pointers, and even under the debugger took quite a bit of tracking down. As has been discussed elsewhere, the SuspendThread and ResumeThread functions ended up being major culprits, and all use of these functions were replaced by mutexes. Similarly all critical sections were removed due to lack of time-outs. Closing documents and exiting the program were also a bug source, where in one instance a document was destroyed with a worker thread still active. To overcome this a single mutex was added per thread to control the life of the thread, and aquired by the document destructor to ensure the thread had terminated as expected.
Once again, many thanks for the all the detailed and varied responses. Next time I take on this type of activity, I'll be better prepared.

How best to test a Mutex implementation?

What is the best way to test an implementation of a mutex is indeed correct? (It is necessary to implement a mutex, reuse is not a viable option)
The best I have come up with is to have many (N) concurrent threads iteratively attempting to access the protected region (I) times, which has a side effect (e.g. update to a global) so that the number of accesses + writes can be counted to ensure that the number of updates to the global is exactly (N)*(I).
Any other suggestions?

Formal proof is better than testing for this kind of thing.
Testing will show you that -- as long as you're not unlucky -- it all worked. But a test is a blunt instrument; it can fail to execute the exact right sequence to cause failure.
It's too hard to test every possible sequence of operations available in the hardware to be sure your mutex works under all circumstances.
Testing is not without value; it shows that you didn't make any obvious coding errors.
But you really need a more formal code inspection to demonstrate that it does the right things at the right times so that one client will atomically seize the lock resource required for proper mutex. In many platforms, there are special instructions to implement this, and if you're using one of those, you have a fighting chance of getting it right.
Similarly, you have to show that the release is atomic.

With something like a mutex, we get back to the old rule that testing can only demonstrate the presence of bugs, not the absence. A year of testing will probably tell you less than simply putting the code up for inspection, and asking if anybody sees a problem with it.

I'm with everyone else that this is incredibly difficult to prove conclusively, I have no idea how to do it - not helpful I know!
When you say implement a mutex and that reuse is not an option, is that for technical reasons like there is no Mutex implementation on the platform/OS that you are using or some other reason? Is wrapping some form of OS level 'lock' and calling it your Mutex implementation an option, eg CriticalSection on windoze, posix Condition Variables? If you can wrap a lower level OS lock then your chance of getting it right are much higher.
In case you have not already done so go and read Herb Sutter's Effective Concurrency articles. There should be some stuff in these worth something to you.
Anyway, some things to consider in your tests:
If your mutex is recursive (ie can the same thread lock it multiple
times) then ensure you do some
reference counting tests.
With your global variable that you
modify it would be best if this was a
varible that can't be written to
atomically. For example if you are
on an 8 bit platform then use a 16 or
32 bit variable that requires
multiple assembly instructions to write.
Carefully examine the assembly listing. Depending on the hardware platform though the assembly does not translate directly to how the code might be optimised...
Get someone else, who didn't write the code, to also write some tests.
test on as many different machines with different specs as you can (assuming that this is 'general purpose' and not for a specific hardware setup)
Good luck!

This reminds me of this question about FIFO semaphore test. In a nutshell my answer was:
Even if you have a specification, maybe it doesn't convey your intention exactly
You can prove that the algorithm fulfills the specification, but not the code (D. Knuth)
Test reveal only the presence of bug, not their absence (Dijkstra)
So your proposition seem reasonably the best to do. If you want to increase your confidence, use fuzzing to randomize scheduling and input.

If the proof stuff doesn't work out for you, then go with the testing one. Be sure to test all the possible use cases. Find out how exactly this thing will be used, who will be using it and, again, how it will be used. When you go the testing route be sure to run each test for each scenario a ton of times (millions, billions, as many as you can possibly get in with the testing time you have).
Try to be random because randomness will give you the best chance to cover all scenarios in a limited number of tests. Be sure to use data that will be used and data that may not be used but could be used and make sure the data doesn't mess up the locks.
BTW, unless you know a ton about mathematics and formal methods you will have no chance of actually coming up with a proof.

Advice for converting a large monolithic singlethreaded application to a multithreaded architecture?

My company's main product is a large monolithic C++ application, used for scientific data processing and visualisation. Its codebase goes back maybe 12 or 13 years, and while we have put work into upgrading and maintaining it (use of STL and Boost - when I joined most containers were custom, for example - fully upgraded to Unicode and the 2010 VCL, etc) there's one remaining, very significant problem: it's fully singlethreaded. Given it's a data processing and visualisation program, this is becoming more and more of a handicap.
I'm both a developer and the project manager for the next release where we want to tackle this, and this is going to be a difficult job in both areas. I'm seeking concrete, practical, and architectural advice on how to tackle the problem.
The program's data flow might go something like this:
a window needs to draw data
In the paint method, it will call a GetData method, often hundreds of times for hundreds of bits of data in one paint operation
This will go and calculate or read from file or whatever else is required (often quite a complex data flow - think of this as data flowing through a complex graph, each node of which performs operations)
Ie, the paint message handler will block while processing is done, and if the data hasn't already been calculated and cached, this can be a long time. Sometimes this is minutes. Similar paths occur for other parts of the program that perform lengthy processing operations - the program is unresponsive for the entire time, sometimes hours.
I'm seeking advice on how to approach changing this. Practical ideas. Perhaps things like:
design patterns for asynchronously requesting data?
storing large collections of objects such that threads can read and write safely?
handling invalidation of data sets while something is trying to read it?
are there patterns and techniques for this sort of problem?
what should I be asking that I haven't thought of?
I haven't done any multithreaded programming since my Uni days a few years ago, and I think the rest of my team is in a similar position. What I knew was academic, not practical, and is nowhere near enough to have confidence approaching this.
The ultimate objective is to have a fully responsive program, where all calculations and data generation is done in other threads and the UI is always responsive. We might not get there in a single development cycle :)
Edit: I thought I should add a couple more details about the app:
It's a 32-bit desktop application for Windows. Each copy is licensed. We plan to keep it a desktop, locally-running app
We use Embarcadero (formerly Borland) C++ Builder 2010 for development. This affects the parallel libraries we can use, since most seem (?) to be written for GCC or MSVC only. Luckily they're actively developing it and its C++ standards support is much better than it used to be. The compiler supports these Boost components.
Its architecture is not as clean as it should be and components are often too tightly coupled. This is another problem :)
Edit #2: Thanks for the replies so far!
I'm surprised so many people have recommended a multi-process architecture (it's the top-voted answer at the moment), not multithreading. My impression is that's a very Unix-ish program structure, and I don't know anything about how it's designed or works. Are there good resources available about it, on Windows? Is it really that common on Windows?
In terms of concrete approaches to some of the multithreading suggestions, are there design patterns for asynchronous request and consuming of data, or threadaware or asynchronous MVP systems, or how to design a task-oriented system, or articles and books and post-release deconstructions illustrating things that work and things that don't work? We can develop all this architecture ourselves, of course, but it's good to work from what others have done before and know what mistakes and pitfalls to avoid.
One aspect that isn't touched on in any answers is project managing this. My impression is estimating how long this will take and keeping good control of the project when doing something as uncertain as this may be hard. That's one reason I'm after recipes or practical coding advice, I guess, to guide and restrict coding direction as much as possible.
I haven't yet marked an answer for this question - this is not because of the quality of the answers, which is great (and thankyou) but simply that because of the scope of this I'm hoping for more answers or discussion. Thankyou to those who have already replied!

You have a big challenge ahead of you. I had a similar challenge ahead of me -- 15 year old monolithic single threaded code base, not taking advantage of multicore, etc. We expended a great deal of effort in trying to find a design and solution that was workable and would work.
Bad news first. It will be somewhere between impractical and impossible to make your single-threaded app multithreaded. A single threaded app relies on it's singlethreaded-ness is ways both subtle and gross. One example is if the computation portion requires input from the GUI portion. The GUI must run in the main thread. If you try to get this data directly from the computation engine, you will likely run in to deadlock and race conditions that will require major redesigns to fix. Many of these reliances will not crop up during the design phase, or even during the development phase, but only after a release build is put in a harsh environment.
More bad news. Programming multithreaded applications is exceptionally hard. It might seem fairly straightforward to just lock stuff and do what you have to do, but it is not. First of all if you lock everything in sight you end up serializing your application, negating every benefit of mutithreading in the first place while still adding in all the complexity. Even if you get beyond this, writing a defect-free MP application is hard enough, but writing a highly-performant MP application is that much more difficult. You could learn on the job in a kind of baptismal by fire. But if you are doing this with production code, especially legacy production code, you put your buisness at risk.
Now the good news. You do have options that don't involve refactoring your whole app and will give you most of what you seek. One option in particular is easy to implement (in relative terms), and much less prone to defects than making your app fully MP.
You could instantiate multiple copies of your application. Make one of them visible, and all the others invisible. Use the visible application as the presentation layer, but don't do the computational work there. Instead, send messages (perhaps via sockets) to the invisible copies of your application which do the work and send the results back to the presentation layer.
This might seem like a hack. And maybe it is. But it will get you what you need without putting the stability and performance of your system at such great risk. Plus there are hidden benefits. One is that the invisible engine copies of your app will have access to their own virtual memory space, making it easier to leverage all the resources of the system. It also scales nicely. If you are running on a 2-core box, you could spin off 2 copies of your engine. 32 cores? 32 copies. You get the idea.

So, there's a hint in your description of the algorithm as to how to proceed:
often quite a complex data flow - think of this as data flowing through a complex graph, each node of which performs operations
I'd look into making that data-flow graph be literally the structure that does the work. The links in the graph can be thread-safe queues, the algorithms at each node can stay pretty much unchanged, except wrapped in a thread that picks up work items from a queue and deposits results on one. You could go a step further and use sockets and processes rather than queues and threads; this will let you spread across multiple machines if there is a performance benefit in doing this.
Then your paint and other GUI methods need split in two: one half to queue the work, and the other half to draw or use the results as they come out of the pipeline.
This may not be practical if the app presumes that data is global. But if it is well contained in classes, as your description suggests it may be, then this could be the simplest way to get it parallelised.

Don't attempt to multithread everything in the old app. Multithreading for the sake of saying it's multithreaded is a waste of time and money. You're building an app that does something, not a monument to yourself.
Profile and study your execution flows to figure out where the app spends most of its time. A profiler is a great tool for this, but so is just stepping through the code in the debugger. You find the most interesting things in random walks.
Decouple the UI from long-running computations. Use cross-thread communications techniques to send updates to the UI from the computation thread.
As a side-effect of #3: think carefully about reentrancy: now that the compute is running in the background and the user can smurf around in the UI, what things in the UI should be disabled to prevent conflicts with the background operation? Allowing the user to delete a dataset while a computation is running on that data is probably a bad idea. (Mitigation: computation makes a local snapshot of the data) Does it make sense for the user to spool up multiple compute operations concurrently? If handled well, this could be a new feature and help rationalize the app rework effort. If ignored, it will be a disaster.
Identify specific operations that are candidates to be shoved into a background thread. The ideal candidate is usually a single function or class that does a lot of work (requires a "lot of time" to complete - more than a few seconds) with well defined inputs and outputs, that makes use of no global resources, and does not touch the UI directly. Evaluate and prioritize candidates based on how much work would be required to retrofit to this ideal.
In terms of project management, take things one step at a time. If you have multiple operations that are strong candidates to be moved to a background thread, and they have no interaction with each other, these might be implemented in parallel by multiple developers. However, it would be a good exercise to have everybody participate in one conversion first so that everyone understands what to look for and to establish your patterns for UI interaction, etc. Hold an extended whiteboard meeting to discuss the design and process of extracting the one function into a background thread. Go implement that (together or dole out pieces to individuals), then reconvene to put it all together and discuss discoveries and pain points.
Multithreading is a headache and requires more careful thought than straight up coding, but splitting the app into multiple processes creates far more headaches, IMO. Threading support and available primitives are good in Windows, perhaps better than some other platforms. Use them.
In general, don't do any more than what is needed. It's easy to severely over implement and over complicate an issue by throwing more patterns and standard libraries at it.
If nobody on your team has done multithreading work before, budget time to make an expert or funds to hire one as a consultant.

The main thing you have to do is to disconnect your UI from your data set. I'd suggest that the way to do that is to put a layer in between.
You will need to design a data structure of data cooked-for-display. This will most likely contain copies of some of your back-end data, but "cooked" to be easy to draw from. The key idea here is that this is quick and easy to paint from. You may even have this data structure contain calculated screen positions of bits of data so that it's quick to draw from.
Whenever you get a WM_PAINT message you should get the most recent complete version of this structure and draw from it. If you do this properly, you should be able to handle multiple WM_PAINT messages per second because the paint code never refers to your back end data at all. It's just spinning through the cooked structure. The idea here is that its better to paint stale data quickly than to hang your UI.
Meanwhile...
You should have 2 complete copies of this cooked-for-display structure. One is what the WM_PAINT message looks at. (call it cfd_A) The other is what you hand to your CookDataForDisplay() function. (call it cfd_B). Your CookDataForDisplay() function runs in a separate thread, and works on building/updating cfd_B in the background. This function can take as long as it wants because it isn't interacting with the display in any way. Once the call returns cfd_B will be the most up-to-date version of the structure.
Now swap cfd_A and cfd_B and InvalidateRect on your application window.
A simplistic way to do this is to have your cooked-for-display structure be a bitmap, and that might be a good way to go to get the ball rolling, but I'm sure with a bit of thought you can do a much better job with a more sophisticated structure.
So, referring back to your example.
In the paint method, it will call a GetData method, often hundreds of times for hundreds of bits of data in one paint operation
This is now 2 threads, the paint method refers to cfd_A and runs on the UI thread. Meanwhile cfd_B is being built by a background thread using GetData calls.
The quick-and-dirty way to do this is
Take your current WM_PAINT code, stick it into a function called PaintIntoBitmap().
Create a bitmap and a Memory DC, this is cfd_B.
Create a thread and pass it cfd_B and have it call PaintIntoBitmap()
When this thread completes, swap cfd_B and cfd_A
Now your new WM_PAINT method just takes the pre-rendered bitmap in cfd_A and draws it to the screen. Your UI is now disconnnected from your backend GetData() function.
Now the real work begins, because the quick-and-dirty way doesn't handle window resizing very well. You can go from there to refine what your cfd_A and cfd_B structures are a little at a time until you reach a point where you are satisfied with the result.

You might just start out breaking the the UI and the work task into separate threads.
In your paint method instead of calling getData() directly, it puts the request in a thread-safe queue. getData() is run in another thread that reads its data from the queue. When the getData thread is done, it signals the main thread to redraw the visualisation area with its result data using thread syncronization to pass the data.
While all this is going on you of course have a progress bar saying reticulating splines so the user knows something is going on.
This would keep your UI snappy without the significant pain of multithreading your work routines (which can be akin to a total rewrite)

It sounds like you have several different issues that parallelism can address, but in different ways.
Performance increases through utilizing multicore CPU Architecutres
You're not taking advantage of the multi-core CPU architetures that are becoming so common. Parallelization allow you to divide work amongst multiple cores. You can write that code through standard C++ divide and conquer techniques using a "functional" style of programming where you pass work to separate threads at the divide stage. Google's MapReduce pattern is an example of that technique. Intel has the new CILK library to give you C++ compiler support for such techniques.
Greater GUI responsiveness through asynchronous document-view
By separating the GUI operations from the document operations and placing them on different threads, you can increase the apparent responsiveness of your application. The standard Model-View-Controller or Model-View-Presenter design patterns are a good place to start. You need to parallelize them by having the model inform the view of updates rather than have the view provide the thread on which the document computes itself. The View would call a method on the model asking it to compute a particular view of the data, and the model would inform the presenter/controller as information is changed or new data becomes available, which would get passed to the view to update itself.
Opportunistic caching and pre-calculation
It sounds like your application has a fixed base of data, but many possible compute-intensive views on the data. If you did a statistical analysis on which views were most commonly requested in what situations, you could create background worker threads to pre-calculate the likely-requested values. It may be useful to put these operations on low-priority threads so that they don't interfere with the main application processing.
Obviously, you'll need to use mutexes (or critical sections), events, and probably semaphores to implement this. You may find some of the new synchronization objects in Vista useful, like the slim reader-writer lock, condition variables, or the new thread pool API. See Joe Duffy's book on concurrency for how to use these basic techniques.

There is something that no-one has talked about yet, but which is quite interesting.
It's called futures. A future is the promise of a result... let's see with an example.
future<int> leftVal = computeLeftValue(treeNode); // [1]
int rightVal = computeRightValue(treeNode); // [2]
result = leftVal + rightVal; // [3]
It's pretty simple:
You spin off a thread that starts computing leftVal, taking it from a pool for example to avoid the initialization problem.
While leftVal is being computed, you compute rightVal.
You add the two, this may block if leftVal is not computed yet and wait for the computation to end.
The great benefit here is that it's straightforward: each time you have one computation followed by another that is independent and you then join the result, you can use this pattern.
See Herb Sutter's article on futures, they will be available in the upcoming C++0x but there are already libraries available today even if the syntax is perhaps not as pretty as I would make you believe ;)

If it was my development dollars I was spending, I would start with the big picture:
What do I hope to accomplish, and how much will I spend to accomplish this, and how will I be further ahead? (If the answer to this is, my app will run 10% better on quadcore PCs, and I could have achieved the same result by spending $1000 more per customer PC , and spending $100,000 less this year on R&D, then, I would skip the whole effort).
Why am I doing multi-threaded instead of massively parallel distributed? Do I really think threads are better than processes? Multi-core systems also run distributed apps pretty well. And there are some advantages to message-passing process based systems that go beyond the benefits (and the costs!) of threading. Should I consider a process-based approach? SHould I consider a background running entirely as a service, and a foreground GUI? Since my product is node-locked and licensed, I think services would suit me (vendor) quite well. Also, separating stuff into two processes (background service and foreground) just might force the kind of rewrite and rearchitecting to occur that I might not be forced to do, if I was to just add threading into my mix.
This is just to get you thinking: What if you were to rewrite it as a service (background app) and a GUI, because that would actually be easier than adding threading, without also adding crashes, deadlocks, and race conditions?
Consider the idea that for your needs, perhaps threading is evil. Develop your religion, and stick with that. Unless you have a real good reason to go the other way. For many years, I religiously avoided threading. Because one thread per process is good enough for me.
I don't see any really solid reasons in your list why you need threading, except ones that could be more inexpensively solved by more expensive target computer hardware. If your app is "too slow" adding in threads might not even speed it up.
I use threads for background serial communications, but I would not consider threading merely for computationally heavy applications, unless my algorithms were so inherently parallel as to make the benefits clear, and the drawbacks minimal.
I wonder if the "design" problems that this C++Builder app has are like my Delphi "RAD Spaghetti" application disease. I have found that a wholesale refactor/rewrite (over a year per major app that I have done this to), was a minimum amount of time for me to get a handle on application "accidental complexity". And that was without throwing a "threads where possible" idea. I tend to write my apps with threads for serial communication and network socket handling, only. And maybe the odd "worker-thread-queue".
If there is a place in your app you can add ONE thread, to test the waters, I would look for the main "work queue" and I would create an experimental version control branch, and I would learn about how my code works by breaking it in the experimental branch. Add that thread. And see where you spend your first day of debugging. Then I might just abandon that branch and go back to my trunk until the pain in my temporal lobe subsides.
Warren

Here's what I would do...
I would start by profiling your and seeing:
1) what is slow and what the hot paths are
2) which calls are reentrant or deeply nested
you can use 1) to determine where the opportunity is for speedups and where to start looking for parallelization.
you can use 2) to find out where the shared state is likely to be and get a deeper sense of how much things are tangled up.
I would use a good system profiler and a good sampling profiler (like the windows perforamnce toolkit or the concurrency views of the profiler in Visual Studio 2010 Beta2 - these are both 'free' right now).
Then I would figure out what the goal is and how to separate things gradually to a cleaner design that is more responsive (moving work off the UI thread) and more performant (parallelizing computationally intensive portions). I would focus on the highest priority and most noticable items first.
If you don't have a good refactoring tool like VisualAssist, invest in one - it's worth it. If you're not familiar with Michael Feathers or Kent Beck's refactoring books, consider borrowing them. I would ensure my refactorings are well covered by unit tests.
You can't move to VS (I would recommend the products I work on the Asynchronous Agents Library & Parallel Pattern Library, you can also use TBB or OpenMP).
In boost, I would look carefully at boost::thread, the asio library and the signals library.
I would ask for help / guidance / a listening ear when I got stuck.
-Rick

You can also look at this article from Herb Sutter You have a mass of existing code and want to add concurrency. Where do you start?

Well, I think you're expecting a lot based on your comments here. You're not going to go from minutes to milliseconds by multithreading. The most you can hope for is the current amount of time divided by the number of cores. That being said, you're in a bit of luck with C++. I've written high performance multiprocessor scientific apps, and what you want to look for is the most embarrassingly parallel loop you can find. In my scientific code, the heaviest piece is calculating somewhere between 100 and 1000 data points. However, all of the data points can be calculated independently of the others. You can then split the loop using openmp. This is the easiest and most efficient way to go. If you're compiler doesn't support openmp, then you will have a very hard time porting existing code. With openmp (if you're lucky), you may only have to add a couple of #pragmas to get 4-8x the performance. Here's an example StochFit

I hope this will help you in understanding and converting your monolithic single threaded app to multi thread easily. Sorry it is for another programming language but never the less the principles explained are the same all over.
http://www.freevbcode.com/ShowCode.Asp?ID=1287
Hope this helps.

The first thing you must do is to separate your GUI from your data, the second is to create a multithreaded class.
STEP 1 - Responsive GUI
We can assume that the image you are producing is contained in the canvas of a TImage. You can put a simple TTimer in you form and you can write code like this:
if (CurrenData.LastUpdate>CurrentUpdate)
{
Image1->Canvas->Draw(0,0,CurrenData.Bitmap);
CurrentUpdate=Now();
}
OK! I know! Is a little bit dirty, but it's fast and is simple.The point is that:
You need an Object that is created in the main thread
The object is copied in the Form you need, only when is needed and in a safe way (ok, a better protection for the Bitmap may be is needed, but for semplicity...)
The object CurrentData is your actual project, single threaded, that produces an image
Now you have a fast and responsive GUI. If your algorithm as slow, the refresh is slow, but your user will never think that your program is freezed.
STEP 2 - Multithread
I suggest you to implement a class like the following:
SimpleThread.h
typedef void (__closure *TThreadFunction)(void* Data);
class TSimpleThread : public TThread
{
public:
TSimpleThread( TThreadFunction _Action,void* _Data = NULL, bool RunNow = true );
void AbortThread();
__property Terminated;
protected:
TThreadFunction ThreadFunction;
void* Data;
private:
virtual void __fastcall Execute() { ThreadFunction(Data); };
};
SimpleThread.c
TSimpleThread::TSimpleThread( TThreadFunction _Action,void* _Data, bool RunNow)
: TThread(true), // initialize suspended
ThreadFunction(_Action), Data(_Data)
{
FreeOnTerminate = false;
if (RunNow) Resume();
}
void TSimpleThread::AbortThread()
{
Suspend(); // Can't kill a running thread
Free(); // Kills thread
}
Let's explain. Now, in your simple threaded class you can create an object like this:
TSimpleThread *ST;
ST=new TSimpleThread( RefreshFunction,NULL,true);
ST->Resume();
Let's explain better: now, in your own monolithic class, you have created a thread. More: you bring a function (ie: RefreshFunction) in a separate thread. The scope of your funcion is the same, the class is the same, the execution is separate.

My number one suggestion, although it's very late (sorry for reviving old thread, it's interesting!) is seek out homogeneous transform loops where each iteration of the loop is mutating a completely independent piece of data from the other iterations.
Instead of thinking about how to turn this old codebase into an asynchronous one running all kinds of operations in parallel (which could be asking for all kinds of trouble from worse than single-threaded performance from poor locking patterns or exponentially worse, race conditions/deadlocks by trying to do this in hindsight to code you can't fully comprehend), stick to the sequential mindset for the overall application design for now but identify or extract simple, homogeneous transform loops. Don't go from intrusive broad design-level multithreading and then try to drill into details. Work from non-intrusive multithreading of fine implementation details and specific hotspots first.
What I mean by homogeneous loops is basically one that transforms data in a very straightforward way, like:
for each pixel in image:
make it brighter
That is very simple to reason about and you can safely parallelize this loop without any problems whatsoever using OMP or TBB or whatever and without getting tangled up in thread synchronization. It only takes one glance at this code to fully comprehend its side effects.
Try to find as many hotspots as you can which fit this type of simple homogeneous transform loop and if you have complex loops which update many different types of data with complex control flows that trigger complex side effects, then seek to refactor towards these homogeneous loops. Often a complex loop which causes 3 disparate side effects to 3 different types of data can be turned into 3 simple homogeneous loops which each trigger just one kind of side effect to one type of data with a simpler control flow. Doing multiple loops instead of one might seem a tad wasteful, but the loops become simpler, the homogeneity will often lead to more cache-friendly sequential memory access patterns vs. sporadic random-access patterns, and you then tend to find much more opportunities to safely parallelize (as well as vectorize) the code in a straightforward way.
First you have to thoroughly understand the side effects of any code you attempt to parallelize (and I mean thoroughly!!!), so seeking out these homogeneous loops gives you isolated areas of the codebase you can easily reason about in terms of the side effects to the point where you can confidently and safely parallelize those hotspots. It'll also improve the maintainability of the code by making it very easy to reason about the state changes going on in that particular piece of code. Save the dream of the uber multithreaded application running everything in parallel for later. For now, focus on identifying/extracting performance-critical, homogeneous loops with simple control flows and simple side effects. Those are your priority targets for parallelization with simple parallelized loops.
Now admittedly I somewhat dodged your questions, but most of them don't need apply if you do what I suggest, at least until you've kind of worked your way out to the point where you're thinking more about multithreading designs as opposed to simply parallelizing implementation details. And you might not even need to go that far to have a very competitive product in terms of performance. If you have beefy work to do in a single loop, you can devote the hardware resources to making that loop go faster instead of making many operations run simultaneously. If you have to resort to more async methods like if your hotspots are more I/O bound, seek an async/wait approach where you fire off an async task but do some things in the meantime and then wait on the async task(s) to complete. Even if that's not absolutely necessary, the idea is to section off isolated areas of your codebase where you can, with 100% confidence (or at least 99.9999999%) say that the multithreaded code is correct.
You don't ever want to gamble with race conditions. There's nothing more demoralizing than finding some obscure race condition that only occurs once in a full moon on some random user's machine while your entire QA team is unable to reproduce it, only to, 3 months later, run into it yourself except during that one time you ran a release build without debugging info available while you then toss and turn in your sleep knowing your codebase can flake out at any given moment but in ways that no one will ever be able to consistently reproduce. So take it easy with multithreading legacy codebases, at least for now, and stick to multithreading isolated but critical sections of the codebase where the side effects are dead simple to reason about. And test the crap out of it -- ideally apply a TDD approach where you write a test for the code you're going to multithread to ensure it gives the correct output after you finish... though race conditions are the types of things that easily fly under the radar of unit and integration testing, so again you absolutely need to be able to comprehend the entirety of the side effects that go on in a given piece of code before you attempt to multithread it. The best way to do that is to make the side effects as easy to comprehend as possible with the simplest control flows causing just one type of side effect for an entire loop.

It is hard to give you proper guidelines. But...
The easiest way out according to me is to convert your application to ActiveX EXE as COM has support for Threading, etc. built right into it your program will automatically become Multi Threading application. Of course you will have to make quite a few changes to your code. But this is the shortest and safest way to go.
I am not sure but probably RichClient Toolset lib may do the trick for you. On the site the author has written:
It also offers registration free Loading/Instancing-capabilities
for ActiveX-Dlls and new, easy to use Threading-approach,
which works with Named-Pipes under the
hood and works therefore also
cross-process.
Please check it out. Who knows it may be the right solution for your requirements.
As for Project management I think you can continue using what is provided in your choice IDE by integrating it with SVN through plugins.
I forgot to mention that we have completed an application for Share market that automatically trades (buys and sells based on lows and highs) into those scripts that are in user portfolio based on an algorithm that we have developed.
While developing this software we were facing the same kind of problem as you have illustrated here. To solve it we converted out application in ActiveX EXE and we converted all those parts that need to execute parallely into ActiveX DLLs. We have not used any third party libs for this!
HTH

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js