At which level does one unit test lock-free code?

At which level does one unit test lock-free code? - unit-testing

Can LLVM, QEMU, GDB, Bochs, OpenStack or the like be used to unit test lock-free concurrent code on an open-source platform? Has anyone achieved this?
If you answer by recommending software, I don't mind, but I mention LLVM, QEMU and the others because these function at various different levels. I should like to learn at which level practical success has been found at interleaving threads under unit-test control.
I am aware of SPIN/Promela, incidentally. That is fine software but one cannot compile C++, Rust, etc., onto a SPIN/Promela target as far as I know.
Examples of existing, open-source unit tests of lock-free concurrent code would be gladly received, if you know any. (I would fetch the source and study it if I knew where to look.)
(See also these questions and their answers.)
EXAMPLE
My question does not require an example as far as I know, so you can ignore this one. However, in case an example of testable lock-free code were helpful for purpose of discussion, here is a relatively brief toy example in C++. I have no unit test for it.
#include <atomic>
#include <thread>
#include <cstdlib>
#include <iostream>
const int threshold = 0x100;
const int large_integer = 0x1000;
// Gradually increase the integer to which q points until it reaches the
// threshold. Then, release.
void inflate(std::atomic_bool *const p_atom, int *const q)
{
while (*q < threshold) ++*q;
p_atom->store(true, std::memory_order_release);
}
int main()
{
std::atomic_bool atom{false};
int n{0};
// Dispatch the inflator, letting it begin gradually, in the background, to
// inflate the integer n.
std::thread inflator(inflate, &atom, &n);
// Waste some time....
for (int i = large_integer; i; --i) {}
// Spin until the inflator has released.
{
int no_of_tries = 0;
while (!atom.load(std::memory_order_acquire)) ++no_of_tries;
std::cout << "tried " << no_of_tries << " times" << std::endl;
}
// Verify that the integer n has reached the threshold.
if (n == threshold) {
std::cout << "succeeded" << std::endl;
}
else {
std::cout << "failed" << std::endl;
std::cerr << "error" << std::endl;
std::exit(1);
}
inflator.join();
return 0;
}
CLARIFICATION BY PETER CORDES
#PeterCordes precisely clarifies my question:
There can be cases where some source compiles to safe x86 asm with any reasonable compiler, but unsafe for weakly-ordered ISAs, which are also usually capable of performing an atomic RMW without a full seq-cst memory barrier (for run-time reordering; compile-time is still up to the compiler). So then you have two separate questions: Is the source portable to arbitrary C++11 systems, and is your code actually safe on x86 (if that's all you care about for now).
Both questions are interesting to me, but I had arbitrary C++11 systems in mind.
Usually you want to write code that's portably correct, because it usually doesn't cost any more when compiled for x86.
Reference: the draft C++17 standard, n4659 (6 MB PDF), well explains the C++11 concurrency model to which Peter refers. See sect. 4.7.1.
INQUIRY BY DIRK HERRMANN
#DirkHerrmann asks a pertinent question:
You ask about how to unit-test your code, but I am not sure that what you describe is truly a unit-testing scenario. Which does not mean you could not use any of the so-called unit-testing frameworks (which can in fact be used for all kinds of tests, not just unit-tests). Could you please explain what the goal of your tests would be, that is, which properties of the code you want to check?
Your point is well taken. The goal of my test would be to flunk bad code reliably for all possible timings the C++11 concurrency model supports. If I know that the code is bad, then I should be able to compose a unit test to flunk it. My trouble is this:
Unthreaded. I can normally compose a unit test to flunk bad code if the code is unthreaded.
Threaded. To flunk bad, threaded code is harder, but as long as mutexes coordinate the threading, at least the code runs similarly on divergent hardware.
Lock-free. To flunk bad, lock-free code might be impossible on particular hardware. What if my bad, lock-free code fails once in a billion runs on your hardware and never fails on mine? How can one unit test such code?
I don't know what I need, really. Insofar as my x86 CPU does not provide a true C++11 concurrency model, maybe I need an emulator for a nonexistent CPU that does provide a true C++11 concurrency model. I am not sure.
If I did have an emulator for a nonexistent CPU that provided a true C++11 concurrency model, then my unit test would (as far as I know) need to try my code under all possible, legal timings.
This is not an easy problem. I wonder whether anyone has solved it.
UPDATE: CDSCHECKER AND RELACY
The discussion has led me to investigate various sources, including
CDSChecker, open-source software by Norris and Demsky; and
Relacy Race Detector, open-source software by Vyukov, earlier discussed here.
At this writing, I do not know whether these answer my question but they look promising. I link them here for reference and further investigation.
For completeness of reference, I also add
SPIN/Promela,
already linked above.

Interesting question!
At which level does one unit test lock-free code?
The unsatisfying answer is: You cannot really test "lock-free concurrent code" as you called it.
No wait, of course you can: Test it by using a single piece of paper and a pen. Try to prove it is correct. The design level is the correct level to test multithreaded code.
Of course you can write unit tests for your code, and you really should do that, but there are virtually no means to achieve 100% coverage for all possible concurrent execution scenarios.
You can (and should) try to torment your code, run it on different architectures (e.g. x86 is so coherent, it will hide many concurrency problems. Run it on ARM in addition.). And you will still fail to find all errors.
The fundamental rule is: You cannot use testing to assure any quality level of multithreaded code (lock-free or also with locks). The only way to 100% assure correctness is to formally prove the correctness of your code, and this usually means you have a very simple thread design, which is so obvious that everybody understands it within 5 minutes. And then you write your code accordingly.
Don't get me wrong: Testing is useful. But it gets you nowhere with multithreading.
Why is this? Well, first of all the unit test approach does not work: Mutexes do not compose.
When you combine 100% correctly working multithreaded subsystems A and B, the result is not guaranteed to work at all. Mutexes do not compose. Condition variables do not compose. Invariants between threads do not compose. There is only very few and very limited primitives, like thread-safe queues, which compose. But testing aspects in isolation in unit tests assumes that things compose, like functions or classes.

Related

Is there a good test for C++ optimizing compilers?

I'm evaluating Visual C++ 10 optimizing compiler on trivial code samples so see how good the machine code emitted and I'm out of creative usecases so far.
Is there some sample codebase that is typically used to evaluate how good an optimizing C++ compiler is?

The only valid benchmark is one that simulates the type of code you're developing. Optimizers react differently to different applications and different coding styles, and the only one that really counts is the code that you are going to be compiling with the compiler.

Try benchmarking such libraries as Eigen (http://eigen.tuxfamily.org/index.php?title=Main_Page).

Quite a few benchmarks use scimark: http://math.nist.gov/scimark2/download_c.html however, you should be selective in what you test (test in isolation), as some benchmarks might fail due to poor loop unrolling but the rest of the code was excellent, but something else does better only cause of the loop unrolling (ie the rest of its generated code was sub-par)

As has been already said, you really need to measure optimisation within the context of typical use cases for your own applications, in typical target environments. I include timers in my own automated regression suite for this reason, and have found some quite unusual results as documented in a previous question FWIW, I'm finding VS2010 SP1 is creating code about 8% faster on average than VS2008 on my own application, with about 13% with whole program optimization. This is not spread evenly across use cases. I also tend to see significant variations between long test runs, which are not visible profiling much smaller test cases. I haven't carried out platform comparisons yet, e.g. are many gains platform or hardware specific.
I would imagine that many optimisers will be fine tuned to give best results against well known benchmark suites, which could imply in turn that these are not the best pieces of code against which to test the benefits of optimisation. (Speculation of course)

Is optimizing a class for a unit test good practice, or is it premature?

I've seen (and searched for) a lot of questions on StackOverflow about premature optimization - word on the street is, it is the root of all evil. :P I confess that I'm often guilty of this; I don't really optimize for speed at the cost of code legibility, but I will rewrite my code in logical manners using datatypes and methods that seem more appropriate for the task (e.g. in Actionscript 3, using a typed Vector instead of an untyped Array for iteration) and if I can make my code more elegant, I will do so. This generally helps me understand my code, and I generally know why I'm making these changes.
At any rate, I was thinking today - in OOP, we promote encapsulation, attempting to hide the implementation and promote the interface, so that classes are loosely coupled. The idea is to make something that works without having to know what's going on internally - the black box idea.
As such, here's my question - is it wise to attempt to do deep optimization of code at the class level, since OOP promotes modularity? Or does this fall into the category of premature optimization? I'm thinking that, if you use a language that easily supports unit testing, you can test, benchmark, and optimize the class because it in itself is a module that takes input and generates output. But, as a single guy writing code, I don't know if it's wiser to wait until a project is fully finished to begin optimization.
For reference: I've never worked in a team before, so something that's obvious to developers who have this experience might be foreign to me.
Hope this question is appropriate for StackOverflow - I didn't find another one that directly answered my query.
Thanks!
Edit: Thinking about the question, I realize that "profiling" may have been the correct term instead of "unit test"; unit-testing checks that the module works as it should, while profiling checks performance. Additionally, a part of the question I should have asked before - does profiling individual modules after you've created them not reduce time profiling after the application is complete?
My question stems from the game development I'm trying to do - I have to create modules, such as a graphics engine, that should perform optimally (whether they will is a different story :D ). In an application where performance was less important, I probably wouldn't worry about this.

I don't really optimize for speed at the cost of code legibility, but I will rewrite my code in logical manners using datatypes and methods that seem more appropriate for the task [...] and if I can make my code more elegant, I will do so. This generally helps me understand my code
This isn't really optimization, rather refactoring for cleaner code and better design*. As such, it is a Good Thing, and it should indeed be practiced continuously, in small increments. Uncle Bob Martin (in his book Clean Code) popularized the Boy Scout Rule, adapted to software development: Leave the code cleaner than you found it.
So to answer your title question rephrased, yes, refactoring code to make it unit testable is a good practice. One "extreme" of this is Test Driven Development, where one writes the test first, then adds the code which makes the test pass. This way the code is created unit testable from the very beginning.
*Not to be nitpicky, just it is useful to clarify common terminology and make sure that we use the same terms in the same meaning.

True, optimization I believe should be left as a final task (although its good to be cognizant of where you might need to go back and optimize while writing your first draft). That's not to say that you shouldn't re-factor things iteratively in order to maintain order and cleanliness in the code. It is to say that if something currently serves the purpose and isn't botching a requirement of the application then the requirements should first be addressed as ultimately they are what you're responsible for delivering (unless the requirements include specifics on maximum request times or something along those lines). I agree with Korin's methodology as well, build for function first if time permits optimize to your hearts content (or the theoretical limit, whichever comes first).

The reason that premature optimization is a bad thing is this: it can take a lot of time and you don't know in advance where the best use of your time is likely to be.
For example, you could spend a lot of time optimizing a class, only to find that the bottleneck in your application is network latency or similar factor that is far more expensive in terms of execution time. Because at the outset you don't have a complete picture, premature optimization leads to a less than optimal use of your time. In this case, you'd probably have preferred to fix the latency issue than optimize class code.

I strongly believe that you should never reduce your code readability and good design because of performance optimizations.
If you are writing code where performance is critical it may be OK to lower the style and clarity of your code, but this does not hold true for the average enterprise application. Hardware evolves quickly and gets cheaper everyday. In the end you are writing code that is going to be read by other developers, so you'd better do a good job at it!
It's always beautiful to read code that has been carefully crafted, where every path has a test that helps you understand how it should be used. I don't really care if it is 50 ms slower than the spaghetti alternative which does lots of crazy stuff.

Yes you should skip optimizing for the unit test. Optimization when required usually makes the code more complex. Aim for simplicity. If you optimize for the unit test you may actually de-optimize for production.
If performance is really bad in the unit test, you may need to look at your design. Test in the application to see if performance is equally bad before optimizing.
EDIT: De-optimization is likely to occur when the data being handled varies is size. This is most likely to occur will classes that work with sets of data. Response may be linear, but originally slow, compared to geometric and originally fast. If the unit test uses a small set of data, then the geometric solution may be chosen for the unit test. When production hits the class with a large set of data performance tanks.
Sorting algorithms are a classic case for this kind of behavior and resulting de-optimizations. Many other algorithms have similar characteristics.
EDIT2: My most successful optimization was the sort routine for a report where data was stored on disk in a memory mapped file. The sort times were reasonable with moderate data sizes which did not require disk access. With larger sized data sets it could take days to process the data. Initial timings of the report showed; data selection 3 minutes, data sorting 3 days, and reporting 3 minutes. Investigation of the sort showed that it was a fully unoptimized bubble sort (n-1 full passes for a data set of size n), roughly n square in big O notation. Changing the sorting algorithm reduced the sort time for this report to 3 minutes. I would not have expected a unit test to cover this case, and the original code was as simple (fast) as you could get for small sets. The replacement was much more complex and slower for very small sets, but handled large sets faster with a more linear curve, n log n in big O notation. (Note: no optimization was attempted until we had metrics.)
In practice, I aim for a ten-fold improvement of a routine which takes at least 50% of the module run-time. Achieving this level of optimization for a routine using 55% of the run-time will save 50% of the total run-time.

Performance penalty for "if error then fail fast" in C++?

Is there any performance difference (in C++) between the two styles of writing if-else, as shown below (logically equivalent code) for the likely1 == likely2 == true path (likely1 and likely2 are meant here as placeholders for some more elaborate conditions)?
// Case (1):
if (likely1) {
Foo();
if (likely2) {
Bar();
} else {
Alarm(2);
}
} else {
Alarm(1);
}
vs.
// Case (2):
if (!likely1) {
Alarm(1);
return;
}
Foo();
if (!likely2) {
Alarm(2);
return;
}
Bar();
I'd be very grateful for information on as many compilers and platforms as possible (but with gcc/x86 highlighted).
Please note I'm not interested in readability opinions on those two styles, neither in any "premature optimisation" claims.
EDIT: In other words, I'd like to ask if the two styles are at some point considered fully-totally-100% equivalent/transparent by a compiler (e.g. bit-by-bit equivalent AST at some point in a particular compiler), and if not, then what are the differences? For any (with a preference towards "modern" and gcc) compiler you know.
And, to make it more clear, I too don't suppose that it's going to give me much of a performance improvement, and that it usually would be premature optimization, but I am interested in whether and how much it can improve/degrade anything?

It depends greatly on the compiler, and the optimization settings. If the difference is crucial - implement both, and either analyze the assembly, or do benchmarks.

I have no answers for specific platforms, but I can make a few general points:
The traditional answer on non-modern processors without branch prediction, is that the first is likely to be more efficient since in the common case it takes fewer branches. But you seem interested in modern compilers and processors.
On modern processors, generally speaking short forward branches are not expensive, whereas mispredicted branches may be expensive. By "expensive" of course I mean a few cycles
Quite aside from this, the compiler is entitled to order basic blocks however it likes provided it doesn't change the logic. So when you write if (blah) {foo();} else {bar();}, the compiler is entitled to emit code like:
evaluate condition blah
jump_if_true else_label
bar()
jump endif_label
else_label:
foo()
endif_label:
On the whole, gcc tends to emit things in roughly the order you write them, all else being equal. There are various things which make all else not equal, for example if you have the logical equivalent of bar(); return in two different places in your function, gcc might well coalesce those blocks, emit only one call to bar() followed by return, and jump or fall through to that from two different places.
There are two kinds of branch prediction - static and dynamic. Static means that the CPU instructions for the branch specify whether the condition is "likely", so that the CPU can optimize for the common case. Compilers might emit static branch predictions on some platforms, and if you're optimizing for that platform you might write code to take account of that. You can take account of it either by knowing how your compiler treats the various control structures, or by using compiler extensions. Personally I don't think it's consistent enough to generalize about what compilers will do. Look at the disassembly.
Dynamic branch prediction means that in hot code, the CPU will keep statistics for itself how likely branches are to be taken, and optimize for the common case. Modern processors use various different dynamic branch prediction techniques: http://en.wikipedia.org/wiki/Branch_predictor. Performance-critical code pretty much is hot code, and as long as the dynamic branch prediction strategy works, it very rapidly optimizes hot code. There might be certain pathological cases that confuse particular strategies, but in general you can say that anything in a tight loop where there's a bias towards taken/not taken, will be correctly predicted most of the time
Sometimes it doesn't even matter whether the branch is correctly predicted or not, since some CPUs in some cases will include both possibilities in the instruction pipeline while it's waiting for the condition to be evaluated, and ditch the unnecessary option. Modern CPUs get complicated. Even much simpler CPU designs have ways of avoiding the cost of branching, though, such as conditional instructions on ARM.
Calls out of line to other functions will upset all such guesswork anyway. So in your example there may be small differences, and those differences may depend on the actual code in Foo, Bar and Alarm. Unfortunately it's not possible to distinguish between significant and insignificant differences, or to account for details of those functions, without getting into the "premature optimization" accusations you're not interested in.
It's almost always premature to micro-optimize code that isn't written yet. It's very hard to predict the performance of functions named Foo and Bar. Presumably the purpose of the question is to discern whether there's any common gotcha that should inform coding style. To which the answer is that, thanks to dynamic branch prediction, there is not. In hot code it makes very little difference how your conditions are arranged, and where it does make a difference that difference isn't as easily predictable as "it's faster to take / not take the branch in an if condition".
If this question was intended to apply to just one single program with this code proven to be hot, then of course it can be tested, there's no need to generalize.

It is compiler dependent. Check out the gcc documentation on __builtin_expect. Your compiler may have something similar. Note that you really should be concerned about premature optimization.

The answer depends a lot on the type of "likely". If it is an integer constant expression, the compiler can optimize it and both cases will be equivalent. Otherwise, it will be evaluated in runtime and can't be optimized much.
Thus, case 2 is generally more efficient than case 1.
As input from real-time embedded systems, which I work with, your "case 2" is often the norm for code that is safety- and/or performance critical. Style guides for safety-critical embedded systems often allow this syntax so a function can quit quickly upon errors.
Generally, style guides will frown upon the "case 2" syntax, but make an exception to allow several returns in one function either if
1) the function needs to quit quickly and handle the error, or
2) if one single return at the end of the function leads to less readable code, which is often the case for various protocol and data parsers.

If you are this concerned about performance, I assume you are using profile guided optimization.
If you are using profile guided optimization, the two variants you have proposed are exactly the same.
In any event, the performance of what you are asking about is completely overshadowed by performance characteristics of things not evident in your code samples, so we really can not answer this. You have to test the performance of both.

Though I'm with everyone else here insofar as optimizing a branch makes no sense without having profiled and actually having found a bottleneck... if anything, it makes sense to optimize for the likely case.
Both likely1 and likely2 are likely, as their name suggests. Thus ruling out the also likely combination of both being true would likely be fastest:
if(likely1 && likely2)
{
... // happens most of the time
}else
{
if(likely1)
...
if(likely2)
...
else if(!likely1 && !likely2) // happens almost never
...
}
Note that the second else is probably not necessary, a decent compiler will figure out that the last if clause cannot possibly be true if the previous one was, even if you don't explicitly tell it.

Multithreaded paranoia

This is a complex question, please consider carefully before answering.
Consider this situation. Two threads (a reader and a writer) access a single global int. Is this safe? Normally, I would respond without thought, yes!
However, it seems to me that Herb Sutter doesn't think so. In his articles on effective concurrency he discusses a flawed lock-free queue and the corrected version.
In the end of the first article and the beginning of the second he discusses a rarely considered trait of variables, write ordering. Int's are atomic, good, but ints aren't necessarily ordered which could destroy any lock-free algorithm, including my above scenario. I fully agree that the only way to guarantee correct multithreaded behavior on all platforms present and future is to use atomics(AKA memory barriers) or mutexes.
My question; is write re-odering ever a problem on real hardware? Or is the multithreaded paranoia just being pedantic?
What about classic uniprocessor systems?
What about simpler RISC processors like an embedded power-pc?
Clarification: I'm more interested in what Mr. Sutter said about the hardware (processor/cache) reordering variable writes. I can stop the optimizer from breaking code with compiler switches or hand inspection of the assembly post-compilation. However, I'd like to know if the hardware can still mess up the code in practice.

Your idea of inspecting the assembly is not good enough; the reordering can happen at the hardware level.
To answer your question "is this ever a problem on read hardware:" Yes! In fact I've run into that problem myself.
Is it OK to skirt the issue with uniprocessor systems or other special-case situations? I would argue "no" because five years from now you might need to run on multi-core after all, and then finding all these locations will be tricky (impossible?).
One exception: Software designed for embedded hardware applications where indeed you have completely control over the hardware. In fact I have "cheated" like this in those situations on e.g. an ARM processor.

Yup - use memory barriers to prevent instruction reordering where needed. In some C++ compilers, the volatile keyword has been expanded to insert implicit memory barriers for every read and write - but this isn't a portable solution. (Likewise with the Interlocked* win32 APIs). Vista even adds some new finer-grained Interlocked APIs which let you specify read or write semantics.
Unfortunately, C++ has such a loose memory model that any kind of code like this is going to be non-portable to some extent and you'll have to write different versions for different platforms.

Like you said, because of reordering done at cache or processor level, you actually do need some sort of memory barrier to ensure proper synchronisation, especially for multi-processors (and especially on non-x86 platforms). (I am given to believe that single-processor systems don't have these issues, but don't quote me on this---I'm certainly more inclined to play safe and do the synchronised access anyway.)

We have run into the problem, albeit on Itanium processors where the instruction reordering is more aggressive than x86/x64.
The fix was to use an Interlocked instruction since there was (at the time) no way of telling the compiler to simply but a write barrier after the assignment.
We really need language extension to deal with this cleanly. Use of volatile (if supported by the compiler) is too coarse grained for the cases where you are trying to squeeze as much performance out of a piece of code as possible.

is this ever a problem on real hardware?
Absolutely, particularly now with the move to multiple cores for current and future CPUs. If you're dependent on ordered atomicity to implement features in your application and you are unable to guarantee this requirement via your chosen platform or the use of synchronization primitives, under all conditions i.e. customer moves from a single-core CPU to multi-core CPU, then you are just waiting for a problem to occur.
Quoting from the referred to Herb Sutter article (second one)
Ordered atomic variables are spelled in different ways on popular platforms and environments. For example:
volatile in C#/.NET, as in volatile int.
volatile or * Atomic* in Java, as in volatile int, AtomicInteger.
atomic<T> in C++0x, the forthcoming ISO C++ Standard, as in atomic<int>.
I have not seen how C++0x implements ordered atomicity so I'm unable to specify whether the upcoming language feature is a pure library implementation or relies on changes to the language also. You could review the proposal to see if it can be incorporated as a non-standard extension to your current tool chain until the new standard is available, it may even be available already for your situation.

It is a problem on real hardware. A friend of mine works for IBM and makes his living primarily by sussing out this kind of problem in customers' codes.
If you want to see how bad things can get, search for academic papers on the Java Memory Model (and also now the C++ memory model). Given the reordering that real hardware can do, trying to figure out what's safe in a high-level language is a nightmare.

No this isn't safe and there is real hardware avaialble that exhibits this problem, for example the memory model in the powerpc chip on xbox 360 allows writes to be reordered. This is exacerbated by the lack of barriers in the intrinsics, see this article on msdn for more details.

The answer to the question" is it safe" is inherently ambiguous.
It's always safe, even for doubles, in the sense that your computer won't catch fire.
It's safe, in the sense that you always will get a value that the int held at some time in the past,
It's not safe, in the sense that you may get a value which is/will be updated by another thread.
"Atomic" means that you get the second guarantee. Since double usually is not atomic, you could get 32 old and 32 new bits. That's clearly unsafe.

When I asked the question I most interested in uniprocessor powerpc. In one of the comments InSciTek Jeff mentioned the powerpc SYNC and ISYNC instructions. Those where the key to a definitive answer. I found it here on IBM's site.
The article is large and pretty dense, but the take away is No, it is not safe. On older powerpc's the memory optimizers where not sophisticated enough to cause problems on a uniprocessor. However, the newer ones are much more aggressive, and can break even simple access to a global int.

Self Testing Systems

I had an idea I was mulling over with some colleagues. None of us knew whether or not it exists currently.
The Basic Premise is to have a system that has 100% uptime but can become more efficient dynamically.
Here is the scenario: * So we hash out a system quickly to a
specified set of interfaces, it has
zero optimizations, yet we are
confident that it is 100% stable
though (dubious, but for the sake of
this scenario please play
along) * We then profile
the original classes, and start to
program replacements for the
bottlenecks.
* The original and the replacement are initiated simultaneously and
synchronized.
* An original is allowed to run to completion: if a replacement hasn´t
completed it is vetoed by the system
as a replacement for the
original.
* A replacement must always return the same value as the original, for a
specified number of times, and for a
specific range of values, before it is
adopted as a replacement for the
original.
* If exception occurs after a replacement is adopted, the system
automatically tries the same operation
with a class which was superseded by
it.
Have you seen a similar concept in practise? Critique Please ...
Below are comments written after the initial question in regards to
posts:
* The system demonstrates a Darwinian approach to system evolution.
* The original and replacement would run in parallel not in series.
* Race-conditions are an inherent issue to multi-threaded apps and I
acknowledge them.

I believe this idea to be an interesting theoretical debate, but not very practical for the following reasons:
To make sure the new version of the code works well, you need to have superb automatic tests, which is a goal that is very hard to achieve and one that many companies fail to develop. You can only go on with implementing the system after such automatic tests are in place.
The whole point of this system is performance tuning, that is - a specific version of the code is replaced by a version that supersedes it in performance. For most applications today, performance is of minor importance. Meaning, the overall performance of most applications is adequate - just think about it, you probably rarely find yourself complaining that "this application is excruciatingly slow", instead you usually find yourself complaining on the lack of specific feature, stability issues, UI issues etc. Even when you do complain about slowness, it's usually an overall slowness of your system and not just a specific applications (there are exceptions, of course).
For applications or modules where performance is a big issue, the way to improve them is usually to identify the bottlenecks, write a new version and test is independently of the system first, using some kind of benchmarking. Benchmarking the new version of the entire application might also be necessary of course, but in general I think this process would only take place a very small number of times (following the 20%-80% rule). Doing this process "manually" in these cases is probably easier and more cost-effective than the described system.
What happens when you add features, fix non-performance related bugs etc.? You don't get any benefit from the system.
Running the two versions in conjunction to compare their performance has far more problems than you might think - not only you might have race conditions, but if the input is not an appropriate benchmark, you might get the wrong result (e.g. if you get loads of small data packets and that is in 90% of the time the input is large data packets). Furthermore, it might just be impossible (for example, if the actual code changes the data, you can't run them in conjunction).
The only "environment" where this sounds useful and actually "a must" is a "genetic" system that generates new versions of the code by itself, but that's a whole different story and not really widely applicable...

A system that runs performance benchmarks while operating is going to be slower than one that doesn't. If the goal is to optimise speed, why wouldn't you benchmark independently and import the fastest routines once they are proven to be faster?
And your idea of starting routines simultaneously could introduce race conditions.
Also, if a goal is to ensure 100% uptime you would not want to introduce untested routines since they might generate uncatchable exceptions.
Perhaps your ideas have merit as a harness for benchmarking rather than an operational system?

Have I seen a similar concept in practice? No. But I'll propose an approach anyway.
It seems like most of your objectives would be meet by some sort of super source control system, which could be implemented with CruiseControl.
CruiseControl can run unit tests to ensure correctness of the new version.
You'd have to write a CruiseControl builder pluggin that would execute the new version of your system against a series of existing benchmarks to ensure that the new version is an improvement.
If the CruiseControl build loop passes, then the new version would be accepted. Such a process would take considerable effort to implement, but I think it feasible. The unit tests and benchmark builder would have to be pretty slick.

I think an Inversion of Control Container like OSGi or Spring could do most of what you are talking about. (dynamic loading by name)
You could build on top of their stuff. Then implement your code to
divide work units into discrete modules / classes (strategy pattern)
identify each module by unique name and associate a capability with it
when a module is requested it is requested by capability and at random one of the modules with that capability is used.
keep performance stats (get system tick before and after execution and store the result)
if an exception occurs mark that module as do not use and log the exception.
If the modules do their work by message passing you can store the message until the operation completes successfully and redo with another module if an exception occurs.

For design ideas for high availability systems, check out Erlang.

I don't think code will learn to be better, by itself. However, some runtime parameters can easily adjust onto optimal values, but that would be just regular programming, right?
About the on-the-fly change, I've shared the wondering and would be building it on top of Lua, or similar dynamic language. One could have parts that are loaded, and if they are replaced, reloaded into use. No rocket science in that, either. If the "old code" is still running, it's perfectly all right, since unlike with DLL's, the file is needed only when reading it in, not while executing code that came from there.
Usefulness? Naa...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js