Debugging or mapping out a large state machine?

Debugging or mapping out a large state machine? - c++

I'm trying to debug a chunk of code that's mostly a straightforward 16-state state machine, although there are some cases where the transitions are not very simple (the data the state changes operate on are about 200 bytes of data in a couple C++ classes).
We're finding the machine ending up in a "final" state much earlier than expected. Since I'm not yet intimately familiar with the code, I'm hoping I can try to make out the different states and transitions in a way that will make it easier for me to quickly identify and debug the different transition paths.
Are there any useful tools or techniques for mapping out a state machine like this?
It might be worth noting that I'm doing this from a reverse-engineering standpoint, so there is no planning documentation for the system available to me.

You can look into formal model checking tools, such as UPPAAL. This tool can be used for modelling and verification of any system that can be modelled as networks of timed automata - this includes state machines. I have used it previously to verify e.g. invariants and reachability of possible states.

Related

How to write a CanOpen stack?

I have a similar problem with this. How to program a simple CANopen layer .
I read the answers but I have to program a CANopen layer on my own I cannot get a commercial one. So are there any basics of writing a CANopen stack (or layer I'm not certain about the difference)? I don't know even where to start..
If it's required here's some information :
My master device is a beagle bone black with QNX. QNX has a generic CAN library I think but not specific to CANopen. And my slave is a militarized brushless motor controller. I'm writing in C++.
I have a documentation about the general requirements of my system.
There are 2 RPDOs and 4 TPDOs, transmission is synchronous, there is no stopped mode( so no heart-beat and node guarding) and all message informations are stated (size, format, related node IDs etc.)

There are actually at least 4 open source projects that implement CANopen:
CanFestival is the oldest and might be the most mature solution. License: LGPLv2.
CANopenNode is aimed at micro-controllers. License: GPLv2.
Lely CANopen is a library for implementing CANopen masters and slaves. License: Apache version 2.
openCANopen is a master that runs on Linux. License: ISC. Note: I am the author of this project.
I would have posted links, but apparently I don't have enough "reputation".
openCANopen also includes some utilities such as a daemon for forwarding traffic over TCP and a program that interprets and dumps CANopen traffic to standard output.
Lely CANopen is actually of pretty decent code quality and I might have used it if it'd been available when I started writing my own implementation. However, I have not tried using it, so I can't really say which implementation is "better". I can only say that they are different and one or the other may suit your needs better.
Now, I doubt that any of those implementations will work straight out of the box on QNX. They will either have to be adapted or you can copy individual parts of the code into your own implementations. At least that should save you some time.

The quick and dirty work-around is to only implement the bare minimum (just don't market it as CANopen or claim CANopen compliance):
Support for those specific RPDOs/TPDOs that the other node will send/expect to receive. Use fixed COBID (CAN identifiers). Forget about PDO mapping and PDO configuration, use fixed settings.
Implement a NMT bootup message.
Implement NMT state transitions between pre-operational and operational (your node needs to respond to these from the NMT master).
Implement some means to set the node id. Easiest might be to hard code it as a program constant.
If you are lucky, this is all that is needed. If you are unlucky, there will be SDO commmunication, meaning you will have to implement the SDO protcol and also the whole Object Dictionary. Otherwise, the above is fairly straight-forward and not that much work.
In case you need the Object Dictionary, then there might be no other way around getting a full-blown protocol stack. You'll also need to apply for a vendor id from CAN-in-Automation, but it's a one-time fee (no royalties).

I'm from Embedded Office and want to add my penny to your search, even if it's late. First I want to mention, the reason why we didn't put drivers into the canopen-stack repository is the complexity of embedded software development on multiple targets with multiple compilers and my goal to provide running software wherever possible. With just a library is hard to identify problems during usage.
The good news, I setup an environment to get the different targets and compilers managable by a single maintainer (me). So the canopen-stack is developped with LLVM on host machines, and a first demo is provided for STM32F7xx microcontrollers. More is coming, so stay tuned :-)

How to make a GameBoy / GameBoy Advance Emulator? [duplicate]

Closed. This question is off-topic. It is not currently accepting answers.
Closed 9 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
How do emulators work? When I see NES/SNES or C64 emulators, it astounds me.
Do you have to emulate the processor of those machines by interpreting its particular assembly instructions? What else goes into it? How are they typically designed?
Can you give any advice for someone interested in writing an emulator (particularly a game system)?

Emulation is a multi-faceted area. Here are the basic ideas and functional components. I'm going to break it into pieces and then fill in the details via edits. Many of the things I'm going to describe will require knowledge of the inner workings of processors -- assembly knowledge is necessary. If I'm a bit too vague on certain things, please ask questions so I can continue to improve this answer.
Basic idea:
Emulation works by handling the behavior of the processor and the individual components. You build each individual piece of the system and then connect the pieces much like wires do in hardware.
Processor emulation:
There are three ways of handling processor emulation:
Interpretation
Dynamic recompilation
Static recompilation
With all of these paths, you have the same overall goal: execute a piece of code to modify processor state and interact with 'hardware'. Processor state is a conglomeration of the processor registers, interrupt handlers, etc for a given processor target. For the 6502, you'd have a number of 8-bit integers representing registers: A, X, Y, P, and S; you'd also have a 16-bit PC register.
With interpretation, you start at the IP (instruction pointer -- also called PC, program counter) and read the instruction from memory. Your code parses this instruction and uses this information to alter processor state as specified by your processor. The core problem with interpretation is that it's very slow; each time you handle a given instruction, you have to decode it and perform the requisite operation.
With dynamic recompilation, you iterate over the code much like interpretation, but instead of just executing opcodes, you build up a list of operations. Once you reach a branch instruction, you compile this list of operations to machine code for your host platform, then you cache this compiled code and execute it. Then when you hit a given instruction group again, you only have to execute the code from the cache. (BTW, most people don't actually make a list of instructions but compile them to machine code on the fly -- this makes it more difficult to optimize, but that's out of the scope of this answer, unless enough people are interested)
With static recompilation, you do the same as in dynamic recompilation, but you follow branches. You end up building a chunk of code that represents all of the code in the program, which can then be executed with no further interference. This would be a great mechanism if it weren't for the following problems:
Code that isn't in the program to begin with (e.g. compressed, encrypted, generated/modified at runtime, etc) won't be recompiled, so it won't run
It's been proven that finding all the code in a given binary is equivalent to the Halting problem
These combine to make static recompilation completely infeasible in 99% of cases. For more information, Michael Steil has done some great research into static recompilation -- the best I've seen.
The other side to processor emulation is the way in which you interact with hardware. This really has two sides:
Processor timing
Interrupt handling
Processor timing:
Certain platforms -- especially older consoles like the NES, SNES, etc -- require your emulator to have strict timing to be completely compatible. With the NES, you have the PPU (pixel processing unit) which requires that the CPU put pixels into its memory at precise moments. If you use interpretation, you can easily count cycles and emulate proper timing; with dynamic/static recompilation, things are a /lot/ more complex.
Interrupt handling:
Interrupts are the primary mechanism that the CPU communicates with hardware. Generally, your hardware components will tell the CPU what interrupts it cares about. This is pretty straightforward -- when your code throws a given interrupt, you look at the interrupt handler table and call the proper callback.
Hardware emulation:
There are two sides to emulating a given hardware device:
Emulating the functionality of the device
Emulating the actual device interfaces
Take the case of a hard-drive. The functionality is emulated by creating the backing storage, read/write/format routines, etc. This part is generally very straightforward.
The actual interface of the device is a bit more complex. This is generally some combination of memory mapped registers (e.g. parts of memory that the device watches for changes to do signaling) and interrupts. For a hard-drive, you may have a memory mapped area where you place read commands, writes, etc, then read this data back.
I'd go into more detail, but there are a million ways you can go with it. If you have any specific questions here, feel free to ask and I'll add the info.
Resources:
I think I've given a pretty good intro here, but there are a ton of additional areas. I'm more than happy to help with any questions; I've been very vague in most of this simply due to the immense complexity.
Obligatory Wikipedia links:
Emulator
Dynamic recompilation
General emulation resources:
Zophar -- This is where I got my start with emulation, first downloading emulators and eventually plundering their immense archives of documentation. This is the absolute best resource you can possibly have.
NGEmu -- Not many direct resources, but their forums are unbeatable.
RomHacking.net -- The documents section contains resources regarding machine architecture for popular consoles
Emulator projects to reference:
IronBabel -- This is an emulation platform for .NET, written in Nemerle and recompiles code to C# on the fly. Disclaimer: This is my project, so pardon the shameless plug.
BSnes -- An awesome SNES emulator with the goal of cycle-perfect accuracy.
MAME -- The arcade emulator. Great reference.
6502asm.com -- This is a JavaScript 6502 emulator with a cool little forum.
dynarec'd 6502asm -- This is a little hack I did over a day or two. I took the existing emulator from 6502asm.com and changed it to dynamically recompile the code to JavaScript for massive speed increases.
Processor recompilation references:
The research into static recompilation done by Michael Steil (referenced above) culminated in this paper and you can find source and such here.
Addendum:
It's been well over a year since this answer was submitted and with all the attention it's been getting, I figured it's time to update some things.
Perhaps the most exciting thing in emulation right now is libcpu, started by the aforementioned Michael Steil. It's a library intended to support a large number of CPU cores, which use LLVM for recompilation (static and dynamic!). It's got huge potential, and I think it'll do great things for emulation.
emu-docs has also been brought to my attention, which houses a great repository of system documentation, which is very useful for emulation purposes. I haven't spent much time there, but it looks like they have a lot of great resources.
I'm glad this post has been helpful, and I'm hoping I can get off my arse and finish up my book on the subject by the end of the year/early next year.

A guy named Victor Moya del Barrio wrote his thesis on this topic. A lot of good information on 152 pages. You can download the PDF here.
If you don't want to register with scribd, you can google for the PDF title, "Study of the techniques for emulation programming". There are a couple of different sources for the PDF.

Emulation may seem daunting but is actually quite easier than simulating.
Any processor typically has a well-written specification that describes states, interactions, etc.
If you did not care about performance at all, then you could easily emulate most older processors using very elegant object oriented programs. For example, an X86 processor would need something to maintain the state of registers (easy), something to maintain the state of memory (easy), and something that would take each incoming command and apply it to the current state of the machine. If you really wanted accuracy, you would also emulate memory translations, caching, etc., but that is doable.
In fact, many microchip and CPU manufacturers test programs against an emulator of the chip and then against the chip itself, which helps them find out if there are issues in the specifications of the chip, or in the actual implementation of the chip in hardware. For example, it is possible to write a chip specification that would result in deadlocks, and when a deadline occurs in the hardware it's important to see if it could be reproduced in the specification since that indicates a greater problem than something in the chip implementation.
Of course, emulators for video games usually care about performance so they don't use naive implementations, and they also include code that interfaces with the host system's OS, for example to use drawing and sound.
Considering the very slow performance of old video games (NES/SNES, etc.), emulation is quite easy on modern systems. In fact, it's even more amazing that you could just download a set of every SNES game ever or any Atari 2600 game ever, considering that when these systems were popular having free access to every cartridge would have been a dream come true.

I know that this question is a bit old, but I would like to add something to the discussion. Most of the answers here center around emulators interpreting the machine instructions of the systems they emulate.
However, there is a very well-known exception to this called "UltraHLE" (WIKIpedia article). UltraHLE, one of the most famous emulators ever created, emulated commercial Nintendo 64 games (with decent performance on home computers) at a time when it was widely considered impossible to do so. As a matter of fact, Nintendo was still producing new titles for the Nintendo 64 when UltraHLE was created!
For the first time, I saw articles about emulators in print magazines where before, I had only seen them discussed on the web.
The concept of UltraHLE was to make possible the impossible by emulating C library calls instead of machine level calls.

Something worth taking a look at is Imran Nazar's attempt at writing a Gameboy emulator in JavaScript.

Having created my own emulator of the BBC Microcomputer of the 80s (type VBeeb into Google), there are a number of things to know.
You're not emulating the real thing as such, that would be a replica. Instead, you're emulating State. A good example is a calculator, the real thing has buttons, screen, case etc. But to emulate a calculator you only need to emulate whether buttons are up or down, which segments of LCD are on, etc. Basically, a set of numbers representing all the possible combinations of things that can change in a calculator.
You only need the interface of the emulator to appear and behave like the real thing. The more convincing this is the closer the emulation is. What goes on behind the scenes can be anything you like. But, for ease of writing an emulator, there is a mental mapping that happens between the real system, i.e. chips, displays, keyboards, circuit boards, and the abstract computer code.
To emulate a computer system, it's easiest to break it up into smaller chunks and emulate those chunks individually. Then string the whole lot together for the finished product. Much like a set of black boxes with inputs and outputs, which lends itself beautifully to object oriented programming. You can further subdivide these chunks to make life easier.
Practically speaking, you're generally looking to write for speed and fidelity of emulation. This is because software on the target system will (may) run more slowly than the original hardware on the source system. That may constrain the choice of programming language, compilers, target system etc.
Further to that you have to circumscribe what you're prepared to emulate, for example its not necessary to emulate the voltage state of transistors in a microprocessor, but its probably necessary to emulate the state of the register set of the microprocessor.
Generally speaking the smaller the level of detail of emulation, the more fidelity you'll get to the original system.
Finally, information for older systems may be incomplete or non-existent. So getting hold of original equipment is essential, or at least prising apart another good emulator that someone else has written!

Yes, you have to interpret the whole binary machine code mess "by hand". Not only that, most of the time you also have to simulate some exotic hardware that doesn't have an equivalent on the target machine.
The simple approach is to interpret the instructions one-by-one. That works well, but it's slow. A faster approach is recompilation - translating the source machine code to target machine code. This is more complicated, as most instructions will not map one-to-one. Instead you will have to make elaborate work-arounds that involve additional code. But in the end it's much faster. Most modern emulators do this.

When you develop an emulator you are interpreting the processor assembly that the system is working on (Z80, 8080, PS CPU, etc.).
You also need to emulate all peripherals that the system has (video output, controller).
You should start writing emulators for the simpe systems like the good old Game Boy (that use a Z80 processor, am I not not mistaking) OR for C64.

Emulator are very hard to create since there are many hacks (as in unusual
effects), timing issues, etc that you need to simulate.
For an example of this, see http://queue.acm.org/detail.cfm?id=1755886.
That will also show you why you ‘need’ a multi-GHz CPU for emulating a 1MHz one.

Also check out Darek Mihocka's Emulators.com for great advice on instruction-level optimization for JITs, and many other goodies on building efficient emulators.

I've never done anything so fancy as to emulate a game console but I did take a course once where the assignment was to write an emulator for the machine described in Andrew Tanenbaums Structured Computer Organization. That was fun an gave me a lot of aha moments. You might want to pick that book up before diving in to writing a real emulator.

Advice on emulating a real system or your own thing?
I can say that emulators work by emulating the ENTIRE hardware. Maybe not down to the circuit (as moving bits around like the HW would do. Moving the byte is the end result so copying the byte is fine). Emulator are very hard to create since there are many hacks (as in unusual effects), timing issues, etc that you need to simulate. If one (input) piece is wrong the entire system can do down or at best have a bug/glitch.

The Shared Source Device Emulator contains buildable source code to a PocketPC/Smartphone emulator (Requires Visual Studio, runs on Windows). I worked on V1 and V2 of the binary release.
It tackles many emulation issues:
- efficient address translation from guest virtual to guest physical to host virtual
- JIT compilation of guest code
- simulation of peripheral devices such as network adapters, touchscreen and audio
- UI integration, for host keyboard and mouse
- save/restore of state, for simulation of resume from low-power mode

To add the answer provided by #Cody Brocious
In the context of virtualization where you are emulating a new system(CPU , I/O etc ) to a virtual machine we can see the following categories of emulators.
Interpretation: bochs is an example of interpreter , it is a x86 PC emulator,it takes each instruction from guest system translates it in another set of instruction( of the host ISA) to produce the intended effect.Yes it is very slow , it doesn't cache anything so every instruction goes through the same cycle.
Dynamic emalator: Qemu is a dynamic emulator. It does on the fly translation of guest instruction also caches results.The best part is that executes as many instructions as possible directly on the host system so that emulation is faster. Also as mentioned by Cody, it divides the code into blocks ( 1 single flow of execution).
Static emulator: As far I know there are no static emulator that can be helpful in virtualization.

How I would start emulation.
1.Get books based around low level programming, you'll need it for the "pretend" operating system of the Nintendo...game boy...
2.Get books on emulation specifically, and maybe os development. (you won't be making an os, but the closest to it.
3.look at some open source emulators, especially ones of the system you want to make an emulator for.
4.copy snippets of the more complex code into your IDE/compliler. This will save you writing out long code. This is what I do for os development, use a district of linux

I wrote an article about emulating the Chip-8 system in JavaScript.
It's a great place to start as the system isn't very complicated, but you still learn how opcodes, the stack, registers, etc work.
I will be writing a longer guide soon for the NES.

Tool for model checking large, distributed C++ projects such as KDE?

Is there a tool which can handle model checking large, real-world, mostly-C++, distributed systems, such as KDE?
(KDE is a distributed system in the sense that it uses IPC, although typically all of the processes are on the same machine. Yes, by the way, this is a valid usage of "distributed system" - check Wikipedia.)
The tool would need to be able to deal with intraprocess events and inter-process messages.
(Let's assume that if the tool supports C++, but doesn't support other stuff that KDE uses such as moc, we can hack something together to workaround that.)
I will happily accept less general (e.g. static analysers specialised for finding specific classes of bugs) or more general static analysis alternatives, in lieu of actual model checkers. But I am only interested in tools that can actually handle projects of the size and complexity of KDE.

You're obviously looking for a static analysis tool that can
parse C++ on scale
locate code fragments of interest
extract a model
pass that model to a model checker
report that result to you
A significant problem is that everybody has a different idea about what model they'd like to check.
That alone likely kills your chance of finding exactly what you want, because each model extraction tool
has generally made a choice as to what it wants to capture as a model, and the chances that it matches
what you want precisely are IMHO close to zero.
You aren't clear on what specifically you want to model, but I presume you want to find the communication
primitives and model the process interactions to check for something like deadlock?
The commercial static analysis tool vendors seem like a logical place to look, but I don't think they are there, yet. Coverity would seem like the best bet, but it appears they only have some kind of dynamic analysis for Java threading issues.
This paper claims to do this, but I have not looked at in any detail: Compositional analysis of C/C++ programs
with VeriSoft. Related is [PDF] Computer-Assisted Assume/Guarantee Reasoning with VeriSoft. It appears you have to hand-annotate
the source code to indicate the modelling elements of interest. The Verifysoft tool itself appears to be proprietary to Bell Labs and is likely hard to obtain.
Similarly this one: Distributed Verification of Multi-threaded C++ Programs .
This paper also makes interesting claims, but doesn't process C++ in spite of the title:
Runtime Model Checking of Multithreaded C/C++ Programs.
While all the parts of this are difficult, an issue they all share is parsing C++ (as exemplified by
the previously quoted paper) and finding the code patterns that provide the raw information for the model.
You also need to parse the specific dialect of C++ you are using; its not nice that the C++ compilers all accept different languages. And, as you have observed, processing large C++ codes is necessary. Model checkers (SPIN and friends) are relatively easy to find.
Our DMS Software Reengineering Toolkit provides for general purpose parsing, with customizable pattern matching and fact extraction, and has a robust C++ Front End that handles many dialects of C++ (EDIT Feb 2019: including C++17 in Ansi, GCC and MS flavors). It could likely be configured to find and extract the facts that correspond to the model you care about. But it doesn't do this this off the shelf.
DMS with its C front end have been used to process extremely large C applications (19,000 compilation units!). The C++ front end has been used in anger on a variety of large-scale C++ projects (EDIT Feb 2019: including large scale refactoring of APIs across 3000+ compilation units). Given DMS's general capability, I think it likely capable of handling fairly big chunks of code. YMMV.

Static code analyzers when used against large code base first time usually produce so many warnings and alerts that you won't be able to analyze all of them in reasonable amount of time. It is hard to single out real problems from code that just look suspicious to a tool.
You can try to use automatic invariant discovery tools like "Daikon" that capture perceived invariants at run time. You can validate later if discovered invariants (equivalence of variables "a == b+1" for example) make sense and then insert permanent asserts into your code. This way when invariant is violated as result of your change you will get a warning that perhaps you broke something by your change. This method helps to avoid restructuring or changing your code to add tests and mocks.

The usual way of applying formal techniques to large systems is to modularise them and write specifications for the interfaces of each module. Then you can verify each module independently (while verifying a module, you import the specifications - but not the code - of the other modules it calls). This approach makes verification scalable.

Does three-tier architecture ever work?

We are building three-tier architectures for over a decade now. Dividing presentation-, logic- and data-tier is supposed to allow us to exchange each layer individually, should the need ever arise, be it through changed requirements or new technologies.
I have never seen it working in practice...
Mostly because (at least) one of the following reasons:
The three tiers concept was only visible in the source code (e.g. package naming in Java) which was then deployed as one, tied together package.
The code representing each layer was nicely bundled in its own deployable format but then thrown into the same process (e.g. an "enterprise container").
Each layer was run in its own process, sometimes even on different machines but through the static nature they were connected to each other, replacing one of them meant breaking all of them.
Thus what you usually end up with, in is a monolithic, tightly coupled system that does not deliver what it's architecture promised.
I therefore think "three-tier architecture" is a total misnomer. The true benefit it brings is that the code is logically sound. But that's at "write time", not at "run time". A better name would be something like "layered by responsibility". In any case, the "architecture" word is misleading.
What are your thoughts on this? How could working three-tier architecture be achieved? By that I mean one which holds its promises: Allowing to plug out a layer without affecting the other ones. The system should survive that and be in a well defined state afterwards.
Thanks!

The true purpose of layered architectures (both logical and physical tiers) isn't to make it easy to replace a layer (which is quite rare), but to make it easy to make changes within a layer without affecting the others (and as Ben notes, to facilitate scalability, consistency, and security) - which works all the time all around us.

One example of a 3-tier architecture is a typical database-driven web application:
End-user's web browser
Server-side web application logic
Database engine

In every system, there is the nice, elegant architecture dreamed up at the beginning, then the hairy mess when its finally in production, full of hundreds of bug fixes and special case handlers, and other typical nasty changes made to address specific issues not realized during the design.
I don't think the problems you've described are specific to three-teir architecture at all.

If you haven't seen it working, you may just have bad luck. I've worked on projects that serve several UIs (presentation) from one web service (logic). In addition, we swapped data providers via configuration (data) so we could use a low-cost database while developing and Oracle in higher environments.
Sure, there's always some duplication - maybe you add validation in the UI for responsiveness and then validate again in the logic layer - but overall, a clean separation is possible and nice to work with.

Once you accept that n-tier's major benefits--namely scalability, logical consistency, security--could not easily be achieved through other means, the question of whether or not any of the tiers can be replaced outright without breaking the the others becomes more like asking whether there's any icing on the cake.

Any operating system will have a similar kind of architecture, or else it won't work. The presentation layer is independent of the hardware layer, which is abstracted into drivers that implement a certain interface. The data is handled using logic that changes depending on the type of data being read (think NTFS vs. FAT32 vs. EXT3 vs. CD-ROM). Linux can run on just about any hardware you can throw at it and it will still look and behave the same because the abstractions between the layers insulate each other from changes within a single layer.

One of the biggest practical benefits of the 3-tier approach is that it makes it easy to split up work. You can easily have a DBA and a business anylist or two building a data layer, a traditional programmer building the server side app code, and a graphic designer/ web designer building the UI. The three teams still need to communicate, of course, but this allows for much smoother development in most cases. In this regard, I see the 3-tier approach working reliably everyday, and this enough for me, even if I cannot count on "interchangeable parts", so to speak.

Self Testing Systems

I had an idea I was mulling over with some colleagues. None of us knew whether or not it exists currently.
The Basic Premise is to have a system that has 100% uptime but can become more efficient dynamically.
Here is the scenario: * So we hash out a system quickly to a
specified set of interfaces, it has
zero optimizations, yet we are
confident that it is 100% stable
though (dubious, but for the sake of
this scenario please play
along) * We then profile
the original classes, and start to
program replacements for the
bottlenecks.
* The original and the replacement are initiated simultaneously and
synchronized.
* An original is allowed to run to completion: if a replacement hasn´t
completed it is vetoed by the system
as a replacement for the
original.
* A replacement must always return the same value as the original, for a
specified number of times, and for a
specific range of values, before it is
adopted as a replacement for the
original.
* If exception occurs after a replacement is adopted, the system
automatically tries the same operation
with a class which was superseded by
it.
Have you seen a similar concept in practise? Critique Please ...
Below are comments written after the initial question in regards to
posts:
* The system demonstrates a Darwinian approach to system evolution.
* The original and replacement would run in parallel not in series.
* Race-conditions are an inherent issue to multi-threaded apps and I
acknowledge them.

I believe this idea to be an interesting theoretical debate, but not very practical for the following reasons:
To make sure the new version of the code works well, you need to have superb automatic tests, which is a goal that is very hard to achieve and one that many companies fail to develop. You can only go on with implementing the system after such automatic tests are in place.
The whole point of this system is performance tuning, that is - a specific version of the code is replaced by a version that supersedes it in performance. For most applications today, performance is of minor importance. Meaning, the overall performance of most applications is adequate - just think about it, you probably rarely find yourself complaining that "this application is excruciatingly slow", instead you usually find yourself complaining on the lack of specific feature, stability issues, UI issues etc. Even when you do complain about slowness, it's usually an overall slowness of your system and not just a specific applications (there are exceptions, of course).
For applications or modules where performance is a big issue, the way to improve them is usually to identify the bottlenecks, write a new version and test is independently of the system first, using some kind of benchmarking. Benchmarking the new version of the entire application might also be necessary of course, but in general I think this process would only take place a very small number of times (following the 20%-80% rule). Doing this process "manually" in these cases is probably easier and more cost-effective than the described system.
What happens when you add features, fix non-performance related bugs etc.? You don't get any benefit from the system.
Running the two versions in conjunction to compare their performance has far more problems than you might think - not only you might have race conditions, but if the input is not an appropriate benchmark, you might get the wrong result (e.g. if you get loads of small data packets and that is in 90% of the time the input is large data packets). Furthermore, it might just be impossible (for example, if the actual code changes the data, you can't run them in conjunction).
The only "environment" where this sounds useful and actually "a must" is a "genetic" system that generates new versions of the code by itself, but that's a whole different story and not really widely applicable...

A system that runs performance benchmarks while operating is going to be slower than one that doesn't. If the goal is to optimise speed, why wouldn't you benchmark independently and import the fastest routines once they are proven to be faster?
And your idea of starting routines simultaneously could introduce race conditions.
Also, if a goal is to ensure 100% uptime you would not want to introduce untested routines since they might generate uncatchable exceptions.
Perhaps your ideas have merit as a harness for benchmarking rather than an operational system?

Have I seen a similar concept in practice? No. But I'll propose an approach anyway.
It seems like most of your objectives would be meet by some sort of super source control system, which could be implemented with CruiseControl.
CruiseControl can run unit tests to ensure correctness of the new version.
You'd have to write a CruiseControl builder pluggin that would execute the new version of your system against a series of existing benchmarks to ensure that the new version is an improvement.
If the CruiseControl build loop passes, then the new version would be accepted. Such a process would take considerable effort to implement, but I think it feasible. The unit tests and benchmark builder would have to be pretty slick.

I think an Inversion of Control Container like OSGi or Spring could do most of what you are talking about. (dynamic loading by name)
You could build on top of their stuff. Then implement your code to
divide work units into discrete modules / classes (strategy pattern)
identify each module by unique name and associate a capability with it
when a module is requested it is requested by capability and at random one of the modules with that capability is used.
keep performance stats (get system tick before and after execution and store the result)
if an exception occurs mark that module as do not use and log the exception.
If the modules do their work by message passing you can store the message until the operation completes successfully and redo with another module if an exception occurs.

For design ideas for high availability systems, check out Erlang.

I don't think code will learn to be better, by itself. However, some runtime parameters can easily adjust onto optimal values, but that would be just regular programming, right?
About the on-the-fly change, I've shared the wondering and would be building it on top of Lua, or similar dynamic language. One could have parts that are loaded, and if they are replaced, reloaded into use. No rocket science in that, either. If the "old code" is still running, it's perfectly all right, since unlike with DLL's, the file is needed only when reading it in, not while executing code that came from there.
Usefulness? Naa...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js