Simplifying algorithm testing for researchers.

Simplifying algorithm testing for researchers. - c++

I work in a group that does a large mix of research development and full shipping code.
Half the time I develop processes that run on our real time system ( somewhere between soft real-time & hard real-time, medium real-time? )
The other half I write or optimize processes for our researchers who don't necessarily care about the code at all.
Currently I'm working on a process which I have to fork into two different branches.
There is a research version for one group, and a production version that will need to occasionally be merged with the research code to get the latest and greatest into production.
To test these processes you need to setup a semi complicated testing environment that will send the data we analyze to the process at the correct time (real time system).
I was thinking about how I could make the:
Idea
Implement
Test
GOTO #1
Cycle as easy, fast and pain free as possible for my colleagues.
One Idea I had was to embed a scripting language inside these long running processes.
So as the process run's they could tweak the actual algorithm & it's parameters.
Off the bat I looked at embedding:
Lua (useful guide)
Python (useful guide)
These both seem doable and might actually fully solve the given problem.
Any other bright idea's out there?
Recompiling after a 1-2 line change, redeploying to the test environment and restarting just sucks.
The system is fairly complicated and hopefully I explained it half decently.

If you can change enough of the program through a script to be useful, without a full recompile, maybe you should think about breaking the system up into smaller parts. You could have a "server" that handles data loading etc and then the client code that does the actual processing. Each time the system loads new data, it could check and see if the client code has been re-compiled and then use it if that's the case.
I think there would be a couple of advantages here, the largest of which would be that the whole system would be much less complex. Now you're working in one language instead of two. There is less of a chance that people can mess things up when moving from python or lua mode to c++ mode in their heads. By embedding some other language in the system you also run the risk of becoming dependent on it. If you use python or lua to tweek the program, those languages either become a dependency when it becomes time to deploy, or you need to back things out to C++. If you choose to port things to C++ theres another chance for bugs to crop up during the switch.

Embedding Lua is much easier than embedding Python.
Lua was designed from the start to be embedded; Python's embeddability was grafted on after the fact.
Lua is about 20x smaller and simpler than Python.
You don't say much about your build process, but building and testing can be simplified significantly by using a really powerful version of make. I use Andrew Hume's mk, but you would probably be even better off investing the time to master Glenn Fowler's nmake, which can add dependencies on the fly and eliminate the need for a separate configuration step. I don't ordinarily recommend nmake because it is somewhat complicated, but it is very clear that Fowler and his group have built into nmake solutions for lots of scaling and portability problems. For your particular situation, it may be worth the effort to master it.

Not sure I understand your system, but if the build and deployment is too complicated, maybe you could automate it? If deployment is completely automatic, would that solve the problem?
I don't understand how a scripting language would solve the problem? If you change your algorithm, you still need to restart calculation from the beginning, don't you?

It kind of sounds like what you need is CruiseControl or something similar; every time hyou touch the baseline code, it rebuilds and reruns tests.

Related

Speed - embedding python in c++ or extending python with c++

I have some big mysql databases with data for calculations and some parts where I need to get data from external websites.
I used python to do the whole thing until now, but what shall I say: its not a speedster.
Now I'm thinking about mixing Python with C++ using Boost::Python and Python C API.
The question I've got now is: what is the better way to get some speed.
Shall I extend python with some c++ code or shall I embedd python code into a c++ programm?
I will get fore sure some speed increment using c++ code for the calculating parts and I think that calling the Python interpreter inside of an C-application will not be better, because the python interpreter will run the whole time. And I must wrap things python-libraries like mysqldb or urllib3 to have a nice way to work inside c++.
So what whould you suggest is the better way to go: extending or embedding?
( I love the python language, but I'm also familiar with c++ and respect it for speed )
Update:
So I switched some parts from python to c++ and used multi threading (real one) in my c modules and my programm now needs instead of 7 hours 30 minutes :))))

In principle, I agree with the first two answers. Anything coming from disk or across a network connection is likely to be a bigger bottleneck than the application.
All the research of the last 50 years indicates that people often have inaccurate intuition about system performance issues. So IMHO, you really need to gather some evidence, by measuring what is actually happening, then chose a solution based on that evidence.
To try to confirm what is causing the slow performance, measure the system and user time of your application (e.g time python prog.py), and measure the load on the machine.
It the application is maxing-out the CPU, and most of that time is spent in the application (user time), then there may be a case for using a more effective technology for the application.
But if the CPU is not maxed, or the application spends most of its time in the system (system time), and not in the application (user time), then it is unlikely that changing the application programming technology will help significantly. (This is an example of Amdahl's Law http://en.wikipedia.org/wiki/Amdahl%27s_law)
You may also need to measure the performance of your database server, and maybe network connection, to identify the source of the bottle neck, but start with the easiest part.

In my opinion, in your case it makes no sense to embed Python in C++, while the reverse could be beneficial.
In most of programs, the performance problems are very localized, which means that you should rewrite the problematic code in C++ only where it makes sense, leaving Python for the rest.
This gives you the best of both world: the speed of C++ where you need it, the ease of use and flexibility of Python everywhere else. What is also great is that you can do this process step by step, replacing the slow code paths by the by, leaving you always with the whole application in an usable (and testable!) state.
The reverse wouldn't make sense: you'd have to rewrite almost all the code, sacrificing the flexibility of the Python structure.
Still, as always when talking about performance, before acting measure: if your bottleneck is not CPU/memory bound switching to C++ isn't likely to produce much advantages.

Version Control for code yet to be unit tested

My team has a dozen engineers, some of whom work on modules that will take 2-3 weeks to complete.
Now we integrate each module to the main branch of CVS only after unit testing is completed.
The problem with this is for a good 2-3 weeks the code sits only on an engineer's computer and is not under version control.
The programming language used is C.
Is there any elegant way to manage non unit tested code under version control.
Thanks
James

Your process of 2-3 weeks before check-in to the "main" branch is not out-of-the-norm, as would be a similar re-design "root canal" effort (serious restructuring work that is sometimes necessary).
However, I would tend to get pretty nervous about that much time outside of version control.
Don't code angry.
Don't code drunk.
Don't code for a long time without version control
"check-point".
A strong recommendation is that the local developer use Mercurial or Git for local version control for that 2-3 weeks, and then you can check the "finalized" project back into the (main) CVS branch. They are really built for exactly that scenario.
That's what we do -- it works, and it makes diffs-and-patches and collaboration between individual developers quite trivial.
(For us, Mercurial is local, Subversion is the "main" version control system.)

...for a good 2-3 weeks the code sits only on an engineer's computer and is not under version control...
Hm well to me programming outside of version control is like driving with reverse gear: technically doable but generally quite er counterproductive.
In that sense, I would say any other approach that somehow allows your developers to continuously keep their work under VC would be more elegant than nothing at all. For that, there are many known ways - googling for version control branching strategy shows plenty resources explaining your options and criteria how to choose.
Without experimenting it's rather hard to tell which of these options is a better fit for your project. When studying resources I refer to above I'd recommend to check details of what is usually called Feature Branch. This strategy rather closely matches the case you describe "modules that will take 2-3 weeks to complete" - although I wouldn't bet on that it's the best fit for your team.
Note also that at least for "internal" developers needs you have an option to use version control system other than old-fashioned and inconvenient CVS.

If your company policy requires all your code to be unit tested before checking in, I think it is pretty good policy, and you should do just that: write your unit tests, perhaps even before writing code.
But if I misunderstood you and it is just that there should be large testing session when everything is done, well, then it is too bad. You will definitely encounter pesky integration problems. If you cannot change that policy, at least have your local VCS. Also, you could have config with "featureX_Enabled" switches, and try not to forget to set it to '0' when checking in.
Anyway, switch to Git or Mercurial, they would be so much less painful to use.

What is the best approach for coding in a slow compilation environment?

I used to be coding in C# in a TDD style - write/or change a small chunk of code, re-compile in 10 seconds the whole solution, re-run the tests and again. Easy...
That development methodology worked very well for me for a few years, until a last year when I had to go back to C++ coding and it really feels that my productivity has dramatically decreased since. The C++ as a language is not a problem - I had quite a lot of C++ dev experience... but in the past.
My productivity is still OK for a small projects, but it gets worse when with the increase of the project size and once compilation time hits 10+ minutes it gets really bad. And if I find the error I have to start compilation again, etc. That is just purely frustrating.
Thus I concluded that in a small chunks (as before) is not acceptable - any recommendations how can I get myself into the old gone habit of coding for an hour or so, when reviewing the code manually (without relying on a fast C# compiler), and only recompiling/re-running unit tests once in a couple of hours.
With a C# and TDD it was very easy to write a code in a evolutionary way - after a dozen of iterations whatever crap I started with was ending up in a good code, but it just does not work for me anymore (in a slow compilation environment).
Would really appreciate your inputs and recos.
p.s. not sure how to tag the question - anyone is welcome to re-tag the question appropriately.
Cheers.

I've found that recompiling and testing sort of pulls me out of the "zone", so in order to have the benefits of TDD, I commit fairly often into a git repository, and run a background process that checks out any new commit, runs the full test suite and annotates the commit object in git with the result. When I get around to it (usually in the evening), I then go back to the test results, fix any issues and "rewrite history", then re-run the tests on the new history. This way I don't have to interrupt my work even for the short times it takes to recompile (most of) my projects.

Sometimes you can avoid the long compile. Aside from improving the quality of your build files/process, you may be able to pick just a small thing to build. If the file you're working on is a .cpp file, just compile that one TU and unit-test it in isolation from the rest of the project. If it's a header (perhaps containing inline functions and templates), do the same with a small number of TUs that between them reference most of the functionality (if no such set of TUs exists, write unit tests for the header file and use those). This lets you quickly detect obvious stupid errors (like typos) that don't compile, and runs the subset of tests you believe to be relevant to the changes you're making. Once you have something that might vaguely work, do a proper build/test of the project to ensure you haven't broken anything you didn't realise was relevant.
Where a long compile/test cycle is unavoidable, I work on two things at once. For this to be efficient, one of them needs to be simple enough that it can just be dropped when the main task is ready to be resumed, and picked up again immediately when the main task's compile/test cycle is finished. This takes a bit of planning. And of course the secondary task has its own build/test cycle, so sometimes you want to work in separate checked-out copies of the source so that errors in one don't block the other.
The secondary task could for example be, "speed up the partial compilation time of the main task by reducing inter-component dependencies". Even so you may have hit a hard limit once it's taking 10 minutes just to link your program's executable, since splitting the thing into multiple dlls just as a development hack probably isn't a good idea. The key thing to avoid is for the secondary task to be, "hit SO", or this.

Since a simple change triggers a 10 minutes recompilation, that means you have a bad build system. Your build should recompile only changed files and files depending on the changed files.
Other then that, there are other techniques to speed up the build time (For example, try to remove unneeded includes. Then instead of including a header, use forward declaration. etc ), but the speed up of these things is not that important as what is recompiled on a change.

I don't see why you can't use TDD with C++. I used CppUnit back in 2001, so I assume it's still in place.
You don't say what IDE or build tool you're using, so I can't comment on how those affect your pace. But small, incremental compiles and running unit tests are both still possible.
Perhaps looking into Cruise Control, Team City, or another hands-off build and test process would be your cup of tea. You can just check in as fast as you can and let the automated build happen on another server.

How should I convert legacy ColdFusion code to a framework?

We have a medium sized ColdFusion code base for our Intranet and Website. For most of the history of the code we have used hard coded links in the cfm's for where to go and for what 'save' code to tun.
In the last few years we've began using cfc's to handle more of the "navigational" code as well as more automated save code (implicitly calling the save process for a given cfc on init)
Assuming that it makes sense to begin using a framework, is it better to begin using it for newer projects or attempt a full scale conversion?
EDIT
To avoid confusion, I'm sensing that by moving to more cfc based code we are going down the path of accidentally creating our own framework. It seems to me that taking a proactive step toward using a proper framework and allowing the cfc's to process data is probably a wiser choice.

I'd only put the effort into a conversion if you were spending more than 10-20% of your time maintaining the project. (Your threshold my be lower or higher.) Other than that, use it just for new projects.
Why? I think the conversion is going to be painful, laborious and potentially a waste of valuable time.

"Assuming that it makes sense to begin
using a framework, is it better to
begin using it for newer projects or
attempt a full scale conversion?"
I would say that the biggest criteria for whether you should move to a framework are:
Am I spending a good amount of time maintaining current code, and is it difficult?
Am I repeating a lot of code? Do you find yourself writing a lot of the same thing over and over when adding to the current project?
Depending on how large the application is, it might be worth it to convert a current application to a framework if it saves you more time down the road by making maintenance easier and reducing code repetition for future additions to the current project. If you rarely maintenance the application except for a few tweaks here and there, then I would say leave it alone and use a framework only for new applications.

Frameworks have a short term cost and a long term gain.
When we start out without one we usually start building one over time indirectly to increase re-use of code, and make things more structured..
I have been a big fan of Fusebox, probably because I've just used it for so long.
What I have done in the past is, if I know the site will never be updated in into any real website functions, I just roll my own cfswitch to navigate between actions. each action I simply break down into the dsp act qry type files fusebox likes.
If ever I need to put it into fusebox, most of my circuits and actions are already done. The path forward is a bit easier.
On the other hand, if I know the client may want more in the future, I will just put it in a framework and leave it at that.
On a sidenote, I have also been checking out the very impressive ColdBox -- which seems to have some fantastic support, scalability and very well documented is cfc intensive... check them out too..

Have you considered using a framework such as fusebox instead of rolling your own? If you begin using a framework on new projects you might then find it easier to apply what you've learnt to existing projects.

Self Testing Systems

I had an idea I was mulling over with some colleagues. None of us knew whether or not it exists currently.
The Basic Premise is to have a system that has 100% uptime but can become more efficient dynamically.
Here is the scenario: * So we hash out a system quickly to a
specified set of interfaces, it has
zero optimizations, yet we are
confident that it is 100% stable
though (dubious, but for the sake of
this scenario please play
along) * We then profile
the original classes, and start to
program replacements for the
bottlenecks.
* The original and the replacement are initiated simultaneously and
synchronized.
* An original is allowed to run to completion: if a replacement hasn´t
completed it is vetoed by the system
as a replacement for the
original.
* A replacement must always return the same value as the original, for a
specified number of times, and for a
specific range of values, before it is
adopted as a replacement for the
original.
* If exception occurs after a replacement is adopted, the system
automatically tries the same operation
with a class which was superseded by
it.
Have you seen a similar concept in practise? Critique Please ...
Below are comments written after the initial question in regards to
posts:
* The system demonstrates a Darwinian approach to system evolution.
* The original and replacement would run in parallel not in series.
* Race-conditions are an inherent issue to multi-threaded apps and I
acknowledge them.

I believe this idea to be an interesting theoretical debate, but not very practical for the following reasons:
To make sure the new version of the code works well, you need to have superb automatic tests, which is a goal that is very hard to achieve and one that many companies fail to develop. You can only go on with implementing the system after such automatic tests are in place.
The whole point of this system is performance tuning, that is - a specific version of the code is replaced by a version that supersedes it in performance. For most applications today, performance is of minor importance. Meaning, the overall performance of most applications is adequate - just think about it, you probably rarely find yourself complaining that "this application is excruciatingly slow", instead you usually find yourself complaining on the lack of specific feature, stability issues, UI issues etc. Even when you do complain about slowness, it's usually an overall slowness of your system and not just a specific applications (there are exceptions, of course).
For applications or modules where performance is a big issue, the way to improve them is usually to identify the bottlenecks, write a new version and test is independently of the system first, using some kind of benchmarking. Benchmarking the new version of the entire application might also be necessary of course, but in general I think this process would only take place a very small number of times (following the 20%-80% rule). Doing this process "manually" in these cases is probably easier and more cost-effective than the described system.
What happens when you add features, fix non-performance related bugs etc.? You don't get any benefit from the system.
Running the two versions in conjunction to compare their performance has far more problems than you might think - not only you might have race conditions, but if the input is not an appropriate benchmark, you might get the wrong result (e.g. if you get loads of small data packets and that is in 90% of the time the input is large data packets). Furthermore, it might just be impossible (for example, if the actual code changes the data, you can't run them in conjunction).
The only "environment" where this sounds useful and actually "a must" is a "genetic" system that generates new versions of the code by itself, but that's a whole different story and not really widely applicable...

A system that runs performance benchmarks while operating is going to be slower than one that doesn't. If the goal is to optimise speed, why wouldn't you benchmark independently and import the fastest routines once they are proven to be faster?
And your idea of starting routines simultaneously could introduce race conditions.
Also, if a goal is to ensure 100% uptime you would not want to introduce untested routines since they might generate uncatchable exceptions.
Perhaps your ideas have merit as a harness for benchmarking rather than an operational system?

Have I seen a similar concept in practice? No. But I'll propose an approach anyway.
It seems like most of your objectives would be meet by some sort of super source control system, which could be implemented with CruiseControl.
CruiseControl can run unit tests to ensure correctness of the new version.
You'd have to write a CruiseControl builder pluggin that would execute the new version of your system against a series of existing benchmarks to ensure that the new version is an improvement.
If the CruiseControl build loop passes, then the new version would be accepted. Such a process would take considerable effort to implement, but I think it feasible. The unit tests and benchmark builder would have to be pretty slick.

I think an Inversion of Control Container like OSGi or Spring could do most of what you are talking about. (dynamic loading by name)
You could build on top of their stuff. Then implement your code to
divide work units into discrete modules / classes (strategy pattern)
identify each module by unique name and associate a capability with it
when a module is requested it is requested by capability and at random one of the modules with that capability is used.
keep performance stats (get system tick before and after execution and store the result)
if an exception occurs mark that module as do not use and log the exception.
If the modules do their work by message passing you can store the message until the operation completes successfully and redo with another module if an exception occurs.

For design ideas for high availability systems, check out Erlang.

I don't think code will learn to be better, by itself. However, some runtime parameters can easily adjust onto optimal values, but that would be just regular programming, right?
About the on-the-fly change, I've shared the wondering and would be building it on top of Lua, or similar dynamic language. One could have parts that are loaded, and if they are replaced, reloaded into use. No rocket science in that, either. If the "old code" is still running, it's perfectly all right, since unlike with DLL's, the file is needed only when reading it in, not while executing code that came from there.
Usefulness? Naa...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js