Optimizing console programs - c++

I have a console application that is doing intense calculations, and it takes several hours to complete. I can run as many as I want at a given moment, and they all run at exactly the same speed. That means my computer has the ability to go faster, so why isn't it? I want my computer to take 100% of its processing power and dedicate it to this one program.
I can easily cut time in half by putting half of the work in one program and half of the work in another, and have them communicate through a txt file or something. Is there any way I can make it go faster without doing that? Task Manager's priority doesn't change anything.

Well... There's always multithreading:
Stack Overflow, for example.
But it's always a complicated process, especially for C++. I'm afraid you'll have to google some tutorials for your particular case.
Here are some: tutorialspoint, codebase, plenty more around. There are a lot of methods, achieved by many distinct libraries, and none of them could be considered simple.

Related

Profiling a multiprocess system

I have a system that i need to profile.
It is comprised of tens of processes, mostly c++, some comprised of several threads, that communicate to the network and to one another though various system calls.
I know there are performance bottlenecks sometimes, but no one has put in the time/effort to check where they are: they may be in userspace code, inefficient use of syscalls, or something else.
What would be the best way to approach profiling a system like this?
I have thought of the following strategy:
Manually logging the roundtrip times of various code sequences (for example processing an incoming packet or a cli command) and seeing which process takes the largest time. After that, profiling that process, fixing the problem and repeating.
This method seems sorta hacky and guess-worky. I dont like it.
How would you suggest to approach this problem?
Are there tools that would help me out (multi-process profiler?)?
What im looking for is more of a strategy than just specific tools.
Should i profile every process separately and look for problems? if so how do i approach this?
Do i try and isolate the problematic processes and go from there? if so, how do i isolate them?
Are there other options?
I don't think there is a single answer to this sort of question. And every type of issue has it's own problems and solutions.
Generally, the first step is to figure out WHERE in the big system is the time spent. Is it CPU-bound or I/O-bound?
If the problem is CPU-bound, a system-wide profiling tool can be useful to determine where in the system the time is spent - the next question is of course whether that time is actually necessary or not, and no automated tool can tell the difference between a badly written piece of code that does a million completely useless processing steps, and one that does a matrix multiplication with a million elements very efficiently - it takes the same amount of CPU-time to do both, but one isn't actually achieving anything. However, knowing which program takes most of the time in a multiprogram system can be a good starting point for figuring out IF that code is well written, or can be improved.
If the system is I/O bound, such as network or disk I/O, then there are tools for analysing disk and network traffic that can help. But again, expecting the tool to point out what packet response or disk access time you should expect is a different matter - if you contact google to search for "kerflerp", or if you contact your local webserver that is a meter away, will have a dramatic impact on the time for a reasonable response.
There are lots of other issues - running two pieces of code in parallel that uses LOTS of memory can cause both to run slower than if they are run in sequence - because the high memory usage causes swapping, or because the OS isn't able to use spare memory for caching file-I/O, for example.
On the other hand, two or more simple processes that use very little memory will benefit quite a lot from running in parallel on a multiprocessor system.
Adding logging to your applications such that you can see WHERE it is spending time is another method that works reasonably well. Particularly if you KNOW what the use-case is where it takes time.
If you have a use-case where you know "this should take no more than X seconds", running regular pre- or post-commit test to check that the code is behaving as expected, and no-one added a lot of code to slow it down would also be a useful thing.

profiler for c++ code, very sleepy

I'm a newbie with profiling. I'd like to optimize my code to satisfy timing constraints. I use Visual C++ 08 Express and thus had to download a profiler, for me it's Very Sleepy. I did some search but found no decent tutorial on Sleepy, and here my question:
How to use it properly? I grasped the general idea of profiling, so I sorted according to %exclusive to find my bottlenecks. Firstly, on the top of this list I have ZwWaitForSingleObject, RtlEnterCriticalSection, operator new, RtlLeaveCriticalSection, printf, some iterators ... and after they take some like 60% there comes my first function, first position with Child Calls. Can someone explain me why above mentioned come out, what do they mean and how can I optimize my code if I have no access to this critical 60%? (for "source file": unknown...).
Also, for my function I'd think I get time for each line, but it's not the case, e.g. arithmetics or some functions have no timing (not nested in unused "if" clauses).
AND last thing: how to find out that some line can execute superfast, but is called thousands times, being the actual bottleneck?
Finally, is Sleepy good? Or some free alternative for my platform?
Help very appreciated!
cheers!
UPDATE - - - - -
I have found another version of profiler, called plain Sleepy. It shows how many times some snippet was called plus the number of line (I guess it points to the critical one). So in my case.. KiFastSystemCallRet takes 50%! It means that it waits for some data right? How to improve that matter, is there maybe a decent approach to trace what causes these multiple calls and eventually remove/change it?
I'd like to optimize my code to satisfy timing constraints
You're running smack into a persistent issue in this business.
You want to find ways to make your code take less time, and you (and many people) assume (and have been taught) the only way to do that is by taking various sorts of measurements.
There's a minority view, and the only thing it has to recommend it is actual significant results (plus an ironclad theory behind it).
If you've got a "bottleneck" (and you do, probably several), it's taking some fraction of time, like 30%.
Just treat it as a bug to be found.
Randomly halt the program with the pause button, and look carefully to see what the program is doing and why it's doing it.
Ask if it's something that could be gotten rid of.
Do this 10 times. On average you will see the problem on 3 of the pauses.
Any activity you see more than once, if it's not truly necessary, is a speed bug.
This does not tell you precisely how much the problem costs, but it does tell you precisely what the problem is, and that it's worth fixing.
You'll see things this way that no profiler can find, because profilers
are only programs, and cannot be broad-minded about what constitutes an opportunity.
Some folks are risk-averse, thinking it might not give enough speedup to be worth it.
Granted, there is a small chance of a low payoff, but it's like investing.
The theory says on average it will be worthwhile, and there's also a small chance of a high payoff.
In any case, if you're worried about the risks, a few more samples will settle your fears.
After you fix the problem, the remaining bottlenecks each take a larger percent, because they didn't get smaller but the overall program did.
So they will be easier to find when you repeat the whole process.
There's lots of literature about profiling, but very little that actually says how much speedup it achieves in practice.
Here's a concrete example with almost 3 orders of magnitude speedup.
I've used GlowCode (commercial product, similar to Sleepy) for profiling native C++ code. You run the instrumenting process, then execute your program, then look at the data produced by the tool. The instrumenting step injects a little trace function at every methods' entrypoints and exitpoints, and simply measures how much time it takes for each function to run through to completion.
Using the call graph profiling tool, I listed the methods sorted from "most time used" to "least time used", and the tool also displays a call count. Simply drilling into the highest percentage routine showed me which methods were using the most time. I could see that some methods were very slow, but drilling into them I discovered they were waiting for user input, or for a service to respond. And some took a long time because they were calling some internal routines thousands of times each invocation. We found someone made a coding error and was walking a large linked list repeatedly for each item in the list, when they really only needed to walk it once.
If you sort by "most frequently called" to "least called", you can see some of the tiny functions that get called from everywhere (iterator methods like next(), etc.) Something to check for is to make sure the functions that are called the most often are really clean. Saving a millisecond in a routine called 500 times to paint a screen will speed that screen up by half a second. This helps you decide which are the most important places to spend your efforts.
I've seen two common approaches to using profiling. One is to do some "general" profiling, running through a suite of "normal" operations, and discovering which methods are slowing the app down the most. The other is to do specific profiling, focusing on specific user complaints about performance, and running through those functions to reveal their issues.
One thing I would caution you about is to limit your changes to those that will measurably impact the users' experience or system throughput. Shaving one millisecond off a mouse click won't make a difference to the average user, because human reaction time simply isn't that fast. Race car drivers have reaction times in the 8 millisecond range, some elite twitch gamers are even faster, but normal users like bank tellers will have reaction times in the 20-30 millisecond range. The benefits would be negligible.
Making twenty 1-millisecond improvements or one 20-millisecond change will make the system a lot more responsive. It's cheaper and better if you can do the single big improvement over the many small improvements.
Similarly, shaving one millisecond off a service that handles 100 users per second will make a 10% improvement, meaning you could improve the service to handle 110 users per second.
The reason for concern is that coding changes strictly to improve performance often negatively impact your code's structure by adding complexity. Let's say you decided to improve a call to a database by caching results. How do you know when the cache goes invalid? Do you add a cache cleaning mechanism? Consider a financial transaction where looping through all the line items to produce a running total is slow, so you decide to keep a runningTotal accumulator to answer faster. You now have to modify the runningTotal for all kinds of situations like line voids, reversals, deletions, modifications, quantity changes, etc. It makes the code more complex and more error-prone.

running time of two programs run separately and then together

I was recently asked this question in an interview, and while I did alright on the first two parts [I am assuming] I struggled a bit on the third. Here's the question:
You have two Linux programs, A and B. When run separately, A and B each take one minute to complete on a system that has just been restarted. [ie: fresh system: you reboot it, log in, get a shell prompt, run the program.]
What can you tell me about the programs if:
a) when run together, they take 2 minutes
b) when run together, they take 1 minute
c) when run together, they take 30 seconds
I said for a) that if they take exactly double the time when run together, they share no mutual exclusion and are vying for all the same resources, probably don't share any sort of cache data or instructions [and thus don't help each other out from a cache perspective] and each program needs the full utilization of said resource to complete such that the OS cannot parallelize them.
For b), I said that if they can run just as fast together, they probably share some spacial/temporal locality in the cash, and may lend themselves to being properly pipelined in such a way that while program A is waiting on something, program B can run in between those stages, and vice versa-- effectively running them both in 1 minute.
For c), I was a bit stuck. In retrospect, I probably should have said that perhaps program A and B were both doing a common task, where two of them running at once could complete said task faster than one running alone-- such as a garbage collector. But the best that I could come up with was that perhaps they loaded out of the same sector on the hard disk, and that helped them both together run quickly.
I am just looking for some input from some of the smarties here on things I probably missed. The position was for a platforms/systems position that require a good understanding of hardware/software and operating systems, and namely interactions between them which is why [I'm assuming] the question was asked.
I was also trying to think of examples that I could apply to each part to help show my knowledge of the questions real life applications, but on the spot I was coming up short.
Together they take 2 minutes to complete
In this case, I think that each program is fully CPU-bound and can saturate 100% of the CPUs available on the machine. Therefore when the programs run together, each runs at half speed.
It's also possible that this would be the observed behavior if both programs were able and willing to saturate some other resource apart from the CPU, for example some I/O device. However, since in practice, usually the performance of I/O devices does not decrease linearly with the load applied to them if they are oversaturated, I would consider that a less likely scenario and go with CPU-bound as a first guess.
Together they take 1 minute to complete
The two programs do not contest the same resources, or there are ample resources in the system to satisfy the demands of both. Therefore, they end up not interfering with each other.
Together they take half a minute to complete
The programs operate on the same input, and both can tell when all input is used up, so each ends up doing half the work it would do if launched alone at half the running time. Also, the system obviously has the capacity to supply double the amount of whatever resource these programs are constrained by.
Since in this case the running time decreases linearly with the amount of processes (perfect scaling), it seems more likely that the resource constraining the programs is CPU for the same reasons explained in the "2 minutes" scenario. This also fits in well with the "common input" assumption, as the input would not be very likely to be coming from one source if there were e.g. different I/O devices supplying it.
Therefore, the first guess in this case is that each program is CPU-bound and written such that it consumes at most half the CPU resources in the system.
For A, They're programs that are in competition for a mutually exclusive resource.
For B, They're independent programs that don't really interact.
For C, which is the one you're struggling with, it seems they both have the same work to pick from. For example, there's a queue of tasks to do, both programs are capable of doing the tasks, and they know what tasks have been done. So if they both run at the same time (assuming multi core machine, but even then not necessarily, all that's important is that they don't have a resource bottleneck) they get the work done in half the time.
See Performance in multithreaded Java application for another possible reason why processes can run faster when you have more than one.
Although I admit that the queue of tasks that canbeperformed concurrently is a much simpler reason to explain this reduced running time.

Beyond Stack Sampling: C++ Profilers

A Hacker's Tale
The date is 12/02/10. The days before Christmas are dripping away and I've pretty much hit a major road block as a windows programmer. I've been using AQTime, I've tried sleepy, shiny, and very sleepy, and as we speak, VTune is installing. I've tried to use the VS2008 profiler, and it's been positively punishing as well as often insensible. I've used the random pause technique. I've examined call-trees. I've fired off function traces. But the sad painful fact of the matter is that the app I'm working with is over a million lines of code, with probably another million lines worth of third-party apps.
I need better tools. I've read the other topics. I've tried out each profiler listed in each topic. There simply has to be something better than these junky and expensive options, or ludicrous amounts of work for almost no gain. To further complicate matters, our code is heavily threaded, and runs a number of Qt Event loops, some of which are so fragile that they crash under heavy instrumentation due to timing delays. Don't ask me why we're running multiple event loops. No one can tell me.
Are there any options more along the lines of Valgrind in a windows environment?
Is there anything better than the long swath of broken tools I've already tried?
Is there anything designed to integrate with Qt, perhaps with a useful display of events in queue?
A full list of the tools I tried, with the ones that were really useful in italics:
AQTime: Rather good! Has some trouble with deep recursion, but the call graph is correct in these cases, and can be used to clear up any confusion you might have. Not a perfect tool, but worth trying out. It might suit your needs, and it certainly was good enough for me most of the time.
Random Pause attack in debug mode: Not enough information enough of the time.
A good tool but not a complete solution.
Parallel Studios: The nuclear option. Obtrusive, weird, and crazily powerful. I think you should hit up the 30 day evaluation, and figure out if it's a good fit. It's just darn cool, too.
AMD Codeanalyst: Wonderful, easy to use, very crash-prone, but I think that's an environment thing. I'd recommend trying it, as it is free.
Luke Stackwalker: Works fine on small projects, it's a bit trying to get it working on ours. Some good results though, and it definitely replaces Sleepy for my personal tasks.
PurifyPlus: No support for Win-x64 environments, most prominently Windows 7. Otherwise excellent. A number of my colleagues in other departments swear by it.
VS2008 Profiler: Produces output in the 100+gigs range in function trace mode at the required resolution. On the plus side, produces solid results.
GProf: Requires GCC to be even moderately effective.
VTune: VTune's W7 support borders on criminal. Otherwise excellent
PIN: I'd need to hack up my own tool, so this is sort of a last resort.
Sleepy\VerySleepy: Useful for smaller apps, but failing me here.
EasyProfiler: Not bad if you don't mind a bit of manually injected code to indicate where to instrument.
Valgrind: *nix only, but very good when you're in that environment.
OProfile: Linux only.
Proffy: They shoot wild horses.
Suggested tools that I haven't tried:
XPerf:
Glowcode:
Devpartner:
Notes:
Intel environment at the moment. VS2008, boost libraries. Qt 4+. And the wretched humdinger of them all: Qt/MFC integration via trolltech.
Now: Almost two weeks later, it looks like my issue is resolved. Thanks to a variety of tools, including almost everything on the list and a couple of my personal tricks, we found the primary bottlenecks. However, I'm going to keep testing, exploring, and trying out new profilers as well as new tech. Why? Because I owe it to you guys, because you guys rock. It does slow the timeline down a little, but I'm still very excited to keep trying out new tools.
Synopsis
Among many other problems, a number of components had recently been switched to the incorrect threading model, causing serious hang-ups due to the fact that the code underneath us was suddenly no longer multithreaded. I can't say more because it violates my NDA, but I can tell you that this would never have been found by casual inspection or even by normal code review. Without profilers, callgraphs, and random pausing in conjunction, we'd still be screaming our fury at the beautiful blue arc of the sky. Thankfully, I work with some of the best hackers I've ever met, and I have access to an amazing 'verse full of great tools and great people.
Gentlefolk, I appreciate this tremendously, and only regret that I don't have enough rep to reward each of you with a bounty. I still think this is an important question to get a better answer to than the ones we've got so far on SO.
As a result, each week for the next three weeks, I'll be putting up the biggest bounty I can afford, and awarding it to the answer with the nicest tool that I think isn't common knowledge. After three weeks, we'll hopefully have accumulated a definitive profile of the profilers, if you'll pardon my punning.
Take-away
Use a profiler. They're good enough for Ritchie, Kernighan, Bentley, and Knuth. I don't care who you think you are. Use a profiler. If the one you've got doesn't work, find another. If you can't find one, code one. If you can't code one, or it's a small hang up, or you're just stuck, use random pausing. If all else fails, hire some grad students to bang out a profiler.
A Longer View
So, I thought it might be nice to write up a bit of a retrospective. I opted to work extensively with Parallel Studios, in part because it is actually built on top of the PIN Tool. Having had academic dealings with some of the researchers involved, I felt that this was probably a mark of some quality. Thankfully, I was right. While the GUI is a bit dreadful, I found IPS to be incredibly useful, though I can't comfortably recommend it for everyone. Critically, there's no obvious way to get line-level hit counts, something that AQT and a number of other profilers provide, and I've found very useful for examining rate of branch-selection among other things. In net, I've enjoyed using AQTime as well, and I've found their support to be really responsive. Again, I have to qualify my recommendation: A lot of their features don't work that well, and some of them are downright crash-prone on Win7x64. XPerf also performed admirably, but is agonizingly slow for the sampling detail required to get good reads on certain kinds of applications.
Right now, I'd have to say that I don't think there's a definitive option for profiling C++ code in a W7x64 environment, but there are certainly options that simply fail to perform any useful service.
First:
Time sampling profilers are more robust than CPU sampling profilers. I'm not extremely familiar with Windows development tools so I can't say which ones are which. Most profilers are CPU sampling.
A CPU sampling profiler grabs a stack trace every N instructions.
This technique will reveal portions of your code that are CPU bound. Which is awesome if that is the bottle neck in your application. Not so great if your application threads spend most of their time fighting over a mutex.
A time sampling profiler grabs a stack trace every N microseconds.
This technique will zero in on "slow" code. Whether the cause is CPU bound, blocking IO bound, mutex bound, or cache thrashing sections of code. In short what ever piece of code is slowing your application will standout.
So use a time sampling profiler if at all possible especially when profiling threaded code.
Second:
Sampling profilers generate gobs of data. The data is extremely useful, but there is often too much to be easily useful. A profile data visualizer helps tremendously here. The best tool I've found for profile data visualization is gprof2dot. Don't let the name fool you, it handles all kinds of sampling profiler output (AQtime, Sleepy, XPerf, etc). Once the visualization has pointed out the offending function(s), jump back to the raw profile data to get better hints on what the real cause is.
The gprof2dot tool generates a dot graph description that you then feed into a graphviz tool. The output is basically a callgraph with functions color coded by their impact on the application.
A few hints to get gprof2dot to generate nice output.
I use a --skew of 0.001 on my graphs so I can easily see the hot code paths. Otherwise the int main() dominates the graph.
If you're doing anything crazy with C++ templates you'll probably want to add --strip. This is especially true with Boost.
I use OProfile to generate my sampling data. To get good output I need configure it to load the debug symbols from my 3rd party and system libraries. Be sure to do the same, otherwise you'll see that CRT is taking 20% of your application's time when what's really going on is malloc is trashing the heap and eating up 15%.
What happened when you tried random pausing? I use it all the time on a monster app. You said it did not give enough information, and you've suggested you need high resolution. Sometimes people need a little help in understanding how to use it.
What I do, under VS, is configure the stack display so it doesn't show me the function arguments, because that makes the stack display totally unreadable, IMO.
Then I take about 10 samples by hitting "pause" during the time it's making me wait. I use ^A, ^C, and ^V to copy them into notepad, for reference. Then I study each one, to try to figure out what it was in the process of trying to accomplish at that time.
If it was trying to accomplish something on 2 or more samples, and that thing is not strictly necessary, then I've found a live problem, and I know roughly how much fixing it will save.
There are things you don't really need to know, like precise percents are not important, and what goes on inside 3rd-party code is not important, because you can't do anything about those. What you can do something about is the rich set of call-points in code you can modify displayed on each stack sample. That's your happy hunting ground.
Examples of the kinds of things I find:
During startup, it can be about 30 layers deep, in the process of trying to extract internationalized character strings from DLL resources. If the actual strings are examined, it can easily turn out that the strings don't really need to be internationalized, like they are strings the user never actually sees.
During normal usage, some code innocently sets a Modified property in some object. That object comes from a super-class that captures the change and triggers notifications that ripple throughout the entire data structure, manipulating the UI, creating and desroying obects in ways hard to foresee. This can happen a lot - the unexpected consequences of notifications.
Filling in a worksheet row-by-row, cell-by-cell. It turns out if you build the row all at once, from an array of values, it's a lot faster.
P.S. If you're multi-threaded, when you pause it, all threads pause. Take a look at the call stack of each thread. Chances are, only one of them is the real culprit, and the others are idling.
I've had some success with AMD CodeAnalyst.
Do you have an MFC OnIdle function? In the past I had a near real-time app I had to fix that was dropping serial packets when set at 19.2K speed which a PentiumD should have been able to keep up with. The OnIdle function was what was killing things. I'm not sure if QT has that concept, but I'd check for that too.
Re the VS Profiler -- if it's generating such large files, perhaps your sampling interval is too frequent? Try lowering it, as you probably have enough samples anyway.
And ideally, make sure you're not collecting samples until you're actually exercising the problem area. So start with collection paused, get your program to do its "slow activity", then start collection. You only need at most 20 seconds of collection. Stop collection after this.
This should help reduce your sample file sizes, and only capture what is necessary for your analysis.
I have successfully used PurifyPlus for Windows. Although it is not cheap, IBM provides a trial version that is slightly crippled. All you need for profiling with quantify are pdb files and linking with /FIXED:NO. Only drawback: No support for Win7/64.
Easyprofiler - I haven't seen it mentioned here yet so not sure if you've looked at it already. It takes a slightly different approach in how it gathers metric data. A drawback to using its compile-time profile approach is you have to make changes to the code-base. Thus you'll need to have some idea of where the slow might be and insert profiling code there.
Going by your latest comments though, it sounds like you're at least making some headway. Perhaps this tool might provide some useful metrics for you. If nothing else it has some really purdy charts and pictures :P
Two more tool suggestions.
Luke Stackwalker has a cute name (even if it's trying a bit hard for my taste), it won't cost you anything, and you get the source code. It claims to support multi threaded programs, too. So it is surely worth a spin.
http://lukestackwalker.sourceforge.net/
Also Glowcode, which I've had pointed out to me as worth using:
http://www.glowcode.com/
Unfortunately I haven't done any PC work for a while, so I haven't tried either of these. I hope the suggestions are of help anyway.
Checkout XPerf
This is free, non-invasive and extensible profiler offered by MS. It was developed by Microsoft to profile Windows.
If you're suspicious of the event loop, could overriding QCoreApplication::notify() and dosome manual profiling (one or two maps of senders/events to counts/time)?
I'm thinking that you first log the frequency of event types, then examine those events more carefully (which object sends it, what does it contain, etc). Signals across threads are queued implicitly, so they end up in the event loop (as well explicit queued connections too, obviously).
We've done it to trap and report exceptions in our event handlers, so really, every event goes through there.
Just an idea.
Edit: I see now you mentioned this in your first post. Dammit, I never thought I'd be that guy.
You can use Pin to instrument your code with finer granularity. I think Pin would let you create a tool to count how many times you enter a function or how many clockticks you spend there, roughly emulating something like VTune or CodeAnalyst. Then you could strip down which functions get instrumented until your timing issues go away.
I can tell you what I use everyday.
a) AMD Code Analyst
It is easy, and it will give you a quick overview of what is happening. It will be ok for most of the time.
With AMD CPUs, it will tell you info about the cpu pipeline, but you only need this only if you have heavy loops, like in graphic engines, video codecs, etc.
b) VTune.
It is very well integrated in vs2008
after you know the hotspots, you need to sample not only time, but other things like cache misses, and memory usage. This is very important. Setup a sampling session, and edit the properties. I always sample for time, memory read/write, and cache misses (three different runs)
But more than the tool, you need to get experience with profiling. And that means understanding how the CPU/Memory/PCI works... so, this is my 3rd option
c) Unit testing
This is very important if you are developing a big application that needs huge performance. If you cannot split the app in some pieces, it will be difficult to track cpu usage. I dont test all the cases and classes, but I have hardcoded executions and input files with important features.
My advice is using random sampling in several small tests, and try to standardise a profile strategy.
I use xperf/ETW for all of my profiling needs. It has a steep learning curve but is incredibly powerful. If you are profiling on Windows then you must know xperf. I frequently use this profiler to find performance problems in my code and in other people's code.
In the configuration that I use it:
xperf grabs CPU samples from every core that is executing code every
ms. The sampling rate can be increased to 8 KHz and the samples
include user-mode and kernel code. This allows finding out what a
thread is doing while it is running
xperf records every context
switch (allowing for perfect reconstruction of how much time each
thread uses), plus call stacks for when threads are switched in, plus
call stacks for what thread readied another thread, allowing tracing
of wait chains and finding out why a thread is not running
xperf
records all file I/O from all processes
xperf records all disk I/O
from all processes
xperf records what window is active, the CPU
frequency, CPU power state, UI delays, etc.
xperf can also record all
heap allocations from one process, all virtual allocations from all
processes, and much more.
That's a lot of data, all on one timeline, for all processes. No other profiler on Windows can do that.
I have blogged extensively about how to use xperf/ETW. These blog posts, and some professionally quality training videos, can be found here:
http://randomascii.wordpress.com/2014/08/19/etw-training-videos-available-now/
If you want to find out what might happen if you don't use xperf read these blog posts:
http://randomascii.wordpress.com/category/investigative-reporting/
These are tales of performance problems I have found in other people's code, that should have been found by the developers. This includes mshtml.dll being loaded into the VC++ compiler, a denial of service in VC++'s find-in-files, thermal throttling in a surprising number of customer machines, slow single-stepping in Visual Studio, a 4 GB allocation in a hard-disk driver, a powerpoint performance bug, and more.
I just finished the first usable version of CxxProf, a portable manual instrumented profiling library for C++.
It fulfills the following goals:
Easy integration
Easily remove the lib during compile time
Easily remove the lib during runtime
Support for multithreaded applications
Support for distributed systems
Keep impact on a minimum
These points were ripped from the project wiki, have a look there for more details.
Disclaimer: Im the main developer of CxxProf
Just to throw it out, even though it's not a full-blown profiler: if all you're after is hung event loops that take long processing an event, an ad-hoc tool is simple matter in Qt. That approach could be easily expanded to keep track of how long did each event take to process, and what those events were, and so on. It's not a universal profiler, but an event-loop-centric one.
In Qt, all cross-thread signal-slot calls are delivered via the event loop, as are timers, network and serial port notifications, and all user interaction,. Thus, observing the event loops is a big step towards understanding where the application is spending its time.
DevPartner, originally developed by NuMega and now distributed by MicroFocus, was once the solution of choice for profiling and code analysis (memory and resource leaks for example).
I haven't tried it recently, so I cannot assure you it will help you; but I once had excellent results with it, so that this is an alternative I do consider to re-install in our code quality process (they provide a 14 days trial)
though your os is win7,the programm cann't run under xp?
how about profile it under xp and the result should be a hint for win7.
There are lots of profilers listed here and I've tried a few of them myself - however I ended up writing my own based on this:
http://code.google.com/p/high-performance-cplusplus-profiler/
It does of course require that you modify the code base, but it's perfect for narrowing down bottlenecks, should work on all x86s (could be a problem with multi-core boxes, i.e. it uses rdtsc, however - this is purely for indicative timing anyway - so I find it's sufficient for my needs..)
I use Orbit profiler, easy, open source and powerfull ! https://orbitprofiler.com/

Advice for converting a large monolithic singlethreaded application to a multithreaded architecture?

My company's main product is a large monolithic C++ application, used for scientific data processing and visualisation. Its codebase goes back maybe 12 or 13 years, and while we have put work into upgrading and maintaining it (use of STL and Boost - when I joined most containers were custom, for example - fully upgraded to Unicode and the 2010 VCL, etc) there's one remaining, very significant problem: it's fully singlethreaded. Given it's a data processing and visualisation program, this is becoming more and more of a handicap.
I'm both a developer and the project manager for the next release where we want to tackle this, and this is going to be a difficult job in both areas. I'm seeking concrete, practical, and architectural advice on how to tackle the problem.
The program's data flow might go something like this:
a window needs to draw data
In the paint method, it will call a GetData method, often hundreds of times for hundreds of bits of data in one paint operation
This will go and calculate or read from file or whatever else is required (often quite a complex data flow - think of this as data flowing through a complex graph, each node of which performs operations)
Ie, the paint message handler will block while processing is done, and if the data hasn't already been calculated and cached, this can be a long time. Sometimes this is minutes. Similar paths occur for other parts of the program that perform lengthy processing operations - the program is unresponsive for the entire time, sometimes hours.
I'm seeking advice on how to approach changing this. Practical ideas. Perhaps things like:
design patterns for asynchronously requesting data?
storing large collections of objects such that threads can read and write safely?
handling invalidation of data sets while something is trying to read it?
are there patterns and techniques for this sort of problem?
what should I be asking that I haven't thought of?
I haven't done any multithreaded programming since my Uni days a few years ago, and I think the rest of my team is in a similar position. What I knew was academic, not practical, and is nowhere near enough to have confidence approaching this.
The ultimate objective is to have a fully responsive program, where all calculations and data generation is done in other threads and the UI is always responsive. We might not get there in a single development cycle :)
Edit: I thought I should add a couple more details about the app:
It's a 32-bit desktop application for Windows. Each copy is licensed. We plan to keep it a desktop, locally-running app
We use Embarcadero (formerly Borland) C++ Builder 2010 for development. This affects the parallel libraries we can use, since most seem (?) to be written for GCC or MSVC only. Luckily they're actively developing it and its C++ standards support is much better than it used to be. The compiler supports these Boost components.
Its architecture is not as clean as it should be and components are often too tightly coupled. This is another problem :)
Edit #2: Thanks for the replies so far!
I'm surprised so many people have recommended a multi-process architecture (it's the top-voted answer at the moment), not multithreading. My impression is that's a very Unix-ish program structure, and I don't know anything about how it's designed or works. Are there good resources available about it, on Windows? Is it really that common on Windows?
In terms of concrete approaches to some of the multithreading suggestions, are there design patterns for asynchronous request and consuming of data, or threadaware or asynchronous MVP systems, or how to design a task-oriented system, or articles and books and post-release deconstructions illustrating things that work and things that don't work? We can develop all this architecture ourselves, of course, but it's good to work from what others have done before and know what mistakes and pitfalls to avoid.
One aspect that isn't touched on in any answers is project managing this. My impression is estimating how long this will take and keeping good control of the project when doing something as uncertain as this may be hard. That's one reason I'm after recipes or practical coding advice, I guess, to guide and restrict coding direction as much as possible.
I haven't yet marked an answer for this question - this is not because of the quality of the answers, which is great (and thankyou) but simply that because of the scope of this I'm hoping for more answers or discussion. Thankyou to those who have already replied!
You have a big challenge ahead of you. I had a similar challenge ahead of me -- 15 year old monolithic single threaded code base, not taking advantage of multicore, etc. We expended a great deal of effort in trying to find a design and solution that was workable and would work.
Bad news first. It will be somewhere between impractical and impossible to make your single-threaded app multithreaded. A single threaded app relies on it's singlethreaded-ness is ways both subtle and gross. One example is if the computation portion requires input from the GUI portion. The GUI must run in the main thread. If you try to get this data directly from the computation engine, you will likely run in to deadlock and race conditions that will require major redesigns to fix. Many of these reliances will not crop up during the design phase, or even during the development phase, but only after a release build is put in a harsh environment.
More bad news. Programming multithreaded applications is exceptionally hard. It might seem fairly straightforward to just lock stuff and do what you have to do, but it is not. First of all if you lock everything in sight you end up serializing your application, negating every benefit of mutithreading in the first place while still adding in all the complexity. Even if you get beyond this, writing a defect-free MP application is hard enough, but writing a highly-performant MP application is that much more difficult. You could learn on the job in a kind of baptismal by fire. But if you are doing this with production code, especially legacy production code, you put your buisness at risk.
Now the good news. You do have options that don't involve refactoring your whole app and will give you most of what you seek. One option in particular is easy to implement (in relative terms), and much less prone to defects than making your app fully MP.
You could instantiate multiple copies of your application. Make one of them visible, and all the others invisible. Use the visible application as the presentation layer, but don't do the computational work there. Instead, send messages (perhaps via sockets) to the invisible copies of your application which do the work and send the results back to the presentation layer.
This might seem like a hack. And maybe it is. But it will get you what you need without putting the stability and performance of your system at such great risk. Plus there are hidden benefits. One is that the invisible engine copies of your app will have access to their own virtual memory space, making it easier to leverage all the resources of the system. It also scales nicely. If you are running on a 2-core box, you could spin off 2 copies of your engine. 32 cores? 32 copies. You get the idea.
So, there's a hint in your description of the algorithm as to how to proceed:
often quite a complex data flow - think of this as data flowing through a complex graph, each node of which performs operations
I'd look into making that data-flow graph be literally the structure that does the work. The links in the graph can be thread-safe queues, the algorithms at each node can stay pretty much unchanged, except wrapped in a thread that picks up work items from a queue and deposits results on one. You could go a step further and use sockets and processes rather than queues and threads; this will let you spread across multiple machines if there is a performance benefit in doing this.
Then your paint and other GUI methods need split in two: one half to queue the work, and the other half to draw or use the results as they come out of the pipeline.
This may not be practical if the app presumes that data is global. But if it is well contained in classes, as your description suggests it may be, then this could be the simplest way to get it parallelised.
Don't attempt to multithread everything in the old app. Multithreading for the sake of saying it's multithreaded is a waste of time and money. You're building an app that does something, not a monument to yourself.
Profile and study your execution flows to figure out where the app spends most of its time. A profiler is a great tool for this, but so is just stepping through the code in the debugger. You find the most interesting things in random walks.
Decouple the UI from long-running computations. Use cross-thread communications techniques to send updates to the UI from the computation thread.
As a side-effect of #3: think carefully about reentrancy: now that the compute is running in the background and the user can smurf around in the UI, what things in the UI should be disabled to prevent conflicts with the background operation? Allowing the user to delete a dataset while a computation is running on that data is probably a bad idea. (Mitigation: computation makes a local snapshot of the data) Does it make sense for the user to spool up multiple compute operations concurrently? If handled well, this could be a new feature and help rationalize the app rework effort. If ignored, it will be a disaster.
Identify specific operations that are candidates to be shoved into a background thread. The ideal candidate is usually a single function or class that does a lot of work (requires a "lot of time" to complete - more than a few seconds) with well defined inputs and outputs, that makes use of no global resources, and does not touch the UI directly. Evaluate and prioritize candidates based on how much work would be required to retrofit to this ideal.
In terms of project management, take things one step at a time. If you have multiple operations that are strong candidates to be moved to a background thread, and they have no interaction with each other, these might be implemented in parallel by multiple developers. However, it would be a good exercise to have everybody participate in one conversion first so that everyone understands what to look for and to establish your patterns for UI interaction, etc. Hold an extended whiteboard meeting to discuss the design and process of extracting the one function into a background thread. Go implement that (together or dole out pieces to individuals), then reconvene to put it all together and discuss discoveries and pain points.
Multithreading is a headache and requires more careful thought than straight up coding, but splitting the app into multiple processes creates far more headaches, IMO. Threading support and available primitives are good in Windows, perhaps better than some other platforms. Use them.
In general, don't do any more than what is needed. It's easy to severely over implement and over complicate an issue by throwing more patterns and standard libraries at it.
If nobody on your team has done multithreading work before, budget time to make an expert or funds to hire one as a consultant.
The main thing you have to do is to disconnect your UI from your data set. I'd suggest that the way to do that is to put a layer in between.
You will need to design a data structure of data cooked-for-display. This will most likely contain copies of some of your back-end data, but "cooked" to be easy to draw from. The key idea here is that this is quick and easy to paint from. You may even have this data structure contain calculated screen positions of bits of data so that it's quick to draw from.
Whenever you get a WM_PAINT message you should get the most recent complete version of this structure and draw from it. If you do this properly, you should be able to handle multiple WM_PAINT messages per second because the paint code never refers to your back end data at all. It's just spinning through the cooked structure. The idea here is that its better to paint stale data quickly than to hang your UI.
Meanwhile...
You should have 2 complete copies of this cooked-for-display structure. One is what the WM_PAINT message looks at. (call it cfd_A) The other is what you hand to your CookDataForDisplay() function. (call it cfd_B). Your CookDataForDisplay() function runs in a separate thread, and works on building/updating cfd_B in the background. This function can take as long as it wants because it isn't interacting with the display in any way. Once the call returns cfd_B will be the most up-to-date version of the structure.
Now swap cfd_A and cfd_B and InvalidateRect on your application window.
A simplistic way to do this is to have your cooked-for-display structure be a bitmap, and that might be a good way to go to get the ball rolling, but I'm sure with a bit of thought you can do a much better job with a more sophisticated structure.
So, referring back to your example.
In the paint method, it will call a GetData method, often hundreds of times for hundreds of bits of data in one paint operation
This is now 2 threads, the paint method refers to cfd_A and runs on the UI thread. Meanwhile cfd_B is being built by a background thread using GetData calls.
The quick-and-dirty way to do this is
Take your current WM_PAINT code, stick it into a function called PaintIntoBitmap().
Create a bitmap and a Memory DC, this is cfd_B.
Create a thread and pass it cfd_B and have it call PaintIntoBitmap()
When this thread completes, swap cfd_B and cfd_A
Now your new WM_PAINT method just takes the pre-rendered bitmap in cfd_A and draws it to the screen. Your UI is now disconnnected from your backend GetData() function.
Now the real work begins, because the quick-and-dirty way doesn't handle window resizing very well. You can go from there to refine what your cfd_A and cfd_B structures are a little at a time until you reach a point where you are satisfied with the result.
You might just start out breaking the the UI and the work task into separate threads.
In your paint method instead of calling getData() directly, it puts the request in a thread-safe queue. getData() is run in another thread that reads its data from the queue. When the getData thread is done, it signals the main thread to redraw the visualisation area with its result data using thread syncronization to pass the data.
While all this is going on you of course have a progress bar saying reticulating splines so the user knows something is going on.
This would keep your UI snappy without the significant pain of multithreading your work routines (which can be akin to a total rewrite)
It sounds like you have several different issues that parallelism can address, but in different ways.
Performance increases through utilizing multicore CPU Architecutres
You're not taking advantage of the multi-core CPU architetures that are becoming so common. Parallelization allow you to divide work amongst multiple cores. You can write that code through standard C++ divide and conquer techniques using a "functional" style of programming where you pass work to separate threads at the divide stage. Google's MapReduce pattern is an example of that technique. Intel has the new CILK library to give you C++ compiler support for such techniques.
Greater GUI responsiveness through asynchronous document-view
By separating the GUI operations from the document operations and placing them on different threads, you can increase the apparent responsiveness of your application. The standard Model-View-Controller or Model-View-Presenter design patterns are a good place to start. You need to parallelize them by having the model inform the view of updates rather than have the view provide the thread on which the document computes itself. The View would call a method on the model asking it to compute a particular view of the data, and the model would inform the presenter/controller as information is changed or new data becomes available, which would get passed to the view to update itself.
Opportunistic caching and pre-calculation
It sounds like your application has a fixed base of data, but many possible compute-intensive views on the data. If you did a statistical analysis on which views were most commonly requested in what situations, you could create background worker threads to pre-calculate the likely-requested values. It may be useful to put these operations on low-priority threads so that they don't interfere with the main application processing.
Obviously, you'll need to use mutexes (or critical sections), events, and probably semaphores to implement this. You may find some of the new synchronization objects in Vista useful, like the slim reader-writer lock, condition variables, or the new thread pool API. See Joe Duffy's book on concurrency for how to use these basic techniques.
There is something that no-one has talked about yet, but which is quite interesting.
It's called futures. A future is the promise of a result... let's see with an example.
future<int> leftVal = computeLeftValue(treeNode); // [1]
int rightVal = computeRightValue(treeNode); // [2]
result = leftVal + rightVal; // [3]
It's pretty simple:
You spin off a thread that starts computing leftVal, taking it from a pool for example to avoid the initialization problem.
While leftVal is being computed, you compute rightVal.
You add the two, this may block if leftVal is not computed yet and wait for the computation to end.
The great benefit here is that it's straightforward: each time you have one computation followed by another that is independent and you then join the result, you can use this pattern.
See Herb Sutter's article on futures, they will be available in the upcoming C++0x but there are already libraries available today even if the syntax is perhaps not as pretty as I would make you believe ;)
If it was my development dollars I was spending, I would start with the big picture:
What do I hope to accomplish, and how much will I spend to accomplish this, and how will I be further ahead? (If the answer to this is, my app will run 10% better on quadcore PCs, and I could have achieved the same result by spending $1000 more per customer PC , and spending $100,000 less this year on R&D, then, I would skip the whole effort).
Why am I doing multi-threaded instead of massively parallel distributed? Do I really think threads are better than processes? Multi-core systems also run distributed apps pretty well. And there are some advantages to message-passing process based systems that go beyond the benefits (and the costs!) of threading. Should I consider a process-based approach? SHould I consider a background running entirely as a service, and a foreground GUI? Since my product is node-locked and licensed, I think services would suit me (vendor) quite well. Also, separating stuff into two processes (background service and foreground) just might force the kind of rewrite and rearchitecting to occur that I might not be forced to do, if I was to just add threading into my mix.
This is just to get you thinking: What if you were to rewrite it as a service (background app) and a GUI, because that would actually be easier than adding threading, without also adding crashes, deadlocks, and race conditions?
Consider the idea that for your needs, perhaps threading is evil. Develop your religion, and stick with that. Unless you have a real good reason to go the other way. For many years, I religiously avoided threading. Because one thread per process is good enough for me.
I don't see any really solid reasons in your list why you need threading, except ones that could be more inexpensively solved by more expensive target computer hardware. If your app is "too slow" adding in threads might not even speed it up.
I use threads for background serial communications, but I would not consider threading merely for computationally heavy applications, unless my algorithms were so inherently parallel as to make the benefits clear, and the drawbacks minimal.
I wonder if the "design" problems that this C++Builder app has are like my Delphi "RAD Spaghetti" application disease. I have found that a wholesale refactor/rewrite (over a year per major app that I have done this to), was a minimum amount of time for me to get a handle on application "accidental complexity". And that was without throwing a "threads where possible" idea. I tend to write my apps with threads for serial communication and network socket handling, only. And maybe the odd "worker-thread-queue".
If there is a place in your app you can add ONE thread, to test the waters, I would look for the main "work queue" and I would create an experimental version control branch, and I would learn about how my code works by breaking it in the experimental branch. Add that thread. And see where you spend your first day of debugging. Then I might just abandon that branch and go back to my trunk until the pain in my temporal lobe subsides.
Warren
Here's what I would do...
I would start by profiling your and seeing:
1) what is slow and what the hot paths are
2) which calls are reentrant or deeply nested
you can use 1) to determine where the opportunity is for speedups and where to start looking for parallelization.
you can use 2) to find out where the shared state is likely to be and get a deeper sense of how much things are tangled up.
I would use a good system profiler and a good sampling profiler (like the windows perforamnce toolkit or the concurrency views of the profiler in Visual Studio 2010 Beta2 - these are both 'free' right now).
Then I would figure out what the goal is and how to separate things gradually to a cleaner design that is more responsive (moving work off the UI thread) and more performant (parallelizing computationally intensive portions). I would focus on the highest priority and most noticable items first.
If you don't have a good refactoring tool like VisualAssist, invest in one - it's worth it. If you're not familiar with Michael Feathers or Kent Beck's refactoring books, consider borrowing them. I would ensure my refactorings are well covered by unit tests.
You can't move to VS (I would recommend the products I work on the Asynchronous Agents Library & Parallel Pattern Library, you can also use TBB or OpenMP).
In boost, I would look carefully at boost::thread, the asio library and the signals library.
I would ask for help / guidance / a listening ear when I got stuck.
-Rick
You can also look at this article from Herb Sutter You have a mass of existing code and want to add concurrency. Where do you start?
Well, I think you're expecting a lot based on your comments here. You're not going to go from minutes to milliseconds by multithreading. The most you can hope for is the current amount of time divided by the number of cores. That being said, you're in a bit of luck with C++. I've written high performance multiprocessor scientific apps, and what you want to look for is the most embarrassingly parallel loop you can find. In my scientific code, the heaviest piece is calculating somewhere between 100 and 1000 data points. However, all of the data points can be calculated independently of the others. You can then split the loop using openmp. This is the easiest and most efficient way to go. If you're compiler doesn't support openmp, then you will have a very hard time porting existing code. With openmp (if you're lucky), you may only have to add a couple of #pragmas to get 4-8x the performance. Here's an example StochFit
I hope this will help you in understanding and converting your monolithic single threaded app to multi thread easily. Sorry it is for another programming language but never the less the principles explained are the same all over.
http://www.freevbcode.com/ShowCode.Asp?ID=1287
Hope this helps.
The first thing you must do is to separate your GUI from your data, the second is to create a multithreaded class.
STEP 1 - Responsive GUI
We can assume that the image you are producing is contained in the canvas of a TImage. You can put a simple TTimer in you form and you can write code like this:
if (CurrenData.LastUpdate>CurrentUpdate)
{
Image1->Canvas->Draw(0,0,CurrenData.Bitmap);
CurrentUpdate=Now();
}
OK! I know! Is a little bit dirty, but it's fast and is simple.The point is that:
You need an Object that is created in the main thread
The object is copied in the Form you need, only when is needed and in a safe way (ok, a better protection for the Bitmap may be is needed, but for semplicity...)
The object CurrentData is your actual project, single threaded, that produces an image
Now you have a fast and responsive GUI. If your algorithm as slow, the refresh is slow, but your user will never think that your program is freezed.
STEP 2 - Multithread
I suggest you to implement a class like the following:
SimpleThread.h
typedef void (__closure *TThreadFunction)(void* Data);
class TSimpleThread : public TThread
{
public:
TSimpleThread( TThreadFunction _Action,void* _Data = NULL, bool RunNow = true );
void AbortThread();
__property Terminated;
protected:
TThreadFunction ThreadFunction;
void* Data;
private:
virtual void __fastcall Execute() { ThreadFunction(Data); };
};
SimpleThread.c
TSimpleThread::TSimpleThread( TThreadFunction _Action,void* _Data, bool RunNow)
: TThread(true), // initialize suspended
ThreadFunction(_Action), Data(_Data)
{
FreeOnTerminate = false;
if (RunNow) Resume();
}
void TSimpleThread::AbortThread()
{
Suspend(); // Can't kill a running thread
Free(); // Kills thread
}
Let's explain. Now, in your simple threaded class you can create an object like this:
TSimpleThread *ST;
ST=new TSimpleThread( RefreshFunction,NULL,true);
ST->Resume();
Let's explain better: now, in your own monolithic class, you have created a thread. More: you bring a function (ie: RefreshFunction) in a separate thread. The scope of your funcion is the same, the class is the same, the execution is separate.
My number one suggestion, although it's very late (sorry for reviving old thread, it's interesting!) is seek out homogeneous transform loops where each iteration of the loop is mutating a completely independent piece of data from the other iterations.
Instead of thinking about how to turn this old codebase into an asynchronous one running all kinds of operations in parallel (which could be asking for all kinds of trouble from worse than single-threaded performance from poor locking patterns or exponentially worse, race conditions/deadlocks by trying to do this in hindsight to code you can't fully comprehend), stick to the sequential mindset for the overall application design for now but identify or extract simple, homogeneous transform loops. Don't go from intrusive broad design-level multithreading and then try to drill into details. Work from non-intrusive multithreading of fine implementation details and specific hotspots first.
What I mean by homogeneous loops is basically one that transforms data in a very straightforward way, like:
for each pixel in image:
make it brighter
That is very simple to reason about and you can safely parallelize this loop without any problems whatsoever using OMP or TBB or whatever and without getting tangled up in thread synchronization. It only takes one glance at this code to fully comprehend its side effects.
Try to find as many hotspots as you can which fit this type of simple homogeneous transform loop and if you have complex loops which update many different types of data with complex control flows that trigger complex side effects, then seek to refactor towards these homogeneous loops. Often a complex loop which causes 3 disparate side effects to 3 different types of data can be turned into 3 simple homogeneous loops which each trigger just one kind of side effect to one type of data with a simpler control flow. Doing multiple loops instead of one might seem a tad wasteful, but the loops become simpler, the homogeneity will often lead to more cache-friendly sequential memory access patterns vs. sporadic random-access patterns, and you then tend to find much more opportunities to safely parallelize (as well as vectorize) the code in a straightforward way.
First you have to thoroughly understand the side effects of any code you attempt to parallelize (and I mean thoroughly!!!), so seeking out these homogeneous loops gives you isolated areas of the codebase you can easily reason about in terms of the side effects to the point where you can confidently and safely parallelize those hotspots. It'll also improve the maintainability of the code by making it very easy to reason about the state changes going on in that particular piece of code. Save the dream of the uber multithreaded application running everything in parallel for later. For now, focus on identifying/extracting performance-critical, homogeneous loops with simple control flows and simple side effects. Those are your priority targets for parallelization with simple parallelized loops.
Now admittedly I somewhat dodged your questions, but most of them don't need apply if you do what I suggest, at least until you've kind of worked your way out to the point where you're thinking more about multithreading designs as opposed to simply parallelizing implementation details. And you might not even need to go that far to have a very competitive product in terms of performance. If you have beefy work to do in a single loop, you can devote the hardware resources to making that loop go faster instead of making many operations run simultaneously. If you have to resort to more async methods like if your hotspots are more I/O bound, seek an async/wait approach where you fire off an async task but do some things in the meantime and then wait on the async task(s) to complete. Even if that's not absolutely necessary, the idea is to section off isolated areas of your codebase where you can, with 100% confidence (or at least 99.9999999%) say that the multithreaded code is correct.
You don't ever want to gamble with race conditions. There's nothing more demoralizing than finding some obscure race condition that only occurs once in a full moon on some random user's machine while your entire QA team is unable to reproduce it, only to, 3 months later, run into it yourself except during that one time you ran a release build without debugging info available while you then toss and turn in your sleep knowing your codebase can flake out at any given moment but in ways that no one will ever be able to consistently reproduce. So take it easy with multithreading legacy codebases, at least for now, and stick to multithreading isolated but critical sections of the codebase where the side effects are dead simple to reason about. And test the crap out of it -- ideally apply a TDD approach where you write a test for the code you're going to multithread to ensure it gives the correct output after you finish... though race conditions are the types of things that easily fly under the radar of unit and integration testing, so again you absolutely need to be able to comprehend the entirety of the side effects that go on in a given piece of code before you attempt to multithread it. The best way to do that is to make the side effects as easy to comprehend as possible with the simplest control flows causing just one type of side effect for an entire loop.
It is hard to give you proper guidelines. But...
The easiest way out according to me is to convert your application to ActiveX EXE as COM has support for Threading, etc. built right into it your program will automatically become Multi Threading application. Of course you will have to make quite a few changes to your code. But this is the shortest and safest way to go.
I am not sure but probably RichClient Toolset lib may do the trick for you. On the site the author has written:
It also offers registration free Loading/Instancing-capabilities
for ActiveX-Dlls and new, easy to use Threading-approach,
which works with Named-Pipes under the
hood and works therefore also
cross-process.
Please check it out. Who knows it may be the right solution for your requirements.
As for Project management I think you can continue using what is provided in your choice IDE by integrating it with SVN through plugins.
I forgot to mention that we have completed an application for Share market that automatically trades (buys and sells based on lows and highs) into those scripts that are in user portfolio based on an algorithm that we have developed.
While developing this software we were facing the same kind of problem as you have illustrated here. To solve it we converted out application in ActiveX EXE and we converted all those parts that need to execute parallely into ActiveX DLLs. We have not used any third party libs for this!
HTH