Best practices to avoid race conditions in a makefile?

Best practices to avoid race conditions in a makefile? - build

Do you know some way to avoid race conditions when you write a makefile that should be run in parallel? There is some kind of barriers? How can I run sequentially some part? or, How can I pause the other processors until some critical object is created?
Thanks!

There's a few things you need to consider:
Don't generate files which are not specified as a target. For instance if you have 2 rules that coincidentally produce "fred.tmp" (and maybe even delete it afterwards), and they both run at the same time, bad things will happen
Keep your build rules as short as possible. Preferably one command.
Don't do recursive builds. Even if they don't break, the -j gets passed to each child make, resulting in an exponential rise in CPU usage.
There is no way in standard make to indicate 2 rules cannot be run in parallel.
You may wish to consider investigating some other build tools as well (such as SCons) to see if they solve your problem better.

Related

Is it better to roll all work into one loop or break it into several loops?

I am parsing a folder structure that is quite heavy (in terms of the number of folders and files). I have to go through all the folders and parse any files I come across. The files themselves are small (1000-2000 characters although a few are bigger). I have two options:
Go through all the folders and files and parse any that I come
across in one big recursive loop.
Go through all the folders and store the paths of all the files
that I come across. The in another loop, parse the files by referring
to the stored file paths.
Which option would be better and maybe faster (the speed will most likely be I/O bound so most likely will not make a difference, but I thought I'd ask anyway)?

You pick the option that gives you the most readable and the most understandable code. Especially since the two options you provide are functionally identical. Seriously, you want to be able for others and yourself in the future to be able to look at it and have some clue as to what it does.
"The most readable and the most understandable" almost always means "the simplest and the easiest way." (Although some code is inherently complex. That's still not an excuse to write unreadable code.) Option 1 sounds easier to implement in my opinion, but try it for yourself. Profile for bottlenecks if it isn't fast enough.
Most likely, the actual disk I/O will take much longer than the total processor cycles or memory accesses needed for either option, so which option you take might not even be relevant. But the only way to know for sure how fast your programs are running and whether you need improvements is by profiling.

How about one thread that creates the list of file names to process, and another thread that reads through that list of files and uses one of a handful of worker threads to do the processing?
I don't know how many directories there are, but just guessing that's not the big time sink. I'd say you'd get the best performance by having a thread pool, each thread in the pool parsing a file (once you have the list of them.) Because that stuff is gonna be so IO bound, the threading will probably make things far more efficient.

The options seem to be functionally identical. I would say the consideration should be readability and maintainability - what is easier to understand and change later on when needed, expand or fix bugs in.
It is also worth considering breaking the functionality into separate objects - one is performing the search while the other is parsing the files found. Then you can run them concurrently and achieve better CPU utilization.

It depends a lot on how deep the folder structure will be and how much data you'll have to hold in memory (including number of files/filenames).
If you have an extremely deep structure, you could run into a stack overflow. However, with path length limits, it's not very likely. You will have to store all the file names in memory, which is probably going to be a pain, but probably won't actually be a problem.
Assuming the functions are reasonable simply, it will likely be easier to simply call the recursive search function for each directory you find and the file parser for each valid file, all in a single loop:
function search_folder:
for each item in curdir:
if item is file:
parse_file(item)
else if item is folder:
search_folder(item)
That gives you a relatively simple and very readable structure, at the cost of potentially deep recursion. Caching filenames and going through them later involves a lot more code and will likely be less readable, and (assuming you handle directories that same way) will have the same amount of recursion.
I'd go with #1, since it seems the more flexible and elegant solution.

How do I detect memory access violation and/or memory race conditions?

I have a target platform reporting when memory is read from or written to as well as when locks(think mutex for example) are taken/freed. It reports the program counter, data address and read/write flag. I am writing a program to use this information on a separate host machine where the reports are received so it does not interfere with the target. The target already reports this data so I am not changing the target code at all.
Are there any references or already available algorithms that do this kind of detection? For example, some way of detecting race conditions when multiple threads try to write to a global variable without protecting it first.
I am currently brewing my own but I convince myself there is definitely some code out there that does this already. Or at least some proven algorithm of how to go about it.
Note This is not to detect memory leaks.
Note Implementation language is C++
I am trying to make the detection code I write platform agnostic so I am using STL and just Standard C++ with libraries like boost, poco, loki.
Any leads will help
thanks.

It is probably too late to talk you out of this, but this does not work. Threading races are caused by subtle timing issues between threads. You can never diagnose timing related problems with logging. Heisenbergian, just logging alters the timing of a thread. Especially the kind you are contemplating. Infamously, there's plenty of software that shipped with logging kept turned on because it would nosedive with it turned off.
Flushing out threading bugs is hard. The kind of tool that works is one that intentionally injects random delays in code. Microsoft CHESS is an example, works on native code too.

To address only part of your question, race conditions are extremely nasty precisely because there is no good way to test for them. By definition they're unpredictable sequences of events that are quite difficult to diagnose. Detection code depends on the fact that the race condition is actually happening, and in that case it's likely that you'll see errant behavior anyway. Any test code you add may make them more or less likely to appear, or possibly even change the timing such that they never appear at all.
Instead of trying to detect race conditions, what about attempting program design that helps make you more resilient to having them in the first place?
For example if your global variable were simply encapsulated in an object that knows all the proper protection that needs to happen on access, then it's impossible for threads to concurrently write to it, because such a interface doesn't exist. Programmatically preventing race conditions is going to be easier than trying to detect them algorithmically (chances are you'll still catch some during unit/subsystem testing).

How best to test a Mutex implementation?

What is the best way to test an implementation of a mutex is indeed correct? (It is necessary to implement a mutex, reuse is not a viable option)
The best I have come up with is to have many (N) concurrent threads iteratively attempting to access the protected region (I) times, which has a side effect (e.g. update to a global) so that the number of accesses + writes can be counted to ensure that the number of updates to the global is exactly (N)*(I).
Any other suggestions?

Formal proof is better than testing for this kind of thing.
Testing will show you that -- as long as you're not unlucky -- it all worked. But a test is a blunt instrument; it can fail to execute the exact right sequence to cause failure.
It's too hard to test every possible sequence of operations available in the hardware to be sure your mutex works under all circumstances.
Testing is not without value; it shows that you didn't make any obvious coding errors.
But you really need a more formal code inspection to demonstrate that it does the right things at the right times so that one client will atomically seize the lock resource required for proper mutex. In many platforms, there are special instructions to implement this, and if you're using one of those, you have a fighting chance of getting it right.
Similarly, you have to show that the release is atomic.

With something like a mutex, we get back to the old rule that testing can only demonstrate the presence of bugs, not the absence. A year of testing will probably tell you less than simply putting the code up for inspection, and asking if anybody sees a problem with it.

I'm with everyone else that this is incredibly difficult to prove conclusively, I have no idea how to do it - not helpful I know!
When you say implement a mutex and that reuse is not an option, is that for technical reasons like there is no Mutex implementation on the platform/OS that you are using or some other reason? Is wrapping some form of OS level 'lock' and calling it your Mutex implementation an option, eg CriticalSection on windoze, posix Condition Variables? If you can wrap a lower level OS lock then your chance of getting it right are much higher.
In case you have not already done so go and read Herb Sutter's Effective Concurrency articles. There should be some stuff in these worth something to you.
Anyway, some things to consider in your tests:
If your mutex is recursive (ie can the same thread lock it multiple
times) then ensure you do some
reference counting tests.
With your global variable that you
modify it would be best if this was a
varible that can't be written to
atomically. For example if you are
on an 8 bit platform then use a 16 or
32 bit variable that requires
multiple assembly instructions to write.
Carefully examine the assembly listing. Depending on the hardware platform though the assembly does not translate directly to how the code might be optimised...
Get someone else, who didn't write the code, to also write some tests.
test on as many different machines with different specs as you can (assuming that this is 'general purpose' and not for a specific hardware setup)
Good luck!

This reminds me of this question about FIFO semaphore test. In a nutshell my answer was:
Even if you have a specification, maybe it doesn't convey your intention exactly
You can prove that the algorithm fulfills the specification, but not the code (D. Knuth)
Test reveal only the presence of bug, not their absence (Dijkstra)
So your proposition seem reasonably the best to do. If you want to increase your confidence, use fuzzing to randomize scheduling and input.

If the proof stuff doesn't work out for you, then go with the testing one. Be sure to test all the possible use cases. Find out how exactly this thing will be used, who will be using it and, again, how it will be used. When you go the testing route be sure to run each test for each scenario a ton of times (millions, billions, as many as you can possibly get in with the testing time you have).
Try to be random because randomness will give you the best chance to cover all scenarios in a limited number of tests. Be sure to use data that will be used and data that may not be used but could be used and make sure the data doesn't mess up the locks.
BTW, unless you know a ton about mathematics and formal methods you will have no chance of actually coming up with a proof.

Does MSVC automatically optimize computation on dual core architecture?

Does MSVC automatically optimize computation on dual core architecture?
void Func()
{
Computation1();
Computation2();
}
If given the 2 computation with no relations in a function, does the visual studio
compiler automatically optimize the computation and allocate them to different cores?

Don't quote me on it but I doubt it. The OpenMP pragmas are the closest thing to what you're trying to do here, but even then you have to tell the compiler to use OpenMP and delineate the tasks.
Barring linking to libraries which are inherently multi-threaded, if you want to use both cores you have to set up threads and divide the work you want done intelligently.

No. It is up to you to create threads (or fibers) and specify what code runs on each one. The function as defined will run sequentially. It may switch to another thread (thanks Drew) core during execution, but it will still be sequential. In order for two functions to run concurrently on two different cores, they must first be running in two separate threads.
As greyfade points out, the compiler is unable to detect whether it is possible. In fact, I suspect that this is in the class of NP-Complete problems. If I am wrong, I am sure one of the compiler gurus will let me know.

There's no reliable way for the compiler to detect that the two functions are completely independent and that they have no state. Therefore, there's no way for the compiler to know that it's safe to break them out into separate threads of execution. In fact, threads aren't even part of the C++ standard (until C++1x), and even when they will be, they won't be an intrinsic feature - you must use the feature explicitly to benefit from it.
If you want your two functions to run in independent threads, then create independent threads for them to execute in. Check out boost::thread (which is also available in the std::tr1 namespace if your compiler has it). It's easy to use and works perfectly for your use case.

No. Madness would ensue if compilers did such a thing behind your back; what if Computation2 depended on side effects of Computation1?
If you're using VC10, look into the Concurrency Runtime (ConcRT or "concert") and it's partner the Parallel Patterns Library (PPL)
Similar solutions include OpenMP (kind of old and busted IMO, but widely supported) and Intel's Threading Building Blocks (TBB).

The compiler can't tell if it's a good idea.
First, of course, the compiler must be able to prove that it would be a safe optimization: That the functions can safely be executed in parallel. In general, that's a NP-complete problem, but in many simple cases, the compiler can figure that out (it already does a lot of dependency analysis).
Some bigger problems are:
it might turn out to be slower. Creating threads is a fairly expensive operation. The cost of that may just outweigh the gain from parallelizing the code.
it has to work well regardless of the number of CPU cores. The compiler doesn't know how many cores will be available when you run the program. So it'd have to insert some kind of optional forking code. If a core is available, follow this code path and branch out into a separate thread, otherwise follow this other code path. And again, more code and more conditionals also has an effect on performance. Will the result still be worth it? Perhaps, but how is the compiler supposed to know that?
it might not be what the programmer expects. What if I already create precisely two CPU-heavy threads on a dual-core system? I expect them both to be running 99% of the time. Suddenly the compiler decides to create more threads under the hood, and suddenly I have three CPU-heavy threads, meaning that mine get less execution time than I'd expected.
How many times should it do this? If you run the code in a loop, should it spawn a new thread in every iteration? Sooner or later the added memory usage starts to hurt.
Overall, it's just not worth it. There are too many cases where it might backfire. Added to the fact that the compiler could only safely apply the optimization in fairly simple cases in the first place, it's just not worth the bother.

Force Program / Thread to use 100% of processor(s) resources

I do some c++ programming related to mapping software and mathematical modeling.
Some programs take anywhere from one to five hours to perform and output a result; however, they only consume 50% of my core duo. I tried the code on another dual processor based machine with the same result.
Is there a way to force a program to use all available processer resources and memory?
Note: I'm using ubuntu and g++

A thread can only run on one core at a time. If you want to use both cores, you need to find a way to do half the work in another thread.
Whether this is possible, and if so how to divide the work between threads, is completely dependent on the specific work you're doing.
To actually create a new thread, see the Boost.Thread docs, or the pthreads docs, or the Win32 API docs.
[Edit: other people have suggested using libraries to handle the threads for you. The reason I didn't mention these is because I have no experience of them, not because I don't think they're a good idea. They probably are, but it all depends on your algorithm and your platform. Threads are almost universal, but beware that multithreaded programming is often difficult: you create a lot of problems for yourself.]

The quickest method would be to read up about openMP and use it to parallelise your program.
Compile with the command g++ -fopenmp provided that your g++ version is >=4

You need to have as many threads running as there are CPU cores available in order to be able to potentially use all the processor time. (You can still be pre-empted by other tasks, though.)
There are many way to do this, and it depends completely on what you're processing. You may be able to use OpenMP or a library like TBB to do it almost transparently, however.

You're right that you'll need to use a threaded approach to use more than one core. Boost has a threading library, but that's not the whole problem: you also need to change your algorithm to work in a threaded environment.
There are some algorithms that simply cannot run in parallel -- for example, SHA-1 makes a number of "passes" over its data, but they cannot be threaded because each pass relies on the output of the run before it.
In order to parallelize your program, you'll need to be sure your algorithm can "divide and conquer" the problem into independent chunks, which it can then process in parallel before combining them into a full result.
Whatever you do, be very careful to verify the correctness of your answer. Save the single-threaded code, so you can compare its output to that of your multi-threaded code; threading is notoriously hard to do, and full of potential errors.
It may be more worth your time to avoid threading entirely, and try profiling your code instead: you may be able to get dramatic speed improvements by optimizing the most frequently-executed code, without getting near the challenges of threading.

To take full use of a multicore processor, you need to make the program multithreaded.

An alternative to multi-threading is to use more than one process. You would still need to divide & conquer your problem into mutiple independent chunks.

By 50%, do you mean just one core?
If the application isn't either multi-process or multi-threaded, there's no way it can use both cores at once.

Add a while(1) { } somewhere in main()?
Or to echo real advice, either launch multiple processes or rewrite the code to use threads. I'd recommend running multiple processes since that is easier, although if you need to speed up a single run it doesn't really help.

To get to 100% for each thread, you will need to:
(in each thread):
Eliminate all secondary storage I/O
(disk read/writes)
Eliminate all display I/O (screen
writes/prints)
Eliminate all locking mechanisms
(mutexs, semaphores)
Eliminate all Primary storage I/O
(operate strictly out of registers
and cache, not DRAM).
Good luck on your rewrite!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js