How to convert OpenMP sections to fork()

How to convert OpenMP sections to fork() - c++

I am new to pthreads and would like to ask how to express something like:
while(imhappy())
{
#pragma omp sections
{
#pragma omp section
{
dothis();
}
#pragma omp section
{
dothat();
}
}
}
in an equivalent construct using fork() or vfork() ?
Thanks ahead!
PS: I included the while around the sections in case it is somehow more clever to fork before the entering the loop due to some resource cloning.

OpenMP does lots of other things behind the scenes than merely spawning threads. It also distributes code segments and synchronises the different threads. You have tagged your question as pthreads though you are asking about implementation with fork() which is confusing. In Linux fork() is very heavyweight as it creates new processes and rather clone() is used to create threads.
Nevertheless, the rough equivalent of an OpenMP sections construct with two threads would be to fork, afterwards an if construct has to follow and the master process would execute the dothis() path while the child will execute the dothat() path. The return value from fork() is different in the parent and in the child processes and can be used to make the decision for the branch. The parent process will then wait for the child to finish with waitpid() which will be analogous to the implicit barrier synchronisation at the end of the omp sections region.
One caveat though - fork() is implemented using COW (copy-on-write) pages. What it means is that although at the beginning the memory content of the child is equal to the memory content of the parent, any changes made are private - the child will not see what the parent modifies in its own memory and vice versa. Memory has to be explicitly shared between the two using either SysV shared memory primitives or shared file mappings.
You might really want to look into using POSIX threads API instead.
vfork() is a syscall designed for completely different purpose and is not suitable for process cloning at all.

I do not think that there is a straight simulation of sections using the fork.
However, you can theoretically simulate it using a messgae passing mechanism where the shared variables are kept with a root machine. All the openMP flush will happen using the root. (Remember that openMP uses weaker than weak consistency model).

Related

Using unique locks while in DPC

Currently working on a light weight filter in the NDIS stack. I'm trying to inject a packet which set in a global variable as an NBL. During receive NBL, if an injected NBL is pending, than a lock is taken by the thread before picking the injected NBL up to process it. Originally I was looking at using a spin lock or FAST_MUTEX. But according to the documentation for FAST_MUTEX, any other threads attempting to take the lock will wait for the lock to release before continuing.
The problem is, that receive NBL is running in DPC mode. This would cause a DPC running thread to pause and wait for the lock to release. Additionally, I'd like to be able to assert ownership of a thread's ownership over a lock.
My question is, does windows kernel support unique mutex locks in the kernel, can these locks be taken in DPC mode and how expensive is assertion of ownership in the lock. I'm fairly new to C++ so forgive any syntax errors.
I attempted to define a mutex in the LWF object
// Header file
#pragma once
#include <mutex.h>
class LWFobject
{
public:
LWFobject()
std::mutex ExampleMutex;
std::unique_lock ExampleLock;
}
//=============================================
// CPP file
#include "LWFobject.h"
LWFobject::LWFObject()
{
ExmapleMutex = CreateMutex(
NULL,
FALSE,
NULL);
ExampleLock(ExampleMutex, std::defer_lock);
}
Is the use of unique_locks supported in the kernel? When I attempt to compile it, it throws hundreds of compilation errors when attempting to use mutex.h. I'd like to use try_lock and owns_lock.

You can't use standard ISO C++ synchronization mechanisms while inside a Windows kernel.
A Windows kernel is a whole other world in itself, and requires you to live by its rules (which are vast - see for example these two 700-page books: 1, 2).
Processing inside a Windows kernel is largely asynchronous and event-based; you handle events and schedule deferred calls or use other synchronization techniques for work that needs to be done later.
Having said that, it is possible to have a mutex in the traditional sense inside a Windows driver. It's called a Fast Mutex and requires raising IRQL to APC_LEVEL. Then you can use calls like ExAcquireFastMutex, ExTryToAcquireFastMutex and ExReleaseFastMutex to lock/try-lock/release it.

A fundamental property of a lock is which priority (IRQL) it's synchronized at. A lock can be acquired from lower priorities, but can never be acquired from a higher priority.
(Why? Imagine how the lock is implemented. The lock must raise the current task priority up to the lock's natural priority. If it didn't do this, then a task running at a low priority could grab the lock, get pre-empted by a higher priority task, which would then deadlock if it tried to acquire the same lock. So every lock has a documented natural IRQL, and the lock will first raise the current thread to that IRQL before attempting to acquire exclusivity.)
The NDIS datapath can run at any IRQL between PASSIVE_LEVEL and DISPATCH_LEVEL, inclusive. This means that anything on the datapath must only ever use locks that are synchronized at DISPATCH_LEVEL (or higher). This really limits your choices: you can use KSPIN_LOCKs, NDIS_RW_LOCKs, and a handful of other uncommon ones.
This gets viral: if you have one function that can sometimes run at DISPATCH_LEVEL (like the datapath), it forces the lock to be synchronized at DISPATCH_LEVEL, which forces any other functions that hold the lock to also run at DISPATCH_LEVEL. That can be inconvenient, for example you might want to hold the locks while reading from the registry too.
There are various approaches to design your driver:
* Use spinlocks everywhere. When reading from the registry, read into temporary variables, then grab a spinlock and copy the temporary variables into global state.
* Use mutexes (or better yet: pushlocks) everywhere. Quarantine the datapath into a component that runs at dispatch level, and carefully copy any configuration state into this component's private state.
* Somehow avoid having your datapath interact with the rest of your driver, so there's no shared state, and thus no shared locks.
* Have the datapath rush to PASSIVE_LEVEL by queuing all packets to a worker thread.

OpenMP: how to explicitly divide code into different threads

Let's say I have a Writer class that generates some data, and a Reader class that consumes it. I want them to run all the time under different threads. How can I do that with OpenMP?
This is want I would like to have:
class Reader
{
public:
void run();
};
class Writer
{
public:
void run();
};
int main()
{
Reader reader;
Writer writer;
reader.run(); // starts asynchronously
writer.run(); // starts asynchronously
wait_until_finished();
}
I guess the first answers will point to separate each operation into a section, but sections does not guarantee that code blocks will be given to different threads.
Can tasks do it? As far as I understood after reading about task is that each code block is executed just once, but the assigned thread can change.
Any other solution?
I would like to know this to know if a code I have inherited that uses pthreads, which explicitly creates several threads, could be written with OpenMP. The issue is that some threads were not smartly written and contain active waiting loops. In that situation, if two objects with active waiting are assigned to the same OpenMP thread (and hence are executed sequentially), they can reach a deadlock. At least, I think that could happen with sections, but I am not sure about tasks.

Serialisation could also happen with tasks. One horrible solution would be to reimplement sections on your own with guarantee that each section would run in a separate thread:
#pragma omp parallel num_threads(3)
{
switch (omp_get_thread_num())
{
case 0: wait_until_finished(); break;
case 1: reader.run(); break;
case 2: writer.run(); break;
}
}
This code assumes that you would like wait_until_finished() to execute in parallel with reader.run() and writer.run(). This is necessary since in OpenMP only the scope of the parallel construct is where the program executes in parallel and there is no way to put things in the background, so to say.

If you're rewriting the code anyway, you might be better moving to Threading Building Blocks (TBB; http://www.threadingbuildingblocks.org).
TBB has explicit support for pipeline style operation (or more complicated task graphs), while maintaing cache-locality and independence of the underlying number of threads.

Boost thread - Out of scope possibility

I was curious about the accuracy of the following code
for(int i=0 ; i<5 ; i++)
{
SomeClass* ptrinst = new SomeClass()
boost::thread t( boost::bind (&SomeClass::SomeMethod,ptrinst));
......
}
What would happen to the running thread when t runs out of scope ?

Since the main thread does not call t.join(), the main thread will continue to run its loop, spawning additional threads and then continue onwards. So the answer is, under your current coding, the child threads will not interact with your parent thread (at least not directly).
Also note that the thread class is a strange beast - the only thing that happens when you fall out of scope is, your main thread no longer has a handle to call t.join() on. The fact that it falls out of scope of the parent thread has zero impact on the child thread. Once you spawn your child thread by instantiating it, the child is, essentially, decoupled from the parent (well, the globals/dynamically allocated memory that were visible in the parent are also visible to the child, but you will need mutexes if you want to modify/mutate those globals). As I mentioned later in the post, you need to gain a solid understanding of memory visibility and ownership within a threading context. Just reading my comments here probably will not help you.
If you want the main thread to wait on the completion of the child threads, you need to store those threads in a std::vector<boost::thread> v; outside of your loop and then in a second loop, call join on all those instances.
Your current code looks a bit suspect as you are invoking an instance method through bind - that's fine, but I wouldn't normally expect that instance method to call delete this; which means it's up to the parent thread to clean up (the parent thread shouldn't clean up until the child threads are done). However, there is no way for it to clean up at the right time without some kind of thread synchronization. Hence, a memory leak or some kind of nasty race condition is almost assured (suppose you put a delete ptrinst; in the ... portion of your main thread in an attempt to clean-up. Without some kind of synchronization, you may delete the pointer before the child threads are done using it).
Also, you may want to use std::thread and std::bind in place of the boost versions.
One last note: I suspect you are still experimenting with the use of threads. If this is true, it may be a good idea to read up and experiment a lot more with simpler examples until you try to fix this code. Otherwise, you may be setting yourself for a world of hurt (debugging hell including race conditions, weird memory synchronization issues, etc...).
Try to build a more solid understanding of what happens with memory and threads: what memory is visible to what threads and what memory can and cannot be shared.

passing "this" to a thread c++

What is the best way of performing the following in C++. Whilst my current method works I'm not sure it's the best way to go:
1) I have a master class that has some function in it
2) I have a thread that takes some instructions on a socket and then runs one of the functions in the master class
3) There are a number of threads that access various functions in the master class
I create the master class and then create instances of the thread classes from the master. The constructor for the thread class gets passed the "this" pointer for the master. I can then run functions from the master class inside the threads - i.e. I get a command to do something which runs a function in the master class from the thread. I have mutex's etc to prevent race problems.
Am I going about this the wrong way - It kinda seems like the thread classes should inherit the master class or another approach would be to not have separate thread classes but just have them as functions of the master class but that gets ugly.

Sounds good to me. In my servers, it is called 'SCB' - ServerControlBlock - and provides access to services like the IOCPbuffer/socket pools, logger, UI access for status/error messages and anything else that needs to be common to all the handler threads. Works fine and I don't see it as a hack.
I create the SCB, (and ensure in the ctor that all services accessed through it are started and ready for use), before creating the thread pool that uses the SCB - no nasty singletonny stuff.
Rgds,
Martin

Separate thread classes is pretty normal, especially if they have specific functionality. I wouldn't inherit from the main thread.

Passing the this pointer to threads is not, in itself, bad. What you do with it can be.
The this pointer is just like any other POD-ish data type. It's just a chunk of bits. The stuff that is in this might be more than PODs however, and passing what is in effect a pointer to it's members can be dangerous for all the usual reasons. Any time you share anything across threads, it introduces potential race conditions and deadlocks. The elementary means to resolve those conflicts is, of course, to introduce synchronization in the form of mutexes, semaphores, etc, but this can have the suprising effect of serializing your application.
Say you have one thread reading data from a socket and storing it to a synchronized command buffer, and another thread which reads from that command buffer. Both threads use the same mutex, which protects the buffer. All is well, right?
Well, maybe not. Your threads could become serialized if you're not very careful with how you lock the buffer. Presumably you created separate threads for the buffer-insert and buffer-remove codes so that they could run in parallel. But if you lock the buffer with each insert & each remove, then only one of those operations can be executing at a time. As long as your writing to the buffer, you can't read from it and vice versa.
You can try to fine-tune the locks so that they are as brief as possible, but so long as you have shared, synchronized data, you will have some degree of serialization.
Another approach is to hand data off to another thread explicitly, and remove as much data sharing as possible. Instead of writing to and reading from a buffer as in the above, for example, your socket code might create some kind of Command object on the heap (eg Command* cmd = new Command(...);) and pass that off to the other thread. (One way to do this in Windows is via the QueueUserAPC mechanism).
There are pros & cons to both approaches. The synchronization method has the benefit of being somewhat simpler to understand and implement at the surface, but the potential drawback of being much more difficult to debug if you mess something up. The hand-off method can make many of the problems inherent with synchronization impossible (thereby actually making it simpler), but it takes time to allocate memory on the heap.

How to detect circular calls?

I've been looking for causes for deadlocks and strategies/tools to avoid and detect them.
Another potential cause for deadlocks is to have blocking functions calling other blocking functions in a circular way, so that eventually a call never returns.
Sometimes this is hard to discover, specially in very large projects.
So, are there any tools/libraries/techiques that allow to automate the detection of circular calls in a program?
EDIT:
I code mostly in C and C++ so, if possible, give any information about the topic that is applicable to those languages.
Nevertheless, it seems this topic is scarcely covered in SO, so answers for other languages are ok too. although maybe those deserve a topic of its own if someone finds it relevant
Thanks.

Circular (or recursive) calls that try to acquire the same non-reentrant lock are one of the easiest to debug blocking scenarios: locking is deterministic, and can be easily checked. When the application locks, fire up the debugger and look at the stack trace to understand what locks are held and why.
As to general solutions for the problem of locking... you can look into some libraries that provide mutex ordering, and detect when you are trying to lock on a mutex out of order. This type of solutions might be complex to implement correctly, but once in place it ensures that you cannot enter a deadlock condition, as it forces all processes to obtain the locks in the same order (i.e. if process A holds lock La, and it tries to acquire lock Lb for which the ordering is correct, then it can either succeed or lock, but whichever process is holding lock Lb cannot try to lock La as the ordering constraint would not be met).

If you are on Linux there 2 Valgrind tools for detecting deadlocks and race conditions: Helgrind, DRD. They both complement each other and it's worth to check for thread errors by both of them.

In linux you can use valgrind to detect deadlocks, use --tool=helgrind.

Best way to detect deadlocks (IMO) is to make a test program that calls all the functions in a random order in like 30 different threads 10000s of times.
If you get a deadlock you can use VS2010 "Parallel Stacks" window. Debug->Windows->Parallel Stacks
This window will show you all the stacks, so you can find the methods that are deadlocking.
A simple strategy I use to write thread-safe objects:
A thread safe object should be safe when its public methods are called, so you don't get deadlocks when it is used.
So, the idea is to just lock all the public methods that access the object's data.
Besides that you need to insure that within the class' code you never call a public method. If you need to use one of the public methods, then make that method private, and wrap the private method with a public method that locks and then calls it.
If you want better lock granularity you could just create objects for each part that has its own lock, and lock it like I suggested. Then use encapsulation to combine those classes to the one class.
Example:
class Blah {
MyData data;
Lock lock;
public:
DataItem GetData(int index)
{
ReadLock read(lock);
return LocalGetData(index);
}
DataItem FindData(string key)
{
ReadLock read(lock);
DataItem item;
//find the item, can use LocalGetData() to get the item without deadlocking
return item;
}
void PutData(DataItem item)
{
ReadLock write(lock);
//put item in database
}
private:
DataItem LocalGetData(int index)
{
return data[index];
}
}

You could find a tool that builds a call graph, and check the graph for cycles.
Otherwise, there are a number of strategies for detecting deadlocks or other circularities, but they all depend on having some sort of supporting infrastructure in place.
There are deadlock avoidance strategies, having to do with assigning lock priorities and ordering the locks according to priority. These require code changes and enforcing the standards, though.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js