Why OCaml has Mutex module? - ocaml

As far as I know, OCaml does offer concurrency but not parallelism (Why OCaml's threading is considered as `not enough`?)
Then why does OCaml still offer Mutex module and provide lock?
If no two threads can run simultaneously, then why we still need lock?

In general there are critical regions in code modifying data shared between threads that leave that data in an inconsistent state. This is precisely the same problem as when there are simultaneously executing processes. As #nlucaroni points out, a context switch in the middle of a critical region should not allow another thread into the same critical region. For example:
(* f should count the number of times it's called *)
let f =
let x = ref 0 in
fun () ->
x := !x + 1;
A context switch after the lookup of x but before the store can clearly result in a miscount. This is fixed with a mutex.
(* f should count the number of times it's called *)
let f =
let x = ref 5 in
let m = Mutex.create () in
fun () ->
Mutex.lock m;
x := !x + 1;
let ret = !x in
Mutex.unlock m;
will fix this.

Because mutex is a concurrency primitive, not specific for parallelism. It is used to make execution of piece of code atomic from the point of view of other concurrent entities. It is used to organize exclusive access to specific portion of data while executing specific piece of code (e.g. to ensure that only single concurrent thread of execution is modifying data breaking the invariants in process but restoring those invariants before releasing the mutex so that other concurrent threads of execution will see consistent data when they get access to it).


How does atomic synchronization work on a single thread when it gets migrated to another core

Asking this question as a pseudo code, and also targeting both rust and c++ as memory model concepts are ditto
x = counter.load(Ordering::Relaxed) //#1
counter.store(x+1, Ordering::Relaxed) //#2
y = counter.load(Ordering::Relaxed) //#3
Question: Imagine SomeFunc is being executed by a thread and between #2 and #3 the thread gets interrupted and now #3 executes on different core, in this case does counter variable get synchronized with the last updated value (core 1) when it runs on another core2 (there is no explicit release/acquire). I suppose the entire cache line+thread local storage gets shelved and loaded when the thread briefly goes to sleep and comes back running on different core?
First of all, it should be noted that atomic instructions add synchronization, and do not remove it.
Would you expect:
unsigned func(unsigned* counter) {
auto x = *counter;
*counter = x + 1;
auto y = *counter;
return y;
To return anything else than the original value of *counter + 1?
Yet, similarly, the thread could be moved between cores in-between two statements!
The above code executes fine even when the core is moved because the OS takes care during the switch to appropriately synchronize between cores to preserve user-space program order.
So, what happens when using atomics on a single thread?
Well, you add a bit of processing overhead -- more synchronization -- and the OS still takes care during the switch to appropriately synchronize.
Hence the effect is strictly the same.

When is a Haskell thread joined?

As a former C++ programmer, the behaviour of Haskell threads is confusing. Refer to the following Haskell code snippet:
import Control.Concurrent
import Control.Concurrent.MVar
import Data.Functor.Compose
import System.Random
randomTill0 :: MVar () -> IO () -- Roll a die until 0 comes
randomTill0 mV = do
x <- randomRIO (0,65535) :: IO Int
if 0 == x
then putMVar mV ()
else randomTill0 mV
main :: IO ()
main = do
n <- getNumCapabilities
mV <- newEmptyMVar
sequence (replicate n (forkIO (randomTill0 mV)))
readMVar mV
putStrLn "Excution complete."
As far as I know, Haskell's forkIO is roughly equivalent to C++'s std::async. In C++, I store a std::future which is returned by std::async, then std::future::wait for it, then the thread will be std::thread::joined.
(Regarding the tiny delay before the message, I don't think any laziness is involved here.)
My question is:
In the above Haskell code snippet, when are the threads resulting from forkIOs joined? Is it upon readMVar mV, or the end of main?
Is there a Haskell equivalent of std::thread::detach?
As far as I understand, threads are never joined, and the program ends when main ends. This is special to main –– in general threads are not associated to each other in any kind of hierarchy. The concept of joining a thread does not exist: a thread runs until its action is finished or until it is explicitly killed with killThread, and then it evaporates (thanks, garbage collector). If you want to wait for a thread to complete you have to do it yourself, probably with an MVar.
It follows that there is no analogue of detach –– all threads are automatically detached.
Another thing worth mentioning is that there is not a 1:1 correspondence between OS threads and Haskell threads. The Haskell runtime system has its own scheduler that can run multiple Haskell threads on a single OS thread; and in general a Haskell thread will bounce around between different OS threads over the course of its lifetime. There is a concept of bound threads which are tied to an OS thread, but really the only reason to use that is if you are interfacing with code in other languages that distinguishes between OS threads.

C++ : std::atomic<bool> and volatile bool

I'm just reading the C++ concurrency in action book by Anthony Williams.
There is this classic example with two threads, one produce data, the other one consumes the data and A.W. wrote that code pretty clear :
std::vector<int> data;
std::atomic<bool> data_ready(false);
void reader_thread()
std::cout << "The answer=" << data[0] << "\n";
void writer_thread()
data_ready = true;
And I really don't understand why this code differs from one where I'd use a classic volatile bool instead of the atomic one.
If someone could open my mind on the subject, I'd be grateful.
A "classic" bool, as you put it, would not work reliably (if at all). One reason for this is that the compiler could (and most likely does, at least with optimizations enabled) load data_ready only once from memory, because there is no indication that it ever changes in the context of reader_thread.
You could work around this problem by using volatile bool to enforce loading it every time (which would probably seem to work) but this would still be undefined behavior regarding the C++ standard because the access to the variable is neither synchronized nor atomic.
You could enforce synchronization using the locking facilities from the mutex header, but this would introduce (in your example) unnecessary overhead (hence std::atomic).
The problem with volatile is that it only guarantees that instructions are not omitted and the instruction ordering is preserved. volatile does not guarantee a memory barrier to enforce cache coherence. What this means is that writer_thread on processor A can write the value to it's cache (and maybe even to the main memory) without reader_thread on processor B seeing it, because the cache of processor B is not consistent with the cache of processor A. For a more thorough explanation see memory barrier and cache coherence on Wikipedia.
There can be additional problems with more complex expressions than x = y (i.e. x += y) that would require synchronization through a lock (or in this simple case an atomic +=) to ensure the value of x does not change during processing.
x += y for example is actually:
read x
compute x + y
write result back to x
If a context switch to another thread occurs during the computation this can result in something like this (2 threads, both doing x += 2; assuming x = 0):
Thread A Thread B
------------------------ ------------------------
read x (0)
compute x (0) + 2
<context switch>
read x (0)
compute x (0) + 2
write x (2)
<context switch>
write x (2)
Now x = 2 even though there were two += 2 computations. This effect is known as tearing.
The big difference is that this code is correct, while the version with bool instead of atomic<bool> has undefined behavior.
These two lines of code create a race condition (formally, a conflict) because they read from and write to the same variable:
while (!data_ready)
And writer
data_ready = true;
And a race condition on a normal variable causes undefined behavior, according to the C++11 memory model.
The rules are found in section 1.10 of the Standard, the most relevant being:
Two actions are potentially concurrent if
they are performed by different threads, or
they are unsequenced, and at least one is performed by a signal handler.
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below. Any such data race results in undefined behavior.
You can see that whether the variable is atomic<bool> makes a very big difference to this rule.
Ben Voigt's answer is completely correct, still a little theoretical, and as I've been asked by a colleague "what does this mean for me", I decided to try my luck with a little more practical answer.
With your sample, the "simplest" optimization problem that could occur is the following:
According to the Standard, an optimized execution order may not change the functionality of a program. Problem is, this is only true for single threaded programs, or single threads in multithreaded programs.
So, for writer_thread and a (volatile) bool
data_ready = true;
data_ready = true;
are equivalent.
The result is, that
std::cout << "The answer=" << data[0] << "\n";
can be executed without having pushed any value into data.
An atomic bool does prevent this kind of optimization, as per definition it may not be reordered. There are flags for atomic operations which allow statements to be moved in front of the operation but not to the back, and vice versa, but those require a really advanced knowledge of your programming structure and the problems it can cause...

C++ memory model - does this example contain a data race?

I was reading Bjarne Stroustrup's C++11 FAQ and I'm having trouble understanding an example in the memory model section.
He gives the following code snippet:
// start with x==0 and y==0
if (x) y = 1; // thread 1
if (y) x = 1; // thread 2
The FAQ says there is not a data race here. I don't understand. The memory location x is read by thread 1 and written to by thread 2 without any synchronization (and the same goes for y). That's two accesses, one of which is a write. Isn't that the definition of a data race?
Further, it says that "every current C++ compiler (that I know of) gives the one right answer." What is this one right answer? Couldn't the answer vary depending on whether one thread's comparison happens before or after the other thread's write (or if the other thread's write is even visible to the reading thread)?
// start with x==0 and y==0
if (x) y = 1; // thread 1
if (y) x = 1; // thread 2
Since neither x nor y is true, the other won't be set to true either. No matter the order the instructions are executed, the (correct) result is always x remains 0, y remains 0.
The memory location x is ... written to by thread 2
Is it really? Why do you say so?
If y is 0 then x is not written to by thread 2. And y starts out 0. Similarly, x cannot be non-zero unless somehow y is non-zero "before" thread 1 runs, and that cannot happen. The general point here is that conditional writes that don't execute don't cause a data race.
This is a non-trivial fact of the memory model, though, because a compiler that is not aware of threading would be permitted (assuming y is not volatile) to transform the code if (x) y = 1; to int tmp = y; y = 1; if (!x) y = tmp;. Then there would be a data race. I can't imagine why it would want to do that exact transformation, but that doesn't matter, the point is that optimizers for non-threaded environments can do things that would violate the threaded memory model. So when Stroustrup says that every compiler he knows of gives the right answer (right under C++11's threading model, that is), that's a non-trivial statement about the readiness of those compilers for C++11 threading.
A more realistic transformation of if (x) y = 1 would be y = x ? 1 : y;. I believe that this would cause a data race in your example, and that there is no special treatment in the standard for the assignment y = y that makes it safe to execute unsequenced with respect to a read of y in another thread. You might find it hard to imagine hardware on which it doesn't work, and anyway I may be wrong, which is why I used a different example above that's less realistic but has a blatant data race.
There has to be a total ordering of the writes, because of the fact that no thread can write to the variable x or y until some other thread has first written a 1 to either variable. In other words you have basically three different scenarios:
thread 1 gets to write to y because x was written to at some previous point before the if statement, and then if thread 2 comes later, it writes to x the same value of 1, and doesn't change it's previous value of 1.
thread 2 gets to write to x because y was changed at some point before the if statement, and then thread 1 will write to y if it comes later the same value of 1.
If there are only two threads, then the if statements are jumped over because x and y remain 0.
Neither of the writes occurs, so there is no race. Both x and y remain zero.
(This is talking about the problem of phantom writes. Suppose one thread speculatively did the write before checking the condition, then attempted to correct things after. That would break the other thread, so it isn't allowed.)
Memory model set the supportable size of code and data areas.before comparing linking source code,we need to specify the memory model that is he can set the size limitsthe data and code.

How to make something lwt supported?

I am trying to understand the term lwt supported.
So assume I have a piece of code which connect a database and write some data: Db.write conn data. It has nothing to do with lwt yet and each write will cost 10 sec.
Now, I would like to use lwt. Can I directly code like below?
let write_all data_list = Lwt_list.iter (Db.write conn) data_list
let _ = Lwt_main.run(write_all my_data_list)
Support there are 5 data items in my_data_list, will all 5 data items be written into the database sequentially or in parallel?
Also in Lwt manually or http://ocsigen.org/tutorial/application, they say
Using Lwt is very easy and does not cause troubles, provided you never
use blocking functions (non cooperative functions). Blocking functions
can cause the entre server to hang!
I quite don't understand how to not using blocking functions. For every my own function, can I just use Lwt.return to make it lwt support?
Yes, your code is correct. The principle of lwt supported is that everything that can potentially takes time in your code should return an Lwt value.
About Lwt_list.iter, you can choose whether you want the treatment to be parallel or sequential, by choosing between iter_p and iter_s :
In iter_s f l, iter_s will call f on each elements
of l, waiting for completion between each element. On the
contrary, in iter_p f l, iter_p will call f on all
elements of l, then wait for all the threads to terminate.
About the non-blocking functions, the principle of the Light Weight Threads is that they keep running until they reach a "cooperation point", i.e. a point where the thread can be safely interrupted or has nothing to do, like in a sleep.
But you have to declare you enter a "cooperation point" before actually doing the sleep. This is why the whole Unix library has been wrapped, so that when you want to do an operation that takes time (e.g. a write), a cooperation point is automatically reached.
For your own function, if you use IOs operations from Unix, you should instead use the Lwt version (Lwt_unix.sleep instead of Unix.sleep)