How to use Hazelcast's CPSubsystem with fewer than 3 nodes? - concurrency

I see that Hazelcast 3.12 has introduced the CPSubsystem() for systems with 3-7 nodes. I understand the reasoning. However, if I am trying to design a solution that can run with anywhere between 1-n nodes, do I need to use different logic to validate if the CPSubsystem is enabled? How do I even check that?
I would have thought/hoped that simply calling
would work no matter the number of nodes, but if there are fewer than 3 nodes, it throws an exception. And I can't find any method that allows me to check if the CPSubsystem is enabled or not.
My current implementation uses the deprecated method getLock() to get a distributed lock:
LOG.debug("Creating a distributed lock on username for a maximum of 5 minutes {}", username);
ILock usernameLock = hazelcastInstance.getLock(this.getClass().getName() + ":" + username);
try {
if (usernameLock.tryLock (5, TimeUnit.MINUTES)) {
} catch (InterruptedException e) {
LOG.warn("Exception locking on : {} ", username, e);
LOG.warn("Invoking clearUserData without synchronization : {}", username);
} finally {
How can I get a lock with Hazelcast without knowing this? The hazelcastInstance.getLock() is marked as deprecated and targeted for removal in HC4.

As you already know, CPSubsystem is a CP system in terms of CAP theorem. It has to be enabled explicitly to use, because it has some limitations and prerequisites. One of them is, at least 3 Hazelcast members should exist in the cluster. Actually, 2 members is sufficient but Hazelcast's CPSubsystem rejects to work with 2 members because majority of 2 members is 2 again, and it's prone to be unavailable once one of the members crashes.
HazelcastInstance.getLock() uses async replication of Hazelcast and cannot provide CP guarantees under failures. This is fine for some systems/applications but not for all. That's why choosing between a best-effort locking mechanism vs CP based locking mechanism should be explicit and applications relying on the lock should be designed depending on this choice. See Daniel Abadi's The dangers of conditional consistency guarantees
post related to this choice. That's why, CPSubsystem().getLock() does not fallback to best-effort/unsafe locking mechanism when cluster size is below 3.
HazelcastInstance.getLock() is deprecated in 3.12 and will be removed in 4.0. But Hazelcast will provide an unsafe (development) mode for CP data structures, which will work with any number of members and will be based on async replication similar to Hazelcast AP data structures.

HazelcastInstance hz1 = Hazelcast.newHazelcastInstance(config);
HazelcastInstance hz2 = Hazelcast.newHazelcastInstance(config);
HazelcastInstance hz3 = Hazelcast.newHazelcastInstance(config);
You can add 3 members on the same JVM/node/instance.You should be able to run CPSubsystem without three physical nodes, instances or JVM.


Synchronization with "versioning" in c++

Please consider the following synchronization problem:
version = 0 // atomic variable
data = 0 // normal variable (there could be many)
Thread A:
data = 3
Thread B:
d = data
v = version
assert(d != 3 || v == 1)
Basically, if thread B sees data = 3 then it must also see version++.
What's the weakest memory order and synchronization we must impose so that the assertion in thread B is always satisfied?
If I understand C++ memory_order correctly, the release-acquire ordering won't do because that guarantees that operations BEFORE version++, in thread A, will be seen by the operations AFTER v = version, in thread B.
Acquire and release fences also work in the same directions, but are more general.
As I said, I need the other direction: B sees data = 3 implies B sees version = 1.
I'm using this "versioning approach" to avoid locks as much as possible in a data structure I'm designing. When I see something has changed, I step back, read the new version and try again.
I'm trying to keep my code as portable as possible, but I'm targeting x86-64 CPUs.
You might be looking for a SeqLock, as long as your data doesn't include pointers. (If it does, then you might need something more like RCU to protect readers that might load a pointer, stall / sleep for a while, then deref that pointer much later.)
You can use the SeqLock sequence counter as the version number. (version = tmp_counter >> 1 since you need two increments per write of the payload to let readers detect tearing when reading the non-atomic data. And to make sure they see the data that goes with this sequence number. Make sure you don't read the atomic counter a 3rd time; use the local tmp that you read it into to verify match before/after copying data.)
Readers will have to retry if they happen to attempt a read while data is being modified. But it's non-atomic, so there's no way if thread B sees data = 3 can ever be part of what creates synchronization; it can only be something you see as a result of synchronizing with a version number from the writer.
Implementing 64 bit atomic counter with 32 bit atomics - my attempt at a SeqLock in C++, with lots of comments. It's a bit of a hack because ISO C++'s data-race UB rules are overly strict; a SeqLock relies on detecting possible tearing and not using torn data, rather than avoiding concurrent access entirely. That's fine on a machine without hardware race detection so that doesn't fault (like all real CPUs), but C++ still calls that UB, even with volatile (although that puts it more into implementation-defined territory). In practice it's fine.
GCC reordering up across load with `memory_order_seq_cst`. Is this allowed? - A GCC bug fixed in 8.1 that could break a seqlock implementation.
If you have multiple writers, you can use the sequence-counter itself as a spinlock for mutual exclusion between writers. e.g. using an atomic_fetch_or or CAS to attempt to set the low bit to make the counter odd. (tmp = seq.fetch_or(1, std::memory_order_acq_rel);, hopefully compiling to x86 lock bts). If it previously didn't have the low bit set, this writer won the race, but if it did then you have to try again.
But with only a single writer, you don't need to RMW the atomic sequence counter, just store new values (ordered with writes to the payload), so you can either keep a local copy of it, or just do a relaxed load of it, and store tmp+1 and tmp+2.

Can I make OpenMP to revert to ideal # of threads after using omp_set_num_threads?

Is there a way to make OpenMP revert the number of threads (for the next time it's used) back to the default after the application has already called omp_set_num_threads() with a specific number?
For example, is there a special code (e.g. 0 or -1) I supply to omp_set_num_threads?
Or should I just try doing something like omp_set_num_threads(omp_get_max_threads())?
I am making the assumption that the default number is whatever the implementation of OpenMP deems as "optimal". But I don't know what, if anything, the default is guaranteed to be or even what it should be. All I know is that I have an application that calls omp_set_num_threads(4) for one specific OpenMP block which I must not edit (for now). But I'd like to prevent that one setting from affecting other OpenMP blocks in my code.
I've had this problem before. (Disclaimer: I work with MSVC, which currently only implements the OpenMP 2.0 standard). To the best of my knowledge, there is nothing in the OpenMP 2.0 standard that allows you to find out this default value. omp_get_max_threads() is not required to return it (all subsequent emphasis mine):
The omp_get_max_threads function returns an integer that is guaranteed to be at least as large as the number of threads that would be used to form a team if a parallel region without a num_threads clause were to be encountered at that point in the code.
In other words, it might return a number that is larger than the currently set (or default) value.
There is no special value for omp_set_num_threads either:
The omp_set_num_threads function sets the default number of threads to use for subsequent parallel regions that do not specify a num_threads clause. [...] The value of the parameter num_threads must be a positive integer.
And if you get it wrong, it's up to the implementation what will happen:
If a parallel region is encountered while dynamic adjustment of the number of threads is disabled, and the number of threads requested for the parallel region exceeds the number that the run-time system can supply, the behavior of the program is implementation-defined. An implementation may, for example, interrupt the execution of the program, or it may serialize the parallel region.
You might find more precise (and less unsettling) information in the documentation of your OpenMP implementation. However, in the case of MSVC, that documentation is just a verbatim copy of the OpenMP 2.0 standard...
Since you are in the business of modifying the number of threads this way, I would like to preemptively caution about the interaction of omp_set_dynamic with omp_get_num_threads within MSVC:
Why does omp_set_dynamic(1) never adjust the number of threads (in Visual C++)?

Zero MQ with C++ bindings and Open MP blocking issue. Why?

I wrote a test for ZeroMQ to convince myself that it manages to map replies to the client independent from processing order, which would prove it thread safe.
It is a multi-threaded server, which just throws the received messages back at the sender. The client sends some messages from several threads and checks, if it receives the same message back. For multi-threading I use OpenMP.
That test worked fine and I wanted to move on and re-implement it with C++ bindings for ZeroMQ. And now it doesn't work in the same way anymore.
Here's the code with ZMQPP:
#include <gtest/gtest.h>
#include <zmqpp/zmqpp.hpp>
#include <zmqpp/proxy.hpp>
TEST(zmqomp, order) {
zmqpp::context ctx;
std::thread proxy([&ctx] {
zmqpp::socket dealer(ctx, zmqpp::socket_type::xrequest);
zmqpp::socket router(ctx, zmqpp::socket_type::xreply);
zmqpp::proxy(router, dealer);
std::thread worker_starter([&ctx] {
#pragma omp parallel
zmqpp::socket in(ctx, zmqpp::socket_type::reply);
#pragma omp for
for (int i = 0; i < 1000; i++) {
std::string request;
std::thread client([&ctx] {
#pragma omp parallel
zmqpp::socket out(ctx, zmqpp::socket_type::request);
#pragma omp for
for (int i = 0; i < 1000; i++) {
std::string msg("Request " + std::to_string(i));
std::string reply;
EXPECT_EQ(reply, msg);
The test blocks and doesn't get executed to the end. I played around with #pragmas a little bit and found out that only one change can "fix" it:
//#pragma omp parallel for
for (int i = 0; i < 250; i++) {
The code is still getting executed parallel in that case, but I have to divide the loop executions number by a number of my physical cores.
Does anybody have a clue what's going on here?
Prologue: ZeroMQ is by-definition and by-design NOT Thread-Safe.
This normally does not matter as there are some safe-guarding design practices, but situation here goes even worse, once following the proposed TEST(){...} design.
Having spent some time with ZeroMQ, your proposal headbangs due to violations on several principal things, that otherwise help distributed architectures to work smarter, than a pure SEQ of monolithic code.
ZeroMQ convinces in ( almost ) every third paragraph to avoid sharing of resources. Zero-sharing is one of the ZeroMQ's fabulous scalable performance and minimised latency maxims, so to say in short.
So one has better to avoid sharing zmq.Context() instance at all ( unless one knows pretty well, why and how the things work under the hood ).
Thus an attempt to fire 1000-times ( almost ) in parallel ( well, not a true PAR ) some flow of events onto a shared instance of zmq.Context ( the less once it was instantiated with default parameters and having none performance tuning adaptations ) will certainly suffer from doing the very opposite from what is, performance-wise and design-wise, recommended to do.
What are some of the constraints, not to headbang into?
1) Each zmq.Context() instance has a limited amount of I/O-threads, that were created during the instantiation process. Once a fair design needs some performance-tuning, it is possible to increase such number of I/O-threads and data-pumps will work that better ( sure, none amount of data-pumps will salvage a poor, the less a disastrous design / architecture of a distributed computing system. This is granted. ).
2) Each zmq.Socket() instance has an { implicit | explicit } mapping onto a respective I/O-thread ( Ref. 1) ). Once a fair design needs some increased robustness against sluggish event-loop handlings or against other adverse effects arisen from data-flow storms ( or load-balancing or you name it ), there are chances to benefit from a divide-and-conquer approach to use .setsockopt( zmq.AFFINITY, ... ) method to directly map each zmq.Socket() instance onto a respective I/O-thread, and remain thus in control of what buffering and internal queues are fighting for which resources during the real operations. In any case, where a total amount of threads goes over the localhost number of cores, the just-CONCURRENT scheduling is obvious ( so a dream of a true PAR execution is principally and inadvertently lost. This is granted. ).
3) Each zmq.Socket() has also a pair of "Hidden Queue Devastators", called High-Watermarks. These get set either { implicitly | explicitly }, the latter being for sure a wiser manner for performance tuning. Why Devastators? Because these stabilise and protect the distributed computing systems from overflows and are permitted to simply discard each and every message above the HWM level(s) so as to protect the systems capability to run forever, even under heavy storms, spurious blasts of crippled packets or DDoS-types of attack. There are many tools for tuning this domain of ZeroMQ Context()-instance's behaviour, which go beyond the scope of this answer ( Ref.: other my posts on ZeroMQ AFFINITY benefits or the ZeroMQ API specifications used in .setsockopt() method ).
4) Each tcp:// transport-class based zmq.Socket() instance has also inherited some O/S dependent heritage. Some O/S demonstrate this risk by extended accumulation of ip-packets ( outside of any ZeroMQ control ) until some threshold got passed and thus a due design care ought be taken for such cases to avoid adverse effects on the intended application signalling / messaging dynamics and robustness against such uncontrollable ( exosystem ) buffering habits.
5) Each .recv() and .send() method call is by-definition blocking, a thing a massively distributed computing system ought never risk to enter into. Never ever. Even in a school-book example. Rather use non-blocking form of these calls. Always. This is granted.
6) Each zmq.Socket() instance ought undertake a set of careful and graceful termination steps. A preventive step of .setsockopt( zmq.LINGER, 0 ) + an explicit .close() methods are fair to be required to be included in every use-case ( and made robust to get executed irrespective of any exceptions that may get appeared. ). A poor { self- | team- }-discipline in this practice is a sure ticket into hanging up the whole application infrastructure due to just not paying due care on a mandatory resources management policy. This is a must-have part of any serious distributed computing Project. Even the school-book examples ought have this. No exceptions. No excuse. This is granted.

Akka testkit: what is the timeFactor?

There are several methods on the Akka TestProbe that say they are "correctly treating the timeFactor." What does that mean?
For example, see the second version of expectMsg.
Using the setting akka.test.timefactor (1 by default) you may increase or decrease all timeouts in your tests (it may be useful when running tests on very different machines in terms of performance). These methods use as max duration the following value: akka.test.single-expect-default (3 seconds by default) multiplied by the akka.test.timefactor mentioned above.

How/why do functional languages (specifically Erlang) scale well?

I have been watching the growing visibility of functional programming languages and features for a while. I looked into them and didn't see the reason for the appeal.
Then, recently I attended Kevin Smith's "Basics of Erlang" presentation at Codemash.
I enjoyed the presentation and learned that a lot of the attributes of functional programming make it much easier to avoid threading/concurrency issues. I understand the lack of state and mutability makes it impossible for multiple threads to alter the same data, but Kevin said (if I understood correctly) all communication takes place through messages and the mesages are processed synchronously (again avoiding concurrency issues).
But I have read that Erlang is used in highly scalable applications (the whole reason Ericsson created it in the first place). How can it be efficient handling thousands of requests per second if everything is handled as a synchronously processed message? Isn't that why we started moving towards asynchronous processing - so we can take advantage of running multiple threads of operation at the same time and achieve scalability? It seems like this architecture, while safer, is a step backwards in terms of scalability. What am I missing?
I understand the creators of Erlang intentionally avoided supporting threading to avoid concurrency problems, but I thought multi-threading was necessary to achieve scalability.
How can functional programming languages be inherently thread-safe, yet still scale?
A functional language doesn't (in general) rely on mutating a variable. Because of this, we don't have to protect the "shared state" of a variable, because the value is fixed. This in turn avoids the majority of the hoop jumping that traditional languages have to go through to implement an algorithm across processors or machines.
Erlang takes it further than traditional functional languages by baking in a message passing system that allows everything to operate on an event based system where a piece of code only worries about receiving messages and sending messages, not worrying about a bigger picture.
What this means is that the programmer is (nominally) unconcerned that the message will be handled on another processor or machine: simply sending the message is good enough for it to continue. If it cares about a response, it will wait for it as another message.
The end result of this is that each snippet is independent of every other snippet. No shared code, no shared state and all interactions coming from a a message system that can be distributed among many pieces of hardware (or not).
Contrast this with a traditional system: we have to place mutexes and semaphores around "protected" variables and code execution. We have tight binding in a function call via the stack (waiting for the return to occur). All of this creates bottlenecks that are less of a problem in a shared nothing system like Erlang.
EDIT: I should also point out that Erlang is asynchronous. You send your message and maybe/someday another message arrives back. Or not.
Spencer's point about out of order execution is also important and well answered.
The message queue system is cool because it effectively produces a "fire-and-wait-for-result" effect which is the synchronous part you're reading about. What makes this incredibly awesome is that it means lines do not need to be executed sequentially. Consider the following code:
r = methodWithALotOfDiskProcessing();
x = r + 1;
y = methodWithALotOfNetworkProcessing();
w = x * y
Consider for a moment that methodWithALotOfDiskProcessing() takes about 2 seconds to complete and that methodWithALotOfNetworkProcessing() takes about 1 second to complete. In a procedural language this code would take about 3 seconds to run because the lines would be executed sequentially. We're wasting time waiting for one method to complete that could run concurrently with the other without competing for a single resource. In a functional language lines of code don't dictate when the processor will attempt them. A functional language would try something like the following:
Execute line 1 ... wait.
Execute line 2 ... wait for r value.
Execute line 3 ... wait.
Execute line 4 ... wait for x and y value.
Line 3 returned ... y value set, message line 4.
Line 1 returned ... r value set, message line 2.
Line 2 returned ... x value set, message line 4.
Line 4 returned ... done.
How cool is that? By going ahead with the code and only waiting where necessary we've reduced the waiting time to two seconds automagically! :D So yes, while the code is synchronous it tends to have a different meaning than in procedural languages.
Once you grasp this concept in conjunction with Godeke's post it's easy to imagine how simple it becomes to take advantage of multiple processors, server farms, redundant data stores and who knows what else.
It's likely that you're mixing up synchronous with sequential.
The body of a function in erlang is being processed sequentially.
So what Spencer said about this "automagical effect" doesn't hold true for erlang. You could model this behaviour with erlang though.
For example you could spawn a process that calculates the number of words in a line.
As we're having several lines, we spawn one such process for each line and receive the answers to calculate a sum from it.
That way, we spawn processes that do the "heavy" computations (utilizing additional cores if available) and later we collect the results.
count_words_in_lines(Lines) ->
% For each line in lines run spawn_summarizer with the process id (pid)
% and a line to work on as arguments.
% This is a list comprehension and spawn_summarizer will return the pid
% of the process that was created. So the variable Pids will hold a list
% of process ids.
Pids = [spawn_summarizer(self(), Line) || Line <- Lines],
% For each pid receive the answer. This will happen in the same order in
% which the processes were created, because we saved [pid1, pid2, ...] in
% the variable Pids and now we consume this list.
Results = [receive_result(Pid) || Pid <- Pids],
% Sum up the results.
WordCount = lists:sum(Results),
io:format("We've got ~p words, Sir!~n", [WordCount]).
spawn_summarizer(S, Line) ->
% Create a anonymous function and save it in the variable F.
F = fun() ->
% Split line into words.
ListOfWords = string:tokens(Line, " "),
Length = length(ListOfWords),
io:format("process ~p calculated ~p words~n", [self(), Length]),
% Send a tuple containing our pid and Length to S.
S ! {self(), Length}
% There is no return in erlang, instead the last value in a function is
% returned implicitly.
% Spawn the anonymous function and return the pid of the new process.
% The Variable Pid gets bound in the function head.
% In erlang, you can only assign to a variable once.
receive_result(Pid) ->
% Pattern-matching: the block behind "->" will execute only if we receive
% a tuple that matches the one below. The variable Pid is already bound,
% so we are waiting here for the answer of a specific process.
% N is unbound so we accept any value.
{Pid, N} ->
io:format("Received \"~p\" from process ~p~n", [N, Pid]),
And this is what it looks like, when we run this in the shell:
Eshell V5.6.5 (abort with ^G)
1> Lines = ["This is a string of text", "and this is another", "and yet another", "it's getting boring now"].
["This is a string of text","and this is another",
"and yet another","it's getting boring now"]
2> c(countwords).
3> countwords:count_words_in_lines(Lines).
process <0.39.0> calculated 6 words
process <0.40.0> calculated 4 words
process <0.41.0> calculated 3 words
process <0.42.0> calculated 4 words
Received "6" from process <0.39.0>
Received "4" from process <0.40.0>
Received "3" from process <0.41.0>
Received "4" from process <0.42.0>
We've got 17 words, Sir!
The key thing that enables Erlang to scale is related to concurrency.
An operating system provides concurrency by two mechanisms:
operating system processes
operating system threads
Processes don't share state – one process can't crash another by design.
Threads share state – one thread can crash another by design – that's your problem.
With Erlang – one operating system process is used by the virtual machine and the VM provides concurrency to Erlang programme not by using operating system threads but by providing Erlang processes – that is Erlang implements its own timeslicer.
These Erlang process talk to each other by sending messages (handled by the Erlang VM not the operating system). The Erlang processes address each other using a process ID (PID) which has a three-part address <<N3.N2.N1>>:
process no N1 on
VM N2 on
physical machine N3
Two processes on the same VM, on different VM's on the same machine or two machines communicate in the same way – your scaling is therefore independent of the number of physical machines you deploy your application on (in the first approximation).
Erlang is only threadsafe in a trivial sense – it doesn't have threads. (The language that is, the SMP/multi-core VM uses one operating system thread per core).
You may have a misunderstanding of how Erlang works. The Erlang runtime minimizes context-switching on a CPU, but if there are multiple CPUs available, then all are used to process messages. You don't have "threads" in the sense that you do in other languages, but you can have a lot of messages being processed concurrently.
Erlang messages are purely asynchronous, if you want a synchronous reply to your message you need to explicitly code for that. What was possibly said was that messages in a process message box is processed sequentially. Any message sent to a process goes sits in that process message box, and the process gets to pick one message from that box process it and then move on to the next one, in the order it sees fit. This is a very sequential act and the receive block does exactly that.
Looks like you have mixed up synchronous and sequential as chris mentioned.
Referential transparency: See
In a purely functional language, order of evaluation doesn't matter - in a function application fn(arg1, .. argn), the n arguments can be evaluated in parallel. That guarantees a high level of (automatic) parallelism.
Erlang uses a process modell where a process can run in the same virtual machine, or on a different processor -- there is no way to tell. That is only possible because messages are copied between processes, there is no shared (mutable) state. Multi-processor paralellism goes a lot farther than multi-threading, since threads depend upon shared memory, this there can only be 8 threads running in parallel on a 8-core CPU, while multi-processing can scale to thousands of parallel processes.