Akka Cluster manual join

Akka Cluster manual join - akka

I'm trying to find a workaround to the following limitation: When starting an Akka Cluster from scratch, one has to make sure that the first seed node is started. It's a problem to me, because if I have an emergency to restart all my system from scratch, who knows if the one machine everything relies on will be up and running properly? And I might not have the luxury to take time changing the system configuration. Hence my attempt to create the cluster manually, without relying on a static seed node list.
Now it's easy for me to have all Akka systems registering themselves somewhere (e.g. a network filesystem, by touching a file periodically). Therefore when starting up a new system could
Look up the list of all systems that are supposedly alive (i.e. who touched the file system recently).
a. If there is none, then the new system joins itself, i.e. starts the cluster alone. b. Otherwise it tries to join the cluster with Cluster(system).joinSeedNodes using all the other supposedly alive systems as seeds.
If 2. b. doesn't succeed in reasonable time, the new system tries again, starting from 1. (looking up again the list of supposedly alive systems, as it might have changed in the meantime; in particular all other systems might have died and we'd ultimately fall into 2. a.).
I'm unsure how to implement 3.: How do I know whether joining has succeeded or failed? (Need to subscribe to cluster events?) And is it possible in case of failure to call Cluster(system).joinSeedNodes again? The official documentation is not very explicit on this point and I'm not 100% how to interpret the following in my case (can I do several attempts, using different seeds?):
An actor system can only join a cluster once. Additional attempts will
be ignored. When it has successfully joined it must be restarted to be
able to join another cluster or to join the same cluster again.
Finally, let me precise that I'm building a small cluster (it's just 10 systems for the moment and it won't grow very big) and it has to be restarted from scratch now and then (I cannot assume the cluster will be alive forever).
Thx

I'm answering my own question to let people know how I sorted out my issues in the end. Michal Borowiecki's answer mentioned the ConstructR project and I built my answer on their code.
How do I know whether joining has succeeded or failed? After issuing Cluster(system).joinSeedNodes I subscribe to cluster events and start a timeout:
private case object JoinTimeout
...
Cluster(context.system).subscribe(self, InitialStateAsEvents, classOf[MemberUp], classOf[MemberLeft])
system.scheduler.scheduleOnce(15.seconds, self, JoinTimeout)
The receive is:
val address = Cluster(system).selfAddress
...
case MemberUp(member) if member.address == address =>
// Hooray, I joined the cluster!
case JoinTimeout =>
// Oops, couldn't join
system.terminate()
Is it possible in case of failure to call Cluster(system).joinSeedNodes again? Maybe, maybe not. But actually I simply terminate the actor system if joining didn't succeed and restart it for another try (so it's a "let it crash" pattern at the actor system level).

You don't need seed-nodes. You need seed nodes if you want the cluster to auto-start up.
You can start your individual application and then have them "manually" join the cluster at any point in time. For example, if you have http enabled, you can use the akka-management library (or implement a subset of it yourself, they are all basic cluster library functions just nicely wrapped).
I strongly discourage the touch approach. How do you sync on the touch reading / writing between nodes? What if someone reads a transient state (while someone else is writing it) ?
I'd say either go full auto (with multiple seed-nodes), or go full "manual" and have another system be in charge of managing the clusterization of your nodes. By that I mean you start them up individually, and they join the cluster only when ordered to do so by the external supervisor (also very helpful to manage split-brains).

We've started using Constructr extension instead of the static list of seed-nodes:
https://github.com/hseeberger/constructr
This doesn't have the limitation of a statically-configured 1st seed-node having to be up after a full cluster restart.
Instead, it relies on a highly-available lookup service. Constructr supports etcd natively and there are extensions for (at least) zookeeper and consul available. Since we already have a zookeeper cluster for kafka, we went for zookeeper:
https://github.com/typesafehub/constructr-zookeeper

Related

Cycle Cloud Run Instance Manually

Occasionally my instances get into a corrupted state (especially since min-instance=1). I would like to restart one manually. Is this possible?
I know I can go through the console to create a new version, but this messes up my Terraform state. I would like to keep with the current version and just cycle the instance, a classic IT procedure called "Turning it on and off again" to fix my short term issue while I figure out the larger issue.

No, you can't do it. If you have a routine that can detect a corruption, you can exit the container (the instance stopped and a new one is created). For that, 2 options:
Either you have an internal check that detects automatically the state of the container and exits in case of corruption (works for max-instance >=1 )
Or you can have 2 different endpoints (works only for max-instance=1):
One tell you the state of the container (OK or KO)
In case of KO, you can call an endpoint in your app that stop the instance ( And if your container is public, it's dangerous because anyone can restart your container!)

Writing scaleable code

Can someone describe in very simple terms how you would scale up a service (lets assume the service is very simple and is the function X() ).
To make this scalable would you just fire off a new node (upto a maximum depending on your hardware) for each client who wants to run X?
So if I had four hardware boxes, I may fire up to four nodes to run service X(), on the 5th client request I would just run X() on the first node, the 6th client on the second node etc?
Following on from this, I know how to spawn processes locally, but how would you get both the 1st and 5th clients to use the same Node 1- would it be by spawning a process remotely on the Node for the Client each time?
Any simple examples are most welcome!

This depends very much on what X is. If X is fully independent, for instance x() -> 37. then you don't even need to connect your nodes. Simply place some standard Load Balancer in front of your system (HAProxy, Varnish, etc) and then forget about any kind of distributed communication. In fact, there is no need to use Erlang for that. Replace Erlang with some other language of your choice. It is equally good.
Where Erlang shines is when several X functions have dependencies on each others result and when the X might live on another physical machine. In that case Erlang can communicate with the other X seamlessly, even if it lives on a different node.
If you want to implement a round-robin scheme in Erlang, the easiest way is to have a single point of entry and then let it forward the requests out to multiple nodes. But this is bad if there is a pattern where a certain node ends up with all the long-running processes. You need to build a mechanism of feedback so you know how to weight the round-robin queue.

How to smooth restart a c++ program without shut down the running program?

I have a server program which should run full time a day. If I want to change some parameters of it, Is there any way rather than shut down then restart way?

There are quite a few ways of doing this, including, but almost certainly not limited to:
You can maintain the parameters in a separate file so that the program will periodically check that file and update its internal information.
Similar to (1) but you can send some sort of signal to the application to get it to immediately re-read the file.
You can do either (1) or (2) but using shared memory rather than a configuration file.
You can have your program sit at the server end of an IPC conversation, so that a client can open up a connection to it to provide new parameters. Anything from a simple message queue to a full-blown HTTP server and associated pages.
Of course, all of these tend to need a fair amount of work in your program to get it to look for the new information.
You should take that into account when making your decision. By far the quickest solution to implement is to just (cleanly) kill off the process at something like 11:55pm then immediately restart it. It's simpler because your code probably already has the ability to load the information on startup, so this could be a simple cron one-liner.
Some people speak of laziness as a bad thing but that's not always the case :-)

If the Server maintains many alive connections from clients, restarting the server process is the last way you should consider. Except reloading configuration files, inserting a proxy process between clients and server can be another way.
The proxy process is Responsible for 2 things.
a. Maintaining the connection from clients and forwarding packets to Server for handling.
b. Judging weather the current server process(Server A) is alive and if it not, switching to another server(Server B) automatically.
Then you can change parameters by restart server without worrying about interrupting clients since there is always two(or more) servers running.

How to implement a master machine controlling several slave machines via Linux C++

could anyone give some advice for how to implement a master machine controlling some slave machines via C++?
I am trying to implement a simple program that can distribute tasks from master to slaves. It is easy to implement one master + one slave machine. However, when there are more than one slave machine, I don't know how to design.
If the solution can be used for both Linux and Windows, it would be much better.

You use should a framework rather than make your own. What you need to search for is Cluster Computing. one that might work easily is Boost.MPI

With n-machines, you need to keep track of which ones are free, and if there are none, load across your slaves (i.e. how many tasks have been queued up at each) and then queue on the lowest loaded machine (or whichever your algorithm deems best), say better hardware means that some slaves perform better than others etc. I'd start with a simple distribution algorithm, and then tweak once it's working...
More interesting problems will arise in exceptional circumstances (i.e. slaves dying, and various such issues.)
I would use an existing messaging bus to make your life easier (rather than re-inventing), the real intelligence is in the distribution algorithm and management of failed nodes.

We need to know more, but basically you just need to make sure the slaves don't block each other. Details of doing that in C++ will get involved, but the first thing to do is ask yourself what the algorithm is. The simplest case is going to be if you don't care about waiting for the slaves, in which case you have
while still tasks to do
launch a task on a slave
If you have to have just one job running on a slave then you'll need something like an array of flags, one per slave
slaves : array 0 to (number of slaves - 1)
initialize slaves to all FALSE
while not done
find the first FALSE slave -- it's not in use
set that slave to TRUE
launch a job on that slave
check for slaves that are done
set that slave to FALSE
Now, if you have multiple threads, you can make that into two threads
while not done
find the first FALSE slave -- it's not in use
set that slave to TRUE
launch a job on that slave
while not done
check for slaves that are done
set that slave to FALSE

Web application background processes, newbie design question

I'm building my first web application after many years of desktop application development (I'm using Django/Python but maybe this is a completely generic question, I'm not sure). So please beware - this may be an ultra-newbie question...
One of my user processes involves heavy processing in the server (i.e. user inputs something, server needs ~10 minutes to process it). On a desktop application, what I would do it throw the user input into a queue protected by a mutex, and have a dedicated background thread running in low priority blocking on the queue using that mutex.
However in the web application everything seems to be oriented towards synchronization with the HTTP requests.
Assuming I will use the database as my queue, what is best practice architecture for running a background process?

There are two schools of thought on this (at least).
Throw the work on a queue and have something else outside your web-stack handle it.
Throw the work on a queue and have something else in your web-stack handle it.
In either case, you create work units in a queue somewhere (e.g. a database table) and let some process take care of them.
I typically work with number 1 where I have a dedicated windows service that takes care of these things. You could also do this with SQL jobs or something similar.
The advantage to item 2 is that you can more easily keep all your code in one place--in the web tier. You'd still need something that triggers the execution (e.g. loading the web page that processes work units with a sufficiently high timeout), but that could be easily accomplished with various mechanisms.

Since:
1) This is a common problem,
2) You're new to your platform
-- I suggest that you look in the contributed libraries for your platform to find a solution to handle the task. In addition to queuing and processing the jobs, you'll also want to consider:
1) status communications between the worker and the web-stack. This will enable web pages that show the percentage complete number for the job, assure the human that the job is progressing, etc.
2) How to ensure that the worker process does not die.
3) If a job has an error, will the worker process automatically retry it periodically?
Will you or an operations person be notified if a job fails?
4) As the number of jobs increase, can additional workers be added to gain parallelism?
Or, even better, can workers be added on other servers?
If you can't find a good solution in Django/Python, you can also consider porting a solution from another platform to yours. I use delayed_job for Ruby on Rails. The worker process is managed by runit.
Regards,
Larry

Speaking generally, I'd look at running background processes on a different server, especially if your web server has any kind of load.

Running long processes in Django: http://iraniweb.com/blog/?p=56

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js