How can I sync the complete `local_dir` (`~/ray_results`) with the head node? - ray

I am creating custom result files in my jobs and want to sync them from the worker nodes to the head nodes (To rsync them down to my local computer later on). I tried to write them all into the local_dir e.g. ~./ray_results but unfortunately it seems that ray tune is only synching the individual trial folders in the local_dir. Is this correct?

Yes; try writing them to the self.logdir for each trainable.

Related

Geth Node Not Syncing To The Blockchain Anymore

I'm running a node of an Ethereum side-chain. I started only getting "peer connected on snap without compatible eth support" error messages in the log a few days ago. It would not download any new blocks. The last block in my local chain was 5 days old. I thought maybe it has something to do with the merge.
The node runs inside a docker container and I don't know how to do anything with docker. My only options are interacting with the node.
First I tried using debug_setHead over RPC. I set the head back approx. 100k blocks before the last block in my chain. But when it reached that same block I would again only get those error messages. What's weird is the log message that came right before in both times (when it first happened and after setting back the head) was "Deep froze chain segment" and after that I only got "peer connected on snap without compatible eth support".
Because setting back the head didn't work, the next thing I tried was pruning the node. According to the documentation pruning should only take 1 to 2 hours for this side-chain (It's on an SSD). But even after running it overnight I would never get the log message "State pruning successful".
Not knowing what to do, I started my node and read the log.
The end of the log says:
WARNING!
The clean trie cache is not found. Please delete it by yourself after the pruning. Remember don't start the Geth without deleting the clean trie cache otherwise the entire database may be damaged!
Check the command description "geth snapshot prune-state --help" for more details.
INFO [09-16|18:14:45.182] Pruning state data nodes=1 size=115.00B elapsed=13m3.752s eta=14m13.881s
INFO [09-16|18:14:53.188] Pruning state data nodes=2,264,671 size=676.51MiB elapsed=13m11.758s eta=14m7.433s
INFO [09-16|18:15:01.198] Pruning state data nodes=4,284,801 size=1.25GiB elapsed=13m19.768s eta=14m2.59s
After that it would just stop logging. It never attempts to connect to the chain and download any blocks. I'm not sure if starting the node could have damaged the chain, because after all it never downloaded any new chain data. Also I have no idea how to delete the clean trie cache.
The last thing I tried was removing all docker containers. I ran docker system prune and it removed all containers, images and volumes. But after reinstalling the node nothing changed. I still get the same log as shown above (without downloading any blocks), because apparently it didn't delete any chain data.
Also the RPC endpoint does not work anymore when starting the node.
I'm completely lost. I don't know what caused this problem in the first place or how to fix it. What can I do to get my node up and running again?
UPDATE:
I have now also tried deleting chain data with geth removedb but I still get the exact same log warning and nothing happens after that. Maybe deleting the clean cache can help getting at least one step further, but I don't know how to do that in a docker container.
UPDATE 2:
While geth removedb did not delete the database, it must have deleted something, because after starting the node, the pruning was successfully completed. But as expected, it did not solve my original problem. I still get an endless stream of
ERROR[09-16|20:50:27.777] Snapshot extension registration failed peer=eec7c316 err="peer connected on snap without compatible eth support"
error logs. And my node is still stuck on the same old block. Mind you that this error stream only starts at a certain block and is not a general problem with my node. If I set the head to a prior block with debug_setHead, the node will successfully sync up to the block I'm stuck at.
You can try to run the command like this:
sudo -u eth1 geth --datadir /path/to/chaindata removedb
This makes the command run under the user "eth1" assigned to geth. Depending on your setup that is the only user that can access the chaindata.
The command then asks which data to remove. You can keep the ancient data. Then start geth again and it should be syncing now.

AWS Batch jobs stuck in PENDING when they `dependsOn`

I have an issue chaining AWS Batch jobs.
There are 3 Compute environments (CE_A, CE_B, CE_C) and they have associated one Job queue each (JQ_A, JQ_B, JQ_C).
There are 6 Job definitions (JD_1, JD_2, ..., JD_6).
Let <jqce>-<jd>-<name> be a Job launched on job queue (or compute environment) <jqce> and with job definition <jd>. Example: A-1-a, C-6-z.
I want to execute sequentially about 20 jobs (launched with different environment variables): A-1-a, A-1-b, B-2-c, A-3-d, A-3-e, A-3-f, ...
For each job I specify the dependency on previous job with:
params.dependsOn = [{ "jobId": "xxxxx-xxxx-xxxx-xxxxxx"}] in Batch.submitJob(params).
The first two jobs A-1-a and A-1-b execute successfully after waiting few minutes for ressource allocation.
The third job, B-2-c also executes successfully, after a some minutes of waiting for the Compute environment CE_B to be up.
Meanwhile, the compute environment CE_A is turned off since no job has presented.
HERE IS THE PROBLEM:
I expect at this point that CE_B goes down and CE_A goes up. CE_A is not going up.
The A-3-d is never executed, 16 hours later it is still in PENDING status.
The dependsOn is ok, its dependency ended long time ago.
Without dependsOn the Batch runs ok, with the same environment variables and config.
QUESTIONS
Did you face similar problems with AWS Batch and dependsOn?
Is it possible to chain batches from different Job Queues?
Is it possible to chain batches from different Compute Environments?
Does the params.dependsOn = [{ "jobId": "xxx-xxx-xxx-xxx" }] seem ok to you? It seems I do not have to set the type attribute see array jobs;
Does the params.dependsOn = [{ "jobId": "xxx-xxx-xxx-xxx" }] seem ok to you? It seems I do not have to set the type attribute see array jobs;
Yes, type is only required when it's defined as an Array job. And the JobID you're providing is what was returned when you submitted the specific job?
Is it possible to chain batches from different Job Queues?
Is it possible to chain batches from different Compute Environments?
You should be able to do it but I've never done that.
Meanwhile, the compute environment CE_A is turned off since no job has presented.
So CE_A was running already and ran A-1-a, A-1-b already?
As I recall AWS checks every 10 minutes for certain statuses and people have run into cases where the system seems stuck.
You could set CE_A to always have a minimum of 1 CPU so it doesn't disappear or become difficult to get a version of.
Can you simply for testing purposes? Shorter actions, reducing Queues, etc
Consider checking the AWS forum on Batch. Not much activity there but worth an additional set of eyes.

Akka Cluster manual join

I'm trying to find a workaround to the following limitation: When starting an Akka Cluster from scratch, one has to make sure that the first seed node is started. It's a problem to me, because if I have an emergency to restart all my system from scratch, who knows if the one machine everything relies on will be up and running properly? And I might not have the luxury to take time changing the system configuration. Hence my attempt to create the cluster manually, without relying on a static seed node list.
Now it's easy for me to have all Akka systems registering themselves somewhere (e.g. a network filesystem, by touching a file periodically). Therefore when starting up a new system could
Look up the list of all systems that are supposedly alive (i.e. who touched the file system recently).
a. If there is none, then the new system joins itself, i.e. starts the cluster alone. b. Otherwise it tries to join the cluster with Cluster(system).joinSeedNodes using all the other supposedly alive systems as seeds.
If 2. b. doesn't succeed in reasonable time, the new system tries again, starting from 1. (looking up again the list of supposedly alive systems, as it might have changed in the meantime; in particular all other systems might have died and we'd ultimately fall into 2. a.).
I'm unsure how to implement 3.: How do I know whether joining has succeeded or failed? (Need to subscribe to cluster events?) And is it possible in case of failure to call Cluster(system).joinSeedNodes again? The official documentation is not very explicit on this point and I'm not 100% how to interpret the following in my case (can I do several attempts, using different seeds?):
An actor system can only join a cluster once. Additional attempts will
be ignored. When it has successfully joined it must be restarted to be
able to join another cluster or to join the same cluster again.
Finally, let me precise that I'm building a small cluster (it's just 10 systems for the moment and it won't grow very big) and it has to be restarted from scratch now and then (I cannot assume the cluster will be alive forever).
Thx
I'm answering my own question to let people know how I sorted out my issues in the end. Michal Borowiecki's answer mentioned the ConstructR project and I built my answer on their code.
How do I know whether joining has succeeded or failed? After issuing Cluster(system).joinSeedNodes I subscribe to cluster events and start a timeout:
private case object JoinTimeout
...
Cluster(context.system).subscribe(self, InitialStateAsEvents, classOf[MemberUp], classOf[MemberLeft])
system.scheduler.scheduleOnce(15.seconds, self, JoinTimeout)
The receive is:
val address = Cluster(system).selfAddress
...
case MemberUp(member) if member.address == address =>
// Hooray, I joined the cluster!
case JoinTimeout =>
// Oops, couldn't join
system.terminate()
Is it possible in case of failure to call Cluster(system).joinSeedNodes again? Maybe, maybe not. But actually I simply terminate the actor system if joining didn't succeed and restart it for another try (so it's a "let it crash" pattern at the actor system level).
You don't need seed-nodes. You need seed nodes if you want the cluster to auto-start up.
You can start your individual application and then have them "manually" join the cluster at any point in time. For example, if you have http enabled, you can use the akka-management library (or implement a subset of it yourself, they are all basic cluster library functions just nicely wrapped).
I strongly discourage the touch approach. How do you sync on the touch reading / writing between nodes? What if someone reads a transient state (while someone else is writing it) ?
I'd say either go full auto (with multiple seed-nodes), or go full "manual" and have another system be in charge of managing the clusterization of your nodes. By that I mean you start them up individually, and they join the cluster only when ordered to do so by the external supervisor (also very helpful to manage split-brains).
We've started using Constructr extension instead of the static list of seed-nodes:
https://github.com/hseeberger/constructr
This doesn't have the limitation of a statically-configured 1st seed-node having to be up after a full cluster restart.
Instead, it relies on a highly-available lookup service. Constructr supports etcd natively and there are extensions for (at least) zookeeper and consul available. Since we already have a zookeeper cluster for kafka, we went for zookeeper:
https://github.com/typesafehub/constructr-zookeeper

let slurmctld "think" that nodes are idle~ like after "SuspendProgram", but in fact they are down when it starts

Is there a way to start slurmctld daemon with the execution nodes off, but making it to belive that he has requested the suspend for these nodes (e.g. like if it had called the SuspendProgram)?
I am setting up a virtual cluster, so the SuspendProgram and ResumeProgram do terminate and instanciate virtual machines. In this way I could power on only the master node, and he would fire up nodes only when requested.
The problem is that for the moment, when I start slurmctld I need the nodes to get up, tell him that they exits, and wait that he shut them down. This adds unwanted costs, because I need to poweron all the "supposed" instances.
I would like to instanciate the master, the one running slurmctld, and let him think that the nodes are idle~ like after SuspendProgram.
Cheers
What you can try is set the nodes to state POWER_DOWN in slurm.conf so that at startup, slurmctld will see those nodes as powered down by SuspendProgram
NodeName=... Sockets=... CoresPerSocket... [etc] State=POWER_DOWN

How to implement a master machine controlling several slave machines via Linux C++

could anyone give some advice for how to implement a master machine controlling some slave machines via C++?
I am trying to implement a simple program that can distribute tasks from master to slaves. It is easy to implement one master + one slave machine. However, when there are more than one slave machine, I don't know how to design.
If the solution can be used for both Linux and Windows, it would be much better.
You use should a framework rather than make your own. What you need to search for is Cluster Computing. one that might work easily is Boost.MPI
With n-machines, you need to keep track of which ones are free, and if there are none, load across your slaves (i.e. how many tasks have been queued up at each) and then queue on the lowest loaded machine (or whichever your algorithm deems best), say better hardware means that some slaves perform better than others etc. I'd start with a simple distribution algorithm, and then tweak once it's working...
More interesting problems will arise in exceptional circumstances (i.e. slaves dying, and various such issues.)
I would use an existing messaging bus to make your life easier (rather than re-inventing), the real intelligence is in the distribution algorithm and management of failed nodes.
We need to know more, but basically you just need to make sure the slaves don't block each other. Details of doing that in C++ will get involved, but the first thing to do is ask yourself what the algorithm is. The simplest case is going to be if you don't care about waiting for the slaves, in which case you have
while still tasks to do
launch a task on a slave
If you have to have just one job running on a slave then you'll need something like an array of flags, one per slave
slaves : array 0 to (number of slaves - 1)
initialize slaves to all FALSE
while not done
find the first FALSE slave -- it's not in use
set that slave to TRUE
launch a job on that slave
check for slaves that are done
set that slave to FALSE
Now, if you have multiple threads, you can make that into two threads
while not done
find the first FALSE slave -- it's not in use
set that slave to TRUE
launch a job on that slave
while not done
check for slaves that are done
set that slave to FALSE