I'm running a node of an Ethereum side-chain. I started only getting "peer connected on snap without compatible eth support" error messages in the log a few days ago. It would not download any new blocks. The last block in my local chain was 5 days old. I thought maybe it has something to do with the merge.
The node runs inside a docker container and I don't know how to do anything with docker. My only options are interacting with the node.
First I tried using debug_setHead over RPC. I set the head back approx. 100k blocks before the last block in my chain. But when it reached that same block I would again only get those error messages. What's weird is the log message that came right before in both times (when it first happened and after setting back the head) was "Deep froze chain segment" and after that I only got "peer connected on snap without compatible eth support".
Because setting back the head didn't work, the next thing I tried was pruning the node. According to the documentation pruning should only take 1 to 2 hours for this side-chain (It's on an SSD). But even after running it overnight I would never get the log message "State pruning successful".
Not knowing what to do, I started my node and read the log.
The end of the log says:
WARNING!
The clean trie cache is not found. Please delete it by yourself after the pruning. Remember don't start the Geth without deleting the clean trie cache otherwise the entire database may be damaged!
Check the command description "geth snapshot prune-state --help" for more details.
INFO [09-16|18:14:45.182] Pruning state data nodes=1 size=115.00B elapsed=13m3.752s eta=14m13.881s
INFO [09-16|18:14:53.188] Pruning state data nodes=2,264,671 size=676.51MiB elapsed=13m11.758s eta=14m7.433s
INFO [09-16|18:15:01.198] Pruning state data nodes=4,284,801 size=1.25GiB elapsed=13m19.768s eta=14m2.59s
After that it would just stop logging. It never attempts to connect to the chain and download any blocks. I'm not sure if starting the node could have damaged the chain, because after all it never downloaded any new chain data. Also I have no idea how to delete the clean trie cache.
The last thing I tried was removing all docker containers. I ran docker system prune and it removed all containers, images and volumes. But after reinstalling the node nothing changed. I still get the same log as shown above (without downloading any blocks), because apparently it didn't delete any chain data.
Also the RPC endpoint does not work anymore when starting the node.
I'm completely lost. I don't know what caused this problem in the first place or how to fix it. What can I do to get my node up and running again?
UPDATE:
I have now also tried deleting chain data with geth removedb but I still get the exact same log warning and nothing happens after that. Maybe deleting the clean cache can help getting at least one step further, but I don't know how to do that in a docker container.
UPDATE 2:
While geth removedb did not delete the database, it must have deleted something, because after starting the node, the pruning was successfully completed. But as expected, it did not solve my original problem. I still get an endless stream of
ERROR[09-16|20:50:27.777] Snapshot extension registration failed peer=eec7c316 err="peer connected on snap without compatible eth support"
error logs. And my node is still stuck on the same old block. Mind you that this error stream only starts at a certain block and is not a general problem with my node. If I set the head to a prior block with debug_setHead, the node will successfully sync up to the block I'm stuck at.
You can try to run the command like this:
sudo -u eth1 geth --datadir /path/to/chaindata removedb
This makes the command run under the user "eth1" assigned to geth. Depending on your setup that is the only user that can access the chaindata.
The command then asks which data to remove. You can keep the ancient data. Then start geth again and it should be syncing now.
Related
I have a job in interruptible sleep state (S), hanging for a few hours.
can't use gdb (gdb will hang when attaching to the PID).
can't use strace, strace will resume the hanging job =(
WCHAN field shows the PID is waiting for ptlrpc. After some search online, it looks like this is a lustre operation. The print files also revealed the program is stuck in reading data from lustre. Any idea or suggestion on how to proceed the diagnose? Or possible reason why the hanging happens?
You can check /proc/$PID/stack on the client to see the whole stack of the process, which would give you some more information about what the process is doing (ptlrpc_set_wait() is just the generic "wait for RPC completion" function).
That said, what is more likely to be useful is to check the kernel console error messages (dmesg and/or /var/log/messages) to see what is going on. Lustre is definitely not shy about logging errors when there is a problem.
Very likely this will show that the client is waiting on a server to complete the RPC, so you'll also have to check the dmesg and/or /var/log/messages To see what the problem is on the server. There are several existing docs that go into detail about how to debug Lustre issues:
https://wiki.lustre.org/Diagnostic_and_Debugging_Tools
https://cug.org/5-publications/proceedings_attendee_lists/CUG11CD/pages/1-program/final_program/Wednesday/12A-Spitz-Paper.pdf
At that point, you are probably best off to check for existing Lustre bugs at https://jira.whamcloud,com/ to search for the first error messages that are reported, or maybe a stack trace. It is very likely (depending on what error is being hit), that there is already a fix available, and upgrading to the latest maintenance release (2.12.7 currently), or applying a patch (if the bug is recently fixed) will sole your problem.
I am creating custom result files in my jobs and want to sync them from the worker nodes to the head nodes (To rsync them down to my local computer later on). I tried to write them all into the local_dir e.g. ~./ray_results but unfortunately it seems that ray tune is only synching the individual trial folders in the local_dir. Is this correct?
Yes; try writing them to the self.logdir for each trainable.
When I work with Akka, I have problems if I do debug.They can't get heartbeat while they're debugging. Then they start not seeing each other anymore.
As far as I can tell, if he can't get a heartbeat after the next 10 seconds, he's out of the cluster and can't see each other anymore. I want to extend this period. For example, if you can't hearbeat for 5 minutes, then you should get out of the cluster.I'm looking for solutions like this.
Turn off auto-downing. That way the node you are debugging might get marked as unreachable, but it will never be removed from the cluster. (And as soon as you release it from being frozen the other nodes will mark it as reachable again.)
I'm trying to find a workaround to the following limitation: When starting an Akka Cluster from scratch, one has to make sure that the first seed node is started. It's a problem to me, because if I have an emergency to restart all my system from scratch, who knows if the one machine everything relies on will be up and running properly? And I might not have the luxury to take time changing the system configuration. Hence my attempt to create the cluster manually, without relying on a static seed node list.
Now it's easy for me to have all Akka systems registering themselves somewhere (e.g. a network filesystem, by touching a file periodically). Therefore when starting up a new system could
Look up the list of all systems that are supposedly alive (i.e. who touched the file system recently).
a. If there is none, then the new system joins itself, i.e. starts the cluster alone. b. Otherwise it tries to join the cluster with Cluster(system).joinSeedNodes using all the other supposedly alive systems as seeds.
If 2. b. doesn't succeed in reasonable time, the new system tries again, starting from 1. (looking up again the list of supposedly alive systems, as it might have changed in the meantime; in particular all other systems might have died and we'd ultimately fall into 2. a.).
I'm unsure how to implement 3.: How do I know whether joining has succeeded or failed? (Need to subscribe to cluster events?) And is it possible in case of failure to call Cluster(system).joinSeedNodes again? The official documentation is not very explicit on this point and I'm not 100% how to interpret the following in my case (can I do several attempts, using different seeds?):
An actor system can only join a cluster once. Additional attempts will
be ignored. When it has successfully joined it must be restarted to be
able to join another cluster or to join the same cluster again.
Finally, let me precise that I'm building a small cluster (it's just 10 systems for the moment and it won't grow very big) and it has to be restarted from scratch now and then (I cannot assume the cluster will be alive forever).
Thx
I'm answering my own question to let people know how I sorted out my issues in the end. Michal Borowiecki's answer mentioned the ConstructR project and I built my answer on their code.
How do I know whether joining has succeeded or failed? After issuing Cluster(system).joinSeedNodes I subscribe to cluster events and start a timeout:
private case object JoinTimeout
...
Cluster(context.system).subscribe(self, InitialStateAsEvents, classOf[MemberUp], classOf[MemberLeft])
system.scheduler.scheduleOnce(15.seconds, self, JoinTimeout)
The receive is:
val address = Cluster(system).selfAddress
...
case MemberUp(member) if member.address == address =>
// Hooray, I joined the cluster!
case JoinTimeout =>
// Oops, couldn't join
system.terminate()
Is it possible in case of failure to call Cluster(system).joinSeedNodes again? Maybe, maybe not. But actually I simply terminate the actor system if joining didn't succeed and restart it for another try (so it's a "let it crash" pattern at the actor system level).
You don't need seed-nodes. You need seed nodes if you want the cluster to auto-start up.
You can start your individual application and then have them "manually" join the cluster at any point in time. For example, if you have http enabled, you can use the akka-management library (or implement a subset of it yourself, they are all basic cluster library functions just nicely wrapped).
I strongly discourage the touch approach. How do you sync on the touch reading / writing between nodes? What if someone reads a transient state (while someone else is writing it) ?
I'd say either go full auto (with multiple seed-nodes), or go full "manual" and have another system be in charge of managing the clusterization of your nodes. By that I mean you start them up individually, and they join the cluster only when ordered to do so by the external supervisor (also very helpful to manage split-brains).
We've started using Constructr extension instead of the static list of seed-nodes:
https://github.com/hseeberger/constructr
This doesn't have the limitation of a statically-configured 1st seed-node having to be up after a full cluster restart.
Instead, it relies on a highly-available lookup service. Constructr supports etcd natively and there are extensions for (at least) zookeeper and consul available. Since we already have a zookeeper cluster for kafka, we went for zookeeper:
https://github.com/typesafehub/constructr-zookeeper
We're currently updating from Akka 2.0.4 to 2.4.2 (I know, quite a big leap, but nobody thought to do it incrementally).
Anyway, in the old codebase, our master node is connected to some remote slave nodes that sometimes fail (the "why" is still to be investigated). When a slave dies, the master receives a RemoteClientShutdown event from which we can extract the getRemoteAddress and process it accordingly (e.g. inform the admin per email pointing to the failed node address).
In version 2.4.2 the RemoteClientShutdown class is replaced (at least I suppose so) by RemotingShutdownEvent which, being an object, doesn't carry any specific information as to the source of the event.
I've checked the migration guides as well as the current documentation but couldn't find info on how to solve this problem. According to the Event Bus documentation, the only way to extract such information is by providing it in the message ("Please note that the EventBus does not preserve the sender of the published messages. If you need a reference to the original sender you have to provide it inside the message").
Should I somehow override the message sent on the remote system shutdown? Or is there any other recommended way to solve it? I hope this question is not too newbie, I'm still quite new to Akka.
Ok, solved it using DisassociatedEvent which actually contains the address and other useful information. Turns out I was misled by the name of RemotingShutdownEvent which is actually received "when the remoting subsystem has been shut down" (docs) and not when a remote actor has been shut down.