I'm trying to setup clusters in my AWS account (Amazon). I followed this tutorial to set it up. I've ran into some problems regarding ports but I finally got it to work until... it shut down after 10 seconds giving me no more than this error:
16/05/12 12:52:46 INFO client.AppClient$ClientActor: Connecting to master spark://ip-to-my-machine:7077...
16/05/12 12:53:06 INFO client.AppClient$ClientActor: Connecting to master spark://ip-to-my-machine:7077...
16/05/12 12:53:26 ERROR cluster.SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
16/05/12 12:53:26 ERROR scheduler.TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up.
This was the bash I ran to make it work:
bin/spark-shell --master spark://ip-to-my-machine:7077
I opened the TCP port 7077, what seems to be the problem?
Related
I'm running a Heroku Redis instance that always worked fine, but after the upgrade when I try to run celery workers I get the following error
[2021-07-23 11:06:08,135: ERROR/MainProcess] consumer: Cannot connect to redis://:**#ec2-54-***-***-*.eu-west-1.compute.amazonaws.com:*****//: Error while reading from socket: (104, 'Connection reset by peer').
Trying again in 12.00 seconds... (6/100)
Everything seems to be up to date and I can't seem to figure out how to run it again. I tried to drop the redis instance and create it from scratch but nothing.
I am working on a web-based tool (named cloudcopasi) which take jobs from a user and submit it to bosco resources (compute nodes). I am using a bosco version (condor 8.8.12) on Linux CentOS 7. The web interface allows a user to add a bosco pool which user can use to submit jobs. However, when I try to submit a job, it fails. I tried to test the pool as well by using the following command:
bosco_cluster --test
It gives me the following GAHP error:
…..
Testing bosco submission...Passed!
Submission and log files for this job are in /home/cloudcopasi/bosco/local.bosco/bosco-test/boscotest.LTA07r
Waiting for jobmanager to accept job...Passed
Checking for submission to remote slurm cluster (could take ~30 seconds)...Failed
Showing last 5 lines of logs:
01/06/21 13:34:03 [3800] Gahp Server (pid=3815) exited with status 1 unexpectedly
01/06/21 13:34:08 [3800] gahp server not up yet, delaying ping
01/06/21 13:34:08 [3800] No jobs left, shutting down
01/06/21 13:34:08 [3800] Got SIGTERM. Performing graceful shutdown.
01/06/21 13:34:08 [3800] **** condor_gridmanager (condor_GRIDMANAGER) pid 3800 EXITING WITH STATUS 0
I am not sure what I am missing but I don’t understand how to solve this “Gahp server” issue.
Any help is highly appreciated.
Thank you.
This a probably an ssh failure (network, authentication, or authorization). Bosco runs the following command to access the remote cluster submit host:
<sbin>/remote_gahp <user>#<hostname> batch_gahp
You can run it on the command line to get more details about what's going wrong. remote_gahp is a bash script, so you can dig in further, if necessary.
I have started to receive a 402 error when accessing my CoreOS cluster. It has been working fine up until a day ago. Anybody has any ideas why I'm receiving this error? I am using the stable channel on EC2.
$ fleetctl list-machines
E0929 09:43:14.823081 00979 fleetctl.go:151] error attempting to check latest fleet version in Registry: 402: Standby Internal Error () [0]
Error retrieving list of active machines: 402: Standby Internal Error () [0]
In this case etcd does not currently have quorum. The "Standby Internal Error" signifies that the node is attempting to act as a standby but is failing to redirect you to the active node. Repairing the etcd issue will fix the problem. Check on the status of etcd by running:
journalctl -u etcd.service on each of the nodes should give you the information that you need to repair etcd in this case.
While debugging I realised that confd doesn't pick up the keys and my journal looks like this:
Sep 18 18:31:50 ip-10-171-54-76.ec2.internal docker[24891]: [nginx] waiting for confd to refresh nginx.conf
Sep 18 18:31:56 ip-10-171-54-76.ec2.internal docker[24891]: 2014-09-18T18:31:56Z 9122c7a54edc confd[9572]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
I use nsenter to log in to the running container to run some experiments for debugging purposes. I ran this command
confd -onetime -node 172.17.42.1:4001 -config-file /etc/confd/conf.d/nginx.toml
Then received this error as above
confd[12894]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
I am totally clueless at this point. I am using EC2 with the stable version of CoreOS and I am sure that etcd is running on the host. Also, I can ping the host from inside the container successfully.
Any ideas on what's wrong?
Assistance will be much appreciated.
This error indicates that your etcd cluster isn't operating correctly, so confd has nothing to watch. It has probably lost quorum. The logs (journalctl -u etcd) should indicate what happened.
I am trying to get Django-Celery running on my Django App. I cannot get the worker server to run. When I try I get the message: No Connection could be made because the target machine actively refused it
Here is what I have done so far. First, I installed the django celery package: http://pypi.python.org/pypi/django-celery
I can load it into python without problems. I also installed the RabbitMQ server per the windows install instructions: http://www.rabbitmq.com/install.html#windows
Starting the tutorials in pytho on the RabbitMQ site I saw the need to install pika: http://pypi.python.org/pypi/pika. It imports without any problems.
From there I start the RabbitMQ server by running this at the command line: rabbitmq-service start
I get the message back that Service RabbitMQ started
Here is where I start to have problems.
I attempted the first steps in django-celery: http://packages.python.org/django-celery/getting-started/first-steps-with-django.html and the "hello world" example on the rabbitMQ site: http://www.rabbitmq.com/tutorials/tutorial-one-python.html
In both cases I get the message: No Connection could be made because the target machine actively refused it
My first thought was that this sounded like a firewall problem. So I went into the windows 7 firewall and added inbound and outbound rules to open the local and remote ports 5672 and 5673 to TCP protocol, but I still get the same error message.
When I run rabbitmqctl status i get the message:
Error: unable to connect to node 'rabbit#hostname': nodedown
diagnostics:
- nodes and their ports on hostname: [{rabbitmqctl18856, 505031}]
Does that mean it that it is trying to operate on those ports? what about the default 5672?
Any suggestions?
UPDATE: This was actually a problem resulting from several failed rabbitmq installs conflicting with the latest installation. If you have to remove rabbitmq use the 'rabbitmq-service remove' command and not SC DELETE, which cause a lot of problems for me and I had to go in and clean up my windows registry file.
The nodedown error indicated by rabbitmqctl suggests that the server isn't running on that machine.
Try going though the steps in RabbitMQ's troubleshooting guide. In particular, pay close attention to the logs. Has the server crashed for some reason? Could you post the logs somewhere?