Gahp server (failure issues ) exited with status 1 unexpectedly - centos7

I am working on a web-based tool (named cloudcopasi) which take jobs from a user and submit it to bosco resources (compute nodes). I am using a bosco version (condor 8.8.12) on Linux CentOS 7. The web interface allows a user to add a bosco pool which user can use to submit jobs. However, when I try to submit a job, it fails. I tried to test the pool as well by using the following command:
bosco_cluster --test
It gives me the following GAHP error:
…..
Testing bosco submission...Passed!
Submission and log files for this job are in /home/cloudcopasi/bosco/local.bosco/bosco-test/boscotest.LTA07r
Waiting for jobmanager to accept job...Passed
Checking for submission to remote slurm cluster (could take ~30 seconds)...Failed
Showing last 5 lines of logs:
01/06/21 13:34:03 [3800] Gahp Server (pid=3815) exited with status 1 unexpectedly
01/06/21 13:34:08 [3800] gahp server not up yet, delaying ping
01/06/21 13:34:08 [3800] No jobs left, shutting down
01/06/21 13:34:08 [3800] Got SIGTERM. Performing graceful shutdown.
01/06/21 13:34:08 [3800] **** condor_gridmanager (condor_GRIDMANAGER) pid 3800 EXITING WITH STATUS 0
I am not sure what I am missing but I don’t understand how to solve this “Gahp server” issue.
Any help is highly appreciated.
Thank you.

This a probably an ssh failure (network, authentication, or authorization). Bosco runs the following command to access the remote cluster submit host:
<sbin>/remote_gahp <user>#<hostname> batch_gahp
You can run it on the command line to get more details about what's going wrong. remote_gahp is a bash script, so you can dig in further, if necessary.

Related

Eris blockchain - Monax getting error while deploying smart contracts

I am new to blockchain platforms and eris. trying to get a private blockchain up and running in my Mac OS from here
https://monax.io/docs/tutorials/getting-started/
all went fine until deployment of a smart contract. While performing "eris pkgs do" command, getting below error.
Performing action. This can sometimes take a wee while
Error connecting to node (tcp://chain:46657) to get chain id: Post http://chain:46657: dial tcp 198.105.244.228:46657: getsockopt: connection refused
Could not perform pkg action service: Could not perform pkg action: Container interactive-73d789a8-4693-4a4c-bcf2-ae2005a12d23 exited with status 1
update:
I am now able to get past this error. Followed the Mona tutorial for docker machines.
This took me to the compiler error (error scenario 6 in getting started session).
Now getting below error. the IP address is taken from the active machine by issuing command "docker-machine ls".
GinguVjs-MacBook-Pro:idi ginguvj$ eris pkgs do --chain simplechain --address $addr --compiler 192.168.99.101:2376
Performing action. This can sometimes take a wee while
Executing Job defaultAddr
Executing Job setStorageBase
Executing Job deployStorageK
failed to send HTTP request Post 192.168.99.101:2376: unsupported protocol scheme ""
Error compiling contracts: Compilers error:
Post 192.168.99.101:2376: unsupported protocol scheme ""
Could not perform pkg action service: Could not perform pkg action: Container interactive-671e81dc-4a1b-4e1e-b1ad-b51d955297b1 exited with status 1
GinguVjs-MacBook-Pro:idi ginguvj$
The compilation issue was resolved by adding a "-z" to the end.
"eris pkgs do --chain simplechain --address $addr -z"

Spark shuts down after 10 seconds of running

I'm trying to setup clusters in my AWS account (Amazon). I followed this tutorial to set it up. I've ran into some problems regarding ports but I finally got it to work until... it shut down after 10 seconds giving me no more than this error:
16/05/12 12:52:46 INFO client.AppClient$ClientActor: Connecting to master spark://ip-to-my-machine:7077...
16/05/12 12:53:06 INFO client.AppClient$ClientActor: Connecting to master spark://ip-to-my-machine:7077...
16/05/12 12:53:26 ERROR cluster.SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
16/05/12 12:53:26 ERROR scheduler.TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up.
This was the bash I ran to make it work:
bin/spark-shell --master spark://ip-to-my-machine:7077
I opened the TCP port 7077, what seems to be the problem?

Spark cluster fails after running and no exception thrown

I'm trying to run a stand-alone Spark application on EC2 Yarn command line. I'm submitting the following spark-submit script:
./bin/spark-submit --class PageRankGraphX --master yarn-cluster --properties-file spark-defaults.conf.2 --executor-memory 2G --total-executor-cores 5 ./SparkPageRank-assembly-1.0.jar s3://linkfilefull/full/links_small.txt s3://conansoutputbucket/smalloutput.txt 10 0.15 2
This is the output - there is no exception or error thrown, the job simply fails after running:
15/04/15 21:27:03 INFO yarn.Client: Application report from ASM:
application identifier: application_1429126831428_0027
appId: 27
clientToAMToken: null
appDiagnostics:
appMasterHost: ip-172-31-1-67.eu-west-1.compute.internal
appQueue: default
appMasterRpcPort: 0
appStartTime: 1429133214320
yarnAppState: RUNNING
distributedFinalState: UNDEFINED
appTrackingUrl: http://172.31.10.227:9046/proxy/application_1429126831428_0027/
appUser: hadoop
15/04/15 21:27:04 INFO yarn.Client: Application report from ASM:
application identifier: application_1429126831428_0027
appId: 27
clientToAMToken: null
appDiagnostics:
appMasterHost: ip-172-31-1-67.eu-west-1.compute.internal
appQueue: default
appMasterRpcPort: 0
appStartTime: 1429133214320
yarnAppState: FINISHED
distributedFinalState: FAILED
appTrackingUrl: http://172.31.10.227:9046/proxy/application_1429126831428_0027/A
appUser: hadoop
Does anyone know what could be causing this or how I could investigate? When I try to access the yarn logs, it says logs are disabled or not ready.
Check out Amazon's documentation on enabling access to the web UI of Hadoop. Once in the UI, you can check the stderr output for the application, where the exception will most likely be. As others mentioned, this log will also be available on S3.

Confd error: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

While debugging I realised that confd doesn't pick up the keys and my journal looks like this:
Sep 18 18:31:50 ip-10-171-54-76.ec2.internal docker[24891]: [nginx] waiting for confd to refresh nginx.conf
Sep 18 18:31:56 ip-10-171-54-76.ec2.internal docker[24891]: 2014-09-18T18:31:56Z 9122c7a54edc confd[9572]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
I use nsenter to log in to the running container to run some experiments for debugging purposes. I ran this command
confd -onetime -node 172.17.42.1:4001 -config-file /etc/confd/conf.d/nginx.toml
Then received this error as above
confd[12894]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
I am totally clueless at this point. I am using EC2 with the stable version of CoreOS and I am sure that etcd is running on the host. Also, I can ping the host from inside the container successfully.
Any ideas on what's wrong?
Assistance will be much appreciated.
This error indicates that your etcd cluster isn't operating correctly, so confd has nothing to watch. It has probably lost quorum. The logs (journalctl -u etcd) should indicate what happened.

Problems getting RabbitMQ and Django-Celery Running: Target Machine actively refused connection

I am trying to get Django-Celery running on my Django App. I cannot get the worker server to run. When I try I get the message: No Connection could be made because the target machine actively refused it
Here is what I have done so far. First, I installed the django celery package: http://pypi.python.org/pypi/django-celery
I can load it into python without problems. I also installed the RabbitMQ server per the windows install instructions: http://www.rabbitmq.com/install.html#windows
Starting the tutorials in pytho on the RabbitMQ site I saw the need to install pika: http://pypi.python.org/pypi/pika. It imports without any problems.
From there I start the RabbitMQ server by running this at the command line: rabbitmq-service start
I get the message back that Service RabbitMQ started
Here is where I start to have problems.
I attempted the first steps in django-celery: http://packages.python.org/django-celery/getting-started/first-steps-with-django.html and the "hello world" example on the rabbitMQ site: http://www.rabbitmq.com/tutorials/tutorial-one-python.html
In both cases I get the message: No Connection could be made because the target machine actively refused it
My first thought was that this sounded like a firewall problem. So I went into the windows 7 firewall and added inbound and outbound rules to open the local and remote ports 5672 and 5673 to TCP protocol, but I still get the same error message.
When I run rabbitmqctl status i get the message:
Error: unable to connect to node 'rabbit#hostname': nodedown
diagnostics:
- nodes and their ports on hostname: [{rabbitmqctl18856, 505031}]
Does that mean it that it is trying to operate on those ports? what about the default 5672?
Any suggestions?
UPDATE: This was actually a problem resulting from several failed rabbitmq installs conflicting with the latest installation. If you have to remove rabbitmq use the 'rabbitmq-service remove' command and not SC DELETE, which cause a lot of problems for me and I had to go in and clean up my windows registry file.
The nodedown error indicated by rabbitmqctl suggests that the server isn't running on that machine.
Try going though the steps in RabbitMQ's troubleshooting guide. In particular, pay close attention to the logs. Has the server crashed for some reason? Could you post the logs somewhere?