Google AI Platform: The replica master 0 exited with a non-zero status of 127 - google-cloud-ml

There's a similar SO question: Tensorflow on ML Engine: The replica master 0 exited with a non-zero status of 1
But here, I'm encountering error "127" instead. Similar to that question, I launched a pytorch custom training container on AI Platform (previously ML Engine) and after about 2 minutes I get the error message "The replica master 0 exited with a non-zero status of 127".
The documentation here doesn't quite say what "127" means: https://cloud.google.com/ai-platform/training/docs/troubleshooting#understanding_training_application_return_codes
Anyone have an idea?

In my case, the problem was that I was using CMD instead of ENTRYPOINT in the Dockerfile.
Let's use ENTRYPOINT like this document: Train an ML model with custom containers
#CMD ["python", "trainer/mnist.py"]
# failed -> the replica master 0 exited with a non-zero status of 127
# Try ENTRYPOINT!
ENTRYPOINT ["python", "trainer/mnist.py"]
This solution may not be the cause in your case though, It may be good idea to check if the cause is a Dockerfile or not 🙂
It may be useful to check the differences between the sample Dockerfile in the above link and your own Dockerfile.

Related

Why GCloud Builds submit failing after creating image?

I am learning deploying a pubsub service to run under Cloud Run, by following the guidelines given here
Steps I followed are:
Created a new project folder "myProject" in my local machine
Added below files:
app.jsindex.jsDockerfile
Executed below command to ship the code
gcloud builds submit --tag gcr.io/Project-ID/pubsub
It's mentioned in the tutorial document that
Upon success, you should see a SUCCESS message containing the ID, creation time, and image name. The image is stored in Container Registry and can be re-used if desired.
But in my case it's returning with error: (Ref: screenshot)
I have verified the build logs, "It's success"
So I thought to ignore this error and proceed with the next step to deploy the app by running the command:
gcloud run deploy sks-pubsub-cloudrun --image gcr.io/Project-ID/pubsub --no-allow-unauthenticated
When I run this command it immediately asking to specify the region (26 is my choice) from the list.
Next it fails with error:
Deploying container to Cloud Run service [sks-pubsub-cloudrun] in project [Project-ID] region [us-central1]
Deploying new service... Cloud Run error: The user-provided container failed to start and listen on the port defined provided by the PORT=8080 environment variable.
Logs for this revision might contain more information.
As I am new to this GCP & Dockerizing services, not understanding this issue and unable to fix it. I researched many blogs and articles yet no proper solution for this error.
Any help will be appreciated.
Tried to run the container locally and it's failing with error.
I'm using VS Code IDE, and "Cloud Code: Debug on Cloud Run Emulator" to debug the code.
Starting to debug the app using configuration `Cloud Run: Run/Debug Locally` from .vscode/launch.json
To view more detailed logs, go to Output channel : "Cloud Run: Run/Debug Locally - Detailed"
Dependency check started
Dependency check succeeded
Unpausing minikube
The minikube profile 'cloud-run-dev-internal' has been scheduled to stop automatically after exiting Cloud Code. To disable this on future deployments, set autoStop to false in your launch configuration d:\POC\promo_run_pubsub\.vscode\launch.json
Configuring minikube gcp-auth addon
Using GCP project 'Project-Id' with minikube gcp-auth
Failed to configure minikube gcp-auth addon. Your app might not be able to authenticate Google or GCP APIs it calls. The addon has been disabled. More details can be found in the detailed logs.
Update initiated
Deploy started
Deploy completed
Status check started
Resource pod/promo-run-pubsub-5d4cd64bf9-8pf4q status updated to In Progress
Resource deployment/promo-run-pubsub status updated to In Progress
Resource pod/promo-run-pubsub-5d4cd64bf9-8pf4q status updated to In Progress
Resource deployment/promo-run-pubsub status failed with waiting for rollout to finish: 0 of 1 updated replicas are available...
Status check failed
Update failed with error code STATUSCHECK_CONTAINER_TERMINATED
1/1 deployment(s) failed
Skaffold exited with code 1.
Cleaning up...
Finished clean up.

kubeflow deploy gcp endpoints controller fails

I am trying to deploy kubeflow on gcp using official guidelines https://www.kubeflow.org/docs/distributions/gke/deploy/deploy-cli/
I tried three times but it seems there is a problem with endpoints controller. When checked by: kubectl -n kubeflow get all
All pods are running except the
NAME READY STATUS RESTARTS AGE
pod/admission-webhook-deployment-667bd68d94 1/1 Running
pod/cache-deployer-deployment-75ccdc98b4 2/2 Running
pod/cache-server-56f78bf64b 2/2 Running
pod/centraldashboard-5fdbd9b744 1/1 Running
pod/cloud-endpoints-controller-5f7dbc6fc8 0/1 ImagePullBackOff
Pod desciption says that it failed to resolve reference "gcr.io/cloud-solutions-group/cloud-endpoints-controller:0.2.1": unexpected status code [manifests 0.2.1]: 403 Forbidden
I am new to kubeflow but despite retrying this three times it always results in the same issue.
You can clone the repo and build the image yourself and push it to your container registry.
This is one workaround to fix this until the official image is back.
git clone https://github.com/jlewi/cloud-endpoints-controller.git
cd cloud-endpoints-controller
git checkout 0.2.1
docker build . -t <YOUR DOCKER REGISTRY>/cloud-endpoints-controller:0.2.1
docker push <YOUR DOCKER REGISTRY>/cloud-endpoints-controller:0.2.1
And this use the new image in your pod spec.
Urgent release is made: https://github.com/kubeflow/gcp-blueprints/releases/tag/v1.4.1, you can now use v1.4.1 tag for deployment.
---- original -----
Thank you for posting this issue! I have posted a mitigation solution here in https://github.com/kubeflow/gcp-blueprints/issues/343#issuecomment-1028488756. I am planning to fix this issue in the coming release.

DB Migration on a Load-Balanced Cloud Foundry Deployment

I'm deploying an app on cloud foundry. I also run a db migration before the deployment. To do this, my launch command looked like:
./run_migration && ./run_app
That was working well on 1 instance, but now I have 2 instances, so the launch command was changed to:
[ $CF_INSTANCE_INDEX != 0 ] || ./run_migration && ./run_app
This way the migration runs only on the instance number 0. And this works as well. However, once the migration failed.
2019-02-12T13:56:45.27+0100 [APP/PROC/WEB/0]OUT Exit status 1
2019-02-12T13:56:45.28+0100 [CELL/SSHD/0]OUT Exit status 0
OK
requested state: started
instances: 2/2
state since cpu memory disk
#0 starting 2019-02-12 01:56:36 PM 0.0% 0 of 1G 0 of 1G
#1 running 2019-02-12 01:56:39 PM 15.8% 93.3M of 1G 249.4M of 1G
So as far as I understand, the puch is considered to be healthy although only one instance manages to start.
Is there a way to fail the push when not all instances managed to start
Is there a way to fail the push when not all instances managed to start
I don't know of a way to do that, but you could always follow up and check after cf push completes.
Run cf app <app> | grep 'instances:' and you should see the number running and total requests. If they don't match something's up.
If you're just trying to make your deployment script fail, doing a check like that, while a little more work, should suffice. If there's some other reason, you'll need to add some background to your question so we call understand the use case better.
Hope that helps!

AWS Code Deploy Deployment Failed for shell scripts

Am trying to create CodeDeploy Deployment Group using the Cloud Formation Stack. Every time I run the stack, am getting script errors like Bad Interpreter, rm/ll command not found, /r /n errors. I tried to change the shell script files using dos2unix and zip those files and upload to CodeDeploy but no success.
Following is the error statement I get in logs:
2018-09-01 10:41:45 INFO [codedeploy-agent(2681)]: [Aws::CodeDeployCommand::Client 200 0.037239 0 retries] put_host_command_complete(command_status:"Failed",diagnostics:{format:"JSON",payload:"{\"error_code\":4,\"script_name\":\"BeforeInstall.sh\",\"message\":\"Script at specified location: BeforeInstall.sh run as user root failed with exit code 127\",\"log\":\"LifecycleEvent - BeforeInstall\\nScript - BeforeInstall.sh\\n[stderr]/usr/bin/env: bash\\r: No such file or directory\\n\"}"},host_command_identifier:"WyJjb20uYW1hem9uLmFwb2xsby5kZXBsb3ljb250cm9sLmRvbWFpbi5Ib3N0Q29tbWFuZElkZW50aWZpZXIiLHsiZGVwbG95bWVudElkIjoiQ29kZURlcGxveS91cy1lYXN0LTEvUHJvZC9hcm46YXdzOnNkczp1cy1lYXN0LTE6OTkzNzM1NTM2Nzc4OmRlcGxveW1lbnQvZC05V0kzWk5DNlYiLCJob3N0SWQiOiJhcm46YXdzOmVjMjp1cy1lYXN0LTE6OTkzNzM1NTM2Nzc4Omluc3RhbmNlL2ktMDk1NGJlNjk4OTMzMzY5MjgiLCJjb21tYW5kTmFtZSI6IkJlZm9yZUluc3RhbGwiLCJjb21tYW5kUG9zaXRpb24iOjMsImNvbW1hbmRBdHRlbXB0IjoxfV0=")
2018-09-01 10:41:45 ERROR [codedeploy-agent(2681)]: InstanceAgent::Plugins::CodeDeployPlugin::CommandPoller: Error during perform: InstanceAgent::Plugins::CodeDeployPlugin::ScriptError - Script at specified location: BeforeInstall.sh run as user root failed with exit code 127 - /opt/codedeploy-agent/lib/instance_agent/plugins/codedeploy/hook_executor.rb:173:in `execute_script'
......
......
2018-09-01 10:41:45 INFO [codedeploy-agent(2681)]: [Aws::CodeDeployCommand::Client 200 0.018288 0 retries] put_host_command_complete(command_status:"Failed",diagnostics:{format:"JSON",payload:"{\"error_code\":5,\"script_name\":\"\",\"message\":\"Script at specified location: BeforeInstall.sh run as user root failed with exit code 127\",\"log\":\"\"}"},host_command_identifier:"WyJjb20uYW1hem9uLmFwb2xsby5kZXBsb3ljb250cm9sLmRvbWFpbi5Ib3N0Q29tbWFuZElkZW50aWZpZXIiLHsiZGVwbG95bWVudElkIjoiQ29kZURlcGxveS91cy1lYXN0LTEvUHJvZC9hcm46YXdzOnNkczp1cy1lYXN0LTE6OTkzNzM1NTM2Nzc4OmRlcGxveW1lbnQvZC05V0kzWk5DNlYiLCJob3N0SWQiOiJhcm46YXdzOmVjMjp1cy1lYXN0LTE6OTkzNzM1NTM2Nzc4Omluc3RhbmNlL2ktMDk1NGJlNjk4OTMzMzY5MjgiLCJjb21tYW5kTmFtZSI6IkJlZm9yZUluc3RhbGwiLCJjb21tYW5kUG9zaXRpb24iOjMsImNvbW1hbmRBdHRlbXB0IjoxfV0=")
What can be the possible reason for failing?
The logs indicate that there is some problem with your scripts, specifically BeforeInstall.sh. Something in that script is failing with an exit code of 127. I would recommend adding logs to that script to see where it's actually failing. Once you identify the command that's failing, you can see what exit code 127 means for that particular command.
If you want help debugging that particular script, you should open up another question and provide the script, including the logs when it's gets run.
A note of CodeDeploy lifecycle hooks
In your case, your BeforeInstall script is failing, which will be the script that gets deployed with your application. However, if had been your ApplicationStop script that was failing, it's important to understand that ApplicationStop uses scripts from the last successful deployment, so if the last successful deployment had a fault script, it can cause future deployments to fail until these steps are followed.

hyperledger fabric 1.0 setup

I'm trying to setup Hyperledger Fabric 1.0 as per the document at https://hyperledger-fabric.readthedocs.io/en/latest/getting_started.html on a Mac.
I have completed till the step "Generate the channel configuration artifact". After setting the ARCH_TAG environment variable, I run the following command :
CHANNEL_NAME=<CHANNEL_NAME> docker-compose -f docker-compose-no-tls.yaml up
with the "CHANNEL_NAME" substituted with my channel name.
I get the following error :
After 5 attempts, Peer 0 has failed to join the channel. Failed to execute End-2 End Scenario.
Can you please help me with this query.