Exit code non-zero and unable to see output logs - google-cloud-ml

How do I view stdout/stderr output logs for cloud ML? I've tried using gcloud beta logging read and also gcloud beta ml jobs stream-logs and nothing... all I see are the INFO level logs generated by the system i.e. "Tearing down TensorFlow".
Also in the case where I have an error that shows the docker container exited with non zero code. It links me to a GUI page that shows the same stuff as gcloud beta ml jobs stream-logs. Nothing that shows me the actual output from the console of what my job produced...
Help please??

It may be the case that the Cloud ML service account does not have permissions to write to your project's StackDriver Logs, or the Logging API is not enabled on your project.
First check whether the Stackdriver Logging API is enabled for the project by going to the API manager: https://console.cloud.google.com/apis/api/logging.googleapis.com/overview?project=[YOUR-PROJECT-ID]
Then Cloud ML service account should be automatically added as an Editor to the project, and therefore allows it to write to the project logs, but if you have changed your project permissions it may have lost it. If so, check that you've manually given the Cloud ML service account LogWriter permissions.
If you are unsure of the service account used by Cloud ML, this page has instructions on how to find it: https://cloud.google.com/ml/docs/how-tos/using-external-buckets

Related

GCP | How I can see all the working virtual machines on a project?

I wrote some code to automate the training procedure on our company vm instances.
you probably know that sometimes GCP can't provide you at the current moment with a machine - 'out of resource' exception.
so , I'd like to monitor which of my machines successfully turned on and which not.
if there is some way to show it on Bigquery it will be great.
thanks .
Using the Cloud Monitoring (Stackdriver) functionality is good way for monitoring all you VMs.
Here is a detailed guide to implement Monitoring on a Compute Engine Instance.
Hope you find it useful.
You can use Google cloud's activity logs too:
Activity logging is enabled by default for all Compute Engine
projects.
You can see your project's activity logs through the Logs Viewer in
the Google Cloud Console:
In the Cloud Console, go to the Logging page. Go to the Logging page
When in the Logs Viewer, select and filter your resource type from the
first drop-down list. From the All logs drop-down list, select
compute.googleapis.com/activity_log to see Compute Engine activity
logs.
Here is the Official documentation.

Running Cloud Build trigger via GCP Console returns 'build.service_account' field cannot be set for triggered builds

I am currently using Cloud Build for my Dataflow Flex template to kick off jobs.
Here's my current command:
gcloud beta builds submit --config run.yaml --substitutions _REGION=$REGION \
--substitutions _FMPKEY=$FMPKEY --no-source
Currently this is running fine from Cloud Shell.
But now I want the build to be kicked off based on a trigger..
So I created a Cloud Build that will trigger running this file based on dropping a message to a topic:
https://github.com/mmistroni/GCP_Experiments/blob/master/dataflow/pipeline/run.yaml
However, after publishing a message to the selected topic, all my builds fail with the following error:
our build failed to run: generic::invalid_argument:generic::invalid_argument:
'build.service_account' field cannot be set for triggered builds
I cannot see any logs or details, so it's not clear to me what is going on..
I am guessing it has something to do with the last line in my run.yaml?
options:
logging: CLOUD_LOGGING_ONLY
# Use the Compute Engine default service account to launch the job.
serviceAccount: projects/$PROJECT_ID/serviceAccounts/$PROJECT_NUMBER-compute#developer.gserviceaccount.com
However I see no option for selecting the service account in cloud build. Do I need to set some permissions in IAM?
You are correct with your guess and this is working as intended.
Cloud Build has a default service account to execute builds on your behalf. While GCP allows you to configure user-specific accounts for additional control, it doesn't apply when you're using build triggers. Build triggers only use the default service account to execute builds.
This is documented in GCP docs:
Build triggers use Cloud Build service account to execute builds. This could provide elevated build-time permissions to users who use triggers to start a build. Keep the following security implications in mind when using build triggers ...
Also as part of limitation:
User-specified service accounts only work with manual builds; they don't work with build triggers.
Therefore, you must pass a config yaml without serviceAccount if you plan on using build triggers.

gcloud builds submit fails while docker push + gcloud run deploy work just fine?

EDIT: The so called duplicate question was way off since 1. I could push another image and 2. I could not push a build image. Finally, point #3 is the solution was totally different and ONLY related to pushing build images via cloudbuild. ie. I beg to differ that this question WAS different.
Running into some more google cloud security stuff. We currently deploy to cloud run like so
docker build . --tag gcr.io/myproject/authservice
docker push gcr.io/myproject/authservice
gcloud run deploy staging-admin --region us-west1 --image gcr.io/myproject/authservice --platform managed
I did the quick start for google builds but I am getting permission errors. I did this command
https://cloud.google.com/cloud-build/docs/quickstart-build
The command I ran was
gcloud builds submit --tag gcr.io/myproject/quickstart-image
This is all the same project but submitting builds gets this same error over and over and over(I am not sure why it doesn't just exit on first error.
The push refers to repository [gcr.io/myproject/quickstart-image]
e3831abe9997: Preparing
60664c29ef5a: Preparing
denied: Token exchange failed for project 'myproject'. Caller does not have permission 'storage.buckets.get'. To configure permissions, follow instructions at: https://cloud.google.com/container-registry/docs/access-control
Any ideas how to fix so I can use google cloud build?
Complementing the previous answer, as is mentioned in this document to perform actions in Container Registry the role "sotrage admin" is necessary
Do you have "roles/storage.admin" role? If not, add it and try.
The Could build service account has this format [project_number]#cloudbuild.gserviceaccount.com please add the role "roles/storage.admin" by following this steps
Open the Cloud IAM page
Select your Cloud project.
In the permissions table, locate the row with the email address
ending with #cloudbuild.gserviceaccount.com. This is your Cloud
Build service account.
Click on the pencil icon.
Select the role you wish to grant to the Cloud Build service
account.
Click Save.
BE WARNED: I read the duplicate question post but in my case
I can push items
only the build one is failing AND the solution I found is different than any of the other question answers
This was a VERY weird issue. The storage permission MUST be a red herring because these permissions fixed the issue
I found some documentation somewhere that I can't seem to find on a google github repo about adding these permissions AND a document on the TWO #cloudbuild.gserviceaccount.com accouts AND you must add the permissions to the correct one!!!! One is owned by google and you should not touch.
In my case, the permission / token exchange failed error was caused by having the storage bucket used by Google Container Registry inside a VPC Service Perimeter.
This can be checked / confirmed via the VPC Service Controls logs - accessible easily from the troubleshooting page.
There is a (very clunky) way to get Cloud Build working to push images to a registry inside a VPC perimeter. It involves running a build worker pool and applying appropriate config + permissions to the perimeter etc.

Kubernetes Engine unable to pull image from non-private / GCR repository

I was happily deploying to Kubernetes Engine for a while, but while working on an integrated cloud container builder pipeline, I started getting into trouble.
I don't know what changed. I can not deploy to kubernetes anymore, even in ways I did before without cloud builder.
The pods rollout process gives an error indicating that it is unable to pull from the registry. Which seems weird because the images exist (I can pull them using cli) and I granted all possibly related permissions to my user and the cloud builder service account.
I get the error ImagePullBackOff and see this in the pod events:
Failed to pull image
"gcr.io/my-project/backend:f4711979-eaab-4de1-afd8-d2e37eaeb988":
rpc error: code = Unknown desc = unauthorized: authentication required
What's going on? Who needs authorization, and for what?
In my case, my cluster didn't have the Storage read permission, which is necessary for GKE to pull an image from GCR.
My cluster didn't have proper permissions because I created the cluster through terraform and didn't include the node_config.oauth_scopes block. When creating a cluster through the console, the Storage read permission is added by default.
The credentials in my project somehow got messed up. I solved the problem by re-initializing a few APIs including Kubernetes Engine, Deployment Manager and Container Builder.
First time I tried this I didn't succeed, because to disable something you have to disable first all the APIs that depend on it. If you do this via the GCloud web UI then you'll likely see a list of services that are not all available for disabling in the UI.
I learned that using the gcloud CLI you can list all APIs of your project and disable everything properly.
Things worked after that.
The reason I knew things were messed up, is because I had a copy of the same things as a production environment, and there these problems did not exist. The development environment had a lot of iterations and messing around with credentials, so somewhere things got corrupted.
These are some examples of useful commands:
gcloud projects get-iam-policy $PROJECT_ID
gcloud services disable container.googleapis.com --verbosity=debug
gcloud services enable container.googleapis.com
More info here, including how to restore service account credentials.

Google compute firewalls disappears later

I trying to create some firewall rules in google compute, everything goes well, but some time later, they are just disappears.
I tried to add rules on default network, and also custom created - in both cases result same.
Tried both: through web UI, and through gcloud tool
If you believe that someone or something is reverting your Firewall changes, you can take multiple approaches to verify that.
inspect Cloud Console Activity logs
same using CLI: gcloud beta logging read "resource.type=gce_firewall_rule"
check GCE Operations section in Cloud Console
check GCE API requests in Cloud Console Logging, using this advanced filter:
resource.type="gce_firewall_rule"
jsonPayload.event_subtype:"compute.firewalls"