Cloud Datalab fails to launch - google-cloud-platform

Keep getting the following error when trying to launch Cloud Datalab. Tried deleting all listed VMs in the project, but still does not work.
Oct 27 10:50:31 datalab-deploy-main-20151027-10-41-31 startupscript: 10:50 AM Checking if updated app version is serving.
Oct 27 10:50:31 datalab-deploy-main-20151027-10-41-31 startupscript: 10:50 AM Not enough VMs ready (0/1 ready, 1 still deploying). Version: datalab:main.388142264345574525
Oct 27 10:50:31 datalab-deploy-main-20151027-10-41-31 startupscript: ERROR: Not enough VMs ready (0/1 ready, 1 still deploying). Version: datalab:main.388142264345574525
Oct 27 10:50:31 datalab-deploy-main-20151027-10-41-31 startupscript: 10:50 AM Rolling back the update. This can sometimes take a while since a VM version is being rolled back.
Oct 27 10:50:32 datalab-deploy-main-20151027-10-41-31 startupscript: Could not start serving the given version.
Oct 27 10:50:32 datalab-deploy-main-20151027-10-41-31 startupscript: ERROR: (gcloud.preview.app.deploy) Command failed with error code [1]
Oct 27 10:50:32 datalab-deploy-main-20151027-10-41-31 startupscript: Step deploy datalab module failed.

Is this the only Managed VM deployed in your project? If so, I wonder if we can try cleaning up the managed VM storage bucket and give it another try: Go to Developer Console, storage, browse, and find the 2 buckets starting with "vm-config" and "vm-containers". Delete those buckets and try deploying Datalab again.
Sometimes there are permission issues with these 2 buckets when they were created. If that happened, Managed VM deployment fails due to being unable to pull images from Google Cloud Storage.

It could be caused by a permissions issue and might have been fixed recently. Could you give it another try? If it still fails with the same error, try the following:
Go to Developer Console, Permissions, under "Service Accounts", find the account [project-id]#appspot.gserviceaccount.com.
Copy the account id to somewhere else since we'll use it later. Then remove the account from the list, make sure it disappears from the list, and add it back with "Can edit" permission.
Wait for a few minutes and try deploying Datalab again.
Let me know how it goes!

Related

"gcloud auth print-access-token" to get refresh token runs slow on my mac os-x

For our project, we use google cloud container registry (gcr.io) to push all our container images.
We have our build system that tries to pull the base images from the container registry.
To pull the container image from the registry we do that using oauth2 access token mechanism and the build script runs the "gcloud auth print-access-token" command to get the access token.
Following is the sample run for gloud --verbosity=debug auth print-access-token,
$ date;gcloud --verbosity=debug auth print-access-token;date
Fri Jul 17 10:23:57 PDT 2020
DEBUG: Running [gcloud.auth.print-access-token] with arguments: [--verbosity: "debug"]
< -- Get stuck here for 2 minutes -- >
DEBUG: Making request: POST https://oauth2.googleapis.com/token
INFO: Display format: "value(token)"
<Output Token Here>
Fri Jul 17 10:25:58 PDT 2020
Output from the gcloud config list
[core]
account = <email address>
disable_usage_reporting = False
log_http_redact_token = False
project = <project-id>
Your active configuration is: [default]
After looking at the code for Google SDK i found out that the Google SDK is trying to make a http call to http://google.metadata.internal every 10 minutes and hence it gets stuck finishing those calls since these calls will get resolved only from internal Google Compute instances.
Questions:
Is it expected that the gcloud tool is making the calls to google internal DNS when i run the utility from my MacBook? ( I am new to GCP so i am ready to share more information about my config if needed )
Is there a way to avoid the calls to google internal DNS for gcloud auth print-access-token commands ?
If there is no way to avoid the calls(eventhough it fails from my Mac), is there a way to reduce the time-outs for the calls to google's internal DNS or is there a way to not do it every 10 minutes ?

How to Connect Secure Shell App to a Google Cloud VM Instance

I would like to connect to a Google Cloud VM instance using Secure Shell App (SSA). I assumed this would be easy as these are both Google products and I had no problem before connection SSA to a Digital Ocean Droplet. I have found Google's own documentation to do so here and it looked easy enough to follow. However, the following link in the instructions: Providing public SSH keys to instances leads down a rabbit hole of confusing and seemingly self-contradicting information. I tried to follow it the best I could but kept running into errors. I have searched in vain for better instructions and am still astounded that Google has made it so hard to connect their own products. Is it really this hard to make this work? Are there any better instructions out there? If not, would someone be willing to write up clear and simple instructions?
Please follow this step by step instruction:
create a new VM instance-1
connect to it with gcloud compute ssh instance-1 (as mentioned #John Hanley)
check ~/.ssh folder
$ ls -l ~/.ssh
-rw------- 1 user usergroup 1856 Dec 9 17:12 google_compute_engine
-rw-r--r-- 1 user usergroup 417 Dec 9 17:12 google_compute_engine.pub
copy keys
cp ~/.ssh/google_compute_engine.pub mykey.pub
cp ~/.ssh/google_compute_engine mykey
follow instructions from step 7 - create connection and import identity
(optional) if you don't find your mykey in the Indentity list try to connect anyway (ended with an error as expected), then restart Secure Shell App and check Indentity menu again (they should be there without redoing import again)
After that, I successfully connected to my VM via Secure Shell App.

AWS authentication issue while using deployment script

From today only I started getting issue while running deployment scripts from my local VM. I am not sure if it's a known issue or some setup related thing which is missing on my VM .
Authentication failed while running deploy_one_off:
[vvaibhav#gld2vm40 debesys (topic/Add_FIX_IBDC_Session_Subtype_info_to_pub-DEB-107734)]$ /opt/virtualenv/devws/bin/python2 deploy/chef/scripts/deploy_one_off.py -s gla2vm178 --email vagesh.vaibhav#trade.tt -c dropcopyclientnode -r 5fff2fe00e5082b39fd5a978af7bf38770a95ef9 --request-build --run-chef --override-oneoff
Enter your INTAD username:vvaibhav
Enter password for INTAD user vvaibhav:
Checking if dropcopyclientnode cookbook has a build target...DONE
Oops: Failed to authenticate and retrieve AWS Keys from AWS, the package check will
fail. It is safe to try re-running, but if the problem persists please notify
Deployment and/or ELS. Exception:
Just FYI,
This issue was resolved as I worked with Tom today morning. It seems that some aws-keys file became stale and we had to delete those files.
$ ls -ltr ~/.aws-keys*
-rw-r--r-- 1 vvaibhav sysadmins 0 Jan 7 07:42 /home/vvaibhav/.aws-keys_deb_read

Instances created recently by AWS ECS do not have ssh authorized_keys configured

We are using Amazon Elastic Compute Services to spin up a cluster with autoscaling groups. Until very recently, this has been working fine, and generally it is still working fine... Except that we are no longer able to connect to the underlying EC2 instances using SSH with our keypair. We get ssh permission denied errors, which is relatively (weeks) new, and we have changed nothing. By contrast, we can spin up an EC2 instance directly and have no problem using SSH with the same keypair.
What I have done to investigate:
Drained the ECS cluster, detached the instance from it, and stopped it.
Detached the instance's root volume and attached it to a different EC2 instance.
Observed that /home/ec2-user/.ssh does not exist.
Found the following error in the instance's /var/log/cloud-init.log:
Oct 30 23:23:09 cloud-init[3195]: handlers.py[DEBUG]: start: init-network/config-ssh: running config-ssh with frequency once-per-instance
Oct 30 23:23:09 cloud-init[3195]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-0e13e9da194d2624a/sem/config_ssh - wb: [644] 20 bytes
Oct 30 23:23:09 cloud-init[3195]: helpers.py[DEBUG]: Running config-ssh using lock (<FileLock using file '/var/lib/cloud/instances/i-0e13e9da194d2624a/sem/config_ssh'>)
Oct 30 23:23:09 cloud-init[3195]: util.py[WARNING]: Applying ssh credentials failed!
Oct 30 23:23:09 cloud-init[3195]: util.py[DEBUG]: Applying ssh credentials failed!
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/cloudinit/config/cc_ssh.py", line 184, in handle
ssh_util.DISABLE_USER_OPTS)
AttributeError: 'module' object has no attribute 'DISABLE_USER_OPTS'
Oct 30 23:23:09 cloud-init[3195]: handlers.py[DEBUG]: finish: init-network/config-ssh: SUCCESS: config-ssh ran successfully
Examined the Python source code for /usr/lib/python2.7/site-packages/cloudinit. It looks OK to me; I see the reference in config/cc_ssh.py to ssh_util.DISABLE_USER_OPTS and it looks like ssh_util.py does indeed contain DISABLE_USER_OPTS as a file-level variable. (But I am not a master Python programmer, so I might be missing something subtle.)
Curiously, the compiled versions of ssh_util.py and cc_ssh.py date from October 16, which raises all sorts of red flags, because we had not seen any problems with ssh until recently. But I loaded uncompyle6 and decompiled those files, and the decompiled versions seem to be OK, too.
Looking at cloud-init, it's pretty clear that if the reference to ssh_util.DISABLE_USER_OPTS throws an exception, the .ssh directory won't be configured for ec2-user, so I understand what's happening.
What I don't understand is why? Has anyone else experienced issues with cloud-init with recently-created EC2 instances under ECS, and found a workaround?
For reference, we are using AMI amzn2-ami-ecs-hvm-2.0.20190815-x86_64-ebs (ami-0b16d80945b1a9c7d)
in us-east-1, and we certainly not seen these issues as far back as August 15. I assume that some cloud-init change that the instance gets via a yum update explains the new behavior and the change to the write dates of the compiled Python modules in cloud-init.
I should also add that the EC2 instance I spun up to mount the root volume of the ECS-created instance has subtly-different cloud-init code. In particular, the cc_ssh.py module doesn't refer to ssh_util.DISABLE_USER_OPTS but rather to a local DISABLE_ROOT_OPTS variable. So this is all suspicious.
I have diagnosed this problem in a specific AWS Deployment on an Amazon Linux2 AMI. The root cause is running yum update, which causes an update of cloud-init, from user_data that is executed in cloud-init during AWS EC2 instance startup.
The user_data associated with an ECS launch_configuration is executed by cloud-init. Our user_data initialization code included a "yum update". Amazon has deployed a new version of cloud-init, 18.5-2amzn2 which is not configured in the AMI images yet (they have 18.2-72-amzn2.07 cloud-init version). Therefore, the yum update will upgrade cloud-init to the 18.5-2amzn2 version. However, analysis of the python code for the 18.5-2amzn2 version indicates that it includes a commit (https://github.com/number5/cloud-init/commit/757247f9ff2df57e792e29d8656ac415364e914d) which adds an attribute to ssh_util not present in the prior version. Ordinarily, yum would produce a consistent cloud-init installation, as verified in a standalone EC2 instance. However, since the update occurs in cloud-init, as it is already running, the results are inconsistent. The ssh_util module is apparently not updated for the running cloud-init so it can't provide the "DISABLE_USER_OPTS" value that was added in the aforementioned commit.
So, the problem was indeed the yum-update command invoked from within cloud-init, which was updating cloud-init itself while in use.
I should point out that we were using Amazon EFS on our nodes, and were following the exact instructions that Amazon specifies on their help page for using EFS with ECS, which include the yum-update call in the user data script.

google cloud aspnetcore default builder yaml missing

Starting Friday afternoon last week, I'm suddenly unable to deploy to GCP for my project and I receive the following error:
...
Building and pushing image for service [myService]
ERROR: (gcloud.beta.app.deploy) Could not read [<googlecloudsdk.api_lib.s
storage_util.ObjectReference object at 0x045FD130>]: HttpError accessing
//www.googleapis.com/storage/v1/b/runtime-builders/o/gs%3A%2F%2Fruntime-b
%2Faspnetcore-default-builder-20170524113403.yaml?alt=media>: response: <
s': '404', 'content-length': '9', 'expires': 'Mon, 05 Jun 2017 14:33:42 G
ary': 'Origin, X-Origin', 'server': 'UploadServer', 'x-guploader-uploadid
B2UpOw2hMicKUV6j5FRap9x4UKxxZsb04j9JxWA_kc27S_AIPf0QZQ40H6OZgZcLJxCnnx5m4
8x3JV3p9kvZZy-A', 'cache-control': 'private, max-age=0', 'date': 'Mon, 05
17 14:33:42 GMT', 'alt-svc': 'quic=":443"; ma=2592000; v="38,37,36,35"',
t-type': 'text/html; charset=UTF-8'}>, content <Not Found>. Please retry.
I tried again this morning and even updated my gcloud components to version 157. I continue to see this error.
Item of note, the 20170524113403 value in that YAML filename is, I think, a match to the first successful deploy to .NET App Flex for my project. I had since deleted that version using the Google Cloud Explorer with a more recent version 'published' early Friday morning. My publish worked Friday morning, now it doesn't. I don't see any logs that help me understand why that file is even needed and an Agent Ransack search on my entire drive doesn't reveal where that filename is coming from to try and point it to a more recent version.
I'm doing this through both Google Cloud Tools integrated into my Visual Studio 2017 (Publish to Google Cloud...) as well as running the command lines:
dotnet restore
dotnet publish -c Release
copy app.yaml -> destination location
gcloud beta app deploy .\app.yaml in destination location
Not sure if this was "fixed" by Google or not, but 4 days later, the problem went away. On some additional information, I was able to locate the logs on my machine during the publish phase and saw something interesting.
When Working:
2017-05-25 09:36:48,821 DEBUG root Calculated builder definition using legacy version [gs://runtime-builders/aspnetcore-default-builder-20170524113403.yaml]
Later when it stopped working:
2017-06-02 15:25:15,312 DEBUG root Resolved runtime [aspnetcore] as build configuration [gs://runtime-builders/gs://runtime-builders/aspnetcore-default-builder-20170524113403.yaml]
What I noticed was the duplicated "gs://runtime-builders/gs//runtime-builders..."
which disappeared this morning... no, I didn't change a thing besides waiting until today.
2017-06-07 08:11:37,042 INFO root Using runtime builder [gs://runtime-builders/aspnetcore-default-builder-20170524113403.yaml]
You'll see that double "gs://runtime-builders/gs://runtime-builders/" is GONE.