When i try to do a gcloud components update from my local machine, the process seems to hang. How do i update google cloud sdk? - google-cloud-platform

I tried installing a new version, however that also takes me the components update route.
The component update route seems to be forever "creating update staging area"

i just waited and it took ~15 minutes. maybe take a break and come back. good luck.

If you run 'gcloud components update' while any of files is in use,
update would not be finished and gcloud would never end the update.
make sure you don't have more windows open.

Try to turn off the docker off for the duration of the update.

Related

AWS Sagemaker: Jupyter Notebook kernel keeps dying

I get disconnect every now and then when running a piece of code in Jupyter Notebooks on Sagemaker. I usually just restart my notebook and run all the cells again. However, I want to know if there is a way to reconnect to my instance without having to lose my progress. At the minute, it shows that there is "No Kernel" at the bottom bar, but my file seems active in the kernel sessions tab. Can I recover my notebook's variables and contents? Also, is there a way to prevent future kernel disconnections?
Note that I reverted back to tornado = 5.1.1, which seems to decrease the number of disconnections, but it still happens every now and then.
Often, disconnections will be caused by inactivity because a job is running for a long time with no user input. If it's pre-processing that's taking a long time, you could increase the instance size of the processing job so that it executes faster, or increase the instance count. If you're using EMR, you can now run an EMR Spark query directly on the EMR cluster since December 2021:
https://aws.amazon.com/about-aws/whats-new/2021/12/amazon-sagemaker-studio-data-notebook-integration-emr/
There's a useful blog here https://aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-backed-by-spark-in-amazon-emr/ which is helpful in getting you up and running.
Please let me know if you need more information, or vote for the answer if it's useful. :-)
For me a quick solution was to open a Terminal instead, save the notebook file as a Pytohn file, and run it from the terminal within Sagemaker.

Why is my cloud run deploy hanging at the "Deploying..." step?

Up until today, my deploy process has worked fine. Today when I go to deploy a new revision, I get stuck at the Deploying... text with a spinning indicator, and it says One or more of the referenced revisions does not yet exist or is deleted. I've tried a number of different images and flags -- all the same.
See Viewing the list of revisions for a service, in order to undo whatever you may have done.
Probably you have the wrong project selected, if it does not know any of the revisions.
I know I provided scant information, but just to follow up with an answer: it looks like the issue was that I was deploying a revision, and then immediately trying to tag it using gcloud alpha run services update-traffic <service_name> --set-tags which looks to have caused some sort of race, where it complained that the revision was not yet deployed, and would hang indefinitely. Moving the set-tag into the gcloud alpha run deploy seemed to fix it.

What's the gcloud equivalent of appcfg rollback?

The GCP command appcfg has been deprecated. appcfg used to have appcfg rollback to be used when there is a failed deployment.
What is its equivalent for gcloud (the new command)? I can't find it in Google GCP documentation.
More context:
Rolling back in appcfg was not meant for managing the traffic and going back to the previous version. It was used to remove the lock on your deploy.
If you had an unsuccessful deployment, you were not able to deploy any more. appcfg rollback was used to remove that lock and make it available for you to deploy again.
I think there is no direct command to appcfg rollback. However, I would highly recommend you to consider the Splitting the traffic option.
This will let you redirect the traffic from one of your versions to another, even between old versions of your service.
Let's imagine this:
You have version 1 of your service and it works just fine.
A couple weeks later you decide to deploy a new the version: version 2
However, the deploy fails and your app is completely down. You are loosing users and money. Everything is on fire.
You can easily switch the traffic to the trusty version 1 by redirecting 100% of the traffic to it.
Version 2 is out of the game until you deploy a new version.
The advantage of this is that you don't have to wait until the rollback is done. The traffic is automatically redirected to an old version. Additionally, it has the gcloud set traffic command for you to run it via the CLI.
Hope this is helpful!

Cloud Composer throwing InvalidToken after adding another node

I recently added a few new DAGs to production airflow and as a result decided to scale up the number of nodes in the Composer pool. After doing so I got the error: Can't decrypt _val for key=<KEY>, invalid token or value. This happens now for every single DAG that uses variables. It's not the same key either, it depends on what variables the DAG needs.
I immediately scaled Composer back down to 3 nodes and the problem persisted.
I have tried re-saving all of the Variables, recreating them in the UI (which says they are all valid), recreating them in the CLI (which lists invalid for every single one).
I have also tried updating configuration to try and reboot the server, and manually stopping the VM instances.
Composer also seems to negate the ability to update the Fernet Key, so I can't try and use a new one. For some reason it appears that the permanent one Composer has assigned is now invalid.
Is there anything else that can be done to remedy that problem short of recreating the environment?
I managed to fix this problem by adding a new python package. It seems that adding a package is the only way to really "reboot" the environment. The reboot invalidated all of my variables and connections when it had finished but I was able to just add those back in rather than having to recreate the entire environment.
Heard back about this issue: According to Google, Composer creates a custom image for the environment and passes one to each node, and if that got corrupted during scaling then the only way to fix it is by adding a new python package so it rebuilds the image. Incidentally, version 1.3.0 of Composer is much better as the scheduler is restarted every 10 minutes which should solve some of the latter issues I experienced.

Exporting VM/vAPP in vCloud environment

We have a customised vCloud environment. We are trying to download the vAPP image as ovf file for migrating it to some other environment. I am following this procedure
Stop the VM.
Click on download button on setting
It asks for download location and type of image (ova/ovf).
It initiates the download.
Now my problem lies on 4th step. When I click download it initiates download and I could see "enabling download" when it happened. After some unknown time(can't predict the time may be 2hr, 3hr 4hr, 1hr) the process gets failed. I have to repeat the process multiple times(at least 3 to 5 times) to start the actual download process where it actually copies the VM image on disk.
I am not able to predict the actual time of VM download and why the process get failed many time before it start the actual export process.
Can someone tell me answers of below mentioned questions
Does vCloud enable download functionality before it allows us to download the VM? If it does how much time it takes for this functionality to enable.
Can we enable this functionality beforehand so that vCloud should just start the VM download process instantly once I shutdown the machine and start the VM export process?
Do you think using CLI tool like ovftool will make the process faster and prevent it from failing so that I will get to know the actual VM download time and we can prepare a plan for migration?
From my limited understanding of working with the API and SDK, I do not think 1 is possible... if it is... it's not straightforward.. at least to me
as for #3 if you are not using the CLI for scripting and automating purposes, yes it would definitely help