Cloud Composer throwing InvalidToken after adding another node - google-cloud-platform

I recently added a few new DAGs to production airflow and as a result decided to scale up the number of nodes in the Composer pool. After doing so I got the error: Can't decrypt _val for key=<KEY>, invalid token or value. This happens now for every single DAG that uses variables. It's not the same key either, it depends on what variables the DAG needs.
I immediately scaled Composer back down to 3 nodes and the problem persisted.
I have tried re-saving all of the Variables, recreating them in the UI (which says they are all valid), recreating them in the CLI (which lists invalid for every single one).
I have also tried updating configuration to try and reboot the server, and manually stopping the VM instances.
Composer also seems to negate the ability to update the Fernet Key, so I can't try and use a new one. For some reason it appears that the permanent one Composer has assigned is now invalid.
Is there anything else that can be done to remedy that problem short of recreating the environment?

I managed to fix this problem by adding a new python package. It seems that adding a package is the only way to really "reboot" the environment. The reboot invalidated all of my variables and connections when it had finished but I was able to just add those back in rather than having to recreate the entire environment.
Heard back about this issue: According to Google, Composer creates a custom image for the environment and passes one to each node, and if that got corrupted during scaling then the only way to fix it is by adding a new python package so it rebuilds the image. Incidentally, version 1.3.0 of Composer is much better as the scheduler is restarted every 10 minutes which should solve some of the latter issues I experienced.

Related

Can I configure Google DataFlow to keep nodes up when I drain a pipeline

I am deploying a pipeline to Google Cloud DataFlow using Apache Beam. When I want to deploy a change to the pipeline, I drain the running pipeline and redeploy it. I would like to make this faster. It appears from the logs that on each deploy DataFlow builds up new worker nodes from scratch: I see Linux boot messages going by.
Is it possible to drain the pipeline without tearing down the worker nodes so the next deployment can reuse them?
rewriting Inigo's answer here:
Answering the original question, no, there's no way to do that. Updating should be the way to go. I was not aware it was marked as experimental (probably we should change that), but the update approach has not changed in the last 3 i have been using DF. About the special cases of update not working, supposing your feature existed, the workers would still need the new code, so no really much to save, and update should work in most of the other cases.

GCP cloud run send a request to all running instances

I have a rest API running on cloud run that implements a cache, which needs to be cleared maybe once a week when I update a certain property in the database. Is there any way to send a HTTP request to all running instances of my application? Right now my understanding is even if I send multiple requests and there are 5 instances, it could all go to one instance. So is there a way to do this?
Let's go back to basics:
Cloud Run instances start based on a revision/image.
If you have the above use case, where suppose you have 5 instances running and you suddenly need to re-start them as restarting the instances resolves your use case, such as clearing/rebuilding the cache, what you need to do is:
Trigger a change in the service/config, so a new revision gets
created.
This will automatically replace, so will stop and relaunch all your instances on the fly.
You have a couple of options here, choose which is suitable for you:
if you have your services defined as yaml files, the easiest is to run the replace service command:
gcloud beta run services replace myservice.yaml
otherwise add an Environmental variable like a date that you increase, and this will yield a new revision (as a change in Env means new config, new revision) read more.
gcloud run services update SERVICE --update-env-vars KEY1=VALUE1,KEY2=VALUE2
As these operations are executed, you will see a new revision created, and your active instances will be replaced on their next request with fresh new instances that will build the new cache.
You can't reach directly all the active instance, it's the magic (and the tradeoff) of serverless: you don't really know what is running!! If you implement cache on Cloud Run, you need a way to invalidate it.
Either based on duration; when expired, refresh it
Or by invalidation. But you can't on Cloud Run.
The other way to see this use case is that you have a cache shared between all your instance, and thus you need a shared cache, something like memory store. You can have only 1 Cloud Run instance which invalidate it and recreate it and all the other instances will use it.

Cloud Function build error - failed to get OS from config file for image

I'm seeing this Cloud Build error when I try to deploy a Cloud Function:
"Step #2 - "analyzer": [31;1mERROR: [0mfailed to initialize cache: failed to create image cache: accessing cache image "us.gcr.io/MY_PROJECT/gcf/us-central1/SOME_KEY/cache:latest": failed to get OS from config file for image 'us.gcr.io/MY_PROJECT/gcf/us-central1/SOME_KEY/cache:latest'"
I'm able to build and emulate the cloud function locally, but I can't deploy it due to this error. I was able to deploy just fine until now. I've looked everywhere and I can't find any discussion about this. Anyone know what's going on here?
UPDATE: I deployed a new function 3 days ago and now I can't seem to deploy an update to it. I get the same error. I'm fairly sure this is happening due to the lifecycle rule I set up to ensure I don't keep storing images of functions: Firebase storage artifacts is huge and keeps increasing. This rule is important to keep around because I don't want to pay for unnecessary storage, but it seems like it might be the source of our problem here. Can someone from Google look into this?
I got the same error, even for code that deployed successfully before.
A workaround is to delete the Docker images for the failing Firebase functions inside Container Registry and re-deploying the functions. (The images will be re-created upon deploying.)
The error still occurs sporadically, so I suspect this may be a bug introduced in Firebase's deployment process. Thankfully for now, the workaround above resolves the issue every time the error comes up.
I also encountered the same problem, and solved it by deleting the images in the Container Registry of Firebase Project.
I made a Script at that time, and I'll put it here. The usage is as follows. Please use it if you like.
Install the Google Cloud SDK.
Download the Script
Edit CONTAINER_REGISTRY to your registry name. For example: CONTAINER_REGISTRY=asia.gcr.io/project-name/gcf/asia-northeast1
Grant execute permission. - $ chmod +x script.sh
Execute it. - $ sh script.sh
Deploy your functions.
I'm having the same problem for the last few days and in contact with the support. I had the same log and in my case it wasn't connected to the artifacts because the artifacts rebuild themselves automatically on deploy (read below about a subtle case related to the artifacts and how to fix it), but deleting the functions and redeploying solved it for me.
Artifacts auto cleanup
Note that if the artifacts bucket is empty, then the problem is somewhere else.
But if it's not empty, what you can do to resolve any possible problems related to the artifacts auto cleanup, is to delete the whole "container" folder manually in the artifacts which should solve it. Then just redeploy again.
Make sure not to delete the artifacts bucket itself!
Dough from firebase confirmed in the question you referring to that removing the artifacts content is safe.
So, here is how to delete it:
go to the google cloud console, select your project -> storage -> browser https://console.cloud.google.com/storage/browser
Select the "artifacts" bucket
Choose "containers" and delete it
If the problem was here, it should work fine after that.
This happens because the deletion rule you refer to in your question checks the "last updated" timestamp of each file while on redeploy only some files are updated. So the next day the rule will delete some of the files while leaving the others which will lead to the inconsistent state of the bucket in this case. So you just remove everything manually.

Why is my cloud run deploy hanging at the "Deploying..." step?

Up until today, my deploy process has worked fine. Today when I go to deploy a new revision, I get stuck at the Deploying... text with a spinning indicator, and it says One or more of the referenced revisions does not yet exist or is deleted. I've tried a number of different images and flags -- all the same.
See Viewing the list of revisions for a service, in order to undo whatever you may have done.
Probably you have the wrong project selected, if it does not know any of the revisions.
I know I provided scant information, but just to follow up with an answer: it looks like the issue was that I was deploying a revision, and then immediately trying to tag it using gcloud alpha run services update-traffic <service_name> --set-tags which looks to have caused some sort of race, where it complained that the revision was not yet deployed, and would hang indefinitely. Moving the set-tag into the gcloud alpha run deploy seemed to fix it.

Cloud9 EC2 instance loses configuration when auto hibernating

I began putting together a project yesterday and decided that I'd like to use Cloud 9 as the development IDE. When I was setting up my dev environment, I selected that I wanted to create a new EC2 instance for the environment (t2.micro) and I put the cost-savings settings as 30 minutes (so that the environment will auto-hibernate after inactivity). I then proceeded to use Cloud 9 as I had in the past, which included some changes such as upgrading the version of Node.js and installing Django. Everything worked great until I went to bed. When I woke up and opened my environment again this morning, the instance was relaunched and none of the changes I made persisted, so I needed to do the updates/installations all over again.
Is there a way I can avoid this without having to turn off auto-hibernate (or is the root issue something else, and if so, how can I address it)? I don't particularly want to waste a bunch of compute time having my instance just sitting there idly, but it's really annoying having to spend a chunk of my morning re-configuring everything that I did yesterday.
Are you setting the default Node version with nvm? If you manually set a Node version with the terminal but don't set it to default, the Node version will only apply to the one terminal session (it won't even persist to a new terminal tab).