Why can Cloud Composer 2 not connect to Redis during environment generation? - google-cloud-platform

I have tried multiple times to create a new Cloud Composer 2 environment. While creating a Cloud Composer 1 environment worked completely fine, this keeps on failing. From what I can see in the logs it appears that airflow-worker cannot connect to Redis. I already made sure the Cloud Composer service account has sufficient rights. What else could be the cause of this?

Related

Can't create Cloud Composer environment on GCP

I'm trying to create a Cloud Composer environment to run Airflow on GCP. However, I keep getting this error:
CREATE operation on this environment failed x minutes ago with the following error message:
Composer Backend timed out. Currently running tasks are [stage: CP_COMPOSER_AGENT_RUNNING
description: "No agent response published."
response_timestamp {
seconds: 1631717057
nanos: 229000000
}
].
Does anybody knows how to solve it?
This is a known insufficient permission issue:
When creating a Cloud Composer environment, you specify a service account that runs the environment's GKE nodes. If this service account does not have enough permissions for the requested operation, Cloud Composer outputs that same error message.
The solution is to assign roles to both to your account and to the service account of your environment as described in Access control.

Pods can't pull image from GCR after configuring google cloud sql proxy

I have a simple application (REST apis based on python and flask) that works well on Google kubernetes engine (GKE). My CI/CD setups create a docker image, push it to Google cloud registry (GCR) and then deploy it to GKE. Everything works well.
Now, I added a database. It will be hosted on Google cloud SQL. To accees the database from kubernetes, I'm using google cloud sql proxy (as a side car) and workload identity as recommended by google.
My problem is, after configuring cloud sql proxy, I'm getting this error:
ImagePullBackOff: Cannot pull image 'gcr.io/xxx-project/xxx-image:xxx-tag' from the registry.
the cloud sql proxy image is loaded correctly (I think because it's hosted in a public registry), but not my image, so the pod keeps crashing.
Something I missed? should I add docker credentials? It's weird because it was working before setting the cloud proxy!!
Many thanks for your help,
Best regards
I think there's something important to understand here and it's that Autopilot doesn't use Workload Identity or anything to do with the pod's permissions to pull images. It uses the default compute service account for your project.
It is the nodes that need permission to pull images, not the pods. See this note from the GCP documentation on Workload Identity.
Note: Even with Workload Identity enabled, GKE still uses the configured Google Service Account for the node pool to pull container images from the image registry. If you encounter ImagePullBackOff or ErrImagePull errors, check the troubleshooting documentation.
I had the same thing happen to me and it turned out that the default compute service account had been deleted. It restored it (using these instructions Deleted Compute Engine default service account) and gave it storage.admin permissions and that resolved the issue.

Where to keep the Dataflow and Cloud composer python code?

It probably is a silly question. In my project we'll be using Dataflow and Cloud composer. For that I had asked permission to create a VM instance in the GCP project to keep the both the Dataflow and Cloud composer python program. But the client asked me the reason of creation of a VM instance and told me that you can execute the Dataflow without the VM instance.
Is that possible? If yes how to achieve it? Can anyone please explain it? It'll be really helpful to me.
You can run Dataflow pipelines or manage Composer environments in you own computer once your credentials are authenticated and you have both the Google SDK and Dataflow Python library installed. However, this depends on how you want to manage your resources. I prefer to use a VM instance to have all the resources I use in the cloud where it is easier to set up VPC networks including different services. Also, saving data from a VM instance into GCS buckets is usually faster than from an on-premise computer/server.

Terraform Google Cloud: Executing a Remote Script on a VM

I'm trying to execute a Script on a Google VM through Terraform.
First I tried it via Google Startup Scripts. But since the metadata is visible in the Google Console (startup scripts count as metadata) and that would mean that anybody with read access can see that script which is not acceptable.
So i tried to get the script from a Storage Account. But for that i need to attach a service account to the VM so the VM has the rights to access the Storage Account. Now people that have access to the VM also have access to my script as long as the service account is attached to the VM. In order to "detach" the service account i would have to stop the VM. Also if i don't want to permanently keep the attachment of the service account i would have to attach the service account via a script which requires another stop and start of the VM. This is probably not possible and also really ugly.
I don't understand how the remote-exec ressource works on GCP VMs. Because i have to specify a user and a userpassword to connect to the VM and then execute the script. But the windows password needs to be set manually via the google console, so i can't specify those things at this point in time.
So does anybody know how I can execute a Script where not anybody has access to my script via Terraform?
Greetings :) and Thanks in advance
I ended up just running a gcloud script in which i removed the Metadata from the VM after the Terraform apply was finished. In my Gitlab pipeline i just called the script in the "after_script"-section. Unfortunately the credentials are visible for approximately 3min.

Using plugin in Google Composer make it crash

I wrote a small plugin for Apache Airflow, which runs fine on my local deployment. However, when I use Google Composer, the user interface hangs and becomes unresponsive. Is there any way to restart the webserver in Google Composer
(Note: This answer is currently more suggestive than finalized.)
As far as restarting the webserver goes...
What doesn't work:
I reviewed Airflow Web Interface in the docs which describes using the webserver but not accessing it from a CLI or restarting.
While you can also run Airflow CLI commands on Composer, I don't see a command for restarting the webserver in the Airflow CLI today.
I checked the gcloud CLI in the Google Cloud SDK but didn't find a restart related command.
Here are a few ideas that may work for restarting the Airflow webserver on Composer:
In the gcloud CLI, there's an update command to change environment properties. I would assume that it restarts the scheduler and webserver (in new containers) after you change one of these to apply the new setting. You could set an arbitrary environment variable to check, but just running the update command with no changes may work.
gcloud beta composer environments update ...
Alternatively, you can update environment properties excluding environment variables in the GCP Console.
I think re-running the import plugins command would cause a scheduler/webserver restart as well.
gcloud beta composer environments storage plugins import ...
In a more advanced setup, Composer supports deploying a self-managed Airflow web server. Following the linked guide, you can: connect into your Composer instance's GKE cluster, create deployment and service Kubernetes configuration files for the webserver, and deploy both with kubectl create. Then you could run a kubectl replace or kubectl delete on the pod to trigger a fresh start.
This all feels like a bit much, so hopefully documentation or a simpler way to achieve webserver restarts emerges to succeed these workarounds.