How to restart webserver in Cloud Composer - google-cloud-platform

We recently met a known issue on airflow:
Airflow "This DAG isnt available in the webserver DagBag object "
Now we used a temporary solution to restart whole environment by changing configurations but this is not an efficient method.
The best workaround now we think is to restart webservers on cloud composer, but we didn't find any command to restart webserver. Is it a possible action?
Thanks!

For those who wander and find this thread: currently the for versions >= 1.13.1 Composer has a preview for a web server restart

Only certain types of updates will cause the webserver container to be restarted, like adding, removing, or upgrading one of the PyPI packages or like changing an Airflow setting.
You can do for example:
# Set some arbitrary Airflow config value to force a webserver rebuild.
gcloud composer environments update ${ENVIRONMENT_NAME} \
--location=${ENV_LOCATION} \
--update-airflow-configs=dummy=true
# Remove the previously set config value.
gcloud composer environments update ${ENVIRONMENT_NAME} \
--location=${ENV_LOCATION} \
--remove-airflow-configs=dummy

From Google Cloud Docs:
gcloud beta composer environments restart-web-server ENVIRONMENT_NAME --location=LOCATION

I finally found an alternative solution!
Based on this document:
https://cloud.google.com/composer/docs/how-to/managing/deploy-webserver
We can build airflow webservers on kubernetes (Yes, please throw away built-in webserver). So we can kill webserver pods to force restart =)

From console dags can be retrieved, we can list all dag present. There other commands too.

Related

My GKE pods stoped with error "no command specified: CreateContainerError"

Everything was Ok and nodes were fine for months, but suddenly some pods stopped with an error
I tried to delete pods and nodes but same issues.
Try below possible solutions to resolve your issue:
Solution 1 :
Check a malformed character in your Dockerfile and cause it to crash.
When you encounter CreateContainerError is to check that you have a valid ENTRYPOINT in the Dockerfile used to build your container image. However, if you don’t have access to the Dockerfile, you can configure your pod object by using a valid command in the command attribute of the object.
So workaround is to not specify any workerConfig explicitly which makes the workers inherit all configs from the master.
Refer to Troubleshooting the container runtime, similar SO1, SO2 & Also check this similar github link for more information.
Solution 2 :
Kubectl describe pod podname command provides detailed information about each of the pods that provide Kubernetes infrastructure. With the help of this you can check for clues, if Insufficient CPU follows the solution below.
The solution is to either:
1)Upgrade the boot disk: If using a pd-standard disk, it's recommended to upgrade to pd-balanced or pd-ssd.
2)Increase the disk size.
3)Use node pool with machine type with more CPU cores.
See Adjust worker, scheduler, triggerer and web server scale and performance parameters for more information.
If you still have the issue, you can then update the GKE version for your cluster Manually upgrading the control planeto one of the fixed versions.
Also check whether you have updated it in the last year to use the new kubectl authentication coming in the GKE v1.26 plugin?
Solution 3 :
If you're having a pipeline on GitLab that deploys an image to a GKE cluster: Check the version of the Gitlab runner that handles the jobs of your pipeline .
Because it turns out that every image built through a Gitlab runner running on an old version causes this issue at the container start. Simply deactivate them and only let Gitlab runners running last version in the pool, replay all pipelines.
Check the gitlab CI script using an old docker image like docker:19.03.5-dind, update to docker:dind helps the kubernetes to start the pod again.

How to increase the cloud build timeout when using ```gcloud run deploy```?

When attempting to deploy to Cloud Run using the gcloud run deploy I am hitting the 10m Cloud Build timeout limit. gcloud run deploy is working well as long as the build step does not exceed 10m. When the build step exceeds 10m, the build fails with the "Timed out" status as shown in below screenshot. AFAIK there are no arguments to gcloud run deploy that can set the Cloud Build timeout limit. gcloud run deploy docs are here: https://cloud.google.com/sdk/gcloud/reference/run/deploy
I've attempted to increase the Cloud Build timeout limit using gcloud config set builds/timeout 20m and gcloud config set container/build_timeout 20m, but these settings are not reflected in the execution details of the cloud build process when using gcloud run deploy.
In the GUI, this is the setting I want to change:
Is it possible to increase the Cloud Build timeout limit using gcloud run deploy?
How about splitting the command into (more easily configured) constituents?
[I've not tried this]
Build the container image specifying the timeout
:
gcloud builds submit --source=.... --timeout=...
Then reference the image that results when you gcloud run deploy:
gcloud run deploy ... --image=...
I know this is answered and confirmed, but #DazWikin's solution was the harder way to solve this problem than #SimonKarman's solution.
For those who do not have the cloudbuild.yml file like myself, this solution still is a valid one, you just need to edit the one created by google itself. You can find it under builds > triggers > Desired Trigger (Edit)
Then when you open the editor you can apply the timeout. If you want other changes to the yaml file you can also checkout the schema here:
https://cloud.google.com/build/docs/build-config-file-schema#yaml
Note: I am using cloudrun and this worked for me and therefore I am not 100% if it works with all builds generated by google
Hope it will be helpful for someone else in future :)
If you're using a --source such as the cloudbuild.yaml you can add the following property to alter the timeout in seconds:
...
timeout: "1800s"
...
You can find this in the documentation

Google Cloud Platform: cloudshell - is there any way to "keep" gcloud init configs?

Does anyone know of a way to persist configurations done using "gcloud init" commands inside cloudshell, so they don't vanish each time you disconnect?
I figured out how to persist python pip installs using the --user
example: pip install --user pandas
But, when I create a new configuration using gcloud init, use it for a bit, close cloudshell (or cloudshell times out on me), then reconnect later, the configurations are gone.
Not a big deal, I bounce between projects/etc so it's nice to have the configs saved so I can simply run
gcloud config configurations activate config-name
Thanks...Rich Murnane
Google Cloud Shell only persists data in your $HOME directory. Commands like gcloud init modify the environment variables and store configuration files in /tmp which is deleted when the VM is restarted. The VM is terminated after being idle for 20 minutes or 60 minutes depending on which document you read.
Google Cloud Shell is a Docker container. You can modify the docker image to customize to fit your needs. This method will allow you to install packages, tools, etc that are not located in your $HOME directory.
You can also store your files and configuration scripts on Google Cloud Storage. Modify .bashrc to download your cloud files and run your configuration script.
Either method will allow you to create a persistent environment.
This StackOverflow answer covers in detail what gcloud init does and how to basically emulate the same thing via script or command line.
gcloud init details
this isn't exactly what I wanted, but since my
account (userid) isn't changing, I'm simply going to
do the command
gcloud config set project second-project-name
good enough, thanks...Rich

Cloud Composer GKE Node upgrade results in Airflow task randomly failing

The problem:
I have a managed Cloud composer environment, under a 1.9.7-gke.6 Kubernetes cluster master.
I tried to upgrade it (as well as the default-pool nodes) to 1.10.7-gke.1, since an upgrade was available.
Since then, Airflow has been acting randomly. Tasks that were working properly are failing for no given reason. This makes Airflow unusable, since the scheduling becomes unreliable.
Here is an example of a task that runs every 15 minutes and for which the behavior is very visible right after the upgrade:
airflow_tree_view
On hover on a failing task, it only shows an Operator: null message (null_operator). Also, there is no log at all for that task.
I have been able to reproduce the situation with another Composer environment in order to ensure that the upgrade is the cause of the dysfunction.
What I have tried so far :
I assumed the upgrade might have screwed up either the scheduler or Celery (Cloud composer defaults to CeleryExecutor).
I tried restarting the scheduler with the following command:
kubectl get deployment airflow-scheduler -o yaml | kubectl replace --force -f -
I also tried to restart Celery from inside the workers, with
kubectl exec -it airflow-worker-799dc94759-7vck4 -- sudo celery multi restart 1
Celery restarts, but it doesn't fix the issue.
So I tried to restart the airflow completely the same way I did with airflow-scheduler.
None of these fixed the issue.
Side note, I can't access Flower to monitor Celery when following this tutorial (Google Cloud - Connecting to Flower). Connecting to localhost:5555 stay in 'waiting' state forever. I don't know if it is related.
Let me know if I'm missing something!
1.10.7-gke.2 is available now [1]. Can you further upgrade to 1.10.7-gke.2 to see if the issue persists?
[1] https://cloud.google.com/kubernetes-engine/release-notes

Cannot run Cloudfoundry Task - Unexpected Response 404

After my app is successfully pushed via cf I usually need do manually ssh-log into the container and execute a couple of PHP scripts to clear and warmup my cache, potentially execute some DB schema updates etc.
Today I found out about Cloudfoundry Tasks which seems to offer a pretty way to do exactly this kind of things and I wanted to test it whether I can integrate it into my build&deploy script.
So used cf login, got successfully connected to the right org and space, app has been pushed and is running and I tried this command:
cf run-task MYAPP "bin/console doctrine:schema:update --dump-sql --env=prod" --name dumpsql
(tried it with a couple of folder changes like app/bin/console etc.)
and this was the output:
Creating task for app MYAPP in org MYORG / space MYSPACE as me#myemail...
Unexpected Response
Response Code: 404
FAILED
Uses CF CLI: 6.32.0
cf logs ArcticTenTestBackend --recent does not output anything (this might be the case because I have enabled an ELK instance for logging - as I wanted to service-connect to ELK to look up the logs I found out that the service-connector cf plugin is gone for which I will open a new ticket).
Created new Issue for that: https://github.com/cloudfoundry/cli/issues/1242
This is not a CF CLI issue. Swisscom Application Cloud does not yet support the Cloud Foundry tasks. This explains the 404 you are currently receiving. We will expose this feature of Cloud Foundry in an upcoming release of Swisscom Application Cloud.
In the meantime, maybe you can find a way to execute your one-off tasks (cache warming, DB migrations) at application startup.
As mentioned by #Mathis Kretz Swisscom has gotten around to enable cf run-task since this question was posted. They send out e-mails on 22. November 2018 to announce the feature.
As discussed on your linked documentation you use the following commands to manage tasks:
cf tasks [APP_NAME]
cf run-task [APP_NAME] [COMMAND]
cf terminate-task [APP_NAME] [TASK_ID]