We are running Airflow 1.10.3 via Google Cloud Composer.
Our dags are distributed over several folders that we collect via instances of DagBag (like here https://medium.com/#xnuinside/how-to-load-use-several-dag-folders-airflow-dagbags-b93e4ef4663c)
However the WebUI apparently can't find any DAGs that are not in the main dag folder (the one configured in airflow.cfg)
This seems to be because in airflow.www.views there is only one global variable dagbag.
Is that really the problem? What could be a workaround?
Additional info:
airflow list_dags shows all dags
the dags are also listed in the WebUI, and seem to get scheduled, but clicking the dag in the WebUI does only yields the error " does not seem to be in dagbag"
I'm curious to hear about your thoughts, since I'm pretty lost here.
According to the Cloud Composer documentation the dags_folder parameter is blocked and can't be overridden (You're only allowed to use the the GCS bucket created by the Cloud Composer Environment). This allows Cloud Composer to upload DAGs, and that the dags folder remains in the Google Cloud Storage bucket.
Due to the DAGBag can't be modified, and because Apache Airflow does not provide strong DAG isolation, it's recommend that you maintain separate your DAGs in different environments to prevent DAG interference.
I've made some test in my Composer environment creating multiple folders to separate my DAGs:
In all cases my Dags were recognized and ran as expected, even for sub-folders:
$ gcloud beta composer environments storage dags list --environment=$ENVIRONMENT --location=us-east1
NAME
dags/
dags/airflow_monitoring.py
dags/dev/
dags/dev/airflow_monitoring_dev.py
dags/qa/
dags/qa/airflow_monitoring_qa.py
dags/qa/qa_test1/
dags/qa/qa_test1/airflow_monitoring_qa_test1.py
If recreate your folders in the dag folder created by Composer is not feasible for you. I recommend you synchronize content of your own bucket with the Composer dags folder, with the rsync command you could mirror both buckets.
Related
I was working on GITHUB and GCP(Cloud Build for deployments) and working good. Below are the steps:
Created multiple Cloud Functions and used same GIT HUB repository.
Created separate Cloud Build Trigger for each Cloud Function where separate cloudbuild.yml in each Cloud Function folder in repository.
Trigger gets run when there are changes in respective cloud function scripts.
Now i need to integrate Cloud Build with GITLAB.
I have gone through the documentation but found that only webhook is the option and the trigger will be based on whole repository changes. It will require separate repository for each cloud function or Cloud Run. There is no option to select the repository itself.
Can experts guide me on this how I can do this integration because, we are planning to have one repo and multiple service/applications stored in that repository. And we want CI to run on GCP environment itself.
Personally I found GitLab being the worst in comparison to GitHub and BitBucket in terms of integration with the GCP Cloud Build (to run the deployment within GCP).
I don't know ideal solutions, but I probably have 2 ideas. None of them is good from my point of view.
1/ Mirror GitLab repository into GCP repository as described here - Mirroring GitLab repositories to Cloud Source Repositories One of the biggest drawbacks from my point of view - the integration solution is based on a personal credentials, and there should be a person to make it working -
Mirroring stops working if the Google Account is closed or loses access rights to the Git repository in Cloud Source Repositories
When mirroring is done - you probably can work with the GCP based repository in an ordinary way and trigger cloud build jobs as usual. A separate question - how to provide deployment logs to those who initiated the deployment...
2/ Use webhooks. That does not depend on any personal accounts, but not very granular - as you mentioned push on the whole repository level. To overcome that limitation, there might be a very tricky (inline) yaml file - executed by a cloud build trigger. In that yaml file, not only we should fetch the code, but also parse all changes (all commits) in that push to find out which subdirectories (thus separate components - cloud functions) are potentially modified. Then, for each affected (modified) subdirectory we can trigger (asynchronously) some other cloud build job (with a yaml file for it located inside that subdirectory).
An obvious drawback - not clear who and how should get the logs from all those deployments, especially if something went wrong, and the development (and management) of such deployment process might be time/effort consuming and not easy.
I am trying to build an app where the user is able to upload a file to cloud storage. This would then trigger a model training process (and predicting later on). Initially I though I could do this with cloud functions/pubsub and cloudml, but it seems that cloud functions are not able to trigger gsutil commands which is needed for cloudml.
Is my only option to enable cloud-composer and attach GPUs to a kubernetes node and create a cloud function that triggers a dag to boot up a pod on the node with GPUs and mounting the bucket with the data? Seems a bit excessive but I can't think of another way currently.
You're correct. As for now, there's no possibility to execute gsutil command from a Google Cloud Function:
Cloud Functions can be written in Node.js, Python, Go, and Java, and are executed in language-specific runtimes.
I really like your second approach with triggering the DAG.
Another idea that comes to my mind is to interact with GCP Virtual Machines within Cloud Composer through the Python operator by using the Compute Engine Pyhton API. You can find more information in automating infrastructure and taking a deep technical dive into the core features of Cloud Composer here.
Another solution that you can think of is Kubeflow, which aims to make running ML workloads on Kubernetes. Kubeflow adds some resources to your cluster to assist with a variety of tasks, including training and serving models and running Jupyter Notebooks. Please, have a look on Codelabs tutorial.
I hope you find the above pieces of information useful.
I am new to google cloud composer. I have some code in google cloud compute engine -
for eg: test.py
Currently I am using Jenkins as my scheduler - and I'm running the code like below
echo "cd /home/user/src/digital_platform &&/home/user/venvs/bdp/bin/python -m test.test.test" | ssh user#instance-dp
I want to run the same code from google cloud composer.
How I can do that..
Basically I need to ssh to an instance in google cloud and run the code in an automated way using google cloud composer.
It seems that SSHOperator might be something that might work for you. This operator is an Airflow feature, not Cloud Composer feature per se.
The other operator that you might want to take a look at before making your final decision is BaskOperator
You need to create a DAG (workflows), Cloud Composer schedules only the DAGs that are in the DAGs folder in the environment's Cloud Storage bucket. Each Cloud Composer environment has a web server that runs the Airflow web interface that you can use to manage DAGs.
Bash Operator is useful to run command-line programs. I suggest you follow the Cloud Composer Quickstart which shows you how to create a Cloud Composer environment in the Google Cloud Console and run a simple Apache Airflow DAG.
When creating an Airflow environment on GCP Composer, there is a DAG named airflow_monitoring automatically created and that comes back even when deleted.
Why? How to handle it? Should I copy this file inside my DAG folder and resign myself to make it part of my code? I noticed that each time I upload my code it stops the execution of this DAG as it could not be found inside the DAG folder until it magically reappears.
I have already tried deleting it inside the DAG folder, delete the logs, delete it from the UI, all of this at the same time etc.
The airflow_monitoring DAG is a per-environment liveness prober/healthcheck that is used to populate the Cloud Composer monitoring metric environment/healthy. It is an indicator for the general overall health of your environment, or more specifically, its ability to schedule DAGs and run tasks. This allows you to use Google Cloud Monitoring features such as metric graphs, or setting alerts when your environment becomes unhealthy.
You can find more information about the metric on the GCP Metrics List, and can explore the metric in Cloud Monitoring under the following:
Resource type: Cloud Composer Environment
Metric: Healthy
This is a Composer-managed DAG and uses very minimal resources from your environment. Ideally, you should leave it untouched, as it has little to no effect on anything else running in your environment.
I wrote a small plugin for Apache Airflow, which runs fine on my local deployment. However, when I use Google Composer, the user interface hangs and becomes unresponsive. Is there any way to restart the webserver in Google Composer
(Note: This answer is currently more suggestive than finalized.)
As far as restarting the webserver goes...
What doesn't work:
I reviewed Airflow Web Interface in the docs which describes using the webserver but not accessing it from a CLI or restarting.
While you can also run Airflow CLI commands on Composer, I don't see a command for restarting the webserver in the Airflow CLI today.
I checked the gcloud CLI in the Google Cloud SDK but didn't find a restart related command.
Here are a few ideas that may work for restarting the Airflow webserver on Composer:
In the gcloud CLI, there's an update command to change environment properties. I would assume that it restarts the scheduler and webserver (in new containers) after you change one of these to apply the new setting. You could set an arbitrary environment variable to check, but just running the update command with no changes may work.
gcloud beta composer environments update ...
Alternatively, you can update environment properties excluding environment variables in the GCP Console.
I think re-running the import plugins command would cause a scheduler/webserver restart as well.
gcloud beta composer environments storage plugins import ...
In a more advanced setup, Composer supports deploying a self-managed Airflow web server. Following the linked guide, you can: connect into your Composer instance's GKE cluster, create deployment and service Kubernetes configuration files for the webserver, and deploy both with kubectl create. Then you could run a kubectl replace or kubectl delete on the pod to trigger a fresh start.
This all feels like a bit much, so hopefully documentation or a simpler way to achieve webserver restarts emerges to succeed these workarounds.