I'm new to GCP. going over different documents on gcp composer and cloud shell but not able to find a place where I can connect the cloud shell environment to the composer DAG folder.
Right now, I'm creating python script outside cloud shell (local system), uploading manually to DAG folder but i want to do this on the cloud shell only. can any one give me the directions on it?
Also when I tried to use import airflow in my python file on cloud shell it gives me error that module not found. how do I install that too?
Take alook on this GCP documentation:
Adding and Updating DAGs (workflows)
among many other entries, you will find information like this one:
Determining the storage bucket name
To determine the name of the storage bucket associated with your environment:
gcloud composer environments describe ENVIRONMENT_NAME \
--location LOCATION \
--format="get(config.dagGcsPrefix)"
where:
ENVIRONMENT_NAME is the name of the environment.
LOCATION is the Compute Engine region where the environment is located.
--format is an option to specify only the dagGcsPrefix property instead of all environment details.
The dagGcsPrefix property shows the bucket name:
gs://region-environment_name-random_id-bucket/
Adding or updating a DAG
To add or update a DAG, move the Python .py file for the DAG to the environment's dags folder in Cloud Storage.
gcloud composer environments storage dags import \
--environment ENVIRONMENT_NAME \
--location LOCATION \
--source LOCAL_FILE_TO_UPLOAD
where:
ENVIRONMENT_NAME is the name of the environment.
LOCATION is the Compute Engine region where the environment is located.
LOCAL_FILE_TO_UPLOAD is the DAG to upload.
Related
I'm trying to automatically run airflow webserver and scheduler in a VM upon boot using startup scripts just followed the documentation here: https://cloud.google.com/compute/docs/instances/startup-scripts/linux . Here is my script:
export AIRFLOW_HOME=/home/name/airflow
cd /home/name/airflow
nohup airflow scheduler >> scheduler.log &
nohup airflow webserver -p 8080 >> webserver.log &
The .log files are created which means the script is been executed but the webserver and the scheduler don't.
Any apparent reason?
I have tried replicating the Airflow webserver Startup script on GCP VM using the document.
Steps followed to run Airflow webserver Startup script on GCP VM :
Create a Service Account. Give minimum access to BigQuery with the role of BigQuery Job User and Dataflow with the role of Dataflow Worker. Click Add Key/Create new key/Done. This will download a JSON file.
Create a Compute Engine instance. Select the Service Account created.
Install Airflow libraries. Create a virtual environment using miniconda.
Init your metadata database and register at least one admin user using command:
airflow db init
airflow users create -r Admin -u username -p mypassword -e example#mail.com -f yourname -l lastname
Whitelist IP for port 8080. Create Firewall Rule and add firewall rule on GCP VM instance. Now go to terminal and start web server using command
airflow webserver -p 8080.
Open another terminal and start the Scheduler.
export AIRFLOW_HOME=/home/acachuan/airflow-medium
cd airflow-medium
conda activate airflow-medium
airflow db init
airflow scheduler
We want our Airflow to start immediately after the Compute Engine starts. So we can create a Cloud Storage bucket and then create a script, upload the file and keep it as a backup.
Now pass a Linux startup script from Cloud Storage to a VM. Refer Passing a startup script that is stored in Cloud Storage to an existing VM. You can also pass a startup script to an existing VM.
Note : PermissionDenied desc = The caller does not have permission means you don’t have sufficient permissions, you need to request access from your project, folder, or organization admin. Depending on the assets you are trying to export. And to access files which are created by root users you need read, write or execute permissions. Refer File permissions.
I want to change the timezone of the airflow webserver created by cloud composer from utc to jst (Asia / Tokyo).
However, even if "webserver-default_ui_timezone ='JST'" is set by airflow config overwrite, the time of webserver cannot be changed.
Even if I changed the time zone of the VM (GKE node) used in airflow from utc to jst (Asia / Tokyo), there was no change in the web server.
How can I change the display time and time zone of webserver and DAG to jst (Asia / Tokyo)?
I understand that default_ui_timezone is the correct Airflow setting to achieve what you are looking for, so it should be related to the way you are trying to set the value. Besides, this setting is not included in the blocked Airflow configurations list.
From GCloud CLI, try this command replacing the following with your environment and region names:
gcloud composer environments update test-environment \
--location you-region-1 \
--update-airflow-configs=webserver-default_ui_timezone=JST
From the Cloud Composer docs:
gcloud composer environments update ENVIRONMENT_NAME \
--location LOCATION \
--update-airflow-configs=KEY=VALUE,KEY=VALUE,...
ENVIRONMENT_NAME with the name of the environment.
LOCATION with the Compute Engine region where the environment is located.
KEY with the configuration section and the option name separated by a hyphen, for
example, core-print_stats_interval.
VALUE with the corresponding value for an option.
In case you need it, in the docs provided above, there are also examples of how to set Airflow configuration values from Console or API. Good luck!
I would like to deploy an application (as a container image) to Google Cloud Run. I am following the documentation as below:
https://cloud.google.com/run/docs/quickstarts/build-and-deploy
I would like to set the region as Tokyo (asia-northeast1) for the following commands:
gcloud builds submit
gcloud run deploy
The reason is that Cloud Run and Cloud Storage costs depends on the region. I would like to set the location of Cloud Storage and Cloud Run.
When creating a service in Cloud Run Console there's a region dropdown in Service setting see the image below :
you can also use the gcloud command to specify the region:
gcloud run deploy --image gcr.io/PROJECT-ID/DOCKER --platform managed --region=asia-northeast1
Setting the Cloud Run deployment region prior to deployment with gcloud is covered in the documentation:
Optionally, set your platform and default Cloud Run region with the
gcloud properties to avoid prompts from the command line:
gcloud config set run/platform managed
gcloud config set run/region REGION
replacing REGION with the default region you want to use.
I use Bizzflow.net ETL template within my GCP project. During work on my extractor configuration (extractor.json) I have uploaded invalid configuration into my repo. Afler running git_pull DAG, my extractors related DAGs were removed, including git_pull DAG itself. How can I repare it?
This is very common issue. Current release of Bizzflow does not check validity of configuration during git_pull DAG run correctly, so when you push invalid configuration into master branch of your project repository and run git_pull, all DAGs will be removed from Airflow UI.
Fixing is easy. Just repare your broken configuration, push it into master branch of your project repo and run git pull command directly on vm-airflow machine. To do that just simply login into vm-airflow machine using
gcloud auth login
gcloud compute ssh admin#vm-airfow --project <your project id> --zone <you zone id>
and run git pull command in project repository
cd /home/admin/project
git pull
After 2-3mins. all you DAGs will be back.
Of course, you have to have appropriate permissions to do that. Typically this fix is for project administrator with GCP Owner role assigned.
I have created a slurm cluster following this tutorial. I have also created a data bucket that stores some data that needs to be accessed in the compute nodes. Since the compute nodes share the home directory of the login node, I mounted the bucket in my login node using gcsfuse. However, if I execute a simple script test.py that prints the contents of mounted directory it is just empty. The folder is there as well as the python file.
Is there something that I have to specify in the yaml configuration file that enables having access to the mounted directory?
I have written down the steps that I have taken in order to mount the directory:
When creating the Slurm cluster using
gcloud deployment-manager deployments create google1 --config slurm-cluster.yaml
it is important that the node that should mount the storage directory has sufficient permissions.
Ucnomment/add the following in the slurm-cluster.yaml file if your login node should mount the data. (Do the same just with the controller node instead if you prefer).
login_node_scopes :
- https://www.googleapis.com/auth/devstorage.read_write
Next, log into the log-in node and install gcsfuse. After having installed gcsfuse, you can mount the bucket using the following command
gcsfuse --implicit-dirs <BUCKET-NAME> target/folder/
Note, the service account which is being attached to your VM has to have access rights on the bucket. You can find the name of the service account in the details of your VM in the cloud console or by running the following command on the VM:
gcloud auth list
I've just got a similar setup working. I don't have a definite answer to why yours isn't, but a few notes:
gcsfuse is installed per default, no need to explicitly install it.
You need to wait for the Slurm install to be fully finished before the bucket is available.
The "devstorage.read_write" appears to be needed.
I have the following under the login_machine_type in the yaml file:
network_storage :
- server_ip: none
remote_mount: mybucket
local_mount: /data
fs_type: gcsfuse
mount_options: file_mode=664,dir_mode=775,allow_other