Debugging broken dags in GCP Composer - google-cloud-platform

I have read the question for vanilla Airflow.
How can broken DAGs be debugged effectively in Google Cloud Composer?
How can i see the full logs of a broken DAG?
Right now I can only see one line of trace in Airflow UI main page.
EDIT:
Answers seem to be not understanding my question.
I am looking for fixing broken DAGs i.e. the DAG does not even appear in the DAGs list and of course there are no tasks running and no task logs to view.

As hexacynide pointed out, you can look at the task logs - there's details in the Composer docs about doing that specifically found here. You can also use Stackdriver logging, which is enabled by default in Composer projects. In Stackdriver logs, you can filter your logs on many variables, including by time, by pod (airflow-worker, airflow-webserver, airflow-scheduler, etc.) and by whatever keywords you suspect might appear in the logs.
EDIT: Adding screenshots and more clarity in response to question update
In Airflow, when there's a broken DAG, there is usually some form of error message at the top. (Yes, I know this error message is helpful and I don't need to debug further, but I'm going to just to show how to)
In the message, I can see that my DAG bq_copy_across_locations is broken.
To debug, I go to Stackdriver, and search for the name of my DAG. I limit the results to the logs from this Composer environment. You can also limit the time frame if needed.
I looked through the error logs and found a the Traceback error for the broken DAG.
Alternatively, if you know you only want to search for the stack traceback, you can run an advanced filter looking for your DAG name and the word "traceback". To do so, click the arrow at the right side of the Stackdriver logging bar and hit "convert to advance filter"
Then enter your advanced filter
resource.type="cloud_composer_environment"
resource.labels.location="YOUR-COMPOSER-REGION"
resource.labels.environment_name="YOUR-ENV-NAME"
("BROKEN-DAG-NAME" AND
"Traceback")
This is what my advanced search looked like
The only logs that will be returned will be the stack Traceback logs for that DAG.

To determine run-time issues that occur when a DAG is triggered, you can always look at task logs as you would for any typical Airflow installation. These can be found using the web UI, or by looking in the associated logs folder in your Cloud Composer environment's associated Cloud Storage bucket.
To identify issues at parse time, you can execute Airflow commands using gcloud composer. For example, to run airflow list_dags, the gcloud CLI equivalent would be:
$ gcloud composer environments --location=$REGION run $ENV_NAME -- list_dags --report
Note that the second -- is intentional. This is so that the command argument parser can differentiate between arguments to gcloud and arguments to be passed to the Airflow subcommand (in this case list_dags).

Related

Data flow pipeline got stuck

Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
I am using service account with all required IAM roles
Generally The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h can be caused by too long setup progress. In order to solve this issue you can try to increase worker resources (via --machine_type parameter) to overcome the issue.
For example, While installing several dependencies that required building wheels (pystan, fbprophet) which will take more than an hour on the minimal machine (n1-standard-1 with 1 vCPU and 3.75GB RAM). Using a more powerful instance (n1-standard-4 which has 4 times more resources) will solve the problem.
You can debug this by looking at the worker startup logs in cloud logging. You are likely to see pip issues with installing dependencies.
Do you have any error logs showing that Dataflow Workers are crashing when trying to start?
If not, maybe worker VMs are started but they can't reach the Dataflow service, which is often related to network connectivity.
Please note that by default, Dataflow creates jobs using the network and subnetwork default (please check if it exists on your project), and you can change to a specific one by specifying --subnetwork. Check https://cloud.google.com/dataflow/docs/guides/specifying-networks for more information.

How to restrict access to airflow.models?

I have an airflow instance with many tenants that have DAGs. They want to extract metadata on their dagruns like DagRun.end_date. However I want to restrict each tenant so they can only access data related to their own dagruns and be unable to access data of other people's dagruns. How can this be done?
This is what I imagine the DAG to look like
# custom macro function
def get_last_dag_run(dag):
last_dag_run = dag.get_last_dagrun()
return last_dag_run.end_date
I found these resources which explain how to extract data but not how to restrict it.
Getting the date of the most recent successful DAG execution
Apache airflow macro to get last dag run execution time
How to get last two successful execution dates of Airflow job?
how to get latest execution time of a dag run in airflow
How to find the start date and end date of a particular task in dag in airflow?
How to get dag status like running or success or failure
NB: I am a contributor to Airflow.
This is not possible with the current Airflow architecture.
We are slowly working to make Airflow multi-tenant capable, but for now we are half-way through and it will be several major releases to get there I believe.
Currently the only way to isolate tenants is to give every tenant separate Airflow instance, which is not as bad as you might initially think. If you run them in separate namespaces on the same auto-scaling Kubernetes cluster and add KEDA autoscaling, and use same database server (but give each tenant a separate schema), this might be rather efficient (especially if you use Terraform to setup/teardown such Airflow instances for example).

How airflow loads/updates DagBag from dags home folder on google cloud platform?

Please do not down vote my answer. If needed then I will update and correct my words. I have done my home-work research. I am little new so trying to understand this.
I would like to understand that how do airflow on Google cloud platform gets the changes from dags home folder to UI. Also Please help me with my dags setup script. I have read so many answers along with books. book link is here
I tried figuring out my answer from page 69 which says
3.11 Scheduling & Triggers The Airflow scheduler monitors all tasks and all DAGs, and triggers the task instances whose dependencies have
been met. Behind the scenes, it monitors and stays in sync with a
folder for all DAG objects it may contain, and periodically (every
minute or so) inspects active tasks to see whether they can be
triggered.
My understanding from this book is that scheduler regularly takes changes from dags home folder. (Is it correct?)
I also read multiple answers on stack overflow , I found this one useful Link
But still answer does not contain process that is doing this creation/updation of dagbag from script.py in dag home folder. How changes are sensed.
Please help me with my dags setup script.
We have created a generic python script that dynamically creates dags by reading/iterating over config files.
Below is directory structure
/dags/workflow/
/dags/workflow/config/dag_a.json
/dags/workflow/config/dag_b.json
/dags/workflow/task_a_with_single_operator.py
/dags/workflow/task_b_with_single_operator.py
/dags/dag_creater.py
Execution flow dag_creater.py is as following :-
1. Iterate in dags/workflow/config folder get the Config JSON file and
read variable dag_id.
2. create Parent_dag = DAG(dag_id=dag_id,
start_date=start_date, schedule_interval=schedule_interval,
default_args=default_args, catchup=False)
3. Read tasks with dependencies of that dag_id from config json file
(example :- [[a,[]],[b,[a]],[c,[b]]]) and code it as task_a >>
task_b >> task_c
This way dag is created. All works fine. Dags are also visible on UI and running fine.
But problem is, My dag creation script is running every time. Even in each task logs I see logs of all the dags. I expect this script to run once. just to fill entry in metadata. I am unable to understand like why is it running every time.
Please make me understand the process.
I know airflow initdb is run once we setup metadata first time. So that is not doing this update all time.
Is it scheduler heart beat updating all?
Is my setup correct?
Please Note: I can't type real code as that is the restriction from my
organization. However if asked, i will provide more information.
Airflow Scheduler is actually continuously running in Airflow runtime environment as a main contributor for monitoring changes in DAG folder and triggering the relevant DAG tasks residing in this folder. The main settings for Airflow Scheduler service can be found in airflow.cfg file, essentially the heart beat intervals which effectively impact the general DAG tasks maintenance.
However, the way how the particular task will be executed is defined as per the Executor's model in Airflow configuration.
To store DAGs being available for the Airflow runtime environment GCP Composer uses Cloud Storage, implementing the specific folder structure, synchronizing any object arriving to /dags folder with *.py extension be verified if it contains the DAG definition.
If you expect to run DAG spreading script within Airflow runtime, then in this particular use case I would advise you to look at PythonOperator, using it in the separate DAG to invoke and execute your custom generic Python code with guarantees scheduling it only once a time. You can check out this Stack thread with implementation details.

Logging jobs on a Google Cloud VM

I am using a Google Cloud virtual machine to run several python scripts scheduled on a cron, I am looking for some way to check that they ran.
When I look in my logs I see nothing, so I guess simply running a .py file is not logged? Is there a way to turn on logging at this level? What are the usual approaches for such things?
The technology for recording log information in GCP is called Stackdriver. You have a couple of choices for how to log within your application. The first is to instrument your code with Stackdriver APIs which explicitly write data to the Stackdriver subsytem. Here are the docs for that and here is further recipe.
A second story is that you install the Stackdriver Logging Agent on your Compute Engine. This will then allow you to tap into other sources of logging output such as local syslog.

Export / Import tool with Google Spanner

I have several questions regarding the Google Spanner Export / Import tool. Apparently the tool creates a dataflow job.
Can an import/export dataflow job be re-run after it had run successfully from the tool? If so, will it use the current timestamp?
How to schedule a daily backup (export) of Spanner DBs?
How to get notified of new enhancements within the GCP platform? I was browsing the web for something else and I noticed that the export / import tool for GCP Spanner had been released 4 days earlier.
I am still browsing through the documentation for dataflow jobs and templates, etc.. Any suggestions to the above would be greatly appreciated.
Thx
My response based on limited experience with the Spanner Export tool.
I have not seen a way to do this. There is no option in the GCP console, though that does not mean it cannot be done.
There is no built-in scheduling capability. Perhaps this can be done via Google's managed Airflow service, Cloud Composer (https://console.cloud.google.com/composer)? I have yet to try this, but it is next step as I have similar needs.
I've made this request to Google several times. I have yet to get a response. My best recommendation is to read the change logs when updating the gcloud CLI.
Finally-- there is an outstanding issue with the Export tool that causes it to fail if you export a table with 0 rows. I have filed a case with Google (Case #16454353) and they confirmed this issue. Specifically:
After running into a similar error message during my reproduction of
the issue, I drilled down into the error message and discovered that
there is something odd with the file path for the Cloud Storage folder
[1]. There seems to be an issue with the Java File class viewing
‘gs://’ as having a redundant ‘/’ and that causes the ‘No such file or
directory’ error message.
Fortunately for us, there is an ongoing internal investigation on this
issue, and it seems like there is a fix being worked on. I have
indicated your interest in a fix as well, however, I do not have any
ETAs or guarantees of when a working fix will be rolled out.