GCP ComposerV2 missing log files - google-cloud-platform

We deployed GCP ComposerV2 with the most recent airflow version. It works perfectly. But from time to time "airflow_monitoring" predefined DAG crashes.
Here are the logs of the issue:
*** Log file is not found: gs://********/logs/airflow_monitoring/echo/2021-12-14T12:36:55+00:00/1.log. The task might not have been executed or worker executing it might have finished abnormally (e.g. was evicted)
*** 404 GET https://storage.googleapis.com/download/storage/v1/b/********/o/logs%2Fairflow_monitoring%2Fecho%2F2021-12-14T12%3A36%3A55%2B00%3A00%2F1.log?alt=media: No such object: ********/logs/airflow_monitoring/echo/2021-12-14T12:36:55+00:00/1.log: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)
We don't change anything, this issue has happened randomly.
Here is the code of "airflow_monitoring" predefined DAG:
"""A liveness prober dag for monitoring composer.googleapis.com/environment/healthy."""
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta
default_args = {
'start_date': airflow.utils.dates.days_ago(0),
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG(
'airflow_monitoring',
default_args=default_args,
description='liveness monitoring dag',
schedule_interval=None,
dagrun_timeout=timedelta(minutes=20))
# priority_weight has type int in Airflow DB, uses the maximum.
t1 = BashOperator(
task_id='echo',
bash_command='echo test',
dag=dag,
depends_on_past=False,
priority_weight=2**31-1)

I think the log says everything:
*** Log file is not found: gs://********/logs/airflow_monitoring/echo/2021-12-14T12:36:55+00:00/1.log. The task might not have been executed or worker executing it might have finished abnormally (e.g. was evicted)
Kubernetes environment might from time to time evict a running task (for example when it fails over to another node because disk crashed or because the machines need to be restarted.
I think you should set retry to 2 and it should automatically retry in such case.

Related

After retrying, task marked success in Airflow but it should marked failed

I am not sure where I am doing wrong. I am running a DAG and I have set retries to 5. My expectation is that, If a task got failed, it will retry 5 times and if it doesn't get pass it will mark it as failed but contrary to that, the task marked as success and triggered the upstream task, despite it has not done all the updates required in that particular task to get complete. Am I missing any required parameter? Thanks.
Note- I am using dataproc cluster and updating records in BQ table and it has limitation of 1500 updates/table/day. That limit got exceeded, but airflow didn't bother to mark it failed, rather it has triggered upstream task. Any fix which we can use? so that the task would have failed even if there is limit/quota exceed issue?
issue- "write to table failed: Reason: 403 limit error due to append/update limit exceeded"- After this, It started running the other codes available. But how can I tell Airflow that it's an error and should stop the task?
writing_to_bq(df=ent1, table, retry_number=3)
def writing_to_bq(df,table,retry):
for _ in range(retry+1):
try:
df.to_gbq(table,if_exists=“append")
break
except
Exception as e:
log.info("Trying again”)
Normally if you set args globally (for all your tasks) or only in the task, it should work as expected :
dag_default_args={
'owner' : 'Airflow',
'retries': 5,
'retry_delay': timedelta(minutes=2),
...
}
with airflow.DAG(
"your_dag_id",
default_args=dag_default_args,
schedule_interval=None) as dag:
....
In this example, I used a retry_delay to 2 minutes.
After discussed together, you indicated the BigQueryInsertJobOperator operator doesn't mark the task as failed if you receive a 403 status for a quota limit exceeded.
You can also use a PythonOperator with Python BigQuery client, in this case normally the client will throw an error for a 403 status. If it's not the case, you can have more flexibility to apply logic based on the result :
def execute_bq_query():
from google.cloud import bigquery
client = bigquery.Client()
QUERY = 'SELECT ....'
query_job = client.query(QUERY)
rows = query_job.result()
PythonOperator(
task_id="task_id",
python_callable=execute_bq_query
)

Apache Airflow - Dag doesn't start even with start_date and schedule_interval defined

I am new at Airflow but I've defined a Dag to send a basic email every day at 9am. My DAG is the following one:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.bash_operator import BashOperator
from airflow.operators.email_operator import EmailOperator
from airflow.utils.dates import days_ago
date_log = str(datetime.today())
my_email = ''
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': days_ago(0),
'email': ['my_email'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'concurrency': 1,
'max_active_runs': 1
}
with DAG('TEST', default_args=default_args, schedule_interval='0 9 * * *',max_active_runs=1, catchup=False) as dag:
t_teste = EmailOperator(dag=dag, task_id='successful_notification',
to='my_email',
subject='Airflow Dag ' + date_log,
html_content="""""")
t_teste
I've all the configurations as I needed, and I have the webserver and scheduler running. Also, I have my Dag active on UI. My problem is that my DAG seems to be doing nothing. It hasn't run for two days, and even if it passes the scheduled time, it doesn't run as expected. I have already tested and run my trigger manually, and it ran successfully. But if I wait for the trigger time, it does nothing.
Do you know how what I am doing wrong?
Thanks!
Your DAG will never be scheduled. Airflow schedule calculates state_date + schedule_interval and schedule the DAG at the END of the interval.
>>> import airflow
>>> from airflow.utils.dates import days_ago
>>> print(days_ago(0))
2021-06-26 00:00:00+00:00
Calculating 2021-06-26 (today) + schedule_interval it means that the DAG will run on 2021-06-27 09:00 however when we reach 2021-06-27 the calculation will produce 2021-06-28 09:00 and so on resulting in DAG never actually runs.
The conclusion is: never use dynamic values in start_date!
To solve your issue simply change:
'start_date': days_ago(0) to some static value like: 'start_date': datetime(2021,6,25)
note that if you are running older versions of Airflow you might also need to change the dag_id.

How to run BigQuery after Dataflow job completed successfully

I am trying to run a query in BigQuery right after a dataflow job completes successfully. I have defined 3 different functions in main.py.
The first one is for running the dataflow job. The second one checks the dataflow jobs status. And the last one runs the query in BigQuery.
The trouble is the second function checks the dataflow job status multiple times for a period of time and after the dataflow job completes successfully, it does not stop checking the status.
And then function deployment fails due to 'function load attempt timed out' error.
from googleapiclient.discovery import build
from oauth2client.client import GoogleCredentials
import os
import re
import config
from google.cloud import bigquery
import time
global flag
def trigger_job(gcs_path, body):
credentials = GoogleCredentials.get_application_default()
service = build('dataflow', 'v1b3', credentials=credentials, cache_discovery=False)
request = service.projects().templates().launch(projectId=config.project_id, gcsPath=gcs_path, body=body)
response = request.execute()
def get_job_status(location, flag):
credentials=GoogleCredentials.get_application_default()
dataflow=build('dataflow', 'v1b3', credentials=credentials, cache_discovery=False)
result=dataflow.projects().jobs().list(projectId=config.project_id, location=location).execute()
for job in result['jobs']:
if re.findall(r'' + re.escape(config.job_name) + '', job['name']):
while flag==0:
if job['currentState'] != "JOB_STATE_DONE":
print('NOT DONE')
else:
flag=1
print('DONE')
break
def bq(sql):
client = bigquery.Client()
query_job = client.query(sql, location='US')
gcs_path = config.gcs_path
body=config.body
trigger_job(gcs_path,body)
flag=0
location='us-central1'
get_job_status(location,flag)
sql= """CREATE OR REPLACE TABLE 'table' AS SELECT * FROM 'table'"""
bq(SQL)
Cloud Function timeout is set to 540 seconds but deployment fails in 3-4 minutes.
Any help is very appreciated.
It appears from the code snippet provided that your HTTP-triggered cloud function is not returning a HTTP response.
All HTTP-based cloud functions must return a HTTP response for proper termination. From the google documentation Ensure HTTP functions send an HTTP response (Emphasis - mine):
If your function is HTTP-triggered, remember to send an HTTP response,
as shown below. Failing to do so can result in your function executing
until timeout. If this occurs, you will be charged for the entire
timeout time. Timeouts may also cause unpredictable behavior or cold
starts on subsequent invocations, resulting in unpredictable behavior
or additional latency.
Thus, you must have a function that in your main.py that returns some sort of value, ideally a value that can be coerced into a Flask http response.

How to skip a task in airflow without skipping its downstream tasks?

Let’s say this is my dag:
A >> B >> C
If task B raises an exception, I want to skip the task instead of failing it. However, I don’t want to skip task C. I looked into AirflowSkipException and the soft_fail sensor but they both forcibly skip downstream tasks as well. Does anyone have a way to make this work?
Thanks!
Currently posted answers touch on different topic or does not seem to be fully correct.
Adding trigger rule all_failed to Task-C won't work for OP's example DAG: A >> B >> C unless Task-A ends in failed state, which most probably is not desirable.
OP was, in fact, very close because expected behavior can be achieved with mix of AirflowSkipException and none_failed trigger rule:
from datetime import datetime
from airflow.exceptions import AirflowSkipException
from airflow.models import DAG
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
with DAG(
dag_id="mydag",
start_date=datetime(2022, 1, 18),
schedule_interval="#once"
) as dag:
def task_b():
raise AirflowSkipException
A = DummyOperator(task_id="A")
B = PythonOperator(task_id="B", python_callable=task_b)
C = DummyOperator(task_id="C", trigger_rule="none_failed")
A >> B >> C
which Airflow executes as follows:
What this rule mean?
Trigger Rules
none_failed: All upstream tasks have not failed or upstream_failed -
that is, all upstream tasks have succeeded or been skipped
So basically we can catch the actual exception in our code and raise mentioned Airflow exception which "force" task state change from failed to skipped.
However, without the trigger_rule argument to Task-C we would end up with Task-B downstream marked as skipped.
You can refer to the Airflow documentation on trigger_rule.
trigger_rule allows you to configure the task's execution dependency. Generally, a task is executed when all upstream tasks succeed. You can change that to other trigger rules provided in Airflow. The all_failed trigger rule only executes a task when all upstream tasks fail, which would accomplish what you outlined.
from datetime import datetime
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.trigger_rule import TriggerRule
with DAG(
dag_id="my_dag",
start_date=datetime(2021, 4, 5),
schedule_interval='#once',
) as dag:
p = PythonOperator(
task_id='fail_task',
python_callable=lambda x: 1,
)
t = PythonOperator(
task_id='run_task',
python_callable=lambda: 1,
trigger_rule=TriggerRule.ALL_FAILED
)
p >> t
You can change the trigger_rule in your task declaration.
task = BashOperator(
task_id="task_C",
bash_command="echo hello world",
trigger_rule="all_done",
dag=dag
)

Dataflow stops streaming to BigQuery without errors

We started using Dataflow to read from PubSub and Stream to BigQuery.
Dataflow should work 24/7, because pubsub is constantly updated with analytics data of multiple websites around the world.
Code looks like this:
from __future__ import absolute_import
import argparse
import json
import logging
import apache_beam as beam
from apache_beam.io import ReadFromPubSub, WriteToBigQuery
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
logger = logging.getLogger()
TABLE_IDS = {
'table_1': 0,
'table_2': 1,
'table_3': 2,
'table_4': 3,
'table_5': 4,
'table_6': 5,
'table_7': 6,
'table_8': 7,
'table_9': 8,
'table_10': 9,
'table_11': 10,
'table_12': 11,
'table_13': 12
}
def separate_by_table(element, num):
return TABLE_IDS[element.get('meta_type')]
class ExtractingDoFn(beam.DoFn):
def process(self, element):
yield json.loads(element)
def run(argv=None):
"""Main entry point; defines and runs the wordcount pipeline."""
logger.info('STARTED!')
parser = argparse.ArgumentParser()
parser.add_argument('--topic',
dest='topic',
default='projects/PROJECT_NAME/topics/TOPICNAME',
help='Gloud topic in form "projects/<project>/topics/<topic>"')
parser.add_argument('--table',
dest='table',
default='PROJECTNAME:DATASET_NAME.event_%s',
help='Gloud topic in form "PROJECT:DATASET.TABLE"')
known_args, pipeline_args = parser.parse_known_args(argv)
# We use the save_main_session option because one or more DoFn's in this
# workflow rely on global context (e.g., a module imported at module level).
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = True
p = beam.Pipeline(options=pipeline_options)
lines = p | ReadFromPubSub(known_args.topic)
datas = lines | beam.ParDo(ExtractingDoFn())
by_table = datas | beam.Partition(separate_by_table, 13)
# Create a stream for each table
for table, id in TABLE_IDS.items():
by_table[id] | 'write to %s' % table >> WriteToBigQuery(known_args.table % table)
result = p.run()
result.wait_until_finish()
if __name__ == '__main__':
logger.setLevel(logging.INFO)
run()
It works fine but after some time (2-3 days) it stops streaming for some reason.
When I check job status, it contains no errors in the logs section (you know, ones marked with red "!" in dataflow's job details). If I cancel the job and run it again - it starts working again, as usual.
If I check Stackdriver for additional logs, here's all Errors that happened:
Here's some warnings that occur periodically while job executes:
Details of one of them:
{
insertId: "397122810208336921:865794:0:479132535"
jsonPayload: {
exception: "java.lang.IllegalStateException: Cannot be called on unstarted operation.
at com.google.cloud.dataflow.worker.fn.data.RemoteGrpcPortWriteOperation.getElementsSent(RemoteGrpcPortWriteOperation.java:111)
at com.google.cloud.dataflow.worker.fn.control.BeamFnMapTaskExecutor$SingularProcessBundleProgressTracker.updateProgress(BeamFnMapTaskExecutor.java:293)
at com.google.cloud.dataflow.worker.fn.control.BeamFnMapTaskExecutor$SingularProcessBundleProgressTracker.periodicProgressUpdate(BeamFnMapTaskExecutor.java:280)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
"
job: "2018-11-30_10_35_19-13557985235326353911"
logger: "com.google.cloud.dataflow.worker.fn.control.BeamFnMapTaskExecutor"
message: "Progress updating failed 4 times. Following exception safely handled."
stage: "S0"
thread: "62"
work: "c-8756541438010208464"
worker: "beamapp-vitar-1130183512--11301035-mdna-harness-lft7"
}
labels: {
compute.googleapis.com/resource_id: "397122810208336921"
compute.googleapis.com/resource_name: "beamapp-vitar-1130183512--11301035-mdna-harness-lft7"
compute.googleapis.com/resource_type: "instance"
dataflow.googleapis.com/job_id: "2018-11-30_10_35_19-13557985235326353911"
dataflow.googleapis.com/job_name: "beamapp-vitar-1130183512-742054"
dataflow.googleapis.com/region: "europe-west1"
}
logName: "projects/PROJECTNAME/logs/dataflow.googleapis.com%2Fharness"
receiveTimestamp: "2018-12-03T20:33:00.444208704Z"
resource: {
labels: {
job_id: "2018-11-30_10_35_19-13557985235326353911"
job_name: "beamapp-vitar-1130183512-742054"
project_id: PROJECTNAME
region: "europe-west1"
step_id: ""
}
type: "dataflow_step"
}
severity: "WARNING"
timestamp: "2018-12-03T20:32:59.442Z"
}
Here's the moment when it seems to start having problems:
Additional info messages that may help:
According to these messages, we don't run out of memory/processing power etc. The job is run with these parameters:
python -m start --streaming True --runner DataflowRunner --project PROJECTNAME --temp_location gs://BUCKETNAME/tmp/ --region europe-west1 --disk_size_gb 30 --machine_type n1-standard-1 --use_public_ips false --num_workers 1 --max_num_workers 1 --autoscaling_algorithm NONE
What could be the problem here?
This isn't really an answer, more helping identify the cause: so far, all streaming Dataflow jobs I've launched using python SDK have stopped that way after some days, whether they use BigQuery as sink or not. So the cause rather seems to be the general fact that streaming jobs with the python SDK are still in beta.
My personal solution: use the Dataflow templates to stream from Pub/Sub to BigQuery (thus avoiding the python SDK), then schedule queries in BigQuery to periodically treat the data. Unfortunately that might not be appropriate for your use cases.
in my company we are experiencing the same and identical problem, as described by the OP, with a similar use case.
Unfortunately the problem is real, concrete and apparently with a random occurrence.
As a workaround, we are considering rewriting our pipeline using the java SDK.
I had a similar issue to this and found that the warning logs contained python Stack trace hidden in the java logs advising of errors.
These errors were continually re-tried by workers causing them to crash and completely freeze the pipeline. I initially thought the No. of workers was too low, so scaled up the number of workers, but the pipeline just took longer to freeze.
I ran the pipeline locally and exported the pubsub messages as text and identified they contained dirty data(messages that did not match the BQ table schema) and as I had no exception handling, that seemed to be the cause of the pipeline to freeze.
Adding a function only accept a record where the first key matches the expected column of your BQ Schema fixed my issue and the Dataflow Job has been running with no issues ongoing.
def bad_records(row):
if 'key1' in row:
yield row
else:
print('bad row',row)
|'exclude bad records' >> beam.ParDo(bad_records)