Cloud Tasks Conditional Execution - google-cloud-platform

I am using Cloud Tasks. I need to trigger the execution of Task C only when Task A and Task B have been completed successfully. So I need some way of reading / being notified of the statuses of Tasks triggered. But I see no way of doing this in GCP's documentation. Using Node.js SDK to create tasks and Cloud Functions as task handlers if at all that helps.
Edit:
As requested, here is more info on what we are doing:
Tasks 1 - 10 each make HTTP requests, fetch data, update individual collections in Firestore based on this data. These 10 tasks can run in parallel and in no particular order as they don't have any dependency on each other. All of these tasks are actually implemented inside GCF.
Task 11 actually depends on the Firestore collection data updated by Tasks 1 - 10. So it can only run after Tasks 1 - 10 are completed successfully.
We do issue a RunID as a common identifier to group a particular run of all tasks (1 - 11).

Cloud Task only trigger task, you can only define time condition. You have to code manually the check when the task C run.
Here an example of process:
Task A is running, at the end, the task write in firestore that is completed
Task B is running, at the end, the task write in firestore that is completed
Task C start and check if A and B are completed in firestore.
If not, the task exit in error
Is yes, continue the process
You have to customize your C task queue for retrying the task in case of error.
Another, expensive, solution is to use Cloud Composer for handling this workflow
There is no other solution for now about workflow management.

Cloud Tasks is not the tool you want to use in this case. Take a look into Cloud Composer which is built in top of Apache Airflow for GCP.
Edit: You could create a GCF to handle the states of those requests
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
################ TASK A
taskA_list = [
"https://via.placeholder.com/400",
"https://via.placeholder.com/410",
"https://via.placeholder.com/420",
"https://via.placeholder.com/430",
"https://via.placeholder.com/440",
"https://via.placeholder.com/450",
"https://via.placeholder.com/460",
"https://via.placeholder.com/470",
"https://via.placeholder.com/480",
"https://via.placeholder.com/490",
]
def call2TaskA(url):
html = requests.get(url, stream=True)
return (url,html.status_code)
processes = []
results = []
with ThreadPoolExecutor(max_workers=10) as executor:
for url in taskA_list:
processes.append(executor.submit(call2TaskA, url))
isOkayToDoTaskB = True
for taskA in as_completed(processes):
result = taskA.result()
if result[1] != 200: # your validation on taskA
isOkayToDoTaskB = False
results.append(result)
if not isOkayToDoTaskB:
raise ValueError('Problems: {}'.format(results))
################ TASK B
def doTaskB():
pass
doTaskB()

Related

Managed Workflows with Apache Airflow (MWAA) - how to disable task run dependency on previous run

I have an Apache Airflow managed environment running in which a number of DAGs are defined and enabled. Some DAGs are scheduled, running on a 15 minute schedule, while others are not scheduled. All the DAGs are single-task DAGs. The DAGs are structured in the following way:
level 2 DAGs -> (triggers) level 1 DAG -> (triggers) level 0 DAG
The scheduled DAGs are the level 2 DAGs, while the level 1 and level 0 DAGs are unscheduled. The level 0 DAG uses ECSOperator to call a pre-defined Elastic Container Service (ECS) task, to call a Python ETL script inside a Docker container defined in the ECS task. The level 2 DAGs wait on the level 1 DAG to complete, which in turns waits on the level 0 DAG to complete. The full Python logs produced by the ETL scripts are visible in the CloudWatch logs from the ECS task runs, while the Airflow task logs only show high-level logging.
The singular tasks in the scheduled DAGs (level 2) have depends_on_past set to False, and I expected that as a result successive scheduled runs of a level 2 DAG would not depend on each other, i.e. that if a particular run failed it would not prevent the next scheduled run from occurring. But what is happening is that Airflow is overriding this and I can clearly see in the UI that a failure of a particular level 2 DAG run is preventing the next run from being selected by the scheduler - the next scheduled run state is being set to None, and I have to manually clear the failed DAG run state before the scheduler can schedule it again.
Why does this happen? As far as I know, there is no Airflow configuration option that should override the task-level setting of False for depends_on_past in the level 2 DAG tasks. Any pointers would be greatly appreciated.
Answering the question "why is this happening?". I understand that the behavior you are watching is explained by the fact that Tasks are being defined with wait_for_downstream = True. The docs state the following about it:
wait_for_downstream (bool) -- when set to true, an instance of task X will wait for tasks immediately downstream of the previous instance of task X to finish successfully or be skipped before it runs. This is useful if the different instances of a task X alter the same asset, and this asset is used by tasks downstream of task X. Note that depends_on_past is forced to True wherever wait_for_downstream is used. Also note that only tasks immediately downstream of the previous task instance are waited for; the statuses of any tasks further downstream are ignored.
Keep in mind that the term previous instances of task X refers to the task_instance of the last scheduled dag_run, not the upstream Task (in a DAG with a daily schedule, that would be the task_instance from "yesterday").
This also explains why your Task are being executed once you clear the state of the previous DAG Run.
I hope it helps you clarifying things up!

Google AI Platform training - wait for the job to finish

I've built an AI Platform pipeline with a lot of parallel processes. Each process launches a training job on the AI Platform, like this:
gcloud ai-platform jobs submit training ...
Then it has to wait for the job to finish to pass to the next step. For doing this, I've tried to add the parameter --stream-logs to the above command. In this way, it streams all the logs until the job is done.
The problem is, with so many parallel processes, I run out of requests for getting logs:
Quota exceeded for quota metric 'Read requests' and limit 'Read requests per minute'
of service 'logging.googleapis.com'
But I do not need to actually stream the logs, I just need a way to tell the process to "wait" until the training job is done. Is there a smarter and simpler way of doing this?
I've just found that I can use the Python API to launch and monitor the job:
training_inputs = {
'scaleTier': 'CUSTOM',
'masterType': 'n1-standard-8',
...
}
job_spec = {'jobId': 'your_job_name', 'trainingInput': training_inputs}
project_name = 'your-project'
project_id = 'projects/{}'.format(project_name)
cloudml = discovery.build('ml', 'v1')
request = cloudml.projects().jobs().create(
body=job_spec,
parent=project_id
)
response = request.execute()
Now I can set up a loop that checks the job state every 60 seconds
state = 'RUNNING'
while state == 'RUNNING':
time.sleep(60)
status_req = cloudml.projects().jobs().get(name=f'{project_id}/jobs/{job_name}')
state = status_req.execute()['state']
print(state)
Regarding the error message you are experiencing, indeed you are hitting a quota exceeded for Cloud Logging, what you can do is to request a quota increase.
On the other hand, about an smarter way to check the status of a job without streaming logs, what you can do is to check the status once in a while by running gcloud ai-platform jobs describe <job_name> or create a Python script to check the status, this is explained in the following documentation.

Best practice to run Prefect flow serverless in Google Cloud

I have started using Prefect for various projects and now I need to decide which deployment strategy on GCP would work best. Preferably I would like to work serverless. Comparing Cloud Run, Cloud Functions and App Engine, I am inclined to go for the latter since this doesn't have a timeout limit, while the other two have of 9 resp. 15 minutes.
Am interested to hear how people have deployed Prefect flows serverlessly, such that Flows are scheduled/triggered for batch processing, whilst the agent is automatically scaled down when not used.
Alternatively, a more classic approach would be to deploy Prefect on Compute Engine and schedule this via Cloud Scheduler. But I feel this is somewhat outdated and doesn't do justice to the functionality of Prefect and flexibility for future development.
Am interested to hear how people have deployed Prefect flows serverlessly, such that Flows are scheduled/triggered for batch processing, whilst the agent is automatically scaled down when not used.
Prefect has a blog post on serverless deployment with AWS Lambda which is a good blueprint for doing the same with GCP. The challenge here is the agent scaling - agents work by polling the backend (whether a self deployment of Prefect Server or the hosted Prefect Cloud) on a regular basis (every ~10 secs). One possibility that comes to mind would be to use a Cloud Function to spin up an agent in-process, triggered by whatever batch processing/scheduling event you're thinking of. You can also use the -max-polls CLI argument or kwarg to spin up the agent to look for runs; it'll tear itself down if it doesn't find anything after however many polling attempts you specify. Details on that here or on any of the specific agent pages.
However, this could be inefficient for long-running flows and you might hit resource caps; it might be worthwhile to look at triggering an auto-scaling Dask cluster deployment if the workloads are high enough. Prefect supports that natively with Kubernetes, and has a Kubernetes agent to interact with your cluster. I think this would be the most elegant and scalable solution without having to go the classic Compute Engine route, which I agree is somewhat dated and doesn't provide great auto-scaling or first-class management.
Better support of serverless execution is on the roadmap, specifically a serverless agent is in the works but I don't have an ETA on when that'll be released.
Hopefully that helps! :)
Recently added to Prefect is the Vertex Agent which uses GCP Vertex, the inheritor to AIP. Vertex has a highly configurable serverless execution environment, and no timeouts.
The full explanation is here: https://jerryan.medium.com/hacking-ways-to-run-prefect-flow-serverless-in-google-cloud-function-bc6b249126e4.
Basically, there are two hacking ways to solve the problem.
Use google cloud storage to persistent task states automatically
Publish previous execution results of a cloud function to its subsequent execution.
Caching and Persisting Data
By default, the Prefect Core stored all data, results, and cached states in memory within the Python process running the flow. However, they can be persisted and retrieved from external locations if necessary hooks are configured.
The Prefect has a notion of “checkpointing” that ensures that every time a task is successfully run, its return value write to persistent storage based on the configuration in a result object and target for the task.
#task(result=LocalResult(dir="~/.prefect"), target="task.txt")
def func_task():
return 99
The complete code example is shown below. Here we write to and read from a Google Cloud Bucket by using GCSResult.
import os
os.environ["PREFECT__FLOWS__CHECKPOINTING"] = "true"
from prefect import task, Flow
from prefect.engine.results import LocalResult, GCSResult
#task(target="{date:%Y-%m-%d}/{task_name}.txt")
def task1():
print("Task 1")
return "Task 1"
#task(target="{date:%Y-%m-%d}/{task_name}.txt")
def task2():
print("Task 2")
return "Task 2"
#task(target="{date:%Y-%m-%d}/{task_name}.txt")
def task3():
print("Task 3")
return "Task 3"
#task(target="{date:%Y-%m-%d}/{task_name}.txt")
def task4():
print("Task 4")
#task
def task5():
print("Task 5")
#task
def task6():
print("Task 6")
#task
def task7():
print("Task 7")
#task
def task8():
print("Task 8")
# with Flow("This is My First Flow",result=LocalResult(dir="~/prefect")) as flow:
with Flow("this is my first flow", result=GCSResult(bucket="prefect")) as flow:
t1, t2 = task1(), task2()
t3 = task3(upstream_tasks=[t1,t2])
t4 = task4(upstream_tasks=[t3])
t5 = task5(upstream_tasks=[t4])
t6 = task6(upstream_tasks=[t4])
t7 = task7(upstream_tasks=[t2,t6])
t8 = task8(upstream_tasks=[t2,t3])
# run the whole flow
flow_state = flow.run()
# visualize the flow
flow.visualize(flow_state)
# print the state of the flow
print(flow_state.result)
Publish Execution Results
Another hacking solution is to publish previous execution results of a google cloud function to its subsequent execution. Here, we assume there is no data input and output dependence between tasks.
Some modifications are needed to make it happen.
Change custom state handers for tasks
Manually change task state before publishing
Encode/Decode the task state
First, we know the flow.run function finishes after all the tasks are entered into the finish state, whether it is a success or failure. However, we don’t want all tasks run inside a single call of the google cloud function because the total run time may exceed 540 seconds.
So a custom state hander for the task is used. Every time a task finishes, we emit an ENDRUN signal to the prefect framework. Then it will set the state of the remaining tasks to Cancelled.
from prefect import task, Flow, Task
from prefect.engine.runner import ENDRUN
from prefect.engine.state import State, Cancelled
num_finished = 0
def my_state_handler(obj, old_state, new_state):
global num_finished
if num_finished >= 1:
raise ENDRUN(state=Cancelled("Flow run is cancelled"))
if new_state.is_finished():
num_finished += 1
return new_state
Second, to make tasks with canceled status execute correctly next time, we must manually change their status to pending.
def run(task_state_dict: Dict[Task, State]) -> Dict[Task, State]:
flow_state = flow.run(task_states=task_state_dict)
task_states = flow_state.result
# change task state before next publish
for t in task_states:
if isinstance(task_states[t], Cancelled):
task_states[t] = Pending("Mocked pending")
# TODO: reset global counter
global num_finished
num_finished = 0
# task state for next run
return task_states
Third, there are two essential functions: encoding_data and decode_data. The former serialize the task states to be ready be published, and the latter deserialize the task states into flow object.
# encoding: utf-8
from typing import List, Dict, Any
from prefect.engine.state import State
from prefect import Flow, Task
def decode_data(flow: Flow, data: List[Dict[str, Any]]) -> Dict[Task, State]:
# data as follows:
# [
# {
# "task": {
# "slug": "task1"
# }
# "state": {
# "type": "Success",
# "message": "Task run succeeded(manually set)"
# }
# }
# ]
task_states = {}
for d in data:
tasks_found = flow.get_tasks(d['task']['slug'])
if len(tasks_found) != 1: # 不唯一就不做处理了
continue
state = State.deserialize(
{"message": d['state']['message'],
"type": d['state']['type']
}
)
task_states[tasks_found[0]] = state
return task_states
def encode_data(task_states: Dict[Task, State]) -> List[Dict[str, Any]]:
data = []
for task, state in task_states.items():
data.append({
"task": task.serialize(),
"state": state.serialize()
})
return data
Last but not least, the orchestration connects all the parts above.
def main(data: List[Dict[str, Any]], *args, **kargs) -> List[Dict[str, Any]]:
task_states = decode_data(flow, data)
task_states = run(task_states)
return encode_data(task_states)
if __name__ == "__main__":
evt = []
while True:
data = main(evt)
states = defaultdict(set)
for task in data:
task_type, slug = task['state']['type'], task['task']['slug']
states[task_type].add(slug)
if len(states['Pending']) == 0:
sys.exit(0)
evt = data
# send pubsub message here
# GooglePubsub().publish(evt)
# sys.exit(0)

How do I poll a web service from a GAE service in short intervals?

I'm developing a client app that relies on a GAE service. This service needs to get updates by polling a remote web service on a less than 1 minute interval so cron jobs are probably not the way to go here.
From the GAE service I need to poll the web service in intervals of a couple of seconds and then update the client app. So to break it down:
GAE service polls the remote web service in 5 sec intervals.
If a change is made, update the client app instantly.
Step 2 is solved already, but I'm struggling to find a good way on a polling of this sort. I have no control over the remote web service so I can't make any changes on that end.
I've looked at the Task queue API but the documentation specifically says that it is unsuitable for interactive applications where a user is waiting for the result
How would be the best way to solve this issue?
Use cron to schedule a bunch of taskqueue tasks with staggered etas
def cron_job(): # scheduled to run every 5 minutes
for i in xrange(0, 60*5, 5):
deferred.defer(poll_web_service, _countdown=i)
def poll_web_service():
# do stuff
Alternatively, with this level of frequency, you might as well have a dedicated instance on this. You can do this with manual-scaling microservice and you can have the request handler for /_ah/start/ never return, which will let it run forever (besides having periodic restarts). See this: https://cloud.google.com/appengine/docs/standard/python/how-instances-are-managed#instance_scaling
def on_change_detected(params):
queue = taskqueue.Queue('default')
task = taskqueue.Task(
url='/some-url-on-your-default-service/',
countdown=0,
target='default',
params={'params': params})
queue.add(task)
class Start(webapp2.RequestHandler):
def get(self):
while True:
time.sleep(5)
if change_detected: # YOUR LOGIC TO DETECT A CHANGE GOES HERE
on_change_detected()
_routes = [
RedirectRoute('/_ah/start', Start, name='start'),
]
for r in _routes:
app.router.add(r)

Airflow : need advices when running a lot of instances per task

This is my 1st post on Stack and it is about Airflow. I need to implement a DAG which will :
1/ Download files from an API
2/ Upload them into Google Cloud Storage
3/ Insert them into BigQuery
The thing is that the step 1 involves about 170 accounts to be call. If any error is raised during the download, I want my DAG to automatically retry it from the abended step. Therefore I implemented a loop above my tasks such as :
dag = DAG('my_dag', default_args=DEFAULT_ARGS)
for account in accounts:
t1 = PythonOperator(task_id='download_file_' + account['id'],
python_callable=download_files(account),
dag=my_dag)
t2 = FileToGoogleCloudStorageOperator(task_id='upload_file_' + account['id'],
google_cloud_storage_conn_id = 'gcs_my_conn',
src = 'file_' + account['id'] + '.json',
bucket = 'my_bucket',
dag=my_dag)
t3 = GoogleCloudStorageToBigQueryOperator(task_id='insert_bq',
bucket = 'my_bucket',
google_cloud_storage_conn_id = 'gcs_my_conn',
bigquery_conn_id = 'bq_my_conn',
src = 'file_' + account['id'],
destination_project_dataset_table = 'my-project:my-dataset.my-table',
source_format = 'NEWLINE_DELIMITED_JSON',
dag=my_dag)
t2.set_upstream(t1)
t3.set_upstream(t2)
So at UI level, I have about 170 instances of each task display. When I run the DAG manually, Airflow is just doing nothing as far as I can see. The DAG is don't init or queued any task instance. I guess this is due to the number of instances involve but I don't know how can I workaround this.
How should I manage so many task instances ?
Thanks,
Alex
How are you running airflow currently? Are you sure the airflow scheduler is running?
You can also run airflow list_dags to ensure the dag can be compiled. If you are running airflow using Celery you should take care that your dag shows up using list_dags on all nodes running airflow.
Alex, it would be easier to post here, I saw you have DEFAULT_ARGS with retries which is at DAG level, you can also set up retries at task level as well. It is in BaseOperator, since all Operator will inherit the BaseOperator then you can use it, you can find more detail here: https://github.com/apache/incubator-airflow/blob/master/airflow/operators/python_operator.py and https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L1864, if you check BaseOperator in model, it has retries and retry_delay, you can do something like this:
t1 = PythonOperator(task_id='download_file_' + account['id'],
python_callable=download_files(account),
retries=3,
retry_delay=timedelta(seconds=300),
dag=my_dag)