Airflow : need advices when running a lot of instances per task - directed-acyclic-graphs

This is my 1st post on Stack and it is about Airflow. I need to implement a DAG which will :
1/ Download files from an API
2/ Upload them into Google Cloud Storage
3/ Insert them into BigQuery
The thing is that the step 1 involves about 170 accounts to be call. If any error is raised during the download, I want my DAG to automatically retry it from the abended step. Therefore I implemented a loop above my tasks such as :
dag = DAG('my_dag', default_args=DEFAULT_ARGS)
for account in accounts:
t1 = PythonOperator(task_id='download_file_' + account['id'],
python_callable=download_files(account),
dag=my_dag)
t2 = FileToGoogleCloudStorageOperator(task_id='upload_file_' + account['id'],
google_cloud_storage_conn_id = 'gcs_my_conn',
src = 'file_' + account['id'] + '.json',
bucket = 'my_bucket',
dag=my_dag)
t3 = GoogleCloudStorageToBigQueryOperator(task_id='insert_bq',
bucket = 'my_bucket',
google_cloud_storage_conn_id = 'gcs_my_conn',
bigquery_conn_id = 'bq_my_conn',
src = 'file_' + account['id'],
destination_project_dataset_table = 'my-project:my-dataset.my-table',
source_format = 'NEWLINE_DELIMITED_JSON',
dag=my_dag)
t2.set_upstream(t1)
t3.set_upstream(t2)
So at UI level, I have about 170 instances of each task display. When I run the DAG manually, Airflow is just doing nothing as far as I can see. The DAG is don't init or queued any task instance. I guess this is due to the number of instances involve but I don't know how can I workaround this.
How should I manage so many task instances ?
Thanks,
Alex

How are you running airflow currently? Are you sure the airflow scheduler is running?
You can also run airflow list_dags to ensure the dag can be compiled. If you are running airflow using Celery you should take care that your dag shows up using list_dags on all nodes running airflow.

Alex, it would be easier to post here, I saw you have DEFAULT_ARGS with retries which is at DAG level, you can also set up retries at task level as well. It is in BaseOperator, since all Operator will inherit the BaseOperator then you can use it, you can find more detail here: https://github.com/apache/incubator-airflow/blob/master/airflow/operators/python_operator.py and https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L1864, if you check BaseOperator in model, it has retries and retry_delay, you can do something like this:
t1 = PythonOperator(task_id='download_file_' + account['id'],
python_callable=download_files(account),
retries=3,
retry_delay=timedelta(seconds=300),
dag=my_dag)

Related

How to automatically stop Sagemaker notebook instances if it is idle?

I have been looking for a script to automatically close Sagemaker Notebook Instances that have been forgotten to be closed or that are idle. A few scripts I found don't work very well (eg: link , it is only checking if ipynb file is live, Im not using .ipynb, or taking the last updated info which never changes until you shut down or open the instance)
Is there a resource or script you can recommend?
You can use the following script to find idle instances. You can modify the script to stop the instance if idle for more than 5 minutes or have a cron job to stop the instance.
import boto3
last_modified_threshold = 5 * 60
sm_client = boto3.client('sagemaker')
response = sm_client.list_notebook_instances()
for item in response['NotebookInstances']:
last_modified_seconds = item['LastModifiedTime'].timestamp()
last_modified_minutes = last_modified_seconds/60
print(last_modified_minutes)
if last_modified_minutes > last_modified_threshold:
print('Notebook {0} has been idle for more than {1} minutes'.format(item['NotebookInstanceName'], last_modified_threshold/60))

Schedule Twillio messages on Google Cloud

What I want to achieve is that once I receive a message via Twilio I want to schedule a reply to it after exactly 5 minutes. I am using Google Cloud Functions to generate the replies, but I'm not sure how to schedule it. I have gone through Cloud tasks, Pub/Sub and Scheduler but I'm still confused as to how to achieve it. I am using Python.
What I am thinking is the following workflow: Twilio -> cloud function receives the message and sets a task for after 5 minutes o-> another cloud function is invoked after 5 minutes. I am stuck as to how to schedule it after 5 minutes.
In AWS you would use SQS in combination with delay queues which makes this very convenient.
Google Cloud Pub/Sub being the equivalent to AWS SQS doesn't support any sort of delay so you would need to use Google Cloud Tasks.
When creating a task you can specify a schedule time which identifies the time at which the task should be executed:
scheduleTime string (Timestamp format)
The time when the task is scheduled to be attempted or retried.
Quick example code copy & pasted from the Google documentation leaving out non-relevant bits and pieces:
from google.cloud import tasks_v2
from google.protobuf import timestamp_pb2
import datetime
[...]
client = tasks_v2.CloudTasksClient()
parent = client.queue_path(project, location, queue)
in_seconds = 5*60 # After 5 minutes...
d = datetime.datetime.utcnow() + datetime.timedelta(seconds=in_seconds)
timestamp = timestamp_pb2.Timestamp()
timestamp.FromDatetime(d)
task = {
"http_request": {
"http_method": tasks_v2.HttpMethod.POST,
"url": url,
"schedule_time": timestamp,
}
}
# Need to add payload, headers and task name as necessary here...
[...]
response = client.create_task(request={"parent": parent, "task": task})

Terminate spot instance after each successful job using Gitlab CI

I have been working around Gitlab CI from couple of days. I have setup the EC2 -ASG as runner with spot instances.
I wonder if we have any solution such that it should delete the spot instance right after the job is successful.
Following is the gitlab runner configuration.
concurrent = 2
check_interval = 3
[session_server]
session_timeout = 1800
[[runners]]
name = "shell-runner"
url = "https://gitlab.com/"
token = "xxxx-xxxx"
executor = "shell"
limit = 1
[runners.custom_build_dir]
[runners.cache]
[runners.cache.s3]
[runners.cache.gcs]
[runners.cache.azure]
[[runners]]
name = "docker-machine-runner"
url = "https://gitlab.com/"
token = "xxxx-xyxyxyxy"
executor = "docker+machine"
limit = 1
[runners.custom_build_dir]
[runners.cache]
[runners.cache.s3]
[runners.cache.gcs]
[runners.cache.azure]
[runners.docker]
tls_verify = false
image = "docker:latest"
privileged = false
disable_entrypoint_overwrite = false
oom_kill_disable = false
disable_cache = false
volumes = ["/cache"]
shm_size = 0
[runners.machine]
IdleCount = 0
IdleTime = 1800
MaxBuilds = 100
MachineDriver = "amazonec2"
MachineName = "gitlab-docker-machine-%s"
MachineOptions = [
"amazonec2-region=us-west-2",
"amazonec2-ssh-user=ubuntu",
"amazonec2-vpc-id=vpc-xxxx",
"amazonec2-subnet-id=subnet-xxx",
"amazonec2-use-private-address=true",
"amazonec2-instance-type=t3a.medium",
"amazonec2-ami=ami-xxx",
"amazonec2-zone=a",
"amazonec2-security-group=gitlab-runner-sg",
"amazonec2-request-spot-instance=true",
"amazonec2-spot-price=0.025"
]
I have two runner in the above configuration i.e., shell and docker-machine.
Currently, it's not deleting the spot fleet at all and if I set amazonec2-block-duration-minutes=20 flag, I guess it keep it the spot instance for 20 mins and remove it after that.
I'm looking for a solution such that the spot instances get deleted after each job is successful and/or it can wait for sometime for other jobs and gets terminated.
In above docker-machine-runner, what configuration change is required to achieve this?
Or can we do any other automation to make it happen?
Let me know if required more information on the same.
Thanks in advance.
I answered a similar question yesterday with how I monitor my runners platform. The full post is here: Start build on a windows ec2 with gitlab runner
What I do I have a small app running somewhere that continuously polls various Gitlab APIs to retrieve the number of pending jobs and the number of available runners. Then, based on some thresholds I've defined, I will either increase the number of runners or decrease them, up to a threshold.
For example, let's say I want a max of 5 runners at a time, and I want no more than 5 jobs in the queue before I increase the number of runners (up to 5). Here's what I'd do (copied from linked post):
Hit the Runners API to get number of runners.
Hit the Projects API to get all projects and filter it to only keep projects that have pipelines/CI enabled. You can store this number too so you don't have to make this call each time.
Hit Pipelines API for each project to get all pending/in progress Pipelines.
Hit Jobs API for each pipeline to get all pending jobs.
If there are no runners but there are pending jobs, add 1 runner.
If there's more than 1 runner but less than the max runners, and more than the jobs threshold, add 1 runner up to the max.
If there's 1 or more runners and no pending jobs, destroy a sleeping runner (and deregister it from Gitlab so the Runners API call remains clean).
sleep for a minute or two, then loop from the top.
The various API's used are:
Projects API: https://docs.gitlab.com/ee/api/projects.html
Pipelines API: https://docs.gitlab.com/ee/api/pipelines.html
Jobs API: https://docs.gitlab.com/ee/api/jobs.html
Runners API: https://docs.gitlab.com/ee/api/runners.html
Whatever you need to create and destroy instances in your cloud provider.

Google AI Platform training - wait for the job to finish

I've built an AI Platform pipeline with a lot of parallel processes. Each process launches a training job on the AI Platform, like this:
gcloud ai-platform jobs submit training ...
Then it has to wait for the job to finish to pass to the next step. For doing this, I've tried to add the parameter --stream-logs to the above command. In this way, it streams all the logs until the job is done.
The problem is, with so many parallel processes, I run out of requests for getting logs:
Quota exceeded for quota metric 'Read requests' and limit 'Read requests per minute'
of service 'logging.googleapis.com'
But I do not need to actually stream the logs, I just need a way to tell the process to "wait" until the training job is done. Is there a smarter and simpler way of doing this?
I've just found that I can use the Python API to launch and monitor the job:
training_inputs = {
'scaleTier': 'CUSTOM',
'masterType': 'n1-standard-8',
...
}
job_spec = {'jobId': 'your_job_name', 'trainingInput': training_inputs}
project_name = 'your-project'
project_id = 'projects/{}'.format(project_name)
cloudml = discovery.build('ml', 'v1')
request = cloudml.projects().jobs().create(
body=job_spec,
parent=project_id
)
response = request.execute()
Now I can set up a loop that checks the job state every 60 seconds
state = 'RUNNING'
while state == 'RUNNING':
time.sleep(60)
status_req = cloudml.projects().jobs().get(name=f'{project_id}/jobs/{job_name}')
state = status_req.execute()['state']
print(state)
Regarding the error message you are experiencing, indeed you are hitting a quota exceeded for Cloud Logging, what you can do is to request a quota increase.
On the other hand, about an smarter way to check the status of a job without streaming logs, what you can do is to check the status once in a while by running gcloud ai-platform jobs describe <job_name> or create a Python script to check the status, this is explained in the following documentation.

Cloud Tasks Conditional Execution

I am using Cloud Tasks. I need to trigger the execution of Task C only when Task A and Task B have been completed successfully. So I need some way of reading / being notified of the statuses of Tasks triggered. But I see no way of doing this in GCP's documentation. Using Node.js SDK to create tasks and Cloud Functions as task handlers if at all that helps.
Edit:
As requested, here is more info on what we are doing:
Tasks 1 - 10 each make HTTP requests, fetch data, update individual collections in Firestore based on this data. These 10 tasks can run in parallel and in no particular order as they don't have any dependency on each other. All of these tasks are actually implemented inside GCF.
Task 11 actually depends on the Firestore collection data updated by Tasks 1 - 10. So it can only run after Tasks 1 - 10 are completed successfully.
We do issue a RunID as a common identifier to group a particular run of all tasks (1 - 11).
Cloud Task only trigger task, you can only define time condition. You have to code manually the check when the task C run.
Here an example of process:
Task A is running, at the end, the task write in firestore that is completed
Task B is running, at the end, the task write in firestore that is completed
Task C start and check if A and B are completed in firestore.
If not, the task exit in error
Is yes, continue the process
You have to customize your C task queue for retrying the task in case of error.
Another, expensive, solution is to use Cloud Composer for handling this workflow
There is no other solution for now about workflow management.
Cloud Tasks is not the tool you want to use in this case. Take a look into Cloud Composer which is built in top of Apache Airflow for GCP.
Edit: You could create a GCF to handle the states of those requests
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
################ TASK A
taskA_list = [
"https://via.placeholder.com/400",
"https://via.placeholder.com/410",
"https://via.placeholder.com/420",
"https://via.placeholder.com/430",
"https://via.placeholder.com/440",
"https://via.placeholder.com/450",
"https://via.placeholder.com/460",
"https://via.placeholder.com/470",
"https://via.placeholder.com/480",
"https://via.placeholder.com/490",
]
def call2TaskA(url):
html = requests.get(url, stream=True)
return (url,html.status_code)
processes = []
results = []
with ThreadPoolExecutor(max_workers=10) as executor:
for url in taskA_list:
processes.append(executor.submit(call2TaskA, url))
isOkayToDoTaskB = True
for taskA in as_completed(processes):
result = taskA.result()
if result[1] != 200: # your validation on taskA
isOkayToDoTaskB = False
results.append(result)
if not isOkayToDoTaskB:
raise ValueError('Problems: {}'.format(results))
################ TASK B
def doTaskB():
pass
doTaskB()