Terminate spot instance after each successful job using Gitlab CI

Terminate spot instance after each successful job using Gitlab CI - amazon-web-services

I have been working around Gitlab CI from couple of days. I have setup the EC2 -ASG as runner with spot instances.
I wonder if we have any solution such that it should delete the spot instance right after the job is successful.
Following is the gitlab runner configuration.
concurrent = 2
check_interval = 3
[session_server]
session_timeout = 1800
[[runners]]
name = "shell-runner"
url = "https://gitlab.com/"
token = "xxxx-xxxx"
executor = "shell"
limit = 1
[runners.custom_build_dir]
[runners.cache]
[runners.cache.s3]
[runners.cache.gcs]
[runners.cache.azure]
[[runners]]
name = "docker-machine-runner"
url = "https://gitlab.com/"
token = "xxxx-xyxyxyxy"
executor = "docker+machine"
limit = 1
[runners.custom_build_dir]
[runners.cache]
[runners.cache.s3]
[runners.cache.gcs]
[runners.cache.azure]
[runners.docker]
tls_verify = false
image = "docker:latest"
privileged = false
disable_entrypoint_overwrite = false
oom_kill_disable = false
disable_cache = false
volumes = ["/cache"]
shm_size = 0
[runners.machine]
IdleCount = 0
IdleTime = 1800
MaxBuilds = 100
MachineDriver = "amazonec2"
MachineName = "gitlab-docker-machine-%s"
MachineOptions = [
"amazonec2-region=us-west-2",
"amazonec2-ssh-user=ubuntu",
"amazonec2-vpc-id=vpc-xxxx",
"amazonec2-subnet-id=subnet-xxx",
"amazonec2-use-private-address=true",
"amazonec2-instance-type=t3a.medium",
"amazonec2-ami=ami-xxx",
"amazonec2-zone=a",
"amazonec2-security-group=gitlab-runner-sg",
"amazonec2-request-spot-instance=true",
"amazonec2-spot-price=0.025"
]
I have two runner in the above configuration i.e., shell and docker-machine.
Currently, it's not deleting the spot fleet at all and if I set amazonec2-block-duration-minutes=20 flag, I guess it keep it the spot instance for 20 mins and remove it after that.
I'm looking for a solution such that the spot instances get deleted after each job is successful and/or it can wait for sometime for other jobs and gets terminated.
In above docker-machine-runner, what configuration change is required to achieve this?
Or can we do any other automation to make it happen?
Let me know if required more information on the same.
Thanks in advance.

I answered a similar question yesterday with how I monitor my runners platform. The full post is here: Start build on a windows ec2 with gitlab runner
What I do I have a small app running somewhere that continuously polls various Gitlab APIs to retrieve the number of pending jobs and the number of available runners. Then, based on some thresholds I've defined, I will either increase the number of runners or decrease them, up to a threshold.
For example, let's say I want a max of 5 runners at a time, and I want no more than 5 jobs in the queue before I increase the number of runners (up to 5). Here's what I'd do (copied from linked post):
Hit the Runners API to get number of runners.
Hit the Projects API to get all projects and filter it to only keep projects that have pipelines/CI enabled. You can store this number too so you don't have to make this call each time.
Hit Pipelines API for each project to get all pending/in progress Pipelines.
Hit Jobs API for each pipeline to get all pending jobs.
If there are no runners but there are pending jobs, add 1 runner.
If there's more than 1 runner but less than the max runners, and more than the jobs threshold, add 1 runner up to the max.
If there's 1 or more runners and no pending jobs, destroy a sleeping runner (and deregister it from Gitlab so the Runners API call remains clean).
sleep for a minute or two, then loop from the top.
The various API's used are:
Projects API: https://docs.gitlab.com/ee/api/projects.html
Pipelines API: https://docs.gitlab.com/ee/api/pipelines.html
Jobs API: https://docs.gitlab.com/ee/api/jobs.html
Runners API: https://docs.gitlab.com/ee/api/runners.html
Whatever you need to create and destroy instances in your cloud provider.

Related

MWAA Airflow Scaling: what do I do when I have to run frequent & time consuming scripts? (Negsignal.SIGKILL)

I have an MWAA Airflow env in my AWS account. The DAG I am setting up is supposed to read massive data from S3 bucket A, filter what I want and dump the filtered results to S3 bucket B. It needs to read every minute since the data is coming in every minute. Every run processes about 200MB of json data.
My initial setting was using env class mw1.small with 10 worker machines, if I only run the task once in this setting, it takes about 8 minutes to finish each run, but when I start the schedule to run every minute, most of them could not finish, starts to take much longer to run (around 18 mins) and displays the error message:
[2021-09-25 20:33:16,472] {{local_task_job.py:102}} INFO - Task exited with return code Negsignal.SIGKILL
I tried to expand env class to mw1.large with 15 workers, more jobs were able to complete before the error shows up, but still could not catch up with the speed of ingesting every minute. The Negsignal.SIGKILL error would still show before even reaching worker machine max.
At this point, what should I do to scale this? I can imagine opening another Airflow env but that does not really make sense. There must be a way to do it within one env.

I've found the solution to this, for MWAA, edit the environment and under Airflow configuration options, setup these configs
celery.sync_parallelism = 1
celery.worker_autoscale = 1,1
This will make sure your worker machine runs 1 job at a time, preventing multiple jobs to share the worker, hence saving memory and reduces runtime.

Google AI Platform training - wait for the job to finish

I've built an AI Platform pipeline with a lot of parallel processes. Each process launches a training job on the AI Platform, like this:
gcloud ai-platform jobs submit training ...
Then it has to wait for the job to finish to pass to the next step. For doing this, I've tried to add the parameter --stream-logs to the above command. In this way, it streams all the logs until the job is done.
The problem is, with so many parallel processes, I run out of requests for getting logs:
Quota exceeded for quota metric 'Read requests' and limit 'Read requests per minute'
of service 'logging.googleapis.com'
But I do not need to actually stream the logs, I just need a way to tell the process to "wait" until the training job is done. Is there a smarter and simpler way of doing this?

I've just found that I can use the Python API to launch and monitor the job:
training_inputs = {
'scaleTier': 'CUSTOM',
'masterType': 'n1-standard-8',
...
}
job_spec = {'jobId': 'your_job_name', 'trainingInput': training_inputs}
project_name = 'your-project'
project_id = 'projects/{}'.format(project_name)
cloudml = discovery.build('ml', 'v1')
request = cloudml.projects().jobs().create(
body=job_spec,
parent=project_id
)
response = request.execute()
Now I can set up a loop that checks the job state every 60 seconds
state = 'RUNNING'
while state == 'RUNNING':
time.sleep(60)
status_req = cloudml.projects().jobs().get(name=f'{project_id}/jobs/{job_name}')
state = status_req.execute()['state']
print(state)

Regarding the error message you are experiencing, indeed you are hitting a quota exceeded for Cloud Logging, what you can do is to request a quota increase.
On the other hand, about an smarter way to check the status of a job without streaming logs, what you can do is to check the status once in a while by running gcloud ai-platform jobs describe <job_name> or create a Python script to check the status, this is explained in the following documentation.

Best practice to run Prefect flow serverless in Google Cloud

I have started using Prefect for various projects and now I need to decide which deployment strategy on GCP would work best. Preferably I would like to work serverless. Comparing Cloud Run, Cloud Functions and App Engine, I am inclined to go for the latter since this doesn't have a timeout limit, while the other two have of 9 resp. 15 minutes.
Am interested to hear how people have deployed Prefect flows serverlessly, such that Flows are scheduled/triggered for batch processing, whilst the agent is automatically scaled down when not used.
Alternatively, a more classic approach would be to deploy Prefect on Compute Engine and schedule this via Cloud Scheduler. But I feel this is somewhat outdated and doesn't do justice to the functionality of Prefect and flexibility for future development.

Am interested to hear how people have deployed Prefect flows serverlessly, such that Flows are scheduled/triggered for batch processing, whilst the agent is automatically scaled down when not used.
Prefect has a blog post on serverless deployment with AWS Lambda which is a good blueprint for doing the same with GCP. The challenge here is the agent scaling - agents work by polling the backend (whether a self deployment of Prefect Server or the hosted Prefect Cloud) on a regular basis (every ~10 secs). One possibility that comes to mind would be to use a Cloud Function to spin up an agent in-process, triggered by whatever batch processing/scheduling event you're thinking of. You can also use the -max-polls CLI argument or kwarg to spin up the agent to look for runs; it'll tear itself down if it doesn't find anything after however many polling attempts you specify. Details on that here or on any of the specific agent pages.
However, this could be inefficient for long-running flows and you might hit resource caps; it might be worthwhile to look at triggering an auto-scaling Dask cluster deployment if the workloads are high enough. Prefect supports that natively with Kubernetes, and has a Kubernetes agent to interact with your cluster. I think this would be the most elegant and scalable solution without having to go the classic Compute Engine route, which I agree is somewhat dated and doesn't provide great auto-scaling or first-class management.
Better support of serverless execution is on the roadmap, specifically a serverless agent is in the works but I don't have an ETA on when that'll be released.
Hopefully that helps! :)

Recently added to Prefect is the Vertex Agent which uses GCP Vertex, the inheritor to AIP. Vertex has a highly configurable serverless execution environment, and no timeouts.

The full explanation is here: https://jerryan.medium.com/hacking-ways-to-run-prefect-flow-serverless-in-google-cloud-function-bc6b249126e4.
Basically, there are two hacking ways to solve the problem.
Use google cloud storage to persistent task states automatically
Publish previous execution results of a cloud function to its subsequent execution.
Caching and Persisting Data
By default, the Prefect Core stored all data, results, and cached states in memory within the Python process running the flow. However, they can be persisted and retrieved from external locations if necessary hooks are configured.
The Prefect has a notion of “checkpointing” that ensures that every time a task is successfully run, its return value write to persistent storage based on the configuration in a result object and target for the task.
#task(result=LocalResult(dir="~/.prefect"), target="task.txt")
def func_task():
return 99
The complete code example is shown below. Here we write to and read from a Google Cloud Bucket by using GCSResult.
import os
os.environ["PREFECT__FLOWS__CHECKPOINTING"] = "true"
from prefect import task, Flow
from prefect.engine.results import LocalResult, GCSResult
#task(target="{date:%Y-%m-%d}/{task_name}.txt")
def task1():
print("Task 1")
return "Task 1"
#task(target="{date:%Y-%m-%d}/{task_name}.txt")
def task2():
print("Task 2")
return "Task 2"
#task(target="{date:%Y-%m-%d}/{task_name}.txt")
def task3():
print("Task 3")
return "Task 3"
#task(target="{date:%Y-%m-%d}/{task_name}.txt")
def task4():
print("Task 4")
#task
def task5():
print("Task 5")
#task
def task6():
print("Task 6")
#task
def task7():
print("Task 7")
#task
def task8():
print("Task 8")
# with Flow("This is My First Flow",result=LocalResult(dir="~/prefect")) as flow:
with Flow("this is my first flow", result=GCSResult(bucket="prefect")) as flow:
t1, t2 = task1(), task2()
t3 = task3(upstream_tasks=[t1,t2])
t4 = task4(upstream_tasks=[t3])
t5 = task5(upstream_tasks=[t4])
t6 = task6(upstream_tasks=[t4])
t7 = task7(upstream_tasks=[t2,t6])
t8 = task8(upstream_tasks=[t2,t3])
# run the whole flow
flow_state = flow.run()
# visualize the flow
flow.visualize(flow_state)
# print the state of the flow
print(flow_state.result)
Publish Execution Results
Another hacking solution is to publish previous execution results of a google cloud function to its subsequent execution. Here, we assume there is no data input and output dependence between tasks.
Some modifications are needed to make it happen.
Change custom state handers for tasks
Manually change task state before publishing
Encode/Decode the task state
First, we know the flow.run function finishes after all the tasks are entered into the finish state, whether it is a success or failure. However, we don’t want all tasks run inside a single call of the google cloud function because the total run time may exceed 540 seconds.
So a custom state hander for the task is used. Every time a task finishes, we emit an ENDRUN signal to the prefect framework. Then it will set the state of the remaining tasks to Cancelled.
from prefect import task, Flow, Task
from prefect.engine.runner import ENDRUN
from prefect.engine.state import State, Cancelled
num_finished = 0
def my_state_handler(obj, old_state, new_state):
global num_finished
if num_finished >= 1:
raise ENDRUN(state=Cancelled("Flow run is cancelled"))
if new_state.is_finished():
num_finished += 1
return new_state
Second, to make tasks with canceled status execute correctly next time, we must manually change their status to pending.
def run(task_state_dict: Dict[Task, State]) -> Dict[Task, State]:
flow_state = flow.run(task_states=task_state_dict)
task_states = flow_state.result
# change task state before next publish
for t in task_states:
if isinstance(task_states[t], Cancelled):
task_states[t] = Pending("Mocked pending")
# TODO: reset global counter
global num_finished
num_finished = 0
# task state for next run
return task_states
Third, there are two essential functions: encoding_data and decode_data. The former serialize the task states to be ready be published, and the latter deserialize the task states into flow object.
# encoding: utf-8
from typing import List, Dict, Any
from prefect.engine.state import State
from prefect import Flow, Task
def decode_data(flow: Flow, data: List[Dict[str, Any]]) -> Dict[Task, State]:
# data as follows:
# [
# {
# "task": {
# "slug": "task1"
# }
# "state": {
# "type": "Success",
# "message": "Task run succeeded（manually set）"
# }
# }
# ]
task_states = {}
for d in data:
tasks_found = flow.get_tasks(d['task']['slug'])
if len(tasks_found) != 1: # 不唯一就不做处理了
continue
state = State.deserialize(
{"message": d['state']['message'],
"type": d['state']['type']
}
)
task_states[tasks_found[0]] = state
return task_states
def encode_data(task_states: Dict[Task, State]) -> List[Dict[str, Any]]:
data = []
for task, state in task_states.items():
data.append({
"task": task.serialize(),
"state": state.serialize()
})
return data
Last but not least, the orchestration connects all the parts above.
def main(data: List[Dict[str, Any]], *args, **kargs) -> List[Dict[str, Any]]:
task_states = decode_data(flow, data)
task_states = run(task_states)
return encode_data(task_states)
if __name__ == "__main__":
evt = []
while True:
data = main(evt)
states = defaultdict(set)
for task in data:
task_type, slug = task['state']['type'], task['task']['slug']
states[task_type].add(slug)
if len(states['Pending']) == 0:
sys.exit(0)
evt = data
# send pubsub message here
# GooglePubsub().publish(evt)
# sys.exit(0)

Cloud Tasks Conditional Execution

I am using Cloud Tasks. I need to trigger the execution of Task C only when Task A and Task B have been completed successfully. So I need some way of reading / being notified of the statuses of Tasks triggered. But I see no way of doing this in GCP's documentation. Using Node.js SDK to create tasks and Cloud Functions as task handlers if at all that helps.
Edit:
As requested, here is more info on what we are doing:
Tasks 1 - 10 each make HTTP requests, fetch data, update individual collections in Firestore based on this data. These 10 tasks can run in parallel and in no particular order as they don't have any dependency on each other. All of these tasks are actually implemented inside GCF.
Task 11 actually depends on the Firestore collection data updated by Tasks 1 - 10. So it can only run after Tasks 1 - 10 are completed successfully.
We do issue a RunID as a common identifier to group a particular run of all tasks (1 - 11).

Cloud Task only trigger task, you can only define time condition. You have to code manually the check when the task C run.
Here an example of process:
Task A is running, at the end, the task write in firestore that is completed
Task B is running, at the end, the task write in firestore that is completed
Task C start and check if A and B are completed in firestore.
If not, the task exit in error
Is yes, continue the process
You have to customize your C task queue for retrying the task in case of error.
Another, expensive, solution is to use Cloud Composer for handling this workflow
There is no other solution for now about workflow management.

Cloud Tasks is not the tool you want to use in this case. Take a look into Cloud Composer which is built in top of Apache Airflow for GCP.
Edit: You could create a GCF to handle the states of those requests
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
################ TASK A
taskA_list = [
"https://via.placeholder.com/400",
"https://via.placeholder.com/410",
"https://via.placeholder.com/420",
"https://via.placeholder.com/430",
"https://via.placeholder.com/440",
"https://via.placeholder.com/450",
"https://via.placeholder.com/460",
"https://via.placeholder.com/470",
"https://via.placeholder.com/480",
"https://via.placeholder.com/490",
]
def call2TaskA(url):
html = requests.get(url, stream=True)
return (url,html.status_code)
processes = []
results = []
with ThreadPoolExecutor(max_workers=10) as executor:
for url in taskA_list:
processes.append(executor.submit(call2TaskA, url))
isOkayToDoTaskB = True
for taskA in as_completed(processes):
result = taskA.result()
if result[1] != 200: # your validation on taskA
isOkayToDoTaskB = False
results.append(result)
if not isOkayToDoTaskB:
raise ValueError('Problems: {}'.format(results))
################ TASK B
def doTaskB():
pass
doTaskB()

Airflow : need advices when running a lot of instances per task

This is my 1st post on Stack and it is about Airflow. I need to implement a DAG which will :
1/ Download files from an API
2/ Upload them into Google Cloud Storage
3/ Insert them into BigQuery
The thing is that the step 1 involves about 170 accounts to be call. If any error is raised during the download, I want my DAG to automatically retry it from the abended step. Therefore I implemented a loop above my tasks such as :
dag = DAG('my_dag', default_args=DEFAULT_ARGS)
for account in accounts:
t1 = PythonOperator(task_id='download_file_' + account['id'],
python_callable=download_files(account),
dag=my_dag)
t2 = FileToGoogleCloudStorageOperator(task_id='upload_file_' + account['id'],
google_cloud_storage_conn_id = 'gcs_my_conn',
src = 'file_' + account['id'] + '.json',
bucket = 'my_bucket',
dag=my_dag)
t3 = GoogleCloudStorageToBigQueryOperator(task_id='insert_bq',
bucket = 'my_bucket',
google_cloud_storage_conn_id = 'gcs_my_conn',
bigquery_conn_id = 'bq_my_conn',
src = 'file_' + account['id'],
destination_project_dataset_table = 'my-project:my-dataset.my-table',
source_format = 'NEWLINE_DELIMITED_JSON',
dag=my_dag)
t2.set_upstream(t1)
t3.set_upstream(t2)
So at UI level, I have about 170 instances of each task display. When I run the DAG manually, Airflow is just doing nothing as far as I can see. The DAG is don't init or queued any task instance. I guess this is due to the number of instances involve but I don't know how can I workaround this.
How should I manage so many task instances ?
Thanks,
Alex

How are you running airflow currently? Are you sure the airflow scheduler is running?
You can also run airflow list_dags to ensure the dag can be compiled. If you are running airflow using Celery you should take care that your dag shows up using list_dags on all nodes running airflow.

Alex, it would be easier to post here, I saw you have DEFAULT_ARGS with retries which is at DAG level, you can also set up retries at task level as well. It is in BaseOperator, since all Operator will inherit the BaseOperator then you can use it, you can find more detail here: https://github.com/apache/incubator-airflow/blob/master/airflow/operators/python_operator.py and https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L1864, if you check BaseOperator in model, it has retries and retry_delay, you can do something like this:
t1 = PythonOperator(task_id='download_file_' + account['id'],
python_callable=download_files(account),
retries=3,
retry_delay=timedelta(seconds=300),
dag=my_dag)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js