AWS Win Tasks aren't running after server restart via Lambda

AWS Win Tasks aren't running after server restart via Lambda - amazon-web-services

I have coded a simple task scedule that turns the AWS Windows server off and a simple Lambda code to turn the stopped instaces on, see code:
import time
import json
import boto3
def lambda_handler(event, context):
# boto3 client
client = boto3.client('ec2')
ssm = boto3.client('ssm')
# getting instance information
describeInstance = client.describe_instances()
#print("eddie", describeInstance["Reservations"][0]["Instances"][0]["State"]["Name"])
InstanceId = []
# fetchin instance id of the running instances
for i in describeInstance['Reservations']:
for instance in i['Instances']:
if instance["State"]["Name"] == "stopped":
InstanceId.append(instance['InstanceId'])
client.start_instances(InstanceIds=InstanceId)
"""looping through instance ids
for instanceid in InstanceId:
# command to be executed on instance
client.start_instances(InstanceIds=InstanceId)
print(output)"""
return {
'statusCode': 200,
'body': json.dumps('Thanks from Srce Cde!')
}
I have several tasks on my task scheduler that run Python scripts but after the activation those tasks arent running even when the server in running.
Note, I have tried to set "run whether logged on or not" on all of them but recived (0x1) errors related to privileges on those tasks
Does anyone know how to solve this? Is there any other way to turn off and on EC2 AWS win server in night time to save on billing

Related

Unable to limit number of Cloud Run instances spun up by Pub/Sub

Situation:
I'm trying to have a single message in Pub/Sub processed by exactly 1 instance of Cloud Run. Additional messages will be processed by another instance of Cloud Run. Each message triggers a heavy computation that runs for around 100s in the Cloud Run instance.
Currently, Cloud Run is configured with max concurrency requests = 1, and min/max instances of 0/5. Subscription is set to allow for 600s Ack deadline.
Issue:
Each message seems to be triggering multiple instances of Cloud Run to be spun up. I believe that it is due to high CPU utilization that's causing Cloud Run to spin up additional instances to help process. Unfortunately, these new instances are attempting to process the same exact message, causing unintented results.
Question:
Is there a way to force Cloud Run to only have 1 instance process a single message, regardless of CPU utilization and other potential factors?
Relevant Code Snippet:
import base64
import json
from fastapi import FastAPI, Request, Response
app = FastAPI()
#app.post("/")
async def handleMessage(request: Request):
envelope = await request.json()
# Basic data validation
if not envelope:
msg = "no Pub/Sub message received"
print(f"error: {msg}")
return Response(content=msg, status_code=400)
if not isinstance(envelope, dict) or "message" not in envelope:
msg = "invalid Pub/Sub message format"
print(f"error: {msg}")
return Response(content=msg, status_code=400)
message = envelope["message"]
if isinstance(message, dict) and "data" in message:
data = json.loads(base64.b64decode(message["data"]).decode("utf-8").strip())
try:
# Do computationally heavy operations here
# Will run for about 100s
return Response(status_code=204)
except Exception as e:
print(e)
Thanks!

I've found the issue.
Apparently, Pub/Sub guarantees "at least once" delivery, which means it is possible for it to deliver a message to a subscriber more than once. The onus is therefore on the subscriber, which in my case is Cloud Run, to handle such scenarios (idempotency) gracefully.

Best practice to run Prefect flow serverless in Google Cloud

I have started using Prefect for various projects and now I need to decide which deployment strategy on GCP would work best. Preferably I would like to work serverless. Comparing Cloud Run, Cloud Functions and App Engine, I am inclined to go for the latter since this doesn't have a timeout limit, while the other two have of 9 resp. 15 minutes.
Am interested to hear how people have deployed Prefect flows serverlessly, such that Flows are scheduled/triggered for batch processing, whilst the agent is automatically scaled down when not used.
Alternatively, a more classic approach would be to deploy Prefect on Compute Engine and schedule this via Cloud Scheduler. But I feel this is somewhat outdated and doesn't do justice to the functionality of Prefect and flexibility for future development.

Am interested to hear how people have deployed Prefect flows serverlessly, such that Flows are scheduled/triggered for batch processing, whilst the agent is automatically scaled down when not used.
Prefect has a blog post on serverless deployment with AWS Lambda which is a good blueprint for doing the same with GCP. The challenge here is the agent scaling - agents work by polling the backend (whether a self deployment of Prefect Server or the hosted Prefect Cloud) on a regular basis (every ~10 secs). One possibility that comes to mind would be to use a Cloud Function to spin up an agent in-process, triggered by whatever batch processing/scheduling event you're thinking of. You can also use the -max-polls CLI argument or kwarg to spin up the agent to look for runs; it'll tear itself down if it doesn't find anything after however many polling attempts you specify. Details on that here or on any of the specific agent pages.
However, this could be inefficient for long-running flows and you might hit resource caps; it might be worthwhile to look at triggering an auto-scaling Dask cluster deployment if the workloads are high enough. Prefect supports that natively with Kubernetes, and has a Kubernetes agent to interact with your cluster. I think this would be the most elegant and scalable solution without having to go the classic Compute Engine route, which I agree is somewhat dated and doesn't provide great auto-scaling or first-class management.
Better support of serverless execution is on the roadmap, specifically a serverless agent is in the works but I don't have an ETA on when that'll be released.
Hopefully that helps! :)

Recently added to Prefect is the Vertex Agent which uses GCP Vertex, the inheritor to AIP. Vertex has a highly configurable serverless execution environment, and no timeouts.

The full explanation is here: https://jerryan.medium.com/hacking-ways-to-run-prefect-flow-serverless-in-google-cloud-function-bc6b249126e4.
Basically, there are two hacking ways to solve the problem.
Use google cloud storage to persistent task states automatically
Publish previous execution results of a cloud function to its subsequent execution.
Caching and Persisting Data
By default, the Prefect Core stored all data, results, and cached states in memory within the Python process running the flow. However, they can be persisted and retrieved from external locations if necessary hooks are configured.
The Prefect has a notion of “checkpointing” that ensures that every time a task is successfully run, its return value write to persistent storage based on the configuration in a result object and target for the task.
#task(result=LocalResult(dir="~/.prefect"), target="task.txt")
def func_task():
return 99
The complete code example is shown below. Here we write to and read from a Google Cloud Bucket by using GCSResult.
import os
os.environ["PREFECT__FLOWS__CHECKPOINTING"] = "true"
from prefect import task, Flow
from prefect.engine.results import LocalResult, GCSResult
#task(target="{date:%Y-%m-%d}/{task_name}.txt")
def task1():
print("Task 1")
return "Task 1"
#task(target="{date:%Y-%m-%d}/{task_name}.txt")
def task2():
print("Task 2")
return "Task 2"
#task(target="{date:%Y-%m-%d}/{task_name}.txt")
def task3():
print("Task 3")
return "Task 3"
#task(target="{date:%Y-%m-%d}/{task_name}.txt")
def task4():
print("Task 4")
#task
def task5():
print("Task 5")
#task
def task6():
print("Task 6")
#task
def task7():
print("Task 7")
#task
def task8():
print("Task 8")
# with Flow("This is My First Flow",result=LocalResult(dir="~/prefect")) as flow:
with Flow("this is my first flow", result=GCSResult(bucket="prefect")) as flow:
t1, t2 = task1(), task2()
t3 = task3(upstream_tasks=[t1,t2])
t4 = task4(upstream_tasks=[t3])
t5 = task5(upstream_tasks=[t4])
t6 = task6(upstream_tasks=[t4])
t7 = task7(upstream_tasks=[t2,t6])
t8 = task8(upstream_tasks=[t2,t3])
# run the whole flow
flow_state = flow.run()
# visualize the flow
flow.visualize(flow_state)
# print the state of the flow
print(flow_state.result)
Publish Execution Results
Another hacking solution is to publish previous execution results of a google cloud function to its subsequent execution. Here, we assume there is no data input and output dependence between tasks.
Some modifications are needed to make it happen.
Change custom state handers for tasks
Manually change task state before publishing
Encode/Decode the task state
First, we know the flow.run function finishes after all the tasks are entered into the finish state, whether it is a success or failure. However, we don’t want all tasks run inside a single call of the google cloud function because the total run time may exceed 540 seconds.
So a custom state hander for the task is used. Every time a task finishes, we emit an ENDRUN signal to the prefect framework. Then it will set the state of the remaining tasks to Cancelled.
from prefect import task, Flow, Task
from prefect.engine.runner import ENDRUN
from prefect.engine.state import State, Cancelled
num_finished = 0
def my_state_handler(obj, old_state, new_state):
global num_finished
if num_finished >= 1:
raise ENDRUN(state=Cancelled("Flow run is cancelled"))
if new_state.is_finished():
num_finished += 1
return new_state
Second, to make tasks with canceled status execute correctly next time, we must manually change their status to pending.
def run(task_state_dict: Dict[Task, State]) -> Dict[Task, State]:
flow_state = flow.run(task_states=task_state_dict)
task_states = flow_state.result
# change task state before next publish
for t in task_states:
if isinstance(task_states[t], Cancelled):
task_states[t] = Pending("Mocked pending")
# TODO: reset global counter
global num_finished
num_finished = 0
# task state for next run
return task_states
Third, there are two essential functions: encoding_data and decode_data. The former serialize the task states to be ready be published, and the latter deserialize the task states into flow object.
# encoding: utf-8
from typing import List, Dict, Any
from prefect.engine.state import State
from prefect import Flow, Task
def decode_data(flow: Flow, data: List[Dict[str, Any]]) -> Dict[Task, State]:
# data as follows:
# [
# {
# "task": {
# "slug": "task1"
# }
# "state": {
# "type": "Success",
# "message": "Task run succeeded（manually set）"
# }
# }
# ]
task_states = {}
for d in data:
tasks_found = flow.get_tasks(d['task']['slug'])
if len(tasks_found) != 1: # 不唯一就不做处理了
continue
state = State.deserialize(
{"message": d['state']['message'],
"type": d['state']['type']
}
)
task_states[tasks_found[0]] = state
return task_states
def encode_data(task_states: Dict[Task, State]) -> List[Dict[str, Any]]:
data = []
for task, state in task_states.items():
data.append({
"task": task.serialize(),
"state": state.serialize()
})
return data
Last but not least, the orchestration connects all the parts above.
def main(data: List[Dict[str, Any]], *args, **kargs) -> List[Dict[str, Any]]:
task_states = decode_data(flow, data)
task_states = run(task_states)
return encode_data(task_states)
if __name__ == "__main__":
evt = []
while True:
data = main(evt)
states = defaultdict(set)
for task in data:
task_type, slug = task['state']['type'], task['task']['slug']
states[task_type].add(slug)
if len(states['Pending']) == 0:
sys.exit(0)
evt = data
# send pubsub message here
# GooglePubsub().publish(evt)
# sys.exit(0)

How to retrieve current workers count for job in GCP dataflow using API

Does anyone know if there is a possibility to get current workers count for active job that is running in GCP Dataflow?
I wasn't able to do it using provided by google API.
One thing that I was able to get is CurrentVcpuCount but it is not what I need.
Thanks in advance!

The current number of workers in a Dataflow job are displayed in the message logs, under autoscaling. For example, I did a quick job as example and I got the following message, when displaying the job logs in my Cloud Shell:
INFO:root:2019-01-28T16:42:33.173Z: JOB_MESSAGE_DETAILED: Autoscaling: Raised the number of workers to 0 based on the rate of progress in the currently running step(s).
INFO:root:2019-01-28T16:43:02.166Z: JOB_MESSAGE_DETAILED: Autoscaling: Raised the number of workers to 1 based on the rate of progress in the currently running step(s).
INFO:root:2019-01-28T16:43:05.385Z: JOB_MESSAGE_DETAILED: Workers have started successfully.
INFO:root:2019-01-28T16:43:05.433Z: JOB_MESSAGE_DETAILED: Workers have started successfully.
Now, you can query these messages by using the projects.jobs.messages.list method, in the Data flow API, and setting the minimumImportance parameter to be JOB_MESSAGE_BASIC.
You will get a response similar to the following:
...
"autoscalingEvents": [
{...} //other events
{
"currentNumWorkers": "1",
"eventType": "CURRENT_NUM_WORKERS_CHANGED",
"description": {
"messageText": "(fcfef6769cff802b): Worker pool started.",
"messageKey": "POOL_STARTUP_COMPLETED"
},
"time": "2019-01-28T16:43:02.130129051Z",
"workerPool": "Regular"
},
To extend this you could create a python script to parse the response, and only get the parameter currentNumWorkers from the last element in the list autoscalingEvents, to know what is the last (hence the current) number of workers in the Job.
Note that if this parameter is not present, it means that the number of workers is zero.
Edit:
I did a quick python script that retrieves the current number of workers, from the message logs, using the API I mentioned above:
from google.oauth2 import service_account
import googleapiclient.discovery
credentials = service_account.Credentials.from_service_account_file(
filename='PATH-TO-SERVICE-ACCOUNT-KEY/key.json',
scopes=['https://www.googleapis.com/auth/cloud-platform'])
service = googleapiclient.discovery.build(
'dataflow', 'v1b3', credentials=credentials)
project_id="MY-PROJECT-ID"
job_id="DATAFLOW-JOB-ID"
messages=service.projects().jobs().messages().list(
projectId=project_id,
jobId=job_id
).execute()
try:
print("Current number of workers is "+messages['autoscalingEvents'][-1]['currentNumWorkers'])
except:
print("Current number of workers is 0")
A couple of notes:
The scopes are the permissions needed on the service account key you are referencing (in the from_service_account_file function), in order to do the call to the API. This line is needed to authenticate to the API. You can use any one of this list, to make it easy on my side, I just used a service account key with project/owner permissions.
If you want to read more about the Python API Client Libraries, check this documentation, and this samples.

<script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<script>
(adsbygoogle = window.adsbygoogle || []).push({
google_ad_client: "ca-pub-5513132861824326",
enable_page_level_ads: true
});
</script>

Restart EC2 instance on Website unavailability

I have a website hosted on an EC2 server. I want to monitor the website endpoint and restart the EC2 instance if the website in unavailable for a certain time frame (say 60 seconds).
What tools do I use in AWS and how do I accomplish this?

This is not a recommended approach.
Firstly, if a website is unavailable, you would probably want to investigate the cause rather than just restarting the instance. Your goal should be to run a stable system by removing root causes of problems rather than just ignoring the problem by restarting all the time.
The recommended design would be to run in a Highly Available configuration with:
The application running on at least two servers across at least two Availability Zones (in case of failure of an AZ). This is not necessarily more expensive because each server can be smaller than a single, large server.
A load balancer in front of the instances, distributing the traffic to the instances. The load balancer also performs continuous health checks and stops sending requests to servers that fail the health check
An Auto Scaling group that can terminate unhealthy instances and automatically launch replacement servers. This also works well if an Availability Zone should fail.
In this design, an unhealthy instance would be terminated (stopped and destroyed) and a new instance created with a pre-defined disk image and startup script. Alternatively, you might choose to move bad instances out of the Auto Scaling group for investigation of the problem, with a new instance being launched to take its place.
If your application requires a database, the database should be external to the instances so that all instances can connect to the database and replacing application instances does not cause any data loss.
As to the speed of noticing problems on a server, the load balancer can perform checks every few seconds. Amazon CloudWatch, on the other hand, would need at least a minute to detect problems (probably longer since metrics are calculated over a period rather than being "now" metrics).

John's approach is the correct one, but at its simplest:
Write a lambda function that can query your website and see if it is running or not and if not have that lambda function restart the instance.
Setup a cloudwatch event rule that runs on a frequency you determine to call the lambda function
I'll leave to you the work of writing the code that determines if the website is functional and restarting the server - but that is pretty straightforward. You can use python, java, node, go or .net core in your lambda function - I would think python would be the easiest in this case, but that is an opinion.

It is clear that this is not a best practice in AWS but can make some sense - e.g. you are running a small personal web server with low demand where availability is a less issue than costs.
At least that was my reason why I built automation for it.
diagram
lambda code
import json
import os
import boto3
import time
env_vars = [
'ALARM_NAME',
'REGION',
'INSTANCE_ID',
'OUTPUT_SNS_ARN'
]
ENV = {}
for env_var in env_vars:
ENV[env_var] = os.environ.get(env_var, None)
if not ENV[env_var]:
raise Exception(f"Environment variable {env_var} must be set!")
def reboot_instance(instanceID, regionName) -> "instanceID":
"""
InstanceID
instanceID - ID of instance
regionName - name of region
return InstanceID or False in case of exception
"""
ec2 = boto3.resource('ec2', region_name=regionName)
instance = ec2.Instance(instanceID)
try:
instance.stop()
time.sleep(30)
instance.stop(Force=True)
except:
pass
for i in range(180): # wait 3 minutes
instance = ec2.Instance(instanceID)
if instance.state['Code'] == 80:
break
time.sleep(1)
else:
raise Exception('Unable to stop instance')
instance.start()
return instanceID
def notify_about_reboot(instanceID, snsarn) -> True:
"""
Put SNS message about reboot to snsarn
"""
client = boto3.client('sns', region_name='us-east-1')
client.publish(TopicArn=snsarn, Message=f'EC2 instance {instanceID} was rebooted!')
return True
def lambda_handler(event, context) -> "status about reboot":
"""
event: see events/event.json
"""
print('EVENT:')
print(event)
for record in event.get('Records', None):
sns = record.get('Sns', None)
message = json.loads(sns.get('Message', None))
msgalarm = message.get('AlarmName', None)
msgstatus = message.get('NewStateValue', None)
if not all([sns,message,msgalarm,msgstatus]):
continue
if (msgalarm == ENV['ALARM_NAME']) and (msgstatus == 'ALARM'):
notify_about_reboot(reboot_instance(ENV['INSTANCE_ID'], ENV['REGION']), ENV['OUTPUT_SNS_ARN'])
return 'rebooting'
else:
return 'nothing to do'
return 'no sns record found'
I have released whole tested automation with SAM template and installation instructions also on https://github.com/koss822/misc/tree/master/Aws/route53-healthcheck-instance-reboot

Post a Message to Elastic Beanstalk Worker Environment via SQS

I have a docker application on elastic beanstalk with a web-server and worker environment.
The worker environment currently runs scheduled jobs via cron.
I'm trying to connect the server to the worker to achieve the following:
Client sends a request to the server (/trigger_job)
Server offloads the job to the worker by sending a JSON message to SQS queue (/perform_job)
Worker performs the job by reading the message from SQS
I haven't been able to find documentation on what the JSON message should look like. There are some HTTP headers mentioned in the official documentation. But there's no mention of header to specify the desired endpoint in the worker environment.
# server.py
from bottle import post, HTTPResponse
#post('/trigger_job')
def trigger_worker_job():
# should send a JSON message to sqs to trigger the '/perform_job'
# Need help with what the JSON message looks like
return HTTPResponse(status=200, body={'Msg': 'Sent message'})
# worker.py
from bottle import post, HTTPResponse
#post('/perform_job')
def perform_job():
# job is performed in the worker environment
return HTTPResponse(status=200, body={'Msg': 'Success'})

In Python, you can see how this from the python sample application where you can find on this aws doc step 4: Deploy a New Application Version.
You can configure SQS endpoint in beanstalk worker environment console. Configuration > Worker > Select a Worker Queue
# for example:
environ['HTTP_X_AWS_SQSD_TASKNAME']
environ['HTTP_X_AWS_SQSD_SCHEDULED_AT']
logger.info("environ X-Aws-Sqsd-Queue %s" % environ['HTTP_X_AWS_SQSD_QUEUE'])
# regarding your message attribute. For example, the attribute name is Email,
# you can extract it via environ['HTTP_X_AWS_SQSD_ATTR_EMAIL']).
# Make sure that the attribute name is all capital.
logger.info("environ X-Aws-Sqsd-Attr Email %s" % environ['HTTP_X_AWS_SQSD_ATTR_EMAIL'])
The message will contains the following info in the image. You can read more on aws AWS Elastic Beanstalk Worker Environments

I found from some of my failed cron jobs in the dead letter queue, there were the following three attributes:
beanstalk.sqsd.path
beanstalk.sqsd.task_name
beanstalk.sqsd.scheduled_time
which are similar to the attributes set on the cron job.
You can manually create a new message in the SQS queue for the worker environment and set those attributes to match the cron job you wish to execute, though that is tedious and I much prefer to do so by code.
I'm using the boto3 framework in python, though hopefully it is a similar process in other languages.
def send_elastic_beanstalk_message(name, path, message):
client = boto3.client('sqs')
return client.send_message(
QueueUrl=url,
MessageAttributes={
"beanstalk.sqsd.path": {
"DataType": "String",
"StringValue": path
},
"beanstalk.sqsd.task_name": {
"DataType": "String",
"StringValue": name
},
"beanstalk.sqsd.scheduled_time": {
"DataType": "String",
"StringValue": datetime.now(timezone.utc).strftime('%Y-%m-%d %H:%M:%S UTC')
}
}
)
This will create a new message in the SQS queue that will be parsed by the worker daemon and trigger a cron job.
Unfortunately, a message body does not seem to be included when the message is parsed, so triggering the job is the only action I was able to solve.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js