Alternative to Cronjob using AWS Lambda + Cloudwatch () - amazon-web-services

I'm a developer on a startup and right now we are using around 30 cronjobs, some of them run each minute, others run once per day while other run on specific days. The problem are the ones that run every minute, when most of the time is not necessary.
This somewhat increases our expenses because during the night, they still run when most of the times our services have nobody online (and don't require to be run).
We have been talking about using AWS to replace those cronjobs into something like event based. Yet, I cannot find a solution. Here's an example of one of our cronjobs:
One costumer starts to make a registration and has 8 minutes to complete it. Right now, we have a cronjob that runs every minute to validate if he completed, and if not, to "delete" it.
I though I could replace this with a SNS + Lambda event. Basically, when an user starts registration, send an message to SNS, that would triger a lambda function. Yet, it could only run after 8 minutes, and not instantly.
I've seen on SNS that we can delay up to 15 minutes, but we got some other service that sends an email after few hours, which would not work
Anyone have a clue on how can I do it?
Thanks

You can use AWS step functions to implement the workflow and add a delay to wait before invoking the Lambda function.

Related

Lambda timed out due to container refreshed

i have gone through the site but unable get the root cause of my issue.
we have a lambda that will run for every 50 seconds. the first run of lambda is a cold start. during the start all the necessary dependencies for the lambda are prepared ( all the interfaces ).. Lambda handler will have its own code to interact with SQS and SWF. during the first run from the cloud watch logs it is clear that it is reading the base file to get all the services. then lambda handler will start. from second run only lambda handler will get invoked after 50th second. So far everything is going smooth.
All of sudden we noticed the lambda took more than 50 seconds ( in general it finishes below 10s). log shows that lambda got timed out and freshly it started to initializing all the dependencies again.
This is not giving any clue to us as after the timeout the subsequent run works smooth. Its not good to see lambda timed out. Definitely lambda code is without errors.
Could this be any container issue? Does the container have any time period that it will keep data active till it reaches the expiry time out.
Can we able to access the container object to find out more information? we have 2 or more dev environments. this behavior is different for different environments. for some it happens for every 3 days. some time in a day it happens thrice.
if we want to understand the properties of the container object how can we do it? Is it a grey zone that only AWS can access it? Lambda code is written in c# using net core App 2.0. thought of checking the cloud trail log for this lambda during the invocation. there too i am not able to find the reason behind the timeout.
we have more than 20 lambda's for dev and 10 for test in each different regions. its not getting clear to us which lambda will time out.
Any suggestions or idea's will help me a lot???????
thankyou.
Lambda containers will not live indefinitely. If you are seeing occasional "cold starts" then that is normal behavior. If you're running only 1 invocation at a time (i.e. you only have a single lambda instance) you can still expect to see the container recycled every few hours. In general, I understand AWS is trying to give us fewer cold starts but you can still expect to get a new container and new cold start from time to time.

Is there an AWS / Pagerduty service that will alert me if it's NOT notified

We've got a little java scheduler running on AWS ECS. It's doing what cron used to do on our old monolith. it fires up (fargate) tasks in docker containers. We've got a task that runs every hour and it's quite important to us. I want to know if it crashes or fails to run for any reason (eg the java scheduler fails, or someone turns the task off).
I'm looking for a service that will alert me if it's not notified. I want to call the notification system every time the script runs successfully. Then if the alert system doesn't get the "OK" notification as expected, it shoots off an alert.
I figure this kind of service must exist, and I don't want to re-invent the wheel trying to build it myself. I guess my question is, what's it called? And where can I go to get that kind of thing? (we're using AWS obviously and we've got a pagerDuty account).
We use this approach for these types of problems. First, the task has to write a timestamp to a file in S3 or EFS. This file is the external evidence that the task ran to completion. Then you need an http based service that will read that file and calculate if the time stamp is valid ie has been updated in the last hour. This could be a simple php or nodejs script. This process is exposed to the public web eg https://example.com/heartbeat.php. This script returns a http response code of 200 if the timestamp file is present and valid, or a 500 if not. Then we use StatusCake to monitor the url, and notify us via its Pager Duty integration if there is an incident. We usually include a message in the response so a human can see the nature of the error.
This may seem tedious, but it is foolproof. Any failure anywhere along the line will be immediately notified. StatusCake has a great free service level. This approach can be used to monitor any critical task in same way. We've learned the hard way that critical cron type tasks and processes can fail for any number of reasons, and you want to know before it becomes customer critical. 24x7x365 monitoring of these types of tasks is necessary, and helps us sleep better at night.
Note: We always have a daily system test event that triggers a Pager Duty notification at 9am each day. For the truly paranoid, this assures that pager duty itself has not failed in some way eg misconfiguratiion etc. Our support team knows if they don't get a test alert each day, there is a problem in the notification system itself. The tech on duty has to awknowlege the incident as per SOP. If they do not awknowlege, then it escalates to the next tier, and we know we have to have a talk about response times. It keeps people on their toes. This is the final piece to insure you have robust monitoring infrastructure.
OpsGene has a heartbeat service which is basically a watch dog timer. You can configure it to call you if you don't ping them in x number of minutes.
Unfortunately I would not recommend them. I have been using them for 4 years and they have changed their account system twice and left my paid account orphaned silently. I have to find a new vendor as soon as I have some free time.

Is there a way to set a walltime on AWS Batch jobs?

Is there a way to set a maximum running time for AWS Batch jobs (or queues)? This is a standard setting in most batch managers, which avoids wasting resources when a job hangs for whatever reason.
As of April, 2018, AWS Batch now supports setting a Job Timeout when submitting a Job, or in the job definition.
https://aws.amazon.com/about-aws/whats-new/2018/04/aws-batch-adds-support-for-automatic-termination-with-job-execution-timeout/
You specify an attemptDurationSeconds parameter, which must be at least 60 seconds, either in your job definition, or when you submit the job. When this number of seconds has passed following the job attempt's startedAt timestamp, AWS Batch terminates the job. On the compute resource, your job's container receives a SIGTERM signal to give your application a chance to shut down gracefully; if the container is still running after 30 seconds, a SIGKILL signal is sent to forcefully shut down the container.
Source: https://docs.aws.amazon.com/batch/latest/userguide/job_timeouts.html
POST /v1/submitjob HTTP/1.1
Content-type: application/json
{
...
"timeout": {
"attemptDurationSeconds": number
}
}
AFAIK there is no feature to do this. However, a workaround was suggested in the forum for a similar question.
One idea is to call Batch as an Activity from Step Functions, pingback
back on a schedule (e.g. every minute) from that job. If it stops
responding then you can detect that situation as a Timeout in the
activity and act accordingly (terminate the job etc.). Not an ideal
solution (especially if the job continues to ping back as a "zombie"),
but it's a start. You'd also likely have to store activity tokens in a
database to trace them to Batch job id.
Alternatively, you split that setup into 2 steps, and schedule a Batch
job from a Lambda in the first state, then pass the Batch job id to
the second step which then polls Batch (from another Lambda) for its
state with Retry and IntervalSeconds (e.g. once every minute, or even
with exponential backoff), and MaxAttempts calculated based on your
timeout. This way, you don't need any external state storage
mechanism, long polling or even a "ping back" from the job (it CAN be
a zombie), but the downside is more steps.
There is no option to set timeout on batch job but you can setup a lambda function that triggers every 1 hour or so and deletes jobs created before say 24 hours.
working with aws for some time now and could not find a way to set a maximum running time for batch jobs.
However there are some alternative way which you could utilize.
AWS Forum
Sadly there is no way to set the limit execution time on AWS Batch.
One solution may be to edit the docker's entry point to schedule the execution time limit.

SQS - Delivery Delay of 30 minutes

From the documentation of SQS, Max time delay we can configure for a message to hide from its consumers is 15 minutes - http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-delay-queues.html
Suppose if I need to hide the messages for a day, what is the pattern?
For eg. I want to mimic a daily cron for doing some action.
Thanks
The simplest way to do this is as follows:
SQS.push_to_queue({perform_message_at : "Thursday November 2022"},delay: 15 mins)
Inside your worker
message = SQS.poll_messages
if message.perform_message_at > Time.now
SQS.push_to_queue({perform_message_at : "Thursday November
2022"},delay:15 mins)
else
process_message(message)
end
Basically push the message back to the queue with the maximum delay and only process it when its processing time is less than the current time.
HTH.
Visibility timeout can do up to 12 hours. I think you can hack something together where you process a message but don't delete it and next time it is processed its been 12 hours. So a queue with one message and visibility timeout of 12 hours. That gets you a 12 hour cron.
Cloudwatch is likely a better way to do it. You can use a createEvent API with the timer, and have it trigger either a lambda function or an API call to whatever comes next.
Another way to do is to use the "wait" utility in an AWS step function.
https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-wait-state.html
In any case, unless you are extremely sure you will never need anything more than 15 minutes, the SQS backdoor to add the delay seems hacky.
You can do this by adding a DLQ with MaxReceives set to 1 on the first queue.
Add a simple Lambda on the first queue and fail the message vi Lambda. So message will be moved to DLQ automatically and then you can consume from DLQ.
Both primary queue and DLQ can have max 15 min delay, so finally you get 30 min delay.
So your consumer app receives the message after 30 minutes, without adding any custom logic on it.
Two thoughts.
Untested. Perhaps publish to and SNS topic that has no SQS queues. When delivery needs to happen, subscribe the queue to the topic. (I've not done this, I'm not sure if this would work as expected)
Push messages as files to a central store (like S3). Create a worker that looks at the time created timestamp and decides whether to publish them to a queue or not. If created >= 1d ago, publish.
This was a challenge for us as well and I never found a perfect solution so I ended up building a service to address it. Obviously self promotion here but the system allows you to work around the DelaySeconds limitation and set arbitrary date/times at scale.
https://anticipated.io
Some of the challenges working with Step Functions are scale of registered machines (if your system had that requirement). If you use EventBridge to fire them you run out of allowable rulesets (limit is 200 as of this posting). Example: if you need to set 150,000 arbitrary events a month you run into limits quickly.

AWS Elastic Beanstalk Worker timing out after inactivity during long computation

I am trying to use Amazon Elastic Beanstalk to run a very long numerical simulation - up to 20 hours. The code works beautifully when I tell it to do a short, 20 second simulation. However, when running a longer one, I get the error "The following instances have not responded in the allowed command timeout time (they might still finish eventually on their own)".
After browsing the web, it seems to me that the issue is that Elastic Beanstalk allows worker processes to run for 30 minutes at most, and then they time out because the instance has not responded (i.e. finished the simulation). The solution some have proposed is to send a message every 30 seconds or so that "pings" Elastic Beanstalk, letting it know that the simulation is going well so it doesn't time out, which would let me run a long worker process. So I have a few questions:
Is this the correct approach?
If so, what code or configuration would I add to the project to make it stop terminating early?
If not, how can I smoothly run a 12+ hour simulation on AWS or more generally, the cloud?
Add on information
Thank you for the feedback, Rohit. To give some more information, I'm using Python with Flask.
• I am indeed using an Elastic Beanstalk worker tier with SQS queues
• In my code, I'm running a simulation of variable length - from as short as 20 seconds to as long as 20 hours. 99% of the work that Elastic Beanstalk does is running the simulation. The other 1% involves saving results, sending emails, etc.
• The simulation itself involves using generating many random numbers and working with objects that I defined. I use numpy heavily here.
Let me know if I can provide any more information. I really appreciate the help :)
After talking to a friend who's more in the know about this stuff than me, I solved the problem. It's a little sketchy, but got the job done. For future reference, here is an outline of what I did:
1) Wrote a main script that used Amazon's boto library to connect to my SQS queue. Wrote an infinite while loop to poll the queue every 60 seconds. When there's a message on the queue, run a simulation and then continue through with the loop
2) Borrowed a beautiful /etc/init.d/ template to run my script as a daemon (http://blog.scphillips.com/2013/07/getting-a-python-script-to-run-in-the-background-as-a-service-on-boot/)
3) Made my main script and the script in (2) executable
4) Set up a cron job to make sure the script would start back up if it failed.
Once again, thank you Rohit for taking the time to help me out. I'm glad I still got to use Amazon even though Elastic Beanstalk wasn't the right tool for the job
From your question it seems you are running into launches timing out because some commands during launch that run on your instance take more than 30 minutes.
As explained here, you can adjust the Timeout option in the aws:elasticbeanstalk:command namespace. This can have values between 1 and 1800. This means if your commands finish within 30 minutes you won't see this error. The commands might eventually finish as the error message says but since Elastic Beanstalk has not received a response within the specified period it does not know what is going on your instance.
It would be helpful if you could add more details about your usecase. What commands you are running during startup? Apparently you are using ebextensions to launch commands which take a long time. Is it possible to run those commands in the background or do you need these commands to run during server startup?
If you are running a Tomcat web app you could also use something like servlet init method to run app bootstrapping code. This code can take however long it needs without giving you this error message.
Unfortunately, there is no way to 'process a message' from an SQS queue for more than 12 hours (see the description of ChangeVisibilityTimeout).
With that being the case, this approach doesn't fit your application well. I have ran into the same problem.
The correct way to do this: I don't know. However, I would suggest an alternate approach where you grab a message off of your queue, spin off a thread or process to run your long running simulation, and then delete the message (signaling successful processing). In this approach, be careful of spinning off too many threads on one machine and also be wary of machines shutting down before the simulation has ended, because the queue message has already been deleted.
Final note: your question is excellently worded and sufficiently detailed :)
For those looking to run jobs shorter than 10 hours, it needs to be mentioned that the current inactivity timeout limit is 36000 seconds, so exactly 10 hours and not anymore 30 minutes, like mentioned in posts all over the web (which led me to think a workaround like described above is needed).
Check out the docs: https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html
A very nice write-up can be found here: https://dev.to/rizasaputra/understanding-aws-elastic-beanstalk-worker-timeout-42hi