Monitoring DB trends in AWS - amazon-web-services

I was wondering about what the best workflows/tools are for the following scenario.
Imagine you receive data from N restaurants, on a daily basis, like how many drinks, dishes of certain type, total order count etc etc, a restaurant made. All these entries go into a postgres DB, described best by the following fields {ID, datetime, restaurant, type_record, count}. Number of restaurants is in the 100's, so I need something that does not need to be updated with a CONFIG file every time a restaurant is added to the system.
Now I want to run a daily script that:
Runs basic queries against the DB.
Makes some basic calculations.
Catches something like number of drinks for today for restaurant X is 15% higher than its daily average`.
If step 3 is beyond a certain threshold, push an alert to slack or pagerduty.
The question is: with which aws service should I perform step 3?
All I can think of is to run this code on a simple lambda function. This implementation would mostly suffice but I was wondering if there are smarter/better ways to achieve this.
Details:
Latency of the query (steps 1 and 2) are not a problem, nor step 4.
The main problem is how to have such a trend monitoring system on the DB that is as simple as possible (easy to maintain).
Any ideas/thoughts?

Either Lambda or EC2 would work. Those are the 2 compute resources AWS provides.
This type of monitoring would normally run periodically eg once per day at noon. For that type of monitoring, Lambda would be perfect, as it can be invoked only when needed.
You can also launch a Ec2 instance periodically, through a scheduled event. But there is the overhead managing the server: install software and manage an AMI.
Either would work. I suggest you try a prototype in Lambda. Lambda can simplify application development and deploy at a lower cost than developing on traditional EC2 instances.

Related

Triggering an AWS Lambda Function based on multiple (1k-10k) schedules

We have an AWS Lambda function which queries some data for a client from our DB and sends a report to the client. Some clients want daily reports, some might need weekly or monthly reports. The number of clients can go up ~1000 and each client might have ~10 such reports.
So we are looking for a way to trigger the Lambda function with different parameters based on schedules set by each client.
For Example:
Client A wants daily report of their data to be sent to abc#clienta.com and Client B wants a weekly report of their data to be sent to xyz#clientb.com. So the Lambda function will be invoked twice on Sunday 12 AM (for both clients) and once on Monday-Saturday 12 AM (for Client A).
We found the following solutions on AWS, but both have some limitations.
Approach 1: Use CloudWatch Events
We can create a CloudWatch Events Rule for each client and each report that could trigger our Lambda function on each schedule.
Pros:
Simple setup, easy to implement.
Cons:
There is a limitation of 100 Event Rules per AWS Account. It's mentioned that we can contact AWS to get it increased, but we are not sure if it can be increased to the number we are looking for (Currently it is ~10k, but we would prefer a solution in which there is no such limit). Also, a limit of 100 per account gives an indication that this is not a suitable solution for such a use case.
Approach 2: Using Step Functions
For each client and each report, we can create one AWS State Machine. We can use the Iterator pattern in Step Functions to wait for a day/week/month and then re-invoke the Lambda Function.
Pros:
No limitations on number of State Machines, so this enables us to scale easily.
Cons:
Step Functions have a limitation that they can run for a year, at maximum. This will be a problem in our case because the users will need to get the reports for a much longer period. There is a way to overcome this in Step Functions. Just before it's about to reach the 1-year limit, we can cancel the execution and start a fresh execution. So overall, this solution looks complex.
Can someone suggest a better solution for this on AWS?
Do you really need a CloudWatch for each client? Why not do something like the following architecture.
Have cloudwatch kick off a lambda that checks schedules for all clients each day (or whatever the most frequent report schedule you allow). You don't want this to take a long time so you just have this check a database (i.e. DynamoDB) of schedules and drop metadata about any reports that need to be generated onto an SQS queue (i.e. type of report, client information, destination email). Worst case, this execute and finds nothing to schedule but this should only takes seconds so the cost is very low to just run this everyday.
Then you have a lambda that actually does the report generator and email that consumes the queue. This report generator lambda will scale and spin up as many instances it needs to handle the messages on the queue. You can set the concurrency limit for the report generator lambda to ensure it doesn't spin up too many at a time if that is a concern once you are having 1000s of clients.
The definition and deployment of all these components can easily be automated via an AWS SAM.
Hope this alternate approach gives you a few more ideas.
You can combine both approach, to get the best result.
step 1: Use stepfunction to run your lambdas.
step 2: Trigger your stepfunction from cloudwatch, based on stepfunction event(SUCCESS,FAILED ETC).
In this way when step 1 fails or completes 1 year run. Cloudwatch event can trigger it back on, based on the json input you pass.

Cron Jobs vs Task Scheduler table for scheduled emails

Preamble: I have a web app, the backend is based on the serverless architecture. It's basically an amplify app hosted on AWS with a dynamoDB database. I've learnt is possible to create a task scheduling system of sorts more here. A quick summary of the article is "Its possible to create a task scheduling table taking advantage of TTL and dynamoDB streams to execute lambda function at specific times. The TTL specifies a set time for an record to be deleted, we can capture this delete event in a dynamoDB stream and run some tasks based on information from the stream"
Problem:
The goal is to send a series of emails to users who sign up for our service. Each user that signs up gets a series of "Getting Started" emails. The first of the emails is sent 24 hours after a user signs up, the second 3 days later and the third exactly 7 days after sign up.
I see how a cron job would be suitable here, but it just seems a bit inefficient to me. I would basically have to search the users table for users whose sign up time falls between a specific 24 hour period and send the email to the users whereas with a Task scheduler table I could add a task to the table ( something like send first email to user300 with a TTL of when I want it to be sent ) and listen for delete events to run the task. No need to run a cron job daily, just a function that handles each task as it comes.
I think this is more like a performance vs storage problem. Having a task scheduler table would take up space, if we add all the emails to be sent to a user as tasks on the table (each email to be sent to a specific user is it's own task) each time a user signs up then I see the task scheduler table growing 3n records for every n user signed up. But this may not really be a problem as tasks are deleted after they are run. I do not know the performance cost of using a cron job for this particular task hence I'm here. I also may be wrong and the cost of running and updating this task scheduler table may be more than that of the cron job.
I initially thought of setting up a dummy user table and running both the cron and the task scheduler and documenting cost of running both, but you can imagine how much time and effort that would take.
So I guess my question is which is a more efficient solution in terms of performance and cost?
There is no perfect solution here. Keep in mind that Dynamodb TTL takes up to 48h to invoke, so it's probably unacceptable. CRON Jobs with Lambda are cheap, and it's easy to set. You coul also use SQS and populate it with daily CRON. Yan Cui wrote great article about this problem https://theburningmonk.com/2019/03/dynamodb-ttl-as-an-ad-hoc-scheduling-mechanism/
This may not exactly be an answer. Based on the medium article you linked the guy had a plausible reason why the TTL and dynamoDB streams would be better than a cron job which you reiterated. Setting up a cron job is easier and cheaper (free) and I doubt the performance will be that much worse unless the database is huge. I don't have any experience doing something like this so I wouldn't know how large the database would have to be for it to make sense to switch over. Alternatively, you can have as many cron jobs as you want so I don't see how you couldn't just set up a user specific cron job whenever someone signs up.
You can setup a CloudWatch Event to fire a Lambda function on a regular schedule. The Lambda function can search a database for an applicable result set and perform other actions - send an email, a text message, etc.
Here is an AWS tutorial that covers a very similar use case with step by step instructions. This tutorial is implemented by using the AWS Java API (but you can implement it using other supported programming languages).
https://github.com/awsdocs/aws-doc-sdk-examples/tree/master/javav2/usecases/creating_scheduled_events
From a Cost perspective - Lambda allows 1M free requests per month. Details are here - https://aws.amazon.com/lambda/pricing/

Should I run forecast predictive model with AWS lambda or sagemaker?

I've been reading some articles regarding this topic and have preliminary thoughts as what I should do with it, but still want to see if anyone can share comments if you have more experience with running machine learning on AWS. I was doing a project for a professor at school, and we decided to use AWS. I need to find a cost-effective and efficient way to deploy a forecasting model on it.
What we want to achieve is:
read the data from S3 bucket monthly (there will be new data coming in every month),
run a few python files (.py) for custom-built packages and install dependencies (including the files, no more than 30kb),
produce predicted results into a file back in S3 (JSON or CSV works), or push to other endpoints (most likely to be some BI tools - tableau etc.) - but really this step can be flexible (not web for sure)
First thought I have is AWS sagemaker. However, we'll be using "fb prophet" model to predict the results, and we built a customized package to use in the model, therefore, I don't think the notebook instance is gonna help us. (Please correct me if I'm wrong) My understanding is that sagemaker is a environment to build and train the model, but we already built and trained the model. Plus, we won't be using AWS pre-built models anyways.
Another thing is if we want to use custom-built package, we will need to create container image, and I've never done that before, not sure about the efforts to do that.
2nd option is to create multiple lambda functions
one that triggers to run the python scripts from S3 bucket (2-3 .py files) every time a new file is imported into S3 bucket, which will happen monthly.
one that trigger after the python scripts are done running and produce results and save into S3 bucket.
3rd option will combine both options:
- Use lambda function to trigger the implementation on the python scripts in S3 bucket when the new file comes in.
- Push the result using sagemaker endpoint, which means we host the model on sagemaker and deploy from there.
I am still not entirely sure how to put pre-built model and python scripts onto sagemaker instance and host from there.
I'm hoping whoever has more experience with AWS service can help give me some guidance, in terms of more cost-effective and efficient way to run model.
Thank you!!
I would say it all depends on how heavy your model is / how much data you're running through it. You're right to identify that Lambda will likely be less work. It's quite easy to get a lambda up and running to do the things that you need, and Lambda has a very generous free tier. The problem is:
Lambda functions are fundamentally limited in their processing capacity (they timeout after max 15 minutes).
Your model might be expensive to load.
If you have a lot of data to run through your model, you will need multiple lambdas. Multiple lambdas means you have to load your model multiple times, and that's wasted work. If you're working with "big data" this will get expensive once you get through the free tier.
If you don't have much data, Lambda will work just fine. I would eyeball it as follows: assuming your data processing step is dominated by your model step, and if all your model interactions (loading the model + evaluating all your data) take less than 15min, you're definitely fine. If they take more, you'll need to do a back-of-the-envelope calculation to figure out whether you'd leave the Lambda free tier.
Regarding Lambda: You can literally copy-paste code in to setup a prototype. If your execution takes more than 15min for all your data, you'll need a method of splitting your data up between multiple Lambdas. Consider Step Functions for this.
SageMaker is a set of services that each is responsible for a different part of the Machine Learning process. What you might want to use is the hosted version of Jupyter notebooks in SageMaker. You get a lot of freedom in the size of the instance that you are using (CPU/GPU, memory, and disk), and you can install various packages on that instance (such as FB Prophet). If you need it once a month, you can stop and start the notebook instances between these times and "Run all" the cells in your notebooks on this instance. It will only cost you the minutes of execution.
regarding the other alternatives, it is not trivial to run FB Prophet in Lambda due to the size limit of the libraries that you can install on Lambda (to avoid too long cold start). You can also use ECS (container Service) where you can have much larger images, but you need to know how to build a Docker image of your code and endpoint to be able to call it.

Display real time data on website that scales?

I am starting a project where I want to create a website which will display LIVE flight information and status. We all have seen this at airport. An example is given here - http://www.computronics.biz/productimages/prodairport4.jpg. As you can see this information changes continuously. The website will talk to a backend api and the this backend api will talk to database. Now the important part is that the flight information in the database will be updated by the airline itself. There could be several airlines and they will update their data respectively. I have drawn a diagram and uploaded here - https://imgur.com/a/ssw1S.
Now those airlines will obviously have an interface (website talking to some backend API) through which they will update the database.
Now here is my attempt to solve it. We need to have some sort of trigger such that if any airline updates a flight detail in the database between current time - 1 hour to current + 4 hours (website will only display few hours of flights), we need to call the web api and then send the update to the website in the real time. The user must not refresh the page at all. At the same time the website needs to scale well i.e. if 1 million users are on the website, and there is an update in the database in the correct time range, all 1 million user's website should get updated within a decent amount of time.
I did some research and it looks like we need to have an event based approach. For example - we need to create a function (AWS lambda or Azure function) that should be called whenever there is an update in the database (Dynamo DB for example) within the correct time range. This function then should call an API which should then update the website through web socket technology for example.
I am not looking for any code but just some alternative suggestions on how this can be solved in a scalable way. Also how do we test scalability?
Dont use serverless functions(Lambda/Azure functions)
Although I am a huge fan of serverless functions, and currently running a full web app in Lambda, I don't think its needed for your use case and doesn't make sense economically. As you've answered in the comments, each airline will not write directly to the database, they'll push to an API, meaning you are explicitly told when flights have changed. When an airline has sent you new data you can simply propagate this to all the browser endpoints via websockets. This keeps the design very simple. There is no need to artificially create a database event that then triggers a function that will then tell you a flight has been updated. Thats like removing your doorbell and replacing it with a motion detector that triggers a doorbell :)
Cost
Money always deserves its own section. Lambda is more of an economic break through than a technological one. You have to know when its cost effective. You pay per request so if your dealing with a process that handles 10,000 operations a month, or something that only fires 1,000 times a day, than lambda is dirt cheap and practically free. You also pay for the length of time the function is executing and the memory consumed while executing. Generally, it makes sense to use lambda functions where a dedicated server would be sitting idle for most of the time. So instead of a whole EC2 instance, AWS provides you with a container on demand. There are points at which high requests rates and constantly running processes makes lambda more expensive than EC2. This article discusses how generally its cheaper to use lambda up to a point -> https://www.trek10.com/blog/lambda-cost/ The same applies to Azure functions and googles equivalent. They are all just containers offered on demand.
If you're dealing with flight information I would imagine you will have thousands of flights being updated every minute so your lambda functions will be firing constantly as if you were running an EC2 instance. You will end up paying a lot more than EC2. When you have a service that needs to stay up 24/7 and run 24/7 with high activity that is most certainly a valid use case for a dedicated server or servers.
Proposed Solution
These are the components I would use below:
Message Queue of some sort (RabbitMQ or AWS SQS with SNS perhaps)
Web Socket Backend (The choice will depend on programming language)
Airline input API (REST,GraphQL, or maybe AWS Kinesis Data Firehose)
The airlines publish their data to a back-end api. The updates are stored on a message queue and the web applicaton that actually displays the results to users, via websockets, reads from the queue.
Scalability
For scalability you can run the websocket application on multiple EC2 instances (all reading from the same queuing service) in an autoscaling group, so with extra load more instances will be created automatically hence the name "autoscaling". And those instances can sit behind an elastic load balancer. Lots of AWS documentation on how to do this and its their flagship design pattern. If you use AWS SQS you don't have to manage the scalability details yourself, aws handles that. The only real components to scale are your websocket application and the flight data input endpoint. You can run the flight api in an autoscaling group as well but AWS does offer an additional tool for high traffic data processing. I detail that below.
Testing Scalability
It would be fairly easy to have a mock airline blast your service with thousands and thousands of fake updates and on the other end you can easily run multiple threads of selenium tests simulating browser clicks and validating that the UI is still operational.
Additional tools
If it ends up being large amounts of data, rather than using a conventional REST api for your flight update service you could consider a service AWS offers specifically for dealing with large amounts of real time updates (Kinessis Data Firehose) https://aws.amazon.com/kinesis/data-firehose/ But I've never used it.
First, please don't over think this. This is a trivial problem to solve and doesn't require any special techniques, technologies or trendy patterns & frameworks.
You actually have three functional areas you can address almost separately.
Ingestion - Collection and normalization of the data from the various sources. For this, you'll need a process and transformation engine, LogicApps or such.
Your databases. You'll quickly learn that not all flights are the same ;). While it might seem so, the amount of data isn't that much. Instances of MySQL/SQL Server tuned for a particular function will work just fine. Hint, you don't need to have data for every movement ready to present all the time.
Presentation. The data API and UIs. This, really, is the easy part. I would suggest you use basic polling at first. For reasons you will never have any control over, the SLA for flight data is ~5 minutes so a real-time client notification system is time you should spend elsewhere at first.

Using any of the Amazon Web Services, how could I schedule something to happen 1 year from now?

I'd like to be able to create a "job" that will execute in an arbitrary time from now... Let's say 1 year from now. I'm trying to come up with a stable, distributed system that doesn't rely on me maintaining a server and scheduling code. (Obviously, I'll have to maintain the servers to execute the job).
I realize I can poll simpleDB every few seconds and check to see if there's anything that needs to be executed, but this seems very inefficient. Ideally I could create an Amazon SNS topic that would fire off at the appropriate time, but I don't think it's possible.
Alternatively, I could create a message in the Amazon SQS that would not be visible for 1 year. After 1 year, it becomes visible and my polling code picks up on it and executes it.
It would seem this is a topic like Singletons or Inversion Control that Phd's have discussed and come up with best practices for. I can't find the articles if there any.
Any ideas?
Cheers!
The easiest way for most people to do this would be to run at least an EC2 server with a cron job on the EC2 server to trigger an action. However, the cost of running an EC2 server 24 hours a day for a year just to trigger an action would be around $170 at the cheapest (8G t1.micro with Heavy Utilization Reserved Instance). Plus, you have to monitor that server and recover from failures.
I have sketched out a different approach to running jobs on a schedule that uses AWS resources completely. It's a bit more work, but does not have the expense or maintenance issues with running an EC2 instance.
You can set up an Auto Scaling schedule (cron format) to start an instance at some point in the future, or on a recurring schedule (e.g., nightly). When you set this up, you specify the job to be run in a user-data script for the launch configuration.
I've written out sample commands in the following article, along with special settings you need to take care of for this to work with Auto Scaling:
Running EC2 Instances on a Recurring Schedule with Auto Scaling
http://alestic.com/2011/11/ec2-schedule-instance
With this approach, you only pay for the EC2 instance hours when the job is actually running and the server can shut itself down afterwards.
This wouldn't be a reasonable way to schedule tens of thousands of emails with an individual timer for each, but it can make a lot of sense for large, infrequent jobs (a few times a day to once per year).
I think it really depends on what kind of job you want to execute in 1 year and if that value (1 year) is actually hypothetical. There are many ways to schedule a task, windows and linux both offer a service to schedule tasks. Windows being Task Scheduler, linux being crontab. In addition to those operating system specific solutions you can use Maintenance tasks on MSSQL server and I'm sure many of the larger db's have similar features.
Without knowing more about what you plan on doing its kind of hard to suggest any more alternatives since I think many of the other solutions would be specific to the technologies and platforms you plan on using. If you want to provide some more insight on what you're going to be doing with these tasks then I'd be more than happy to expand my answer to be more helpful.