Using AWS To Process Large Amounts Of Data With Serverless - amazon-web-services

I have about 300,000 transactions for each user in my DynamoDB database.
I would like to calculate the taxes based on those transactions in a serverless manner, if that is the cheapest way.
My thought process was that I should use AWS Step Functions to grab all of the transactions, store them into Amazon S3, then use AWS Step Functions to iterate over each row in the CSV file. The problem is that once I read a row in the CSV, I would have to store it in memory so that I can use it for later calculations. If this Lambda function runs out of time, then I have no way to save the state, so this route is not plausible.
Another route which would be expensive, is to have two copies of each transaction in DynamoDB and perform the operations on the copy Table, keeping the original data untouched. The problem with this is that the DynamoDB table is eventually consistent and there could be a scenario where I read a dirty item.

Serverless is ideal for event-driven processing but for your batch use-case, it is probably easier to use an EC2 instance.
An Amazon EC2 t2.nano instance is under 1c/hour, as is a t2.micro instance with spot pricing, and they are per-second pricing.

There really isn't enough detail here to make a good suggestion. For example, how is the data organized in your DynamoDB table? How often do you plan on running this job? How quickly do you need the job to complete?
You mentioned price so I'm assuming that is the biggest factor for you.
Lambda tends to be cheapest for event-driven processing. The idea is that with any EC2/ECS event driven system you would need to over provision by some amount to handle spikes in traffic. The over provisioned compute power is idle most of the time but you still pay for it. In the case of lambda, you pay a little more for the compute power but you save money by needing less since you don't need to over provision.
Batch processing systems tend to lend themselves nicely to EC2 since they typically use 100% of the compute power throughout the duration of the job. At the end of the job, you shutdown all of the instances and you don't pay for them anymore. Also, if you use spot pricing, you can really push the price of your compute power down.

Related

Calling lambda functions programatically every minute at scale

While I have worked with AWS for a bit, I'm stuck on how to correctly approach the following use case.
We want to design an uptime monitor for up to 10K websites.
The monitor should run from multiple AWS regions and ping websites if they are available and measure the response time. With a lambda function, I can ping the site, pass the result to a sqs queue and process it. So far, so good.
However, I want to run this function every minute. I also want to have the ability to add and delete monitors. So if I don't want to monitor website "A" from region "us-west-1" I would like to do that. Or the other way round, add a website to a region.
Ideally, all this would run serverless and deployable to custom regions with cloud formation.
What services should I go with?
I have been thinking about Eventbridge, where I wanted to make custom events for every website in every region and then send the result over SNS to a central processing Lambda. But I'm not sure this is the way to go.
Alternatively, I wanted to build a scheduler lambda that fetches the websites it has to schedule from a DB and then invokes the fetcher lambda. But I was not sure about the delay since I want to have the functions triggered every minute. The architecture should monitor 10K websites and even more if possible.
Feel free to give me any advise you have :)
Kind regards.
In my opinion Lambda is not the correct solution for this problem. Your costs will be very high and it may not scale to what you want to ultimately do.
A c5.9xlarge EC2 costs about USD $1.53/hour and has a 10gbit network. With 36 CPU's a threaded program could take care of a large percentage - maybe all 10k - of your load. It could still be run in multiple regions on demand and push to an SQS queue. That's around $1100/month/region without pre-purchasing EC2 time.
A Lambda, running 10000 times / minute and running 5 seconds every time and taking only 128MB would be around USD $4600/month/region.
Coupled with the management interface you're alluding to the EC2 could handle pretty much everything you're wanting to do. Of course, you'd want to scale and likely have at least two EC2's for failover but with 2 of them you're still less than half the cost of the Lambda. As you scale now to 100,000 web sites it's a matter of adding machines.
There are a ton of other choices but understand that serverless does not mean cost efficient in all use cases.

Avoid throttle dynamoDB

I am new to cloud computing, but had a question if a mechanism as what I am about to describe exists or is possible to create?
Dynamodb has provisioned throughput (eg. 100 writes/second). Of course, in real world application real life throughput is very dynamic and will almost never be your provisioned amount of 100 writes/second. I was thinking what would be great would be some type of queue for dynamodb. For example, my dynamodb during peak hours may receive 500 write requests per second (5 times what I have allocated) and would return errors. Is it there some queue I can put in between the client and database, so the client requests go to the queue, the client gets acknowledged their request has been dealt with, then the queue spits out the request to the dynamodb at a rate of 100/ writes per second exactly, so that way there are no error returned and I don't need to raise the through put which will raise my costs?
Putting AWS SQS is front of DynamoDB would solve this problem for you, and is not an uncommon design pattern. SQS is already well suited to scale as big as it needs to, and ingest a large amount of messages with unpredictable flow patterns.
You could either put all the messages into SQS first, or use SQS as an overflow buffer when you exceed the design thoughput on your DynamoDB database.
One or more worker instances can than read messages from the SQS queue and put them into DynamoDB at exactly the the pace you decide.
If the order of the messages coming in is extremely important, Kinesis is another option for you to ingest the incoming messages and then insert them into DynamoDB, in the same order they arrived, at a pace you define.
IMO, SQS will be easier to work with, but Kineses will give you more flexibility if your needs are more complicated.
This cannot be accomplished using DynamoDB alone. DynamoDB is designed for uniform, scalable, predictable workloads. If you want to put a queue in front of DynamoDB you have do that yourself.
DynamoDB does have a little tolerance for burst capacity, but that is not for sustained use. You should read the best practices section Consider Workload Uniformity When Adjusting Provisioned Throughput, but here are a few, what I think are important, paragraphs with a few things emphasized by me:
For applications that are designed for use with uniform workloads, DynamoDB's partition allocation activity is not noticeable. A temporary non-uniformity in a workload can generally be absorbed by the bursting allowance, as described in Use Burst Capacity Sparingly. However, if your application must accommodate non-uniform workloads on a regular basis, you should design your table with DynamoDB's partitioning behavior in mind (see Understand Partition Behavior), and be mindful when increasing and decreasing provisioned throughput on that table.
If you reduce the amount of provisioned throughput for your table, DynamoDB will not decrease the number of partitions . Suppose that you created a table with a much larger amount of provisioned throughput than your application actually needed, and then decreased the provisioned throughput later. In this scenario, the provisioned throughput per partition would be less than it would have been if you had initially created the table with less throughput.
There are tools that help with auto-scaling DynamoDB, such as sebdah/dynamic-dynamodb which may be worth looking into.
One update for those seeing this recently, for having burst capacity DynamoDB launched on 2018 the On Demand capacity mode.
You don't need to decide on the capacity beforehand, it will scale read and write capacity following the demand.
See:
https://aws.amazon.com/blogs/aws/amazon-dynamodb-on-demand-no-capacity-planning-and-pay-per-request-pricing/

Using AWS DynamoDB or Redshift to store analytics data

I would like to ask which services would suits me the best. For example, an facebook-like mobile app where I need to track every movement of a user such as the pages visited or links clicked.
I am thinking of using DynamoDB to create multiple tables to track each different activities. When I run my analytic app, it will query all the data for each table (Similar hash key but different range key so I can query all data) and compute the result in the app. So the main cost is the read throughput which can easily be 250 reads/s (~ $28/mth) for each table. The storage for each table has no limit so it is free?
For Redshift, I will be paying for the storage size on a 100% utilized per month basis for 160GB. That will cost me about $14.62/mth. Although it looks cheaper, I am not familiar with Redshift, hence, I am not sure of what are the other hidden costs.
Thanks in advance!
Pricing for Amazon DynamoDB has several components:
Provisioned Throughput Capacity (the speed of the tables)
Indexed Data Storage (the cost of storing data)
Data Transfer (for data going from AWS to the Internet)
For example, 100GB of data storage would cost around $25.
If you want 250 reads/second, it would cost $0.0065 per 50 units, which is .0065 * 5 units * 24 hours * 30 days = $23.40 (plus some write capacity units).
Pricing for Amazon Redshift is based upon the number and type of nodes. A 160GB dc1.large node would cost 25c/hour * 24 hours * 30 days = $180 per node (but only one node is probably required for your situation).
Amazon Redshift therefore comes out as more expensive, but it is also a more-featured system. You can run complex SQL against Amazon Redshift, whereas you would have to write an application to retrieve, join and compute information from DynamoDB. Think of DynamoDB as a storage service, while Redshift is also a querying service.
The real decision, however, should be based on how you are going to use the data. If you can create an application that will work with DynamoDB, then use it. However, many people find the simplicity of using SQL on Redshift to be much easier.

Amazon Kinesis Vs EC2

Sorry for the silly question, I am new to cloud development. I am trying to develop a realtime processing app in cloud, which can process the data from a sensor in realtime. the data stream is very low data rate, <50Kbps per sensor. probably <10 sensors will be running at once.
I am confused, what is the use of Amazon Kinesis for this application. I can use EC2 directly to receive my stream and process it. Why do I need Kinesis?
Why do I need Kinesis?
Short answer, you don't.
Yes, you can use EC2 - and probably dozens of other technologies.
Here is the first two sentences of the Kinesis product page:
Amazon Kinesis is a fully managed service for real-time processing of streaming data at massive scale. You can configure hundreds of thousands of data producers to continuously put data into an Amazon Kinesis stream.
So, if want to manage the stack yourelf, and/or you don't need massive scale and/or you don't need the ability to scale this processing to hundreds of thousands of simulataneous producers, then Kinesis may be overkill.
On the other hand, if the ingestion of this data is mission critical, and you don't have the time, skills or ability to manage the underlying infrastructure - or there is a chance the scale of your application will grow exponentially, then maybe Kinesis is the right choice - only you can decide based on your requirements.
Along with what E.J Brennan Just said, there are many other ways to solve your problem as the rate of data is very low.
As far as I know, amazon kinesis runs on ec2 under the hood, so may be your question is why to use kinesis as a streaming solution.
for scalability reasons ,you might need the streaming solution in future, as your volume of data grows and as the cost of maintaining the on premises resources increases and the focus shifts from application development to administration.
So kinesis for that matter ,would provide pay per use model instead of you worrying about increasing/reducing your resource stack.

Do I need to set up backup data pipeline for AWS Dynamo DB on a daily basis?

I am considering using AWS DynamoDB for an application we are building. I understand that setting a backup job that exports data from DynamoDB to S3 involves a data pipeline with EMR. But my question is do I need to worry about having a backup job set up on day 1? What are the chances that a data loss would happen?
There are multiple use-cases for DynamoDB table data copy elsewhere:
(1) Create a backup in S3 on a daily basis, in order to restore in case of accidental deletion of data or worse yet drop table (code bugs?)
(2) Create a backup in S3 to become the starting point of your analytics workflows. Once this data is backed up in S3, you can combine it with, say, your RDBMS system (RDS or on-premise) or other S3 data from log files. Data Integration workflows could involve EMR jobs to be ultimately loaded into Redshift (ETL) for BI queries. Or directly load these into Redshift to do more ELT style - so transforms happen within Redshift
(3) Copy (the whole set or a subset of) data from one table to another (either within the same region or another region) - so the old table can be garbage collected for controlled growth and cost containment. This table-to-table copy could also be used as a readily consumable backup table in case of, say region-specific availability issues. Or, use this mechanism to copy data from one region to another to serve it from an endpoint closer to the DynamoDB client application that is using it.
(4) Periodic restore of data from S3. Possibly as a way to load back post-analytics data back into DynamoDB for serving it in online applications with high-concurrency, low-latency requirements.
AWS Data Pipeline helps schedule all these scenarios with flexible data transfer solutions (using EMR underneath).
One caveat when using these solutions is to note that this is not a point-in-time backup: so any changes to the underlying table happening during the backup might be inconsistent.
This is really subjective. IMO you shouldn't worry about them 'now'.
You can also use simpler solutions other than pipleline. Perhaps that will be a good place to start.
After running DynamoDB as our main production database for more than a year I can say it is a great experience. No data loss and no downtime. The only thing that we care about is sometimes SDK misbehaves and tweaking provisioned throughput.
data pipeline has limit regions.
https://docs.aws.amazon.com/general/latest/gr/rande.html#datapipeline_region
I would recommend setting up a Data pipeline to backup on daily basis to an S3 bucket - If you want to be really safe.
Dynamo DB itself might be very reliable, but nobody can protect you from your own accidental deletions (what if by mistake you or your colleague ended up deleting a table from the console). So I would suggest setup a backup on daily basis - It doesn't any case cost so much.
You can tell the Pipeline to only consume say may 25% of the capacity while backup is going on so that your real users don't see any delay. Every backup is "full" (not incremental), so in some periodic interval, you can delete some old backups if you are concerned about storage.