How can i aggregate data from multiple lambdas in aws - amazon-web-services

I have SNS Topic which triggers 50 Lambdas in Multiple Accounts
Now each lambda produces some output in json format.
I want to aggregate all those individual json into one list and then pass that into Another SNS Topic
whats the best to achieve to aggregate data

There are a couple of architecture solutions you can use to solve this. There is probably not a "right one", it will depends on the volume of data, frequency of triggers and budget.
You will need some shared storage where your 50 lambdas functions can temporary store their results, and another component, most probably another lambda function in charge of the aggregation to produce the final result.
Depending on the volume of data to handle, I would first consider a shared Amazon S3 bucket where all your 50 functions can drop their piece of JSON, and the aggregation function could read and assemble all the pieces. Other services that can act as a shared storage are Amazon DynamoDB and Amazon Kinesis.
The difficulty will be to detect when all the pieces are available to start the final aggregation. If 50 is a fixed number, that will be easy, otherwise you will need to think about a mechanism to tell the aggregation function it can start to work...

The scenario you describe does not really match with the architectural pattern you are choosing. If you know upfront you'll have to deal with state (aggregate is keeping track of the state) SNS & SQS is not the right solution, neither is Lambda.
What is not mentioned in the other posts is that you'll have to manage the fact that there is a possibility that one of your 50 processes could fail. You'll have to take that in account too. Handling all of these cases shouldn't be your focus since there are tools doing that for you.
I recommend you to take a look at AWS Kinesis: https://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html
Also, AWS Step Functions provides a solution:
https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-parallel-state.html

I would suggest looking at DynamoDB for aggregating the information, if the data being stored lends itself to that.
The various components can drop their data in asynchronously, then the aggregator can perform a single query to pull in the whole result set.
Although it's described as a database, it can be viewed as a simple object store or lookup engine, so you do not really have to think about data keys, only a way to distinguish each contribution from the others.
So you might store under "lambda-id + timestamp", which ensures each record is distinct, and then you can just retrieve all records. Don't forget to have a way to retire records, so the system does not fill up !

Related

Data Storage and Analytics on AWS

I have one data analytics requirement on AWS. I have limited knowledge on Big Data processing, but based on my
analysis, I have figured out some options.
The requirement is to collect data by calling a Provider API every 30 mins. (data ingestion)
The data is mainly structured.
This data need to be stored in a storage (S3 data lake or Red Shift.. not sure)and various aggregations/dimensions from this data are to be provided through a REST API.
There is a future requirement to run ML algorithms on the original data and hence the storage need to be decided accordingly. So based on this, can you suggest:
How to ingest data (Lambda to run at a scheduled interval and pull data, store in the storage OR any better way to pull data in AWS)
How to store (store in S3 or RedShift)
Data Analytics (currently some monthly, weekly aggregations), what tools can be used? What tools to use if I am storing data in S3.
Expose the analytics results through an API. (Hope I can use Lambda to query the Analytics engine in the previous step)
Ingestion is simple. If the retrieval is relatively quick, then scheduling an AWS Lambda function is a good idea.
However, all the answers to your other questions really depend upon how you are going to use the data, and then work backwards.
For Storage, Amazon S3 makes sense at least for the initial storage of the retrieved data, but might (or might not) be appropriate for the API and Analytics.
If you are going to provide an API, then you will need to consider how the API code (eg using AWS API Gateway) will need to retrieve the data. For example, is it identical to the blob of data original retrieved, or are there complex transformations required or perhaps combining of data from other locations and time intervals. This will help determine how the data should be stored so that it is easily retrieved.
Data Analytics needs will also drive how your data is stored. Consider whether an SQL database sufficient. If there are millions and billions of rows, you could consider using Amazon Redshift. If the data is kept in Amazon S3, then you might be able to use Amazon Athena. The correct answer depends completely upon how you intend to access and process the data.
Bottom line: Consider first how you will use the data, then determine the most appropriate place to store it. There is no generic answer that we can provide.

Can I aggregate data from a stream on AWS?

I have data coming from multiple machines, I would like to aggregate it by user. I'm thinking of producing batches of 1000 "rows", or 10 seconds of data (whichever comes first), by user.
I do have some experience with AWS kinesis and lambdas, but in my experience we don't have so much control on how the aggregation is done. All machines would send the data by kinesis, with the user id as the partition key. Then AWS will call our lambda with small batches. This has been great for some other use cases but here if I receive 100 records I don't know what to do (I would like to "wait" to receive more or wait that 10 seconds elapse since the date of the first record).
Also I'm not sure how the aggregation "by user id" would work. So far, on a lambda, I would have split the records in the batch by user id, but then if I get called with a batch of 100 records, even though there is a partition key on the user id, there is no guarantee that those 100 records would be for 1 user. Maybe I will get 100 records from 100 different users, and there is no "aggregation" help at all.
Any idea if kinesis + lambda is suited for this? I did look at the documentation of AWS but I don't see my scenario. It looks like they also have a tool "Data Streams" but it's hard for me to understand if this would work for my case.
Thanks!
Your understanding is correct. AWS Lambda + Kinesis alone will not be sufficient alone for aggregation. AWS Lambda programming model is stateless, so you can only aggregate based on the batch of records received in that particular invocatio(GetRecords API) call. Furthermore, the batch size provided in the function does not gurantee that you will get that number of records. This is merely the maximum number of records which you can get(MaxRecords) per invocation.
What you need is some kind of windowing mechanism, either row-based or time-based. Kinesis Analytics would be the easiest and fastest to get on-boarded with this. You can either use SQL or Flink with Kinesis analytics. You can even have your output to AWS Lambda for post processing.
Other ways would be use a Spark streaming job (you can use AWS EMR) and use windowing in your application.

How to efficiently implement a page view counter in DynamoDB?

I am essentially trying to build a website where members can post blog entries and i want to record unique and overall page views for the different posts in absolute terms as well as over different time-frames e.g., last 24h, last week etc.
My initial approach was to use the date as primary key and the blogPostId as secondary key, i could then add all the posts visited during a given day. If i then include the userIds as an attribute i should then be able to a)get unique page views and b)overall page views (which might include duplicate visits by a specific user) for a given day. Finally, i would then pull the primary key for let's say the last 7 days and extract the most popular post.
As far as i can tell this should work fine as long as there aren't too many entries, however, i'm sceptical if this will scale. More specifically, if the number of blog posts increases a lot for a given interval, or if i want to find the all-time most viewed post i'd essentially have to read the whole table.
Has anyone an idea how i could implement this more efficiently?
DynamoDB will almost certainly work for you, and if you need an excuse to use it, by all means give it a try. If you get a ton or traffic it might end up being expensive.
Personally, I would consider using redis for what you are asking to do, and here is a pretty good/detailed question/answer on how you might implement it:
Scalable way of logging page request data from a PHP application?
DynamoDB can be used to iterate and create this feature quickly.
Nonetheless, this is a feature for Amazon Kinesis Data Streams, which will let you ingest data and then manipulate it to your needs.
Know that Kinesis can become expensive if you try to be as frugal as possible.
But, if you start receiving a lot of traffic, Kinesis will work as a Queue and let you manipulate the data before ingesting it to DynamoDB (Or another Data Store) (It will be cheaper than sending all those write requests).
Another limitation you'd like to check out is that DynamoDB will only return up to 1MB per Query.
Amazon recommends you use Redshift to handle all those operations as it is more suited to perform aggregation and calculation across Data warehouses.

AWS hosted data storage for storing simple entities

I need to choose data storage for simple system. The main purpose of the system is storing events - simple entities with timestamp, user id and type. No joins. Just single table.
Stored data will be fetched rarely (compared with writes). I expect following read operations:
get latest events for a list of users
get latest events of a type for a list of users
I expect about 0.5-1 million writes a day. Data older than 2 years can be removed.
I'm looking for best fitted service provided by AWS. I wonder if using redshift is like taking a sledgehammer to crack a nut?
For your requirement you can use AWS DynamoDB and also define the TTL values to remove the older items automatically. You get the following advantages.
Fully managed data storage
Able to scale with the need for write throughput (Though it can be costly)
Use sort key with timestamp to query latest items.
I would also like to check the AWS Simple DB as it looks more fit(in a first glance) for your requirements.
Please refer this article which explains some practical user experience.
http://www.masonzhang.com/2013/06/2-reasons-why-we-select-simpledb.html

Orphan management with AWS DynamoDB & S3 data spread across multiple items & buckets?

DynamoDB items are currently limited to a 400KB maximum size. When storing items longer than this limit, Amazon suggests a few options, including splitting long items into multiple items, splitting items across tables, and/or storing large data in S3.
Sounds OK if nothing ever failed. But what's a recommended approach to deal with making updates and deletes consistent across multiple DynamoDB items plus, just to make things interesting, S3 buckets too?
For a concrete example, imagine an email app with:
EmailHeader table in DynamoDB
EmailBodyChunk table in DynamoDB
EmailAttachment table in DynamoDB that points to email attachments stored in S3 buckets
Let's say I want to delete an email. What's a good approach to make sure that orphan data will get cleaned up if something goes wrong during the delete operation and data is only partially deleted? (Ideally, it'd be a solution that won't add additional operational complexity like having to temporarily increase the provisioned read limit to run a garbage-collector script.)
There are couple of alternatives for your use case:
Use the DynamoDB transactions library that:
enables Java developers to easily perform atomic writes and isolated reads across multiple items and tables when building high scale applications on Amazon DynamoDB.
It is important to note that it requires 7N+4 writes, which'll be costly. So, go this route only if you require strong ACID properties, such as for banking or other monetary applications.
If you are okay with the DB being inconsistent for a short duration, you can perform the required operations one by one and mark the entire thing complete only at the end.
You could manage your deletion events with an SQS queue that supports exactly-once semantics and use that queue to start a Step workflow that deletes the corresponding header, body chunk and attachment. In retrospect, the queue does not even need to be exactly once, as you can just stop a workflow in case the header does not exist.