Ordering of streaming data with kinesis stream and firehose - amazon-web-services

I have an architecture dilemma for my current project which is for near realtime processing of big amount of data. So here is a diagram of the the current architecture:
Here is an explanation of my idea which led me to that picture:
When the API gateway receives a request it's put in the stream(this is because of the nature of my application- "fire and forget) That's how I came up to that conclusion. The input data is separated in the shards based on a specific request attribute which guarantees me the correct order.
Then I have a lambda which cares for validating the input and anomaly detection. So it's an abstraction which keeps the data clean for the next layer- the data enrichment. So this lambda sends the data to a kinesis firehose because it can backup the "raw" data(something which I definitely want to have) and also attach a transformation lambda which will do the enrichment- so I won't care for saving the data in S3, it will come out of the box. So everything is great until the moment where I need a preserved ordering of the received data(the enricher is doing sessionization), which is lost in the firehose, because there's no data separation there as it's in the kinesis streams.
So the only thing I could think of is- to move the sissionization in the first lambda, which will break my abstraction, because it will start caring about data enrichment and the bigger drawback is that the backup data will have enriched data in it, which is also breaking the architecture. And all this is happening because the missing sharding conception in the firehose.
So can someone think of a solution of that problem without losing the out of the box features which aws provides us?

I think that sessionization and data enrichment are two different abstractions, will need to be split between the lambdas.
A session is a time bound, strictly ordered flow of events that are bounded by a purpose or task. You only have that information at the first lambda stage (from the kinesis stream categorization), and should label flows with session context at the source and where sessions can be bounded.
If storing session information in a backup is a problem, it may be that the definition of a session is not well specified or subject to redefinition. If sessions are subject to future recasting, the session data already calculated can be ignored, provided enough additional data to inform the unpredictable future concepts of possible sessions has also been recorded with enough detail.
Additional enrichment providing business context (aka externally identifiable data) should process the sessions transactionally within the previously recorded boundaries.
If sessions aren't transactional at the business level, then the definition of a session is over or under specified. If that is the case, you are out of the stream processing business and into batch processing, where you will need to scale state to the number of possible simultaneous interleaved sessions and their maximum durations -- querying the entire corpus of events to bracket sessions of hopefully manageable time durations.

Related

feeding real time data into aws personalize

I want to feed real time data into aws personalize to build a recommendation engine. I've read online resources and in those guides, I could see that the training user-interaction data, user data and item data is provided in the beginning while creating the recommendation engine.
However, I have an app and I will gather data in the app and want to feed those realtime data into aws personalize. I want to know if building the recommendation engine is possible without providing any data at first and then stream real time data from my app later with the putevents, putItem and putUser api from aws-sdk? I'm quite new to this so I'm quite confused with this initial step
I want to know if building the recommendation engine is possible without providing any data at first and then stream real time data from my app later with the putevents, putItem and putUser api from aws-sdk?
Yes, it is possible. You just need to adjust the sequence of creating resources.
Interaction data is required for all Personalize recipes before a recommender can be created that provides recommendations. However, if you don't have interaction data (or enough data; see quotas and limits) to start with, you can create a dataset group and an interactions dataset, feed interactions to the dataset using the PutEvents API (see recording events page), and then create a domain recommender or custom solution when enough data has been ingested.
The minimum amount of interaction data (and potentially item metadata) required before you can train a model/recommender depends on the recipe that you select. Generally speaking, you will need 1000 interactions across 25 distinct users where each of those users has 2+ interactions. The domain recommenders also require specific event types. Check the docs linked above. The quality and relevance of recommendations will improve as you collect more data and retrain.

Data Storage and Analytics on AWS

I have one data analytics requirement on AWS. I have limited knowledge on Big Data processing, but based on my
analysis, I have figured out some options.
The requirement is to collect data by calling a Provider API every 30 mins. (data ingestion)
The data is mainly structured.
This data need to be stored in a storage (S3 data lake or Red Shift.. not sure)and various aggregations/dimensions from this data are to be provided through a REST API.
There is a future requirement to run ML algorithms on the original data and hence the storage need to be decided accordingly. So based on this, can you suggest:
How to ingest data (Lambda to run at a scheduled interval and pull data, store in the storage OR any better way to pull data in AWS)
How to store (store in S3 or RedShift)
Data Analytics (currently some monthly, weekly aggregations), what tools can be used? What tools to use if I am storing data in S3.
Expose the analytics results through an API. (Hope I can use Lambda to query the Analytics engine in the previous step)
Ingestion is simple. If the retrieval is relatively quick, then scheduling an AWS Lambda function is a good idea.
However, all the answers to your other questions really depend upon how you are going to use the data, and then work backwards.
For Storage, Amazon S3 makes sense at least for the initial storage of the retrieved data, but might (or might not) be appropriate for the API and Analytics.
If you are going to provide an API, then you will need to consider how the API code (eg using AWS API Gateway) will need to retrieve the data. For example, is it identical to the blob of data original retrieved, or are there complex transformations required or perhaps combining of data from other locations and time intervals. This will help determine how the data should be stored so that it is easily retrieved.
Data Analytics needs will also drive how your data is stored. Consider whether an SQL database sufficient. If there are millions and billions of rows, you could consider using Amazon Redshift. If the data is kept in Amazon S3, then you might be able to use Amazon Athena. The correct answer depends completely upon how you intend to access and process the data.
Bottom line: Consider first how you will use the data, then determine the most appropriate place to store it. There is no generic answer that we can provide.

How can i aggregate data from multiple lambdas in aws

I have SNS Topic which triggers 50 Lambdas in Multiple Accounts
Now each lambda produces some output in json format.
I want to aggregate all those individual json into one list and then pass that into Another SNS Topic
whats the best to achieve to aggregate data
There are a couple of architecture solutions you can use to solve this. There is probably not a "right one", it will depends on the volume of data, frequency of triggers and budget.
You will need some shared storage where your 50 lambdas functions can temporary store their results, and another component, most probably another lambda function in charge of the aggregation to produce the final result.
Depending on the volume of data to handle, I would first consider a shared Amazon S3 bucket where all your 50 functions can drop their piece of JSON, and the aggregation function could read and assemble all the pieces. Other services that can act as a shared storage are Amazon DynamoDB and Amazon Kinesis.
The difficulty will be to detect when all the pieces are available to start the final aggregation. If 50 is a fixed number, that will be easy, otherwise you will need to think about a mechanism to tell the aggregation function it can start to work...
The scenario you describe does not really match with the architectural pattern you are choosing. If you know upfront you'll have to deal with state (aggregate is keeping track of the state) SNS & SQS is not the right solution, neither is Lambda.
What is not mentioned in the other posts is that you'll have to manage the fact that there is a possibility that one of your 50 processes could fail. You'll have to take that in account too. Handling all of these cases shouldn't be your focus since there are tools doing that for you.
I recommend you to take a look at AWS Kinesis: https://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html
Also, AWS Step Functions provides a solution:
https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-parallel-state.html
I would suggest looking at DynamoDB for aggregating the information, if the data being stored lends itself to that.
The various components can drop their data in asynchronously, then the aggregator can perform a single query to pull in the whole result set.
Although it's described as a database, it can be viewed as a simple object store or lookup engine, so you do not really have to think about data keys, only a way to distinguish each contribution from the others.
So you might store under "lambda-id + timestamp", which ensures each record is distinct, and then you can just retrieve all records. Don't forget to have a way to retire records, so the system does not fill up !

How do you handle Amazon Kinesis Record duplicates?

According to the Amazon Kinesis Streams documentation, a record can be delivered multiple times.
The only way to be sure to process every record just once is to temporary store them in a database that supports Integrity checks (e.g. DynamoDB, Elasticache or MySQL/PostgreSQL) or just checkpoint the RecordId for each Kinesis shard.
Do you know a better / more efficient way of handling duplicates?
We had exactly that problem when building a telemetry system for a mobile app. In our case we were also unsure that producers where sending each message exactly once, therefore for each received record we calculated its MD5 on the fly and checked whether it is presented in some form of a persistent storage, but indeed what storage to use is the trickiest bit.
Firstly, we tried trivial relational database, but it quickly became a major bottleneck of the whole system as this isn't just read-heavy but also write-heavy case, since the volume of data going though Kinesis was quite significant.
We ended up having a DynamoDB table storing MD5's for each unique message. The issue we had was that it wasn't so easy to delete the messages - even though our table contained partition and sort keys, DynamoDB does not allow to drop all records with a given partition key, we had to query all of the to get sort key values (which wastes time and capacity). Unfortunately, we had to just simply drop the whole table once in a while. Another way suboptimal solution is to regularly rotate DynamoDB tables which store message identifiers.
However, recently DynamoDB introduced a very handy feature - Time To Live, which means that now we can control the size of a table by enabling auto-expiry on a per record basis. In that sense DynamoDB seems to be quite similar to ElastiCache, however ElastiCache (at least Memcached cluster) is much less durable - there is no redundancy there, and all data residing on terminated nodes is lost in case of scale in operation or failure.
The thing you mentioned is a general problem of all queue systems with "at least once" approach. Also, not just the queue systems, the producers and consumers both may process the same message multiple times (due to ReadTimeout errors etc.). Kinesis and Kafka both uses that paradigm. Unfortunately there is not an easy answer for that.
You may also try to use an "exactly-once" message queue, with stricter transaction approach. For example AWS SQS does that: https://aws.amazon.com/about-aws/whats-new/2016/11/amazon-sqs-introduces-fifo-queues-with-exactly-once-processing-and-lower-prices-for-standard-queues/ . Be aware, SQS throughput is far smaller than Kinesis.
To solve your problem, you should be aware of your application domain and try to solve it internally like you suggested (database checks). Especially when you communicate with an external service (let's say an email server for example), you should be able to recover the operation state in order to prevent double processing (because double sending in the email server example, may result in multiple copies of the same post in the recipient's mailbox).
See also the following concepts;
At-least-once Delivery: http://www.cloudcomputingpatterns.org/at_least_once_delivery/
Exactly-once Delivery: http://www.cloudcomputingpatterns.org/exactly_once_delivery/
Idempotent Processor: http://www.cloudcomputingpatterns.org/idempotent_processor/

Orphan management with AWS DynamoDB & S3 data spread across multiple items & buckets?

DynamoDB items are currently limited to a 400KB maximum size. When storing items longer than this limit, Amazon suggests a few options, including splitting long items into multiple items, splitting items across tables, and/or storing large data in S3.
Sounds OK if nothing ever failed. But what's a recommended approach to deal with making updates and deletes consistent across multiple DynamoDB items plus, just to make things interesting, S3 buckets too?
For a concrete example, imagine an email app with:
EmailHeader table in DynamoDB
EmailBodyChunk table in DynamoDB
EmailAttachment table in DynamoDB that points to email attachments stored in S3 buckets
Let's say I want to delete an email. What's a good approach to make sure that orphan data will get cleaned up if something goes wrong during the delete operation and data is only partially deleted? (Ideally, it'd be a solution that won't add additional operational complexity like having to temporarily increase the provisioned read limit to run a garbage-collector script.)
There are couple of alternatives for your use case:
Use the DynamoDB transactions library that:
enables Java developers to easily perform atomic writes and isolated reads across multiple items and tables when building high scale applications on Amazon DynamoDB.
It is important to note that it requires 7N+4 writes, which'll be costly. So, go this route only if you require strong ACID properties, such as for banking or other monetary applications.
If you are okay with the DB being inconsistent for a short duration, you can perform the required operations one by one and mark the entire thing complete only at the end.
You could manage your deletion events with an SQS queue that supports exactly-once semantics and use that queue to start a Step workflow that deletes the corresponding header, body chunk and attachment. In retrospect, the queue does not even need to be exactly once, as you can just stop a workflow in case the header does not exist.