How do you handle Amazon Kinesis Record duplicates?

How do you handle Amazon Kinesis Record duplicates? - amazon-web-services

According to the Amazon Kinesis Streams documentation, a record can be delivered multiple times.
The only way to be sure to process every record just once is to temporary store them in a database that supports Integrity checks (e.g. DynamoDB, Elasticache or MySQL/PostgreSQL) or just checkpoint the RecordId for each Kinesis shard.
Do you know a better / more efficient way of handling duplicates?

We had exactly that problem when building a telemetry system for a mobile app. In our case we were also unsure that producers where sending each message exactly once, therefore for each received record we calculated its MD5 on the fly and checked whether it is presented in some form of a persistent storage, but indeed what storage to use is the trickiest bit.
Firstly, we tried trivial relational database, but it quickly became a major bottleneck of the whole system as this isn't just read-heavy but also write-heavy case, since the volume of data going though Kinesis was quite significant.
We ended up having a DynamoDB table storing MD5's for each unique message. The issue we had was that it wasn't so easy to delete the messages - even though our table contained partition and sort keys, DynamoDB does not allow to drop all records with a given partition key, we had to query all of the to get sort key values (which wastes time and capacity). Unfortunately, we had to just simply drop the whole table once in a while. Another way suboptimal solution is to regularly rotate DynamoDB tables which store message identifiers.
However, recently DynamoDB introduced a very handy feature - Time To Live, which means that now we can control the size of a table by enabling auto-expiry on a per record basis. In that sense DynamoDB seems to be quite similar to ElastiCache, however ElastiCache (at least Memcached cluster) is much less durable - there is no redundancy there, and all data residing on terminated nodes is lost in case of scale in operation or failure.

The thing you mentioned is a general problem of all queue systems with "at least once" approach. Also, not just the queue systems, the producers and consumers both may process the same message multiple times (due to ReadTimeout errors etc.). Kinesis and Kafka both uses that paradigm. Unfortunately there is not an easy answer for that.
You may also try to use an "exactly-once" message queue, with stricter transaction approach. For example AWS SQS does that: https://aws.amazon.com/about-aws/whats-new/2016/11/amazon-sqs-introduces-fifo-queues-with-exactly-once-processing-and-lower-prices-for-standard-queues/ . Be aware, SQS throughput is far smaller than Kinesis.
To solve your problem, you should be aware of your application domain and try to solve it internally like you suggested (database checks). Especially when you communicate with an external service (let's say an email server for example), you should be able to recover the operation state in order to prevent double processing (because double sending in the email server example, may result in multiple copies of the same post in the recipient's mailbox).
See also the following concepts;
At-least-once Delivery: http://www.cloudcomputingpatterns.org/at_least_once_delivery/
Exactly-once Delivery: http://www.cloudcomputingpatterns.org/exactly_once_delivery/
Idempotent Processor: http://www.cloudcomputingpatterns.org/idempotent_processor/

Related

Amazon DynamoDB read latency while writing

I have an Amazon DynamoDB table which is used for both read and write operations. Write operations are performed only when the batch job runs at certain intervals whereas Read operations are happening consistently throughout the day.
I am facing a problem of increased Read latency when there is significant amount of write operations are happening due to the batch jobs. I explored a little bit about having a separate read replica for DynamoDB but nothing much of use. Global tables are not an option because that's not what they are for.
Any ideas how to solve this?

Going by the Dynamo paper, the concept of a read-replica for a record or a table does not exist in Dynamo. Within the same region, you will have multiple copies of a record depending on the replication factor (R+W > N) where N is the replication factor. However when the client reads, one of those records are returned depending on the cluster health.
Depending on how the co-ordinator node is chosen either at the client library or at the cluster, the client can only ask for a record (get) or send a record(put) to either the cluster co-ordinator ( 1 extra hop ) or to the node assigned to the record (single hop to record). There is just no way for the client to say 'give me a read replica from another node'. The replicas are there for fault-tolerance, if one of the nodes containing the master copy of the record dies, replicas will be used.
I am researching the same problem in the context of hot keys. Every record gets assigned to a node in Dynamo. So a million reads on the same record will lead to hot keys, loss of reads/writes etc. How to deal with this ? A read-replica will work great because I can now manage the hot keys at the application and move all extra reads to read-replica(s). This is again fraught with issues.

How can i aggregate data from multiple lambdas in aws

I have SNS Topic which triggers 50 Lambdas in Multiple Accounts
Now each lambda produces some output in json format.
I want to aggregate all those individual json into one list and then pass that into Another SNS Topic
whats the best to achieve to aggregate data

There are a couple of architecture solutions you can use to solve this. There is probably not a "right one", it will depends on the volume of data, frequency of triggers and budget.
You will need some shared storage where your 50 lambdas functions can temporary store their results, and another component, most probably another lambda function in charge of the aggregation to produce the final result.
Depending on the volume of data to handle, I would first consider a shared Amazon S3 bucket where all your 50 functions can drop their piece of JSON, and the aggregation function could read and assemble all the pieces. Other services that can act as a shared storage are Amazon DynamoDB and Amazon Kinesis.
The difficulty will be to detect when all the pieces are available to start the final aggregation. If 50 is a fixed number, that will be easy, otherwise you will need to think about a mechanism to tell the aggregation function it can start to work...

The scenario you describe does not really match with the architectural pattern you are choosing. If you know upfront you'll have to deal with state (aggregate is keeping track of the state) SNS & SQS is not the right solution, neither is Lambda.
What is not mentioned in the other posts is that you'll have to manage the fact that there is a possibility that one of your 50 processes could fail. You'll have to take that in account too. Handling all of these cases shouldn't be your focus since there are tools doing that for you.
I recommend you to take a look at AWS Kinesis: https://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html
Also, AWS Step Functions provides a solution:
https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-parallel-state.html

I would suggest looking at DynamoDB for aggregating the information, if the data being stored lends itself to that.
The various components can drop their data in asynchronously, then the aggregator can perform a single query to pull in the whole result set.
Although it's described as a database, it can be viewed as a simple object store or lookup engine, so you do not really have to think about data keys, only a way to distinguish each contribution from the others.
So you might store under "lambda-id + timestamp", which ensures each record is distinct, and then you can just retrieve all records. Don't forget to have a way to retire records, so the system does not fill up !

Ordering of streaming data with kinesis stream and firehose

I have an architecture dilemma for my current project which is for near realtime processing of big amount of data. So here is a diagram of the the current architecture:
Here is an explanation of my idea which led me to that picture:
When the API gateway receives a request it's put in the stream(this is because of the nature of my application- "fire and forget) That's how I came up to that conclusion. The input data is separated in the shards based on a specific request attribute which guarantees me the correct order.
Then I have a lambda which cares for validating the input and anomaly detection. So it's an abstraction which keeps the data clean for the next layer- the data enrichment. So this lambda sends the data to a kinesis firehose because it can backup the "raw" data(something which I definitely want to have) and also attach a transformation lambda which will do the enrichment- so I won't care for saving the data in S3, it will come out of the box. So everything is great until the moment where I need a preserved ordering of the received data(the enricher is doing sessionization), which is lost in the firehose, because there's no data separation there as it's in the kinesis streams.
So the only thing I could think of is- to move the sissionization in the first lambda, which will break my abstraction, because it will start caring about data enrichment and the bigger drawback is that the backup data will have enriched data in it, which is also breaking the architecture. And all this is happening because the missing sharding conception in the firehose.
So can someone think of a solution of that problem without losing the out of the box features which aws provides us?

I think that sessionization and data enrichment are two different abstractions, will need to be split between the lambdas.
A session is a time bound, strictly ordered flow of events that are bounded by a purpose or task. You only have that information at the first lambda stage (from the kinesis stream categorization), and should label flows with session context at the source and where sessions can be bounded.
If storing session information in a backup is a problem, it may be that the definition of a session is not well specified or subject to redefinition. If sessions are subject to future recasting, the session data already calculated can be ignored, provided enough additional data to inform the unpredictable future concepts of possible sessions has also been recorded with enough detail.
Additional enrichment providing business context (aka externally identifiable data) should process the sessions transactionally within the previously recorded boundaries.
If sessions aren't transactional at the business level, then the definition of a session is over or under specified. If that is the case, you are out of the stream processing business and into batch processing, where you will need to scale state to the number of possible simultaneous interleaved sessions and their maximum durations -- querying the entire corpus of events to bracket sessions of hopefully manageable time durations.

IoT and DynamoDB

We have to work on an IoT system. Basically sensors sending data to the cloud, and users being able to access the data belonging to them.
The amount of data can be pretty substantial so we need to ensure something that covers both security and heavy load.
The typology of data is pretty straightforward, basically a data and its value at a specified time.
The idea was to use DynamoDB for this, having a table with :
[id of sensor-array]
[id of sensor]
[type of measure]
[value of measure]
[date of measure]
The idea was for the IoT system to put directly (in python) data into the database.
Our questions are :
In terms of performance :
will DynamoDB be able to handle a lot of insertions on a daily basis (we may be talking about hundredth of thousands insertions per minute) ?
does querying the table by giving the id of the sensor array and a minimal date will ensure being able to retrieve the data in a efficient fashion?
In terms of security is it okay to proceed this way?
We used to use NoSQL like MongoDB, so we're finding hard to apply our notions on DynamoDB where the data seems to be arranged in a pretty simple fashion.
Thanks.

will DynamoDB be able to handle a lot of insertions on a daily basis (we may be talking about hundredth of thousands insertions per minute) ?
Yes, DynamoDB will sustain (and cost) all the write throughput you provision for the table. As your data is small, aggregating before writing in batches (BatchWriteItem) is probably more cost efficient than individual writes.
does querying the table by giving the id of the sensor array and a minimal date will ensure being able to retrieve the data in a efficient fashion?
Yes, queries by hash (id) and range keys (date) would be very efficient. You may need secondary indexes for more complex queries though.
In terms of security is it okay to proceed this way?
Although data is encrypted in transit and client-side encryption is straightforward, there is a lot uncovered here. For example, AWS IoT provides TLS mutual authentication with certificates, IAM or Cognito, among other security features. AWS IoT can store data in DynamoDB using a simple rule.

Orphan management with AWS DynamoDB & S3 data spread across multiple items & buckets?

DynamoDB items are currently limited to a 400KB maximum size. When storing items longer than this limit, Amazon suggests a few options, including splitting long items into multiple items, splitting items across tables, and/or storing large data in S3.
Sounds OK if nothing ever failed. But what's a recommended approach to deal with making updates and deletes consistent across multiple DynamoDB items plus, just to make things interesting, S3 buckets too?
For a concrete example, imagine an email app with:
EmailHeader table in DynamoDB
EmailBodyChunk table in DynamoDB
EmailAttachment table in DynamoDB that points to email attachments stored in S3 buckets
Let's say I want to delete an email. What's a good approach to make sure that orphan data will get cleaned up if something goes wrong during the delete operation and data is only partially deleted? (Ideally, it'd be a solution that won't add additional operational complexity like having to temporarily increase the provisioned read limit to run a garbage-collector script.)

There are couple of alternatives for your use case:
Use the DynamoDB transactions library that:
enables Java developers to easily perform atomic writes and isolated reads across multiple items and tables when building high scale applications on Amazon DynamoDB.
It is important to note that it requires 7N+4 writes, which'll be costly. So, go this route only if you require strong ACID properties, such as for banking or other monetary applications.
If you are okay with the DB being inconsistent for a short duration, you can perform the required operations one by one and mark the entire thing complete only at the end.

You could manage your deletion events with an SQS queue that supports exactly-once semantics and use that queue to start a Step workflow that deletes the corresponding header, body chunk and attachment. In retrospect, the queue does not even need to be exactly once, as you can just stop a workflow in case the header does not exist.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js