Orphan management with AWS DynamoDB & S3 data spread across multiple items & buckets?

Orphan management with AWS DynamoDB & S3 data spread across multiple items & buckets? - amazon-web-services

DynamoDB items are currently limited to a 400KB maximum size. When storing items longer than this limit, Amazon suggests a few options, including splitting long items into multiple items, splitting items across tables, and/or storing large data in S3.
Sounds OK if nothing ever failed. But what's a recommended approach to deal with making updates and deletes consistent across multiple DynamoDB items plus, just to make things interesting, S3 buckets too?
For a concrete example, imagine an email app with:
EmailHeader table in DynamoDB
EmailBodyChunk table in DynamoDB
EmailAttachment table in DynamoDB that points to email attachments stored in S3 buckets
Let's say I want to delete an email. What's a good approach to make sure that orphan data will get cleaned up if something goes wrong during the delete operation and data is only partially deleted? (Ideally, it'd be a solution that won't add additional operational complexity like having to temporarily increase the provisioned read limit to run a garbage-collector script.)

There are couple of alternatives for your use case:
Use the DynamoDB transactions library that:
enables Java developers to easily perform atomic writes and isolated reads across multiple items and tables when building high scale applications on Amazon DynamoDB.
It is important to note that it requires 7N+4 writes, which'll be costly. So, go this route only if you require strong ACID properties, such as for banking or other monetary applications.
If you are okay with the DB being inconsistent for a short duration, you can perform the required operations one by one and mark the entire thing complete only at the end.

You could manage your deletion events with an SQS queue that supports exactly-once semantics and use that queue to start a Step workflow that deletes the corresponding header, body chunk and attachment. In retrospect, the queue does not even need to be exactly once, as you can just stop a workflow in case the header does not exist.

Related

Database suggestion for large unstructured datasets to integrate with elasticsearch

A scenario where we have millions of records saved in database, currently I was using dynamodb for saving metadata(and also do write, update and delete operations on objects), S3 for storing files(eg: files can be images, where its associated metadata is stored in dynamoDb) and elasticsearch for indexing and searching. But due to dynamodb limit of 400kb for a row(a single object), it was not sufficient for data to be saved. I thought about saving for an object in different versions in dynamodb itself, but it would be too complicated.
So I was thinking for replacement of dynamodb with some better storage:
AWS DocumentDb
S3 for saving metadata also, along with object files
So which one is better option among both in your opinion and why, which is also cost effective. (Also easy to sync with elasticsearch, but this ES syncing is not much issue as somehow it is possible for both)
If you have any other better suggestions than these two you can also tell me those.

I would suggest looking at DocumentDB over Amazon S3 based on your use case for the following reasons:
Pricing of storing the data would be $0.023 for standard and $0.0125 for infrequent access per GB per month (whereas Document DB is $0.10per GB-month), depending on your size this could add up greatly. If you use IA be aware that your costs for retrieval could add up greatly.
Whilst you would not directly get the data down you would use either Athena or S3 Select to filter. Depending on the data size being queried it would take from a few seconds to possibly minutes (not the milliseconds you requested).
For unstructured data storage in S3 and the querying technologies around it are more targeted at a data lake used for analysis. Whereas DocumentDB is more driven for performance within live applications (it is a MongoDB compatible data store after all).

How can i aggregate data from multiple lambdas in aws

I have SNS Topic which triggers 50 Lambdas in Multiple Accounts
Now each lambda produces some output in json format.
I want to aggregate all those individual json into one list and then pass that into Another SNS Topic
whats the best to achieve to aggregate data

There are a couple of architecture solutions you can use to solve this. There is probably not a "right one", it will depends on the volume of data, frequency of triggers and budget.
You will need some shared storage where your 50 lambdas functions can temporary store their results, and another component, most probably another lambda function in charge of the aggregation to produce the final result.
Depending on the volume of data to handle, I would first consider a shared Amazon S3 bucket where all your 50 functions can drop their piece of JSON, and the aggregation function could read and assemble all the pieces. Other services that can act as a shared storage are Amazon DynamoDB and Amazon Kinesis.
The difficulty will be to detect when all the pieces are available to start the final aggregation. If 50 is a fixed number, that will be easy, otherwise you will need to think about a mechanism to tell the aggregation function it can start to work...

The scenario you describe does not really match with the architectural pattern you are choosing. If you know upfront you'll have to deal with state (aggregate is keeping track of the state) SNS & SQS is not the right solution, neither is Lambda.
What is not mentioned in the other posts is that you'll have to manage the fact that there is a possibility that one of your 50 processes could fail. You'll have to take that in account too. Handling all of these cases shouldn't be your focus since there are tools doing that for you.
I recommend you to take a look at AWS Kinesis: https://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html
Also, AWS Step Functions provides a solution:
https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-parallel-state.html

I would suggest looking at DynamoDB for aggregating the information, if the data being stored lends itself to that.
The various components can drop their data in asynchronously, then the aggregator can perform a single query to pull in the whole result set.
Although it's described as a database, it can be viewed as a simple object store or lookup engine, so you do not really have to think about data keys, only a way to distinguish each contribution from the others.
So you might store under "lambda-id + timestamp", which ensures each record is distinct, and then you can just retrieve all records. Don't forget to have a way to retire records, so the system does not fill up !

AWS hosted data storage for storing simple entities

I need to choose data storage for simple system. The main purpose of the system is storing events - simple entities with timestamp, user id and type. No joins. Just single table.
Stored data will be fetched rarely (compared with writes). I expect following read operations:
get latest events for a list of users
get latest events of a type for a list of users
I expect about 0.5-1 million writes a day. Data older than 2 years can be removed.
I'm looking for best fitted service provided by AWS. I wonder if using redshift is like taking a sledgehammer to crack a nut?

For your requirement you can use AWS DynamoDB and also define the TTL values to remove the older items automatically. You get the following advantages.
Fully managed data storage
Able to scale with the need for write throughput (Though it can be costly)
Use sort key with timestamp to query latest items.

I would also like to check the AWS Simple DB as it looks more fit(in a first glance) for your requirements.
Please refer this article which explains some practical user experience.
http://www.masonzhang.com/2013/06/2-reasons-why-we-select-simpledb.html

How do you handle Amazon Kinesis Record duplicates?

According to the Amazon Kinesis Streams documentation, a record can be delivered multiple times.
The only way to be sure to process every record just once is to temporary store them in a database that supports Integrity checks (e.g. DynamoDB, Elasticache or MySQL/PostgreSQL) or just checkpoint the RecordId for each Kinesis shard.
Do you know a better / more efficient way of handling duplicates?

We had exactly that problem when building a telemetry system for a mobile app. In our case we were also unsure that producers where sending each message exactly once, therefore for each received record we calculated its MD5 on the fly and checked whether it is presented in some form of a persistent storage, but indeed what storage to use is the trickiest bit.
Firstly, we tried trivial relational database, but it quickly became a major bottleneck of the whole system as this isn't just read-heavy but also write-heavy case, since the volume of data going though Kinesis was quite significant.
We ended up having a DynamoDB table storing MD5's for each unique message. The issue we had was that it wasn't so easy to delete the messages - even though our table contained partition and sort keys, DynamoDB does not allow to drop all records with a given partition key, we had to query all of the to get sort key values (which wastes time and capacity). Unfortunately, we had to just simply drop the whole table once in a while. Another way suboptimal solution is to regularly rotate DynamoDB tables which store message identifiers.
However, recently DynamoDB introduced a very handy feature - Time To Live, which means that now we can control the size of a table by enabling auto-expiry on a per record basis. In that sense DynamoDB seems to be quite similar to ElastiCache, however ElastiCache (at least Memcached cluster) is much less durable - there is no redundancy there, and all data residing on terminated nodes is lost in case of scale in operation or failure.

The thing you mentioned is a general problem of all queue systems with "at least once" approach. Also, not just the queue systems, the producers and consumers both may process the same message multiple times (due to ReadTimeout errors etc.). Kinesis and Kafka both uses that paradigm. Unfortunately there is not an easy answer for that.
You may also try to use an "exactly-once" message queue, with stricter transaction approach. For example AWS SQS does that: https://aws.amazon.com/about-aws/whats-new/2016/11/amazon-sqs-introduces-fifo-queues-with-exactly-once-processing-and-lower-prices-for-standard-queues/ . Be aware, SQS throughput is far smaller than Kinesis.
To solve your problem, you should be aware of your application domain and try to solve it internally like you suggested (database checks). Especially when you communicate with an external service (let's say an email server for example), you should be able to recover the operation state in order to prevent double processing (because double sending in the email server example, may result in multiple copies of the same post in the recipient's mailbox).
See also the following concepts;
At-least-once Delivery: http://www.cloudcomputingpatterns.org/at_least_once_delivery/
Exactly-once Delivery: http://www.cloudcomputingpatterns.org/exactly_once_delivery/
Idempotent Processor: http://www.cloudcomputingpatterns.org/idempotent_processor/

ETL Possible Between S3 and Redshift with Kinesis Firehose?

My team is attempting to use Redshift to consolidate information from several different databases. In our first attempt to implement this solution, we used Kinesis Firehose to write records of POSTs to our APIs to S3 then issued a COPY command to write the data being inserted to the correct tables in Redshift. However, this only allowed us to insert new data and did not let us transform data, update rows when altered, or delete rows.
What is the best way to maintain an updated data warehouse in Redshift without using batch transformation? Ideally, we would like updates to occur "automatically" (< 5min) whenever data is altered in our local databases.

Firehose or Redshift don't have triggers, however you could potentially use the approach using Lambda and Firehose to pre-process the data before it gets inserted as described here: https://blogs.aws.amazon.com/bigdata/post/Tx2MUQB5PRWU36K/Persist-Streaming-Data-to-Amazon-S3-using-Amazon-Kinesis-Firehose-and-AWS-Lambda
In your case, you could extend it to use Lambda on S3 as Firehose is creating new files, which would then execute COPY/SQL update.
Another alternative is just writing your own KCL client that would implement what Firehose does, and then executing the required updates after COPY of micro-batches (500-1000 rows).
I've done such an implementation (we needed to update old records based on new records) and it works alright from consistency point of view, though I'd advise against such architecture in general due to bad Redshift performance with regards to updates. Based on my experience, the key rule is that Redshift data is append-only, and it is often faster to use filters to remove unnecessary rows (with optional regular pruning, like daily) than to delete/update those rows in real-time.
Yet another alernative, is to have Firehose dump data into staging table(s), and then have scheduled jobs take whatever is in that table, do processing, move the data, and rotate tables.
As a general reference architecture for real-time inserts into Redshift, take a look at this: https://blogs.aws.amazon.com/bigdata/post/Tx2ANLN1PGELDJU/Best-Practices-for-Micro-Batch-Loading-on-Amazon-Redshift
This has been implemented multiple times, and works well.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js