How to build an index of S3 objects when data exceeds object metadata limit?

How to build an index of S3 objects when data exceeds object metadata limit? - amazon-web-services

Building an index of S3 objects can be very useful to make them searchable quickly : the natural, most obvious way is to store additional data on the object meta-data and use a lambda to write in DynamoDB or RDS, as described here: https://aws.amazon.com/blogs/big-data/building-and-maintaining-an-amazon-s3-metadata-index-without-servers/
However, this strategy is limited by the amount of data one can store in the object metadata, which is 2KB, as described here: https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html. Suppose you need to build a system where every time an object is uploaded on S3 you store need to add some information not contained in the file and the object name to a database and this data exceeds 2KB:you can't store it in the object metadata.
What are viable strategies to keep the bucket and the index updated?
Implement two chained API calls where each call is idempotent: if the second fails when the first succeed, one can retry until success. What happens if you perform a PUT of an identical object on S3, and you have versioning activated? Will S3 increase the version? In this case, implementing idempotency requires a single writer to be active at each time
Use some sort of workflow engine to keep track of this two-step behaviour, such as AWS Step. What are the gotchas with this solution?

Related

AWS S3 Bucket policy to prevent Object updates

I have set of objects in an S3 Bucket, all with a common prefix. I want to prevent updating of the currently existing objects, however allow users to add new objects in the same prefix.
As I understand it, the S3:PutObject action is both used to update existing objects AND create new ones.
Is there a bucket policy that can limit updating, while allowing creating?
ex: forbid modifying already existing s3:/bucket/Input/obj1, but allow creating s3:/bucket/Input/obj2
edit, context: We're using S3 as a store for regression test data, used to test our transformations. As we're continuously adding new test data, we want to ensure that the already ingested input data doesn't change. This would resolve one of the current causes of failed tests. All our input data is stored with the same prefix, and likewise for the expected data.

No, this is not possible.
The same API call, and the same permissions, are used to upload an object regardless of whether an object already exists with the same name.
You could use Amazon S3 Versioning to retain both the old object and the new object, but that depends on how you will be using the objects.

It is not possible in a way you describe, but there is a mechanism of sorts, called S3 object lock, which allows you to lock a specific version of file. It will not prevent creation of new versions of file, but the version you lock is going to be immutable.

Is AWS S3 read guaranteed to return a newly created object?

I've been reading the docs regarding read-after-write consistency with AWS S3 but I'm still unsure about this.
If I write an object to S3 and after getting a successful response from my write operation, I immediately attempt to read it, is the read operation guaranteed to return the object?
In other words, is it possible that the read operation will fail because it can't find the object? Because the read happened too soon after the write?
I'm only talking about new PUTs here, not updates to existing objects.

Yes guaranteed to return the object (only for new objects) with one caveat:
As per AWS documentation:
Amazon S3 provides read-after-write consistency for PUTS of new
objects in your S3 bucket in all regions with one caveat. The caveat
is that if you make a HEAD or GET request to the key name (to find if
the object exists) before creating the object, Amazon S3 provides
eventual consistency for read-after-write.
Amazon S3 offers eventual consistency for overwrite PUTS and DELETES
in all regions.
EDIT: credits to #Michael - sqlbot, more on HEAD (or) GET caveat:
If you send a GET or HEAD before the object exists, such as to check whether there's an object there before you upload, then the upload is not immediately consistent for read requests even after the upload is complete, because S3 has already made the only immediately consistent internal query it's going to make for that object, discovering, authoritatively, that there's no such key. The object creation becomes eventually consistent, since the creation has to "overwrite" the previous lookup that found nothing.
Based on following table provided in the link, "consistent reads" will never be stale.
Above provided link has nice example regarding how "read-after-write consistency" & "eventual consistency" works.
I would like to add this caution note to this answer to make things more clear:
Amazon S3 achieves high availability by replicating data across multiple servers within Amazon's data centers. If a PUT request is successful, your data is safely stored. However, information about the changes must replicate across Amazon S3, which can take some time, and so you might observe the following behaviors:
A process writes a new object to Amazon S3 and immediately lists keys
within its bucket. Until the change is fully propagated, the object
might not appear in the list.

How long does it take for AWS S3 to save and load an item?

S3 FAQ mentions that "Amazon S3 buckets in all Regions provide read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES." However, I don't know how long it takes to get eventual consistency. I tried to search for this but couldn't find an answer in S3 documentation.
Situation:
We have a website consists of 7 steps. When user clicks on save in each step, we want to save a json document (contains information of all 7 steps) to Amazon S3. Currently we plan to:
Create a single S3 bucket to store all json documents.
When user saves step 1 we create a new item in S3.
When user saves step 2-7 we override the existing item.
After user saves a step and refresh the page, he should be able to see the information he just saved. i.e. We want to make sure that we always read after write.
The full json document (all 7 steps completed) is around 20 KB.
After users clicked on save button we can freeze the page for some time and they cannot make other changes until save is finished.
Question:
How long does it take for AWS S3 to save and load an item? (We can freeze our website when document is being saved to S3)
Is there a function to calculate save/load time based on item size?
Is the save/load time gonna be different if I choose another S3 region? If so which is the best region for Seattle?

I wanted to add to #error2007s answers.
How long does it take for AWS S3 to save and load an item? (We can freeze our website when document is being saved to S3)
It's not only that you will not find the exact time anywhere - there's actually no such thing exact time. That's just what "eventual consistency" is all about: consistency will be achieved eventually. You can't know when.
If somebody gave you an upper bound for how long a system would take to achieve consistency, then you wouldn't call it "eventually consistent" anymore. It would be "consistent within X amount of time".
The problem now becomes, "How do I deal with eventual consistency?" (instead of trying to "beat it")
To really find the answer to that question, you need to first understand what kind of consistency you truly need, and how exactly the eventual consistency of S3 could affect your workflow.
Based on your description, I understand that you would write a total of 7 times to S3, once for each step you have. For the first write, as you correctly cited the FAQs, you get strong consistency for any reads after that. For all the subsequent writes (which are really "replacing" the original object), you might observe eventual consistency - that is, if you try to read the overwritten object, you might get the most recent version, or you might get an older version. This is what is referred to as "eventual consistency" on S3 in this scenario.
A few alternatives for you to consider:
don't write to S3 on every single step; instead, keep the data for each step on the client side, and then only write 1 single object to S3 after the 7th step. This way, there's only 1 write, no "overwrites", so no "eventual consistency". This might or might not be possible for your specific scenario, you need to evaluate that.
alternatively, write to S3 objects with different names for each step. E.g., something like: after step 1, save that to bruno-preferences-step-1.json; then, after step 2, save the results to bruno-preferences-step-2.json; and so on, then save the final preferences file to bruno-preferences.json, or maybe even bruno-preferences-step-7.json, giving yourself the flexibility to add more steps in the future. Note that the idea here to avoid overwrites, which could cause eventual consistency issues. Using this approach, you only write new objects, you never overwrite them.
finally, you might want to consider Amazon DynamoDB. It's a NoSQL database, you can securely connect to it directly from the browser or from your server. It provides you with replication, automatic scaling, load distribution (just like S3). And you also have the option to tell DynamoDB that you want to perform strongly consistent reads (the default is eventually consistent reads; you have to change a parameter to get strongly consistent reads). DynamoDB is typically used for "small" records, 20kB is definitely within the range -- the maximum size of a record would be 400kB as of today. You might want to check this out: DynamoDB FAQs: What is the consistency model of Amazon DynamoDB?

How long does it take for AWS S3 to save and load an item? (We can freeze our website when document is being saved to S3)
You will not find the exact time anywhere. If you ask AWS they will give you approx timings. Your file is 20 KB so as per my experience from S3 usage the time will be more or less 60-90 Sec.
Is there a function to calculate save/load time based on item size?
No there is no any function using which you can calculate this.
Is the save/load time gonna be different if I choose another S3 region? If so which is the best region for Seattle?
For Seattle US West Oregon Will work with no problem.
You can also take a look at this experiment for comparison https://github.com/andrewgaul/are-we-consistent-yet

AWS - want to upload multiple files to S3 and only when all are uploaded trigger a lambda function

I am seeking advice on what's the best way to design this -
Use Case
I want to put multiple files into S3. Once all files are successfully saved, I want to trigger a lambda function to do some other work.
Naive Approach
The way I am approaching this is by saving a record in Dynamo that contains a unique identifier and the total number of records I will be uploading along with the keys that should exist in S3.
A basic implementation would be to take my existing lambda function which is invoked anytime my S3 bucket is written into, and have it check manually whether all the other files been saved.
The Lambda function would know (look in Dynamo to determine what we're looking for) and query S3 to see if the other files are in. If so, use SNS to trigger my other lambda that will do the other work.
Edit: Another approach is have my client program that puts the files in S3 be responsible for directly invoking the other lambda function, since technically it knows when all the files have been uploaded. The issue with this approach is that I do not want this to be the responsibility of the client program... I want the client program to not care. As soon as it has uploaded the files, it should be able to just exit out.
Thoughts
I don't think this is a good idea. Mainly because Lambda functions should be lightweight, and polling the database from within the Lambda function to get the S3 keys of all the uploaded files and then checking in S3 if they are there - doing this each time seems ghetto and very repetitive.
What's the better approach? I was thinking something like using SWF but am not sure if that's overkill for my solution or if it will even let me do what I want. The documentation doesn't show real "examples" either. It's just a discussion without much of a step by step guide (perhaps I'm looking in the wrong spot).
Edit In response to mbaird's suggestions below-
Option 1 (SNS) This is what I will go with. It's simple and doesn't really violate the Single Responsibility Principal. That is, the client uploads the files and sends a notification (via SNS) that its work is done.
Option 2 (Dynamo streams) So this is essentially another "implementation" of Option 1. The client makes a service call, which in this case, results in a table update vs. a SNS notification (Option 1). This update would trigger the Lambda function, as opposed to notification. Not a bad solution, but I prefer using SNS for communication rather than relying on a database's capability (in this case Dynamo streams) to call a Lambda function.
In any case, I'm using AWS technologies and have coupling with their offering (Lambda functions, SNS, etc.) but I feel relying on something like Dynamo streams is making it an even tighter coupling. Not really a huge concern for my use case but still feels dirty ;D
Option 3 with S3 triggers My concern here is the possibility of race conditions. For example, if multiple files are being uploaded by the client simultaneously (think of several async uploads fired off at once with varying file sizes), what if two files happen to finish uploading at around the same time, and two or more Lambda functions (or whatever implementations we use) query Dynamo and gets back N as the completed uploads (instead of N and N+1)? Now even though the final result should be N+2, each one would add 1 to N. Nooooooooooo!
So Option 1 wins.

If you don't want the client program responsible for invoking the Lambda function directly, then would it be OK if it did something a bit more generic?
Option 1: (SNS) What if it simply notified an SNS topic that it had completed a batch of S3 uploads? You could subscribe your Lambda function to that SNS topic.
Option 2: (DynamoDB Streams) What if it simply updated the DynamoDB record with something like an attribute record.allFilesUploaded = true. You could have your Lambda function trigger off the DynamoDB stream. Since you are already creating a DynamoDB record via the client, this seems like a very simple way to mark the batch of uploads as complete without having to code in knowledge about what needs to happen next. The Lambda function could then check the "allFilesUploaded" attribute instead of having to go to S3 for a file listing every time it is called.
Alternatively, don't insert the DynamoDB record until all files have finished uploading, then your Lambda function could just trigger off new records being created.
Option 3: (continuing to use S3 triggers) If the client program can't be changed from how it works today, then instead of listing all the S3 files and comparing them to the list in DynamoDB each time a new file appears, simply update the DynamoDB record via an atomic counter. Then compare the result value against the size of the file list. Once the values are the same you know all the files have been uploaded. The down side to this is that you need to provision enough capacity on your DynamoDB table to handle all the updates, which is going to increase your costs.
Also, I agree with you that SWF is overkill for this task.

Get arbitrary object from Riak bucket

Is there a way to get a random object from a specific bucket by using Riak's HTTP API? Let's say that you have no knowledge about the contents of a bucket, the only thing you know is that all objects in a bucket share a common data structure. What would be a good way to get any object from a bucket, in order to show its data structure? Preferably using MapReduce over Search, since Search will flatten the response.

The best option is to use predictable keys so you don't have to find them. Since that is not always possible, secondary indexing is the next best.
If you are using eLevelDB, you can query the $BUCKET implicit index with max_results set to 1, which will return a single key. You would then issue a get request for that key.
If you are using Bitcask, you have 2 options:
list all of the keys in the bucket
Key listing in Bitcask will need to fold over every value in all buckets in order to return the list of keys in a single bucket. Effectively this means reading your entire dataset from disk, so this is very heavy on the system and could bring a production cluster to its knees.
MapReduce
MapReduce over a full bucket uses a similar query to key listing so it is also very heave on the system. Since the map phase function is executed separately for each object, if your map phase returns an object, every object in the bucket would be passed over the network to the node running the reduce phase. Thus it would be more efficient (read: less disastrous) to have the map phase function return just the key with no data, then have your reduce phase return the first item in the list, which leaves you needing to issue a get request for the object once you have the key name.
While it is technically possible to find a key in a given bucket when you have no information about the keys or the contents, if you designed your system to create a key named <<"schema">> or <<"sample">> that contains a sample object in each bucket, you could simply issue a get request for that key instead of searching, folding, or mapping.

If you are using Riak 2.X then search (http://docs.basho.com/riak/latest/dev/using/search/) is recommended over Map Reduce or 2i queries in most use cases and it is available via the HTTP API.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js