Get arbitrary object from Riak bucket - mapreduce

Is there a way to get a random object from a specific bucket by using Riak's HTTP API? Let's say that you have no knowledge about the contents of a bucket, the only thing you know is that all objects in a bucket share a common data structure. What would be a good way to get any object from a bucket, in order to show its data structure? Preferably using MapReduce over Search, since Search will flatten the response.

The best option is to use predictable keys so you don't have to find them. Since that is not always possible, secondary indexing is the next best.
If you are using eLevelDB, you can query the $BUCKET implicit index with max_results set to 1, which will return a single key. You would then issue a get request for that key.
If you are using Bitcask, you have 2 options:
list all of the keys in the bucket
Key listing in Bitcask will need to fold over every value in all buckets in order to return the list of keys in a single bucket. Effectively this means reading your entire dataset from disk, so this is very heavy on the system and could bring a production cluster to its knees.
MapReduce
MapReduce over a full bucket uses a similar query to key listing so it is also very heave on the system. Since the map phase function is executed separately for each object, if your map phase returns an object, every object in the bucket would be passed over the network to the node running the reduce phase. Thus it would be more efficient (read: less disastrous) to have the map phase function return just the key with no data, then have your reduce phase return the first item in the list, which leaves you needing to issue a get request for the object once you have the key name.
While it is technically possible to find a key in a given bucket when you have no information about the keys or the contents, if you designed your system to create a key named <<"schema">> or <<"sample">> that contains a sample object in each bucket, you could simply issue a get request for that key instead of searching, folding, or mapping.

If you are using Riak 2.X then search (http://docs.basho.com/riak/latest/dev/using/search/) is recommended over Map Reduce or 2i queries in most use cases and it is available via the HTTP API.

Related

Boto3, S3 check if keys exist

Right now I do know how to check if a single key exists within my S3 bucket using Boto 3:
res = s3.list_objects_v2(Bucket=record.bucket_name, Prefix='back.jpg', Delimiter='/')
for obj in res.get('Contents', []):
print(obj)
However I'm wondering if it's possible to check if multiple keys exist within a single API call. It feels a bit of a waste to do 5+ requests for that.
You could either use head_object() to check whether a specific object exists, or retrieve the complete bucket listing using list_objects_v2() and then look through the returned list to check for multiple objects.
Please note that list_objects_v2() only returns 1000 objects at a time, so it might need several calls to retrieve a list of all objects.

How to build an index of S3 objects when data exceeds object metadata limit?

Building an index of S3 objects can be very useful to make them searchable quickly : the natural, most obvious way is to store additional data on the object meta-data and use a lambda to write in DynamoDB or RDS, as described here: https://aws.amazon.com/blogs/big-data/building-and-maintaining-an-amazon-s3-metadata-index-without-servers/
However, this strategy is limited by the amount of data one can store in the object metadata, which is 2KB, as described here: https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html. Suppose you need to build a system where every time an object is uploaded on S3 you store need to add some information not contained in the file and the object name to a database and this data exceeds 2KB:you can't store it in the object metadata.
What are viable strategies to keep the bucket and the index updated?
Implement two chained API calls where each call is idempotent: if the second fails when the first succeed, one can retry until success. What happens if you perform a PUT of an identical object on S3, and you have versioning activated? Will S3 increase the version? In this case, implementing idempotency requires a single writer to be active at each time
Use some sort of workflow engine to keep track of this two-step behaviour, such as AWS Step. What are the gotchas with this solution?

Organizing files in S3

I have a social media web application. Users upload pictures such as profile picture, project pictures, and etc. What's the best way to organize these files in a S3 bucket?
I thought of creating a folder with userid as its name inside the bucket and the inside that multiple other folders i.e. profile, projects and etc.
Not sure if that's the best approach to follow!
The names (Keys) you assign an object in Amazon S3 are frankly irrelevant.
What matters is that you have a database that tracks the objects, their ownership and their purpose.
You should not use the filename (Key) of an Amazon S3 object as a way of storing information about the object, because your application might have millions of objects in S3 and it is too slow to scan the list of objects to see which ones exist. Instead, consult a database to find them.
To answer your question: Yes, create a prefix by username if you wish, but then just give it a unique name (eg a Universally unique identifier - Wikipedia) that avoids name clashes.
Earlier there used to be a need to add random prefixes for better performance. More details here and here.
Following is the extract from one of that pages
Pay Attention to Your Naming Scheme If:
Distributing the Key names
Don’t save your object's key name starts with a date or standard key
names, it improves complexity in the S3 indexing and will reduce
performance, because based on the indexing objects saves in the single
storage partition .
Amazon S3 maintains keys lexicographically in its internal indices.
However, as of 17 Jul 2018 announcement, adding random prefix to S3 key isn't required for improving the performance

Copy files from S3 bucket to local machine using file index

I need to copy a files from many subdirectories in an S3 bucket to my local machine. The file name is auto generated and would be difficult to obtain without first using ls, but I do know that the target file is always the 2nd file in the subfolder by date creation order.
Is there a way to reference a file the in the s3 bucket subfolder file by index?
I am envisioning doing this with aws cli, though I'm open to other suggestions.
I'm not aware of any way within S3 to list the second oldest object without listing all objects at a given prefix and then explicitly sorting that list by date. If you need to do this then here are a few ideas:
if objects are only ever added (never deleted), then you could perhaps use a key naming convention when objects are uploaded that allows you to easily locate the 2nd oldest object e.g 0001-xxx, 0002-xxx. Then you can find the 2nd oldest object by listing objects with prefix 0002.
maintain an independent index of the objects in an RDBMS or KV database that allows you to easily locate the S3 key of the 2nd oldest object in any part of your S3 hierarchy. Possibly the DB is maintained via a Lambda function called when objects are put or deleted.
use a Lambda function triggered on object PUT that enumerates all of the objects in the relevant 'folder' and writes the key of the 2nd oldest object back to a kind of index object in that same folder (or as metadata on a known index object). Then you can find the 2nd oldest by getting the contents of the index object (or its metadata).
Option #2 might be the best as it's simple, fast, and flexible (what if, as your app changes over time, you find that you also need to know the 4th oldest object, or the 2nd newest object).
You could use this method to obtain the name of the second file in a given bucket/path:
aws s3api list-objects-v2 --bucket BUCKET-NAME --query 'Contents[1].Key' --output text
This would also work with BUCKET-NAME/PATH.
However, you mention that you have many subdirectories, so you would have to know the names of all those subdirectories if you are wanting to avoid doing a full bucket listing.

Top level solution to rename AWS bucket item's folder names?

I've inherited a project at work. Its essentially a niche content repository, and we use S3 to store the content. The project was severely outdated, and I'm in the process of a thorough update.
For some unknown and undocumented reason, the content is stored in an AWS S3 bucket with the pattern web_cl_000000$DB_ID$CONTENT_NAME So, one particular folder can be named web_cl_0000003458zyxwv. This makes no sense, and requires a bit of transformation logic to construct a URL to serve up the content!
I can write a Python script using the boto3 library to do an item-by-item rename, but would like to know if there's a faster way to do so. There are approximately 4M items in that bucket, which will take quite a long time.
That isn't possible, because the folders are an illusion derived from the strings between / delimiters in the object keys.
Amazon S3 has a flat structure with no hierarchy like you would see in a typical file system. However, for the sake of organizational simplicity, the Amazon S3 console supports the folder concept as a means of grouping objects. Amazon S3 does this by using key name prefixes for objects. (emphasis added)
http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
The console contributes to the illusion by allowing you to "create" a folder, but all that actually does is create a 0-byte object with / as its last character, which the console will display as a folder whether there are other objects with that prefix or not, making it easier to upload objects manually with some organization.
But any tool or technique that allows renaming folders in S3 will in fact be making a copy of each object with the modified name, then deleting the old object, because S3 does not actually support rename or move, either -- objects in S3, including their key and metadata, are actually immutable. Any "change" is handled at the API level with a copy/overwrite or copy-then-delete.
Worth noting, S3 should be able to easily sustain 100 such requests per second, so with asynchronous requests or multi-threaded code, or even several processes each handling a shard of the keyspace, you should be able to do the whole thing in a few hours.
Note also that the less sorted (more random) the new keys are in the requests, the harder you can push S3 during a mass-write operation like this. Sending the requests so that the new keys are in lexical order will be the most likely scenario in which you might see 503 Slow Down errors... in which case, you just back off and retry... but if the new keys are not ordered, S3 can more easily accommodate a large number of requests.