Is there a way to list or iterate over the CONTENT of a file in S3? - amazon-web-services

I have a S3 object that has a key
I am trying to iterate over the values of an key inside S3, which is basically a simple .txt file. I have found similar questions for iterating over objects and listing files in an object, but nothing so far on iterating over the actual contents of the file itself.
The code below will return the object and bucket containing the data but it doesn't list it's content nor give me an optiopn to iterate over it's contents. This appears to just filter the keys in the object itself, but I am trying to open or/and iterate over the values of the key.
s3 = boto3.resource('s3')
bucket = s3.Bucket('account-id-metadata')
for i in bucket.objects.filter(Prefix='data.txt'):
print(i)
Would like to know if this is possible with S3 using boto3?
NOTE: This was originally in a local file and & I was planning to iterate over the file locally instead; however, because of the large amount of data it was crashing & taking up a lot of memory, so I moved this to S3 hoping to perform the same functionality.
Thanks you in advance.

The only Amazon S3 operation that works on the "contents" of an object is S3 Select and Glacier Select – Retrieving Subsets of Objects | AWS News Blog.
This allows you to use SQL-like commands to extract rows and columns from a single object for certain file formats. This is useful when wanting to extract a small amount of information from large objects.

Related

Store list of Strings in S3

I am new to Amazon AWS S3.
One of my applications processes 40000 updates an hour with a unique identifier for each update.
This identifier is basically a string.
At runtime, I want to store the ID in an S3 bucket for all updates.
But, as far as I understood, we need to store files in s3.
Is there anyway around this?
Should I store a file.. Then read that file each time..append the name and store it again?
Any direction would be very helpful.
Thanks in advance.
I want it to be stored like:
Id1
Id2
Id3
.
.
,
.
Edit: Thanks for the responses, I have added what is asked..
I want to be able to just fetch all these IDs if and when a problem occurs in our system.
I am open to using anything other than s3 as well. I was also looking into DynamoDB. With the ID as the primary key. But, these ID's might be repetitive in 1-2% cases.
In S3, you do not have concept of files and folders. All you have is a bucket and objects inside the bucket. However, the UI of AWS groups objects with common prefixes such that they appear to be in the same folder.
Also, there is nothing like appending to a file in S3. Since S3 has objects, what essentially happens is that the so called append deletes the previous object and creates a new object with the previous object's data appended with some more data.
So, one way to do what I think you're trying is :
Suppose you have all the IDs written at 10:00 in an S3 object called data_corresponding_to_10_00_00. For the next hour(and 40000 updates), if they have all new IDs, you can write them to another S3 object with the name data_corresponding_to_11_00_00.
However, if you do not want multiple entries in both the files, and you need to update the previous file itself, using S3 is not a great idea. Rather use a database indexed on ID so that the performance becomes faster.

How can I detect orphaned objects in S3 that aren't mapped to our database?

I am trying to find possible orphans in an S3 bucket. What I mean is that we might delete something out of the DB, and for whatever reason, it doesn't get cleared from S3. This can be a bug in our system or something of that nature. I want to double check against our API that the object in S3 maps to something that exists - the naming convention let's us map things together like that.
Scraping an entire bucket every X days seems unscalable. I was thinking that for each object in the bucket, it can add itself to an SQS queue for the relevant checking to happen, every 30 days or so.
I've only found events around uploads and specific modifications over at https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html. Is there anything more generalized I can't find? Any creative solutions to this problem?
You should activate Amazon S3 Inventory, which can provide a regular CSV file (as often as daily) that contains a list of every object in the Amazon S3 bucket.
You could then trigger some code that compares the contents of the CSV file against the database to find 'orphan' objects.

Can S3 Select search multiple objects?

I'm testing out S3 Select and as far as I understand from the examples, you can treat a single object (CSV or JSON) as a data store.
I wanted to have a single JSON document per S3 object and search the entire bucket as a 'database'. I'm saving each 'file' as <ID>.json and each file has JSON documents with the same schema.
Is it possible to search multiple objects in a single call? i.e. Find all JSON documents where customerId = 123 ?
It appears that Amazon S3 Select operates on only one object.
You can use Amazon Athena to run queries across paths, which will include all files within that path. It also supports partitioning.
Simple, just iterate over the folder key in which you have all the files and grab the key and use the same to leverage S3 Select.

Copy files from S3 bucket to local machine using file index

I need to copy a files from many subdirectories in an S3 bucket to my local machine. The file name is auto generated and would be difficult to obtain without first using ls, but I do know that the target file is always the 2nd file in the subfolder by date creation order.
Is there a way to reference a file the in the s3 bucket subfolder file by index?
I am envisioning doing this with aws cli, though I'm open to other suggestions.
I'm not aware of any way within S3 to list the second oldest object without listing all objects at a given prefix and then explicitly sorting that list by date. If you need to do this then here are a few ideas:
if objects are only ever added (never deleted), then you could perhaps use a key naming convention when objects are uploaded that allows you to easily locate the 2nd oldest object e.g 0001-xxx, 0002-xxx. Then you can find the 2nd oldest object by listing objects with prefix 0002.
maintain an independent index of the objects in an RDBMS or KV database that allows you to easily locate the S3 key of the 2nd oldest object in any part of your S3 hierarchy. Possibly the DB is maintained via a Lambda function called when objects are put or deleted.
use a Lambda function triggered on object PUT that enumerates all of the objects in the relevant 'folder' and writes the key of the 2nd oldest object back to a kind of index object in that same folder (or as metadata on a known index object). Then you can find the 2nd oldest by getting the contents of the index object (or its metadata).
Option #2 might be the best as it's simple, fast, and flexible (what if, as your app changes over time, you find that you also need to know the 4th oldest object, or the 2nd newest object).
You could use this method to obtain the name of the second file in a given bucket/path:
aws s3api list-objects-v2 --bucket BUCKET-NAME --query 'Contents[1].Key' --output text
This would also work with BUCKET-NAME/PATH.
However, you mention that you have many subdirectories, so you would have to know the names of all those subdirectories if you are wanting to avoid doing a full bucket listing.

download, process, upload large number of s3 files with spark

I have a large amount of files (~500k hdf5) inside a s3 bucket which I need to process and reupload to another s3 bucket.
I am pretty new to such tasks, so I am not quite sure if my approach is correct here. I do the following:
I use boto to get the list of keys inside the bucket and parallelize it with spark:
s3keys = bucket.list()
data = sc.parallelize(s3keys)
data = data.map(lambda x: download_process_upload(x))
result = data.collect()
where download_process_upload is a function which downloads the file specified by the key, does some processing on it and re-uploads it to another bucket (returning 1 if everything was successful, and 0 if there was an error)
So in the end I could do
success_rate = sum(result) / float(len(s3keys))
I have read that spark map statements should be stateless, while my custom map function definitely is not stateless. It downloads the file to disk and then loads it into memory etc.
So is this the proper way to do such a task?
I've successfully used your methodology to download and process data from S3. I have not tried to upload the data from within a map statement. But, I see no reason why you wouldn't be able to read the file from s3, process it, and then upload it to a new location.
Also, you can save a few keystrokes and take the explicit lambda out of the map statement like this data = data.map(download_process_upload)