Can S3 Select search multiple objects? - amazon-web-services

I'm testing out S3 Select and as far as I understand from the examples, you can treat a single object (CSV or JSON) as a data store.
I wanted to have a single JSON document per S3 object and search the entire bucket as a 'database'. I'm saving each 'file' as <ID>.json and each file has JSON documents with the same schema.
Is it possible to search multiple objects in a single call? i.e. Find all JSON documents where customerId = 123 ?

It appears that Amazon S3 Select operates on only one object.
You can use Amazon Athena to run queries across paths, which will include all files within that path. It also supports partitioning.

Simple, just iterate over the folder key in which you have all the files and grab the key and use the same to leverage S3 Select.

Related

Create Athena table using s3 source data

Below is given the s3 path where I have stored the files obtained at the end of a process. The below-provided path is dynamic, that is, the value of the following fields will vary - partner_name, customer_name, product_name.
s3://bucket/{val1}/data/{val2}/output/intermediate_results
I am trying to create Athena tables for each output file present under output/ as well as under intermediate_results/ directories, for each val1-val2.
Each file is a CSV.
But I am not much familiar with AWS Athena so I'm unable to figure out the way to implement this. I would really appreciate any kind of help. Thanks!
Use CREATE TABLE - Amazon Athena. You will need to specify the LOCATION of the data in Amazon S3 by providing a path.
Amazon Athena will automatically use all files in that path, including subdirectories. This means that a table created with a Location of output/ will include all subdirectories, including intermediate_results. Therefore, your data storage format is not compatible with your desired use for Amazon Athena. You would need to put the data into separate paths for each table.

Is there a way to list or iterate over the CONTENT of a file in S3?

I have a S3 object that has a key
I am trying to iterate over the values of an key inside S3, which is basically a simple .txt file. I have found similar questions for iterating over objects and listing files in an object, but nothing so far on iterating over the actual contents of the file itself.
The code below will return the object and bucket containing the data but it doesn't list it's content nor give me an optiopn to iterate over it's contents. This appears to just filter the keys in the object itself, but I am trying to open or/and iterate over the values of the key.
s3 = boto3.resource('s3')
bucket = s3.Bucket('account-id-metadata')
for i in bucket.objects.filter(Prefix='data.txt'):
print(i)
Would like to know if this is possible with S3 using boto3?
NOTE: This was originally in a local file and & I was planning to iterate over the file locally instead; however, because of the large amount of data it was crashing & taking up a lot of memory, so I moved this to S3 hoping to perform the same functionality.
Thanks you in advance.
The only Amazon S3 operation that works on the "contents" of an object is S3 Select and Glacier Select – Retrieving Subsets of Objects | AWS News Blog.
This allows you to use SQL-like commands to extract rows and columns from a single object for certain file formats. This is useful when wanting to extract a small amount of information from large objects.

Store list of Strings in S3

I am new to Amazon AWS S3.
One of my applications processes 40000 updates an hour with a unique identifier for each update.
This identifier is basically a string.
At runtime, I want to store the ID in an S3 bucket for all updates.
But, as far as I understood, we need to store files in s3.
Is there anyway around this?
Should I store a file.. Then read that file each time..append the name and store it again?
Any direction would be very helpful.
Thanks in advance.
I want it to be stored like:
Id1
Id2
Id3
.
.
,
.
Edit: Thanks for the responses, I have added what is asked..
I want to be able to just fetch all these IDs if and when a problem occurs in our system.
I am open to using anything other than s3 as well. I was also looking into DynamoDB. With the ID as the primary key. But, these ID's might be repetitive in 1-2% cases.
In S3, you do not have concept of files and folders. All you have is a bucket and objects inside the bucket. However, the UI of AWS groups objects with common prefixes such that they appear to be in the same folder.
Also, there is nothing like appending to a file in S3. Since S3 has objects, what essentially happens is that the so called append deletes the previous object and creates a new object with the previous object's data appended with some more data.
So, one way to do what I think you're trying is :
Suppose you have all the IDs written at 10:00 in an S3 object called data_corresponding_to_10_00_00. For the next hour(and 40000 updates), if they have all new IDs, you can write them to another S3 object with the name data_corresponding_to_11_00_00.
However, if you do not want multiple entries in both the files, and you need to update the previous file itself, using S3 is not a great idea. Rather use a database indexed on ID so that the performance becomes faster.

Use AWS Athena To Query S3 Object Tagging

Is it possible to use AWS Athena to query S3 Object Tagging? For example, if I have an S3 layout such as this
bucketName/typeFoo/object1.txt
bucketName/typeFoo/object2.txt
bucketName/typeFoo/object3.txt
bucketName/typeBar/object1.txt
bucketName/typeBar/object2.txt
bucketName/typeBar/object3.txt
And each object has an S3 Object Tag such as this
#For typeFoo/object1.txt and typeBar/object1.txt
id=A
#For typeFoo/object2.txt and typeBar/object2.txt
id=B
#For typeFoo/object3.txt and typeBar/object3.txt
id=C
Then is it possible to run an AWS Athena query to get any object with the associated tag such as this
select * from myAthenaTable where tag.id = 'A'
# returns typeFoo/object1.txt and typeBar/object1.txt
This is just an example and doesn't reflect my actual S3 bucket/object-prefix layout. Feel free to use any layout you wish in your answers/comments.
Ultimately I have a plethora of objects that could be in different buckets and folder paths but they are related to each other and my goal is to tag them so that I can query for a particular id value and get all objects related to that id. The id value would be a GUID and that GUID would map to many different types of objects that are related e.g., I could have a video file, a picture file, a meta-data file, and a json file and I want to get all of those files using their common id value; please feel free to offer suggestions too because I have the ability to structure this as I see fit.
Update - Note
S3 Object Metadata and S3 Object Tagging are two different things.
Athena does not support querying based on s3 tag
one workaround is,
you can create a meta file which contains the tag and file mapping using lambda i.e whenever new file comes to s3 and lambda would update a file in s3 with tag and name details.

Selecting specific files for athena

While creating a table in Athena, I am not able to create tables using specific files. Is there any way to select all the files starting with "year_2019" from a given bucket? For e.g.
s3://bucketname/prefix/year_2019*.csv
The documentation is very clear about it and it is not allowed.
From:
https://docs.aws.amazon.com/athena/latest/ug/tables-location-format.html
Athena reads all files in an Amazon S3 location you specify in the
CREATE TABLE statement, and cannot ignore any files included in the
prefix. When you create tables, include in the Amazon S3 path only the
files you want Athena to read. Use AWS Lambda functions to scan files
in the source location, remove any empty files, and move unneeded
files to another location.
I will like to know if the community has found some work-around :)
Unfortunately the filesystem abstraction that Athena uses for S3 doesn't support this. It requires table locations to look like directories, and Athena will add a slash to the end of the location when listing files.
There is a way to create tables that contain only a selection of files, but as far as I know it does not support wildcards, only explicit lists of files.
What you do is you create a table with
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
and then instead of pointing the LOCATION of the table to the actual files, you point it to a prefix with a single symlink.txt file (or point each partition to a prefix with a single symlink.txt). In the symlink.txt file you add the S3 URIs of the files to include in the table, one per line.
The only documentation that I know of for this feature is the S3 Inventory documentation for integrating with Athena.
You can also find a full example in this Stackoverflow response: https://stackoverflow.com/a/55069330/1109