Does emrfs support custom query parameters in s3 url? - amazon-web-services

Is it possible to add customer query parameters in s3 url?
We would like to add some custom meta data to S3 objects, but would like it to be transparent to EMRFS
Something like:
s3://bucket-name/object-name?x-amz-meta-tag=magic-tag
Then in our PySpark or hadoop job, we would like to write:
data.write.csv('s3://bucket-name/object-name?x-amz-meta-tag=magic-tag')
Trying this on the emrfs shows that it treats "object-name?x-amz-meta-tag=magic-tag" as the entire object name instead of ignoring the query parameters.

I can't speak for the closed source EMRFS, but for the ASF s3 connectors, the answer is "no". Interesting proposal though; maybe you should think about contributing it to the ASF. Of course, that adds a new problem: what if existing users are creating files with ? in their names —how to retain compatibility.

Related

Update data in csv table which is stored in AWS S3 bucket

I need a solution for entering new data in csv that is stored in S3 bucket in AWS.
At this point we are downloading the file, editing and then uploading it again in s3 and we would like to automatize this process.
We need to add one row in a three column.
Thank you in advance!
I think you will be able to do that using Lambda Functions. You will need to programmatically make the modifications you need over the CSV but there are multiple programming languages that allow you to do that. One quick example is using python and the csv library
Then you can invoke that lambda or add more logic to the operations you want to do using an AWS API Gateway.
You can access the CSV file (object) inside the S3 Bucket from the lambda code using the AWS SDK and append the new rows with data you pass as parameters to the function
There is no way to directly modify the csv stored in S3 (if that is what you're asking). The process will always entail some version of download, modify, upload. There are many examples of how you can do this, for example here

How can I detect orphaned objects in S3 that aren't mapped to our database?

I am trying to find possible orphans in an S3 bucket. What I mean is that we might delete something out of the DB, and for whatever reason, it doesn't get cleared from S3. This can be a bug in our system or something of that nature. I want to double check against our API that the object in S3 maps to something that exists - the naming convention let's us map things together like that.
Scraping an entire bucket every X days seems unscalable. I was thinking that for each object in the bucket, it can add itself to an SQS queue for the relevant checking to happen, every 30 days or so.
I've only found events around uploads and specific modifications over at https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html. Is there anything more generalized I can't find? Any creative solutions to this problem?
You should activate Amazon S3 Inventory, which can provide a regular CSV file (as often as daily) that contains a list of every object in the Amazon S3 bucket.
You could then trigger some code that compares the contents of the CSV file against the database to find 'orphan' objects.

boto3 find object by metadata or tag

Is it possible to search objects in S3 bucket by object's metadata or tag key/value? (without object name or etag)
I know about head_object() method (ref), but it requires a Key in its parameters.
It seems that get_object() method is also not a solution - it takes the same argument set as head_object(), and nothing about metadata.
As I can see, neither get_* nor list_* methods provide any suitable filters. But I believe that such an opportunity should be in S3 API.
No. The ListObjects() API call does not accept search criteria.
You will need to retrieve a listing of all objects, then call head_object() to obtain metadata.
Alternatively, you could use Amazon S3 Inventory, which can provide a regular CSV file containing a list of all objects and their metadata. Your program could use this as a source of information rather than calling ListObjects().
If you require something that can do real-time searching of metadata, the common practice is to store such information in a database (eg DynamoDB, RDS, Elasticsearch) and then reference the database to identify the desired Amazon S3 objects.

Use AWS Athena To Query S3 Object Tagging

Is it possible to use AWS Athena to query S3 Object Tagging? For example, if I have an S3 layout such as this
bucketName/typeFoo/object1.txt
bucketName/typeFoo/object2.txt
bucketName/typeFoo/object3.txt
bucketName/typeBar/object1.txt
bucketName/typeBar/object2.txt
bucketName/typeBar/object3.txt
And each object has an S3 Object Tag such as this
#For typeFoo/object1.txt and typeBar/object1.txt
id=A
#For typeFoo/object2.txt and typeBar/object2.txt
id=B
#For typeFoo/object3.txt and typeBar/object3.txt
id=C
Then is it possible to run an AWS Athena query to get any object with the associated tag such as this
select * from myAthenaTable where tag.id = 'A'
# returns typeFoo/object1.txt and typeBar/object1.txt
This is just an example and doesn't reflect my actual S3 bucket/object-prefix layout. Feel free to use any layout you wish in your answers/comments.
Ultimately I have a plethora of objects that could be in different buckets and folder paths but they are related to each other and my goal is to tag them so that I can query for a particular id value and get all objects related to that id. The id value would be a GUID and that GUID would map to many different types of objects that are related e.g., I could have a video file, a picture file, a meta-data file, and a json file and I want to get all of those files using their common id value; please feel free to offer suggestions too because I have the ability to structure this as I see fit.
Update - Note
S3 Object Metadata and S3 Object Tagging are two different things.
Athena does not support querying based on s3 tag
one workaround is,
you can create a meta file which contains the tag and file mapping using lambda i.e whenever new file comes to s3 and lambda would update a file in s3 with tag and name details.

Is there a way to query S3 object key names for the latest per prefix?

In an S3 bucket, I have thousands and thousands of files stored with names having a structure that comes down to prefix and number:
A-0001
A-0002
A-0003
B-0001
B-0002
C-0001
C-0002
C-0003
C-0004
C-0005
New objects for a given prefix should come in with varying frequency, but might not. Older objects may disappear.
Is there a way to efficiently query S3 for the highest number of every prefix, i.e. without listing the entire bucket? The result I want is:
A-0003
B-0002
C-0005
The S3 API itself does not seem to offer anything usable for that. However, perhaps another service, like Athena, could do it? So far I have only found it capable of searching within objects, but all I care about are their key names. If it can report on the contents of objects in the bucket, can't it on the bucket itself?
I would be okay with the latest modification date per prefix, but I want to avoid having to switch to a versioned bucket with just the prefixes as names to achieve that.
I think this is what you are looking for:
variable name is $path and you can regexp to get the pattern you are querying...
WHERE regexp_extract(sp."$path", '[^/]+$') like concat('%',cast(current_date - interval '1' day as varchar),'.csv')
The S3 API itself does not seem to offer anything usable for that.
However, perhaps another service, like Athena, could do it?
Yes at the moment, there is not direct way of doing it only with AWS S3. Even with Athena, it will go through the files to query their content but it will be easier using standard SQL support with Athena and would be faster since the queries runs in parallel.
So far I have only found it capable of searching within objects, but
all I care about are their key names.
Both Athena and S3 Select is to query by content not keys.
The best approach I can recommend is to use AWS DynamoDB to keep the metadata of the files, including file names for faster querying.