How to configure prefix events in S3 for a dynamically named subdirectory? - amazon-web-services

So I have an S3 bucket with this structure:
ready_data/{auto_generated_subfolder_name}/some_data.json
The thing is, I want to recursively listen for any data that is put after the ready_data/ directory.
I have tried to set the prefix to ready_data/ and ready_data/*, but this only seems to capture events when a file is added directly in the ready_data directory. The ML algorithm might created a nested structure like ready_data/{some_dynamically_named_subfolder}/{some_somefolder}/data.json and I want to be able to know about the data.json object being created in a path where ready_data is the top-level subfolder.

ready_data/ is correct.
The "prefix" is a left-anchored substring. In pseudocode, the test is determining whether left(object_key, length(rule_prefix)) == rule_prefix so wildcards aren't needed and aren't interpreted. (Doesn"t throw an error, but won't match.)
Be sure you create the rule matching s3:ObjectCreated:* because there are multiple ways to create objects in S3 -- not just Put. Selecting only one of the APIs is a common mistake.

Related

Boto3 AWS Codecommit Delete Folder

Issue is that you can create and update multiple files, with something like .create_commit. However, you can't do the reverse, you can delete files 1 by 1, using the function mentioned in the docs.
For the client I use boto3 and boto3.client('codecommit')
Reference - boto3 docs - delete file
Question:
How to delete folders with boto3 and aws codecommit?
Only the following 4 methods are available:
delete_branch()
delete_comment_content()
delete_file()
delete_repository()
To delete a folder, set keepEmptyFolders=False when invoking delete_file on the last file in that folder. I'm not aware of a single API function that will delete an entire folder and all of its contents.
Note: by default, empty folders will be deleted when calling delete_file.
AWS codecommit doesn't allow deleting directories (folders). This implementation works, instead of deleting the whole directory at once, you find all the files and then delete them.
Basic overview.
Get FileNames inside Folder using .get_folder() note that his gives a lot more info*
Clean .get_folder - clean output not just the filepaths, we only require filepaths
Commit (delete)
Where REPOSITORY_NAME is the name of your repository and folderPath is the name of the folder that you want to delete.
files = codecommit_client.get_folder(repositoryName=REPOSITORY_NAME, folderPath=PATH)
Now we use that information, to create a commit with deleted files, we had to do some manipulation since the deleteFiles param takes only filepaths and the information that we got with .get_folder contains a lot more than just filepaths. Please replace branchName if you need (currently main)
codecommit_client.create_commit(
repositoryName=REPOSITORY_NAME,
branchName='main',
parentCommitId=files['commitId'],
commitMessage=f"DELETED Files",
deleteFiles=[{'filePath':x['absolutePath']} for x in files['files']],
)

Google Cloud Storage JSON REST API - Insert and List objects in a sub-directory in a bucket

I'm trying to figure out how to:
a) Store / insert an object on Google Cloud Storage within a sub-directory
b) List a given sub-directory's contents
I managed to resolve how to get an object here: Google Cloud Storage JSON REST API - Get object held in a sub-directory
However, the same logic doesn't seem to apply to these other types of call.
For store, this works:
https://www.googleapis.com/upload/storage/v1/b/bucket-name/o?uploadType=media&name=foldername%2objectname
but it then stores the file name on GCS as foldername_filename, which doesn't change functionality but isn't really ideal.
For listing objects in a bucket, not sure where the syntax for a nested directory should go in here: storage.googleapis.com/storage/v1/b/bucketname/o.
Any insight much appreciated.
The first thing to know is that GCS does not have directories or folders. It just has objects that share a fixed prefix. There are features that both gsutil and the UI used to create the illusion that folders do exist.
https://cloud.google.com/storage/docs/gsutil/addlhelp/HowSubdirectoriesWork
With that out of the way: to create an object with a prefix you need to URL encode the object name, as I recall %2F is the encoding for /:
https://cloud.google.com/storage/docs/request-endpoints#encoding
Finally to list only the objects with a common prefix you would use the prefix parameter:
https://cloud.google.com/storage/docs/json_api/v1/objects/list#parameters
Using prefix=foo/bar/baz (after encoding) would list all the objects in the foo/bar/baz "folder", note that this is recursive, it will include foo/bar/baz/quux/object-name in the results. To stop at one level you want to read about the delimiter parameter too.

I wonder if I can perform data-pipeline by directory of a specific name with DataFusion

I'm using google-cloud-platform data fusion.
Assuming that the bucket's path is as follows:
test_buk/...
In the test_buk bucket there are four files:
20190901, 20190902
20191001, 20191002
Let's say there is a directory inside test_buk called dir.
I have a prefix-based bundle based on 201909(e.g, 20190901, 20190902)
also, I have a prefix-based bundle based on 201910(e.g, 20191001, 20191002)
I'd like to complete the data-pipeline for 201909 and 201910 bundles.
Here's what I've tried:
with regex path filter
gs://test_buk/dir//2019 to run the data pipeline.
If regex path filter is inserted, the Input value is not read, and likewise there is no Output value.
When I want to create a data pipeline with a specific directory in a bundle, how do I handle it in a datafusion?
If using directly the raw path (gs://test_buk/dir/), you might be getting an error when escaping special characters in the regex. That might be the reason for which you do not get any input file into the pipeline that matches your filter.
I suggest instead that you use ".*" to math the initial part (given that you are also specifying the path, no additional files in other folders will match the filter).
Therefore, I would use the following expressions depending on the group of files you want to use (feel free to change the extension of the files):
path = gs://test_buk/dir/
regex path filter = .*201909.*\.csv or .*201910.*\.csv
If you would like to know more about the regex used, you can take a look at (1)

What is definition of Amazon S3 prefix

What exactly is the definition of S3 prefix.
Lets say I have the following S3 structure:
photos/2006/January/sample.jpg
photos/2006/February/sample2.jpg
photos/2006/February/sample3.jpg
photos/2006/February/sample4.jpg
what will be the prefix for sample.jpg?
Either photos will be the prefix or the whole path till sample.jpg will be the prefix (i.e photos/2006/January/)
Because there is read write limit for each prefix.
S3 is just an object store, mapping a 'key' to an 'object'. In your case, I see four objects (likely images) with their own keys that are trying to imitate a filesystem's folder structure.
Prefix is referring to any string that would be a prefix to an object's key.
photos/2006/January/sample.jpg is just a key, so any of the following (and more) can be a prefix that would match this key:
pho
photos
photos/2
photos/2006/January/sample.jp
photos/2006/January/sample.jpg
Note that the first three prefixes listed above will be a match for the other keys you mention in your question.
You can think of a prefix as a path to a folder. Although they are not really folders, AWS has created prefix's to make it easier for us to visualize our data.
The prefix path is relative to the object. So for sample.jpg, the prefix is: photos/2006/January/ but if i have a sample2.jpg inside photos/2006/ then the prefix is the latter.

Can someone explain AWS GET?

For reference: GET Bucket (List Objects)
When I do a get request on the root bucket it comes back with test/ and test/subdir/ both 0 bytes. Which is correct, there should be 2 folders up there. When I upload a file to test/subdir/file. The root bucket has an item with the key=test/subdir/file. test/ and test/subdir/ are still 0 bytes. When I do a get request on test/subdir/ it returns nothing.
What's going on here?
Note: I do not have access to the console.
Greg, this might sound confusing at first, but the truth is that there's no such thing as "a folder" in Amazon S3. I'll explain.
The data structure of S3 is like a flat list of objects -- not like a tree. When you think you have a "file" called puppy.jpg inside a "folder" called pics, what you actually have is an object which key is pics/puppy.jpg. Note that the / character is not any more special than the . character, or the p characters.
You might be thinking, Bruno is nuts, I see folders in the AWS Management Console. True, you see the folders. But they are actually emulated by the GUI.
When you create a folder through the AWS Management Console, what it will actually do is create an object which name is the full path of the "folder", with a trailing slash, and 0 bytes. Just like the test/ object (not "folder") and the test/subdir/ object (not "folder") you mention in your question.
To actually identify and draw "folders", the AWS Management Console (as well as many other S3 browsing tools) is doing is some API magic with the parameters delimiter and prefix.
Now, knowing the fact that there's no such thing as a folder, and that they are emulated through the use of those 0-byte, trailing-/ objects, it should be easy to understand why you see the test/ object as a 0-byte object... The same reasoning would explain why you see nothing when you do a GET on a "folder" -- you are actually downloading a 0-byte object!
Finally, as a conclusion, there's no easy way to obtain from S3 the size of "a folder" (they don't exist...). The only way would be for you to list all the objects with that prefix and add their sizes. Or keep an index of your object ("files" and "folders") in some kind of database with more advanced querying capabilities.