Can someone explain AWS GET? - amazon-web-services

For reference: GET Bucket (List Objects)
When I do a get request on the root bucket it comes back with test/ and test/subdir/ both 0 bytes. Which is correct, there should be 2 folders up there. When I upload a file to test/subdir/file. The root bucket has an item with the key=test/subdir/file. test/ and test/subdir/ are still 0 bytes. When I do a get request on test/subdir/ it returns nothing.
What's going on here?
Note: I do not have access to the console.

Greg, this might sound confusing at first, but the truth is that there's no such thing as "a folder" in Amazon S3. I'll explain.
The data structure of S3 is like a flat list of objects -- not like a tree. When you think you have a "file" called puppy.jpg inside a "folder" called pics, what you actually have is an object which key is pics/puppy.jpg. Note that the / character is not any more special than the . character, or the p characters.
You might be thinking, Bruno is nuts, I see folders in the AWS Management Console. True, you see the folders. But they are actually emulated by the GUI.
When you create a folder through the AWS Management Console, what it will actually do is create an object which name is the full path of the "folder", with a trailing slash, and 0 bytes. Just like the test/ object (not "folder") and the test/subdir/ object (not "folder") you mention in your question.
To actually identify and draw "folders", the AWS Management Console (as well as many other S3 browsing tools) is doing is some API magic with the parameters delimiter and prefix.
Now, knowing the fact that there's no such thing as a folder, and that they are emulated through the use of those 0-byte, trailing-/ objects, it should be easy to understand why you see the test/ object as a 0-byte object... The same reasoning would explain why you see nothing when you do a GET on a "folder" -- you are actually downloading a 0-byte object!
Finally, as a conclusion, there's no easy way to obtain from S3 the size of "a folder" (they don't exist...). The only way would be for you to list all the objects with that prefix and add their sizes. Or keep an index of your object ("files" and "folders") in some kind of database with more advanced querying capabilities.

Related

Google Cloud Storage JSON REST API - Insert and List objects in a sub-directory in a bucket

I'm trying to figure out how to:
a) Store / insert an object on Google Cloud Storage within a sub-directory
b) List a given sub-directory's contents
I managed to resolve how to get an object here: Google Cloud Storage JSON REST API - Get object held in a sub-directory
However, the same logic doesn't seem to apply to these other types of call.
For store, this works:
https://www.googleapis.com/upload/storage/v1/b/bucket-name/o?uploadType=media&name=foldername%2objectname
but it then stores the file name on GCS as foldername_filename, which doesn't change functionality but isn't really ideal.
For listing objects in a bucket, not sure where the syntax for a nested directory should go in here: storage.googleapis.com/storage/v1/b/bucketname/o.
Any insight much appreciated.
The first thing to know is that GCS does not have directories or folders. It just has objects that share a fixed prefix. There are features that both gsutil and the UI used to create the illusion that folders do exist.
https://cloud.google.com/storage/docs/gsutil/addlhelp/HowSubdirectoriesWork
With that out of the way: to create an object with a prefix you need to URL encode the object name, as I recall %2F is the encoding for /:
https://cloud.google.com/storage/docs/request-endpoints#encoding
Finally to list only the objects with a common prefix you would use the prefix parameter:
https://cloud.google.com/storage/docs/json_api/v1/objects/list#parameters
Using prefix=foo/bar/baz (after encoding) would list all the objects in the foo/bar/baz "folder", note that this is recursive, it will include foo/bar/baz/quux/object-name in the results. To stop at one level you want to read about the delimiter parameter too.

rclone - How do I list which directory has the latest files in AWS S3 bucket?

I am currently using rclone accessing AWS S3 data, and since I don't use either one much I am not an expert.
I am accessing the public bucket unidata-nexrad-level2-chunks and there are 1000 folders I am looking at. To see these, I am using the windows command prompt and entering :
rclone lsf chunks:unidata-nexrad-level2-chunks/KEWX
Only one folder has realtime data being written to it at any time and that is the one I need to find. How do I determine which one is the one I need? I could run a check to see which folder has the newest data. But how can I do that?
The output from my command looks like this :
1/
10/
11/
12/
13/
14/
15/
16/
17/
18/
19/
2/
20/
21/
22/
23/
... ... ... (to 1000)
What can I do to find where the latest data is being written to? Since it is only one folder at a time, I hope it would be simple.
Edit : I realized I need a way to list the latest file (along with it's folder #) without listing every single file and timestamp possible in all 999 directories. I am starting a bounty and the correct answer that allows me to do this without slogging through all of them will be awarded the bounty. If it takes 20 minutes to list all contents from all 999 folders, it's useless as the next folder will be active by that time.
If you wanted to know the specific folder with the very latest file, you should write your own script that retrieves a list of ALL objects, then figures out which one is the latest and which bucket it is in. Here's a Python script that does it:
import boto3
s3_resource = boto3.resource('s3')
objects = s3_resource.Bucket('unidata-nexrad-level2-chunks').objects.filter(Prefix='KEWX/')
date_key_list = [(object.last_modified, object.key) for object in objects]
print(len(date_key_list)) # How many objects?
date_key_list.sort(reverse=True)
print(date_key_list[0][1])
Output:
43727
KEWX/125/20200912-071306-065-I
It takes a while to go through those 43,700 objects!

How to configure prefix events in S3 for a dynamically named subdirectory?

So I have an S3 bucket with this structure:
ready_data/{auto_generated_subfolder_name}/some_data.json
The thing is, I want to recursively listen for any data that is put after the ready_data/ directory.
I have tried to set the prefix to ready_data/ and ready_data/*, but this only seems to capture events when a file is added directly in the ready_data directory. The ML algorithm might created a nested structure like ready_data/{some_dynamically_named_subfolder}/{some_somefolder}/data.json and I want to be able to know about the data.json object being created in a path where ready_data is the top-level subfolder.
ready_data/ is correct.
The "prefix" is a left-anchored substring. In pseudocode, the test is determining whether left(object_key, length(rule_prefix)) == rule_prefix so wildcards aren't needed and aren't interpreted. (Doesn"t throw an error, but won't match.)
Be sure you create the rule matching s3:ObjectCreated:* because there are multiple ways to create objects in S3 -- not just Put. Selecting only one of the APIs is a common mistake.

Will s3 creates objects itself when we save a file?

I created a dataframe and selected some columns say col1col2 and col3 using df.select().
df1=df.select(col1,col2,col3)
I am writing this into a parquet file and saving it to s3.
df1.write.partitionBy("col1").format("parquet").save('s3a://myBucket/fol1/subfolder')
currently there is no location like 's3a://myBucket/fol1/subfolder' in my s3. Only thing I have is 's3a:myBucket'. My question as there are no objects named fol1 and subfolder.Will It create objects itself and save the file? or the code will fail?
I think you're asking if save('s3a://myBucket/fol1/subfolder') will create the fol1/subfolder structure in S3, and if it doesn't, do you need to.
The bottom line is that you don't need to worry about creating the intermediate folder structure because Hadoop FS API creates it for you, as needed.
#SteveLoughran's answer provides much more detail and deserves to be the accepted answer.
Although S3 is an object store, Spark, Hive &c all pretend its a filesystem & use the Hadoop filesystem API.
Some early actions of a spark save() are
call FileSystem.exists(dest) & fail if there's something there (unless you have enabled appending to existing data)
call FileSystem.mkdir(dest).
set up some _temporary dir underneath for the job, renaming things into place when the job is committed.
Action #2 triggers a scan for any entry in the path /a/b/c/dest being a file (Failure), creates an empty directory marker object /a/b/c/dest/. That marker will be deleted as soon as a child directory (i.e _temporary) is created.
At the end of the job then, there won't be any parent marker entries, but they go in there just to keep quiet all those bits of code which expect that after a mkdirs() call that the created directory exists.
Finally, be advised: the whole commit-by-rename mechanism is broken when it comes to S3 as it is (a) slow and (b) at risk of losing data due to directory listing consistency. You need a consistent listing layer (EMR: Consistent S3, Apache Hadoop: S3Guard, Databricks: something also DynamoDB based), and, for maximum performance atop Apache Hadoop 3.1, switch to a specific zero-rename S3A committer.

Triggering cloud functions on folders - is it possible?

Is it possible to configure the cloud function trigger-bucket parameter to be a folder in a GCS bucket?
For example, imagine we have the following:
gs://a_bucket/a_folder
Instead of setting --trigger-bucket gs://a_bucket when deploying, I need to set it at the folder level i.e. --trigger-bucket gs://a_bucket/a_folder/.
However, I get the error:
ERROR: (gcloud.beta.functions.deploy) argument --trigger-bucket:
Invalid value 'gs://a_bucket/a_folder/': Bucket must only contain
lower case Latin letters, digits and characters . _ -. It must start
and end with a letter or digit and be from 3 to 232 characters long.
You may optionally prepend the bucket name with gs:// and append / at
the end.
https://cloud.google.com/sdk/gcloud/reference/beta/functions/deploy
It is not possible to set the trigger at the folder level.
However, one workaround I found is that you can perform it programmatically in the Cloud Function by hooking into the name attribute of the change notification as it contains the folder(s) and the filename.
E.g. a a_sample.txt file which is located in gs://a_bucket/a_folder/, the name attribute will contain a_folder/a_sample.txt as a string value.
So, you can then filter on the folder you are interested in. The trade-offs:
It's not pretty!
Your cloud function will be triggered for all bucket events - even the ones you are not interested in.
If you can live with that, then it's the way to go (until Google support triggering at the folder level).