Issue is that you can create and update multiple files, with something like .create_commit. However, you can't do the reverse, you can delete files 1 by 1, using the function mentioned in the docs.
For the client I use boto3 and boto3.client('codecommit')
Reference - boto3 docs - delete file
Question:
How to delete folders with boto3 and aws codecommit?
Only the following 4 methods are available:
delete_branch()
delete_comment_content()
delete_file()
delete_repository()
To delete a folder, set keepEmptyFolders=False when invoking delete_file on the last file in that folder. I'm not aware of a single API function that will delete an entire folder and all of its contents.
Note: by default, empty folders will be deleted when calling delete_file.
AWS codecommit doesn't allow deleting directories (folders). This implementation works, instead of deleting the whole directory at once, you find all the files and then delete them.
Basic overview.
Get FileNames inside Folder using .get_folder() note that his gives a lot more info*
Clean .get_folder - clean output not just the filepaths, we only require filepaths
Commit (delete)
Where REPOSITORY_NAME is the name of your repository and folderPath is the name of the folder that you want to delete.
files = codecommit_client.get_folder(repositoryName=REPOSITORY_NAME, folderPath=PATH)
Now we use that information, to create a commit with deleted files, we had to do some manipulation since the deleteFiles param takes only filepaths and the information that we got with .get_folder contains a lot more than just filepaths. Please replace branchName if you need (currently main)
codecommit_client.create_commit(
repositoryName=REPOSITORY_NAME,
branchName='main',
parentCommitId=files['commitId'],
commitMessage=f"DELETED Files",
deleteFiles=[{'filePath':x['absolutePath']} for x in files['files']],
)
Related
I'm trying to figure out how to:
a) Store / insert an object on Google Cloud Storage within a sub-directory
b) List a given sub-directory's contents
I managed to resolve how to get an object here: Google Cloud Storage JSON REST API - Get object held in a sub-directory
However, the same logic doesn't seem to apply to these other types of call.
For store, this works:
https://www.googleapis.com/upload/storage/v1/b/bucket-name/o?uploadType=media&name=foldername%2objectname
but it then stores the file name on GCS as foldername_filename, which doesn't change functionality but isn't really ideal.
For listing objects in a bucket, not sure where the syntax for a nested directory should go in here: storage.googleapis.com/storage/v1/b/bucketname/o.
Any insight much appreciated.
The first thing to know is that GCS does not have directories or folders. It just has objects that share a fixed prefix. There are features that both gsutil and the UI used to create the illusion that folders do exist.
https://cloud.google.com/storage/docs/gsutil/addlhelp/HowSubdirectoriesWork
With that out of the way: to create an object with a prefix you need to URL encode the object name, as I recall %2F is the encoding for /:
https://cloud.google.com/storage/docs/request-endpoints#encoding
Finally to list only the objects with a common prefix you would use the prefix parameter:
https://cloud.google.com/storage/docs/json_api/v1/objects/list#parameters
Using prefix=foo/bar/baz (after encoding) would list all the objects in the foo/bar/baz "folder", note that this is recursive, it will include foo/bar/baz/quux/object-name in the results. To stop at one level you want to read about the delimiter parameter too.
I am currently using rclone accessing AWS S3 data, and since I don't use either one much I am not an expert.
I am accessing the public bucket unidata-nexrad-level2-chunks and there are 1000 folders I am looking at. To see these, I am using the windows command prompt and entering :
rclone lsf chunks:unidata-nexrad-level2-chunks/KEWX
Only one folder has realtime data being written to it at any time and that is the one I need to find. How do I determine which one is the one I need? I could run a check to see which folder has the newest data. But how can I do that?
The output from my command looks like this :
1/
10/
11/
12/
13/
14/
15/
16/
17/
18/
19/
2/
20/
21/
22/
23/
... ... ... (to 1000)
What can I do to find where the latest data is being written to? Since it is only one folder at a time, I hope it would be simple.
Edit : I realized I need a way to list the latest file (along with it's folder #) without listing every single file and timestamp possible in all 999 directories. I am starting a bounty and the correct answer that allows me to do this without slogging through all of them will be awarded the bounty. If it takes 20 minutes to list all contents from all 999 folders, it's useless as the next folder will be active by that time.
If you wanted to know the specific folder with the very latest file, you should write your own script that retrieves a list of ALL objects, then figures out which one is the latest and which bucket it is in. Here's a Python script that does it:
import boto3
s3_resource = boto3.resource('s3')
objects = s3_resource.Bucket('unidata-nexrad-level2-chunks').objects.filter(Prefix='KEWX/')
date_key_list = [(object.last_modified, object.key) for object in objects]
print(len(date_key_list)) # How many objects?
date_key_list.sort(reverse=True)
print(date_key_list[0][1])
Output:
43727
KEWX/125/20200912-071306-065-I
It takes a while to go through those 43,700 objects!
So I have an S3 bucket with this structure:
ready_data/{auto_generated_subfolder_name}/some_data.json
The thing is, I want to recursively listen for any data that is put after the ready_data/ directory.
I have tried to set the prefix to ready_data/ and ready_data/*, but this only seems to capture events when a file is added directly in the ready_data directory. The ML algorithm might created a nested structure like ready_data/{some_dynamically_named_subfolder}/{some_somefolder}/data.json and I want to be able to know about the data.json object being created in a path where ready_data is the top-level subfolder.
ready_data/ is correct.
The "prefix" is a left-anchored substring. In pseudocode, the test is determining whether left(object_key, length(rule_prefix)) == rule_prefix so wildcards aren't needed and aren't interpreted. (Doesn"t throw an error, but won't match.)
Be sure you create the rule matching s3:ObjectCreated:* because there are multiple ways to create objects in S3 -- not just Put. Selecting only one of the APIs is a common mistake.
lets think we have bucket with name "bucket1" and inside that having a
folder with name 'new folder'
Inside 'new folder' are files
new folder/a1.pdf-->2mb
new folder/a2.pdf-->2mb
new folder/new folder2/b.pdf-->3mb
when we use amazons3client.listObjects("bucket1","new folder")---->it
will return the list of files and folders inside that for each 's3object'
there is 'size' parameter i can loop through all those s3 objects and i
can get folder size but it is heavy operation.
/* will you please any another way to get folder size*/
There is not another way.
The "folder" is not a container in S3 so it has no size, itself.
I have a talend job which is simple like below:
ts3Connection -> ts3Get -> tfileinputDelimeted -> tmap -> tamazonmysqloutput.
Now the scenario here is that some times I get the file in .txt format and sometimes I get it in a zip file.
So I want to use tFileUnarchive to unzip the file if it's in zip or process it bypassing the tFileUnarchive component if the file is in unzipped format i.e only in .txt format.
Any help on this is greatly appreciated.
The trick here is to break the file retrieval and potential unzipping into one sub job and then the processing of the files into another sub job afterwards.
Here's a simple example job:
As normal, you connect to S3 and then you might list all the relevant objects in the bucket using the tS3List and then pass this to tS3Get. Alternatively you might have another way of passing the relevant object key that you want to download to tS3Get.
In the above job I set tS3Get up to fetch every object that is iterated on by the tS3List component by setting the key as:
((String)globalMap.get("tS3List_1_CURRENT_KEY"))
and then downloading it to:
"C:/Talend/5.6.1/studio/workspace/S3_downloads/" + ((String)globalMap.get("tS3List_1_CURRENT_KEY"))
The extra bit I've added starts with a Run If conditional link from the tS3Get which links the tFileUnarchive with the condition:
((String)globalMap.get("tS3List_1_CURRENT_KEY")).endsWith(".zip")
Which checks to see if the file being downloaded from S3 is a .zip file.
The tFileUnarchive component then just needs to be told what to unzip, which will be the file we've just downloaded:
"C:/Talend/5.6.1/studio/workspace/S3_downloads/" + ((String)globalMap.get("tS3List_1_CURRENT_KEY"))
and where to extract it to:
"C:/Talend/5.6.1/studio/workspace/S3_downloads"
This then puts any extracted files in the same place as the ones that didn't need extracting.
From here we can now iterate through the downloads folder looking for the file types we want by setting the directory to "C:/Talend/5.6.1/studio/workspace/S3_downloads" and the global expression to "*.csv" in my case as I wanted to read in only the CSV files (including the zipped ones) I had in S3.
Finally, we then read the delimited files by setting the file to be read by the tFileInputDelimited component as:
((String)globalMap.get("tFileList_1_CURRENT_FILEPATH"))
And in my case I simply then printed this to the console but obviously you would then want to perform some transformation before uploading to your AWS RDS instance.