Extract Links Within Specific Folder in AWS S3 Buckets - amazon-web-services

I am trying to get my AWS S3 API to list objects that I have stored in my S3 buckets. I have successfully used the code below to pull some of the links from my S3 buckets.
aws s3api list-objects --bucket my-bucket --query Contents[].[Key] --output text
The problem is the output in my command prompt is not listing the entire S3 Bucket inventory list. Is it possible to alter this code so that the output on my CLI lists the full inventory? If not, is there a way to alter the code to target specific file names within the S3 Bucket? For example, all the file names in my bucket are dates, so I would try and pull all the links from the file titled 3_15_20 Videos within the "my-bucket" bucket. Thanks in advance!

From list-objects — AWS CLI Command Reference:
list-objects is a paginated operation. Multiple API calls may be issued in order to retrieve the entire data set of results. You can disable pagination by providing the --no-paginate argument.
Therefore, try using --no-paginate and see whether it returns all objects.
If you are regularly listing a bucket that contains a huge number of objects, you could also consider using Amazon S3 Inventory, which can provide a daily CSV file listing the contents of a bucket.

Related

How to copy subset of files from one S3 bucket folder to another by date

I have a bucket in AWS S3. There are two folders in the bucket - folder1 & folder2. I want to copy the files from s3://myBucket/folder1 to s3://myBucket/folder2. But there is a twist: I ONLY want to copy the items in folder1 that were created after a certain date. I want to do something like this:
aws s3 cp s3://myBucket/folder1 s3://myBucket/folder2 --recursive --copy-source-if-modified-since
2020-07-31
There is no aws-cli command that will do this for you in a single line. If the number of files is relatively small, say a hundred thousands or fewer I think it would be easiest to write a bash script, or use your favourite language's AWS SDK, that lists the first folder, filters on creation date and issues the copy commands.
If the number of files is large you can create an S3 Inventory that will give you a listing of all the files in the bucket, which you can download and generate the copy commands from. This will be cheaper and quicker than listing when there are lots and lots of files.
Something like this could be a start, using #jarmod's suggestion about --copy-source-if-modified-since:
for key in $(aws s3api list-objects --bucket my-bucket --prefix folder1/ --query 'Contents[].Key' --output text); do
relative_key=${key/folder1/folder2}
aws s3api copy-object --bucket my-bucket --key "$relative_key" --source-object "my-bucket/$key" --copy-source-if-modified-since THE_CUTOFF_DATE
done
It will copy each object individually, and it will be fairly slow if there are lots of objects, but it's at least somewhere to start.

AWS CLI filename search: only look inside a specific folder?

Is there a way to not look through the entire bucket when searching for a filename?
We have millions of files so each search like this one takes minutes:
aws s3api list-objects --bucket XXX --query "Contents[?contains(Key, 'tokens.json')]"
I can also make the key contain the folder name, but that doesn't speed things up at all:
aws s3api list-objects --bucket XXX --query "Contents[?contains(Key, 'folder/tokens.json')]"
There is a prefix option. You have to use this option not the query syntax because the query is applied after the list object occurs. See the details in the documentation.
If you are regularly searching for objects within an Amazon S3 with a large number of objects, you could consider using Amazon S3 Inventory, which can provide a regular CSV listing of the objects in the bucket.

Delete Folders, Subfolders and All Files from a S3 bucket older than X days

I have a S3 bucket with the below architecture -
Bucket
|__2019-08-23/
| |__SubFolder1
| |__Files
|
|__2019-08-22/
|__SubFolder2
I want to delete all folders, subfolders and files within which are older than X days.
How can that be done? I am not sure if S3 LifeCycle can be used for this ?
Also when I do -
aws s3 ls s3://bucket/
I get this -
PRE 2019-08-23/
PRE 2019-08-22/
Why do I see PRE in front of the folder name?
As per the valuable comments I tried this -
$ Number=1;current_date=$(date +%Y-%m-%d);
past_date=$(date -d "$current_date - $Number days" +%Y-%m-%d);
aws s3api list-objects --bucket bucketname --query 'Contents[?LastModified<=$past_date ].{Key:Key,LastModified: LastModified}' --output text | xargs -I {} aws s3 rm bucketname/{}
I am trying to remove all files which are 1 day old. But I get this error -
Bad jmespath expression: Unknown token $:
Contents[?LastModified<=$past_date ].{Key:Key,LastModified: LastModified}
How can I pass a variable in lastmodified?
You can use lifecycle, a lambda function if you have more complex logic or command line.
here is an example using command line:
aws s3api list-objects --bucket your-bucket --query 'Contents[?LastModified>=`2019-01-01` ].{Key:Key,LastModified: LastModified}' --prefix "2019-01-01" --output text | xargs -I {} aws s3 rm s3://your-bucket/{}
#Elzo's answer already covers the life cycle policy and how to delete the objects, therefore here I have an answer for the second part of your question:
PRE stands for PREFIX as stated in the aws s3 cli's manual.
If you run aws s3 ls help you will come across the following section:
The following ls command lists objects and common prefixes under
a
specified bucket and prefix. In this example, the user owns the bucket
mybucket with the objects test.txt and somePrefix/test.txt. The Last-
WriteTime and Length are arbitrary. Note that since the ls command has
no interaction with the local filesystem, the s3:// URI scheme is not
required to resolve ambiguity and may be omitted:
aws s3 ls s3://mybucket
Output:
PRE somePrefix/
2013-07-25 17:06:27 88 test.txt
This is just to differentiate keys that have a prefix (split by forward slashes) from keys that don't.
Therefore, if your key is prefix/key01 you will always see a PRE in front of it. However, if your key is key01, then PRE is not shown.
Keep in mind that S3 does not work with directories even though you can tell otherwise when looking from the UI. S3's file structure is just one flat single-level container of files.
From the docs:
In Amazon S3, buckets and objects are the primary resources, where
objects are stored in buckets. Amazon S3 has a flat structure with no
hierarchy like you would see in a file system. However, for the sake
of organizational simplicity, the Amazon S3 console supports the
folder concept as a means of grouping objects. Amazon S3 does this by
using a shared name prefix for objects (that is, objects that have
names that begin with a common string). Object names are also referred
to as key names.
For example, you can create a folder in the console called photos, and
store an object named myphoto.jpg in it. The object is then stored
with the key name photos/myphoto.jpg, where photos/ is the prefix.
S3 Lifecycle can be used for buckets. For folders and subfolder management, you can write a simple AWS lambda to delete the folders and sub folders which are xx days old. Leverage S3 AWS SDK for JavaScript or Java or Python, etc. to develop the Lambda.

Using AWS CLI to query file names inside folders?

Our bucket structure goes from MyBucket -> CustomerGUID(folder) -> [actual files]
I'm having a hell of a time trying to use the AWS CLI (on windows) --query option to try and locate a file across all of the customer folders. Can someone look at my --query and see what i'm doing wrong here? Or tell me the proper way to search for a specific file name?
This is an example of how i'm able to list ALL the files in the bucket LastModified by a date.
I need to limit the output based on filename, and that is where i'm getting stuck. When I look at the individual files in S3, I can see other files have a "Key", is the Key the 'name' of the file?
See Photo
aws s3 ls s3://mybucket --recursive --output text --query "Contents[?contains(LastModified) > '2018-12-8']"
The aws s3 ls command only returns a text list of objects.
If you wish to use --query, then use: aws s3api list-objects
See: list-objects — AWS CLI Command Reference

AWS S3 uploading hidden files by default

I have created a bucket in AWS and a couple of IAM. The IAM by default are included in a group with read-only access. However, when I create my bucket I generated a policy to grant access to specific IAM to list, put and get. Now I'm trying to run a simple command to put a file with one of these AWS IAM from my site:
aws s3api put-object --bucket <bucket name> --key poc/test.txt --body <windows path file>
The output is successful, means that the files are always loaded. However, when I take a look the bucket in AWS I have to click on show because all loadings are setting the bucket content as hidden.
The account that I'm using to verify the files uploaded in AWS has manager access in S3 and I'm going thru the web console. How should I load the files without the hidden mark?
thanks