Amazon S3 contains a file that cannot be found in console - amazon-web-services

I have an Amazon S3 bucket in AWS, and I have tried to list all the files in the bucket by:
aws s3 ls s3://bucket-1 --recursive | awk '{$1=$2=$3=""; print $0}' | sed 's/^[ \t]*//' | sort > bucket_1_files
This works for most files, but there are some files that are listed in bucket_1_files, but when I search for the file in the Amazon S3 bucket in the console, I cannot find the files (no matches returned when I search for the name) - would anyone know of possible reasons this could be the case? The file is a .png file, and there are other .png files listed that I can find within the console.

I think there is something wrong with the command I'm using, when I just do
aws s3 ls s3://bucket-1 --recursive
I am finding that a lot of these missing files actually have a "t" in front of them

Rather than playing with aws, sed and sort, you can list objects with:
aws s3api list-objects --bucket bucket-1 --query 'Contents[].[Key]' --output text
However, it will only return 1000 objects.

Related

AWS S3 File merge using CLI

I am trying to combine/merge contents from all the files existing in a S3 bucket folder into a new file. The combine/merge should be done by the ascending order of the Last modified of the S3 file.
I am able to do that manually by having hard coded file names like as follows:
(aws s3 cp s3://bucket1/file1 - && aws s3 cp s3://bucket1/file2 - && aws s3 cp s3://bucket1/file3 - ) | aws s3 cp - s3://bucket1/new-file
But, now I want to change the CLI command so that we can do this file merge based on list of as many files as they exist in a folder, sorted by Last Modified. So ideally, the cp command should receive the list of all files that exist in a S3 bucket folder, sorted by Last Modified and then merge them into a new file.
I appreciate everyone's help on this.
Give you some hints.
First list the files in the reverse order of Last Modified.
aws s3api list-objects --bucket bucket1 --query "reverse(sort_by(Contents,&LastModified))"
Then you should be fine to attach the rest commands as you did
aws s3api list-objects --bucket bucket1 --query "reverse(sort_by(Contents,&LastModified))" |jq -r .[].Key |while read file
do
echo $file
# do the cat $file >> new-file
done
aws s3 cp new-file s3://bucket1/new-file

How to determine s3 folder size?

I have a bunch of s3 folders for different projects/clients and I would like to estimate total size (so I can for instance consider reducing sizes/cost). What is a good way to determine this?
I can do this with a combination of Python and the AWS client:
import os
bucket_rows = os.popen('aws s3 ls').split(chr(10))
sizes = dict()
for bucket in bucket_rows:
buck = bucket.split(' ')[-1] # the full row contains additional information
cmd = f"aws s3 ls --summarize --human-readable --recursive s3://{buck}/ | grep 'Total'"
sizes[buck] = os.popen(cmd).read()
As stated here AWS CLI natively supports --query parameter with can determine the size of every object in S3 bucket.
aws s3api list-objects --bucket BUCKETNAME --output json --query "[sum(Contents[].Size), length(Contents[])]"
I hope it helps.
If you want to check via console.
If you mention a folder by folder and not bucket, then just select that object go to "action" drop down and select "Get total size"
If you mean bucket by folder , then go to management tab and in that go to metrics it will show entire bucket size
This would do the magic 🚀
for bucket_name in `aws s3 ls | awk '{print $3}'`; do
echo "$bucket_name"
aws s3 ls s3://$bucket_name --recursive --summarize | tail -n2
done

Using AWS CLI to query file names inside folders?

Our bucket structure goes from MyBucket -> CustomerGUID(folder) -> [actual files]
I'm having a hell of a time trying to use the AWS CLI (on windows) --query option to try and locate a file across all of the customer folders. Can someone look at my --query and see what i'm doing wrong here? Or tell me the proper way to search for a specific file name?
This is an example of how i'm able to list ALL the files in the bucket LastModified by a date.
I need to limit the output based on filename, and that is where i'm getting stuck. When I look at the individual files in S3, I can see other files have a "Key", is the Key the 'name' of the file?
See Photo
aws s3 ls s3://mybucket --recursive --output text --query "Contents[?contains(LastModified) > '2018-12-8']"
The aws s3 ls command only returns a text list of objects.
If you wish to use --query, then use: aws s3api list-objects
See: list-objects — AWS CLI Command Reference

AWS CLI Commands

I want to get list of all files in S3 bucket with particular naming pattern.
For Eg if i have files like
aaaa2018-05-01
aaaa2018-05-23
aaaa2018-06-30
aaaa2018-06-21
I need to get list of all files for 5th month.Output should look like:
aaaa2018-05-01
aaaa2018-05-23
I executed the following command and the result was empty:
aws s3api list-objects --bucket bucketname --query "Contents[?contains(Key, 'aaaa2018-05-*')]" > s3list05.txt
when i check the s3list05.txt its empty. Also i tried the below command and
aws s3 ls s3:bucketname --recursive | grep aaaa2018-05* > s3list05.txt
this command lists me all the objects present in the file.
Kindly let me know the exact command to get desired output.
You are almost there. Try this:
aws s3 ls s3://bucketname --recursive | grep aaaa2018-05
or
aws s3 ls bucketname --recursive | grep aaaa2018-05
The Contains parameter doesn't need a wildcard:
aws s3api list-objects --bucket bucketname --query "Contents[?contains(Key, 'aaaa2018-05')].[Key]" --output text
This provides a list of Keys.
--output text removes the JSON formatting.
Using [Key] instead of just Key puts them all on one line.

How to search an Amazon S3 Bucket using Wildcards?

This stackoverflow answer helped a lot. However, I want to search for all PDFs inside a given bucket.
I click "None".
Start typing.
I type *.pdf
Press Enter
Nothing happens. Is there a way to use wildcards or regular expressions to filter bucket search results via the online S3 GUI console?
As stated in a comment, Amazon's UI can only be used to search by prefix as per their own documentation:
http://docs.aws.amazon.com/AmazonS3/latest/UG/searching-for-objects-by-prefix.html
There are other methods of searching but they require a bit of effort. Just to name two options, AWS-CLI application or Boto3 for Python.
I know this post is old but it is high on Google's list for s3 searching and does not have an accepted answer. The other answer by Harish is linking to a dead site.
UPDATE 2020/03/03: AWS link above has been removed. This is a link to a very similar topic that was as close as I could find. https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysHierarchy.html
AWS CLI search:
In AWS Console,we can search objects within the directory only but not in entire directories, that too with prefix name of the file only(S3 Search limitation).
The best way is to use AWS CLI with below command in Linux OS
aws s3 ls s3://bucket_name/ --recursive | grep search_word | cut -c 32-
Searching files with wildcards
aws s3 ls s3://bucket_name/ --recursive |grep '*.pdf'
You can use the copy function with the --dryrun flag:
aws s3 ls s3://your-bucket/any-prefix/ .\ --recursive --exclude * --include *.pdf --dryrun
It would show all of the files that are PDFs.
If you use boto3 in Python it's quite easy to find the files. Replace 'bucket' with the name of the bucket.
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket')
for obj in bucket.objects.all():
if '.pdf' in obj.key:
print(obj.key)
The CLI can do this; aws s3 only supports prefixes, but aws s3api supports arbitrary filtering. For s3 links that look like s3://company-bucket/category/obj-foo.pdf, s3://company-bucket/category/obj-bar.pdf, s3://company-bucket/category/baz.pdf, you can run
aws s3api list-objects --bucket "company-bucket" --prefix "category/" --query "Contents[?ends-with(Key, '.pdf')]"
or for a more general wildcard
aws s3api list-objects --bucket "company-bucket" --prefix "category/" --query "Contents[?contains(Key, 'foo')]"
or even
aws s3api list-objects --bucket "company-bucket" --prefix "category/obj" --query "Contents[?ends_with(Key, '.pdf') && contains(Key, 'ba')]"
The full query language is described at JMESPath.
The documentation using the Java SDK suggests it can be done:
https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysHierarchy.html
https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingObjectKeysUsingJava.html
Specifically the function listObjectsV2Result allows you to specify a prefix filter, e.g. "files/2020-01-02*" so you can only return results matching today's date.
https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/ListObjectsV2Result.html
My guess the files were uploaded from a unix system and your downloading to windows so s3cmd is unable to preserve file permissions which don't apply on NTFS.
To search for files and grab them try this from the target directory or change ./ to target:
for i in `s3cmd ls s3://bucket | grep "searchterm" | awk '{print $4}'`; do s3cmd sync --no-preserve $i ./; done
This works in WSL in windows.
I have used this in one of my project but its a bit of hard coding
import subprocess
bucket = "Abcd"
command = "aws s3 ls s3://"+ bucket + "/sub_dir/ | grep '.csv'"
listofitems = subprocess.check_output(command, shell=True,)
listofitems = listofitems.decode('utf-8')
print([item.split(" ")[-1] for item in listofitems.split("\n")[:-1]])