How to search an Amazon S3 Bucket using Wildcards? - amazon-web-services

This stackoverflow answer helped a lot. However, I want to search for all PDFs inside a given bucket.
I click "None".
Start typing.
I type *.pdf
Press Enter
Nothing happens. Is there a way to use wildcards or regular expressions to filter bucket search results via the online S3 GUI console?

As stated in a comment, Amazon's UI can only be used to search by prefix as per their own documentation:
http://docs.aws.amazon.com/AmazonS3/latest/UG/searching-for-objects-by-prefix.html
There are other methods of searching but they require a bit of effort. Just to name two options, AWS-CLI application or Boto3 for Python.
I know this post is old but it is high on Google's list for s3 searching and does not have an accepted answer. The other answer by Harish is linking to a dead site.
UPDATE 2020/03/03: AWS link above has been removed. This is a link to a very similar topic that was as close as I could find. https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysHierarchy.html

AWS CLI search:
In AWS Console,we can search objects within the directory only but not in entire directories, that too with prefix name of the file only(S3 Search limitation).
The best way is to use AWS CLI with below command in Linux OS
aws s3 ls s3://bucket_name/ --recursive | grep search_word | cut -c 32-
Searching files with wildcards
aws s3 ls s3://bucket_name/ --recursive |grep '*.pdf'

You can use the copy function with the --dryrun flag:
aws s3 ls s3://your-bucket/any-prefix/ .\ --recursive --exclude * --include *.pdf --dryrun
It would show all of the files that are PDFs.

If you use boto3 in Python it's quite easy to find the files. Replace 'bucket' with the name of the bucket.
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket')
for obj in bucket.objects.all():
if '.pdf' in obj.key:
print(obj.key)

The CLI can do this; aws s3 only supports prefixes, but aws s3api supports arbitrary filtering. For s3 links that look like s3://company-bucket/category/obj-foo.pdf, s3://company-bucket/category/obj-bar.pdf, s3://company-bucket/category/baz.pdf, you can run
aws s3api list-objects --bucket "company-bucket" --prefix "category/" --query "Contents[?ends-with(Key, '.pdf')]"
or for a more general wildcard
aws s3api list-objects --bucket "company-bucket" --prefix "category/" --query "Contents[?contains(Key, 'foo')]"
or even
aws s3api list-objects --bucket "company-bucket" --prefix "category/obj" --query "Contents[?ends_with(Key, '.pdf') && contains(Key, 'ba')]"
The full query language is described at JMESPath.

The documentation using the Java SDK suggests it can be done:
https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysHierarchy.html
https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingObjectKeysUsingJava.html
Specifically the function listObjectsV2Result allows you to specify a prefix filter, e.g. "files/2020-01-02*" so you can only return results matching today's date.
https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/ListObjectsV2Result.html

My guess the files were uploaded from a unix system and your downloading to windows so s3cmd is unable to preserve file permissions which don't apply on NTFS.
To search for files and grab them try this from the target directory or change ./ to target:
for i in `s3cmd ls s3://bucket | grep "searchterm" | awk '{print $4}'`; do s3cmd sync --no-preserve $i ./; done
This works in WSL in windows.

I have used this in one of my project but its a bit of hard coding
import subprocess
bucket = "Abcd"
command = "aws s3 ls s3://"+ bucket + "/sub_dir/ | grep '.csv'"
listofitems = subprocess.check_output(command, shell=True,)
listofitems = listofitems.decode('utf-8')
print([item.split(" ")[-1] for item in listofitems.split("\n")[:-1]])

Related

Amazon S3 contains a file that cannot be found in console

I have an Amazon S3 bucket in AWS, and I have tried to list all the files in the bucket by:
aws s3 ls s3://bucket-1 --recursive | awk '{$1=$2=$3=""; print $0}' | sed 's/^[ \t]*//' | sort > bucket_1_files
This works for most files, but there are some files that are listed in bucket_1_files, but when I search for the file in the Amazon S3 bucket in the console, I cannot find the files (no matches returned when I search for the name) - would anyone know of possible reasons this could be the case? The file is a .png file, and there are other .png files listed that I can find within the console.
I think there is something wrong with the command I'm using, when I just do
aws s3 ls s3://bucket-1 --recursive
I am finding that a lot of these missing files actually have a "t" in front of them
Rather than playing with aws, sed and sort, you can list objects with:
aws s3api list-objects --bucket bucket-1 --query 'Contents[].[Key]' --output text
However, it will only return 1000 objects.

How to determine s3 folder size?

I have a bunch of s3 folders for different projects/clients and I would like to estimate total size (so I can for instance consider reducing sizes/cost). What is a good way to determine this?
I can do this with a combination of Python and the AWS client:
import os
bucket_rows = os.popen('aws s3 ls').split(chr(10))
sizes = dict()
for bucket in bucket_rows:
buck = bucket.split(' ')[-1] # the full row contains additional information
cmd = f"aws s3 ls --summarize --human-readable --recursive s3://{buck}/ | grep 'Total'"
sizes[buck] = os.popen(cmd).read()
As stated here AWS CLI natively supports --query parameter with can determine the size of every object in S3 bucket.
aws s3api list-objects --bucket BUCKETNAME --output json --query "[sum(Contents[].Size), length(Contents[])]"
I hope it helps.
If you want to check via console.
If you mention a folder by folder and not bucket, then just select that object go to "action" drop down and select "Get total size"
If you mean bucket by folder , then go to management tab and in that go to metrics it will show entire bucket size
This would do the magic 🚀
for bucket_name in `aws s3 ls | awk '{print $3}'`; do
echo "$bucket_name"
aws s3 ls s3://$bucket_name --recursive --summarize | tail -n2
done

AWS CLI Download list of S3 files

We have ~400,000 files on a private S3 bucket that are inbound/outbound call recordings. The files have a certain pattern to it that lets me search for numbers both inbound and outbound. Note these calls are on the Glacier storage class
Using AWS CLI, I can search through this bucket and grep the files I need out. What I'd like to do is now initiate an S3 restore job to expedited retrieval (so ~1-5 minute recovery time), and then maybe 30 minutes later run a command to download the files.
My efforts so far:
aws s3 ls s3://exetel-logs/ --recursive | grep .*042222222.* | cut -c 32-
Retreives the key of about 200 files. I am unsure of how to proceed next, as aws s3 cp wont work for any objects in storage class.
Cheers,
The AWS CLI has two separate commands for S3: s3 ands3api. s3 is a high level abstraction with limited features, so for restoring files, you'll have to use one of the commands available with s3api:
aws s3api restore-object --bucket exetel-logs --key your-key
If you afterwards want to copy the files, but want to ensure to only copy files which were restored from Glacier, you can use the following code snippet:
for key in $(aws s3api list-objects-v2 --bucket exetel-logs --query "Contents[?StorageClass=='GLACIER'].[Key]" --output text); do
if [ $(aws s3api head-object --bucket exetel-logs --key ${key} --query "contains(Restore, 'ongoing-request=\"false\"')") == true ]; then
echo ${key}
fi
done
Have you considered using a high-level language wrapper for the AWS CLI? It will make these kinds of tasks easier to integrate into your workflows. I prefer the Python implementation (Boto 3). Here is example code for how to download all files from an S3 bucket.

AWS Cli in Windows wont upload file to s3 bucket

Windows server 12r2 with python 2.7.10 and the aws cli tool installed. The following works:
aws s3 cp c:\a\a.txt s3://path/
I can upload that file without problem. What I want to do is upload a file from a mapped drive to an s3 bucket, so I tried this:
aws s3 cp s:\path\file s3://path/
and it works.
Now what I want to do and cannot figure out is how to not specify, but let it grab all file(s) so I can schedule this to upload the contents of a directory to my s3 bucket. I tried this:
aws s3 cp "s:\path\..\..\" s3://path/ --recursive --include "201512"
and I get this error "TOO FEW ARGUMENTS"
Nearest I can guess it's mad I'm not putting a specific file to send up, but I don't want to do that, I want to automate all things.
If someone could please shed some light on what I'm missing I would really appreciate it.
Thank you
In case this is useful for anyone else coming after me: Add some extra spaces between the source and target. I've been beating my head against running this command with every combination of single quotes, double quotes, slashes, etc:
aws s3 cp /home/<username>/folder/ s3://<bucketID>/<username>/archive/ --recursive --exclude "*" --include "*.csv"
And it would give me: "aws: error: too few arguments" Every. Single. Way. I. Tried.
So finally saw the --debug option in aws s3 cp help
so ran it again this way:
aws s3 cp /home/<username>/folder/ s3://<bucketID>/<username>/archive/ --recursive --exclude "*" --include "*.csv" --debug
And this was the relevant debug line:
MainThread - awscli.clidriver - DEBUG - Arguments entered to CLI: ['s3', 'cp', 'home/<username>/folder\xc2\xa0s3://<bucketID>/<username>/archive/', '--recursive', '--exclude', '*', '--include', '*.csv', '--debug']
I have no idea where \xc2\xa0 came from in between source and target, but there it is! Updated the line to add a couple extra spaces and now it runs without errors:
aws s3 cp /home/<username>/folder/ s3://<bucketID>/<username>/archive/ --recursive --exclude "*" --include "*.csv"
aws s3 cp "s:\path\..\..\" s3://path/ --recursive --include "201512"
TOO FEW ARGUMENTS
This is because, in you command, double-quote(") is escaped with backslash(\), so local path(s:\path\..\..\) is not parsed correctly.
What you need to do is to escape backslash with double backslashes, i.e. :
aws s3 cp "s:\\path\\..\\..\\" s3://path/ --recursive --include "201512"
Alternatively you can try 'mc' which comes as single binary is available for windows both 64bit and 32bit. 'mc' implements mirror, cp, resumable sessions, json parseable output and more - https://github.com/minio/mc
64-bit from https://dl.minio.io/client/mc/release/windows-amd64/mc.exe
32-bit from https://dl.minio.io/client/mc/release/windows-386/mc.exe
Use aws s3 sync instead of aws s3 cp to copy the contents of a directory.
I faced the same situation. Let share two scenarios I had tried to check the same code.
Within bash
please make sure you have AWS profile in place (use $aws configure). Also, make sure you use a proper proxy if applicable.
$aws s3 cp s3://bucket/directory/ /usr/home/folder/ --recursive --region us-east-1 --profile yaaagy
it worked.
Within a perl script
$cmd="$aws s3 cp s3://bucket/directory/ /usr/home/folder/ --recursive --region us-east-1 --profile yaaagy"
I enclosed it within "" and it was successful. Let me know if this works out for you.
I ran into this same problem recently, and quiver's answer -- replacing single backslashes with double backslashes -- resolved the problem I was having.
Here's the Powershell code I used to address the problem, using the OP's original example:
# Notice how my path string contains a mixture of single- and double-backslashes
$folderPath = "c:\\a\a.txt"
echo "`$folderPath = $($folderPath)"
# Use the "Resolve-Path" cmdlet to generate a consistent path string.
$osFolderPath = (Resolve-Path $folderPath).Path
echo "`$osFolderPath = $($osFolderPath)"
# Escape backslashes in the path string.
$s3TargetPath = ($osFolderPath -replace '\\', "\\")
echo "`$s3TargetPath = $($s3TargetPath)"
# Now pass the escaped string to your AWS CLI command.
echo "AWS Command = aws s3 cp `"s3://path/`" `"$s3TargetPath`""

Selective file download in AWS CLI

I have files in S3 bucket. I was trying to download files based on a date, like 08th aug, 09th Aug etc.
I used the following code, but it still downloads the entire bucket:
aws s3 cp s3://bucketname/ folder/file \
--profile pname \
--exclude \"*\" \
--recursive \
--include \"" + "2015-08-09" + "*\"
I am not sure, how to achieve this. How can I download selective date file?
This command will copy all files starting with 2015-08-15:
aws s3 cp s3://BUCKET/ folder --exclude "*" --include "2015-08-15*" --recursive
If your goal is to synchronize a set of files without copying them twice, use the sync command:
aws s3 sync s3://BUCKET/ folder
That will copy all files that have been added or modified since the previous sync.
In fact, this is the equivalent of the above cp command:
aws s3 sync s3://BUCKET/ folder --exclude "*" --include "2015-08-15*"
References:
AWS CLI s3 sync command documentation
AWS CLI s3 cp command documentation
Bash Command to copy all files for specific date or month to current folder
aws s3 ls s3://bucketname/ | grep '2021-02' | awk '{print $4}' | aws s3 cp s3://bucketname/{} folder
Command is doing the following thing
Listing all the files under a bucket
Filtering out all the files of 2021-02 i.e. all files of feb month of 2021
Filtering out only the name of them
running command aws s3 cp on specific files
In case your bucket size is large in the upwards of 10 to 20 gigs,
this was true in my own personal use case, you can achieve the same
goal by using sync in multiple terminal windows.
All the terminal sessions can use the same token, in case you need to generate a token for prod environment.
$ aws s3 sync s3://bucket-name/sub-name/another-name folder-name-in-pwd/
--exclude "*" --include "name_date1*" --profile UR_AC_SomeName
and another terminal window (same pwd)
$ aws s3 sync s3://bucket-name/sub-name/another-name folder-name-in-pwd/
--exclude "*" --include "name_date2*" --profile UR_AC_SomeName
and another two for "name_date3*" and "name_date4*"
Additionally, you can also do multiple excludes in the same sync
command as in:
$ aws s3 sync s3://bucket-name/sub-name/another-name my-local-path/
--exclude="*.log/*" --exclude=img --exclude=".error" --exclude=tmp
--exclude="*.cache"
This Bash Script will copy all files from one bucket to another by modified-date using aws-cli.
aws s3 ls <BCKT_NAME> --recursive | sort | grep "2020-08-*" | cut -b 32- > a.txt
Inside Bash File
while IFS= read -r line; do
aws s3 cp s3://<SRC_BCKT>/${line} s3://<DEST_BCKT>/${line} --sse AES256
done < a.txt
aws cli is really slow at this. I waited hours and nothing really happened. So I looked for alternatives.
https://github.com/peak/s5cmd worked great.
supports globs, for example:
s5cmd -numworkers 30 cp 's3://logs-bucket/2022-03-30-19-*' .
is really blazing fast, so you can work with buckets that have s3 access logs without much fuss.