Grab latest AWS S3 Folder Object name with AWS CLI

Grab latest AWS S3 Folder Object name with AWS CLI - amazon-web-services

I tried using this post to look for the last modified file then awk for the folder it's contained in: Get last modified object from S3 using AWS CLI
But this isn't ideal for over 1000 folders and by documentation, should be failing. I have 2000+ folder objects I need to search through. My desired folder will always begin with an D and be followed by a set of incrementing numbers. Ex: D1200
The results from the answer led me to creating this call which works:
aws s3 ls main.test.staging/General_Testing/Results/ --recursive | sort | tail -n 1 | awk '{print $4}'
but it takes over 40 secs to search through thousands of files and I then need to regex parse the output to find the folder object and not the last file modified within it. Also, if I try to do this to find my desired folder (which is the object right after the Results object):
aws ls s3 main.test.staging/General_Testing/Results/ | sort | tail -1
Then my output will be D998 because the sort function will order folder names like this:
D119
D12
D13
Because technically D12 is greater than D119 because it has a 2 in the 2nd position. Following this strange logic, there's no way I can use that call to reliable retrieve the highest numbered folder and therefore the last one created. Something to note is that folder objects that contain files do not have a Last Modified tag that one can use to query.
To be clear of my question: What call can I use to look through a large amount of S3 objects to find the largest numbered folder object? Preferably the answer is fast, can work with 1000+ objects, and won't require a regex breakdown.

I wonder whether you can use a list of CommonPrefixes to overcome your program of having many folders?
Try this command:
aws s3api list-objects-v2 --bucket main.test.staging --delimiter '/' --prefix 'General_Testing/Results/' --query CommonPrefixes --output text
(Note that is uses s3api rather than s3.)
It should provide a list of 'folders'. I don't know whether it has a limit on the number of 'folders' returned.
As for sorting D119 before D2, this is because it is sorting strings. The output is perfectly correct when sorting strings.
To sort by the number portion, you can likely use "version sorting". See: How to sort strings that contain a common prefix and suffix numerically from Bash?

Related

Random sample from regex

I would like to test a tool on a small number of files from a directory. To run the tool on all files in the directory, I would run:
./my-tool input/*.test
However, the tool takes a long time to run and I would like to test it only on a subset of the files in input/. Currently, I am copying a random subset to another folder and using the regex to grab all files from that folder
My question is: Is there any way to limit the regex matches? i.e. a way to run ./my-tool input/[PATTERN].test Where [PATTERN] is a regex that will expand to only be n matches. Even better, is there a way to do that and randomize which ones are returned?

On GNU/Linux you can easily and robustly select a subset of files with shuf:
shuf -ze -n 10 input/*.test | xargs -0 ./my-tool

Is it possible to exclude from aws S3 sync files older then x time?

I'm trying to use aws s3 CLI command to sync files (then delete a local copy) from the server to S3 bucket, but can't find a way to exclude newly created files which are still in use in local machine.
Any ideas?

This should work:
find /path/to/local/SyncFolder -mtime +1 -print0 | sed -z 's/^/--include=/' | xargs -0 /usr/bin/aws s3 sync /path/to/local/SyncFolder s3://remote.sync.folder --exclude '*'
There's a trick here: we're not excluding the files we don't want, we're excluding everything and then including the files we want. Why? Because either way, we're probably going to have too many parameters to fit into the command line. We can use xargs to split up long lines into multiple calls, but we can't let xargs split up our excludes list, so we have to let it split up our includes list instead.
So, starting from the beginning, we have a find command. -mtime +1 finds all the files that are older than a day, and -print0 tells find to delimit each result with a null byte instead of a newline, in case some of your files have newlines in their names.
Next, sed adds the --include=/ option to the start of each filename, and the -z option is included to let sed know to use null bytes instead of newlines as delimiters.
Finally, xargs will feed all those include options to the end of our aws command, calling aws multiple times if need be. The -0 option is just like sed's -z option, telling it to use null bytes instead of newlines.

To my knowledge you can only Include/ Exclude based on Filename. So the only way I see is a realy dirty hack.
You could run a bash script to rename all files below your treshhold and prefix/ postfix them like TOO_NEW_%Filename% and run cli like:
--exclude 'TOO_NEW_*'
But no don't do that.

Most likely ignoring the newer files is the default behavior. We can read in aws s3 sync help:
The default behavior is to ignore same-sized items unless the local version is newer than the S3 version.
If you'd like to change the default behaviour, you've the following parameters to us:
--size-only (boolean) Makes the size of each key the only criteria used to decide whether to sync from source to destination.
--exact-timestamps (boolean) When syncing from S3 to local, same-sized
items will be ignored only when the timestamps match exactly. The
default behavior is to ignore same-sized items unless the local version
is newer than the S3 version.
To see what files are going to be updated, run the sync with --dryrun.
Alternatively use find to list all the files which needs to be excluded, and pass it into --exclude parameter.

Display only one instance of a result type found with regex

I'm not sure how to even ask this question so bear with me. I have a list of (mostly) alpha-numerics that are drawing numbers in a giant XML that I'm tweaking a schema for. There appears to be no standard as to how they've been created, so I'm trying to create an XSD regex pattern for them to validate against. Normally, I'd just grind through them but in this case there are hundreds of them. What I want to do is isolate them down to a single instance of each type of drawing number, and then from that, I can create a regex with appropriate OR statements in the XSD.
My environment is Win7, but I've got an Ubuntu VM as well as Cygwin (where I'm currently doing all of this). I don't know if there's a Linux utility that can do this, or if my grep/sed-fu is just weak. I have no idea how to reduce this problem down except by brute force (which I've done for other pieces of this puzzle that weren't as large as this one).
I used this command line statement to grab the drawing "numbers". It looks for the drawing number, sorts them, only gives me uniques, and then strips away the enclosing tags:
grep "DrawingNumber" uber.xml | sort | uniq | sed -e :a -e 's/<[^>]*>//g;/</N;//ba'
Here is a sample of some of the actual drawing "numbers" (there are hundreds more):
10023C/10024C *<= this is how it's represented in the XML & I can't (easily) change it.
10023C
10043E
10051B
10051D
10058B
10059C
10447B 10447B *<= this is how it's represented in the XML & I can't (easily) change it.
10064A
10079B
10079D
10082B
10095A
10098B
10100B
10102
10109B
10109C
10115
101178
10118F
What I want is a list that would reduce the list of drawing numbers to a single instance of each type. For instance, this group of drawing "numbers":
10023C
10043E
10051B
10051D
10058B
10059C
Would reduce to:
nnnnnx
to represent all instances of 5 digits followed by a single letter for which I can create a pattern like so:
[0-9]{5}[a-z A-Z]{1}
Similarly,
10102
10115
would reduce to:
nnnnn
which would represent all instances of 5 digits with nothing following and be captured with:
[0-9]{5}
and so on. I hope this enough information to present the problem in a workable form. Like I said, I didn't even know how to frame the question, and frequently when I get as far as writing a question in SO I realize a solution & don't even submit it, but this one has me stumped.
Update:
Using #nullrevolution's answer, here's what I came up with (this clarifies my comment below which is largely unreadable).
The command line I eventually used was:
grep "DrawingNumber" uber.xml | sort -d | uniq | sed -e :a -e 's/<[^>]*>//g;/</N;//ba' | sed 's/[A-Za-z]/x/g;s/[0-9]/n/g' | sort -u
On data that looked like this:
<DrawingNumber>10430A</DrawingNumber>
<DrawingNumber>10431</DrawingNumber>
<DrawingNumber>10433</DrawingNumber>
<DrawingNumber>10434</DrawingNumber>
<DrawingNumber>10443A</DrawingNumber>
<DrawingNumber>10444</DrawingNumber>
<DrawingNumber>10446</DrawingNumber>
<DrawingNumber>10446A</DrawingNumber>
<DrawingNumber>10447</DrawingNumber>
<DrawingNumber>10447B 10447B</DrawingNumber>
<DrawingNumber>10447B</DrawingNumber>
<DrawingNumber>10454A</DrawingNumber>
<DrawingNumber>10454B</DrawingNumber>
<DrawingNumber>10455</DrawingNumber>
<DrawingNumber>10457</DrawingNumber>
Which gave me a generified output of (for all my data, not the snippet above):
nnnnn
nnnnnn
nnnnnx
nnnnnx nnnnnx
nnnnnx/nnnnnx
nnxxx
Which is exactly what I needed. Turns out the next two instances of things I need to figure out will benefit from this new method, so who knows how many hours this just saved me?

try stripping away the enclosing tags first, then:
sed 's/[A-Za-z]/x/g;s/[0-9]/n/g' file | sort -u
which will replace all letters with "n" and all numbers with "x", then remove all duplicates.
run against your sample input file, the output is:
nnnnnx
if that's not feasible, then could you share a portion of the input file in its original form?

Linux Commad Line Zip with Regex

I have thousands of jpg files that are all called 1.jpg, 2.jpg, 3.jpg and so on. I need to zip up a range of them and I thought I could do this with regex, but so far haven't had any luck.
Here is the command
zip images.zip '[66895-105515]'.jpg
Does anyone have any ideas?

I am very sure that is not possible to match number ranges like this with regular expressions (digit ranges, yes, but not whole multi-digit numbers), as regular expressions work on the character level. However, you can use the "seq" command to generate the list of files and use "xargs" to pass them to "zip":
seq --format %g.jpg 66895 105515 | xargs zip images.zip
I tested the command with a bunch of dummy files under Linux and it works fine.

Use in conjunction with ls and bash range ({m..n}) operator like this:
ls {66895..105515}".jpg" 2>/dev/null | zip jpegs -#

You need to pipe some stuff - list the files, filter by the regex, zip up each listed file.
ls | grep [66895-10551] | xargs zip images.zip
Edit: Whoops, didn't test with multi-digit numbers. As denisw mentions, this method won't work.

Regex and shell - multiple recursive rename

I have a folder with several hundreds of folders inside it. These folders contain another folder each, called images, and in this folder there is sometimes a strictly numerically named .jpg file. Sometimes there are other JPG files in the folder as well, but these need to be ignored if they aren't strictly numeric.
I would like to learn how to write a script which would, when run in a given folder, traverse every single subfolder and look for this numeric file. It would then add the "_n" suffix to a copy of each, if such a file does not already exist.
Can this be done through the unix terminal easily?
To be more specific, this is the structure I'm dealing with:
master folder
18556
images
2234.jpg
47772
images
2234.jpg
2234_n.jpg
some_pic.jpg
77377
images
88723
images
22.jpg
some_pic.jpg
After the script is run, the situation would look like this:
master folder
18556
images
2234.jpg
2234_n.jpg
47772
images
2234.jpg
2234_n.jpg
some_pic.jpg
77377
images
88723
images
22.jpg
22_n.jpg
some_pic.jpg
Update: Sorry about the typo, I accidentally put 2235 into 47772.
Update 2: Regarding the 2nd comment on the mathematical.coffee's answer, the OS I am currently on (at work) is MacOS, but my main machines are running CentOS and Ubuntu at home, so I just assumed my situation applies to all unix based systems.

You can use the -regex switch to find to match /somefolder/images/numeric.jpg:
find -type f -regex './[^/]+/images/[0-9]+\.jpg$'
Edit: refinement from #JonathanLeffler: add -type f to find so it only finds files (ie don't match a directory called '12345.jpg').
The ./[^/]+/ is for the first folder (if that first folder is always numeric too you can change it to [0-9]+).
The [0-9]+\.jpg$ means a jpg file with file name only being numeric.
You might want to change the jpg to jpe?g to allow .jpeg, but that's up to you.
Then it's a matter of copying these to xxx_n.jpg.
for f in $(find -type f -regex './[^/]+/images/[0-9]+\.jpg$')
do
# replace '.jpg' in $f (filename) with '_n.jpg'
newf=${f/\.jpg/_n\.jpg}
# see if this new file exists
if [ ! -f $newf ];
then
# if not exists, copy it.
cp "$f" "$newf"
fi
done

What should be the logic behind the renames in Folder 47772? If we assume you want to rename all the files just consisting of numbers to numbers + _n
With mmv you could write it like:
mmv "[0-9][0-9]*.jpg" "#1#2#3_n.jpg"
Note: mmv is for moving; mcp is for copying, and so is more appropriate to this question.
Question of Vader:
Well I checked the man page and the problem is that it's a bit strange.
I was thinking [0-9]* would match zero or more numbers. I turns out that this assumption was wrong.
The problem is that I could not tell I want two or more numbers at the start of the name.
So [0-9][0-9]* matches a name starting with at least two numbers (after that it takes all the rest up to the .. Now every [0-9] is one pattern and so I had to make the to pattern into:
"#1#2#3_n.jpg" With e.g 1234.jpg I have #1 = 1; #2 = 2, #3 = 34 So
#1#2#3 -> 1234; _n appends the _n and .jpg the extension
However it would rename also files with 12some_other_stuff.jpg sot 12some_other_stuff_n.jpg. It's not ideal but achieves in this context what was intended.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js