Find foldername of highest version [duplicate] - regex

This question already has answers here:
Unix sort of version numbers
(7 answers)
Closed 9 years ago.
I've got a few folders
src/1.0.0.1/
src/1.0.0.2/
src/1.0.0.12/
src/1.0.0.13/
src/1.0.1.1/
I'm looking for a bash command-chain that always returns only the latest version.
In the upper case that would be 1.0.1.1. If 1.0.1.1 wasn't present the latest version would be 1.0.0.13
I'm on OS X so sort has no -V option.
Can anyone help me out here?

You can also use an awk script:
#!/usr/bin/awk -f
$0 ~ /[0-9]+(.[0-9]+)*\/?$/ {
t = $0
sub(/\/$/, "", t)
sub(/.*\//, "", t)
current_count = split(t, current_array, /\./)
is_later = 0
for (i = 1; i <= current_count || i <= latest_count; ++i) {
if (current_array[i] > latest_array[i]) {
is_later = 1
break
} else if (latest_array[i] > current_array[i]) {
break
}
}
if (is_later) {
latest_string = $0
latest_count = split(t, latest_array, /\./)
}
}
END {
if (latest_count) {
print latest_string
}
}
Run:
find src/ -maxdepth 1 -type d | awk -f script.awk ## Or
ls -1 src/ | awk -f script.awk
You can also use a minimized version:
... | awk -- '$0~/[0-9]+(.[0-9]+)*\/?$/{t=$0;sub(/\/$/,"",t);sub(/.*\//,"",t);c=split(t,a,/\./);l=0;for(i=1;i<=c||i<=z;++i){if(a[i]>x[i]){l=1;break}else if(x[i]>a[i])break}if(l){s=$0;z=split(t,x,/\./)}}END{if(z)print s}'

Use sort (the -V option performs version sort):
$ ls -1 src/
1.0.0.1
1.0.0.12
1.0.0.13
1.0.0.2
$ find src/1* -type d | sort -rV
src/1.0.0.13
src/1.0.0.12
src/1.0.0.2
src/1.0.0.1
$ mkdir src/1.0.1.1
$ find src/1* -type d | sort -rV
src/1.0.1.1
src/1.0.0.13
src/1.0.0.12
src/1.0.0.2
src/1.0.0.1
Pipe the output to head and you can get only the highest version:
$ find src/1* -type d | sort -rV | head -1
src/1.0.1.1

Related

stop condition for emulating "grep -oE" with awk

I'm trying to emulate GNU grep -Eo with a standard awk call.
What the man says about the -o option is:
-o --only-matching
     Print only the matched (non-empty) parts of matching lines, with each such part on a separate output line.
For now I have this code:
#!/bin/sh
regextract() {
[ "$#" -ge 2 ] || return 1
__regextract_ere=$1
shift
awk -v FS='^$' -v ERE="$__regextract_ere" '
{
while ( match($0,ERE) && RLENGTH > 0 ) {
print substr($0,RSTART,RLENGTH)
$0 = substr($0,RSTART+1)
}
}
' "$#"
}
My question is: In the case that the matching part is 0-length, do I need to continue trying to match the rest of the line or should I move to the next line (like I already do)? I can't find a sample of input+regex that would need the former but I feel like it might exist. Any idea?
Here's a POSIX awk version, which works with a* (or any POSIX awk regex):
echo abcaaaca |
awk -v regex='a*' '
{
while (match($0, regex)) {
if (RLENGTH) print substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART + (RLENGTH > 0 ? RLENGTH : 1))
if ($0 == "") break
}
}'
Prints:
a
aaa
a
POSIX awk and grep -E use POSIX extended regular expressions, except that awk allows C escapes (like \t) but grep -E does not. If you wanted strict compatibility you'd have to deal with that.
If you can consider a gnu-awk solution then using RS and RT may give identical behavior of grep -Eo.
# input data
cat file
FOO:TEST3:11
BAR:TEST2:39
BAZ:TEST0:20
Using grep -Eo:
grep -Eo '[[:alnum:]]+' file
FOO
TEST3
11
BAR
TEST2
39
BAZ
TEST0
20
Using gnu-awk with RS and RT using same regex:
awk -v RS='[[:alnum:]]+' 'RT != "" {print RT}' file
FOO
TEST3
11
BAR
TEST2
39
BAZ
TEST0
20
More examples:
grep -Eo '\<[[:digit:]]+' file
11
39
20
awk -v RS='\\<[[:digit:]]+' 'RT != "" {print RT}' file
11
39
20
Thanks to the various comments and answers I think that I have a working, robust, and (maybe) efficient code now:
tested on AIX/Solaris/FreeBSD/macOS/Linux
#!/bin/sh
regextract() {
[ "$#" -ge 1 ] || return 1
[ "$#" -eq 1 ] && set -- "$1" -
awk -v FS='^$' '
BEGIN {
ere = ARGV[1]
delete ARGV[1]
}
{
tail = $0
while ( tail != "" && match(tail,ere) ) {
if (RLENGTH) {
print substr(tail,RSTART,RLENGTH)
tail = substr(tail,RSTART+RLENGTH)
} else
tail = substr(tail,RSTART+1)
}
}
' "$#"
}
regextract "$#"
notes:
I pass the ERE string along the file arguments so that awk doesn't pre-process it (thanks #anubhava for pointing that out); C-style escape sequences will still be translated by the regex engine of awk though (thanks #dan for pointing that out).
Because assigning $0 does reset the values of all fields,
I chose FS = '^$' for limiting the overhead
Copying $0 in a separate variable nullifies the overhead induced by assigning $0 in the while loop (thanks #EdMorton for pointing that out).
a few examples:
# Multiple matches in a single line:
echo XfooXXbarXXX | regextract 'X*'
X
XX
XXX
# Passing the regex string to awk as a parameter versus a file argument:
echo '[a]' | regextract_as_awk_param '\[a]'
a
echo '[a]' | regextract '\[a]'
[a]
# The regex engine of awk translates C-style escape sequences:
printf '%s\n' '\t' | regextract '\t'
printf '%s\n' '\t' | regextract '\\t'
\t
Your code will malfunction for match which might have zero or more characters, consider following simple example, let file.txt content be
1A2A3
then
grep -Eo A* file.txt
gives output
A
A
your while's condition is match($0,ERE) && RLENGTH > 0, in this case former part gives true, but latter gives false as match found is zero-length before first character (RSTART was set to 1), thus body of while will be done zero times.

How to count the number of s3 folders inside given path?

I tried to search for this solution through out but wasn't lucky. Hoping to find some solution quickly here. I have some migrated files in S3 and now there is a requirement to identify the number of folders involved in the give path. Say I have some files with as below.
If I give aws s3 ls s3://my-bucket/foo1 --recursive >> file_op.txt
"cat file_op.txt" - will look something like below:
my-bucket/foo1/foo2/foo3/foo4/foo5/foo6/foo7/file1.txt
my-bucket/foo1/foo2/foo3/foo4/foo5/foo6/foo7/file2.txt
my-bucket/foo1/foo2/foo3/foo4/foo5/foo6/file1.pdf
my-bucket/foo1/foo2/foo3/foo4/foo6/file2.txt
my-bucket/foo1/foo2/foo3/file3.txt
my-bucket/foo1/foo8/file1.txt
my-bucket/foo1/foo9/foo10/file4.csv
I have stored the output in a file and processed to find the number of files by wc -l
But I couldn't find the number of folders involved in the path.
I need the output as below:
number of files : 7
number of folders : 9
EDIT 1:
Corrected the expected number of folders.
(Excluding my-bucket and foo1)
(foo6 is in foo5 and foo4 directories)
Below is my code where I'm failing in calculating the count of directories:
#!/bin/bash
if [[ "$#" -ne 1 ]] ; then
echo "Usage: $0 \"s3 folder path\" <eg. \"my-bucket/foo1\"> "
exit 1
else
start=$SECONDS
input=$1
input_code=$(echo $input | awk -F'/' '{print $1 "_" $3}')
#input_length=$(echo $input | awk -F'/' '{print NF}' )
s3bucket=$(echo $input | awk -F'/' '{print $1}')
db_name=$(echo $input | awk -F'/' '{print $3}')
pathfinder=$(echo $input | awk 'BEGIN{FS=OFS="/"} {first = $1; $1=""; print}'|sed 's#^/##g'|sed 's#$#/#g')
myn=$(whoami)
cdt=$(date +%Y%m%d%H%M%S)
filename=$0_${myn}_${cdt}_${input_code}
folders=${filename}_folders
dcountfile=${filename}_dir_cnt
aws s3 ls s3://${input} --recursive | awk '{print $4}' > $filename
cat $filename |awk -F"$pathfinder" '{print $2}'| awk 'BEGIN{FS=OFS="/"}{NF--; print}'| sort -n | uniq > $folders
#grep -oP '(?<="$input_code" ).*'
fcount=`cat ${filename} | wc -l`
awk 'BEGIN{FS="/"}
{ if (NF > maxNF)
{
for (i = maxNF + 1; i <= NF; i++)
count[i] = 1;
maxNF = NF;
}
for (i = 1; i <= NF; i++)
{
if (col[i] != "" && $i != col[i])
count[i]++;
col[i] = $i;
}
}
END {
for (i = 1; i <= maxNF; i++)
print count[i];
}' $folders > $dcountfile
dcount=$(cat $dcountfile | xargs | awk '{for(i=t=0;i<NF;) t+=$++i; $0=t}1' )
printf "Bucket name : \e[1;31m $s3bucket \e[0m\n" | tee -a ${filename}.out
printf "DB name : \e[1;31m $db_name \e[0m\n" | tee -a ${filename}.out
printf "Given folder path : \e[1;31m $input \e[0m\n" | tee -a ${filename}.out
printf "The number of folders in the given directory are\e[1;31m $dcount \e[0m\n" | tee -a ${filename}.out
printf "The number of files in the given directory are\e[1;31m $fcount \e[0m\n" | tee -a ${filename}.out
end=$SECONDS
elapsed=$((end - start))
printf '\n*** Script completed in %d:%02d:%02d - Elapsed %d:%02d:%02d ***\n' \
$((end / 3600)) $((end / 60 % 60)) $((end % 60)) \
$((elapsed / 3600)) $((elapsed / 60 % 60)) $((elapsed % 60)) | tee -a ${filename}.out
exit 0
fi
Your question is not clear.
If we count unique relatives folder paths in the list provided there are 12:
my-bucket/foo1/foo2/foo3/foo4/foo5/foo6/foo7
my-bucket/foo1/foo2/foo3/foo4/foo5/foo6
my-bucket/foo1/foo2/foo3/foo4/foo6
my-bucket/foo1/foo2/foo3/foo4/foo5
my-bucket/foo1/foo2/foo3/foo4
my-bucket/foo1/foo2/foo3
my-bucket/foo1/foo2
my-bucket/foo1/foo8
my-bucket/foo1/foo9/foo10
my-bucket/foo1/foo9
my-bucket/foo1
my-bucket
The awk script to count this is:
BEGIN {FS = "/";} # set field deperator to "/"
{ # for each input line
commulativePath = OFS = ""; # reset commulativePath and OFS (Output Field Seperator) to ""
for (i = 1; i < NF; i++) { # loop all folders up to file name
if (i > 1) OFS = FS; # set OFS to "/" on second path
commulativePath = commulativePath OFS $i; # append current field to commulativePath variable
dirs[commulativePath] = 0; # insert commulativePath into an associative array dirs
}
}
END {
print NR " " length(dirs); # print records count, and associative array dirs length
}
If we count unique folder names there are 11:
my-bucket
foo1
foo2
foo3
foo4
foo5
foo6
foo7
foo8
foo9
foo10
The awk script to count this is:
awk -F'/' '{for(i=1;i<NF;i++)dirs[$i]=1;}END{print NR " " length(dirs)}' input.txt
You have clarified that you wanted to count the unique names, ignoring the top two levels (my-bucket and foo1) and the last level (the file name).
perl -F/ -lane'
++$f;
++$d{ $F[$_] } for 2 .. $#F - 1;
END {
print "Number of files: ".( $f // 0 );
print "Number of dirs: ".( keys(%d) // 0 );
}
'
Output:
Number of files: 7
number of dirs: 9
Specifying file to process to Perl one-liner
If you don't want mind using a pipe and calling awk twice, then it's rather clean :
mawk 'BEGIN {OFS=ORS;FS="/";_^=_}_+_<NF && --NF~($_="")' file \
\
| mawk 'NF {_[$__]} END { print length(_) }'

Remove duplicates from text file based on second text file

How can I remove all lines from a text file (main.txt) by checking a second textfile (removethese.txt). What is an efficient approach if files are greater than 10-100mb. [Using mac]
Example:
main.txt
3
1
2
5
Remove these lines
removethese.txt
3
2
9
Output:
output.txt
1
5
Example Lines (these are the actual lines I'm working with - order does not matter):
ChIJW3p7Xz8YyIkRBD_TjKGJRS0
ChIJ08x-0kMayIkR5CcrF-xT6ZA
ChIJIxbjOykFyIkRzugZZ6tio1U
ChIJiaF4aOoEyIkR2c9WYapWDxM
ChIJ39HoPKDix4kRcfdIrxIVrqs
ChIJk5nEV8cHyIkRIhmxieR5ak8
ChIJs9INbrcfyIkRf0zLkA1NJEg
ChIJRycysg0cyIkRArqaCTwZ-E8
ChIJC8haxlUDyIkRfSfJOqwe698
ChIJxRVp80zpcEARAVmzvlCwA24
ChIJw8_LAaEEyIkR68nb8cpalSU
ChIJs35yqObit4kR05F4CXSHd_8
ChIJoRmgSdwGyIkRvLbhOE7xAHQ
ChIJaTtWBAWyVogRcpPDYK42-Nc
ChIJTUjGAqunVogR90Kc8hriW8c
ChIJN7P2NF8eVIgRwXdZeCjL5EQ
ChIJizGc0lsbVIgRDlIs85M5dBs
ChIJc8h6ZqccVIgR7u5aefJxjjc
ChIJ6YMOvOeYVogRjjCMCL6oQco
ChIJ54HcCsaeVogRIy9___RGZ6o
ChIJif92qn2YVogR87n0-9R5tLA
ChIJ0T5e1YaYVogRifrl7S_oeM8
ChIJwWGce4eYVogRcrfC5pvzNd4
There are two standard ways to do this:
With grep:
grep -vxFf removethese main
This uses:
-v to invert the match.
-x match whole line, to prevent, for example, he to match lines like hello or highway to hell.
-F to use fixed strings, so that the parameter is taken as it is, not interpreted as a regular expression.
-f to get the patterns from another file. In this case, from removethese.
With awk:
$ awk 'FNR==NR {a[$0];next} !($0 in a)' removethese main
1
5
Like this we store every line in removethese in an array a[]. Then, we read the main file and just print those lines that are not present in the array.
With grep:
grep -vxFf removethese.txt main.txt >output.txt
With fgrep:
fgrep -vxf removethese.txt main.txt >output.txt
fgrep is deprecated. fgrep --help says:
Invocation as 'fgrep' is deprecated; use 'grep -F' instead.
With awk (from #fedorqui):
awk 'FNR==NR {a[$0];next} !($0 in a)' removethese.txt main.txt >output.txt
With sed:
sed "s=^=/^=;s=$=$/d=" removethese.txt | sed -f- main.txt >output.txt
This will fail if removethese.txt contains special chars. For that you can do:
sed 's/[^^]/[&]/g; s/\^/\\^/g' removethese.txt >newremovethese.txt
and use this newremovethese.txt in the sed command. But this is not worth the effort, it's too slow compared to the other methods.
Test performed on the above methods:
The sed method takes too much time and not worth testing.
Files Used:
removethese.txt : Size: 15191908 (15MB) Blocks: 29672 Lines: 100233
main.txt : Size: 27640864 (27.6MB) Blocks: 53992 Lines: 180034
Commands:
grep -vxFf | fgrep -vxf | awk
Taken Time:
0m7.966s | 0m7.823s | 0m0.237s
0m7.877s | 0m7.889s | 0m0.241s
0m7.971s | 0m7.844s | 0m0.234s
0m7.864s | 0m7.840s | 0m0.251s
0m7.798s | 0m7.672s | 0m0.238s
0m7.793s | 0m8.013s | 0m0.241s
AVG
0m7.8782s | 0m7.8468s | 0m0.2403s
This test result implies that fgrep is a little bit faster than grep.
The awk method (from #fedorqui) passes the test with flying colors (0.2403 seconds only !!!).
Test Environment:
HP ProBook 440 G1 Laptop
8GB RAM
2.5GHz processor with turbo boost upto 3.1GHz
RAM being used: 2.1GB
Swap being used: 588MB
RAM being used when the grep/fgrep command is run: 3.5GB
RAM being used when the awk command is run: 2.2GB or less
Swap being used when the commands are run: 588MB (No change)
Test Result:
Use the awk method.
Here are a lot of the simple and effective solutions I've found: http://www.catonmat.net/blog/set-operations-in-unix-shell-simplified/
You need to use one of Set Complement bash commands. 100MB files can be solved within seconds or minutes.
Set Membership
$ grep -xc 'element' set # outputs 1 if element is in set
# outputs >1 if set is a multi-set
# outputs 0 if element is not in set
$ grep -xq 'element' set # returns 0 (true) if element is in set
# returns 1 (false) if element is not in set
$ awk '$0 == "element" { s=1; exit } END { exit !s }' set
# returns 0 if element is in set, 1 otherwise.
$ awk -v e='element' '$0 == e { s=1; exit } END { exit !s }'
Set Equality
$ diff -q <(sort set1) <(sort set2) # returns 0 if set1 is equal to set2
# returns 1 if set1 != set2
$ diff -q <(sort set1 | uniq) <(sort set2 | uniq)
# collapses multi-sets into sets and does the same as previous
$ awk '{ if (!($0 in a)) c++; a[$0] } END{ exit !(c==NR/2) }' set1 set2
# returns 0 if set1 == set2
# returns 1 if set1 != set2
$ awk '{ a[$0] } END{ exit !(length(a)==NR/2) }' set1 set2
# same as previous, requires >= gnu awk 3.1.5
Set Cardinality
$ wc -l set | cut -d' ' -f1 # outputs number of elements in set
$ wc -l < set
$ awk 'END { print NR }' set
Subset Test
$ comm -23 <(sort subset | uniq) <(sort set | uniq) | head -1
# outputs something if subset is not a subset of set
# does not putput anything if subset is a subset of set
$ awk 'NR==FNR { a[$0]; next } { if !($0 in a) exit 1 }' set subset
# returns 0 if subset is a subset of set
# returns 1 if subset is not a subset of set
Set Union
$ cat set1 set2 # outputs union of set1 and set2
# assumes they are disjoint
$ awk 1 set1 set2 # ditto
$ cat set1 set2 ... setn # union over n sets
$ cat set1 set2 | sort -u # same, but assumes they are not disjoint
$ sort set1 set2 | uniq
# sort -u set1 set2
$ awk '!a[$0]++' # ditto
Set Intersection
$ comm -12 <(sort set1) <(sort set2) # outputs insersect of set1 and set2
$ grep -xF -f set1 set2
$ sort set1 set2 | uniq -d
$ join <(sort -n A) <(sort -n B)
$ awk 'NR==FNR { a[$0]; next } $0 in a' set1 set2
Set Complement
$ comm -23 <(sort set1) <(sort set2)
# outputs elements in set1 that are not in set2
$ grep -vxF -f set2 set1 # ditto
$ sort set2 set2 set1 | uniq -u # ditto
$ awk 'NR==FNR { a[$0]; next } !($0 in a)' set2 set1
Set Symmetric Difference
$ comm -3 <(sort set1) <(sort set2) | sed 's/\t//g'
# outputs elements that are in set1 or in set2 but not both
$ comm -3 <(sort set1) <(sort set2) | tr -d '\t'
$ sort set1 set2 | uniq -u
$ cat <(grep -vxF -f set1 set2) <(grep -vxF -f set2 set1)
$ grep -vxF -f set1 set2; grep -vxF -f set2 set1
$ awk 'NR==FNR { a[$0]; next } $0 in a { delete a[$0]; next } 1;
END { for (b in a) print b }' set1 set2
Power Set
$ p() { [ $# -eq 0 ] && echo || (shift; p "$#") |
while read r ; do echo -e "$1 $r\n$r"; done }
$ p `cat set`
# no nice awk solution, you are welcome to email me one:
# peter#catonmat.net
Set Cartesian Product
$ while read a; do while read b; do echo "$a, $b"; done < set1; done < set2
$ awk 'NR==FNR { a[$0]; next } { for (i in a) print i, $0 }' set1 set2
Disjoint Set Test
$ comm -12 <(sort set1) <(sort set2) # does not output anything if disjoint
$ awk '++seen[$0] == 2 { exit 1 }' set1 set2 # returns 0 if disjoint
# returns 1 if not
Empty Set Test
$ wc -l < set # outputs 0 if the set is empty
# outputs >0 if the set is not empty
$ awk '{ exit 1 }' set # returns 0 if set is empty, 1 otherwise
Minimum
$ head -1 <(sort set) # outputs the minimum element in the set
$ awk 'NR == 1 { min = $0 } $0 < min { min = $0 } END { print min }'
Maximum
$ tail -1 <(sort set) # outputs the maximum element in the set
$ awk '$0 > max { max = $0 } END { print max }'
I like #fedorqui's use of awk for setups where one has enough memory to fit all the "remove these" lines: a concise expression of an in-memory approach.
But for a scenario where the size of the lines to remove is large relative to current memory, and reading that data into an in-memory data structure is an invitation to fail or thrash, consider an ancient approach: sort/join
sort main.txt > main_sorted.txt
sort removethese.txt > removethese_sorted.txt
join -t '' -v 1 main_sorted.txt removethese_sorted.txt > output.txt
Notes:
this does not preserve the order from main.txt: lines in output.txt will be sorted
it requires enough disk to be present to let sort do its thing (temp files), and store same-size sorted versions of the input files
having join's -v option do just what we want here - print "unpairable" from file 1, drop matches - is a bit of serendipity
it does not directly address locales, collating, keys, etc. - it relies on defaults of sort and join (-t with an empty argument) to match sort order, which happen to work on my current machine

Sed : removing lines where pattern result also appears elsewhere

Let's suppose I have this sample
foo/bar/123-465.txt
foo/bar/456-781.txt
foo/bar/102-445.txt
foo/bar/123-721.txt
I want to remove every line where the regex /[0-9]*- result also appears on another line. In other terms : I want to remove every line where the file prefix is present more than once in my file.
Therefore only keeping :
foo/bar/456-781.txt
foo/bar/102-445.txt
I bet sed can do this, but how ?
Ok I misunderstood your problem, here's how to do it:
grep -vf <(grep -o '/[0-9]*-' file | sort | uniq -d) file
In action:
cat file
foo/bar/123-465.txt
foo/bar/456-781.txt
foo/bar/102-445.txt
foo/bar/123-721.txt
grep -vf <(grep -o '/[0-9]*-' file | sort | uniq -d) file
foo/bar/456-781.txt
foo/bar/102-445.txt
awk '
match($0, "[0-9]*-") {
id=substr($0, RSTART, RLENGTH)
if (store[id])
dup[id] = 1
store[id] = $0
}
END {
for(id in store) {
if(! dup[id]) {
print store[id]
}
}
}
'
You can use the following awk script:
example.awk:
{
# Get value of interest (before the -)
prefix=substr($3,0,match($3,/\-/)-1)
# Increment counter for this value (starting at 0)
counter[prefix]++
# Buffer the current line
buffer[prefix]=$0
}
# At the end print every line which's value of interest appeared just once
END {
for(index in counter)
if(counter[index]==1)
print buffer[index]
}
Execute it like this:
awk -F\ -f example.awk input.file

How to change upper case to lower case using find and awk and regular expressions

I need to change all filenames in given folder. If there's an uppercase it need to be changed to _lowercase but the first one is always lowercase. Example:
/folder/FileNameOneTwo.txt -> /folder/file_name_one_two.txt
There is no need to save the filenames, only to print to the console.
The code:
find $1 -type f -print | awk '
BEGIN {
FS = "/"; }
{
split($NF,nazwa,".");
}
{
if(nazwa[1] ~ /([[:upper:]])[[:alnum:]]*/ ){
gsub(/[A-Z]/,"_&");
sub(/_/,"");
print tolower($nazwa[1])
}
}
'
$ ls -1
FileNameOneTwo.txt
$ find -maxdepth 1 -type f -exec basename {} \; | sed 's/[A-Z]/_&/g2;s/.*/\L&/'
file_name_one_two.txt
With awk:
awk '{gsub(/[[:upper:]]/,"_&");sub(/^_/,"");print tolower($0)}'