Get the multilevel basename of a Path - regex

I am trying to write a program that is sort of similar to UNIX basename, except I can control the level of its base.
For example, the program would perform tasks like the following:
$PROGRAM /PATH/TO/THE/FILE.txt 1
FILE.txt # returns the first level basename
$PROGRAM /PATH/TO/THE/FILE.txt 2
THE/FILE.txt #returns the second level basename
$ PROGRAM /PATH/TO/THE/FILE.txt 3
TO/THE/FILE.txt #returns the third level base name
I was trying to write this in perl, and to quickly test my idea, I used the following command line script to obtain the second level basename, to no avail:
$echo "/PATH/TO/THE/FILE.txt" | perl -ne '$rev=reverse $_; $rev=~s:((.*?/){2}).*:$2:; print scalar reverse $rev'
/THE
As you can see, it's only printing out the directory name and not the rest.
I feel this has to do with nongreedy matching with quantifier or what not, but my knowledge lacks in that area.
If there is more efficient way to do this in bash, please advise

You will find that your own solution works fine if you use $1 in the substitution instead of $2. The captures are numbered in the order that their opening parentheses appear within the regex, and you want to retain the outermost capture. However the code is less than elegant.
The File::Spec module is ideal for this purpose. It has been a core module with every release of Perl v5 and so shouldn't need installing.
use strict;
use warnings;
use File::Spec;
my #path = File::Spec->splitdir($ARGV[0]);
print File::Spec->catdir(splice #path, -$ARGV[1]), "\n";
output
E:\Perl\source>bnamen.pl /PATH/TO/THE/FILE.txt 1
FILE.txt
E:\Perl\source>bnamen.pl /PATH/TO/THE/FILE.txt 2
THE\FILE.txt
E:\Perl\source>bnamen.pl /PATH/TO/THE/FILE.txt 3
TO\THE\FILE.txt

A pure bash solution (with no checking of the number of arguments and all that):
#!/bin/bash
IFS=/ read -a a <<< "$1"
IFS=/ scratch="${a[*]:${#a[#]}-$2}"
echo "$scratch"
Done.
Works like this:
$ ./program /PATH/TO/THE/FILE.txt 1
FILE.txt
$ ./program /PATH/TO/THE/FILE.txt 2
THE/FILE.txt
$ ./program /PATH/TO/THE/FILE.txt 3
TO/THE/FILE.txt
$ ./program /PATH/TO/THE/FILE.txt 4
PATH/TO/THE/FILE.txt

#!/bin/bash
[ $# -ne 2 ] && exit
input=$1
rdepth=$2
delim=/
[ $rdepth -lt 1 ] && echo "depth must be greater than zero" && exit
parts=$(echo -n $input | sed "s,[^$delim],,g" | wc -m)
[ $parts -lt 1 ] && echo "invalid path" && exit
[ $rdepth -gt $parts ] && echo "input has only $parts part(s)" && exit
depth=$((parts-rdepth+2))
echo $input | cut -d "$delim" -f$depth-
Usage:
$ ./level.sh /tmp/foo/bar 2
foo/bar

Here's a bash script to do it with awk:
#!/bin/bash
level=$1
awk -v lvl=$level 'BEGIN{FS=OFS="/"}
{count=NF-lvl+1;
if (count < 1) {
count=1;
}
while (count <= NF) {
if (count > NF-lvl+1 ) {
printf "%s", OFS;
}
printf "%s", $(count);
count+=1;
}
printf "\n";
}'
To use it, do:
$ ./script_name num_args input_file
For example, if file input contains the line "/PATH/TO/THE/FILE.txt"
$ ./get_lvl_name 2 < input
THE/FILE.txt
$

As #tripleee said, split on the path delimiter ("/" for Unix-like) and then paste back together. For example:
echo "/PATH/TO/THE/FILE.txt" | perl -ne 'BEGIN{$n=shift} #p = split /\//; $start=($#p-$n+1<0?0:$#p-$n+1); print join("/",#p[$start..$#p])' 1
FILE.txt
echo "/PATH/TO/THE/FILE.txt" | perl -ne 'BEGIN{$n=shift} #p = split /\//; $start=($#p-$n+1<0?0:$#p-$n+1); print join("/",#p[$start..$#p])' 3
TO/THE/FILE.txt
Just for fun, here's one that will work on Unix and Windows (and any other) path types, if you provide the delimiter as the second argument:
# Unix-like
echo "PATH/TO/THE/FILE.txt" | perl -ne 'BEGIN{$n=shift;$d=shift} #p = split /\Q$d\E/; $start=($#p-$n+1<0?0:$#p-$n+1); print join($d,#p[$start..$#p])' 3 /
TO/THE/FILE.txt
# Wrong delimiter
echo "PATH/TO/THE/FILE.txt" | perl -ne 'BEGIN{$n=shift;$d=shift} #p = split /\Q$d\E/; $start=($#p-$n+1<0?0:$#p-$n+1); print join($d,#p[$start..$#p])' 3 \\
PATH/TO/THE/FILE.txt
# Windows
echo "C:\Users\Name\Documents\document.doc" | perl -ne 'BEGIN{$n=shift;$d=shift} #p = split /\Q$d\E/; $start=($#p-$n+1<0?0:$#p-$n+1); print join($d,#p[$start..$#p])' 3 \\
Name\Documents\document.doc
# Wrong delimiter
echo "C:\Users\Name\Documents\document.doc" | perl -ne 'BEGIN{$n=shift;$d=shift} #p = split /\Q$d\E/; $start=($#p-$n+1<0?0:$#p-$n+1); print join($d,#p[$start..$#p])' 3 /
C:\Users\Name\Documents\document.doc

Related

Find regular expression in a file matching a given value

I have some basic knowledge on using regular expressions with grep (bash).
But I want to use regular expressions the other way around.
For example I have a file containing the following entries:
line_one=[0-3]
line_two=[4-6]
line_three=[7-9]
Now I want to use bash to figure out to which line a particular number matches.
For example:
grep 8 file
should return:
line_three=[7-9]
Note: I am aware that the example of "grep 8 file" doesn't make sense, but I hope it helps to understand what I am trying to achieve.
Thanks for you help,
Marcel
As others haven pointed out, awk is the right tool for this:
awk -F'=' '8~$2{print $0;}' file
... and if you want this tool to feel more like grep, a quick bash wrapper:
#!/bin/bash
awk -F'=' -v seek_value="$1" 'seek_value~$2{print $0;}' "$2"
Which would run like:
./not_exactly_grep.sh 8 file
line_three=[7-9]
My first impression is that this is not a task for grep, maybe for awk.
Trying to do things with grep I only see this:
for line in $(cat file); do echo 8 | grep "${line#*=}" && echo "${line%=*}" ; done
Using while for file reading (following comments):
while IFS= read -r line; do echo 8 | grep "${line#*=}" && echo "${line%=*}" ; done < file
This can be done in native bash using the syntax [[ $value =~ $regex ]] to test:
find_regex_matching() {
local value=$1
while IFS= read -r line; do # read from input line-by-line
[[ $line = *=* ]] || continue # skip lines not containing an =
regex=${line#*=} # prune everything before the = for the regex
if [[ $value =~ $regex ]]; then # test whether we match...
printf '%s\n' "$line" # ...and print if we do.
fi
done
}
...used as:
find_regex_matching 8 <file
...or, to test it with your sample input inline:
find_regex_matching 8 <<'EOF'
line_one=[0-3]
line_two=[4-6]
line_three=[7-9]
EOF
...which properly emits:
line_three=[7-9]
You could replace printf '%s\n' "$line" with printf '%s\n' "${line%%=*}" to print only the key (contents before the =), if so inclined. See the bash-hackers page on parameter expansion for a rundown on the syntax involved.
This is not built-in functionality of grep, but it's easy to do with awk, with a change in syntax:
/[0-3]/ { print "line one" }
/[4-6]/ { print "line two" }
/[7-9]/ { print "line three" }
If you really need to, you could programmatically change your input file to this syntax, if it doesn't contain any characters that need escaping (mainly / in the regex or " in the string):
sed -e 's#\(.*\)=\(.*\)#/\2/ { print "\1" }#'
As I understand it, you are looking for a range that includes some value.
You can do this in gawk:
$ cat /tmp/file
line_one=[0-3]
line_two=[4-6]
line_three=[7-9]
$ awk -v n=8 'match($0, /([0-9]+)-([0-9]+)/, a){ if (a[1]<n && a[2]>n) print $0 }' /tmp/file
line_three=[7-9]
Since the digits are being treated as numbers (vs a regex) it supports larger ranges:
$ cat /tmp/file
line_one=[0-3]
line_two=[4-6]
line_three=[75-95]
line_four=[55-105]
$ awk -v n=92 'match($0, /([0-9]+)-([0-9]+)/, a){ if (a[1]<n && a[2]>n) print $0 }' /tmp/file
line_three=[75-95]
line_four=[55-105]
If you are just looking to interpret the right hand side of the = as a regex, you can do:
$ awk -F= -v tgt=8 'tgt~$2' /tmp/file
You would like to do something like
grep -Ef <(cut -d= -f2 file) <(echo 8)
This wil grep what you want but will not display where.
With grep you can show some message:
echo "8" | sed -n '/[7-9]/ s/.*/Found it in line_three/p'
Now you would like to transfer your regexp file into such commands:
sed 's#\(.*\)=\(.*\)#/\2/ s/.*/Found at \1/p#' file
Store these commands in a virtual command file and you will have
echo "8" | sed -nf <(sed 's#\(.*\)=\(.*\)#/\2/ s/.*/Found at \1/p#' file)

Comparing a word in a string with another in another string

I have a file with strings, like below:
ABCEF
RFGTH
ABCEF_ABCT
DRFRF_ABCT
LOIKH
LOIKH_DEFT
I need to extract the lines which have words matching even if they have _ABCT at the end.
while IFS= read -r line
do
if [ $line == $line ];
then
echo "$line"
fi
done < "$file"
The output I want is:
ABCEF
ABCEF_ABCT
LOIKH
LOIKH_DEFT
I know I have a mistake in the IF branch but I just got out of options now and I don't know how to get the outcome I need.
I would use awk to solve this problem:
awk -F_ '{ ++count[$1]; line[NR] = $0 }
END { for (i = 1; i <= NR; ++i) { split(line[i], a); if (count[a[1]] > 1) print line[i] } }' file
A count is kept of the first field of each line. Each line is saved to an array. Once the file is processed, any lines whose first part has a count greater than one are printed.
for w in $(for wrd in $(grep -o "^[A-Z]*" abc.dat)
do
n=$(grep -c $wrd abc.dat)
if (( $n > 1 ))
then
echo $wrd
fi
done | uniq)
do
grep $w abc.dat
done
With grep -o extract tokens "^[A-Z]*" from beginning of line (^) only matching A-Z (not _). These tokens are searched again in the same file and counted (grep -c) and if > 1 collected. With uniq they are only taken once and then again we search for them in the file to find all matches, but only once.
Here's a pure Bash solution using arrays and associative arrays:
#!/bin/bash
IFS=_
declare -A seen
while read -r -a tokens
do
# ${tokens[0]} contains the first word before the underscore.
word="${tokens[0]}"
if [[ "${seen[$word]}" ]]
then
[[ "${seen[$word]}" -eq 1 ]] && echo "$word"
echo "${tokens[*]}"
(( seen["$word"]++ ))
else
seen["$word"]=1
fi
done < "$file"
Output:
ABCEF
ABCEF_ABCT
LOIKH
LOIKH_DEFT
One more answer using sed
#!/bin/bash
#set -x
counter=1;
while read line ; do
((counter=counter+1))
var=$(sed -n -e "$counter,\$ s/$line/$line/p" file.txt)
if [ -n "$var" ]
then
echo $line
echo $var
fi
done < file.txt

How to select multiple lines from a file or from pipe in a script?

I'd like to have a script, called lines.sh that I can pipe data to to select a series of lines.
For example, if I had the following file:
test.txt
a
b
c
d
Then I could run:
cat test.txt | lines 2,4
and it would output
b
d
I'm using zsh, but would prefer a bash solution if possible.
You can use this awk:
awk -v s='2,4' 'BEGIN{split(s, a, ","); for (i in a) b[a[i]]} NR in b' file
two
four
Via a separate script lines.sh:
#!/bin/bash
awk -v s="$1" 'BEGIN{split(s, a, ","); for (i in a) b[a[i]]} NR in b' "$2"
Then give execute permissions:
chmod +x lines.sh
And call it as:
./lines.sh '2,4' 'test.txt'
Try sed:
sed -n '2p; 4p' inputFile
-n tells sed to suppress output, but for the lines 2 and 4, the p (print) command is used to print these lines.
You can also use ranges, e.g.:
sed -n '2,4p' inputFile
Two pure Bash versions. Since you're looking for general and reusable solutions, you might as well put a little bit of effort in that. (Also, see last section).
Version 1
This script slurps the entire stdin into an array (using mapfile, so it's rather efficient) and then prints the lines specified on its arguments. Ranges are valid, e.g.,
1-4 # for lines 1, 2, 3 and 4
3- # for everything from line 3 till the end of the file
You may separate these by spaces or commas. The lines are printed exactly in the order the arguments are given:
lines 1 1,2,4,1-3,4- 1
will print line 1 twice, then line 2, then line 4, then lines 1, 2 and 3, then everything from line 4 till the end, and finally, line 1 again.
Here you go:
#!/bin/bash
lines=()
# Slurp stdin in array
mapfile -O1 -t lines
# Arguments:
IFS=', ' read -ra args <<< "$*"
for arg in "${args[#]}"; do
if [[ $arg = +([[:digit:]]) ]]; then
arg=$arg-$arg
fi
if [[ $arg =~ ([[:digit:]]+)-([[:digit:]]*) ]]; then
((from=10#${BASH_REMATCH[1]}))
((to=10#${BASH_REMATCH[2]:-$((${#lines[#]}))}))
((from==0)) && from=1
((to>=${#lines[#]})) && to=${#lines[#]}
((from<=to)) || printf >&2 'Argument %d-%d: lines not in increasing order' "$from" "$to"
for((i=from;i<=to;++i)); do
printf '%s\n' "${lines[i]}"
done
else
printf >&2 "Error in argument \`%s'.\n" "$arg"
fi
done
Pro: It's really cool.
Con: Needs to read entire stream into memory. Not suitable for infinite streams.
Version 2
This version addresses the previous problem of infinite streams. But you'll lose the ability to repeat and reorder lines.
Same thing, ranges are allowed:
lines 1 1,4-6 9-
will print lines 1, 4, 5, 6, 9 and everything till the end. If the set of lines is bounded, exits as soon as last line is read.
#!/bin/bash
lines=()
tillend=0
maxline=0
# Process arguments
IFS=', ' read -ra args <<< "$#"
for arg in "${args[#]}"; do
if [[ $arg = +([[:digit:]]) ]]; then
arg=$arg-$arg
fi
if [[ $arg =~ ([[:digit:]]+)-([[:digit:]]*) ]]; then
((from=10#${BASH_REMATCH[1]}))
((from==0)) && from=1
((tillend && from>=tillend)) && continue
if [[ -z ${BASH_REMATCH[2]} ]]; then
tillend=$from
continue
fi
((to=10#${BASH_REMATCH[2]}))
if ((from>to)); then
printf >&2 "Invalid lines order: %s\n" "$arg"
exit 1
fi
((maxline<to)) && maxline=$to
for ((i=from;i<=to;++i)); do
lines[i]=1
done
else
printf >&2 "Invalid argument \`%s'\n" "$arg"
exit 1
fi
done
# If nothing to read, exit
((tillend==0 && ${#lines[#]}==0)) && exit
# Now read stdin
linenb=0
while IFS= read -r line; do
((++linenb))
((tillend==0 && maxline && linenb>maxline)) && exit
if [[ ${lines[linenb]} ]] || ((tillend && linenb>=tillend)); then
printf '%s\n' "$line"
fi
done
Pro: It's really cool and doesn't read the full stream in memory.
Con: Can't repeat or reorder lines as Version 1. Speed is not is it's strongest point.
Further thoughts
If you really want an awesome general script that does what Version 1 and Version 2 does, and more, you definitely should consider using another language, e.g., Perl: you'll gain a lot (in particular speed)! you'll be able to have nice options that'll do lots of much cooler stuff. It might be worth it in the long run, as you want a general and reusable script. You might even end up having a script that reads emails!
Disclaimer. I haven't thoroughly checked these scripts... so beware of bugs!
Well, provided that:
your file is small enough
you don't have any semicolon (or another specific character of your choice) in the file
you don't mind using multiple pipes
you could use something like:
cat test.txt |tr "\\n" ";"|cut -d';' -f2,4|tr ";" "\\n"
Where -f2,4 indicates the lines you want to extract
Quick solution for you friend.
Input:
test.txt
a
b
c
d
e
f
g
h
i
j
test.sh
lines (){
sed -n "$( echo "$#" | sed 's/[0-9]\+/&p;/g')"
}
cat 1.txt | lines 1 5 10
Or if you want to have your lines as script:
lines.sh
IFS=',' read -a lines <<< "$1"; sed -n "$( echo "${lines[#]}" | sed 's/[0-9]\+/&p;/g')" "$2"
./lines.sh 1,5,10 test.txt
Output in both cases:
a
e
j
If this is a one-time operation and there aren't many lines to pick, you could use pick to manually select them:
cat test.txt | pick | ...
A interactive screen would open allowing you to select what you want.
Try this :
file=$1
for var in "$#" //var is all line numbers
do
sed -n "${var}p" $file
done
I created a script with 1 file parameter, and an unlimited number of parameters for line numbers. You would call it as so :
lines txt 2 3 4...etc

Bash regex match spanning multiple lines

I'm trying to create a bash script that validates files. One of the requirements is that there has to be exactly one "2" in the file.
Here's my code at the moment:
regex1="[0-9b]*2[0-9b]*2[0-9b]*"
# This regex will match if there are at least two 2's in the file
if [[ ( $(cat "$file") =~ $regex1 ) ]]; then
# stuff to do when there's more than 1 "2"
fi
#...
regex2="^[013456789b]*$"
# This regex will match if there are at least no 2's in the file
if [[ ( $(cat "$file") =~ $regex2 ) ]]; then
# stuff to do when there are no 2's
fi
What I'm trying to do is match the following pieces:
654654654654
254654845845
845462888888
(because there are 2 2's in there, it should be matched)
987886546548
546546546848
654684546548
(because there are no 2's in there, it should be matched)
Any idea how I make it search all lines with the =~ operator?
I'm trying to create a bash script that validates files. One of the
requirements is that there has to be exactly one "2" in the file.
Try using grep
#!/bin/bash
file='input.txt'
n=$(grep -o '2' "$file" | wc -l)
# echo $n
if [[ $n -eq 1 ]]; then
echo 'Valid'
else
echo 'Invalid'
fi
How about this:
twocount=$(tr -dc '2' input.txt | wc -c)
if (( twocount != 1 ))
then
# there was either no 2, or more than one 2
else
# exactly one 2
fi
Using anchors as you've been, match a string of non-2s, a 2, and another string of non-2s.
^[^2]*2[^2]*$
Multiline regex match is indeed possible using awk with null record separator.
Consider below code:
awk '$0 ~ /^.*2.*2/ || $0 ~ /^[013456789]*$/' RS= file
654654654654
254654845845
845462888888
Take note of RS= which makes awk join multiple lines into single line $0 until it hits a double newline.

Delete everything except all surrounded by ()

Let's say i have file like this
adsf(2)
af(3)
g5a(65)
aafg(1245)
a(3)df
How can i get from this only numbers between ( and ) ?
using BASH
A couple of solution comes to mind. Some of them handles the empty lines correctly, others not. Trivial to remove those though, using either grep -v '^$' or sed '/^$/d'.
sed
sed 's|.*(\([0-9]\+\).*|\1|' input
awk
awk -F'[()]' '/./{print $2}' input
2
3
65
1245
3
pure bash
#!/bin/bash
IFS="()"
while read a b; do
if [ -z $b ]; then
continue
fi
echo $b
done < input
and finally, using tr
cat input | tr -d '[a-z()]'
while read line; do
if [ -z "$line" ]; then
continue
fi
line=${line#*(}
line=${line%)*}
echo $line
done < file
Positive lookaround:
$ echo $'a1b(2)c\nd3e(456)fg7' | grep -Poe '(?<=\()[0-9]*(?=\))'
2
456
Another one:
while read line ; do
[[ $line =~ .*\(([[:digit:]]+)\).* ]] && echo "${BASH_REMATCH[1]}"
done < file