Comparing a word in a string with another in another string - regex

I have a file with strings, like below:
ABCEF
RFGTH
ABCEF_ABCT
DRFRF_ABCT
LOIKH
LOIKH_DEFT
I need to extract the lines which have words matching even if they have _ABCT at the end.
while IFS= read -r line
do
if [ $line == $line ];
then
echo "$line"
fi
done < "$file"
The output I want is:
ABCEF
ABCEF_ABCT
LOIKH
LOIKH_DEFT
I know I have a mistake in the IF branch but I just got out of options now and I don't know how to get the outcome I need.

I would use awk to solve this problem:
awk -F_ '{ ++count[$1]; line[NR] = $0 }
END { for (i = 1; i <= NR; ++i) { split(line[i], a); if (count[a[1]] > 1) print line[i] } }' file
A count is kept of the first field of each line. Each line is saved to an array. Once the file is processed, any lines whose first part has a count greater than one are printed.

for w in $(for wrd in $(grep -o "^[A-Z]*" abc.dat)
do
n=$(grep -c $wrd abc.dat)
if (( $n > 1 ))
then
echo $wrd
fi
done | uniq)
do
grep $w abc.dat
done
With grep -o extract tokens "^[A-Z]*" from beginning of line (^) only matching A-Z (not _). These tokens are searched again in the same file and counted (grep -c) and if > 1 collected. With uniq they are only taken once and then again we search for them in the file to find all matches, but only once.

Here's a pure Bash solution using arrays and associative arrays:
#!/bin/bash
IFS=_
declare -A seen
while read -r -a tokens
do
# ${tokens[0]} contains the first word before the underscore.
word="${tokens[0]}"
if [[ "${seen[$word]}" ]]
then
[[ "${seen[$word]}" -eq 1 ]] && echo "$word"
echo "${tokens[*]}"
(( seen["$word"]++ ))
else
seen["$word"]=1
fi
done < "$file"
Output:
ABCEF
ABCEF_ABCT
LOIKH
LOIKH_DEFT

One more answer using sed
#!/bin/bash
#set -x
counter=1;
while read line ; do
((counter=counter+1))
var=$(sed -n -e "$counter,\$ s/$line/$line/p" file.txt)
if [ -n "$var" ]
then
echo $line
echo $var
fi
done < file.txt

Related

Check if a string contains valid pattern in Bash

I have a file a.txt contains a string like:
Axxx-Bxxxx
Rules for checking if it is valid or not include:
length is 10 characters.
x here is digits only.
Then, I try to check with:
#!/bin/bash
exp_len=10;
file=a.txt;
msg="checking string";
tmp="File not exist";
echo $msg;
if[ -f $file];then
tmp=$(cat $file);
if[[${#tmp} != $exp_len ]];then
msg="invalid length";
elif [[ $tmp =~ ^[A[0-9]{3}-B[0-9]{4}]$]];then
msg="valid";
else
msg="invalid";
fi
else
msg="file not exist";
fi
echo $msg;
But in valid case it doesn't work...
Is there someone help to correct me?
Thanks :)
Other than the regex fix, your code can be refactored as well, moreover there are syntax issues as well. Consider this code:
file="a.txt"
msg="checking string"
tmp="File not exist"
echo "$msg"
if [[ -f $file ]]; then
s="$(<$file)"
if [[ $s =~ ^A[0-9]{3}-B[0-9]{4}$ ]]; then
msg="valid"
else
msg="invalid"
fi
else
msg="file not exist"
fi
echo "$msg"
Changes are:
Remove unnecessary cat
Use [[ ... ]] when using bash
Spaces inside [[ ... ]] are required (your code was missing them)
There is no need to check length of 10 as regex will make sure that part as well
As mentioned in comments earlier correct regex should be ^A[0-9]{3}-B[0-9]{4}$ or ^A[[:digit:]]{3}-B[[:digit:]]{4}$
Note that a regex like ^[A[0-9]{3}-B[0-9]{4}]$ matches
^ - start of string
[A[0-9]{3} - three occurrences of A, [ or a digit
-B - a -B string
[0-9]{4} - four digits
] - a ] char
$ - end of string.
So, it matches strings like [A[-B1234], [[[-B1939], etc.
Your regex checking line must look like
if [[ $tmp =~ ^A[0-9]{3}-B[0-9]{4}$ ]];then
See the online demo:
#!/bin/bash
tmp="A123-B1234";
if [[ $tmp =~ ^A[0-9]{3}-B[0-9]{4}$ ]];then
msg="valid";
else
msg="invalid";
fi
echo $msg;
Output:
valid
Using just grep might be easier:
$ echo A123-B1234 > valid.txt
$ echo 123 > invalid.txt
$ grep -Pq 'A\d{3}-B\d{4}' valid.txt && echo valid || echo invalid
valid
$ grep -Pq 'A\d{3}-B\d{4}' invalid.txt && echo valid || echo invalid
invalid
With your shown samples and attempts, please try following code also.
#!/bin/bash
exp_len=10;
file=a.txt;
msg="checking string";
tmp="File not exist";
if [[ -f "$file" ]]
then
echo "File named $file is existing.."
awk '/^A[0-9]{3}-B[0-9]{4}$/{print "valid";next} {print "invalid"}' "$file"
else
echo "Please do check File named $file is not existing, exiting from script now..."
exit 1;
fi
OR In case you want to check if line in your Input_file should be 10 characters long(by seeing OP's attempted code's exp_len shell variable) then try following code, where an additional condition is also added in awk code.
#!/bin/bash
exp_len=10;
file=a.txt;
msg="checking string";
tmp="File not exist";
if [[ -f "$file" ]]
then
echo "File named $file is existing.."
awk -v len="$exp_len" 'length($0) == len && /^A[0-9]{3}-B[0-9]{4}$/{print "valid";next} {print "invalid"}' "$file"
else
echo "Please do check File named $file is not existing, exiting from script now..."
exit 1;
fi
NOTE: I am using here -f flag to test if file is existing or not, you can change it to -s eg: -s "$file" in case you want to check file is present and is of NOT NULL size.

Pattern matching in if statement in bash

I'm trying to count the words with at least two vowels in all the .txt files in the directory. Here's my code so far:
#!/bin/bash
wordcount=0
for i in $HOME/*.txt
do
cat $i |
while read line
do
for w in $line
do
if [[ $w == .*[aeiouAEIOU].*[AEIOUaeiou].* ]]
then
wordcount=`expr $wordcount + 1`
echo $w ':' $wordcount
else
echo "In else"
fi
done
done
echo $i ':' $wordcount
wordcount=0
done
Here is my sample from a txt file
Last modified: Sun Aug 20 18:18:27 IST 2017
To remove PPAs
sudo apt-get install ppa-purge
sudo ppa-purge ppa:
The problem is it doesn't match the pattern in the if statement for all the words in the text file. It goes directly to the else statement. And secondly, the wordcount in echo $i ':' $wordcount is equal to 0 which should be some value.
Immediate Issue: Glob vs Regex
[[ $string = $pattern ]] doesn't perform regex matching; instead, it's a glob-style pattern match. While . means "any character" in regex, it matches only itself in glob.
You have a few options here:
Use =~ instead to perform regular expression matching:
[[ $w =~ .*[aeiouAEIOU].*[AEIOUaeiou].* ]]
Use a glob-style expression instead of a regex:
[[ $w = *[aeiouAEIOU]*[aeiouAEIOU]* ]]
Note the use of = rather than == here; while either is technically valid, the former avoids building finger memory that would lead to bugs when writing code for a POSIX implementation of test / [, as = is the only valid string comparison operator there.
Larger Issue: Properly Reading Word-By-Word
Using for w in $line is innately unsafe. Use read -a to read a line into an array of words:
#!/usr/bin/env bash
wordcount=0
for i in "$HOME"/*.txt; do
while read -r -a words; do
for word in "${words[#]}"; do
if [[ $word = *[aeiouAEIOU]*[aeiouAEIOU]* ]]; then
(( ++wordcount ))
fi
done
done <"$i"
printf '%s: %s\n' "$i" "$wordcount"
wordcount=0
done
Try:
awk '/[aeiouAEIOU].*[AEIOUaeiou]/{n++} ENDFILE{print FILENAME":"n; n=0}' RS='[[:space:]]' *.txt
Sample output looks like:
$ awk '/[aeiouAEIOU].*[AEIOUaeiou]/{n++} ENDFILE{print FILENAME":"n; n=0}' RS='[[:space:]]' *.txt
one.txt:1
sample.txt:9
How it works:
/[aeiouAEIOU].*[AEIOUaeiou]/{n++}
Every time we find a word with two vowels, we increment variable n.
ENDFILE{print FILENAME":"n; n=0}
At the end of each file, we print the name of the file and the 2-vowel word count n. We then reset n to zero.
RS='[[:space:]]'
This tells awk to use any whitespace as a word separator. This makes each word into a record. Awk reads the input one record at a time.
Shell issues
The use of awk avoids a multitude of shell issues. For example, consider the line for w in $line. This will not work the way you hope. Consider a directory with these files:
$ ls
one.txt sample.txt
Now, let's take line='* Item One' and see what happens:
$ line='* Item One'
$ for w in $line; do echo "w=$w"; done
w=one.txt
w=sample.txt
w=Item
w=One
The shell treats the * in line as a wildcard and expands it into a list of files. Odds are you didn't want this. The awk solution avoids a variety of issues like this.
Using grep - this is pretty simple to do.
#!/bin/bash
wordcount=0
for file in ./*.txt
do
count=`cat $file | xargs -n1 | grep -ie "[aeiou].*[aeiou]" | wc -l`
wordcount=`expr $wordcount + $count`
done
echo $wordcount

In bash, how can I check a string for partials in an array?

If I have a string:
s='path/to/my/foo.txt'
and an array
declare -a include_files=('foo.txt' 'bar.txt');
how can I check the string for matches in my array efficiently?
You could loop through the array and use a bash substring check
for file in "${include_files[#]}"
do
if [[ $s = *${file} ]]; then
printf "%s\n" "$file"
fi
done
Alternately, if you want to avoid the loop and you only care that a file name matches or not, you could use the # form of bash extended globbing. The following example assumes that array file names do not contain |.
shopt -s extglob
declare -a include_files=('foo.txt' 'bar.txt');
s='path/to/my/foo.txt'
printf -v pat "%s|" "${include_files[#]}"
pat="${pat%|}"
printf "%s\n" "${pat}"
#prints foo.txt|bar.txt
if [[ ${s##*/} = #(${pat}) ]]; then echo yes; fi
For an exact match to the file name:
#!/bin/bash
s="path/to/my/foo.txt";
ARR=('foo.txt' 'bar.txt');
for str in "${ARR[#]}";
do
# if [ $(echo "$s" | awk -F"/" '{print $NF}') == "$str" ]; then
if [ $(basename "$s") == "$str" ]; then # A better option than awk for sure...
echo "match";
else
echo "no match";
fi;
done

Get the multilevel basename of a Path

I am trying to write a program that is sort of similar to UNIX basename, except I can control the level of its base.
For example, the program would perform tasks like the following:
$PROGRAM /PATH/TO/THE/FILE.txt 1
FILE.txt # returns the first level basename
$PROGRAM /PATH/TO/THE/FILE.txt 2
THE/FILE.txt #returns the second level basename
$ PROGRAM /PATH/TO/THE/FILE.txt 3
TO/THE/FILE.txt #returns the third level base name
I was trying to write this in perl, and to quickly test my idea, I used the following command line script to obtain the second level basename, to no avail:
$echo "/PATH/TO/THE/FILE.txt" | perl -ne '$rev=reverse $_; $rev=~s:((.*?/){2}).*:$2:; print scalar reverse $rev'
/THE
As you can see, it's only printing out the directory name and not the rest.
I feel this has to do with nongreedy matching with quantifier or what not, but my knowledge lacks in that area.
If there is more efficient way to do this in bash, please advise
You will find that your own solution works fine if you use $1 in the substitution instead of $2. The captures are numbered in the order that their opening parentheses appear within the regex, and you want to retain the outermost capture. However the code is less than elegant.
The File::Spec module is ideal for this purpose. It has been a core module with every release of Perl v5 and so shouldn't need installing.
use strict;
use warnings;
use File::Spec;
my #path = File::Spec->splitdir($ARGV[0]);
print File::Spec->catdir(splice #path, -$ARGV[1]), "\n";
output
E:\Perl\source>bnamen.pl /PATH/TO/THE/FILE.txt 1
FILE.txt
E:\Perl\source>bnamen.pl /PATH/TO/THE/FILE.txt 2
THE\FILE.txt
E:\Perl\source>bnamen.pl /PATH/TO/THE/FILE.txt 3
TO\THE\FILE.txt
A pure bash solution (with no checking of the number of arguments and all that):
#!/bin/bash
IFS=/ read -a a <<< "$1"
IFS=/ scratch="${a[*]:${#a[#]}-$2}"
echo "$scratch"
Done.
Works like this:
$ ./program /PATH/TO/THE/FILE.txt 1
FILE.txt
$ ./program /PATH/TO/THE/FILE.txt 2
THE/FILE.txt
$ ./program /PATH/TO/THE/FILE.txt 3
TO/THE/FILE.txt
$ ./program /PATH/TO/THE/FILE.txt 4
PATH/TO/THE/FILE.txt
#!/bin/bash
[ $# -ne 2 ] && exit
input=$1
rdepth=$2
delim=/
[ $rdepth -lt 1 ] && echo "depth must be greater than zero" && exit
parts=$(echo -n $input | sed "s,[^$delim],,g" | wc -m)
[ $parts -lt 1 ] && echo "invalid path" && exit
[ $rdepth -gt $parts ] && echo "input has only $parts part(s)" && exit
depth=$((parts-rdepth+2))
echo $input | cut -d "$delim" -f$depth-
Usage:
$ ./level.sh /tmp/foo/bar 2
foo/bar
Here's a bash script to do it with awk:
#!/bin/bash
level=$1
awk -v lvl=$level 'BEGIN{FS=OFS="/"}
{count=NF-lvl+1;
if (count < 1) {
count=1;
}
while (count <= NF) {
if (count > NF-lvl+1 ) {
printf "%s", OFS;
}
printf "%s", $(count);
count+=1;
}
printf "\n";
}'
To use it, do:
$ ./script_name num_args input_file
For example, if file input contains the line "/PATH/TO/THE/FILE.txt"
$ ./get_lvl_name 2 < input
THE/FILE.txt
$
As #tripleee said, split on the path delimiter ("/" for Unix-like) and then paste back together. For example:
echo "/PATH/TO/THE/FILE.txt" | perl -ne 'BEGIN{$n=shift} #p = split /\//; $start=($#p-$n+1<0?0:$#p-$n+1); print join("/",#p[$start..$#p])' 1
FILE.txt
echo "/PATH/TO/THE/FILE.txt" | perl -ne 'BEGIN{$n=shift} #p = split /\//; $start=($#p-$n+1<0?0:$#p-$n+1); print join("/",#p[$start..$#p])' 3
TO/THE/FILE.txt
Just for fun, here's one that will work on Unix and Windows (and any other) path types, if you provide the delimiter as the second argument:
# Unix-like
echo "PATH/TO/THE/FILE.txt" | perl -ne 'BEGIN{$n=shift;$d=shift} #p = split /\Q$d\E/; $start=($#p-$n+1<0?0:$#p-$n+1); print join($d,#p[$start..$#p])' 3 /
TO/THE/FILE.txt
# Wrong delimiter
echo "PATH/TO/THE/FILE.txt" | perl -ne 'BEGIN{$n=shift;$d=shift} #p = split /\Q$d\E/; $start=($#p-$n+1<0?0:$#p-$n+1); print join($d,#p[$start..$#p])' 3 \\
PATH/TO/THE/FILE.txt
# Windows
echo "C:\Users\Name\Documents\document.doc" | perl -ne 'BEGIN{$n=shift;$d=shift} #p = split /\Q$d\E/; $start=($#p-$n+1<0?0:$#p-$n+1); print join($d,#p[$start..$#p])' 3 \\
Name\Documents\document.doc
# Wrong delimiter
echo "C:\Users\Name\Documents\document.doc" | perl -ne 'BEGIN{$n=shift;$d=shift} #p = split /\Q$d\E/; $start=($#p-$n+1<0?0:$#p-$n+1); print join($d,#p[$start..$#p])' 3 /
C:\Users\Name\Documents\document.doc

Delete everything except all surrounded by ()

Let's say i have file like this
adsf(2)
af(3)
g5a(65)
aafg(1245)
a(3)df
How can i get from this only numbers between ( and ) ?
using BASH
A couple of solution comes to mind. Some of them handles the empty lines correctly, others not. Trivial to remove those though, using either grep -v '^$' or sed '/^$/d'.
sed
sed 's|.*(\([0-9]\+\).*|\1|' input
awk
awk -F'[()]' '/./{print $2}' input
2
3
65
1245
3
pure bash
#!/bin/bash
IFS="()"
while read a b; do
if [ -z $b ]; then
continue
fi
echo $b
done < input
and finally, using tr
cat input | tr -d '[a-z()]'
while read line; do
if [ -z "$line" ]; then
continue
fi
line=${line#*(}
line=${line%)*}
echo $line
done < file
Positive lookaround:
$ echo $'a1b(2)c\nd3e(456)fg7' | grep -Poe '(?<=\()[0-9]*(?=\))'
2
456
Another one:
while read line ; do
[[ $line =~ .*\(([[:digit:]]+)\).* ]] && echo "${BASH_REMATCH[1]}"
done < file