Bash regex match spanning multiple lines - regex

I'm trying to create a bash script that validates files. One of the requirements is that there has to be exactly one "2" in the file.
Here's my code at the moment:
regex1="[0-9b]*2[0-9b]*2[0-9b]*"
# This regex will match if there are at least two 2's in the file
if [[ ( $(cat "$file") =~ $regex1 ) ]]; then
# stuff to do when there's more than 1 "2"
fi
#...
regex2="^[013456789b]*$"
# This regex will match if there are at least no 2's in the file
if [[ ( $(cat "$file") =~ $regex2 ) ]]; then
# stuff to do when there are no 2's
fi
What I'm trying to do is match the following pieces:
654654654654
254654845845
845462888888
(because there are 2 2's in there, it should be matched)
987886546548
546546546848
654684546548
(because there are no 2's in there, it should be matched)
Any idea how I make it search all lines with the =~ operator?

I'm trying to create a bash script that validates files. One of the
requirements is that there has to be exactly one "2" in the file.
Try using grep
#!/bin/bash
file='input.txt'
n=$(grep -o '2' "$file" | wc -l)
# echo $n
if [[ $n -eq 1 ]]; then
echo 'Valid'
else
echo 'Invalid'
fi

How about this:
twocount=$(tr -dc '2' input.txt | wc -c)
if (( twocount != 1 ))
then
# there was either no 2, or more than one 2
else
# exactly one 2
fi

Using anchors as you've been, match a string of non-2s, a 2, and another string of non-2s.
^[^2]*2[^2]*$

Multiline regex match is indeed possible using awk with null record separator.
Consider below code:
awk '$0 ~ /^.*2.*2/ || $0 ~ /^[013456789]*$/' RS= file
654654654654
254654845845
845462888888
Take note of RS= which makes awk join multiple lines into single line $0 until it hits a double newline.

Related

bash regex not working - works with online editors

Regex works with online editors but not in a bash script. Tried couple different ways
#!/bin/bash
echo -n "Your string> "
read String
regex='(?<!NOT.)TEST_34_TEST'
if [[ "$String" =~ ^(\?\<\!NOT\.)TEST_34_TEST ]]; then
echo Match
else
echo Non-Match
fi
if [[ "$String" =~ $regex ]]; then
echo Match
else
echo Non-Match
fi
I want string matching TEST_34_TEST and that does have NOT prefixed to it
TEST_34_TEST,TEST_34_TEST,TEST_34_TEST -> should match all 3
TEST_34_TEST, NOT_TEST_34_TEST, TEST_34_TEST -> should match 2 values
NOT_TEST_34_TEST, TEST_34_TEST, TEST_34_TEST -> should match 2 values
Thanks in advance.
You can use GNU grep if you only want to know the number of matches (and not do anything with them)
for s in "TEST_34_TEST,TEST_34_TEST,TEST_34_TEST" "TEST_34_TEST, NOT_TEST_34_TEST, TEST_34_TEST" "NOT_TEST_34_TEST, TEST_34_TEST, TEST_34_TEST"; do
grep -noP '((?<!NOT.)TEST_34_TEST)' <<< "$s" | wc -l
done
and will print
3
2
2

multi-lines pattern matching

I have some files with content like this:
file1:
AAA
BBB
CCC
123
file2:
AAA
BBB
123
I want to echo the filename only if the first 3 lines are letters, or "file1" in the samples above.
Im merging the 3 lines into one and comparing it to my regex [A-Z], but could not get it to match for some reason
my script:
file=file1
if [[ $(head -3 $file|tr -d '\n'|sed 's/\r//g') == [A-Z] ]]; then
echo "$file"
fi
I ran it with bash -x, this is the output
+ file=file1
++ head -3 file1
++ tr -d '\n'
++ sed 's/\r//g'
+ [[ ASMUTCEDD == [A-Z] ]]
+exit
What you missed:
You can use grep to check that the input matches only [A-Z] characters (or indeed Bash's built-in regex matching, as #Barmar pointed out)
You can use the pipeline directly in the if statement, without [[ ... ]]
Like this:
file=file1
if head -n 3 "$file" | tr -d '\n\r' | grep -qE '^[A-Z]+$'; then
echo "$file"
fi
To do regular expression matching you have to use =~, not ==. And the regular expression should be ^[A-Z]*$. Your regular expression matches if there's a letter anywhere in the string, not just if the string is entirely letters.
if [[ $(head -3 $file|tr -d '\n\r') =~ ^[A-Z]*$ ]]; then
echo "$file"
fi
You can use built-ins and character classes for this problem:-
#!/bin/bash
file="file1"
C=0
flag=0
while read line
do
(( ++C ))
[ $C -eq 4 ] && break;
[[ "$line" =~ '[^[:alpha:]]' ]] && flag=1
done < "$file"
[ $flag -eq 0 ] && echo "$file"

how to match regex in bash script with for loop?

I'm trying to match multiple strings from output of a command and do something for each one of them.
#!/usr/bin/env bash
echo 'Howdy, can you please give me the domain (without www)?'
read domain
routes=$(flynn -a shop-app route | grep $domain)
# echo $routes | egrep "http\/\S+"
pattern="http\/[^ ]+"
for word in $routes
do
[[ $word =~ $pattern ]]
if ${BASH_REMATCH[0]}
then
match="${BASH_REMATCH[0]}"
sed -i s/DOMAIN/$domain/g $domain.sh
sed -i s:ROUTE1:$match:g $domain.sh
fi
if ${BASH_REMATCH[1]}
then
match2="${BASH_REMATCH[1]}"
sed -i s:ROUTE2:$match2:g $domain.sh
fi
done
echo $match
update: the regex part works now but the loop is not working. I know the loop will find two match and want to do something with each one
the sample text:
http:www.lipi.ir shop-app-web http/d49ced12-c6ca-46a0-b919-6d97b6580ad3 false false /
http:lipi.ir shop-app-web http/ff919e9d-9bf7-4342-a4b3-ea184c698959 false false /

Pattern matching in if statement in bash

I'm trying to count the words with at least two vowels in all the .txt files in the directory. Here's my code so far:
#!/bin/bash
wordcount=0
for i in $HOME/*.txt
do
cat $i |
while read line
do
for w in $line
do
if [[ $w == .*[aeiouAEIOU].*[AEIOUaeiou].* ]]
then
wordcount=`expr $wordcount + 1`
echo $w ':' $wordcount
else
echo "In else"
fi
done
done
echo $i ':' $wordcount
wordcount=0
done
Here is my sample from a txt file
Last modified: Sun Aug 20 18:18:27 IST 2017
To remove PPAs
sudo apt-get install ppa-purge
sudo ppa-purge ppa:
The problem is it doesn't match the pattern in the if statement for all the words in the text file. It goes directly to the else statement. And secondly, the wordcount in echo $i ':' $wordcount is equal to 0 which should be some value.
Immediate Issue: Glob vs Regex
[[ $string = $pattern ]] doesn't perform regex matching; instead, it's a glob-style pattern match. While . means "any character" in regex, it matches only itself in glob.
You have a few options here:
Use =~ instead to perform regular expression matching:
[[ $w =~ .*[aeiouAEIOU].*[AEIOUaeiou].* ]]
Use a glob-style expression instead of a regex:
[[ $w = *[aeiouAEIOU]*[aeiouAEIOU]* ]]
Note the use of = rather than == here; while either is technically valid, the former avoids building finger memory that would lead to bugs when writing code for a POSIX implementation of test / [, as = is the only valid string comparison operator there.
Larger Issue: Properly Reading Word-By-Word
Using for w in $line is innately unsafe. Use read -a to read a line into an array of words:
#!/usr/bin/env bash
wordcount=0
for i in "$HOME"/*.txt; do
while read -r -a words; do
for word in "${words[#]}"; do
if [[ $word = *[aeiouAEIOU]*[aeiouAEIOU]* ]]; then
(( ++wordcount ))
fi
done
done <"$i"
printf '%s: %s\n' "$i" "$wordcount"
wordcount=0
done
Try:
awk '/[aeiouAEIOU].*[AEIOUaeiou]/{n++} ENDFILE{print FILENAME":"n; n=0}' RS='[[:space:]]' *.txt
Sample output looks like:
$ awk '/[aeiouAEIOU].*[AEIOUaeiou]/{n++} ENDFILE{print FILENAME":"n; n=0}' RS='[[:space:]]' *.txt
one.txt:1
sample.txt:9
How it works:
/[aeiouAEIOU].*[AEIOUaeiou]/{n++}
Every time we find a word with two vowels, we increment variable n.
ENDFILE{print FILENAME":"n; n=0}
At the end of each file, we print the name of the file and the 2-vowel word count n. We then reset n to zero.
RS='[[:space:]]'
This tells awk to use any whitespace as a word separator. This makes each word into a record. Awk reads the input one record at a time.
Shell issues
The use of awk avoids a multitude of shell issues. For example, consider the line for w in $line. This will not work the way you hope. Consider a directory with these files:
$ ls
one.txt sample.txt
Now, let's take line='* Item One' and see what happens:
$ line='* Item One'
$ for w in $line; do echo "w=$w"; done
w=one.txt
w=sample.txt
w=Item
w=One
The shell treats the * in line as a wildcard and expands it into a list of files. Odds are you didn't want this. The awk solution avoids a variety of issues like this.
Using grep - this is pretty simple to do.
#!/bin/bash
wordcount=0
for file in ./*.txt
do
count=`cat $file | xargs -n1 | grep -ie "[aeiou].*[aeiou]" | wc -l`
wordcount=`expr $wordcount + $count`
done
echo $wordcount

use regular expression in if-condition in bash

I wonder the general rule to use regular expression in if clause in bash?
Here is an example
$ gg=svm-grid-ch
$ if [[ $gg == *grid* ]] ; then echo $gg; fi
svm-grid-ch
$ if [[ $gg == ^....grid* ]] ; then echo $gg; fi
$ if [[ $gg == ....grid* ]] ; then echo $gg; fi
$ if [[ $gg == s...grid* ]] ; then echo $gg; fi
$
Why the last three fails to match?
Hope you could give as many general rules as possible, not just for this example.
When using a glob pattern, a question mark represents a single character and an asterisk represents a sequence of zero or more characters:
if [[ $gg == ????grid* ]] ; then echo $gg; fi
When using a regular expression, a dot represents a single character and an asterisk represents zero or more of the preceding character. So ".*" represents zero or more of any character, "a*" represents zero or more "a", "[0-9]*" represents zero or more digits. Another useful one (among many) is the plus sign which represents one or more of the preceding character. So "[a-z]+" represents one or more lowercase alpha character (in the C locale - and some others).
if [[ $gg =~ ^....grid.*$ ]] ; then echo $gg; fi
Use
=~
for regular expression check Regular Expressions Tutorial Table of Contents
if [[ $gg =~ ^....grid.* ]]
Adding this solution with grep and basic sh builtins for those interested in a more portable solution (independent of bash version; also works with plain old sh, on non-Linux platforms etc.)
# GLOB matching
gg=svm-grid-ch
case "$gg" in
*grid*) echo $gg ;;
esac
# REGEXP
if echo "$gg" | grep '^....grid*' >/dev/null ; then echo $gg ; fi
if echo "$gg" | grep '....grid*' >/dev/null ; then echo $gg ; fi
if echo "$gg" | grep 's...grid*' >/dev/null ; then echo $gg ; fi
# Extended REGEXP
if echo "$gg" | egrep '(^....grid*|....grid*|s...grid*)' >/dev/null ; then
echo $gg
fi
Some grep incarnations also support the -q (quiet) option as an alternative to redirecting to /dev/null, but the redirect is again the most portable.
#OP,
Is glob pettern not only used for file names?
No, "glob" pattern is not only used for file names. you an use it to compare strings as well. In your examples, you can use case/esac to look for strings patterns.
gg=svm-grid-ch
# looking for the word "grid" in the string $gg
case "$gg" in
*grid* ) echo "found";;
esac
# [[ $gg =~ ^....grid* ]]
case "$gg" in ????grid*) echo "found";; esac
# [[ $gg =~ s...grid* ]]
case "$gg" in s???grid*) echo "found";; esac
In bash, when to use glob pattern and when to use regular expression? Thanks!
Regex are more versatile and "convenient" than "glob patterns", however unless you are doing complex tasks that "globbing/extended globbing" cannot provide easily, then there's no need to use regex.
Regex are not supported for version of bash <3.2 (as dennis mentioned), but you can still use extended globbing (by setting extglob ). for extended globbing, see here and some simple examples here.
Update for OP: Example to find files that start with 2 characters (the dots "." means 1 char) followed by "g" using regex
eg output
$ shopt -s dotglob
$ ls -1 *
abg
degree
..g
$ for file in *; do [[ $file =~ "..g" ]] && echo $file ; done
abg
degree
..g
In the above, the files are matched because their names contain 2 characters followed by "g". (ie ..g).
The equivalent with globbing will be something like this: (look at reference for meaning of ? and * )
$ for file in ??g*; do echo $file; done
abg
degree
..g