How to check if string contains characters in regex pattern in shell? - regex

How do I check if a variable contains characters (regex) other than 0-9a-z and - in pure bash?
I need a conditional check. If the string contains characters other than the accepted characters above simply exit 1.

One way of doing it is using the grep command, like this:
grep -qv "[^0-9a-z-]" <<< $STRING
Then you ask for the grep returned value with the following:
if [ ! $? -eq 0 ]; then
echo "Wrong string"
exit 1
fi
As #mpapis pointed out, you can simplify the above expression it to:
grep -qv "[^0-9a-z-]" <<< $STRING || exit 1
Also you can use the bash =~ operator, like this:
if [[ ! "$STRING" =~ [^0-9a-z-] ]] ; then
echo "Valid";
else
echo "Not valid";
fi

case has support for matching:
case "$string" in
(+(-[[:alnum:]-])) true ;;
(*) exit 1 ;;
esac
the format is not pure regexp, but it works faster then separate process with grep - which is important if you would have multiple checks.

Using Bash's substitution engine to test if $foo contains $bar
bar='[^0-9a-z-]'
if [ -n "$foo" -a -z "${foo/*$bar*}" ] ; then
echo exit 1
fi

Related

Using regular expressions in a ksh Script

I have a file (file.txt) that contains some text like:
000000000+000+0+00
000000001+000+0+00
000000002+000+0+00
and I am trying to check each line to make sure that it follows the format:
character*9, "+", character*3, "+", etc
so far I have:
#!/bin/ksh
file=file.txt
line_number=1
for line in $(cat $file)
do
if [[ "$line" != "[[.]]{9}+[[.]]{3}+[[.]]{1}+[[.]]{2} ]" ]]
then
echo "Invalid number ($line) check line $line_number"
exit 1
fi
let "line_number++"
done
however this does not evaluate correctly, no matter what I put in the lines the program terminates.
When you want line numbers of the mismatches, you can use grep -vn. Be careful with writing a correct regular expression, and you will have
grep -Evn "^.{9}[+].{3}[+].[+].{2}$" file.txt
This is not in the layout that you want, so change the layout with sed:
grep -Evn "^.{9}[+].{3}[+].[+].{2}$" file.txt |
sed -r 's/([^:]*):(.*)/Invalid number (\2) check line number \1./'
EDIT:
I changed .{1} into ..
The sed is also over the top. When you need spme explanation, you can start with echo "Linenr:Invalid line"
I'm having funny results putting the regex in the condition directly:
$ line='000000000+000+0+00'
$ [[ $line =~ ^.{9}\+.{3}\+.\+..$ ]] && echo ok
ksh: syntax error: `~(E)^.{9}\+.{3}\+.\+..$ ]] && echo ok
' unexpected
But if I save the regex in a variable:
$ re="^.{9}\+.{3}\+.\+..$"
$ [[ $line =~ $re ]] && echo ok
ok
So you can do
#!/bin/ksh
file=file.txt
line_number=1
re="^.{9}\+.{3}\+.\+..$"
while IFS= read -r line; do
if [[ ! $line =~ $re ]]; then
echo "Invalid number ($line) check line $line_number"
exit 1
fi
let "line_number++"
done < "$file"
You can also use a plain glob pattern:
if [[ $line != ?????????+???+?+?? ]]; then echo error; fi
ksh glob patterns have some regex-like syntax. If there's an optional space in there, you can handle that with the ?(sub-pattern) syntax
pattern="?????????+???+?( )?+??"
line1="000000000+000+0+00"
line2="000000000+000+ 0+00"
[[ $line1 == $pattern ]] && echo match || echo no match # => match
[[ $line2 == $pattern ]] && echo match || echo no match # => match
Read the "File Name Generation" section of the ksh man page.
Your regex looks bad - using sites like https://regex101.com/ is very helpful. From your description, I suspect it should look more like one of these;
^.{9}\+.{3}\+.{1}\+.{2}$
^[^\+]{9}\+[^\+]{3}\+[^\+]{1}\+[^\+]{2}$
^[0-9]{9}\+[0-9]{3}\+[0-9]{1}\+[0-9]{2}$
From the ksh manpage section on [[ - you would probably want to be using =~.
string =~ ere
True if string matches the pattern ~(E)ere where ere is an extended regular expression.
Note: As far as I know, ksh regex doesn't follow the normal syntax
You may have better luck with using grep:
# X="000000000+000+0+00"
# grep -qE "^[^\+]{9}\+[^\+]{3}\+[^\+]{1}\+[^\+]{2}$" <<<"${X}" && echo true
true
Or:
if grep -qE "^[^\+]{9}\+[^\+]{3}\+[^\+]{1}\+[^\+]{2}$" <<<"${line}"
then
exit 1
fi
You may also prefer to use a construct like below for handling files:
while read line; do
echo "${line}";
done < "${file}"

multi-lines pattern matching

I have some files with content like this:
file1:
AAA
BBB
CCC
123
file2:
AAA
BBB
123
I want to echo the filename only if the first 3 lines are letters, or "file1" in the samples above.
Im merging the 3 lines into one and comparing it to my regex [A-Z], but could not get it to match for some reason
my script:
file=file1
if [[ $(head -3 $file|tr -d '\n'|sed 's/\r//g') == [A-Z] ]]; then
echo "$file"
fi
I ran it with bash -x, this is the output
+ file=file1
++ head -3 file1
++ tr -d '\n'
++ sed 's/\r//g'
+ [[ ASMUTCEDD == [A-Z] ]]
+exit
What you missed:
You can use grep to check that the input matches only [A-Z] characters (or indeed Bash's built-in regex matching, as #Barmar pointed out)
You can use the pipeline directly in the if statement, without [[ ... ]]
Like this:
file=file1
if head -n 3 "$file" | tr -d '\n\r' | grep -qE '^[A-Z]+$'; then
echo "$file"
fi
To do regular expression matching you have to use =~, not ==. And the regular expression should be ^[A-Z]*$. Your regular expression matches if there's a letter anywhere in the string, not just if the string is entirely letters.
if [[ $(head -3 $file|tr -d '\n\r') =~ ^[A-Z]*$ ]]; then
echo "$file"
fi
You can use built-ins and character classes for this problem:-
#!/bin/bash
file="file1"
C=0
flag=0
while read line
do
(( ++C ))
[ $C -eq 4 ] && break;
[[ "$line" =~ '[^[:alpha:]]' ]] && flag=1
done < "$file"
[ $flag -eq 0 ] && echo "$file"

How to test if string matches a regex in POSIX shell? (not bash)

I'm using Ubuntu system shell, not bash, and I found the regular way can not work:
#!/bin/sh
string='My string';
if [[ $string =~ .*My.* ]]
then
echo "It's there!"
fi
error [[: not found!
What can I do to solve this problem?
The [[ ... ]] are a bash-ism. You can make your test shell-agnostic by just using grep with a normal if:
if echo "$string" | grep -q "My"; then
echo "It's there!"
fi
Using grep for such a simple pattern can be considered wasteful. Avoid that unnecessary fork, by using the Sh built-in Glob-matching engine (NOTE: This does not support regex):
case "$value" in
*XXX*) echo OK ;;
*) echo fail ;;
esac
It is POSIX compliant. Bash have simplified syntax for this:
if [[ "$value" == *XXX* ]]; then :; fi
and even regex:
[[ abcd =~ b.*d ]] && echo ok
You could use expr:
if expr "$string" : "My" 1>/dev/null; then
echo "It's there";
fi
This would work with both sh and bash.
As a handy function:
exprq() {
local value
test "$2" = ":" && value="$3" || value="$2"
expr "$1" : "$value" 1>/dev/null
}
# Or `exprq "somebody" "body"` if you'd rather ditch the ':'
if exprq "somebody" : "body"; then
echo "once told me"
fi
Quoting from man expr:
STRING : REGEXP
anchored pattern match of REGEXP in STRING

bash and grep: passing of regex parameter

I'm trying to write a bash script that helps solving crosswords. For example, the question is "Alcoholic Drink in German". I already have a 'B' at the first place, an 'R' at the last place and two gaps in between. So a regex would be $B..R^
Since I live in Switzerland, I'd like to use the ngerman dictionary (DICT=/usr/share/dict/ngerman).
Here's how I'd do it directly on the shell:
grep -i '^B...$' /usr/share/dict/ngerman
That works perfectly, and the word 'Bier' appears among three others. Since this syntax is cumbersome, I'd like to write a little batch script, that allows me to enter it like this:
crosswords 'B..R'
Here's my approach:
#!/bin/bash
DICT=/usr/share/dict/ngerman
usage () {
progname=$(basename $0)
echo "usage: $progname regex"
}
if [ $# -le 0 ]; then
usage
exit 1
fi
regex="'^$1$'"
cmd="grep -i $regex $DICT"
echo $regex
echo $cmd
$($cmd) | while read word; do
echo "$word"
done
But nothing appears, it doesn't work. I also output the $regex and the $cmd variable for debugging reasons. Here's what comes out:
'^B..R$'
grep -i '^B..R$' /usr/share/dict/ngerman
That's exactly what I need. If I copy/paste the command above, it works perfectly. But if i call it with $($cmd), it fails.
What is wrong?
you do not need to put quotes around regex variable string. and $($cmd) should change to $cmd
so the correct code is :
#!/bin/bash
DICT=/usr/share/dict/ngerman
usage () {
progname=$(basename $0)
echo "usage: $progname regex"
}
if [ $# -le 0 ]; then
usage
exit 1
fi
regex="^$1$"
cmd="grep -i $regex $DICT"
echo $regex
echo $cmd
$cmd | while read word; do
echo "$word"
done
Change regex="^'$1$'" to regex="^$1$" and $($cmd) to $cmd
Here is a fixed version:
#!/bin/bash
DICT=/usr/share/dict/ngerman
usage () {
progname=$(basename "$0")
echo "usage: $progname regex"
}
if [ $# -le 0 ]; then
usage
exit 1
fi
regex="^$1$"
cmd="grep -i $regex $DICT"
echo "$regex"
echo "$cmd"
$cmd | while read -r word; do
echo "$word"
done
But this script has potential problems. For example try running it as ./script 'asdads * '. This will expand to all files in a directory and all of them are going to be passed to grep.
Here is a bit improved version of your code with correct quoting and also with bonus input validation:
#!/bin/bash
DICT=/usr/share/dict/ngerman
usage () {
progname=$(basename "$0")
echo "usage: $progname regex"
}
if [ $# -le 0 ]; then
usage
exit 1
fi
if ! [[ $1 =~ ^[a-zA-Z\.]+$ ]]; then
echo 'Wrong word. Please use only a-zA-Z characters and dots for unknown letters'
exit 1
fi
grep -i "^$1$" "$DICT" | while read -r word; do
echo "$word"
done
Oh, now I got it. When I do it manually, '' are expanded! Here's my test program in C (param-test.c):
#include <stdio.h>
int main(int argc, char *argv[]) {
puts(argv[1]);
return 0;
}
Then I call:
param-test 'foo'
And I see:
foo
That's the problem! grep doesn't really get 'B..R', but just B..R.

use regular expression in if-condition in bash

I wonder the general rule to use regular expression in if clause in bash?
Here is an example
$ gg=svm-grid-ch
$ if [[ $gg == *grid* ]] ; then echo $gg; fi
svm-grid-ch
$ if [[ $gg == ^....grid* ]] ; then echo $gg; fi
$ if [[ $gg == ....grid* ]] ; then echo $gg; fi
$ if [[ $gg == s...grid* ]] ; then echo $gg; fi
$
Why the last three fails to match?
Hope you could give as many general rules as possible, not just for this example.
When using a glob pattern, a question mark represents a single character and an asterisk represents a sequence of zero or more characters:
if [[ $gg == ????grid* ]] ; then echo $gg; fi
When using a regular expression, a dot represents a single character and an asterisk represents zero or more of the preceding character. So ".*" represents zero or more of any character, "a*" represents zero or more "a", "[0-9]*" represents zero or more digits. Another useful one (among many) is the plus sign which represents one or more of the preceding character. So "[a-z]+" represents one or more lowercase alpha character (in the C locale - and some others).
if [[ $gg =~ ^....grid.*$ ]] ; then echo $gg; fi
Use
=~
for regular expression check Regular Expressions Tutorial Table of Contents
if [[ $gg =~ ^....grid.* ]]
Adding this solution with grep and basic sh builtins for those interested in a more portable solution (independent of bash version; also works with plain old sh, on non-Linux platforms etc.)
# GLOB matching
gg=svm-grid-ch
case "$gg" in
*grid*) echo $gg ;;
esac
# REGEXP
if echo "$gg" | grep '^....grid*' >/dev/null ; then echo $gg ; fi
if echo "$gg" | grep '....grid*' >/dev/null ; then echo $gg ; fi
if echo "$gg" | grep 's...grid*' >/dev/null ; then echo $gg ; fi
# Extended REGEXP
if echo "$gg" | egrep '(^....grid*|....grid*|s...grid*)' >/dev/null ; then
echo $gg
fi
Some grep incarnations also support the -q (quiet) option as an alternative to redirecting to /dev/null, but the redirect is again the most portable.
#OP,
Is glob pettern not only used for file names?
No, "glob" pattern is not only used for file names. you an use it to compare strings as well. In your examples, you can use case/esac to look for strings patterns.
gg=svm-grid-ch
# looking for the word "grid" in the string $gg
case "$gg" in
*grid* ) echo "found";;
esac
# [[ $gg =~ ^....grid* ]]
case "$gg" in ????grid*) echo "found";; esac
# [[ $gg =~ s...grid* ]]
case "$gg" in s???grid*) echo "found";; esac
In bash, when to use glob pattern and when to use regular expression? Thanks!
Regex are more versatile and "convenient" than "glob patterns", however unless you are doing complex tasks that "globbing/extended globbing" cannot provide easily, then there's no need to use regex.
Regex are not supported for version of bash <3.2 (as dennis mentioned), but you can still use extended globbing (by setting extglob ). for extended globbing, see here and some simple examples here.
Update for OP: Example to find files that start with 2 characters (the dots "." means 1 char) followed by "g" using regex
eg output
$ shopt -s dotglob
$ ls -1 *
abg
degree
..g
$ for file in *; do [[ $file =~ "..g" ]] && echo $file ; done
abg
degree
..g
In the above, the files are matched because their names contain 2 characters followed by "g". (ie ..g).
The equivalent with globbing will be something like this: (look at reference for meaning of ? and * )
$ for file in ??g*; do echo $file; done
abg
degree
..g