Bash regex to check file extensions - regex

I am trying to check the type of a given file and if it is what I expect. It can have one of three extensions .fa, .fasta or .fasta.gz. Looking at other questions I think this should be quite trivial however when I try suggestions they do not work for me.
This is what I have tried, all of which do not match:
#!/bin/bash
test1="abcdef.fa"
test2="ghijkl.fasta"
test3="mnopqr.fasta.gz"
echo "test1: $test1"
echo "test2: $test2"
echo "test3: $test3"
# Attempt 1
if [[ $test1 =~ *.fa|*.fasta|*.fasta.gz ]] &> /dev/null; then printf "Attempt1: Match with $test1\n"; fi
if [[ $test2 =~ *.fa|*.fasta|*.fasta.gz ]] &> /dev/null; then printf "Attempt1: Match with $test2\n"; fi
if [[ $test3 =~ *.fa|*.fasta|*.fasta.gz ]] &> /dev/null; then printf "Attempt1: Match with $test3\n"; fi
# Attempt 2 - do I need to quote the string?
if [[ "$test1" =~ *.fa|*.fasta|*.fasta.gz ]] &> /dev/null; then printf "Attempt2: Match with $test1\n"; fi
if [[ "$test2" =~ *.fa|*.fasta|*.fasta.gz ]] &> /dev/null; then printf "Attempt2: Match with $test2\n"; fi
if [[ "$test3" =~ *.fa|*.fasta|*.fasta.gz ]] &> /dev/null; then printf "Attempt2: Match with $test3\n"; fi
# Attempt 3 - alternative regex
if [[ $test1 =~ .\*.(fa|fasta|fasta.gz) ]] &> /dev/null; then printf "Attempt3: Match with $test1\n"; fi
if [[ $test2 =~ .\*.(fa|fasta|fasta.gz) ]] &> /dev/null; then printf "Attempt3: Match with $test2\n"; fi
if [[ $test3 =~ .\*.(fa|fasta|fasta.gz) ]] &> /dev/null; then printf "Attempt3: Match with $test3\n"; fi
# Attempt 4 - again with the quoted string
if [[ "$test1" =~ .\*.(fa|fasta|fasta.gz) ]] &> /dev/null; then printf "Attempt4: Match with $test1\n"; fi
if [[ "$test2" =~ .\*.(fa|fasta|fasta.gz) ]] &> /dev/null; then printf "Attempt4: Match with $test2\n"; fi
if [[ "$test3" =~ .\*.(fa|fasta|fasta.gz) ]] &> /dev/null; then printf "Attempt4: Match with $test3\n"; fi
# Attempt 5 - put $ on end of regex
if [[ $test1 =~ .\*.(fa|fasta|fasta.gz)$ ]] &> /dev/null; then printf "Attempt5: Match with $test1\n"; fi
if [[ $test2 =~ .\*.(fa|fasta|fasta.gz)$ ]] &> /dev/null; then printf "Attempt5: Match with $test2\n"; fi
if [[ $test3 =~ .\*.(fa|fasta|fasta.gz)$ ]] &> /dev/null; then printf "Attempt5: Match with $test3\n"; fi
# Attempt 6 - again with the quoted string
if [[ "$test1" =~ .\*.(fa|fasta|fasta.gz)$ ]] &> /dev/null; then printf "Attempt6: Match with $test1\n"; fi
if [[ "$test2" =~ .\*.(fa|fasta|fasta.gz)$ ]] &> /dev/null; then printf "Attempt6: Match with $test2\n"; fi
if [[ "$test3" =~ .\*.(fa|fasta|fasta.gz)$ ]] &> /dev/null; then printf "Attempt6: Match with $test3\n"; fi
# Attempt 7 - use double ||
if [[ $test1 =~ .\*.(fa||fasta||fasta.gz) ]] &> /dev/null; then printf "Attempt7: Match with $test1\n"; fi
if [[ $test2 =~ .\*.(fa||fasta||fasta.gz) ]] &> /dev/null; then printf "Attempt7: Match with $test2\n"; fi
if [[ $test3 =~ .\*.(fa||fasta||fasta.gz) ]] &> /dev/null; then printf "Attempt7: Match with $test3\n"; fi
I am close with this:
# Attempt 8 - escape parentheses
if [[ $test1 =~ .\*.\(fa|fasta|fasta.gz\) ]] &> /dev/null; then printf "Attempt8: Match with $test1\n"; fi
if [[ $test2 =~ .\*.\(fa|fasta|fasta.gz\) ]] &> /dev/null; then printf "Attempt8: Match with $test2\n"; fi
if [[ $test3 =~ .\*.\(fa|fasta|fasta.gz\) ]] &> /dev/null; then printf "Attempt8: Match with $test3\n"; fi
However the first test does not work and the output looks like this:
test1: abcdef.fa
test2: ghijkl.fasta
test3: mnopqr.fasta.gz
Attempt8: Match with ghijkl.fasta
Attempt8: Match with mnopqr.fasta.gz
What am I missing?

You could try a case statement, something like:
case "$test1" in
*.fa|*.fasta|*.fasta.gz) printf 'Attempt1: Match with %s\n' "$test1";;
esac
case "$test2" in
*.fa|*.fasta|*.fasta.gz) printf 'Attempt1: Match with %s\n' "$test2";;
esac
case "$test3" in
*.fa|*.fasta|*.fasta.gz) printf 'Attempt1: Match with %s\n' "$test3";;
esac
See help case
See LESS='+/case word in' man bash

=~ is supposed to accept regex patterns and not glob patterns. Try \.(fa|fasta|fasta\.gz)$.
Also you can use extended pattern matching: [[ $test1 == *.#(fa|fasta|fasta.gz) ]]
Conditional Constructs ([[ ]])
Pattern Matching

It's much easier to define regex in a variable :
#!/usr/bin/env bash
test1="abcdef.fa"
test2="ghijkl.fasta"
test3="mnopqr.fasta.gz"
echo "test1: $test1"
echo "test2: $test2"
echo "test3: $test3"
pattern='\.(fa|fasta|fasta.gz)$'
# Attempt 1
if [[ $test1 =~ $pattern ]] &> /dev/null; then printf "Attempt1: Match with $test1\n"; fi
if [[ $test2 =~ $pattern ]] &> /dev/null; then printf "Attempt1: Match with $test2\n"; fi
if [[ $test3 =~ $pattern ]] &> /dev/null; then printf "Attempt1: Match with $test3\n"; fi

You can use either regular-expression matching or pattern matching with [[ ... ]].
# regular expression
[[ $test1 =~ \.(fa|fasta|fasta.gz)$ ]]
# pattern match
[[ $test1 = *.#(fa|fasta|fasta.gz) ]]
Regular expressions aren't anchored to either end of the string, so you need to match $ to ensure the extensions actually occur at the end of the string, not just somewhere in the middle. The (...) is a list of alternatives to choose from
Pattern matches are anchored to both ends, so you need the * to match all of the string up to the extension. The #(...) is a list of alternatives to choose from.
Quoting the left-hand operand is optional in both cases.

Related

how to use regex variable in zsh?

How can I use a regex variable in zsh the same way it works in bash? I can only get zsh to work with an inline regex. I am just trying to test a string only contains alphanumerics, underscores or periods, but no dashes. As you can see, the inline regex and the regex variable work as expected in bash, but zsh only matches the inline regex.
Bash
#!/bin/bash
RE='[0-9A-Za-z_\.]'
for test in $#; do
echo -e "bash test: $test"
if [[ "${test//[0-9A-Za-z_\.]/}" = "" ]]; then
echo -e '\tmatch inline'
fi
if [[ "${test//$RE/}" = "" ]]; then
echo -e '\tmatch var'
fi
done
❯./bash-regex-test.sh foo_bar foo-bar
output:
bash test: foo_bar
match inline
match var
bash test: foo-bar
Zsh
#!/bin/zsh
RE='[0-9A-Za-z_\.]'
for test in $#; do
echo "zsh test: $test"
if [[ "${test//[0-9A-Za-z_\.]/}" = "" ]]; then
echo '\tmatch inline'
fi
if [[ "${test//$RE/}" = "" ]]; then
echo '\tmatch var'
fi
done
❯./zsh-regex-test.zsh foo_bar foo-bar
output:
zsh test: foo_bar
match inline
zsh test: foo-bar
With zsh you need to use ${~RE} instead of $RE so the variable $RE
is treated as a pattern, not the literal string. Then change the line as:
if [[ "${test//${~RE}/}" = "" ]]; then
BTW your usage of $RE is not the regex but the pattern as in
pathname expansion.
In order to use it as a regex, you'll need to use =~ operator as:
#!/bin/zsh
RE='^[0-9A-Za-z_\.]+$'
for test in "$#"; do
echo "zsh test: $test"
if [[ $test =~ ^[0-9A-Za-z_\.]+$ ]]; then
echo '\tmatch inline'
fi
if [[ $test =~ $RE ]]; then
echo '\tmatch var'
fi
done

Simple regex matching produces wildly different results depending on shell version [duplicate]

The following code
number=1
if [[ $number =~ [0-9] ]]
then
echo matched
fi
works. If I try to use quotes in the regex, however, it stops:
number=1
if [[ $number =~ "[0-9]" ]]
then
echo matched
fi
I tried "\[0-9\]", too. What am I missing?
Funnily enough, bash advanced scripting guide suggests this should work.
Bash version 3.2.39.
It was changed between 3.1 and 3.2. Guess the advanced guide needs an update.
This is a terse description of the new
features added to bash-3.2 since the
release of bash-3.1. As always, the
manual page (doc/bash.1) is the place
to look for complete descriptions.
New Features in Bash
snip
f. Quoting the string argument to the
[[ command's =~ operator now forces
string matching, as with the other pattern-matching operators.
Sadly this'll break existing quote using scripts unless you had the insight to store patterns in variables and use them instead of the regexes directly. Example below.
$ bash --version
GNU bash, version 3.2.39(1)-release (i486-pc-linux-gnu)
Copyright (C) 2007 Free Software Foundation, Inc.
$ number=2
$ if [[ $number =~ "[0-9]" ]]; then echo match; fi
$ if [[ $number =~ [0-9] ]]; then echo match; fi
match
$ re="[0-9]"
$ if [[ $number =~ $re ]]; then echo MATCH; fi
MATCH
$ bash --version
GNU bash, version 3.00.0(1)-release (i586-suse-linux)
Copyright (C) 2004 Free Software Foundation, Inc.
$ number=2
$ if [[ $number =~ "[0-9]" ]]; then echo match; fi
match
$ if [[ "$number" =~ [0-9] ]]; then echo match; fi
match
Bash 3.2 introduced a compatibility option compat31 which reverts bash regular expression quoting behavior back to 3.1
Without compat31:
$ shopt -u compat31
$ shopt compat31
compat31 off
$ set -x
$ if [[ "9" =~ "[0-9]" ]]; then echo match; else echo no match; fi
+ [[ 9 =~ \[0-9] ]]
+ echo no match
no match
With compat31:
$ shopt -s compat31
+ shopt -s compat31
$ if [[ "9" =~ "[0-9]" ]]; then echo match; else echo no match; fi
+ [[ 9 =~ [0-9] ]]
+ echo match
match
Link to patch:
http://ftp.gnu.org/gnu/bash/bash-3.2-patches/bash32-039
GNU bash, version 4.2.25(1)-release (x86_64-pc-linux-gnu)
Some examples of string match and regex match
$ if [[ 234 =~ "[0-9]" ]]; then echo matches; fi # string match
$
$ if [[ 234 =~ [0-9] ]]; then echo matches; fi # regex natch
matches
$ var="[0-9]"
$ if [[ 234 =~ $var ]]; then echo matches; fi # regex match
matches
$ if [[ 234 =~ "$var" ]]; then echo matches; fi # string match after substituting $var as [0-9]
$ if [[ 'rss$var919' =~ "$var" ]]; then echo matches; fi # string match after substituting $var as [0-9]
$ if [[ 'rss$var919' =~ $var ]]; then echo matches; fi # regex match after substituting $var as [0-9]
matches
$ if [[ "rss\$var919" =~ "$var" ]]; then echo matches; fi # string match won't work
$ if [[ "rss\\$var919" =~ "$var" ]]; then echo matches; fi # string match won't work
$ if [[ "rss'$var'""919" =~ "$var" ]]; then echo matches; fi # $var is substituted on LHS & RHS and then string match happens
matches
$ if [[ 'rss$var919' =~ "\$var" ]]; then echo matches; fi # string match !
matches
$ if [[ 'rss$var919' =~ "$var" ]]; then echo matches; fi # string match failed
$
$ if [[ 'rss$var919' =~ '$var' ]]; then echo matches; fi # string match
matches
$ echo $var
[0-9]
$
$ if [[ abc123def =~ "[0-9]" ]]; then echo matches; fi
$ if [[ abc123def =~ [0-9] ]]; then echo matches; fi
matches
$ if [[ 'rss$var919' =~ '$var' ]]; then echo matches; fi # string match due to single quotes on RHS $var matches $var
matches
$ if [[ 'rss$var919' =~ $var ]]; then echo matches; fi # Regex match
matches
$ if [[ 'rss$var' =~ $var ]]; then echo matches; fi # Above e.g. really is regex match and not string match
$
$ if [[ 'rss$var919[0-9]' =~ "$var" ]]; then echo matches; fi # string match RHS substituted and then matched
matches
$ if [[ 'rss$var919' =~ "'$var'" ]]; then echo matches; fi # trying to string match '$var' fails
$ if [[ '$var' =~ "'$var'" ]]; then echo matches; fi # string match still fails as single quotes are omitted on RHS
$ if [[ \'$var\' =~ "'$var'" ]]; then echo matches; fi # this string match works as single quotes are included now on RHS
matches
As mentioned in other answers, putting the regular expression in a variable is a general way to achieve compatibility over different bash versions. You may also use this workaround to achieve the same thing, while keeping your regular expression within the conditional expression:
$ number=1
$ if [[ $number =~ $(echo "[0-9]") ]]; then echo matched; fi
matched
$
Using a local variable has slightly better performance than using command substitution.
For larger scripts, or collections of scripts, it might make sense to use a utility to prevent unwanted local variables polluting the code, and to reduce verbosity. This seems to work well:
# Bash's built-in regular expression matching requires the regular expression
# to be unqouted (see https://stackoverflow.com/q/218156), which makes it harder
# to use some special characters, e.g., the dollar sign.
# This wrapper works around the issue by using a local variable, which means the
# quotes are not passed on to the regex engine.
regex_match() {
local string regex
string="${1?}"
regex="${2?}"
# shellcheck disable=SC2046 `regex` is deliberately unquoted, see above.
[[ "${string}" =~ ${regex} ]]
}
Example usage:
if regex_match "${number}" '[0-9]'; then
echo matched
fi

bash regex in 4.1

the following code works fine on 3.5 bash but not in 4.1
regex='^WORD\-([^(WORD2)][^[:space:]]{1,}$)|(WORD2[[:space:]][^[:space:]]{2,}$)'
if ! [[ $appname =~ $regex ]]
then
printf "no match"
ct_dev_error=$((ct_dev_error+1))
fi
any soliutions? or ideas?
Your regex can be simplified to this:
regex='^WORD-(WORD2[[:space:]][^[:space:]]{2,}|[^[:space:]]+)$'
Test it:
appname='WORD-APP' && [[ $appname =~ $regex ]] && echo "${BASH_REMATCH[0]}"
WORD-APP
appname='WORD-BUD APP' && [[ $appname =~ $regex ]] && echo "${BASH_REMATCH[0]}"
appname='WORD-WORD2 APP' && [[ $appname =~ $regex ]] && echo "${BASH_REMATCH[0]}"
WORD-WORD2 APP
[^(WORD2)] is not actually negating match of WORD2. It is actually a negated character class and it is basically matching a single character that is NOT one of the characters in this list (WORD2).

regex with variable length

In Bash, can we match varying length of strings?
Eg.,
regex="FOO[0-9]{5}"
if [[ $1 =~ ${regex} ]]
will match FOO1111 or FOO98765 right...
But how do we match FOO1, FOO123 and FOO12345?
regex="FOO[0-9]{1-5}" doesnt work.
Is there a way do that in a simple manner or we just use:
regex5="FOO[0-9]{5}"
regex4="FOO[0-9]{4}"
regex3="FOO[0-9]{3}"
regex2="FOO[0-9]{2}"
regex="FOO[0-9]"
if [[ $1 =~ ${regex} || $1 =~ ${regex2} || $1 =~ ${regex3} || $1 =~ ${regex4} || $1 =~ ${regex5} ]]
You can use {min,max}:
regex="FOO[0-9]{1,5}"
And in fact you can use ^ and $ for exact match:
[[ $v =~ ^${regex}$ ]]
Test
$ v=FOO
$ [[ $v =~ ^${regex}$ ]] && echo "yes"
$
$ v=FOO1
$ [[ $v =~ ^${regex}$ ]] && echo "yes"
yes
$ v=FOO123456
$ [[ $v =~ ^${regex}$ ]] && echo "yes"
$

bash regex with quotes?

The following code
number=1
if [[ $number =~ [0-9] ]]
then
echo matched
fi
works. If I try to use quotes in the regex, however, it stops:
number=1
if [[ $number =~ "[0-9]" ]]
then
echo matched
fi
I tried "\[0-9\]", too. What am I missing?
Funnily enough, bash advanced scripting guide suggests this should work.
Bash version 3.2.39.
It was changed between 3.1 and 3.2. Guess the advanced guide needs an update.
This is a terse description of the new
features added to bash-3.2 since the
release of bash-3.1. As always, the
manual page (doc/bash.1) is the place
to look for complete descriptions.
New Features in Bash
snip
f. Quoting the string argument to the
[[ command's =~ operator now forces
string matching, as with the other pattern-matching operators.
Sadly this'll break existing quote using scripts unless you had the insight to store patterns in variables and use them instead of the regexes directly. Example below.
$ bash --version
GNU bash, version 3.2.39(1)-release (i486-pc-linux-gnu)
Copyright (C) 2007 Free Software Foundation, Inc.
$ number=2
$ if [[ $number =~ "[0-9]" ]]; then echo match; fi
$ if [[ $number =~ [0-9] ]]; then echo match; fi
match
$ re="[0-9]"
$ if [[ $number =~ $re ]]; then echo MATCH; fi
MATCH
$ bash --version
GNU bash, version 3.00.0(1)-release (i586-suse-linux)
Copyright (C) 2004 Free Software Foundation, Inc.
$ number=2
$ if [[ $number =~ "[0-9]" ]]; then echo match; fi
match
$ if [[ "$number" =~ [0-9] ]]; then echo match; fi
match
Bash 3.2 introduced a compatibility option compat31 which reverts bash regular expression quoting behavior back to 3.1
Without compat31:
$ shopt -u compat31
$ shopt compat31
compat31 off
$ set -x
$ if [[ "9" =~ "[0-9]" ]]; then echo match; else echo no match; fi
+ [[ 9 =~ \[0-9] ]]
+ echo no match
no match
With compat31:
$ shopt -s compat31
+ shopt -s compat31
$ if [[ "9" =~ "[0-9]" ]]; then echo match; else echo no match; fi
+ [[ 9 =~ [0-9] ]]
+ echo match
match
Link to patch:
http://ftp.gnu.org/gnu/bash/bash-3.2-patches/bash32-039
GNU bash, version 4.2.25(1)-release (x86_64-pc-linux-gnu)
Some examples of string match and regex match
$ if [[ 234 =~ "[0-9]" ]]; then echo matches; fi # string match
$
$ if [[ 234 =~ [0-9] ]]; then echo matches; fi # regex natch
matches
$ var="[0-9]"
$ if [[ 234 =~ $var ]]; then echo matches; fi # regex match
matches
$ if [[ 234 =~ "$var" ]]; then echo matches; fi # string match after substituting $var as [0-9]
$ if [[ 'rss$var919' =~ "$var" ]]; then echo matches; fi # string match after substituting $var as [0-9]
$ if [[ 'rss$var919' =~ $var ]]; then echo matches; fi # regex match after substituting $var as [0-9]
matches
$ if [[ "rss\$var919" =~ "$var" ]]; then echo matches; fi # string match won't work
$ if [[ "rss\\$var919" =~ "$var" ]]; then echo matches; fi # string match won't work
$ if [[ "rss'$var'""919" =~ "$var" ]]; then echo matches; fi # $var is substituted on LHS & RHS and then string match happens
matches
$ if [[ 'rss$var919' =~ "\$var" ]]; then echo matches; fi # string match !
matches
$ if [[ 'rss$var919' =~ "$var" ]]; then echo matches; fi # string match failed
$
$ if [[ 'rss$var919' =~ '$var' ]]; then echo matches; fi # string match
matches
$ echo $var
[0-9]
$
$ if [[ abc123def =~ "[0-9]" ]]; then echo matches; fi
$ if [[ abc123def =~ [0-9] ]]; then echo matches; fi
matches
$ if [[ 'rss$var919' =~ '$var' ]]; then echo matches; fi # string match due to single quotes on RHS $var matches $var
matches
$ if [[ 'rss$var919' =~ $var ]]; then echo matches; fi # Regex match
matches
$ if [[ 'rss$var' =~ $var ]]; then echo matches; fi # Above e.g. really is regex match and not string match
$
$ if [[ 'rss$var919[0-9]' =~ "$var" ]]; then echo matches; fi # string match RHS substituted and then matched
matches
$ if [[ 'rss$var919' =~ "'$var'" ]]; then echo matches; fi # trying to string match '$var' fails
$ if [[ '$var' =~ "'$var'" ]]; then echo matches; fi # string match still fails as single quotes are omitted on RHS
$ if [[ \'$var\' =~ "'$var'" ]]; then echo matches; fi # this string match works as single quotes are included now on RHS
matches
As mentioned in other answers, putting the regular expression in a variable is a general way to achieve compatibility over different bash versions. You may also use this workaround to achieve the same thing, while keeping your regular expression within the conditional expression:
$ number=1
$ if [[ $number =~ $(echo "[0-9]") ]]; then echo matched; fi
matched
$
Using a local variable has slightly better performance than using command substitution.
For larger scripts, or collections of scripts, it might make sense to use a utility to prevent unwanted local variables polluting the code, and to reduce verbosity. This seems to work well:
# Bash's built-in regular expression matching requires the regular expression
# to be unqouted (see https://stackoverflow.com/q/218156), which makes it harder
# to use some special characters, e.g., the dollar sign.
# This wrapper works around the issue by using a local variable, which means the
# quotes are not passed on to the regex engine.
regex_match() {
local string regex
string="${1?}"
regex="${2?}"
# shellcheck disable=SC2046 `regex` is deliberately unquoted, see above.
[[ "${string}" =~ ${regex} ]]
}
Example usage:
if regex_match "${number}" '[0-9]'; then
echo matched
fi