retrieve a word after a regular expression in shell script - regex

I am trying to retrieve specific fields from a text file which has a metadata as follows:
project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN
And I have the following script for retrieving the field 'cell'
while read line
do
cell="$(echo $line | cut -d";" -f7 )"
echo $cell
fi
done < files.txt
However the following script retrieves the whole field as cell=ABC , whereas I just want the value 'ABC' from the field, how do I retrieve the value after the regex, in the same line of code?

If extracting one value (or, generally, a non-repeating set of values captured by distinct capture groups) is enough and you're running bash, ksh, or zsh, consider using the regex-matching operator, =~: [[ string =~ regex ]]:
Tip of the hat to #Adrian Frühwirth for the gist of the ksh and zsh solutions.
Sample input string:
string='project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN'
Shell-specific use of =~ is discussed next; a multi-shell implementation of the =~ functionality via a shell function can be found at the end.
bash
The special BASH_REMATCH array variable receives the results of the matching operation: element 0 contains the entire match, element 1 the first capture group's (parenthesized subexpression's) match, and so on.
bash 3.2+:
[[ $string =~ \ cell=([^;]+) ]] && cell=${BASH_REMATCH[1]} # -> $cell == 'ABC'
bash 4.x:
While the specific command above works, using regex literals in bash 4.x is buggy, notably when involving word-boundary assertions \< and \> on Linux; e.g., [[ a =~ \<a ]] inexplicably doesn't match; workaround: use an intermediate variable (unquoted!): re='\a'; [[ a =~ $re ]] works (also on bash 3.2+).
bash 3.0 and 3.1 - or after setting shopt -s compat31:
Quote the regex to make it work:
[[ $string =~ ' cell=([^;]+)' ]] && cell=${BASH_REMATCH[1]} # -> $cell == 'ABC'
ksh
The ksh syntax is the same as in bash, except:
the name of the special array variable that contains the matched strings is .sh.match (you must enclose the name in {...} even when just implicitly referring to the first element with ${.sh.match}):
[[ $string =~ \ cell=([^;]+) ]] && cell=${.sh.match[1]} # -> $cell == 'ABC'
zsh
The zsh syntax is also similar to bash, except:
The regex literal must be quoted - for simplicity as a whole, or at least some shell metacharacters, such as ;.
you may, but needn't double-quote a regex provided as a variable value.
Note how this quoting behavior differs fundamentally from that of bash 3.2+: zsh requires quoting only for syntax reasons and always treats the resulting string as a whole as a regex, whether it or parts of it were quoted or not.
There are 2 variables containing the matching results:
$MATCH contains the entire matched string
array variable $match contains only the matches for the capture groups (note that zsh arrays start with index 1 and that you don't need to enclose the variable name in {...} to reference array elements)
[[ $string =~ ' cell=([^;]+)' ]] && cell=$match[1] # -> $cell == 'ABC'
Multi-shell implementation of the =~ operator as shell function reMatch
The following shell function abstracts away the differences between bash, ksh, zsh with respect to the =~ operator; the matches are returned in array variable ${reMatches[#]}.
As #Adrian Frühwirth notes, to write portable (across zsh, ksh, bash) code with this, you need to execute setopt KSH_ARRAYS in zsh so as to make its arrays start with index 0; as a side effect, you also have to use the ${...[]} syntax when referencing arrays, as in ksh and bash).
Applied to our example we'd get:
# zsh: make arrays behave like in ksh/bash: start at *0*
[[ -n $ZSH_VERSION ]] && setopt KSH_ARRAYS
reMatch "$string" ' cell=([^;]+)' && cell=${reMatches[1]}
Shell function:
# SYNOPSIS
# reMatch string regex
# DESCRIPTION
# Multi-shell implementation of the =~ regex-matching operator;
# works in: bash, ksh, zsh
#
# Matches STRING against REGEX and returns exit code 0 if they match.
# Additionally, the matched string(s) is returned in array variable ${reMatch[#]},
# which works the same as bash's ${BASH_REMATCH[#]} variable: the overall
# match is stored in the 1st element of ${reMatch[#]}, with matches for
# capture groups (parenthesized subexpressions), if any, stored in the remaining
# array elements.
# NOTE: zsh arrays by default start with index *1*.
# EXAMPLE:
# reMatch 'This AND that.' '^(.+) AND (.+)\.' # -> ${reMatch[#]} == ('This AND that.', 'This', 'that')
function reMatch {
typeset ec
unset -v reMatch # initialize output variable
[[ $1 =~ $2 ]] # perform the regex test
ec=$? # save exit code
if [[ $ec -eq 0 ]]; then # copy result to output variable
[[ -n $BASH_VERSION ]] && reMatch=( "${BASH_REMATCH[#]}" )
[[ -n $KSH_VERSION ]] && reMatch=( "${.sh.match[#]}" )
[[ -n $ZSH_VERSION ]] && reMatch=( "$MATCH" "${match[#]}" )
fi
return $ec
}
Note:
function reMatch (as opposed to reMatch()) is used to declare the function, which is required for ksh to truly create local variables with typeset.

I would not use cut, since you cannot specify more than one delimiter.
If your grep supports PCRE, then you can do:
$ string='project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN'
$ grep -oP '(?<=cell=)[^;]+' <<< "$string"
ABC
You can use sed, which in simple terms can be done as -
$ sed -r 's/.*cell=([^;]+).*/\1/' <<< "$string"
ABC
Another option is to use awk. With that you can do the following by specifying list of delimiters you want to consider as field separators:
$ awk -F'[;= ]' '{print $5}' <<< "$string"
ABC
You can certainly put more checks by iterating over the line so that you don't have to hard-code to print 5th field.
Note that if your shell does not support here-string notation <<< then you can echo the variable and pipe it to the command.
$ echo "$string" | cmd

Here's a native shell solution:
$ string='project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN'
$ cell=${string#*cell=}
$ cell=${cell%%;*}
$ echo "${cell}"
ABC
This removes the shortest leading match up to including cell= from the string, then removes the longest trailing match up to including the ; leaving you with ABC.
Here's another solution which uses read to split the strings:
$ cat t.sh
#!/bin/bash
while IFS=$'; \t' read -ra attributes; do
for foo in "${attributes[#]}"; do
IFS='=' read -r key value <<< "${foo}"
[ "${key}" = cell ] && echo "${value}"
done
done <<EOF
foo=X; cell=ABC; quux=Z;
foo=X; cell=DEF; quux=Z;
EOF
.
$ ./t.sh
ABC
DEF
For solutions using external tools see #jaypal's excellent answer.

Related

In Bash, what is the string replacement pattern to remove any number of leading hyphens?

This strips out any number of the leading hyphens:
§ echo '--nom-nom' | perl -pe 's|^-+||'
nom-nom
What should the replacement pattern look like if I want to use bash string replacement to do the same? This does not work:
§ a=--nom-nom; a="${a/^-+/}"; echo $a
--nom-nom
Replacing all hyphens works, but that is not what I want:
§ a=--nom-nom; a="${a//-/}"; echo $a
nomnom
If shell option extglob is set, you can use an extended pattern
$ shopt -s extglob
$ a=--nom-nom; a="${a##*(-)}"; echo $a
nom-nom
If you don't want to always enable extglob, you can use a subshell to temporarily set it:
$ shopt -u extglob
$ a=--nom-nom; a=$(shopt -s extglob; echo "${a##*(-)}"); echo $a
${var##*(-)} uses the "remove longest matching prefix" replacement. You could also use ${var/#*(-)/}; in this context, the # forces the match to be initial. In both cases, *(pattern) means "nothing or any number of repetitions of 'pattern'", similar to regex syntax except that the * comes first and the parentheses are required.
If you want to use regular expressions, you can use the expr command:
$ expr "$a" : '-*\(.*\)'
nom-nom
Note that this is not a bash built-in. But it is required by Posix. It always uses Posix Basic Regular Expressions, which is why the capture parentheses need to be backslashed. (As noted in the documentation, it is expected that there will be precisely one capture group in the regex.)
You can capture what you want vs eliminate what you don't want with a Bash regex:
$ s='--nom-nom'
$ [[ $s =~ ^-*(.*) ]] && echo ${BASH_REMATCH[1]}
nom-nom

Difference between grep -E regex and Bash regex in conditional expression

For the same regex applied to the same string, why does grep -E match, but the Bash =~ operator in [[ ]] does not?
$ D=Dw4EWRwer
$ echo $D|grep -qE '^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_-\ ]{1,22}$' || echo wrong pattern
$ [[ "${D}" =~ ^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_-\ ]{1,22}$ ]] || echo wrong pattern
wrong pattern
Update: I confirm this worked:
[[ "${D}" =~ ^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]\ _-]{1,22}$ ]] || echo wrong pattern
The problem (for both versions of the code) is on this character class:
[[:alnum:]_-\ ]
In the grep version, because the regex is enclosed in single quotes, the backslash doesn't escape anything and the character range received by grep is exactly how it is represented above.
In the bash version, the backslash (\) escapes the space that follows it and the actual character class used by [[ ]] to test is [[:alnum:]_- ].
Because in ASCII table the underscore (_) comes after both space () and backslash (\), neither of these character classes is correct.
For the bash version you can use:
[[ "${D}" =~ ^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_-\ ]{1,22}$ ]]; echo $?
to verify its outcome. If the regex is incorrect, the exit code is 2.
If you want to put a dash (-) into a character class you have to put it either as the first character in the class (just after [ or [^ if it is a negating class) or as the last character in the class (right before the closing]`).
The grep version of the code should be (there is no need to escape anything inside a string enclosed in single quotes):
$ echo $D | grep -qE '^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_ -]{1,22}$' || echo wrong pattern
The bash version of your code should be:
[[ "${D}" =~ ^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_\ -]{1,22}$ ]] || echo wrong pattern
Based on your comment, you want the bracket expression to contain alphanumeric characters, spaces, underscores and dashes, so the dash is not supposed to indicate a range. To add a hyphen to a bracket expression, it has to be the first or last character in it. Additionally, you don't have to escape things in bracket expressions, so you can drop the backslash. Your grep regex includes a literal \ in the bracket expression:
$ grep -q '[\]' <<< '\' && echo "Match"
Match
In the Bash regex, the space has to be escaped because the string is first read by the shell, but see below how to avoid that.
First, fixing your regex:
^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_ -]{1,22}$
The backslash is gone, and the hyphen is moved to the end. Using this with grep works fine:
$ D=Dw4EWRwer
$ grep -E '^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_ -]{1,22}$' <<< "$D"
Dw4EWRwer
To use the regex within [[ ]] directly, the space has to be escaped:
$ [[ $D =~ ^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_\ -]{1,22}$ ]] && echo "Match"
Match
I would make the following changes:
Use character classes where possible: [A-Z] is [[:upper:]], [A-Za-z0-9] is [[:alnum:]]
Store the regex in a variable for usage in [[ ]]; this has two advantages: no escaping characters special to the shell, and compatibility with older Bash versions, as the quoting requirements changed between 3.1 and 3.2 (see the Patterns article in the BashGuide).
The regex would then become this for grep:
$ grep -E '^[[:upper:]][[:alnum:]][[:alnum:]_ -]{1,22}$' <<< "$D"
Dw4EWRwer
and this in Bash:
$ re='^[[:upper:]][[:alnum:]][[:alnum:]_ -]{1,22}$'
$ [[ $D =~ $re ]] && echo "Match"
Match

Regex stored in a shell variable doesn't work between double brackets

The below is a small part of a bigger script I'm working on, but the below is giving me a lot of pain which causes a part of the bigger script to not function properly. The intention is to check if the variable has a string value matching red hat or Red Hat. If it is, then change the variable name to redhat. But it doesn't quite match the regex I've used.
getos="red hat"
rh_reg="[rR]ed[:space:].*[Hh]at"
if [ "$getos" =~ "$rh_reg" ]; then
getos="redhat"
fi
echo $getos
Any help will be greatly appreciated.
There are a multiple things to fix here
bash supports regex pattern matching within its [[ extended test operator and not within its POSIX standard [ test operator
Never quote our regex match string. bash 3.2 introduced a compatibility option compat31 (under New Features in Bash 1.l) which reverts bash regular expression quoting behavior back to 3.1 which supported quoting of the regex string.
Fix the regex to use [[:space:]] instead of just [:space:]
So just do
getos="red hat"
rh_reg="[rR]ed[[:space:]]*[Hh]at"
if [[ "$getos" =~ $rh_reg ]]; then
getos="redhat"
fi;
echo "$getos"
or enable the compat31 option from the extended shell option
shopt -s compat31
getos="red hat"
rh_reg="[rR]ed[[:space:]]*[Hh]at"
if [[ "$getos" =~ "$rh_reg" ]]; then
getos="redhat"
fi
echo "$getos"
shopt -u compat31
But instead of messing with those shell options just use the extended test operator [[ with an unquoted regex string variable.
There are two issues:
First, replace:
rh_reg="[rR]ed[:space:].*[Hh]at"
With:
rh_reg="[rR]ed[[:space:]]*[Hh]at"
A character class like [:space:] only works when it is in square brackets. Also, it appears that you wanted to match zero or more spaces and that is [[:space:]]* not [[:space:]].*. The latter would match a space followed by zero or more of anything at all.
Second, replace:
[ "$getos" =~ "$rh_reg" ]
With:
[[ "$getos" =~ $rh_reg ]]
Regex matches requires bash's extended test: [[...]]. The POSIX standard test, [...], does not have the feature. Also, in bash, regular expressions only work if they are unquoted.
Examples:
$ rh_reg='[rR]ed[[:space:]]*[Hh]at'
$ getos="red Hat"; [[ "$getos" =~ $rh_reg ]] && getos="redhat"; echo $getos
redhat
$ getos="RedHat"; [[ "$getos" =~ $rh_reg ]] && getos="redhat"; echo $getos
redhat

Pattern matching in if statement in bash

I'm trying to count the words with at least two vowels in all the .txt files in the directory. Here's my code so far:
#!/bin/bash
wordcount=0
for i in $HOME/*.txt
do
cat $i |
while read line
do
for w in $line
do
if [[ $w == .*[aeiouAEIOU].*[AEIOUaeiou].* ]]
then
wordcount=`expr $wordcount + 1`
echo $w ':' $wordcount
else
echo "In else"
fi
done
done
echo $i ':' $wordcount
wordcount=0
done
Here is my sample from a txt file
Last modified: Sun Aug 20 18:18:27 IST 2017
To remove PPAs
sudo apt-get install ppa-purge
sudo ppa-purge ppa:
The problem is it doesn't match the pattern in the if statement for all the words in the text file. It goes directly to the else statement. And secondly, the wordcount in echo $i ':' $wordcount is equal to 0 which should be some value.
Immediate Issue: Glob vs Regex
[[ $string = $pattern ]] doesn't perform regex matching; instead, it's a glob-style pattern match. While . means "any character" in regex, it matches only itself in glob.
You have a few options here:
Use =~ instead to perform regular expression matching:
[[ $w =~ .*[aeiouAEIOU].*[AEIOUaeiou].* ]]
Use a glob-style expression instead of a regex:
[[ $w = *[aeiouAEIOU]*[aeiouAEIOU]* ]]
Note the use of = rather than == here; while either is technically valid, the former avoids building finger memory that would lead to bugs when writing code for a POSIX implementation of test / [, as = is the only valid string comparison operator there.
Larger Issue: Properly Reading Word-By-Word
Using for w in $line is innately unsafe. Use read -a to read a line into an array of words:
#!/usr/bin/env bash
wordcount=0
for i in "$HOME"/*.txt; do
while read -r -a words; do
for word in "${words[#]}"; do
if [[ $word = *[aeiouAEIOU]*[aeiouAEIOU]* ]]; then
(( ++wordcount ))
fi
done
done <"$i"
printf '%s: %s\n' "$i" "$wordcount"
wordcount=0
done
Try:
awk '/[aeiouAEIOU].*[AEIOUaeiou]/{n++} ENDFILE{print FILENAME":"n; n=0}' RS='[[:space:]]' *.txt
Sample output looks like:
$ awk '/[aeiouAEIOU].*[AEIOUaeiou]/{n++} ENDFILE{print FILENAME":"n; n=0}' RS='[[:space:]]' *.txt
one.txt:1
sample.txt:9
How it works:
/[aeiouAEIOU].*[AEIOUaeiou]/{n++}
Every time we find a word with two vowels, we increment variable n.
ENDFILE{print FILENAME":"n; n=0}
At the end of each file, we print the name of the file and the 2-vowel word count n. We then reset n to zero.
RS='[[:space:]]'
This tells awk to use any whitespace as a word separator. This makes each word into a record. Awk reads the input one record at a time.
Shell issues
The use of awk avoids a multitude of shell issues. For example, consider the line for w in $line. This will not work the way you hope. Consider a directory with these files:
$ ls
one.txt sample.txt
Now, let's take line='* Item One' and see what happens:
$ line='* Item One'
$ for w in $line; do echo "w=$w"; done
w=one.txt
w=sample.txt
w=Item
w=One
The shell treats the * in line as a wildcard and expands it into a list of files. Odds are you didn't want this. The awk solution avoids a variety of issues like this.
Using grep - this is pretty simple to do.
#!/bin/bash
wordcount=0
for file in ./*.txt
do
count=`cat $file | xargs -n1 | grep -ie "[aeiou].*[aeiou]" | wc -l`
wordcount=`expr $wordcount + $count`
done
echo $wordcount

How do I use a regex in a shell script?

I want to match an input string (contained in the variable $1) with a regex representing the date formats MM/DD/YYYY and MM-DD-YYYY.
REGEX_DATE="^\d{2}[\/\-]\d{2}[\/\-]\d{4}$"
 
echo "$1" | grep -q $REGEX_DATE
echo $?
The echo $? returns the error code 1 no matter the input string.
To complement the existing helpful answers:
Using Bash's own regex-matching operator, =~, is a faster alternative in this case, given that you're only matching a single value already stored in a variable:
set -- '12-34-5678' # set $1 to sample value
kREGEX_DATE='^[0-9]{2}[-/][0-9]{2}[-/][0-9]{4}$' # note use of [0-9] to avoid \d
[[ $1 =~ $kREGEX_DATE ]]
echo $? # 0 with the sample value, i.e., a successful match
Note, however, that the caveat re using flavor-specific regex constructs such as \d equally applies:
While =~ supports EREs (extended regular expressions), it also supports the host platform's specific extension - it's a rare case of Bash's behavior being platform-dependent.
To remain portable (in the context of Bash), stick to the POSIX ERE specification.
Note that =~ even allows you to define capture groups (parenthesized subexpressions) whose matches you can later access through Bash's special ${BASH_REMATCH[#]} array variable.
Further notes:
$kREGEX_DATE is used unquoted, which is necessary for the regex to be recognized as such (quoted parts would be treated as literals).
While not always necessary, it is advisable to store the regex in a variable first, because Bash has trouble with regex literals containing \.
E.g., on Linux, where \< is supported to match word boundaries, [[ 3 =~ \<3 ]] && echo yes doesn't work, but re='\<3'; [[ 3 =~ $re ]] && echo yes does.
I've changed variable name REGEX_DATE to kREGEX_DATE (k signaling a (conceptual) constant), so as to ensure that the name isn't an all-uppercase name, because all-uppercase variable names should be avoided to prevent conflicts with special environment and shell variables.
I think this is what you want:
REGEX_DATE='^\d{2}[/-]\d{2}[/-]\d{4}$'
echo "$1" | grep -P -q $REGEX_DATE
echo $?
I've used the -P switch to get perl regex.
the problem is you're trying to use regex features not supported by grep. namely, your \d won't work. use this instead:
REGEX_DATE="^[[:digit:]]{2}[-/][[:digit:]]{2}[-/][[:digit:]]{4}$"
echo "$1" | grep -qE "${REGEX_DATE}"
echo $?
you need the -E flag to get ERE in order to use {#} style.