bash: assign grep regex results to array - regex

I am trying to assign a regular expression result to an array inside of a bash script but I am unsure whether that's possible, or if I'm doing it entirely wrong. The below is what I want to happen, however I know my syntax is incorrect:
indexes[4]=$(echo b5f1e7bfc2439c621353d1ce0629fb8b | grep -o '[a-f0-9]\{8\}')
such that:
index[1]=b5f1e7bf
index[2]=c2439c62
index[3]=1353d1ce
index[4]=0629fb8b
Any links, or advice, would be wonderful :)

here
array=( $(echo b5f1e7bfc2439c621353d1ce0629fb8b | grep -o '[a-f0-9]\{8\}') )
$ echo ${array[#]}
b5f1e7bf c2439c62 1353d1ce 0629fb8b

#!/bin/bash
# Bash >= 3.2
hexstring="b5f1e7bfc2439c621353d1ce0629fb8b"
# build a regex to get four groups of eight hex digits
for i in {1..4}
do
regex+='([[:xdigit:]]{8})'
done
[[ $hexstring =~ $regex ]] # match the regex
array=(${BASH_REMATCH[#]}) # copy the match array which is readonly
unset array[0] # so we can eliminate the full match and only use the parenthesized captured matches
for i in "${array[#]}"
do
echo "$i"
done

here's a pure bash way, no external commands needed
#!/bin/bash
declare -a array
s="b5f1e7bfc2439c621353d1ce0629fb8b"
for((i=0;i<=${#s};i+=8))
do
array=(${array[#]} ${s:$i:8})
done
echo ${array[#]}
output
$ ./shell.sh
b5f1e7bf c2439c62 1353d1ce 0629fb8b

Related

Excluding the first 3 characters of a string using regex

Given any string in bash, e.g flaccid, I want to match all characters in the string but the first 3 (in this case I want to exclude "fla" and match only "ccid"). The regex also needs to work in sed.
I have tried positive look behind and the following regex expressions (as well as various other unsuccessful ones):
^.{3}+([a-z,A-Z]+)
sed -r 's/(?<=^....)(.[A-Z]*)/,/g'
Google hasn't been very helpful as it only produce results like "get first 3 characters .."
Thanks in advance!
If you want to get all characters but the first 3 from a string, you can use cut:
str="flaccid"
cut -c 4- <<< "$str"
or bash variable subsitution:
str="flaccid"
echo "${str:3}"
That will strip the first 3 characters out of your string.
You may just use a capturing group within an expression like ^.{3}(.*) / ^.{3}([a-zA-Z]+) and grab the ${BASH_REMATCH[1]} contents:
#!/bin/bash
text="flaccid"
rx="^.{3}(.*)"
if [[ $text =~ $rx ]]; then
echo ${BASH_REMATCH[1]};
fi
See online Bash demo
In sed, you should also be using capturing groups / backreferences to get what you need. To just keep the first 3 chars, you may use a simple:
echo "flaccid" | sed 's/.\{3\}//'
See this regex demo. The .\{3\} matches exactly any 3 chars and will remove them from the beginning only, since g modifier is not used.
Now, both the solutions above will output ccid, returning the first 3 chars only.
Using sed, just remove them
echo string | sed 's/^...//g'
How is it that no-one has named the most simple and portable solution:
shell "Parameter expansions":
str="flacid"
echo "${str#???}
For a regex (bash):
$ str="flaccid"
$ regex='^.{3}(.*)$'
$ [[ $str =~ $regex ]] && echo "${BASH_REMATCH[1]}"
ccid
Same regex in sed:
$ echo "flaccid" | sed -E "s/$regex/\1/"
ccid
Or sed (Basic Regex):
$ echo "flaccid" | sed 's/^.\{3\}\(.*\)$/\1/'
ccid

Capture group from regex in bash

I have the following string /path/to/my-jar-1.0.jar for which I am trying to write a bash regex to pull out my-jar.
Now I believe the following regex would work: ([^\/]*?)-\d but I don't know how to get bash to run it.
The following: echo '/path/to/my-jar-1.0.jar' | grep -Po '([^\/]*?)-\d' captures my-jar-1
In BASH you can do:
s='/path/to/my-jar-1.0.jar'
[[ $s =~ .*/([^/[:digit:]]+)-[[:digit:]] ]] && echo "${BASH_REMATCH[1]}"
my-jar
Here "${BASH_REMATCH[1]}" will print captured group #1 which is expression inside first (...).
You can do this as well with shell prefix and suffix removal:
$ path=/path/to/my-jar-1.0.jar
# Remove the longest prefix ending with a slash
$ base="${path##*/}"
# Remove the longest suffix starting with a dash followed by a digit
$ base="${base%%-[0-9]*}"
$ echo "$base"
my-jar
Although it's a little annoying to have to do the transform in two steps, it has the advantage of only using Posix features so it will work with any compliant shell.
Note: The order is important, because the basename cannot contain a slash, but a path component could contain a dash. So you need to remove the path components first.
grep -o doesn't recognize "capture groups" I think, just the entire match. That said, with Perl regexps (-P) you have the "lookahead" option to exclude the -\d from the match:
echo '/path/to/my-jar-1.0.jar' | grep -Po '[^/]*(?=-\d)'
Some reference material on lookahead/lookbehind:
http://www.perlmonks.org/?node_id=518444

retrieve a word after a regular expression in shell script

I am trying to retrieve specific fields from a text file which has a metadata as follows:
project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN
And I have the following script for retrieving the field 'cell'
while read line
do
cell="$(echo $line | cut -d";" -f7 )"
echo $cell
fi
done < files.txt
However the following script retrieves the whole field as cell=ABC , whereas I just want the value 'ABC' from the field, how do I retrieve the value after the regex, in the same line of code?
If extracting one value (or, generally, a non-repeating set of values captured by distinct capture groups) is enough and you're running bash, ksh, or zsh, consider using the regex-matching operator, =~: [[ string =~ regex ]]:
Tip of the hat to #Adrian Frühwirth for the gist of the ksh and zsh solutions.
Sample input string:
string='project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN'
Shell-specific use of =~ is discussed next; a multi-shell implementation of the =~ functionality via a shell function can be found at the end.
bash
The special BASH_REMATCH array variable receives the results of the matching operation: element 0 contains the entire match, element 1 the first capture group's (parenthesized subexpression's) match, and so on.
bash 3.2+:
[[ $string =~ \ cell=([^;]+) ]] && cell=${BASH_REMATCH[1]} # -> $cell == 'ABC'
bash 4.x:
While the specific command above works, using regex literals in bash 4.x is buggy, notably when involving word-boundary assertions \< and \> on Linux; e.g., [[ a =~ \<a ]] inexplicably doesn't match; workaround: use an intermediate variable (unquoted!): re='\a'; [[ a =~ $re ]] works (also on bash 3.2+).
bash 3.0 and 3.1 - or after setting shopt -s compat31:
Quote the regex to make it work:
[[ $string =~ ' cell=([^;]+)' ]] && cell=${BASH_REMATCH[1]} # -> $cell == 'ABC'
ksh
The ksh syntax is the same as in bash, except:
the name of the special array variable that contains the matched strings is .sh.match (you must enclose the name in {...} even when just implicitly referring to the first element with ${.sh.match}):
[[ $string =~ \ cell=([^;]+) ]] && cell=${.sh.match[1]} # -> $cell == 'ABC'
zsh
The zsh syntax is also similar to bash, except:
The regex literal must be quoted - for simplicity as a whole, or at least some shell metacharacters, such as ;.
you may, but needn't double-quote a regex provided as a variable value.
Note how this quoting behavior differs fundamentally from that of bash 3.2+: zsh requires quoting only for syntax reasons and always treats the resulting string as a whole as a regex, whether it or parts of it were quoted or not.
There are 2 variables containing the matching results:
$MATCH contains the entire matched string
array variable $match contains only the matches for the capture groups (note that zsh arrays start with index 1 and that you don't need to enclose the variable name in {...} to reference array elements)
[[ $string =~ ' cell=([^;]+)' ]] && cell=$match[1] # -> $cell == 'ABC'
Multi-shell implementation of the =~ operator as shell function reMatch
The following shell function abstracts away the differences between bash, ksh, zsh with respect to the =~ operator; the matches are returned in array variable ${reMatches[#]}.
As #Adrian Frühwirth notes, to write portable (across zsh, ksh, bash) code with this, you need to execute setopt KSH_ARRAYS in zsh so as to make its arrays start with index 0; as a side effect, you also have to use the ${...[]} syntax when referencing arrays, as in ksh and bash).
Applied to our example we'd get:
# zsh: make arrays behave like in ksh/bash: start at *0*
[[ -n $ZSH_VERSION ]] && setopt KSH_ARRAYS
reMatch "$string" ' cell=([^;]+)' && cell=${reMatches[1]}
Shell function:
# SYNOPSIS
# reMatch string regex
# DESCRIPTION
# Multi-shell implementation of the =~ regex-matching operator;
# works in: bash, ksh, zsh
#
# Matches STRING against REGEX and returns exit code 0 if they match.
# Additionally, the matched string(s) is returned in array variable ${reMatch[#]},
# which works the same as bash's ${BASH_REMATCH[#]} variable: the overall
# match is stored in the 1st element of ${reMatch[#]}, with matches for
# capture groups (parenthesized subexpressions), if any, stored in the remaining
# array elements.
# NOTE: zsh arrays by default start with index *1*.
# EXAMPLE:
# reMatch 'This AND that.' '^(.+) AND (.+)\.' # -> ${reMatch[#]} == ('This AND that.', 'This', 'that')
function reMatch {
typeset ec
unset -v reMatch # initialize output variable
[[ $1 =~ $2 ]] # perform the regex test
ec=$? # save exit code
if [[ $ec -eq 0 ]]; then # copy result to output variable
[[ -n $BASH_VERSION ]] && reMatch=( "${BASH_REMATCH[#]}" )
[[ -n $KSH_VERSION ]] && reMatch=( "${.sh.match[#]}" )
[[ -n $ZSH_VERSION ]] && reMatch=( "$MATCH" "${match[#]}" )
fi
return $ec
}
Note:
function reMatch (as opposed to reMatch()) is used to declare the function, which is required for ksh to truly create local variables with typeset.
I would not use cut, since you cannot specify more than one delimiter.
If your grep supports PCRE, then you can do:
$ string='project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN'
$ grep -oP '(?<=cell=)[^;]+' <<< "$string"
ABC
You can use sed, which in simple terms can be done as -
$ sed -r 's/.*cell=([^;]+).*/\1/' <<< "$string"
ABC
Another option is to use awk. With that you can do the following by specifying list of delimiters you want to consider as field separators:
$ awk -F'[;= ]' '{print $5}' <<< "$string"
ABC
You can certainly put more checks by iterating over the line so that you don't have to hard-code to print 5th field.
Note that if your shell does not support here-string notation <<< then you can echo the variable and pipe it to the command.
$ echo "$string" | cmd
Here's a native shell solution:
$ string='project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN'
$ cell=${string#*cell=}
$ cell=${cell%%;*}
$ echo "${cell}"
ABC
This removes the shortest leading match up to including cell= from the string, then removes the longest trailing match up to including the ; leaving you with ABC.
Here's another solution which uses read to split the strings:
$ cat t.sh
#!/bin/bash
while IFS=$'; \t' read -ra attributes; do
for foo in "${attributes[#]}"; do
IFS='=' read -r key value <<< "${foo}"
[ "${key}" = cell ] && echo "${value}"
done
done <<EOF
foo=X; cell=ABC; quux=Z;
foo=X; cell=DEF; quux=Z;
EOF
.
$ ./t.sh
ABC
DEF
For solutions using external tools see #jaypal's excellent answer.

regular expression extract string after a colon in bash

I need to extract the string after the : in an example below:
package:project.abc.def
Where i would get project.abc.def as a result.
I am attempting this in bash and i believe i have a regular expression that will work :([^:]*)$.
In my bash script i have package:project.abc.def as a variable called apk. Now how do i assign the same variable the substring found with the regular expression?
Where the result from package:project.abc.def would be in the apk variable. And package:project.abc.def is initially in the apk variable?
Thanks!
There is no need for a regex here, just a simple prefix substitution:
$ apk="package:project.abc.def"
$ apk=${apk##package:}
project.abc.def
The ## syntax is one of bash's parameters expansions. Instead of #, % can be used to trim the end. See this section of the bash man page for the details.
Some alternatives:
$ apk=$(echo $apk | awk -F'package:' '{print $2}')
$ apk=$(echo $apk | sed 's/^package://')
$ apk=$(echo $apk | cut -d':' -f2)
$ string="package:project.abc.def"
$ apk=$(echo $string | sed 's/.*\://')
".*:" matches everything before and including ':' and then its removed from the string.
Capture groups from regular expressions can be found in the BASH_REMATCH array.
[[ $str =~ :([^:]*)$ ]]
# 0 is the substring that matches the entire regex
# n > 1: the nth parenthesized group
apk=${BASH_REMATCH[1]}

Return a regex match in a Bash script, instead of replacing it

I just want to match some text in a Bash script. I've tried using sed but I can't seem to make it just output the match instead of replacing it with something.
echo -E "TestT100String" | sed 's/[0-9]+/dontReplace/g'
Which will output TestTdontReplaceString.
Which isn't what I want, I want it to output 100.
Ideally, it would put all the matches in an array.
edit:
Text input is coming in as a string:
newName()
{
#Get input from function
newNameTXT="$1"
if [[ $newNameTXT ]]; then
#Use code that im working on now, using the $newNameTXT string.
fi
}
You could do this purely in bash using the double square bracket [[ ]] test operator, which stores results in an array called BASH_REMATCH:
[[ "TestT100String" =~ ([0-9]+) ]] && echo "${BASH_REMATCH[1]}"
echo "TestT100String" | sed 's/[^0-9]*\([0-9]\+\).*/\1/'
echo "TestT100String" | grep -o '[0-9]\+'
The method you use to put the results in an array depends somewhat on how the actual data is being retrieved. There's not enough information in your question to be able to guide you well. However, here is one method:
index=0
while read -r line
do
array[index++]=$(echo "$line" | grep -o '[0-9]\+')
done < filename
Here's another way:
array=($(grep -o '[0-9]\+' filename))
Pure Bash. Use parameter substitution (no external processes and pipes):
string="TestT100String"
echo ${string//[^[:digit:]]/}
Removes all non-digits.
I Know this is an old topic but I came her along same searches and found another great possibility apply a regex on a String/Variable using grep:
# Simple
$(echo "TestT100String" | grep -Po "[0-9]{3}")
# More complex using lookaround
$(echo "TestT100String" | grep -Po "(?i)TestT\K[0-9]{3}(?=String)")
With using lookaround capabilities search expressions can be extended for better matching. Where (?i) indicates the Pattern before the searched Pattern (lookahead),
\K indicates the actual search pattern and (?=) contains the pattern after the search (lookbehind).
https://www.regular-expressions.info/lookaround.html
The given example matches the same as the PCRE regex TestT([0-9]{3})String
Use grep. Sed is an editor. If you only want to match a regexp, grep is more than sufficient.
using awk
linux$ echo -E "TestT100String" | awk '{gsub(/[^0-9]/,"")}1'
100
I don't know why nobody ever uses expr: it's portable and easy.
newName()
{
#Get input from function
newNameTXT="$1"
if num=`expr "$newNameTXT" : '[^0-9]*\([0-9]\+\)'`; then
echo "contains $num"
fi
}
Well , the Sed with the s/"pattern1"/"pattern2"/g just replaces globally all the pattern1s to pattern 2.
Besides that, sed while by default print the entire line by default .
I suggest piping the instruction to a cut command and trying to extract the numbers u want :
If u are lookin only to use sed then use TRE:
sed -n 's/.*\(0-9\)\(0-9\)\(0-9\).*/\1,\2,\3/g'.
I dint try and execute the above command so just make sure the syntax is right.
Hope this helped.
using just the bash shell
declare -a array
i=0
while read -r line
do
case "$line" in
*TestT*String* )
while true
do
line=${line#*TestT}
array[$i]=${line%%String*}
line=${line#*String*}
i=$((i+1))
case "$line" in
*TestT*String* ) continue;;
*) break;;
esac
done
esac
done <"file"
echo ${array[#]}