bash - print regex captured groups

bash - print regex captured groups - regex

I have a file.xml so composed:
...some xml text here...
<Version>1.0.13-alpha</Version>
...some xml text here...
I need to extract the following information:
mayor_and_minor_release_number --> 1.0
patch_number --> 13
suffix --> -alpha
I've thought the cleanest way to achieve that is by mean of a regex with grep command:
<Version>(\d+\.\d+)\.(\d+)([\w-]+)?<\/Version>
I've checked with regex101 the correctness of this regex and actually it seems to properly capture the 3 fields I'm looking for. But here comes the problem, since I have no idea how to print those fields.
cat file.xml | grep "<Version>(\d+\.\d+)\.(\d+)([\w-]+)?<\/Version>" -oP
This command prints the entire line so it's quite useless.
Several posts on this site have been written about this topic, so I've also tried to use the bash native
regex support, with poor results:
regex="<Version>(\d+\.\d+)\.(\d+)([\w-]+)?<\/Version>"
txt=$(cat file.xml)
[[ "$txt" =~ $regex ]] --> it fails!
echo "${BASH_REMATCH[*]}"
I'm sorry but I cannot figure out how to overtake this issue. The desired output should be:
1.0
13
-alpha

You may use this read + sed solution with similar regex as your's:
read -r major minor suffix < <(
sed -nE 's~.*<Version>([0-9]+\.[0-9]+)\.([0-9]+)(-[^<]*)</Version>.*~\1 \2 \3~p' file.xml
)
Check variable contents:
declare -p major minor suffix
declare -- major="1.0"
declare -- minor="13"
declare -- suffix="-alpha"
Few points:
You cannot use \d without using -P (perl) mode in grep
grep command doesn't return capture groups

Use this Perl one-liner:
perl -lne 'print for m{<Version>(\d+\.\d+)\.(\d+)([\w-]+)?<\/Version>};' file.xml
Example:
echo '<Version>1.0.13-alpha</Version>' | perl -lne 'print for m{<Version>(\d+\.\d+)\.(\d+)([\w-]+)?<\/Version>};'
Output:
1.0
13
-alpha
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches

Related

Regular expression in perl does not work as expected

I have a simple bash script that uses a line of perl code + regex to extract the necessary piece of string. It looks like
ANSWER=$(host $IPW 2>/dev/null | perl -p -e 's#^.+\s\b([a-zA-Z]{4,8}\d{1,3})(?=-\d\.).+$#\1#;'
It works for the most part, but produces unexpected matches from time to time. Example:
$ echo "Host 31.201.188.199.in-addr.arpa. not found: 3(NXDOMAIN)" | perl -p -e 's#^.+?\s\b([a-zA-Z]{4,8}\d{1,3})(?=-\d\.).+?(?=\.$)#\1#;'
Host 31.201.188.199.in-addr.arpa. not found: 3(NXDOMAIN)
The string is supposed to match parts of string like "server100" (letters + digits) and return the corresponding part. Is there something I am missing or don't understand yet. (sorry for bothering)

Your regex doesn't match, so no substitution is made. The line is therefore printed as is.
If you don't want to print when there is no match, you can use -n instead of -p, plus and print to print the line on successful substitution:
echo "Host 31.201.188.199.in-addr.arpa. not found: 3(NXDOMAIN)" |
perl -n -e 's#^.+?\s\b([a-zA-Z]{4,8}\d{1,3})(?=-\d\.).+?(?=\.$)#\1# and print'

I assume the sample text that you show shouldn't be printed at all?
I suggest that you use a simple match instead of a substitution. I've also removed the superfluous parts of your regex pattern
perl -lne 'print $1 if /.*\s([a-z]{4,8}\d{1,3})(?=-\d\.)/i'

extract substring using regex in shell script

The strings could be of form:
com.company.$(PRODUCT_NAME:rfc1034identifier)
$(PRODUCT_BUNDLE_IDENTIFIER)
com.company.$(PRODUCT_NAME:rfc1034identifier).$(someRandomVariable)
I need help in writing regex that extract all the string inside $(..)
I created a regex like ([(])\w+([)]) but when I try to execute in shell script, it gives me error of unmatched parenthesis.
This is what I executed:
echo "com.io.$(sdfsdfdsf)"|grep -P '([(])\w+([)])' -o
I need to get all matching substrings.

Problem is use of double quotes in echo command which is interpreting $(...) as a command substitution.
You can use single quotes:
echo 'com.io.$(sdfsdfdsf)' | grep -oP '[(]\w+[)]'
Here is an alternative using builtin BASH regex:
$> re='[(][^)]+[)]'
$> [[ 'com.io.$(sdfsdfdsf)' =~ $re ]] && echo "${BASH_REMATCH[0]}"
(sdfsdfdsf)

You can do it quite simple with sed
echo 'com.io.$(asdfasdf)'|sed -e 's/.*(\(.*\))/\1/g'
Gives
asdfasdf
For two fields:
echo 'com.io.$(asdfasdf).$(ddddd)'|sed -e 's/.*((.*)).$((.*))/\1 \2/g'
Gives
asdfasdf ddddd
Explanation:
sed -e 's/.*(\(.*\))/\1/g'
\_/\____/ \/
| | |_ print the placeholder content
| |___ placeholder selecting the text inside the paratheses
|____ select the text from beginning including the first paranthese

Your question specifies "shell", but not "bash". So I'll start with a common shell-based tool (awk) rather than assuming you can use any particular set of non-POSIX built-ins.
$ cat inp.txt
com.company.$(PRODUCT_NAME:rfc1034identifier)
$(PRODUCT_BUNDLE_IDENTIFIER)
com.company.$(PRODUCT_NAME:rfc1034identifier).$(someRandomVariable)
$ awk -F'[()]' '{for(i=2;i<=NF;i+=2){print $i}}' inp.txt
PRODUCT_NAME:rfc1034identifier
PRODUCT_BUNDLE_IDENTIFIER
PRODUCT_NAME:rfc1034identifier
someRandomVariable
This awk one-liner defines a field separator that consists of opening or closing brackets. With such a field separator, every even-numbered field will be the content you're looking for, assuming all lines of input are correctly formatted and there are no parentheses embedded inside other parentheses.
If you did want to do this in POSIX shell alone, the following would be an option:
#!/bin/sh
while read line; do
while expr "$line" : '.*(' >/dev/null; do
line="${line#*(}"
echo "${line%%)*}"
done
done < inp.txt
This steps through each line of input, slicing it up using the parentheses and printing each slice. Note that this uses expr, which most likely an external binary, but is at least included in POSIX.1.

Copy matched regex to new file

I want to copy regex matched text to a new file.
<SHOPITEM>([\s\S]*?)<YEAR>2015<\/YEAR>([\s\S]*?)<\/SHOPITEM>
([\s\S]*?) = any text, any line
This works (I am able to find) in Sublime editor, but how this regex looks for sed/grep (or any other Unix tool)?

Usually sed and grep are used to search on lines not on multiline mode as is it still possible under certain conditions.
I would advise to use Perl which should be installed on your computer:
perl -p -e 'undef $/;$_=<>;print $& if /<SHOPITEM>([\s\S]*?)<YEAR>2015<\/YEAR>([\s\S]*?)<\/SHOPITEM>/i;'
Be aware that this regex won't work if you have nested <shopitem> tags or even multiple occurences. Instead use a XML parser.
Also you can write a Program that parse your xml file and this time it will capture all the matches.
myparser.pl:
#!/usr/bin/env perl
undef $/;
$_ = <>;
print while(/<(shopitem)>[\s\S]*<(year)>2015<\/\2>[\s\S]*<\/\1>/ig);
That you can execute:
$ chmod u+x myparser.pl
$ ./myparser.pl myfile.xml

I'm not the best scripter, but I think this should work:
grep "<SHOPITEM>" infile | grep "<YEAR>2015" | sed -e "s/<[^>]*>//g" | sed "s/2015/ /g" > outfile
Edit: I didn't match the regex, instead I got SHOPITEMs with YEAR 2015 tag and removed all the unwanted parts.
Edit: I'd do it this way, but I'm not sure it's the most elegant solution.

Print all matches of a regular expression from the command line?

What's the simplest way to print all matches (either one line per match or one line per line of input) to a regular expression on a unix command line? Note that there may be 0 or more than 1 match per line of input.
I assume there must be some way to do this with sed, awk, grep, and/or perl, and I'm hoping for a simple command line solution so it will show up in my bash history when needed in the future.
EDIT: To clarify, I do not want to print all matching lines, only the matches to the regular expression. For example, a line might have 1000 characters, but there are only two 10-character matches to the regular expression. I'm only interested in those two 10-character matches.

Assuming you only use non-capturing parentheses,
perl -wnE'say /yourregex/g'
or
perl -wnE'say for /yourregex/g'
Sample use:
$ echo -ne 'fod,food,fad\nbar\nfooooood\n' | perl -wnE'say for /fo*d/g'
fod
food
fooooood
$ echo -ne 'fod,food,fad\nbar\nfooooood\n' | perl -wnE'say /fo*d/g'
fodfood
fooooood

Unless I misunderstand your question, the following will do the trick
grep -o 'fo.*d' input.txt
For more details see:
GNU grep (most platforms)
Solaris grep
AIX grep
HP-UX grep

Going off the comment, and assuming you're passed the input from a pipe or otherwise on STDIN:
perl -e 'my $re=shift;$re=~qr{$re};while(<STDIN>){if(/($re)/g){print"$1\n"}while(m/\G.*?($re)/g){print"$1\n"}}'
Usage:
cat SOME_TEXT_FILE | perl -e 'my $re=shift;$re=~qr{$re};while(<STDIN>){if(/($re)/g){print"$1\n"}while(m/\G.*?($re)/g){print"$1\n"}}' 'YOUR_REGEX'
or I would just stuff that whole mess into a bash function...
bggrep ()
{
if [ "x$1" != "x" ]; then
perl -e 'my $re=shift;$re=~qr{$re};while(<STDIN>){if(/($re)/g){print"$1\n"}while(m/\G.*?($re)/g){print"$1\n"}}' $1;
else
echo "Usage: bggrep <regex>";
fi
}
Usage is the same, just cleaner-looking:
cat SOME_TEXT_FILE | bggrep 'YOUR_REGEX'
(or just type the command itself and enter the text to match line-by-line, but that didn't seem a likely use case :).
Example (from your comment):
bash$ cat garbage
fod,food,fad
bar
fooooooood
bash$ cat garbage | perl -e 'my $re=shift;$re=~qr{$re};while(<STDIN>){if(/($re)/g){print"$1\n"}while(m/\G.*?($re)/g){print"$1\n"}}' 'fo*d'
fod
food
fooooooood
or...
bash$ cat garbage | bggrep 'fo*d'
fod
food
fooooooood

perl -MSmart::Comments -ne '#a=m/(...)/g;print;' -e '### #a'

Return a regex match in a Bash script, instead of replacing it

I just want to match some text in a Bash script. I've tried using sed but I can't seem to make it just output the match instead of replacing it with something.
echo -E "TestT100String" | sed 's/[0-9]+/dontReplace/g'
Which will output TestTdontReplaceString.
Which isn't what I want, I want it to output 100.
Ideally, it would put all the matches in an array.
edit:
Text input is coming in as a string:
newName()
{
#Get input from function
newNameTXT="$1"
if [[ $newNameTXT ]]; then
#Use code that im working on now, using the $newNameTXT string.
fi
}

You could do this purely in bash using the double square bracket [[ ]] test operator, which stores results in an array called BASH_REMATCH:
[[ "TestT100String" =~ ([0-9]+) ]] && echo "${BASH_REMATCH[1]}"

echo "TestT100String" | sed 's/[^0-9]*\([0-9]\+\).*/\1/'
echo "TestT100String" | grep -o '[0-9]\+'
The method you use to put the results in an array depends somewhat on how the actual data is being retrieved. There's not enough information in your question to be able to guide you well. However, here is one method:
index=0
while read -r line
do
array[index++]=$(echo "$line" | grep -o '[0-9]\+')
done < filename
Here's another way:
array=($(grep -o '[0-9]\+' filename))

Pure Bash. Use parameter substitution (no external processes and pipes):
string="TestT100String"
echo ${string//[^[:digit:]]/}
Removes all non-digits.

I Know this is an old topic but I came her along same searches and found another great possibility apply a regex on a String/Variable using grep:
# Simple
$(echo "TestT100String" | grep -Po "[0-9]{3}")
# More complex using lookaround
$(echo "TestT100String" | grep -Po "(?i)TestT\K[0-9]{3}(?=String)")
With using lookaround capabilities search expressions can be extended for better matching. Where (?i) indicates the Pattern before the searched Pattern (lookahead),
\K indicates the actual search pattern and (?=) contains the pattern after the search (lookbehind).
https://www.regular-expressions.info/lookaround.html
The given example matches the same as the PCRE regex TestT([0-9]{3})String

Use grep. Sed is an editor. If you only want to match a regexp, grep is more than sufficient.

using awk
linux$ echo -E "TestT100String" | awk '{gsub(/[^0-9]/,"")}1'
100

I don't know why nobody ever uses expr: it's portable and easy.
newName()
{
#Get input from function
newNameTXT="$1"
if num=`expr "$newNameTXT" : '[^0-9]*\([0-9]\+\)'`; then
echo "contains $num"
fi
}

Well , the Sed with the s/"pattern1"/"pattern2"/g just replaces globally all the pattern1s to pattern 2.
Besides that, sed while by default print the entire line by default .
I suggest piping the instruction to a cut command and trying to extract the numbers u want :
If u are lookin only to use sed then use TRE:
sed -n 's/.*\(0-9\)\(0-9\)\(0-9\).*/\1,\2,\3/g'.
I dint try and execute the above command so just make sure the syntax is right.
Hope this helped.

using just the bash shell
declare -a array
i=0
while read -r line
do
case "$line" in
*TestT*String* )
while true
do
line=${line#*TestT}
array[$i]=${line%%String*}
line=${line#*String*}
i=$((i+1))
case "$line" in
*TestT*String* ) continue;;
*) break;;
esac
done
esac
done <"file"
echo ${array[#]}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

bash - print regex captured groups - regex

Related

Regular expression in perl does not work as expected

extract substring using regex in shell script

Copy matched regex to new file

Print all matches of a regular expression from the command line?

Return a regex match in a Bash script, instead of replacing it

Categories

Resources