Extract version using Regular expressions in bash

Extract version using Regular expressions in bash - regex

All I need to do is extract the versioning information from the following file:
my_archive_1.1.1.201_x86_64.tgz
I am trying to extract both the version number which is 1.1.1 and the release number which is 201. Normally I use python for these purposes, but I have been asked not to. How do I do it by just using bash? The filename will always be of the form
([A-Za-z_]+)_([0-9]+\.[0-9]+\.[0-9]+)\.([0-9]+)_x86_64\.tgz
The groups are in parenthesis. I need the second and third groups if you start counting from 1.

Use pure BASH:
s='my_archive_1.1.1.201_x86_64.tgz'
[[ $s =~ ^[^_]+_[^_]+_(([^.]+\.){2}[^.]+)\.([^_]+) ]] && \
echo "${BASH_REMATCH[1]}, ${BASH_REMATCH[3]}"
OUTPUT:
1.1.1, 201
Using your own regex:
[[ $s =~ ([A-Za-z_]+)_([0-9]+\.[0-9]+\.[0-9]+).([0-9]+)_x86_64\.tgz ]] && \
echo "${BASH_REMATCH[2]}, ${BASH_REMATCH[3]}"

You can use simple string substitutions to extract substrings. You don't really need regular expressions. As a bonus, this is portable to other POSIX shells. Whether this is simpler or not is a matter of taste, and also depends on the problem.
s='my_archive_1.1.1.201_x86_64.tgz'
# ${s%%_[0-9]*} is 'my-archive'
s=${s#${s%%_[0-9]*}_}
# s='1.1.1.201_x86_64.tgz'
s=${s%%_*}
# s='1.1.1.201'
release=${s##*.}
version=${s%."$release"}
You might also want to experiment with set:
s='my_archive_1.1.1.201_x86_64.tgz'
oldIFS=$IFS
IFS=_
set $s
# $1 = my, $2=archive, $3=1.1.1.201, $4=x86, $5=64.tgz
# Shift until $1 contains only numbers and periods
while $1; do
case $1 in *[!.0-9]* ) shift ;; *) break ;; esac
done
IFS=.
set $1
version=$1.$2.$3
release=$4
IFS=$oldIFS

Another alternative without using regular expressions:
split=`echo "my_archive_1.1.1.201_x86_64.tgz" | cut -d'_' -f3`
versionnumber=`echo $split | cut -d'.' -f1,2,3`
releasenumber=`echo $split | cut -d'.' -f4`
echo "$versionnumber $releasenumber"

Related

Bash Regex to extract everything between the last occurrence of a string (release-) and some characters (--)

I have multiple strings, where I want to extract everything between the last occurrence of a string (release-) and some characters (--). More specifically, for a sting like the following:
inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE
I want to have the following output:
PI_4.1-Sprint-3.1a
I created a regex online, which you can find here. There regex is the following:
.*release-(.*)--.*
However, when I am trying to use this script into a bash script, it wont work. Here is an example.
artifactoryVersion="inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE"
[[ "$artifactoryVersion" =~ (.*release-(.*)--.*) ]]
echo $BASH_REMATCH[0]
echo $BASH_REMATCH[1]
Will return:
inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE[0]
inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE[1]
Do you have any ideas about how can I accomplish my goal in bash?

You may use:
s='inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE'
rx='.*-release-(.*)--'
[[ $s =~ $rx ]] && echo "${BASH_REMATCH[1]}"
PI_4.1-Sprint-3.1a
Code Demo
Your regex appears correct but make sure to use "${BASH_REMATCH[1]}" to extract first capture group in the result.

You need to use the following:
#!/bin/bash
artifactoryVersion="inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE"
if [[ "$artifactoryVersion" =~ .*release-(.*)-- ]]; then
echo ${BASH_REMATCH[1]};
fi
See the online demo
Output:
PI_4.1-Sprint-3.1a

With your shown samples please try following BASH code with regex. I have also mentioned comments before executing each statement to understand each statement here.
##Shell variable named var being created here.
var="inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE"
##Mentioning regex which needs to be checked on later in program.
regex="(.*release-release)-(.*)--"
##Check condition on var variable with regex if match found then print 2nd capturing group value.
[[ $var =~ $regex ]] && echo "${BASH_REMATCH[2]}"
Explanation of regex: Following is the detailed explanation for used regex.
regex="(.*release-release)-(.*)--": Creating shell variable named regex in which putting regular expression (.*release-release)-(.*)--.
Where regex is creating 2 capturing groups.
First matching everything till release-release(with greedy match), which is followed by a -(not captured anywhere).
Which is followed by a greedy match, which will basically match everything before -- to get the exactly needed value.

You can also do it with shell parameter expansions (it's slower than a bash regex but it's standard):
artifactoryVersion='inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE'
result=${artifactoryVersion##*-release-}
result=${result%%--*}
printf %s\\n "$result"
PI_4.1-Sprint-3.1a
Or directly with a bash parameter expansion and extended globing:
#!/bin/bash
shopt -s extglob
artifactoryVersion='inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE'
echo "${artifactoryVersion//#(*-release-|--*)}"
PI_4.1-Sprint-3.1a

RegEx : How can I extract a certain part and modify it?

I'd like to extract a certain part of a string and modify it by using a regular expression.
A given string is TestcaseVzwPerformance_8_2_1_4_1_FDD2.
I'd like to extract the part 8_2_1_4_1 from the string and then replace the underscores _ with dots . So the expected result needs to be 8.2.1.4.1.
The numbers and length of the given string can be different.
For example,
Given string // Expected result
TestcaseVzwCqi_3_9_Test2 // 3.9
TestcaseVzwSvd1xRttAclr_6_6_2_3 // 6.6.2.3
TestcaseVzwCsiFading_9_4_1_1_1_FDD4 // 9.4.1.1.1
Here is my RegEx:
((?:\D{0,}_)(\d(_\d)*)(.*))
The numbered capturing group - $2 - contains 8_2_1_4_1 but with underscores.
Can I replace the underscores with dots?
It needs to be done in one RegEx and a Replace.

regex cannot modify, for example with sed
echo TestcaseVzwPerformance_8_2_1_4_1_FDD2 |
sed -E 's/[^_]*_(([_0-9])+)_.*/\1/;s/_/./g'
8.2.1.4.1

If you have a Bash string, you can use a Bash regex to capture and Bash parameter expansions to replace:
$ s="TestcaseVzwSvd1xRttAclr_6_6_2_3"
$ [[ $s =~ ^[^_]*_([[:digit:]_]+)_* ]] && tmp=${BASH_REMATCH[1]//_/.} && echo "${tmp%.}"
6.6.2.3
Which can be in a loop:
while read -r line; do
if [[ $line =~ ^[^_]*_([[:digit:]_]+)_* ]]; then
tmp=${BASH_REMATCH[1]//_/.}
echo "\"$line\" => ${tmp%.}"
fi
done <<< 'Given string
TestcaseVzwCqi_3_9_Test2
TestcaseVzwSvd1xRttAclr_6_6_2_3
TestcaseVzwCsiFading_9_4_1_1_1_FDD4'
Prints:
"TestcaseVzwCqi_3_9_Test2" => 3.9
"TestcaseVzwSvd1xRttAclr_6_6_2_3" => 6.6.2.3
"TestcaseVzwCsiFading_9_4_1_1_1_FDD4" => 9.4.1.1.1
You can use the same loop to process a file.
If you have a file, you may as well use gawk:
$ awk 'BEGIN{FPAT="_[[:digit:]_]+"}
/_[[:digit:]]/ {sub(/^_/,"", $1); sub(/_$/,"",$1); gsub(/_/,".",$1); print $1}' file
3.9
6.6.2.3
9.4.1.1.1

is there any named regular expression capture for grep?

i'd like to know if its possible to get named regular expression with grep -P(linux bash) from a non formatted string? well.. from any string
For example:
John Smith www.website.com john#website.com jan-01-2001
to capture as
$name
$website
$email
$date
but it seems I cant pass any variables from output?
echo "www.website.com" | grep -Po '^(www\.)?(?<domain>.+)$' | echo $domain
has no output

no. grep is a process. you are talking about environment propagation from child to parent. that's forbidden.
instead, you can do
DATA=($your_line)
then take name=DATA[0] so and forth.
or another way using awk:
eval "`echo $your_line | awk '
function escape(s)
{
gsub(/'\''/,"'\''\"'\''\"'\''", s);
s = "'\''"s"'\''";
return s;
}
{
print "name="escape($1);
print "family_name="escape($2);
print "website="escape($3);
print "email="escape($4);
print "date="escape($5);
}'`"
the sense here is to propagate the info via stdout and eval it in the parent environment.
notice that, here, escape function will escape any string correctly such that nothing will be interpreted wrongly(like the evil of quotes).
following is the output from my jessie:
name='John'
family_name='Smith'
website='www.website.com'
email='john#website.com'
date='jan-01-2001'
if the family name is O'Reilly, the eval result will still be correct:
name='John'
family_name='O'"'"'Reilly'
website='www.website.com'
email='john#website.com'
date='jan-01-2001'

Grep is an independent command-line utility; it does not run inside of bash. So it couldn't create bash variables even if it wanted to.
However, bash has a regular expression matcher built-in. It's not a perl-compatible regex matcher, so it doesn't implement named captures. (To be precise, it matches Posix extended regular expressions, the same as grep -E.) But it does implement numbered captures.
You do regular expression matches with the =~ operator inside of the [[ ... ]] compound command syntax. If the regular expression matches, then the expression succeeds, and the captures are inserted into the array variable BASH_REMATCH. ${BASH_REMATCH[0]} will be the entire matched substring, and the remaining elements, starting with ${BASH_REMATCH[1]}, will be the individual captures in order.
For example:
$ url=www.example.com
$ [[ $url =~ ^(www\.)?(.*) ]]
$ echo "${BASH_REMATCH[1]}"
www.
$ echo "${BASH_REMATCH[2]}"
example.com

Bash regex to match substring with exact integer range

I need to match a string $str that contains any of
foo{77..93}
and capture the above substring in a variable.
So far I've got:
str=/random/string/containing/abc-foo78_efg/ # for example
if [[ $str =~ (foo[7-9][0-9]) ]]; then
id=${BASH_REMATCH[1]}
fi
echo $id # gives foo78
but this also captures ids outside of the target range (e.g. foo95).
Is there a way to restrict the regex to an exact integer range? (tried foo[77-93] but that doesn't work.
Thanks

If you want to use a regex, you're going to have to make it slightly more complex:
if [[ $str =~ foo(7[7-9]|8[0-9]|9[0-3]) ]]; then
id=${BASH_REMATCH[0]}
fi
Note that I have removed the capture group around the whole pattern and am now using the 0th element of the match array.
As an aside, for maximum compatibility with older versions of bash, I would recommend assigning the pattern to a variable and using in the test like this:
re='foo(7[7-9]|8[0-9]|9[0-3])'
if [[ $str =~ $re ]]; then
id=${BASH_REMATCH[0]}
fi
An alternative to using a regex would be to use an arithmetic context, like this:
if (( "${str#foo}" >= 77 && "${str#foo}" <= 93 )); then
id=$str
fi
This strips the "foo" part from the start of the variable so that the integer part can be compared numerically.

Sure is easy to do with Perl:
$ echo foo{1..100} | tr ' ' '\n' | perl -lne 'print $_ if m/foo(\d+)/ and $1>=77 and $1<=93'
foo77
foo78
foo79
foo80
foo81
foo82
foo83
foo84
foo85
foo86
foo87
foo88
foo89
foo90
foo91
foo92
foo93
Or awk even:
$ echo foo{1..100} | tr ' ' '\n' | awk -F 'foo' '$2>=77 && $2<=93
{print}'
foo77
foo78
foo79
foo80
foo81
foo82
foo83
foo84
foo85
foo86
foo87
foo88
foo89
foo90
foo91
foo92
foo93

Matching optional parameters with non-capturing groups in Bash regular expression

I want to parse strings similar to the following into separate variables using regular expressions from within Bash:
Category: entity;scheme="http://schemas.ogf.org/occi/core#";class="kind";title="Entity";attributes="occi.core.id occi.core.title";
or
Category: resource;scheme="http://schemas.ogf.org/occi/core#";class="kind";title="Resource";rel="http://schemas.ogf.org/occi/core#entity";attributes="occi.core.summary";
The first part before "title" is common to all strings, the parts title and attributes are optional.
I managed to extract the mandatory parameters common to all strings, but I have trouble with optional parameters not necessarily present for all strings. As far as I found out, Bash doesn't support Non-capturing parentheses which I would use for this purpose.
Here is what I achieved thus far:
CATEGORY_REGEX='Category:\s*([^;]*);scheme="([^"]*)";class="([^"]*)";'
category_string='Category: entity;scheme="http://schemas.ogf.org/occi/core#";class="kind";title="Entity";attributes="occi.core.id occi.core.title";'
[[ $category_string =~ $CATEGORY_REGEX ]]
echo ${BASH_REMATCH[0]}
echo ${BASH_REMATCH[1]}
echo ${BASH_REMATCH[2]}
echo ${BASH_REMATCH[3]}
The regular expression I would like to use (and which is working for me in Ruby) would be:
CATEGORY_REGEX='Category:\s*([^;]*);\s*scheme="([^"]*)";\s*class="([^"]*)";\s*(?:title="([^"]*)";)?\s*(?:rel="([^"]*)";)?\s*(?:location="([^"]*)";)?\s*(?:attributes="([^"]*)";)?\s*(?:actions="([^"]*)";)?'
Is there any other solution to parse the string with command line tools without having to fall back on perl, python or ruby?

I don't think non-capturing groups exist in bash regex, so your options are to use a scripting language or to remove the ?: from all of the (?:...) groups and just be careful about which groups you reference, for example:
CATEGORY_REGEX='Category:\s*([^;]*);\s*scheme="([^"]*)";\s*class="([^"]*)";\s*(title="([^"]*)";)?\s*(rel="([^"]*)";)?\s*(location="([^"]*)";)?\s*(attributes="([^"]*)";)?\s*(actions="([^"]*)";)?'
category_string='Category: entity;scheme="http://schemas.ogf.org/occi/core#";class="kind";title="Entity";attributes="occi.core.id occi.core.title";'
[[ $category_string =~ $CATEGORY_REGEX ]]
echo "full: ${BASH_REMATCH[0]}"
echo "category: ${BASH_REMATCH[1]}"
echo "scheme: ${BASH_REMATCH[2]}"
echo "class: ${BASH_REMATCH[3]}"
echo "title: ${BASH_REMATCH[5]}"
echo "rel: ${BASH_REMATCH[7]}"
echo "location: ${BASH_REMATCH[9]}"
echo "attributes: ${BASH_REMATCH[11]}"
echo "actions: ${BASH_REMATCH[13]}"
Note that starting with the optional parameters we need to skip a group each time, because the even numbered groups from 4 on contain the parameter name as well as the value (if the parameter is present).

You can emulate non-matching groups in bash using a little bit of regexp magic:
_2__ _4__ _5__
[[ "fu#k" =~ ((.+)#|)((.+)/|)(.+) ]];
echo "${BASH_REMATCH[2]:--} ${BASH_REMATCH[4]:--} ${BASH_REMATCH[5]:--}"
# Output: fu - k
Characters # and / are parts of string we parse.
Regexp pipe | is used for either left or right (empty) part matching.
For curious, ${VAR:-<default value>} is variable expansion with default value in case $VAR is empty.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract version using Regular expressions in bash - regex

Another alternative without using regular expressions: split=`echo "my_archive_1.1.1.201_x86_64.tgz" | cut -d'_' -f3` versionnumber=`echo $split | cut -d'.' -f1,2,3` releasenumber=`echo $split | cut -d'.' -f4` echo "$versionnumber $releasenumber"

Related

Bash Regex to extract everything between the last occurrence of a string (release-) and some characters (--)

RegEx : How can I extract a certain part and modify it?

is there any named regular expression capture for grep?

Bash regex to match substring with exact integer range

Matching optional parameters with non-capturing groups in Bash regular expression

Categories

Resources