RegEx : How can I extract a certain part and modify it? - regex

I'd like to extract a certain part of a string and modify it by using a regular expression.
A given string is TestcaseVzwPerformance_8_2_1_4_1_FDD2.
I'd like to extract the part 8_2_1_4_1 from the string and then replace the underscores _ with dots . So the expected result needs to be 8.2.1.4.1.
The numbers and length of the given string can be different.
For example,
Given string // Expected result
TestcaseVzwCqi_3_9_Test2 // 3.9
TestcaseVzwSvd1xRttAclr_6_6_2_3 // 6.6.2.3
TestcaseVzwCsiFading_9_4_1_1_1_FDD4 // 9.4.1.1.1
Here is my RegEx:
((?:\D{0,}_)(\d(_\d)*)(.*))
The numbered capturing group - $2 - contains 8_2_1_4_1 but with underscores.
Can I replace the underscores with dots?
It needs to be done in one RegEx and a Replace.

regex cannot modify, for example with sed
echo TestcaseVzwPerformance_8_2_1_4_1_FDD2 |
sed -E 's/[^_]*_(([_0-9])+)_.*/\1/;s/_/./g'
8.2.1.4.1

If you have a Bash string, you can use a Bash regex to capture and Bash parameter expansions to replace:
$ s="TestcaseVzwSvd1xRttAclr_6_6_2_3"
$ [[ $s =~ ^[^_]*_([[:digit:]_]+)_* ]] && tmp=${BASH_REMATCH[1]//_/.} && echo "${tmp%.}"
6.6.2.3
Which can be in a loop:
while read -r line; do
if [[ $line =~ ^[^_]*_([[:digit:]_]+)_* ]]; then
tmp=${BASH_REMATCH[1]//_/.}
echo "\"$line\" => ${tmp%.}"
fi
done <<< 'Given string
TestcaseVzwCqi_3_9_Test2
TestcaseVzwSvd1xRttAclr_6_6_2_3
TestcaseVzwCsiFading_9_4_1_1_1_FDD4'
Prints:
"TestcaseVzwCqi_3_9_Test2" => 3.9
"TestcaseVzwSvd1xRttAclr_6_6_2_3" => 6.6.2.3
"TestcaseVzwCsiFading_9_4_1_1_1_FDD4" => 9.4.1.1.1
You can use the same loop to process a file.
If you have a file, you may as well use gawk:
$ awk 'BEGIN{FPAT="_[[:digit:]_]+"}
/_[[:digit:]]/ {sub(/^_/,"", $1); sub(/_$/,"",$1); gsub(/_/,".",$1); print $1}' file
3.9
6.6.2.3
9.4.1.1.1

Related

Extracting filesnames in bash with regex

Can please someone help me to set up a regular expression.
I have a large LaTeX3 TeXDoc file. LaTeX3 TeXDoc defines the macro \TestFiles{}, which should be used, to list the names of files, which itself should be used as an unit tests. You can name more than one file between the braces. So \TestFiles{foo-bar} and \TestFiles{foo-bar, bar+baz,foo_bar_baz} are syntactical correct use cases for this macro.
I would like to write a bash script, to extract all the uni test files, named in the \TestFiles{} macros, to compile them with pdflatex and check, if pdflatex will be able to produce an output file successfully.
I have something like this in my script:
function get_filenames () {
## This regex works but is not sensible enough
# regex='\\TestFiles{(.*)}'
## This works also, but is again not precise enough
regex='\\TestFiles{([0-9a-zA-Z+-_, ]*)}'
## This should give more than one matching group
## (separated by ", " or ","), but this regex doesn't
## work. I have no idea why or how to modify, to get
## it working
while read -r line ; do
if [[ $line =~ $regex ]] ; then
i=1
while [ $i -le 3 ]; do
echo "Match $i: \"${BASH_REMATCH[$i]}\""
i=$(( i + 1 ))
done
echo
fi
done < mystyle.dtx
}
Here is an excerpt of the DTX file
\TestFiles{foo-bar}
\TestFiles{foo-bar, bar+baz,foo_bar_baz}
(You can store this as mystyle.dtx, in order to reproduce the next example.)
Using the above noted examples, my script gives me the following results:
get_filenames
Match 1: "foo-bar"
Match 2: ""
Match 3: ""
Match 1: "foo-bar, bar+baz,foo_bar_baz"
Match 2: ""
Match 3: ""
I wasn't able, to modify my regex expression, to split the content of the last \TestFiles{foo-bar, bar+baz,foo_bar_baz} example into three matching results.
I tried a regular expression like this regex='\\TestFiles{([[:alnum:]+-_]*)[,]+[ ]*}'. I thought the [:alnum:]+-_]* would match the filenames. As far as I understand regular expressions, the (...) should form a group, that should be listed afterwards in the bash array BASH_REMATCH[$i].
The part [,]+ should reflect that every file name must be separated by at least one comma. Between the filenames there might be some white space, so something like [[:space:]]* or at least [ ]* should represent this. The quantifier * means any repetition, ranging from 0 to ..., while + should at least appear one or more times.
But that regular expression did not work at all, if had no matching results.
How must regex be defined, to store each filenames as a matching group? I am searching for the correct regular expression, to get this result:
get_filenames
Match 1: "foo-bar"
Match 2: ""
Match 3: ""
Match 1: "foo-bar"
Match 2: "bar+baz"
Match 3: "foo_bar_baz"
EDIT: in my real world files, there may be (and are) more than tree test files.
Thanks in advance.
## This should give more than one matching group
regex='\\TestFiles{([0-9a-zA-Z+-_, ]*)}'
The element of BASH_REMATCH with index n is the portion of the string matching the nth parenthesized subexpression.
Your regex only has 1 "parenthesized subexpression" - which is why everything ends up in BASH_REMATCH[1]
$ regex='\\TestFiles{([0-9a-zA-Z+-_, ]*)}'
$ [[ $line =~ $regex ]]
$ declare -p BASH_REMATCH
declare -a BASH_REMATCH=(
[0]="\\TestFiles{foo-bar, bar+baz,foo_bar_baz}"
[1]="foo-bar, bar+baz,foo_bar_baz"
)
As you're trying to match an unknown number of filenames you would have to "dynamically" create your regex so it contains the needed amount of groups.
$ regex='\\TestFiles{([^, }]+)([,}] ?)'
$ [[ $line =~ $regex ]]
$ declare -p BASH_REMATCH
declare -a BASH_REMATCH=(
[0]="\\TestFiles{foo-bar, "
[1]="foo-bar"
[2]=", "
)
Add another group and see if it still matches:
$ regex+='([^, }]+)([,}] ?)'
$ [[ $line =~ $regex ]]
$ declare -p BASH_REMATCH
declare -a BASH_REMATCH=(
[0]="\\TestFiles{foo-bar, bar+baz,"
[1]="foo-bar"
[2]=", "
[3]="bar+baz"
[4]=","
)
You could keep looping until the regex no longer matches - or perhaps a simpler approach would be to count the number of , characters on the line.
regex='\\TestFiles{([^, }]+)([,}] ?)'
line='\TestFiles{foo-bar, bar+baz,foo_bar_baz}'
commas=${line//[!,]}
for ((i=0; i<${#commas}; i++))
do
regex+='([^, }]+)([,}] ?)'
done
[[ $line =~ $regex ]]
Which results in:
$ declare -p BASH_REMATCH
declare -a BASH_REMATCH=(
[0]="\\TestFiles{foo-bar, bar+baz,foo_bar_baz}"
[1]="foo-bar"
[2]=", "
[3]="bar+baz"
[4]=","
[5]="foo_bar_baz"
[6]="}"
)
Alternative approach using IFS
You can set IFS=', ' and have bash do the splitting for you.
line='\TestFiles{foo-bar, bar+baz,foo_bar_baz}'
[[ $line = \\TestFiles{* ]] && {
# Remove leading '\Testfiles{'
# Remove trailing }
line=${line#*{}
line=${line%}}
IFS=', ' read -a filenames <<< "$line"
declare -p filenames
}
declare -a filenames=([0]="foo-bar" [1]="bar+baz" [2]="foo_bar_baz}}")
Use set with IFS to split each line into new positional parameters. Assign $# to an array so that elements can be accessed by index. Trying this with $# directly results in a bad substitution error.
get-filenames.sh
#!/usr/bin/env bash
get_filenames() {
local IFS=' {},'
declare -a names
while read -r line; do
set -- $line
names=($#)
test "${names[0]}" == '\TestFiles' && {
for i in {1..3}; do
printf "Match %i: \"%s\"\n" $i ${names[$i]}
done
}
echo
done < 'mystyle.dtx'
}
get_filenames
mystyle.dtx
\TestFiles{foo-bar}
\TestFiles{foo-bar, bar+baz,foo_bar_baz}
output
Match 1: "foo-bar"
Match 2: ""
Match 3: ""
Match 1: "foo-bar"
Match 2: "bar+baz"
Match 3: "foo_bar_baz"
EDIT (without external programs, though it's rather impractical, and tied to exactly three matches)
function get_filenames () {
p='([^, }]*) *,? *'
regex="\\TestFiles\{$p$p$p"
while read -r line ; do
if [[ $line =~ $regex ]] ; then
i=1
while [ $i -le 3 ]; do
echo "Match $i: \"${BASH_REMATCH[$i]}\""
i=$(( i + 1 ))
done
echo
fi
done < mystyle.dtx
}
If you really need to output exactly three file names (even empty) for each '\TestFiles' row then here's the code.
function get_filenames () {
MAX_FILES_CNT=3
IFS=$'\n'
for line in $(grep -oP '\\TestFiles\{\K[^}]*' < mystyle.dtx); do
filenames=()
for filename in $(grep -m $MAX_FILES_CNT -oP "[^, ]+" <<< "$line"); do
filenames+=("$filename")
done
i=0
while [ $i -lt $MAX_FILES_CNT ]; do
echo "Match $(($i+1)): \"${filenames[i]}\""
i=$(( i + 1 ))
done
echo ""
done
unset IFS
}
Match 1: "foo-bar"
Match 2: ""
Match 3: ""
Match 1: "foo-bar"
Match 2: "bar+baz"
Match 3: "foo_bar_baz"
By the way, BASH_REMATCH is no good for this task, cause it captures only last rematch. Look
[[ "asdf" =~ (.)* ]]
echo "${BASH_REMATCH[#]}"
asdf f
Also I would recommend to read this question https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice
Suggesting an awk script that would do the trick on one or more files.
get_filenames.awk
/\\TestFiles{[^}]*}/ { # handle only lines matching regex filter
filesCount = split($0, fileNamesArr, "\\\\TestFiles{[ ]*|[ ]*,[ ]*|[ ]*}"); # parse line to array fileNamesArr
for (i = 2; i < filesCount; i++) { # read elements 2 --> filesCount - 1
printf("Match %d in %s: \"%s\"\n", i - 1, FILENAME, fileNamesArr[i]); # format print fileNames
}
print"";
}
test file: input.1.txt
some text line 1
\TestFiles{foo-bar0}
some text \TestFiles{foo-bar1, bar+baz1, foo_bar_baz1}
some text \TestFiles{foo-bar2 ,bar+baz2 ,foo_bar_baz2 }
some text \TestFiles{ foo-bar3 , bar+baz3 , foo_bar_baz3 } some text
line 4
test file: input.2.txt
\TestFiles{file10, file11}
text
text \TestFiles{ file20 } some text
text\TestFiles{file30,file31,file32 }text
text
testingget_filenames.awk
awk -f get_filenames.awk input.1.txt input.2.txt
Match 1 in input.1.txt: "foo-bar0"
Match 1 in input.1.txt: "foo-bar1"
Match 2 in input.1.txt: "bar+baz1"
Match 3 in input.1.txt: "foo_bar_baz1"
Match 1 in input.1.txt: "foo-bar2"
Match 2 in input.1.txt: "bar+baz2"
Match 3 in input.1.txt: "foo_bar_baz2"
Match 1 in input.1.txt: "foo-bar3"
Match 2 in input.1.txt: "bar+baz3"
Match 3 in input.1.txt: "foo_bar_baz3"
Match 1 in input.2.txt: "file10"
Match 2 in input.2.txt: "file11"
Match 1 in input.2.txt: "file20"
Match 1 in input.2.txt: "file30"
Match 2 in input.2.txt: "file31"
Match 3 in input.2.txt: "file32"
I believe this is the regular expression you're looking for:
(?<=\\TestFiles{.*)([\w\d\-\+_]+)[, }]+
You can see it working, modify it and have an explanation on what it does in the following link: https://regex101.com/r/0W8PBi/1

Regex/Shell - how to match all except those with specific pattern

I need a regex in shell to match all strings except those with specific pattern.
My specific pattern can be variable, i.e. (i|I)[2 digits numbers](u|U)[2 digits numbers] in every string should not match.
For example :
Some.text.1234.text => should match
Some.text.1234.i10u20.text => shouldn't match
Some.text.1234.I01U02.text => shouldn't match
Some.text.1234.i83U23.text => shouldn't match
You can try with that:
^(?!.*[tuTU]\d{2}).*$
Demo
Explanation:
^ start of a line
?!.* negative look ahead
[tuTU]\d{2} check if there exists such character following 2 digits only
.*$ if previous condition is negative then match entire string to end of string $
The Bash script checking if a string matches a regex or not can look like
f='It_is_your_string_to_check';
if [[ "${f^^}" =~ I[0-9]{2}U[0-9]{2} ]]; then
echo "$f is invalid";
else
echo "$f is valid"
fi;
Here, "${f^^}" turns the string into uppercase (so as not to use (U|u) and (I|i)), and then =~ operator triggers a regex check here since the pattern on the right side is not quoted. You may play it safe and define the regex pattern with a separate single-quoted string variable and use
rx='I[0-9]{2}U[0-9]{2}'
if [[ "${f^^}" =~ $rx ]]; then ...
See a Bash demo online:
s='Some.text.1234.text
Some.text.1234.i10u20.text
Some.text.1234.I01U02.text
Some.text.1234.i83U23.text'
for f in $s; do
if [[ "${f^^}" =~ I[0-9]{2}U[0-9]{2} ]]; then
echo "$f is invalid";
else
echo "$f is valid"
fi;
done;
Output:
Some.text.1234.text is valid
Some.text.1234.i10u20.text is invalid
Some.text.1234.I01U02.text is invalid
Some.text.1234.i83U23.text is invalid

is there any named regular expression capture for grep?

i'd like to know if its possible to get named regular expression with grep -P(linux bash) from a non formatted string? well.. from any string
For example:
John Smith www.website.com john#website.com jan-01-2001
to capture as
$name
$website
$email
$date
but it seems I cant pass any variables from output?
echo "www.website.com" | grep -Po '^(www\.)?(?<domain>.+)$' | echo $domain
has no output
no. grep is a process. you are talking about environment propagation from child to parent. that's forbidden.
instead, you can do
DATA=($your_line)
then take name=DATA[0] so and forth.
or another way using awk:
eval "`echo $your_line | awk '
function escape(s)
{
gsub(/'\''/,"'\''\"'\''\"'\''", s);
s = "'\''"s"'\''";
return s;
}
{
print "name="escape($1);
print "family_name="escape($2);
print "website="escape($3);
print "email="escape($4);
print "date="escape($5);
}'`"
the sense here is to propagate the info via stdout and eval it in the parent environment.
notice that, here, escape function will escape any string correctly such that nothing will be interpreted wrongly(like the evil of quotes).
following is the output from my jessie:
name='John'
family_name='Smith'
website='www.website.com'
email='john#website.com'
date='jan-01-2001'
if the family name is O'Reilly, the eval result will still be correct:
name='John'
family_name='O'"'"'Reilly'
website='www.website.com'
email='john#website.com'
date='jan-01-2001'
Grep is an independent command-line utility; it does not run inside of bash. So it couldn't create bash variables even if it wanted to.
However, bash has a regular expression matcher built-in. It's not a perl-compatible regex matcher, so it doesn't implement named captures. (To be precise, it matches Posix extended regular expressions, the same as grep -E.) But it does implement numbered captures.
You do regular expression matches with the =~ operator inside of the [[ ... ]] compound command syntax. If the regular expression matches, then the expression succeeds, and the captures are inserted into the array variable BASH_REMATCH. ${BASH_REMATCH[0]} will be the entire matched substring, and the remaining elements, starting with ${BASH_REMATCH[1]}, will be the individual captures in order.
For example:
$ url=www.example.com
$ [[ $url =~ ^(www\.)?(.*) ]]
$ echo "${BASH_REMATCH[1]}"
www.
$ echo "${BASH_REMATCH[2]}"
example.com

Bash regex to match substring with exact integer range

I need to match a string $str that contains any of
foo{77..93}
and capture the above substring in a variable.
So far I've got:
str=/random/string/containing/abc-foo78_efg/ # for example
if [[ $str =~ (foo[7-9][0-9]) ]]; then
id=${BASH_REMATCH[1]}
fi
echo $id # gives foo78
but this also captures ids outside of the target range (e.g. foo95).
Is there a way to restrict the regex to an exact integer range? (tried foo[77-93] but that doesn't work.
Thanks
If you want to use a regex, you're going to have to make it slightly more complex:
if [[ $str =~ foo(7[7-9]|8[0-9]|9[0-3]) ]]; then
id=${BASH_REMATCH[0]}
fi
Note that I have removed the capture group around the whole pattern and am now using the 0th element of the match array.
As an aside, for maximum compatibility with older versions of bash, I would recommend assigning the pattern to a variable and using in the test like this:
re='foo(7[7-9]|8[0-9]|9[0-3])'
if [[ $str =~ $re ]]; then
id=${BASH_REMATCH[0]}
fi
An alternative to using a regex would be to use an arithmetic context, like this:
if (( "${str#foo}" >= 77 && "${str#foo}" <= 93 )); then
id=$str
fi
This strips the "foo" part from the start of the variable so that the integer part can be compared numerically.
Sure is easy to do with Perl:
$ echo foo{1..100} | tr ' ' '\n' | perl -lne 'print $_ if m/foo(\d+)/ and $1>=77 and $1<=93'
foo77
foo78
foo79
foo80
foo81
foo82
foo83
foo84
foo85
foo86
foo87
foo88
foo89
foo90
foo91
foo92
foo93
Or awk even:
$ echo foo{1..100} | tr ' ' '\n' | awk -F 'foo' '$2>=77 && $2<=93
{print}'
foo77
foo78
foo79
foo80
foo81
foo82
foo83
foo84
foo85
foo86
foo87
foo88
foo89
foo90
foo91
foo92
foo93

Matching optional parameters with non-capturing groups in Bash regular expression

I want to parse strings similar to the following into separate variables using regular expressions from within Bash:
Category: entity;scheme="http://schemas.ogf.org/occi/core#";class="kind";title="Entity";attributes="occi.core.id occi.core.title";
or
Category: resource;scheme="http://schemas.ogf.org/occi/core#";class="kind";title="Resource";rel="http://schemas.ogf.org/occi/core#entity";attributes="occi.core.summary";
The first part before "title" is common to all strings, the parts title and attributes are optional.
I managed to extract the mandatory parameters common to all strings, but I have trouble with optional parameters not necessarily present for all strings. As far as I found out, Bash doesn't support Non-capturing parentheses which I would use for this purpose.
Here is what I achieved thus far:
CATEGORY_REGEX='Category:\s*([^;]*);scheme="([^"]*)";class="([^"]*)";'
category_string='Category: entity;scheme="http://schemas.ogf.org/occi/core#";class="kind";title="Entity";attributes="occi.core.id occi.core.title";'
[[ $category_string =~ $CATEGORY_REGEX ]]
echo ${BASH_REMATCH[0]}
echo ${BASH_REMATCH[1]}
echo ${BASH_REMATCH[2]}
echo ${BASH_REMATCH[3]}
The regular expression I would like to use (and which is working for me in Ruby) would be:
CATEGORY_REGEX='Category:\s*([^;]*);\s*scheme="([^"]*)";\s*class="([^"]*)";\s*(?:title="([^"]*)";)?\s*(?:rel="([^"]*)";)?\s*(?:location="([^"]*)";)?\s*(?:attributes="([^"]*)";)?\s*(?:actions="([^"]*)";)?'
Is there any other solution to parse the string with command line tools without having to fall back on perl, python or ruby?
I don't think non-capturing groups exist in bash regex, so your options are to use a scripting language or to remove the ?: from all of the (?:...) groups and just be careful about which groups you reference, for example:
CATEGORY_REGEX='Category:\s*([^;]*);\s*scheme="([^"]*)";\s*class="([^"]*)";\s*(title="([^"]*)";)?\s*(rel="([^"]*)";)?\s*(location="([^"]*)";)?\s*(attributes="([^"]*)";)?\s*(actions="([^"]*)";)?'
category_string='Category: entity;scheme="http://schemas.ogf.org/occi/core#";class="kind";title="Entity";attributes="occi.core.id occi.core.title";'
[[ $category_string =~ $CATEGORY_REGEX ]]
echo "full: ${BASH_REMATCH[0]}"
echo "category: ${BASH_REMATCH[1]}"
echo "scheme: ${BASH_REMATCH[2]}"
echo "class: ${BASH_REMATCH[3]}"
echo "title: ${BASH_REMATCH[5]}"
echo "rel: ${BASH_REMATCH[7]}"
echo "location: ${BASH_REMATCH[9]}"
echo "attributes: ${BASH_REMATCH[11]}"
echo "actions: ${BASH_REMATCH[13]}"
Note that starting with the optional parameters we need to skip a group each time, because the even numbered groups from 4 on contain the parameter name as well as the value (if the parameter is present).
You can emulate non-matching groups in bash using a little bit of regexp magic:
_2__ _4__ _5__
[[ "fu#k" =~ ((.+)#|)((.+)/|)(.+) ]];
echo "${BASH_REMATCH[2]:--} ${BASH_REMATCH[4]:--} ${BASH_REMATCH[5]:--}"
# Output: fu - k
Characters # and / are parts of string we parse.
Regexp pipe | is used for either left or right (empty) part matching.
For curious, ${VAR:-<default value>} is variable expansion with default value in case $VAR is empty.