bash regex to parse text of the form +incdir+<dir1>+<dir2> - regex

I have an input string of the form +incdir+<dir1>+<dir2>, where <dir1> and <dir2> are directory names. I want to parse this using a bash regex and have the values of the directories inside BASH_REMATCH[1], [2], ...
Here is what I tried:
function match {
if [[ "$1" =~ \+incdir(\+.*)+ ]]; then
for i in $(seq $(expr ${#BASH_REMATCH[#]} - 1)); do
echo $i ":" ${BASH_REMATCH[$i]}
done
else
echo "no match"
fi
}
This works for match +incdir+foo, but doesn't for match +incdir+foo+bar, because it does greedy matching and it outputs +foo+bar. There isn't any non-greedy matching in bash as regex in bash expression mentions so I tried the following for the pattern: \+incdir(\+[^+]*)+ but this just gives me +bar.
The way I would interpret the regex is the following: find the beginning +incdir, then match me at least one group starting with a + followed by as many characters as you can find that are not +. When you hit a + this is the start of the next group. I guess my reasoning is incorrect.
Does anyone have any idea what I'm doing wrong?

Using only bash builtins (but NOT regular expressions, which are the wrong tool for this job):
match() {
[[ $1 = *+incdir+* ]] || return # noop if no +incdir present
IFS=+ read -r -a pieces <<<"${1#*+incdir+}" # read everything after +incdir+
# into +-separated array
for idx in "${!pieces[#]}"; do # iterate over keys in array
echo "$idx: ${pieces[$idx]}" # ...and emit key/value pairs
done
}
$ match "yadda yadda +incdir+foo+bar+baz"
0: foo
1: bar
2: baz

Related

bash IF not matching variable that contains regex numbers

DPHPV = /usr/local/nginx/conf/php81-remi.conf;
I am unable to figure out how to match a string that contains any 2 digits:
if [[ "$DPHPV" =~ *"php[:digit:][:digit:]-remi.conf"* ]]
You are not using the right regex here as * is a quantifier in regex, not a placeholder for any text.
Actually, you do not need a regex, you may use a mere glob pattern like
if [[ "$DPHPV" == *php[[:digit:]][[:digit:]]-remi.conf ]]
Note
== - enables glob matching
*php[[:digit:]][[:digit:]]-remi.conf - matches any text with *, then matches php, then two digits (note that the POSIX character classes must be used inside bracket expressions), and then -rem.conf at the end of string.
See the online demo:
#!/bin/bash
DPHPV='/usr/local/nginx/conf/php81-remi.conf'
if [[ "$DPHPV" == *php[[:digit:]][[:digit:]]-remi.conf ]]; then
echo yes;
else
echo no;
fi
Output: yes.

Extracting filesnames in bash with regex

Can please someone help me to set up a regular expression.
I have a large LaTeX3 TeXDoc file. LaTeX3 TeXDoc defines the macro \TestFiles{}, which should be used, to list the names of files, which itself should be used as an unit tests. You can name more than one file between the braces. So \TestFiles{foo-bar} and \TestFiles{foo-bar, bar+baz,foo_bar_baz} are syntactical correct use cases for this macro.
I would like to write a bash script, to extract all the uni test files, named in the \TestFiles{} macros, to compile them with pdflatex and check, if pdflatex will be able to produce an output file successfully.
I have something like this in my script:
function get_filenames () {
## This regex works but is not sensible enough
# regex='\\TestFiles{(.*)}'
## This works also, but is again not precise enough
regex='\\TestFiles{([0-9a-zA-Z+-_, ]*)}'
## This should give more than one matching group
## (separated by ", " or ","), but this regex doesn't
## work. I have no idea why or how to modify, to get
## it working
while read -r line ; do
if [[ $line =~ $regex ]] ; then
i=1
while [ $i -le 3 ]; do
echo "Match $i: \"${BASH_REMATCH[$i]}\""
i=$(( i + 1 ))
done
echo
fi
done < mystyle.dtx
}
Here is an excerpt of the DTX file
\TestFiles{foo-bar}
\TestFiles{foo-bar, bar+baz,foo_bar_baz}
(You can store this as mystyle.dtx, in order to reproduce the next example.)
Using the above noted examples, my script gives me the following results:
get_filenames
Match 1: "foo-bar"
Match 2: ""
Match 3: ""
Match 1: "foo-bar, bar+baz,foo_bar_baz"
Match 2: ""
Match 3: ""
I wasn't able, to modify my regex expression, to split the content of the last \TestFiles{foo-bar, bar+baz,foo_bar_baz} example into three matching results.
I tried a regular expression like this regex='\\TestFiles{([[:alnum:]+-_]*)[,]+[ ]*}'. I thought the [:alnum:]+-_]* would match the filenames. As far as I understand regular expressions, the (...) should form a group, that should be listed afterwards in the bash array BASH_REMATCH[$i].
The part [,]+ should reflect that every file name must be separated by at least one comma. Between the filenames there might be some white space, so something like [[:space:]]* or at least [ ]* should represent this. The quantifier * means any repetition, ranging from 0 to ..., while + should at least appear one or more times.
But that regular expression did not work at all, if had no matching results.
How must regex be defined, to store each filenames as a matching group? I am searching for the correct regular expression, to get this result:
get_filenames
Match 1: "foo-bar"
Match 2: ""
Match 3: ""
Match 1: "foo-bar"
Match 2: "bar+baz"
Match 3: "foo_bar_baz"
EDIT: in my real world files, there may be (and are) more than tree test files.
Thanks in advance.
## This should give more than one matching group
regex='\\TestFiles{([0-9a-zA-Z+-_, ]*)}'
The element of BASH_REMATCH with index n is the portion of the string matching the nth parenthesized subexpression.
Your regex only has 1 "parenthesized subexpression" - which is why everything ends up in BASH_REMATCH[1]
$ regex='\\TestFiles{([0-9a-zA-Z+-_, ]*)}'
$ [[ $line =~ $regex ]]
$ declare -p BASH_REMATCH
declare -a BASH_REMATCH=(
[0]="\\TestFiles{foo-bar, bar+baz,foo_bar_baz}"
[1]="foo-bar, bar+baz,foo_bar_baz"
)
As you're trying to match an unknown number of filenames you would have to "dynamically" create your regex so it contains the needed amount of groups.
$ regex='\\TestFiles{([^, }]+)([,}] ?)'
$ [[ $line =~ $regex ]]
$ declare -p BASH_REMATCH
declare -a BASH_REMATCH=(
[0]="\\TestFiles{foo-bar, "
[1]="foo-bar"
[2]=", "
)
Add another group and see if it still matches:
$ regex+='([^, }]+)([,}] ?)'
$ [[ $line =~ $regex ]]
$ declare -p BASH_REMATCH
declare -a BASH_REMATCH=(
[0]="\\TestFiles{foo-bar, bar+baz,"
[1]="foo-bar"
[2]=", "
[3]="bar+baz"
[4]=","
)
You could keep looping until the regex no longer matches - or perhaps a simpler approach would be to count the number of , characters on the line.
regex='\\TestFiles{([^, }]+)([,}] ?)'
line='\TestFiles{foo-bar, bar+baz,foo_bar_baz}'
commas=${line//[!,]}
for ((i=0; i<${#commas}; i++))
do
regex+='([^, }]+)([,}] ?)'
done
[[ $line =~ $regex ]]
Which results in:
$ declare -p BASH_REMATCH
declare -a BASH_REMATCH=(
[0]="\\TestFiles{foo-bar, bar+baz,foo_bar_baz}"
[1]="foo-bar"
[2]=", "
[3]="bar+baz"
[4]=","
[5]="foo_bar_baz"
[6]="}"
)
Alternative approach using IFS
You can set IFS=', ' and have bash do the splitting for you.
line='\TestFiles{foo-bar, bar+baz,foo_bar_baz}'
[[ $line = \\TestFiles{* ]] && {
# Remove leading '\Testfiles{'
# Remove trailing }
line=${line#*{}
line=${line%}}
IFS=', ' read -a filenames <<< "$line"
declare -p filenames
}
declare -a filenames=([0]="foo-bar" [1]="bar+baz" [2]="foo_bar_baz}}")
Use set with IFS to split each line into new positional parameters. Assign $# to an array so that elements can be accessed by index. Trying this with $# directly results in a bad substitution error.
get-filenames.sh
#!/usr/bin/env bash
get_filenames() {
local IFS=' {},'
declare -a names
while read -r line; do
set -- $line
names=($#)
test "${names[0]}" == '\TestFiles' && {
for i in {1..3}; do
printf "Match %i: \"%s\"\n" $i ${names[$i]}
done
}
echo
done < 'mystyle.dtx'
}
get_filenames
mystyle.dtx
\TestFiles{foo-bar}
\TestFiles{foo-bar, bar+baz,foo_bar_baz}
output
Match 1: "foo-bar"
Match 2: ""
Match 3: ""
Match 1: "foo-bar"
Match 2: "bar+baz"
Match 3: "foo_bar_baz"
EDIT (without external programs, though it's rather impractical, and tied to exactly three matches)
function get_filenames () {
p='([^, }]*) *,? *'
regex="\\TestFiles\{$p$p$p"
while read -r line ; do
if [[ $line =~ $regex ]] ; then
i=1
while [ $i -le 3 ]; do
echo "Match $i: \"${BASH_REMATCH[$i]}\""
i=$(( i + 1 ))
done
echo
fi
done < mystyle.dtx
}
If you really need to output exactly three file names (even empty) for each '\TestFiles' row then here's the code.
function get_filenames () {
MAX_FILES_CNT=3
IFS=$'\n'
for line in $(grep -oP '\\TestFiles\{\K[^}]*' < mystyle.dtx); do
filenames=()
for filename in $(grep -m $MAX_FILES_CNT -oP "[^, ]+" <<< "$line"); do
filenames+=("$filename")
done
i=0
while [ $i -lt $MAX_FILES_CNT ]; do
echo "Match $(($i+1)): \"${filenames[i]}\""
i=$(( i + 1 ))
done
echo ""
done
unset IFS
}
Match 1: "foo-bar"
Match 2: ""
Match 3: ""
Match 1: "foo-bar"
Match 2: "bar+baz"
Match 3: "foo_bar_baz"
By the way, BASH_REMATCH is no good for this task, cause it captures only last rematch. Look
[[ "asdf" =~ (.)* ]]
echo "${BASH_REMATCH[#]}"
asdf f
Also I would recommend to read this question https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice
Suggesting an awk script that would do the trick on one or more files.
get_filenames.awk
/\\TestFiles{[^}]*}/ { # handle only lines matching regex filter
filesCount = split($0, fileNamesArr, "\\\\TestFiles{[ ]*|[ ]*,[ ]*|[ ]*}"); # parse line to array fileNamesArr
for (i = 2; i < filesCount; i++) { # read elements 2 --> filesCount - 1
printf("Match %d in %s: \"%s\"\n", i - 1, FILENAME, fileNamesArr[i]); # format print fileNames
}
print"";
}
test file: input.1.txt
some text line 1
\TestFiles{foo-bar0}
some text \TestFiles{foo-bar1, bar+baz1, foo_bar_baz1}
some text \TestFiles{foo-bar2 ,bar+baz2 ,foo_bar_baz2 }
some text \TestFiles{ foo-bar3 , bar+baz3 , foo_bar_baz3 } some text
line 4
test file: input.2.txt
\TestFiles{file10, file11}
text
text \TestFiles{ file20 } some text
text\TestFiles{file30,file31,file32 }text
text
testingget_filenames.awk
awk -f get_filenames.awk input.1.txt input.2.txt
Match 1 in input.1.txt: "foo-bar0"
Match 1 in input.1.txt: "foo-bar1"
Match 2 in input.1.txt: "bar+baz1"
Match 3 in input.1.txt: "foo_bar_baz1"
Match 1 in input.1.txt: "foo-bar2"
Match 2 in input.1.txt: "bar+baz2"
Match 3 in input.1.txt: "foo_bar_baz2"
Match 1 in input.1.txt: "foo-bar3"
Match 2 in input.1.txt: "bar+baz3"
Match 3 in input.1.txt: "foo_bar_baz3"
Match 1 in input.2.txt: "file10"
Match 2 in input.2.txt: "file11"
Match 1 in input.2.txt: "file20"
Match 1 in input.2.txt: "file30"
Match 2 in input.2.txt: "file31"
Match 3 in input.2.txt: "file32"
I believe this is the regular expression you're looking for:
(?<=\\TestFiles{.*)([\w\d\-\+_]+)[, }]+
You can see it working, modify it and have an explanation on what it does in the following link: https://regex101.com/r/0W8PBi/1

Regex for matching directory path depth

I'm trying to regex match a specific folder depth of varying path strings using bash scripts.
I want to match two levels down from packages eg. /packages/[any-folder-name]/[any-folder-name]/.
So for example for /packages/frontend/react-app/src/index.ts I want to match /packages/frontend/react-app/ and store it in an array
array=()
string="/packages/frontend/react-app/src/index.ts"
[[ $string =~ packages/.*/.*/ ]] && array+=(${BASH_REMATCH[0]}
almost works, but it returns /packages/frontend/react-app/src/
I've been going round in circles on this for a few hours now.
Probably this:
#!/usr/bin/env bash
array=()
string="/packages/frontend/react-app/src/index.ts"
[[ $string =~ packages/([^/]+/){2} ]] && array+=("${BASH_REMATCH[0]}")
Explanation:
[^/]+ match any non-empty string that does not contain a /.
$ echo '/packages/frontend/react-app/src/index.ts' | sed 's|^\(/packages/[^/]*/[^/]*/\).*$|\1|'
/packages/frontend/react-app/
Explanation:
use sed regex:
|^...$| - match the whole string, anchor at beginning and end
^\(...\) - capture stuff inside parenthesis
/packages/ - expect this text
[^/]*/ - followed by anything non-slash, followed by a slash
[^/]*/ - rinse and repeat
.* - discard anything after the captured text
|\1| - replace matched text with the captured text
Looks like a glob expression would be enough.
# enable nullglob to get an empty array if there is no match
shopt -s nullglob
array=(/packages/*/*/)
echo ${array[*]}

why this shell script could not work?

My script like this:
#!/bin/env bash
monitor_sock_raw1=socket,id=hmqmondev,port=55919,host=127.0.0.1,nodelay,server,nowait
msock=${monitor_sock_raw1##,port=}
msock=${msock%%,host=}
echo $msock
I expect get '55919', but the result is:
socket,id=hmqmondev,port=55919,host=127.0.0.1,nodelay,server,nowait
Why and how to fix this bug?
For a simple requirement like this, bash supports a regex (See bash ERE support) approach using the ~ operator which you can use it to match the port string and match the digits after it.
#!/bin/env bash
var='monitor_sock_raw1=socket,id=hmqmondev,port=55919,host=127.0.0.1,nodelay'
if [[ $var =~ ^.*port=([[:digit:]]+).*$ ]]; then
printf "%s\n" "${BASH_REMATCH[1]}"
fi
The captured group from the regex is stored in the array BASH_REMATCH from which the first element after index 0 i.e. index 1 contains the value of 1st captured group.
RegEx Demo
You need to add wildcards or the patterns wont match. The pattern needs to match the whole start or end of the text.
msock=${monitor_sock_raw1##*,port=}
msock=${msock%%,host=*}
Script that solves your problem.
#!/bin/bash
monitor_sock_raw1="socket,id=hmqmondev,port=55919,host=127.0.0.1,nodelay,server,nowait"
msock=(${monitor_sock_raw1##*port=})
echo ${msock%%,*}

RegEx : How can I extract a certain part and modify it?

I'd like to extract a certain part of a string and modify it by using a regular expression.
A given string is TestcaseVzwPerformance_8_2_1_4_1_FDD2.
I'd like to extract the part 8_2_1_4_1 from the string and then replace the underscores _ with dots . So the expected result needs to be 8.2.1.4.1.
The numbers and length of the given string can be different.
For example,
Given string // Expected result
TestcaseVzwCqi_3_9_Test2 // 3.9
TestcaseVzwSvd1xRttAclr_6_6_2_3 // 6.6.2.3
TestcaseVzwCsiFading_9_4_1_1_1_FDD4 // 9.4.1.1.1
Here is my RegEx:
((?:\D{0,}_)(\d(_\d)*)(.*))
The numbered capturing group - $2 - contains 8_2_1_4_1 but with underscores.
Can I replace the underscores with dots?
It needs to be done in one RegEx and a Replace.
regex cannot modify, for example with sed
echo TestcaseVzwPerformance_8_2_1_4_1_FDD2 |
sed -E 's/[^_]*_(([_0-9])+)_.*/\1/;s/_/./g'
8.2.1.4.1
If you have a Bash string, you can use a Bash regex to capture and Bash parameter expansions to replace:
$ s="TestcaseVzwSvd1xRttAclr_6_6_2_3"
$ [[ $s =~ ^[^_]*_([[:digit:]_]+)_* ]] && tmp=${BASH_REMATCH[1]//_/.} && echo "${tmp%.}"
6.6.2.3
Which can be in a loop:
while read -r line; do
if [[ $line =~ ^[^_]*_([[:digit:]_]+)_* ]]; then
tmp=${BASH_REMATCH[1]//_/.}
echo "\"$line\" => ${tmp%.}"
fi
done <<< 'Given string
TestcaseVzwCqi_3_9_Test2
TestcaseVzwSvd1xRttAclr_6_6_2_3
TestcaseVzwCsiFading_9_4_1_1_1_FDD4'
Prints:
"TestcaseVzwCqi_3_9_Test2" => 3.9
"TestcaseVzwSvd1xRttAclr_6_6_2_3" => 6.6.2.3
"TestcaseVzwCsiFading_9_4_1_1_1_FDD4" => 9.4.1.1.1
You can use the same loop to process a file.
If you have a file, you may as well use gawk:
$ awk 'BEGIN{FPAT="_[[:digit:]_]+"}
/_[[:digit:]]/ {sub(/^_/,"", $1); sub(/_$/,"",$1); gsub(/_/,".",$1); print $1}' file
3.9
6.6.2.3
9.4.1.1.1