Can please someone help me to set up a regular expression.
I have a large LaTeX3 TeXDoc file. LaTeX3 TeXDoc defines the macro \TestFiles{}, which should be used, to list the names of files, which itself should be used as an unit tests. You can name more than one file between the braces. So \TestFiles{foo-bar} and \TestFiles{foo-bar, bar+baz,foo_bar_baz} are syntactical correct use cases for this macro.
I would like to write a bash script, to extract all the uni test files, named in the \TestFiles{} macros, to compile them with pdflatex and check, if pdflatex will be able to produce an output file successfully.
I have something like this in my script:
function get_filenames () {
## This regex works but is not sensible enough
# regex='\\TestFiles{(.*)}'
## This works also, but is again not precise enough
regex='\\TestFiles{([0-9a-zA-Z+-_, ]*)}'
## This should give more than one matching group
## (separated by ", " or ","), but this regex doesn't
## work. I have no idea why or how to modify, to get
## it working
while read -r line ; do
if [[ $line =~ $regex ]] ; then
i=1
while [ $i -le 3 ]; do
echo "Match $i: \"${BASH_REMATCH[$i]}\""
i=$(( i + 1 ))
done
echo
fi
done < mystyle.dtx
}
Here is an excerpt of the DTX file
\TestFiles{foo-bar}
\TestFiles{foo-bar, bar+baz,foo_bar_baz}
(You can store this as mystyle.dtx, in order to reproduce the next example.)
Using the above noted examples, my script gives me the following results:
get_filenames
Match 1: "foo-bar"
Match 2: ""
Match 3: ""
Match 1: "foo-bar, bar+baz,foo_bar_baz"
Match 2: ""
Match 3: ""
I wasn't able, to modify my regex expression, to split the content of the last \TestFiles{foo-bar, bar+baz,foo_bar_baz} example into three matching results.
I tried a regular expression like this regex='\\TestFiles{([[:alnum:]+-_]*)[,]+[ ]*}'. I thought the [:alnum:]+-_]* would match the filenames. As far as I understand regular expressions, the (...) should form a group, that should be listed afterwards in the bash array BASH_REMATCH[$i].
The part [,]+ should reflect that every file name must be separated by at least one comma. Between the filenames there might be some white space, so something like [[:space:]]* or at least [ ]* should represent this. The quantifier * means any repetition, ranging from 0 to ..., while + should at least appear one or more times.
But that regular expression did not work at all, if had no matching results.
How must regex be defined, to store each filenames as a matching group? I am searching for the correct regular expression, to get this result:
get_filenames
Match 1: "foo-bar"
Match 2: ""
Match 3: ""
Match 1: "foo-bar"
Match 2: "bar+baz"
Match 3: "foo_bar_baz"
EDIT: in my real world files, there may be (and are) more than tree test files.
Thanks in advance.
## This should give more than one matching group
regex='\\TestFiles{([0-9a-zA-Z+-_, ]*)}'
The element of BASH_REMATCH with index n is the portion of the string matching the nth parenthesized subexpression.
Your regex only has 1 "parenthesized subexpression" - which is why everything ends up in BASH_REMATCH[1]
$ regex='\\TestFiles{([0-9a-zA-Z+-_, ]*)}'
$ [[ $line =~ $regex ]]
$ declare -p BASH_REMATCH
declare -a BASH_REMATCH=(
[0]="\\TestFiles{foo-bar, bar+baz,foo_bar_baz}"
[1]="foo-bar, bar+baz,foo_bar_baz"
)
As you're trying to match an unknown number of filenames you would have to "dynamically" create your regex so it contains the needed amount of groups.
$ regex='\\TestFiles{([^, }]+)([,}] ?)'
$ [[ $line =~ $regex ]]
$ declare -p BASH_REMATCH
declare -a BASH_REMATCH=(
[0]="\\TestFiles{foo-bar, "
[1]="foo-bar"
[2]=", "
)
Add another group and see if it still matches:
$ regex+='([^, }]+)([,}] ?)'
$ [[ $line =~ $regex ]]
$ declare -p BASH_REMATCH
declare -a BASH_REMATCH=(
[0]="\\TestFiles{foo-bar, bar+baz,"
[1]="foo-bar"
[2]=", "
[3]="bar+baz"
[4]=","
)
You could keep looping until the regex no longer matches - or perhaps a simpler approach would be to count the number of , characters on the line.
regex='\\TestFiles{([^, }]+)([,}] ?)'
line='\TestFiles{foo-bar, bar+baz,foo_bar_baz}'
commas=${line//[!,]}
for ((i=0; i<${#commas}; i++))
do
regex+='([^, }]+)([,}] ?)'
done
[[ $line =~ $regex ]]
Which results in:
$ declare -p BASH_REMATCH
declare -a BASH_REMATCH=(
[0]="\\TestFiles{foo-bar, bar+baz,foo_bar_baz}"
[1]="foo-bar"
[2]=", "
[3]="bar+baz"
[4]=","
[5]="foo_bar_baz"
[6]="}"
)
Alternative approach using IFS
You can set IFS=', ' and have bash do the splitting for you.
line='\TestFiles{foo-bar, bar+baz,foo_bar_baz}'
[[ $line = \\TestFiles{* ]] && {
# Remove leading '\Testfiles{'
# Remove trailing }
line=${line#*{}
line=${line%}}
IFS=', ' read -a filenames <<< "$line"
declare -p filenames
}
declare -a filenames=([0]="foo-bar" [1]="bar+baz" [2]="foo_bar_baz}}")
Use set with IFS to split each line into new positional parameters. Assign $# to an array so that elements can be accessed by index. Trying this with $# directly results in a bad substitution error.
get-filenames.sh
#!/usr/bin/env bash
get_filenames() {
local IFS=' {},'
declare -a names
while read -r line; do
set -- $line
names=($#)
test "${names[0]}" == '\TestFiles' && {
for i in {1..3}; do
printf "Match %i: \"%s\"\n" $i ${names[$i]}
done
}
echo
done < 'mystyle.dtx'
}
get_filenames
mystyle.dtx
\TestFiles{foo-bar}
\TestFiles{foo-bar, bar+baz,foo_bar_baz}
output
Match 1: "foo-bar"
Match 2: ""
Match 3: ""
Match 1: "foo-bar"
Match 2: "bar+baz"
Match 3: "foo_bar_baz"
EDIT (without external programs, though it's rather impractical, and tied to exactly three matches)
function get_filenames () {
p='([^, }]*) *,? *'
regex="\\TestFiles\{$p$p$p"
while read -r line ; do
if [[ $line =~ $regex ]] ; then
i=1
while [ $i -le 3 ]; do
echo "Match $i: \"${BASH_REMATCH[$i]}\""
i=$(( i + 1 ))
done
echo
fi
done < mystyle.dtx
}
If you really need to output exactly three file names (even empty) for each '\TestFiles' row then here's the code.
function get_filenames () {
MAX_FILES_CNT=3
IFS=$'\n'
for line in $(grep -oP '\\TestFiles\{\K[^}]*' < mystyle.dtx); do
filenames=()
for filename in $(grep -m $MAX_FILES_CNT -oP "[^, ]+" <<< "$line"); do
filenames+=("$filename")
done
i=0
while [ $i -lt $MAX_FILES_CNT ]; do
echo "Match $(($i+1)): \"${filenames[i]}\""
i=$(( i + 1 ))
done
echo ""
done
unset IFS
}
Match 1: "foo-bar"
Match 2: ""
Match 3: ""
Match 1: "foo-bar"
Match 2: "bar+baz"
Match 3: "foo_bar_baz"
By the way, BASH_REMATCH is no good for this task, cause it captures only last rematch. Look
[[ "asdf" =~ (.)* ]]
echo "${BASH_REMATCH[#]}"
asdf f
Also I would recommend to read this question https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice
Suggesting an awk script that would do the trick on one or more files.
get_filenames.awk
/\\TestFiles{[^}]*}/ { # handle only lines matching regex filter
filesCount = split($0, fileNamesArr, "\\\\TestFiles{[ ]*|[ ]*,[ ]*|[ ]*}"); # parse line to array fileNamesArr
for (i = 2; i < filesCount; i++) { # read elements 2 --> filesCount - 1
printf("Match %d in %s: \"%s\"\n", i - 1, FILENAME, fileNamesArr[i]); # format print fileNames
}
print"";
}
test file: input.1.txt
some text line 1
\TestFiles{foo-bar0}
some text \TestFiles{foo-bar1, bar+baz1, foo_bar_baz1}
some text \TestFiles{foo-bar2 ,bar+baz2 ,foo_bar_baz2 }
some text \TestFiles{ foo-bar3 , bar+baz3 , foo_bar_baz3 } some text
line 4
test file: input.2.txt
\TestFiles{file10, file11}
text
text \TestFiles{ file20 } some text
text\TestFiles{file30,file31,file32 }text
text
testingget_filenames.awk
awk -f get_filenames.awk input.1.txt input.2.txt
Match 1 in input.1.txt: "foo-bar0"
Match 1 in input.1.txt: "foo-bar1"
Match 2 in input.1.txt: "bar+baz1"
Match 3 in input.1.txt: "foo_bar_baz1"
Match 1 in input.1.txt: "foo-bar2"
Match 2 in input.1.txt: "bar+baz2"
Match 3 in input.1.txt: "foo_bar_baz2"
Match 1 in input.1.txt: "foo-bar3"
Match 2 in input.1.txt: "bar+baz3"
Match 3 in input.1.txt: "foo_bar_baz3"
Match 1 in input.2.txt: "file10"
Match 2 in input.2.txt: "file11"
Match 1 in input.2.txt: "file20"
Match 1 in input.2.txt: "file30"
Match 2 in input.2.txt: "file31"
Match 3 in input.2.txt: "file32"
I believe this is the regular expression you're looking for:
(?<=\\TestFiles{.*)([\w\d\-\+_]+)[, }]+
You can see it working, modify it and have an explanation on what it does in the following link: https://regex101.com/r/0W8PBi/1
I need a regex in shell to match all strings except those with specific pattern.
My specific pattern can be variable, i.e. (i|I)[2 digits numbers](u|U)[2 digits numbers] in every string should not match.
For example :
Some.text.1234.text => should match
Some.text.1234.i10u20.text => shouldn't match
Some.text.1234.I01U02.text => shouldn't match
Some.text.1234.i83U23.text => shouldn't match
You can try with that:
^(?!.*[tuTU]\d{2}).*$
Demo
Explanation:
^ start of a line
?!.* negative look ahead
[tuTU]\d{2} check if there exists such character following 2 digits only
.*$ if previous condition is negative then match entire string to end of string $
The Bash script checking if a string matches a regex or not can look like
f='It_is_your_string_to_check';
if [[ "${f^^}" =~ I[0-9]{2}U[0-9]{2} ]]; then
echo "$f is invalid";
else
echo "$f is valid"
fi;
Here, "${f^^}" turns the string into uppercase (so as not to use (U|u) and (I|i)), and then =~ operator triggers a regex check here since the pattern on the right side is not quoted. You may play it safe and define the regex pattern with a separate single-quoted string variable and use
rx='I[0-9]{2}U[0-9]{2}'
if [[ "${f^^}" =~ $rx ]]; then ...
See a Bash demo online:
s='Some.text.1234.text
Some.text.1234.i10u20.text
Some.text.1234.I01U02.text
Some.text.1234.i83U23.text'
for f in $s; do
if [[ "${f^^}" =~ I[0-9]{2}U[0-9]{2} ]]; then
echo "$f is invalid";
else
echo "$f is valid"
fi;
done;
Output:
Some.text.1234.text is valid
Some.text.1234.i10u20.text is invalid
Some.text.1234.I01U02.text is invalid
Some.text.1234.i83U23.text is invalid
I'd like to extract a certain part of a string and modify it by using a regular expression.
A given string is TestcaseVzwPerformance_8_2_1_4_1_FDD2.
I'd like to extract the part 8_2_1_4_1 from the string and then replace the underscores _ with dots . So the expected result needs to be 8.2.1.4.1.
The numbers and length of the given string can be different.
For example,
Given string // Expected result
TestcaseVzwCqi_3_9_Test2 // 3.9
TestcaseVzwSvd1xRttAclr_6_6_2_3 // 6.6.2.3
TestcaseVzwCsiFading_9_4_1_1_1_FDD4 // 9.4.1.1.1
Here is my RegEx:
((?:\D{0,}_)(\d(_\d)*)(.*))
The numbered capturing group - $2 - contains 8_2_1_4_1 but with underscores.
Can I replace the underscores with dots?
It needs to be done in one RegEx and a Replace.
regex cannot modify, for example with sed
echo TestcaseVzwPerformance_8_2_1_4_1_FDD2 |
sed -E 's/[^_]*_(([_0-9])+)_.*/\1/;s/_/./g'
8.2.1.4.1
If you have a Bash string, you can use a Bash regex to capture and Bash parameter expansions to replace:
$ s="TestcaseVzwSvd1xRttAclr_6_6_2_3"
$ [[ $s =~ ^[^_]*_([[:digit:]_]+)_* ]] && tmp=${BASH_REMATCH[1]//_/.} && echo "${tmp%.}"
6.6.2.3
Which can be in a loop:
while read -r line; do
if [[ $line =~ ^[^_]*_([[:digit:]_]+)_* ]]; then
tmp=${BASH_REMATCH[1]//_/.}
echo "\"$line\" => ${tmp%.}"
fi
done <<< 'Given string
TestcaseVzwCqi_3_9_Test2
TestcaseVzwSvd1xRttAclr_6_6_2_3
TestcaseVzwCsiFading_9_4_1_1_1_FDD4'
Prints:
"TestcaseVzwCqi_3_9_Test2" => 3.9
"TestcaseVzwSvd1xRttAclr_6_6_2_3" => 6.6.2.3
"TestcaseVzwCsiFading_9_4_1_1_1_FDD4" => 9.4.1.1.1
You can use the same loop to process a file.
If you have a file, you may as well use gawk:
$ awk 'BEGIN{FPAT="_[[:digit:]_]+"}
/_[[:digit:]]/ {sub(/^_/,"", $1); sub(/_$/,"",$1); gsub(/_/,".",$1); print $1}' file
3.9
6.6.2.3
9.4.1.1.1
I have a string like this
foo:collection:indexation [options] [--] <text> <text_1> <text_2> <text_3> <text_4>
And i want to use bash regex to get an array or string that I can split to get this in order to check if the syntax is correct
["text", "text_1", "text_2", "text_3", "text_4"]
I have tried to do this :
COMMAND_OUTPUT=$($COMMAND_HELP)
# get the output of the help
# regex
ARGUMENT_REGEX="<([^>]+)>"
GOOD_REGEX="[a-z-]"
# get all the arguments
while [[ $COMMAND_OUTPUT =~ $ARGUMENT_REGEX ]]; do
ARGUMENT="${BASH_REMATCH[1]}"
# bad syntax
if [[ ! $ARGUMENT =~ $GOOD_REGEX ]]; then
echo "Invalid argument '$ARGUMENT' for the command $FILE"
echo "Must only use characters [a-z:-]"
exit 5
fi
done
But the while does not seem to be appropriate since I always get the first match.
How can I get all the matches for this regex ?
Thanks !
The loop doesn't work because every time you're just testing the same input string against the regexp. It doesn't know that it should start scanning after the match from the previous iteration. You'd need to remove the part of the string up to and including the previous match before doing the next test.
A simpler way is to use grep -o to get all the matches.
$COMMAND_HELP | grep -o "$ARGUMENT_REGEX" | while read ARGUMENT; do
if [[ ! $ARGUMENT =~ $GOOD_REGEX ]]; then
echo "Invalid argument '$ARGUMENT' for the command $FILE"
echo "Must only use characters [a-z:-]"
exit 5
fi
done
Bash doesn't have this directly, but you can achieve a similar effect with a slight modification.
string='foo...'
re='<([^>]+)>'
while [[ $string =~ $re(.*) ]]; do
string=${BASH_REMATCH[2]}
# process as before
done
This matches the regex we want and also everything in the string after the regex. We keep shortening $string by assigning only the after-our-regex portion to it on every iteration. On the last iteration, ${BASH_REMATCH[2]} will be empty so the loop will terminate.
I want to parse strings similar to the following into separate variables using regular expressions from within Bash:
Category: entity;scheme="http://schemas.ogf.org/occi/core#";class="kind";title="Entity";attributes="occi.core.id occi.core.title";
or
Category: resource;scheme="http://schemas.ogf.org/occi/core#";class="kind";title="Resource";rel="http://schemas.ogf.org/occi/core#entity";attributes="occi.core.summary";
The first part before "title" is common to all strings, the parts title and attributes are optional.
I managed to extract the mandatory parameters common to all strings, but I have trouble with optional parameters not necessarily present for all strings. As far as I found out, Bash doesn't support Non-capturing parentheses which I would use for this purpose.
Here is what I achieved thus far:
CATEGORY_REGEX='Category:\s*([^;]*);scheme="([^"]*)";class="([^"]*)";'
category_string='Category: entity;scheme="http://schemas.ogf.org/occi/core#";class="kind";title="Entity";attributes="occi.core.id occi.core.title";'
[[ $category_string =~ $CATEGORY_REGEX ]]
echo ${BASH_REMATCH[0]}
echo ${BASH_REMATCH[1]}
echo ${BASH_REMATCH[2]}
echo ${BASH_REMATCH[3]}
The regular expression I would like to use (and which is working for me in Ruby) would be:
CATEGORY_REGEX='Category:\s*([^;]*);\s*scheme="([^"]*)";\s*class="([^"]*)";\s*(?:title="([^"]*)";)?\s*(?:rel="([^"]*)";)?\s*(?:location="([^"]*)";)?\s*(?:attributes="([^"]*)";)?\s*(?:actions="([^"]*)";)?'
Is there any other solution to parse the string with command line tools without having to fall back on perl, python or ruby?
I don't think non-capturing groups exist in bash regex, so your options are to use a scripting language or to remove the ?: from all of the (?:...) groups and just be careful about which groups you reference, for example:
CATEGORY_REGEX='Category:\s*([^;]*);\s*scheme="([^"]*)";\s*class="([^"]*)";\s*(title="([^"]*)";)?\s*(rel="([^"]*)";)?\s*(location="([^"]*)";)?\s*(attributes="([^"]*)";)?\s*(actions="([^"]*)";)?'
category_string='Category: entity;scheme="http://schemas.ogf.org/occi/core#";class="kind";title="Entity";attributes="occi.core.id occi.core.title";'
[[ $category_string =~ $CATEGORY_REGEX ]]
echo "full: ${BASH_REMATCH[0]}"
echo "category: ${BASH_REMATCH[1]}"
echo "scheme: ${BASH_REMATCH[2]}"
echo "class: ${BASH_REMATCH[3]}"
echo "title: ${BASH_REMATCH[5]}"
echo "rel: ${BASH_REMATCH[7]}"
echo "location: ${BASH_REMATCH[9]}"
echo "attributes: ${BASH_REMATCH[11]}"
echo "actions: ${BASH_REMATCH[13]}"
Note that starting with the optional parameters we need to skip a group each time, because the even numbered groups from 4 on contain the parameter name as well as the value (if the parameter is present).
You can emulate non-matching groups in bash using a little bit of regexp magic:
_2__ _4__ _5__
[[ "fu#k" =~ ((.+)#|)((.+)/|)(.+) ]];
echo "${BASH_REMATCH[2]:--} ${BASH_REMATCH[4]:--} ${BASH_REMATCH[5]:--}"
# Output: fu - k
Characters # and / are parts of string we parse.
Regexp pipe | is used for either left or right (empty) part matching.
For curious, ${VAR:-<default value>} is variable expansion with default value in case $VAR is empty.