Extracting filesnames in bash with regex

Extracting filesnames in bash with regex - regex

Can please someone help me to set up a regular expression.
I have a large LaTeX3 TeXDoc file. LaTeX3 TeXDoc defines the macro \TestFiles{}, which should be used, to list the names of files, which itself should be used as an unit tests. You can name more than one file between the braces. So \TestFiles{foo-bar} and \TestFiles{foo-bar, bar+baz,foo_bar_baz} are syntactical correct use cases for this macro.
I would like to write a bash script, to extract all the uni test files, named in the \TestFiles{} macros, to compile them with pdflatex and check, if pdflatex will be able to produce an output file successfully.
I have something like this in my script:
function get_filenames () {
## This regex works but is not sensible enough
# regex='\\TestFiles{(.*)}'
## This works also, but is again not precise enough
regex='\\TestFiles{([0-9a-zA-Z+-_, ]*)}'
## This should give more than one matching group
## (separated by ", " or ","), but this regex doesn't
## work. I have no idea why or how to modify, to get
## it working
while read -r line ; do
if [[ $line =~ $regex ]] ; then
i=1
while [ $i -le 3 ]; do
echo "Match $i: \"${BASH_REMATCH[$i]}\""
i=$(( i + 1 ))
done
echo
fi
done < mystyle.dtx
}
Here is an excerpt of the DTX file
\TestFiles{foo-bar}
\TestFiles{foo-bar, bar+baz,foo_bar_baz}
(You can store this as mystyle.dtx, in order to reproduce the next example.)
Using the above noted examples, my script gives me the following results:
get_filenames
Match 1: "foo-bar"
Match 2: ""
Match 3: ""
Match 1: "foo-bar, bar+baz,foo_bar_baz"
Match 2: ""
Match 3: ""
I wasn't able, to modify my regex expression, to split the content of the last \TestFiles{foo-bar, bar+baz,foo_bar_baz} example into three matching results.
I tried a regular expression like this regex='\\TestFiles{([[:alnum:]+-_]*)[,]+[ ]*}'. I thought the [:alnum:]+-_]* would match the filenames. As far as I understand regular expressions, the (...) should form a group, that should be listed afterwards in the bash array BASH_REMATCH[$i].
The part [,]+ should reflect that every file name must be separated by at least one comma. Between the filenames there might be some white space, so something like [[:space:]]* or at least [ ]* should represent this. The quantifier * means any repetition, ranging from 0 to ..., while + should at least appear one or more times.
But that regular expression did not work at all, if had no matching results.
How must regex be defined, to store each filenames as a matching group? I am searching for the correct regular expression, to get this result:
get_filenames
Match 1: "foo-bar"
Match 2: ""
Match 3: ""
Match 1: "foo-bar"
Match 2: "bar+baz"
Match 3: "foo_bar_baz"
EDIT: in my real world files, there may be (and are) more than tree test files.
Thanks in advance.

## This should give more than one matching group
regex='\\TestFiles{([0-9a-zA-Z+-_, ]*)}'
The element of BASH_REMATCH with index n is the portion of the string matching the nth parenthesized subexpression.
Your regex only has 1 "parenthesized subexpression" - which is why everything ends up in BASH_REMATCH[1]
$ regex='\\TestFiles{([0-9a-zA-Z+-_, ]*)}'
$ [[ $line =~ $regex ]]
$ declare -p BASH_REMATCH
declare -a BASH_REMATCH=(
[0]="\\TestFiles{foo-bar, bar+baz,foo_bar_baz}"
[1]="foo-bar, bar+baz,foo_bar_baz"
)
As you're trying to match an unknown number of filenames you would have to "dynamically" create your regex so it contains the needed amount of groups.
$ regex='\\TestFiles{([^, }]+)([,}] ?)'
$ [[ $line =~ $regex ]]
$ declare -p BASH_REMATCH
declare -a BASH_REMATCH=(
[0]="\\TestFiles{foo-bar, "
[1]="foo-bar"
[2]=", "
)
Add another group and see if it still matches:
$ regex+='([^, }]+)([,}] ?)'
$ [[ $line =~ $regex ]]
$ declare -p BASH_REMATCH
declare -a BASH_REMATCH=(
[0]="\\TestFiles{foo-bar, bar+baz,"
[1]="foo-bar"
[2]=", "
[3]="bar+baz"
[4]=","
)
You could keep looping until the regex no longer matches - or perhaps a simpler approach would be to count the number of , characters on the line.
regex='\\TestFiles{([^, }]+)([,}] ?)'
line='\TestFiles{foo-bar, bar+baz,foo_bar_baz}'
commas=${line//[!,]}
for ((i=0; i<${#commas}; i++))
do
regex+='([^, }]+)([,}] ?)'
done
[[ $line =~ $regex ]]
Which results in:
$ declare -p BASH_REMATCH
declare -a BASH_REMATCH=(
[0]="\\TestFiles{foo-bar, bar+baz,foo_bar_baz}"
[1]="foo-bar"
[2]=", "
[3]="bar+baz"
[4]=","
[5]="foo_bar_baz"
[6]="}"
)
Alternative approach using IFS
You can set IFS=', ' and have bash do the splitting for you.
line='\TestFiles{foo-bar, bar+baz,foo_bar_baz}'
[[ $line = \\TestFiles{* ]] && {
# Remove leading '\Testfiles{'
# Remove trailing }
line=${line#*{}
line=${line%}}
IFS=', ' read -a filenames <<< "$line"
declare -p filenames
}
declare -a filenames=([0]="foo-bar" [1]="bar+baz" [2]="foo_bar_baz}}")

Use set with IFS to split each line into new positional parameters. Assign $# to an array so that elements can be accessed by index. Trying this with $# directly results in a bad substitution error.
get-filenames.sh
#!/usr/bin/env bash
get_filenames() {
local IFS=' {},'
declare -a names
while read -r line; do
set -- $line
names=($#)
test "${names[0]}" == '\TestFiles' && {
for i in {1..3}; do
printf "Match %i: \"%s\"\n" $i ${names[$i]}
done
}
echo
done < 'mystyle.dtx'
}
get_filenames
mystyle.dtx
\TestFiles{foo-bar}
\TestFiles{foo-bar, bar+baz,foo_bar_baz}
output
Match 1: "foo-bar"
Match 2: ""
Match 3: ""
Match 1: "foo-bar"
Match 2: "bar+baz"
Match 3: "foo_bar_baz"

EDIT (without external programs, though it's rather impractical, and tied to exactly three matches)
function get_filenames () {
p='([^, }]*) *,? *'
regex="\\TestFiles\{$p$p$p"
while read -r line ; do
if [[ $line =~ $regex ]] ; then
i=1
while [ $i -le 3 ]; do
echo "Match $i: \"${BASH_REMATCH[$i]}\""
i=$(( i + 1 ))
done
echo
fi
done < mystyle.dtx
}
If you really need to output exactly three file names (even empty) for each '\TestFiles' row then here's the code.
function get_filenames () {
MAX_FILES_CNT=3
IFS=$'\n'
for line in $(grep -oP '\\TestFiles\{\K[^}]*' < mystyle.dtx); do
filenames=()
for filename in $(grep -m $MAX_FILES_CNT -oP "[^, ]+" <<< "$line"); do
filenames+=("$filename")
done
i=0
while [ $i -lt $MAX_FILES_CNT ]; do
echo "Match $(($i+1)): \"${filenames[i]}\""
i=$(( i + 1 ))
done
echo ""
done
unset IFS
}
Match 1: "foo-bar"
Match 2: ""
Match 3: ""
Match 1: "foo-bar"
Match 2: "bar+baz"
Match 3: "foo_bar_baz"
By the way, BASH_REMATCH is no good for this task, cause it captures only last rematch. Look
[[ "asdf" =~ (.)* ]]
echo "${BASH_REMATCH[#]}"
asdf f
Also I would recommend to read this question https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice

Suggesting an awk script that would do the trick on one or more files.
get_filenames.awk
/\\TestFiles{[^}]*}/ { # handle only lines matching regex filter
filesCount = split($0, fileNamesArr, "\\\\TestFiles{[ ]*|[ ]*,[ ]*|[ ]*}"); # parse line to array fileNamesArr
for (i = 2; i < filesCount; i++) { # read elements 2 --> filesCount - 1
printf("Match %d in %s: \"%s\"\n", i - 1, FILENAME, fileNamesArr[i]); # format print fileNames
}
print"";
}
test file: input.1.txt
some text line 1
\TestFiles{foo-bar0}
some text \TestFiles{foo-bar1, bar+baz1, foo_bar_baz1}
some text \TestFiles{foo-bar2 ,bar+baz2 ,foo_bar_baz2 }
some text \TestFiles{ foo-bar3 , bar+baz3 , foo_bar_baz3 } some text
line 4
test file: input.2.txt
\TestFiles{file10, file11}
text
text \TestFiles{ file20 } some text
text\TestFiles{file30,file31,file32 }text
text
testingget_filenames.awk
awk -f get_filenames.awk input.1.txt input.2.txt
Match 1 in input.1.txt: "foo-bar0"
Match 1 in input.1.txt: "foo-bar1"
Match 2 in input.1.txt: "bar+baz1"
Match 3 in input.1.txt: "foo_bar_baz1"
Match 1 in input.1.txt: "foo-bar2"
Match 2 in input.1.txt: "bar+baz2"
Match 3 in input.1.txt: "foo_bar_baz2"
Match 1 in input.1.txt: "foo-bar3"
Match 2 in input.1.txt: "bar+baz3"
Match 3 in input.1.txt: "foo_bar_baz3"
Match 1 in input.2.txt: "file10"
Match 2 in input.2.txt: "file11"
Match 1 in input.2.txt: "file20"
Match 1 in input.2.txt: "file30"
Match 2 in input.2.txt: "file31"
Match 3 in input.2.txt: "file32"

I believe this is the regular expression you're looking for:
(?<=\\TestFiles{.*)([\w\d\-\+_]+)[, }]+
You can see it working, modify it and have an explanation on what it does in the following link: https://regex101.com/r/0W8PBi/1

Related

How to test if variable is in specified format in Bash?

My Bash script accepts an argument of a version number, in the following format: #.#.#.#.dev or #.#.#.#.prod, where # can be any number, for example: 3.6.212.0.dev.
I want to verify that the argument is in the right format (contains all the 4 numbers separated by dots and includes .prod or .dev in the end).
I'm not sure how to achieve this, so far I've tried this:
re='^[0-9]+$'
if ! [[ $1 =~ $re ]] ; then
echo "error: Incorrect version specified; must be in the format of #.#.#.#.env" >&2; exit 1
fi

You can use
re='^([0-9]+\.){4}(dev|prod)$'
if ! [[ "$1" =~ $re ]] ; then
echo "error: Incorrect version specified; must be in the format of #.#.#.#.env" >&2; exit 1
fi
See a Bash demo online.
^([0-9]+\.){4}(dev|prod)$ is a POSIX ERE compliant pattern that matches:
^ - start of string
([0-9]+\.){4} - four occurrences of one or more digits and a dot
(dev|prod) - dev or prod substring
$ - end of string.

Regex/Shell - how to match all except those with specific pattern

I need a regex in shell to match all strings except those with specific pattern.
My specific pattern can be variable, i.e. (i|I)[2 digits numbers](u|U)[2 digits numbers] in every string should not match.
For example :
Some.text.1234.text => should match
Some.text.1234.i10u20.text => shouldn't match
Some.text.1234.I01U02.text => shouldn't match
Some.text.1234.i83U23.text => shouldn't match

You can try with that:
^(?!.*[tuTU]\d{2}).*$
Demo
Explanation:
^ start of a line
?!.* negative look ahead
[tuTU]\d{2} check if there exists such character following 2 digits only
.*$ if previous condition is negative then match entire string to end of string $

The Bash script checking if a string matches a regex or not can look like
f='It_is_your_string_to_check';
if [[ "${f^^}" =~ I[0-9]{2}U[0-9]{2} ]]; then
echo "$f is invalid";
else
echo "$f is valid"
fi;
Here, "${f^^}" turns the string into uppercase (so as not to use (U|u) and (I|i)), and then =~ operator triggers a regex check here since the pattern on the right side is not quoted. You may play it safe and define the regex pattern with a separate single-quoted string variable and use
rx='I[0-9]{2}U[0-9]{2}'
if [[ "${f^^}" =~ $rx ]]; then ...
See a Bash demo online:
s='Some.text.1234.text
Some.text.1234.i10u20.text
Some.text.1234.I01U02.text
Some.text.1234.i83U23.text'
for f in $s; do
if [[ "${f^^}" =~ I[0-9]{2}U[0-9]{2} ]]; then
echo "$f is invalid";
else
echo "$f is valid"
fi;
done;
Output:
Some.text.1234.text is valid
Some.text.1234.i10u20.text is invalid
Some.text.1234.I01U02.text is invalid
Some.text.1234.i83U23.text is invalid

RegEx : How can I extract a certain part and modify it?

I'd like to extract a certain part of a string and modify it by using a regular expression.
A given string is TestcaseVzwPerformance_8_2_1_4_1_FDD2.
I'd like to extract the part 8_2_1_4_1 from the string and then replace the underscores _ with dots . So the expected result needs to be 8.2.1.4.1.
The numbers and length of the given string can be different.
For example,
Given string // Expected result
TestcaseVzwCqi_3_9_Test2 // 3.9
TestcaseVzwSvd1xRttAclr_6_6_2_3 // 6.6.2.3
TestcaseVzwCsiFading_9_4_1_1_1_FDD4 // 9.4.1.1.1
Here is my RegEx:
((?:\D{0,}_)(\d(_\d)*)(.*))
The numbered capturing group - $2 - contains 8_2_1_4_1 but with underscores.
Can I replace the underscores with dots?
It needs to be done in one RegEx and a Replace.

regex cannot modify, for example with sed
echo TestcaseVzwPerformance_8_2_1_4_1_FDD2 |
sed -E 's/[^_]*_(([_0-9])+)_.*/\1/;s/_/./g'
8.2.1.4.1

If you have a Bash string, you can use a Bash regex to capture and Bash parameter expansions to replace:
$ s="TestcaseVzwSvd1xRttAclr_6_6_2_3"
$ [[ $s =~ ^[^_]*_([[:digit:]_]+)_* ]] && tmp=${BASH_REMATCH[1]//_/.} && echo "${tmp%.}"
6.6.2.3
Which can be in a loop:
while read -r line; do
if [[ $line =~ ^[^_]*_([[:digit:]_]+)_* ]]; then
tmp=${BASH_REMATCH[1]//_/.}
echo "\"$line\" => ${tmp%.}"
fi
done <<< 'Given string
TestcaseVzwCqi_3_9_Test2
TestcaseVzwSvd1xRttAclr_6_6_2_3
TestcaseVzwCsiFading_9_4_1_1_1_FDD4'
Prints:
"TestcaseVzwCqi_3_9_Test2" => 3.9
"TestcaseVzwSvd1xRttAclr_6_6_2_3" => 6.6.2.3
"TestcaseVzwCsiFading_9_4_1_1_1_FDD4" => 9.4.1.1.1
You can use the same loop to process a file.
If you have a file, you may as well use gawk:
$ awk 'BEGIN{FPAT="_[[:digit:]_]+"}
/_[[:digit:]]/ {sub(/^_/,"", $1); sub(/_$/,"",$1); gsub(/_/,".",$1); print $1}' file
3.9
6.6.2.3
9.4.1.1.1

preg_match_all equivalent for BASH?

I have a string like this
foo:collection:indexation [options] [--] <text> <text_1> <text_2> <text_3> <text_4>
And i want to use bash regex to get an array or string that I can split to get this in order to check if the syntax is correct
["text", "text_1", "text_2", "text_3", "text_4"]
I have tried to do this :
COMMAND_OUTPUT=$($COMMAND_HELP)
# get the output of the help
# regex
ARGUMENT_REGEX="<([^>]+)>"
GOOD_REGEX="[a-z-]"
# get all the arguments
while [[ $COMMAND_OUTPUT =~ $ARGUMENT_REGEX ]]; do
ARGUMENT="${BASH_REMATCH[1]}"
# bad syntax
if [[ ! $ARGUMENT =~ $GOOD_REGEX ]]; then
echo "Invalid argument '$ARGUMENT' for the command $FILE"
echo "Must only use characters [a-z:-]"
exit 5
fi
done
But the while does not seem to be appropriate since I always get the first match.
How can I get all the matches for this regex ?
Thanks !

The loop doesn't work because every time you're just testing the same input string against the regexp. It doesn't know that it should start scanning after the match from the previous iteration. You'd need to remove the part of the string up to and including the previous match before doing the next test.
A simpler way is to use grep -o to get all the matches.
$COMMAND_HELP | grep -o "$ARGUMENT_REGEX" | while read ARGUMENT; do
if [[ ! $ARGUMENT =~ $GOOD_REGEX ]]; then
echo "Invalid argument '$ARGUMENT' for the command $FILE"
echo "Must only use characters [a-z:-]"
exit 5
fi
done

Bash doesn't have this directly, but you can achieve a similar effect with a slight modification.
string='foo...'
re='<([^>]+)>'
while [[ $string =~ $re(.*) ]]; do
string=${BASH_REMATCH[2]}
# process as before
done
This matches the regex we want and also everything in the string after the regex. We keep shortening $string by assigning only the after-our-regex portion to it on every iteration. On the last iteration, ${BASH_REMATCH[2]} will be empty so the loop will terminate.

bash regex to parse text of the form +incdir+<dir1>+<dir2>

I have an input string of the form +incdir+<dir1>+<dir2>, where <dir1> and <dir2> are directory names. I want to parse this using a bash regex and have the values of the directories inside BASH_REMATCH[1], [2], ...
Here is what I tried:
function match {
if [[ "$1" =~ \+incdir(\+.*)+ ]]; then
for i in $(seq $(expr ${#BASH_REMATCH[#]} - 1)); do
echo $i ":" ${BASH_REMATCH[$i]}
done
else
echo "no match"
fi
}
This works for match +incdir+foo, but doesn't for match +incdir+foo+bar, because it does greedy matching and it outputs +foo+bar. There isn't any non-greedy matching in bash as regex in bash expression mentions so I tried the following for the pattern: \+incdir(\+[^+]*)+ but this just gives me +bar.
The way I would interpret the regex is the following: find the beginning +incdir, then match me at least one group starting with a + followed by as many characters as you can find that are not +. When you hit a + this is the start of the next group. I guess my reasoning is incorrect.
Does anyone have any idea what I'm doing wrong?

Using only bash builtins (but NOT regular expressions, which are the wrong tool for this job):
match() {
[[ $1 = *+incdir+* ]] || return # noop if no +incdir present
IFS=+ read -r -a pieces <<<"${1#*+incdir+}" # read everything after +incdir+
# into +-separated array
for idx in "${!pieces[#]}"; do # iterate over keys in array
echo "$idx: ${pieces[$idx]}" # ...and emit key/value pairs
done
}
$ match "yadda yadda +incdir+foo+bar+baz"
0: foo
1: bar
2: baz

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extracting filesnames in bash with regex - regex

I believe this is the regular expression you're looking for: (?<=\\TestFiles{.*)([\w\d\-\+_]+)[, }]+ You can see it working, modify it and have an explanation on what it does in the following link: https://regex101.com/r/0W8PBi/1

Related

How to test if variable is in specified format in Bash?

Regex/Shell - how to match all except those with specific pattern

RegEx : How can I extract a certain part and modify it?

preg_match_all equivalent for BASH?

bash regex to parse text of the form +incdir+<dir1>+<dir2>

Categories

Resources