Unexpected behavior in a regular expression in bash - regex

I created this regular expression and tested it out successfully
https://regex101.com/r/a7qvuw/1
However the regular expression behaves differently in this bash code that I wrote
# Splitting by colon
IFS=';' read -ra statements <<< $contents
# Splitting by the = sign.
regex="\s*(.*?)\s*=\s*(.*)\b"
for i in "${statements[#]}"; do
if [[ $i =~ $regex ]]; then
key=${BASH_REMATCH[1]}
params=${BASH_REMATCH[2]}
echo "KEY: $key| PARAMS: $params"
fi
done
The variable $contents has the text as is used in the link. The problem is that the $key has a space at its end, while the regular expression I tried matches the words without the space.
I get output like this:
KEY: vclock_spec | PARAMS: clk_i 1 1
As you can see there is a space between vclock_spec and the | which should not be there. What am I doing wrong?

As #Cyrus mentioned, lazy quantifiers are not supported in Bash regex. They act as greedy ones.
You may fix your pattern to work in Bash using
regex="\s*([^=]*\S)\s*=\s*(.*)\b"
^^^^^^^
The [^=]* matches zero or more symbols other then = and \S matches any non-whitespace (maybe [^\s=] will be more precise here as it matches any char but a whitespace (\s) and =, but it looks like regex="\s*([^=]*[^\s=])\s*=\s*(.*)\b" yields the same results).

Related

Why does bash "=~" operator ignore the last part of the pattern specified?

I am trying to do compare a string in bash to a regex pattern and have found something odd. For starters I am using GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu). This is within WSL.
For example here is sample program demonstrating the problem:
#!/bin/env bash
name="John"
if [[ "${name}" =~ "John"* ]]; then
echo "found"
else
echo "not found"
fi
exit
As expected this will echo found since the name "John" matches the regex pattern described. Now what I find odd is if I drop the n in John, it still echos found. Imo "Joh" does match the pattern of "John"*.
If you drop the "hn" and just set $name to "Jo" then it echos not found. It seems to only affect the last character in the Regex pattern (aside from the wildcard).
I am converting an old csh script to bash and this behavior is not happening in csh. What is causing bash to do this?
You're mixing up syntax for shell patterns and regular expressions. Your regular expression, after stripping the quoting, is John*: Joh followed by any number of n, including 0. Matches Joh, John, Johnn, Johnnn, ...
It's not anchored, so it also matches any string containing one of the matches above.
Since it's not anchored, depending on what you want, you could do any of these:
Any string containing John should match:
Regex: [[ $name =~ John ]]
Shell pattern: [[ $name == *John* ]]
Any string that begins with John should match:
Regex: [[ $name =~ ^John ]]
Shell pattern: [[ $name == John* ]]
Notice that shell patterns, unlike the regular expressions, must match the entire string.
A note on quoting: within [[ ... ]], the left-hand side doesn't have to be quoted; on the right-hand side, quoted parts are interpreted literally. For regular expressions, it's a good practice to define it in a separate variable:
re='^John'
if [[ $name =~ $re ]]; then
This avoids a few edge cases with special characters in the regex.
The =~ operator compares using regular expression syntax, not glob syntax. The * isn't a shell wildcard, it means, "the previous character, 0 or more times".
The string Joh matches the regular expression John* because it contains Joh followed by zero n characters.

Mix of regex and non-regex in bash if-statement

Inside of my $foo variable I have this data (please pay close attention to the .s and ,s):
,example.com,de.wikipedia.org,reddit,stackoverflow.com.,amazon.,
I am trying to write an if statement in bash that basically works like this:
if [[ "${foo}" =~ *','[a-z0-9]','* || "${foo}" =~ *','[a-z0-9]'.,'* ]]; then
echo "Invalid input detected"
else
echo "OK"
fi
It would echo Invalid input detected since reddit and amazon. are in $foo.
If I change the contents of $foo to be:
,example.com,de.wikipedia.org,www.reddit.com,stackoverflow.com.,amazon.com,
Then it would echo OK.
I am using bash 3.2.57(1)-release on OS X 10.11.6 El Capitan.
Try:
if [[ $foo =~ ,[a-z0-9]*, || $foo =~ ,[a-z0-9]*\., ]]; then
echo "Invalid input detected"
else
echo "OK"
fi
Notes:
=~ is a regular expression operator. The right-hand-side needs to be a regular expression, not a glob.
, is not a shell-active character. Thus, it does not need any special quoting.
[a-z0-9] matches exactly one alphanumeric. Since we want to allow for more any number, use [a-z0-9]*
In regular expressions, ','* matches zero or more commas. This is not what you want. One might write ,.* which, because, . is a wildcard, matches a comma followed by zero or more of anything. Since the regex is not anchored to the end, adding a final .* makes no difference.
Inside of [[...]] there is no word splitting. So shell variables do not the double-quoting that need elsewhere.
Note that, in [a-z0-9], the exact characters that match a-z or 0-9 depend on the collation order in the locale.

Which regexp would find my datetime format?

I created a regexp for a date with time, formatted like this:
25.06.19 / 16:30
I created the following regexp:
^[0-9]{2}\.[0-9]{2}\.[0-9]{4}\s.\/\s.[0-9]{2}\:[0-9]{2}$
The result should deliver the above match, but it doesn't. Can you help me fix my regexp?
Your year can consist of two or four digits, use
regex='^[0-9]{2}\.[0-9]{2}\.[0-9]{2}([0-9]{2})?[[:space:]]*/[[:space:]]*[0-9]{2}:[0-9]{2}$'
# ^^^^^^^^^^^
Bash demo:
s="25.06.19 / 16:30"
regex='^[0-9]{2}\.[0-9]{2}\.[0-9]{2}([0-9]{2})?[[:space:]]*/[[:space:]]*[0-9]{2}:[0-9]{2}$'
if [[ "$s" =~ $regex ]]; then
echo "Matched!"
fi;
Note I also replaced \s with [[:space:]] that should have wider support in Bash, and / does not need escaping as there are no regex delimiters here, and / is not a special regex metacharacter. Besides, the dots in \s.\/\s. are suspicious, I understand you wanted to match any 0 or more whitespaces, so I replaced . with *.

bash. regexp using bash_rematch

I have a var which can has single quotes or spaces before and/or after:
var="' /path/to/somewhere '"
var=" /path/to/somewhere "#<- here is a space at the end
var="/path/to/somewhere'"
var=" /path/to/somewhere '"
I need a regexp using bash rematch to clean possible single quotes or blank spaces before and after
I know it can be done in this way:
var=${var##+([ \'])}
var=${var%%+([ \'])}
But I need it with BASH_REMATCH (long to explain xd). I'm trying with:
[[ ${var} =~ ^([\' ]*)?(.+)([\' ])?$ ]] && var="${BASH_REMATCH[1]}"
But it doesn't work. Probably .+ is getting the rest of the string. How can I get only the interesting part? Thanks.
The .+ in your example is what's causing the problem, as it is greedy, so it will consume the rest of the line.
In this case, you can prevent it from doing so by requiring that the part in the middle ends in something other than a space or ', like this:
re='^[ '"'"']*(.*[^ '"'"'])[ '"'"']*$'
[[ $var =~ $re ]] && echo "${BASH_REMATCH[1]}"
The nasty-looking '"'"' are needed to insert a literal ' into a single-quoted string. When working with regular expressions in bash, the recommended method is to define a variable containing the pattern in single quotes and then to use it, unquoted (this method works in all versions of bash that support regular expressions).

Regex doesn't match with the lines in txt file

I'm reading the lines from a text file and check if it matches with the regex that I've created or not.
But it always says that your regex didn't match but the regex tool shows that it matches with my regular explanation.
while read line
do
name=$line
BRANCH_REGEX="\d{10}\-[^_]*\_\d{13}"
if [[ $name =~ $BRANCH_REGEX ]];
then
echo "BRANCH '$name' matches BRANCH_REGEX '$BRANCH_REGEX'"
else
echo "BRANCH '$name' DOES NOT MATCH BRANCH_REGEX '$BRANCH_REGEX'"
fi
done < names.txt
names.txt includes lines for example :
9000999484-suchocka_1416578464908
9000989944-schubertk_1416582641605
9001026342-extbeerfelde_1416586904787
9000687045-sturmjo_1416573131629
9001059401-extburghartswieser_1416405627982
9000806302-PDPUPDATE_1357830207068
9000658783-PDPUPDATE_1360445087963
BRANCH_REGEX="/\d{10}\-[^_]*\_\d{13}"
↑
Remove the leading /, none of your lines begin with it.
Also note that _ doesn't need to be escaped, you can write _ instead of \_.
Change your regex to:
BRANCH_REGEX="[0-9]{10}-[^_]*_[0-9]{13}"
Or else:
BRANCH_REGEX="[[:digit:]]{10}-[^_]*_[[:digit:]]{13}"
As BASH regex doesn't support \d property. There is no need to escape hyphens.