Bash regular expression execution hangs on long expressions - regex

I need to validate a 38 field comma seperated string. Fields can be numeric, decimal or empty allowed strings.
Problem is when I construct a regular expression for 38 fields and try to execute, it hangs forever and it hangs.
I use following per field reg exps:
INT="[0-9]+"
TIM="[0-9]+"
NUM="[0-9]+(\.[0-9]+)?"
STR=".*" # --> (also tried "[^,]*" but no change)
I constructed my regexps with above expressions.
1) This is working fine: (Output: "matches")
[[ "str1,1.1,5,6,7,8,9,str2,str3,str4,str1,1.1,5,6,7,8,9,str2,str3,str4,str1,1.1,5,6,7,8,9,str2,str3,str4,str1,1.1,5,6,7,8,9,str2,str3,str4" =~ ^.*\,[0-9]+(\.[0-9]+)?\,[0-9]+\,[0-9]+\,[0-9]+\,[0-9]+\,[0-9]+\,.*\,.*\,.*\,.*\,[0-9]+(\.[0-9]+)?\,[0-9]+\,[0-9]+\,[0-9]+\,[0-9]+\,[0-9]+\,.*\,.*\,.*\,.*\,[0-9]+(\.[0-9]+)?\,[0-9]+\,[0-9]+\,[0-9]+\,[0-9]+\,[0-9]+\,.*\,.*\,.*\,.*\,[0-9]+(\.[0-9]+)?\,[0-9]+\,[0-9]+\,[0-9]+\,[0-9]+\,[0-9]+\,.*\,.*\,.*$ ]] && echo matches
2) This hangs and execution wont complete !!!:
[[ "str1,1.1,5,6,7,8,9,str2,str3,str4,str5,str6,str7,str8,str9,str10,str11,2.0,str12,0.0,5.0,str13,12312545645,45456456478,78979754545,12312545645,45456456478,78979754545,78979754545,4.74,0.1245,4.174,0.4245,6,80,str14,str15" =~ ^.*\,[0-9]+(\.[0-9]+)?\,[0-9]+\,[0-9]+\,[0-9]+\,[0-9]+\,[0-9]+\,.*\,.*\,.*\,.*\,.*\,.*\,.*\,.*\,.*\,.*\,[0-9]+(\.[0-9]+)?\,.*\,[0-9]+(\.[0-9]+)?\,[0-9]+(\.[0-9]+)?\,.*\,[0-9]+\,[0-9]+\,[0-9]+\,[0-9]+\,[0-9]+\,[0-9]+\,[0-9]+\,[0-9]+(\.[0-9]+)?\,[0-9]+(\.[0-9]+)?\,[0-9]+(\.[0-9]+)?\,[0-9]+(\.[0-9]+)?\,[0-9]+\,[0-9]+\,[0-9]+(\.[0-9]+)?\,.*\,.*$ ]] && echo matches
I thought .* is too generic then tried [^,]* but nothing changed.
Please advice how can I solve this without splitting by "," once then compare one by one.
!!! Correction !!!
Above I stated:
STR="." # --> (also tried "[^,]" but no change)
This is wrong. Noticed that, I failed to replace all of them. When I replace all .* to [^,] problem is resolved. See below:
3) This is fixed version and working as expected:
[[ "str1,1.1,5,6,7,8,9,str2,str3,str4,str5,str6,str7,str8,str9,str10,str11,2.0,str12,0.0,5.0,str13,12312545645,45456456478,78979754545,12312545645,45456456478,78979754545,78979754545,4.74,0.1245,4.174,0.4245,6,80,1,str15,str16" =~ ^[^,]*\,[0-9]+(\.[0-9]+)?\,[0-9]+\,[0-9]+\,[0-9]+\,[0-9]+\,[0-9]+\,[^,]*\,[^,]*\,[^,]*\,[^,]*\,[^,]*\,[^,]*\,[^,]*\,[^,]*\,[^,]*\,[^,]*\,[0-9]+(\.[0-9]+)?\,[^,]*\,[0-9]+(\.[0-9]+)?\,[0-9]+(\.[0-9]+)?\,[^,]*\,[0-9]+\,[0-9]+\,[0-9]+\,[0-9]+\,[0-9]+\,[0-9]+\,[0-9]+\,[0-9]+(\.[0-9]+)?\,[0-9]+(\.[0-9]+)?\,[0-9]+(\.[0-9]+)?\,[0-9]+(\.[0-9]+)?\,[0-9]+\,[0-9]+\,[0-9]+(\.[0-9]+)?\,[^,]*\,[^,]*$ ]] && echo matches
Watch out for Catastrophic Backtracking that I learned from this issue.

Sorry I am not able to comment since my reputation is lower than 50. :(
Will the following regex work for you?
^([A-Za-z0-9\s\.]+\,){37}[A-Za-z0-9\s\.]+$

Related

Regex to match custom key pair not working in linux [duplicate]

I'm trying to do a tiny bash script that'll clean up the file and folder names of downloaded episodes of some tv shows I like. They often look like "[ www.Speed.Cd ] - Some.Show.S07E14.720p.HDTV.X264-SOMEONE", and I basically just want to strip out that speedcd advertising bit.
It's easy enough to remove www.Speed.Cd, spaces, and dashes using regexp matching in BASH, but for the life of me, I cannot figure out how to include the brackets in a list of characters to be matched against. [- [] doesn't work, neither does [- \[], [- \\[], [- \\\[], or any number of escape characters preceding the bracket I want to remove.
Here's what I've got so far:
[[ "$newfile" =~ ^(.*)([- \[]*(www\.torrenting\.com|spastikustv|www\.speed\.cd|moviesp2p\.com)[- \]]*)(.*)$ ]] &&
newfile="${BASH_REMATCH[1]}${BASH_REMATCH[4]}"
But it breaks on the brackets.
Any ideas?
TIA,
Daniel :)
EDIT: I should probably note that I'm using "shopt -s nocasematch" to ensure case insensitive matching, just in case you're wondering :)
EDIT 2: Thanks to all who contributed. I'm not 100% sure which answer was to be the "correct" one, as I had several problems with my statement. Actually, the most accurate answer was just a comment to my question posted by jw013, but I didn't get it at the time because I hadn't understood yet that spaces should be escaped. I've opted for aefxx's as that one basically says the same, but with explanations :) Would've liked to put a correct answer mark on ormaaj's answer, too, as he spotted more grave issues with my expression.
Anyway, the approach I was using above, trying to match and extract the parts to keep and leave behind the unwanted ones is really not very elegant, and won't catch all cases, not even something really simple like "Some.Show.S07E14.720p.HDTV.X264-SOMEONE - [ www.Speed.Cd ]". I've instead rewritten it to match and extract just the unwanted parts and then do string replacement of those on the original string, like so (loop is in case there's multiple brandings):
# Remove common torrent site brandings, including surrounding spaces, brackets, etc.:
while [[ "$newfile" =~ ([[\ {\(-]*(www\.)?(torrentday\.com|torrenting\.com|spastikustv|speed\.cd|moviesp2p\.com|publichd\.org|publichd|scenetime\.com|kingdom-release)[]\ }\)-]*) ]]; do
newfile=${newfile//"${BASH_REMATCH[1]}"/}
done
Ok, this is the first time I've heard of the =~ operator but nevertheless here's what I found by trial and error:
if [[ $newfile =~ ^(.*)([-[:space:][]*(what|ever)[][:space:]-]*)(.*)$ ]]
^^^^^^^^^^ ^^^^^^^^^^
Looks strange but actually does work (just tested it).
EDIT
Quote from the Linux man pages regex(7):
To include a literal ] in the list, make it the first character (following a possible ^). To include a literal -, make it the first or last character, or the second endpoint of a range. To use a literal aq-aq as the first endpoint of a range, enclose it in "[." and ".]" to make it a collating element (see below). With the exception of these and some combinations using aq[aq (see next paragraphs), all other special characters, including aq\aq, lose their special significance within a bracket expression.
Whenever you're doing a regex it's most compatible between Bash versions to put regexes in a variable even if you do manage to dodge all the pitfalls of putting them directly in a test expression. http://mywiki.wooledge.org/BashPitfalls#if_.5B.5B_.24foo_.3D.2BAH4_.27some_RE.27_.5D.5D
Your current regex looks like you're trying to optionally match anything preceding the opening bracket. I'd guess you're actually trying to save for example 3 and 4 from something like this:
$ shopt -s nocasematch
$ newfile='[ www.Speed.Cd ] - Some.Show.S07E14.720p.HDTV.X264-SOMEONE'
$ re='^.*[-[:space:][]*(www\.torrenting\.com|spastikustv|www\.speed\.cd|moviesp2p\.com)[][:space:]-]*(.*)$'
$ [[ $newfile =~ $re ]]
$ declare -p BASH_REMATCH
declare -ar BASH_REMATCH='([0]="[ www.Speed.Cd ] - Some.Show.S07E14.720p.HDTV.X264-SOMEONE" [1]="www.Speed.Cd" [2]="Some.Show.S07E14.720p.HDTV.X264-SOMEONE")'
The basic issue is quite simple, if not obvious.
A BASH REGEX is totally unprotected (from the shell), and cannot be protected by "​double quotes​". This means that every literal space (and tab,etc) must be protected by a baskslash \ ... end of story. The rest is just a case of getting you regex to suit your needs.
One other thing; use [\ [] and []\ ] to match [ and ] respectively, within the range square-bracket construct (in this case along with a space).
example:
newfile="[ ]"
[[ "$newfile" =~ ^[\ []\ []\ ]$ ]] &&
echo YES ||
echo NO
You can try something like this (though you weren't 100% clear on what cases you are trying to filter:
newfile="[ www.Speed.Cd ] - Some.Show.S07E14.720p.HDTV.X264-SOMEONE"
if [[ $newfile =~ ^(.*)([^a-zA-Z0-9.]*\[.*\][^a-zA-Z0-9.]*)(.*)$ ]]; then
newfile="${BASH_REMATCH[1]}${BASH_REMATCH[3]}"
fi
echo $newfile
# Some.Show.S07E14.720p.HDTV.X264-SOMEONE
Its just stripping any non-alnum (and dot) characters outside the [], and anything within []

How to match string (with regular expression) that begins with a string

In a bash script I have to match strings that begin with exactly 3 times with the string lo; so lololoba is good, loloba is bad, lololololoba is good, balololo is bad.
I tried with this pattern: "^$str1/{$n,}" but it doesn't work, how can I do it?
EDIT:
According to OPs comment, lololololoba is bad now.
This should work:
pat="^(lo){3}"
s="lolololoba"
[[ $s =~ $pat ]] && echo good || echo bad
EDIT (As per OPs comment):
If you want to match exactly 3 times (i.e lolololoba and such should be unmatched):
change the pat="^(lo){3}" to:
pat="^(lo){3}(l[^o]|[^l].)"
You can use following regex :
^(lo){3}.*$
Instead of lo you can put your variable.
See demo https://regex101.com/r/sI8zQ6/1
You can use this awk to match exactly 3 occurrences of lo at the beginning:
# input file
cat file
lololoba
balololo
loloba
lololololoba
lololo
# awk command to print only valid lines
awk -F '^(lo){3}' 'NF == 2 && !($2 ~ /^lo/)' file
lololoba
lololo
As per your comment:
... more than 3 is bad so "lolololoba" is not good!
You'll find that #Jahid's answer doesn't fit (as his gives you "good" to that test string.
To use his answer with the correct regex:
pat="^(lo){3}(?\!lo)"
s="lolololoba"
[[ $s =~ $pat ]] && echo good || echo bad
This verifies that there are three "lo"s at the beginning, and not another one immediately following the three.
Note that if you're using bash you'll have to escape that ! in the first line (which is what my regex above does)

bash regex doesnt match "at least n times but not more than m"

I'm trying to match string such as: "+99", "-82", "5", "auto" and "max"
==auto and max and numbers(lets say integers) with or without sign
I tried regex
var='^([+|-]{0,1}[0-9][0-9]*)|(auto)|(max)$'
but it fails on "at least n times but not more than m" thing, in my case {0,1}
Anyway I tested var='ab{0,1}' and var='ab{2}' and these don't work neither
I didn't get any furher but I thing that the next problem could by these: ()
I'm using #!/bin/bash version 4.2.24(1)
Thanks in advance!
edit1:
I don't know how to group this regex for ? to be working as Karoly Horvath suggested.
I'm using this check function I found somewhere.
#!/bin/bash
INTEGER_MAX='^([+-])?[0-9][0-9]*$'
function isNumeric() {
check=`echo $1 | sed "s/\($2\)//"`
if [ -z ${check} ]; then
return 0
else
return 1
fi
}
isNumeric "$1" "$INTEGER_MAX" && echo "passed"
edit2 - SOLVED
it's working with
RE='(^([+-])?[0-9]+$)|(^auto$)|(^max$)'
tested on
[[ $string =~ $pattern ]] && echo "passed"
THX!
The [+|-] selector accepts one character which is either + | or -. You probably meant: [+-].
The shorthand for {0,1} is ?, and [0-9][0-9]* is simply [0-9]+, but of course both should work.
Anyway I tested var='ab{0,1}' and var='ab{2}' and these don't work
neither
ab{0,1} means either a or ab, quantifiers work on the last expression, which is typically a character, or a selector, if you want to apply the quantifier for an expression you have to group it.
If you have further questions please post how you use the regex, cause I'm not sure what your problem is...

BASH regexp matching - including brackets in a bracketed list of characters to match against?

I'm trying to do a tiny bash script that'll clean up the file and folder names of downloaded episodes of some tv shows I like. They often look like "[ www.Speed.Cd ] - Some.Show.S07E14.720p.HDTV.X264-SOMEONE", and I basically just want to strip out that speedcd advertising bit.
It's easy enough to remove www.Speed.Cd, spaces, and dashes using regexp matching in BASH, but for the life of me, I cannot figure out how to include the brackets in a list of characters to be matched against. [- [] doesn't work, neither does [- \[], [- \\[], [- \\\[], or any number of escape characters preceding the bracket I want to remove.
Here's what I've got so far:
[[ "$newfile" =~ ^(.*)([- \[]*(www\.torrenting\.com|spastikustv|www\.speed\.cd|moviesp2p\.com)[- \]]*)(.*)$ ]] &&
newfile="${BASH_REMATCH[1]}${BASH_REMATCH[4]}"
But it breaks on the brackets.
Any ideas?
TIA,
Daniel :)
EDIT: I should probably note that I'm using "shopt -s nocasematch" to ensure case insensitive matching, just in case you're wondering :)
EDIT 2: Thanks to all who contributed. I'm not 100% sure which answer was to be the "correct" one, as I had several problems with my statement. Actually, the most accurate answer was just a comment to my question posted by jw013, but I didn't get it at the time because I hadn't understood yet that spaces should be escaped. I've opted for aefxx's as that one basically says the same, but with explanations :) Would've liked to put a correct answer mark on ormaaj's answer, too, as he spotted more grave issues with my expression.
Anyway, the approach I was using above, trying to match and extract the parts to keep and leave behind the unwanted ones is really not very elegant, and won't catch all cases, not even something really simple like "Some.Show.S07E14.720p.HDTV.X264-SOMEONE - [ www.Speed.Cd ]". I've instead rewritten it to match and extract just the unwanted parts and then do string replacement of those on the original string, like so (loop is in case there's multiple brandings):
# Remove common torrent site brandings, including surrounding spaces, brackets, etc.:
while [[ "$newfile" =~ ([[\ {\(-]*(www\.)?(torrentday\.com|torrenting\.com|spastikustv|speed\.cd|moviesp2p\.com|publichd\.org|publichd|scenetime\.com|kingdom-release)[]\ }\)-]*) ]]; do
newfile=${newfile//"${BASH_REMATCH[1]}"/}
done
Ok, this is the first time I've heard of the =~ operator but nevertheless here's what I found by trial and error:
if [[ $newfile =~ ^(.*)([-[:space:][]*(what|ever)[][:space:]-]*)(.*)$ ]]
^^^^^^^^^^ ^^^^^^^^^^
Looks strange but actually does work (just tested it).
EDIT
Quote from the Linux man pages regex(7):
To include a literal ] in the list, make it the first character (following a possible ^). To include a literal -, make it the first or last character, or the second endpoint of a range. To use a literal aq-aq as the first endpoint of a range, enclose it in "[." and ".]" to make it a collating element (see below). With the exception of these and some combinations using aq[aq (see next paragraphs), all other special characters, including aq\aq, lose their special significance within a bracket expression.
Whenever you're doing a regex it's most compatible between Bash versions to put regexes in a variable even if you do manage to dodge all the pitfalls of putting them directly in a test expression. http://mywiki.wooledge.org/BashPitfalls#if_.5B.5B_.24foo_.3D.2BAH4_.27some_RE.27_.5D.5D
Your current regex looks like you're trying to optionally match anything preceding the opening bracket. I'd guess you're actually trying to save for example 3 and 4 from something like this:
$ shopt -s nocasematch
$ newfile='[ www.Speed.Cd ] - Some.Show.S07E14.720p.HDTV.X264-SOMEONE'
$ re='^.*[-[:space:][]*(www\.torrenting\.com|spastikustv|www\.speed\.cd|moviesp2p\.com)[][:space:]-]*(.*)$'
$ [[ $newfile =~ $re ]]
$ declare -p BASH_REMATCH
declare -ar BASH_REMATCH='([0]="[ www.Speed.Cd ] - Some.Show.S07E14.720p.HDTV.X264-SOMEONE" [1]="www.Speed.Cd" [2]="Some.Show.S07E14.720p.HDTV.X264-SOMEONE")'
The basic issue is quite simple, if not obvious.
A BASH REGEX is totally unprotected (from the shell), and cannot be protected by "​double quotes​". This means that every literal space (and tab,etc) must be protected by a baskslash \ ... end of story. The rest is just a case of getting you regex to suit your needs.
One other thing; use [\ [] and []\ ] to match [ and ] respectively, within the range square-bracket construct (in this case along with a space).
example:
newfile="[ ]"
[[ "$newfile" =~ ^[\ []\ []\ ]$ ]] &&
echo YES ||
echo NO
You can try something like this (though you weren't 100% clear on what cases you are trying to filter:
newfile="[ www.Speed.Cd ] - Some.Show.S07E14.720p.HDTV.X264-SOMEONE"
if [[ $newfile =~ ^(.*)([^a-zA-Z0-9.]*\[.*\][^a-zA-Z0-9.]*)(.*)$ ]]; then
newfile="${BASH_REMATCH[1]}${BASH_REMATCH[3]}"
fi
echo $newfile
# Some.Show.S07E14.720p.HDTV.X264-SOMEONE
Its just stripping any non-alnum (and dot) characters outside the [], and anything within []

RegEx, colon separated list

I am trying to match a list of colon separated emails. For the sake of keeping things simple, I am going to leave the email expression out of the mix and match it with any number of characters with no spaces in between them.
The following will be matched...
somevalues ;somevalues; somevalues;
or
somevalues; somevalues ;somevalues
The ending ; shouldn't be necessary.
The following would not be matched.
somevalues ; some values somevalues;
or
some values; somevalues some values
I have gotten this so far, but it doesn't work. Since I allow spaces between the colons, the expression doesn't know if the space is in the word, or between the colon.
([a-zA-Z]*\s*\;?\s*)*
The following is matched (which shouldn't e)
somevalue ; somevalues some values;
How do I make the expression only allow spaces if there is a ; to the left or right of it?
Why not just split on semi colon and then regex out the email addresses?
This following PCRE Expression should work.
\w+\s*(?:(?:;(?:\s*\w+\s*)?)+)?
However if putting the email address validation regular expression on this will require
replacing \w+ with (?:<your email validation regex>)
Probabbly This is exactly what you want, tested on http://regexr.com?2rnce
EDIT: However depending on the language you might? need to escape ; as \;
The problem comes from the ? in \;?
[a-zA-Z]*(\s*;\s*[a-zA-Z]*)*
should work.
Try
([a-zA-Z]+\s*;\s*)*([a-zA-Z]+\s*\)?
Note that I changed * to + on the e-mail pattern since I assume you don't want strings like ; to match.
to solve this with regex, you must prepend + append the delimiter to your input lines, otherwise you cannot easily detect the first and last item
#!/bin/bash
input=a:aa:aaa:aaaa
needle=aa
if [[ ":$input:" =~ ":$needle:" ]]
then
echo found
else
echo not found
fi
# -> found
.. this takes 45 nanoseconds
bash globbing is faster with 35 nanoseconds
input=a:aa:aaa:aaaa
needle=aa
if [[ ":$input:" == *":$needle:"* ]]
then
echo found
else
echo not found
fi
# -> found
stupid solution: split by delimiter and match whole lines. this one is really slow, with 5100 nanoseconds
echo a:aa:aaa:aaaa | tr ':' $'\n' | grep "^aa$"
# -> aa