Improve regex that works - regex

I'm not a regex expert, so please be nice :-)
I created this regex to verify if a user submitted a day of the week (in italian language):
/((lun|mart|giov)e|mercol(e?)|vener)d(ì|i('?)|í)|sabato|domenica/
This regex perfectly works and it matches the following:
lunedi
lunedì
lunedí
lunedi’
martedi
martedì
martedí
martedi'
mercoledi
mercoledì
mercoledí
mercoledi'
mercoldi
mercoldì
mercoldí
mercoldi'
giovedi
giovedì
giovedí
giovedi'
venerdi
venerdì
venerdí
venerdi'
sabato
domenica
Now consider the first part of the regex and focus on venerdì: as you can see, I added an OR (|) just to manage the venerdì day, just because of the presence of that “r”.
Anything works just fine but I’m here to ask if is there any way to start the regex this way:
(lun|mar|giov|ven)e
and then manage that “r” some way.
I red about backrefences and conditionals but I’m not sure they can be of any help.
My idea is something like: “if the first group captured ‘ven’, than add “r” to the “e” right after the end of the group.
Is this possible?

Don't "golf" your regex. If you want to improve it at all, make it more readable. While it it certainly worthwile to use different cases for the different "i" variants, everything else should IMHO be kept as simple as possible.
How about something like this?
(lune|marte|mercole?|giove|vener)d(ì|i'?|í)|sabato|domenica
Don't use backreferences and other advanced features if you don't need them, just to make your regex a few chars shorter. Even if you would still understand what it means, think about your fellow co-developers -- or just yourself two months from now.
I just removed a few redundant (...) and the "shared e" part. Note how (besides the (...)) it is the same length, whether you use (lun|mart|giov)e or lune|marte|giove, but the latter is arguably more readable. Similarly, a backreference or some conditional would likely make your regex longer instead of shorter -- and considerably more complicated.

Related

Matching within matches by extending an existing Regex

I'm trying to see if its possible to extend an existing arbitrary regex by prepending or appending another regex to match within matches.
Take the following example:
The original regex is cat|car|bat so matching output is
cat
car
bat
I want to add to this regex and output only matches that start with 'ca',
cat
car
I specifically don't want to interpret a whole regex, which could be quite a long operation and then change its internal content to match produce the output as in:
^ca[tr]
or run the original regex and then the second one over the results. I'm taking the original regex as an argument in python but want to 'prefilter' the matches by adding the additional code.
This is probably a slight abuse of regex, but I'm still interested if it's possible. I have tried what I know of subgroups and the following examples but they're not giving me what I need.
Things I've tried:
^ca(cat|car|bat)
(?<=ca(cat|car|bat))
(?<=^ca(cat|car|bat))
It may not be possible but I'm interested in what any regex gurus think. I'm also interested if there is some way of doing this positionally if the length of the initial output is known.
A slightly more realistic example of the inital query might be [a-z]{4} but if I create (?<=^ca([a-z]{4})) it matches against 6 letter strings starting with ca, not 4 letter.
Thanks for any solutions and/or opinions on it.
EDIT: See solution including #Nick's contribution below. The tool I was testing this with (exrex) seems to have a slight bug that, following the examples given, would create matches 6 characters long.
You were not far off with what you tried, only you don't need a lookbehind, but rather a lookahead assertion, and a parenthesis was misplaced. The right thing is: Put the original pattern in parentheses, and prepend (?=ca):
(?=ca)(cat|car|bat)
(?=ca)([a-z]{4})
In the second example (without | alternative), the parentheses around the original pattern wouldn't be required.
Ok, thanks to #Armali I've come to the conclusion that (?=ca)(^[a-z]{4}$) works (see https://regexr.com/3f4vo). However, I'm trying this with the great exrex tool to attempt to produce matching strings, and it's producing matches that are 6 characters long rather than 4. This may be a limitation of exrex rather than the regex, which seems to work in other cases.
See #Nick's comment.
I've also raised an issue on the exrex GitHub for this.

Regular expression/Regex with Java/Javascript: performance drop or infinite loop

I want here to submit a very specific performance problem that i want to understand.
Goal
I'm trying to validate a custom synthax with a regex. Usually, i'm not encountering performance issues, so i like to use it.
Case
The regex:
^(\{[^\][{}(),]+\}\s*(\[\s*(\[([^\][{}(),]+\s*(\(\s*([^\][{}(),]+\,?\s*)+\))?\,?\s*)+\]\s*){1,2}\]\s*)*)+$
A valid synthax:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4]]
You could find the regex and a test text here :
https://regexr.com/3jama
I hope that be sufficient enough, i don't know how to explain what i want to match more than with a regex ;-).
Issue
Applying the regex on valid text is not costing much, it's almost instant.
But when it comes to specific not valid text case, the regexr app hangs. It's not specific to regexr app since i also encountered dramatic performances with my own java code or javascript code.
Thus, my needs is to validate all along the user is typing the text. I can even imagine validating the text on click, but i cannot afford that the app will be hanging if the text submited by the user is structured as the case below, or another that produce the same performance drop.
Reproducing the issue
Just remove the trailing "]" character from the test text
So the invalid text to raise the performance drop becomes:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4
Another invalid test could be, and with no permformance drop:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4]]]
Request
I'll be glad if a regex guru coming by could explain me what i'm doing wrong, or why my use case isn't adapted for regex.
This answer is for the condensed regex from your comment:
^(\{[^\][{}(),]+\}(\[(\[([^\][{}(),]+(\(([^\][{}(),]+\,?)+\))?\,?)+\]){1,2}\])*)+$
The issues are similar for your original pattern.
You are facing catastrophic backtracking. Whenever the regex engine cannot complete a match, it backtracks into the string, trying to find other ways to match the pattern to certain substrings. If you have lots of ambiguous patterns, especially if they occur inside repetitions, testing all possible variations takes a looooong time. See link for a better explanation.
One of the subpatterns that you use is the following (multilined for better visualisation):
([^\][{}(),]+
(\(
([^\][{}(),]+\,?)+
\))?
\,?)+
That is supposed to match a string like actor4(syno3, syno4). Condensing this pattern a little more, you get to ([^\][{}(),]+,?)+. If you remove the ,? from it, you get ([^\][{}(),]+)+ which is an opening gate to the catasrophic backtracking, as string can be matched in quite a lot of different ways with this pattern.
I get what you try to do with this pattern - match an identifier - and maybe other other identifiers that are separated by comma. The proper way of doing this however is: ([^\][{}(),]+(?:,[^\][{}(),]+)*). Now there isn't an ambiguous way left to backtrack into this pattern.
Doing this for the whole pattern shown above (yes, there is another optional comma that has to be rolled out) and inserting it back to your complete pattern I get to:
^(\{[^\][{}(),]+\}(\[(\[([^\][{}(),]+(\(([^\][{}(),]+(?:,[^\][{}(),]+)*)\))?(?:\,[^\][{}(),]+(\(([^\][{}(),]+(?:,[^\][{}(),]+))*\))?)*)\]){1,2}\])*)+$
Which doesn't catastrophically backtrack anymore.
You might want to do yourself a favour and split this into subpatterns that you concat together either using strings in your actual source or using defines if you are using a PCRE pattern.
Note that some regex engines allow the use of atomic groups and possessive quantifiers that further help avoiding needless backtracking. As you have used different languages in your title, you will have to check yourself, which one is available for your language of choice.

RegEx to match sets of literal strings along with value ranges

Utter RegEx noob here with a project involving RegEx I need to modify. Has been a blast learning all of this.
I need to search for/verify a set of vales that start with one of two string combinations (NC or KH) and a variable numeric list—unique to each string prefix. NC01-NC13 or KH01-11.
I have been able to pull off the first common "chunk" of this with:
^(NC|KH)0[1-9]$
to verify NC01-NC09 or KH01-KH09. The next part is completely throwing me—needing to change the leading character of the two-digit character to a 1 vs a 0, and restricting the range to 0–3 for NC and 0–1 for KH.
I have found references abound for selecting between two strings (where I got the (NC|KH) from), but nothing as detailed as how to restrict following values based on the found text.
Any and all help would be greatly appreciated, as well as any great references/books/tutorials to RegEx (currently using Regular-Expressions.info).
The best way to do this is to just separate the two case altogether.
((NC(0\d|1[0-3])|(KH(0\d|1[01])))
You might want to turn some of those internal capturing groups into non capturing groups, but that make the regex a little hard to read.
Edit: You might also be able to do this with positive lookbehind.
Edit: Here's a regex using lookbehind. It's a lot messier, and not really necessary here, but hopefully demonstrates the utility:
(KH|NC)(0\d|(?<=KH)(1[01])|(?<=NC)(1[0-3]))
Sticking with your original idea of options for NC or KH, do the same for the numbers, try this:
^(NC|KH)(0[1-9]|1[0-3])$
Hope that makes sense
EDIT:
Based upon #Patrick's comment below, and sticking with this original answer, you could use this (although I bet there's a better way):
^(NC|KH)(0[1-9]|1[0-1])|(NC1[2-3])$

Stuck on Specific Regex

I have a specific case where I somehow can't find something that suits my need. I've always been struggling when parenthesis comes in strong, and this case is a bit painful. I'm trying to collect the most of a text field to fit it in a more controlled database, and there's a few tricks I'm fumbling in.
There is ONE thing that is always the case for every row entered;
serie of character + ( + text + )
Basically, here's what it could look like:
1111111E (CARRIER), 2222222, 33333 (CARRIER2) 44444 (CARRIER 3)
My goal is to get:
1111111E (CARRIER)
2222222, 33333 (CARRIER2)
44444 (CARRIER 3)
And if I can ever manage to get a hold of commas and space to split the few like the middle one, that would be just amazing.
I'm struggling through a few REGEX tester website as I'm writing this, starting from scratch over and over again.
If some regex gurus are around, you're a welcome hand !
If it has to be RegEx you could split at
(?<=\))[, ]*
Note that as you don't want to take out the ")" you must not match it and thus the statement uses a look behind which does not work in all RegEx engines.
[^\s|\,].*?\s\(.*?\)
With a Match All is doing the expected result. I doubt it's the most optimal regex I could type in, but it seems to be working fine.
I could try to work around the second case to wrap it up, but I think I'll take care of these case in my code.
Leaving the answer up for anybody who could be looking into something similar.

regex best practice?

Today I got an email from my boss saying to change the regex in our java script code that goes onto our client's website from
[a-zA-Z0-9]+[a-zA-Z0-9_\.\-]
to
[a-zA-Z0-9]+[a-zA-Z0-9_\-\.]
because one of our clients were complaining that it wasn't regex best practices and it's causing problems with their CMS and their DB.
Looking at those two regexes, It appears to me they match the exact same thing.
the . and the - are swapped at the end, but that shouldn't make a difference. Should it?
Am I missing something?
The developer from our client's company is really adamant about us changing it.
Can someone shed some light?
Thanks!
There is no functional difference.
If anything is having issues with that regex, then it is a non-standard/buggy implementation. I recommend finding out exactly what the problem is.
While I see no reason to change it, I see no reason not to change it, so do what you wish.
Tip: I'm guessing the regex is written wrong. If I know what it is supposed to mean, I would write it:
[a-zA-Z0-9]+[_\.\-]?
If you use a - in a character group, it goes last otherwise it denotes a range of characters, like A-Z. If you're escaping it, like you are, then it can be anywhere.
It's possible the CMS or other code they use un-escapes the regex, so in this case it will throw errors if the - isn't the last character in the group. I would say that having as few escaped characters in a regular expression as possible makes it easier to read, but that's from a personal perspective.