Matching within matches by extending an existing Regex - regex

I'm trying to see if its possible to extend an existing arbitrary regex by prepending or appending another regex to match within matches.
Take the following example:
The original regex is cat|car|bat so matching output is
cat
car
bat
I want to add to this regex and output only matches that start with 'ca',
cat
car
I specifically don't want to interpret a whole regex, which could be quite a long operation and then change its internal content to match produce the output as in:
^ca[tr]
or run the original regex and then the second one over the results. I'm taking the original regex as an argument in python but want to 'prefilter' the matches by adding the additional code.
This is probably a slight abuse of regex, but I'm still interested if it's possible. I have tried what I know of subgroups and the following examples but they're not giving me what I need.
Things I've tried:
^ca(cat|car|bat)
(?<=ca(cat|car|bat))
(?<=^ca(cat|car|bat))
It may not be possible but I'm interested in what any regex gurus think. I'm also interested if there is some way of doing this positionally if the length of the initial output is known.
A slightly more realistic example of the inital query might be [a-z]{4} but if I create (?<=^ca([a-z]{4})) it matches against 6 letter strings starting with ca, not 4 letter.
Thanks for any solutions and/or opinions on it.
EDIT: See solution including #Nick's contribution below. The tool I was testing this with (exrex) seems to have a slight bug that, following the examples given, would create matches 6 characters long.

You were not far off with what you tried, only you don't need a lookbehind, but rather a lookahead assertion, and a parenthesis was misplaced. The right thing is: Put the original pattern in parentheses, and prepend (?=ca):
(?=ca)(cat|car|bat)
(?=ca)([a-z]{4})
In the second example (without | alternative), the parentheses around the original pattern wouldn't be required.

Ok, thanks to #Armali I've come to the conclusion that (?=ca)(^[a-z]{4}$) works (see https://regexr.com/3f4vo). However, I'm trying this with the great exrex tool to attempt to produce matching strings, and it's producing matches that are 6 characters long rather than 4. This may be a limitation of exrex rather than the regex, which seems to work in other cases.
See #Nick's comment.
I've also raised an issue on the exrex GitHub for this.

Related

Regular expression/Regex with Java/Javascript: performance drop or infinite loop

I want here to submit a very specific performance problem that i want to understand.
Goal
I'm trying to validate a custom synthax with a regex. Usually, i'm not encountering performance issues, so i like to use it.
Case
The regex:
^(\{[^\][{}(),]+\}\s*(\[\s*(\[([^\][{}(),]+\s*(\(\s*([^\][{}(),]+\,?\s*)+\))?\,?\s*)+\]\s*){1,2}\]\s*)*)+$
A valid synthax:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4]]
You could find the regex and a test text here :
https://regexr.com/3jama
I hope that be sufficient enough, i don't know how to explain what i want to match more than with a regex ;-).
Issue
Applying the regex on valid text is not costing much, it's almost instant.
But when it comes to specific not valid text case, the regexr app hangs. It's not specific to regexr app since i also encountered dramatic performances with my own java code or javascript code.
Thus, my needs is to validate all along the user is typing the text. I can even imagine validating the text on click, but i cannot afford that the app will be hanging if the text submited by the user is structured as the case below, or another that produce the same performance drop.
Reproducing the issue
Just remove the trailing "]" character from the test text
So the invalid text to raise the performance drop becomes:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4
Another invalid test could be, and with no permformance drop:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4]]]
Request
I'll be glad if a regex guru coming by could explain me what i'm doing wrong, or why my use case isn't adapted for regex.
This answer is for the condensed regex from your comment:
^(\{[^\][{}(),]+\}(\[(\[([^\][{}(),]+(\(([^\][{}(),]+\,?)+\))?\,?)+\]){1,2}\])*)+$
The issues are similar for your original pattern.
You are facing catastrophic backtracking. Whenever the regex engine cannot complete a match, it backtracks into the string, trying to find other ways to match the pattern to certain substrings. If you have lots of ambiguous patterns, especially if they occur inside repetitions, testing all possible variations takes a looooong time. See link for a better explanation.
One of the subpatterns that you use is the following (multilined for better visualisation):
([^\][{}(),]+
(\(
([^\][{}(),]+\,?)+
\))?
\,?)+
That is supposed to match a string like actor4(syno3, syno4). Condensing this pattern a little more, you get to ([^\][{}(),]+,?)+. If you remove the ,? from it, you get ([^\][{}(),]+)+ which is an opening gate to the catasrophic backtracking, as string can be matched in quite a lot of different ways with this pattern.
I get what you try to do with this pattern - match an identifier - and maybe other other identifiers that are separated by comma. The proper way of doing this however is: ([^\][{}(),]+(?:,[^\][{}(),]+)*). Now there isn't an ambiguous way left to backtrack into this pattern.
Doing this for the whole pattern shown above (yes, there is another optional comma that has to be rolled out) and inserting it back to your complete pattern I get to:
^(\{[^\][{}(),]+\}(\[(\[([^\][{}(),]+(\(([^\][{}(),]+(?:,[^\][{}(),]+)*)\))?(?:\,[^\][{}(),]+(\(([^\][{}(),]+(?:,[^\][{}(),]+))*\))?)*)\]){1,2}\])*)+$
Which doesn't catastrophically backtrack anymore.
You might want to do yourself a favour and split this into subpatterns that you concat together either using strings in your actual source or using defines if you are using a PCRE pattern.
Note that some regex engines allow the use of atomic groups and possessive quantifiers that further help avoiding needless backtracking. As you have used different languages in your title, you will have to check yourself, which one is available for your language of choice.

REGEX number not in a list failing with a long list

I have a list of the following numbers and want a Regular expression that matches when a number is not in the list.
0,1,2,3,4,9,11,12,13,14,15,16,18,19,250
I have written the following REGEX statement.
^(?!.*(0|1|2|3|4|9|11|12|13|14|15|16|18|19|250)).*$
The problem is that it correctly gives a match for 5,6,7,8 etc but not for 17 or 251 for example.
I have been testing this on the online REGEX simulators.
This should resolve your issue..
^(?!\D*(0|1|2|3|4|9|11|12|13|14|15|16|18|19|250)\b).*$
In your earlier regex you were basically saying eliminate all numbers that start with 0/1/2/3/4/9!
So your original regex would actually match 54/623/71/88 but not the others. Also the 11-19 and 250 in the list were rendered useless.
Although as others have I would also recommend you to not use regex for this, as I believe it is an overkill and a maintenance nightmare!
Also an extra note "Variable length look arounds are very inefficient too" vs regular checks.
I would do \b\d+\b to get each number in the string and check if they are in your list. It would be way faster.
You can use the discard technique by matching what you do not want and capturing what you really want.
You can use a regex like this:
\b(?:[0-49]|1[1-689]|250)\b|(\d+)
Here you can check a working demo where in blue you have the matches (what you don't want) and in green the content you want. Then you have to grab the content from the capturing group
Working demo
Not sure what regex engine you are using, but here I created a sample using java:
https://ideone.com/B7kLe0

RegEx to match sets of literal strings along with value ranges

Utter RegEx noob here with a project involving RegEx I need to modify. Has been a blast learning all of this.
I need to search for/verify a set of vales that start with one of two string combinations (NC or KH) and a variable numeric list—unique to each string prefix. NC01-NC13 or KH01-11.
I have been able to pull off the first common "chunk" of this with:
^(NC|KH)0[1-9]$
to verify NC01-NC09 or KH01-KH09. The next part is completely throwing me—needing to change the leading character of the two-digit character to a 1 vs a 0, and restricting the range to 0–3 for NC and 0–1 for KH.
I have found references abound for selecting between two strings (where I got the (NC|KH) from), but nothing as detailed as how to restrict following values based on the found text.
Any and all help would be greatly appreciated, as well as any great references/books/tutorials to RegEx (currently using Regular-Expressions.info).
The best way to do this is to just separate the two case altogether.
((NC(0\d|1[0-3])|(KH(0\d|1[01])))
You might want to turn some of those internal capturing groups into non capturing groups, but that make the regex a little hard to read.
Edit: You might also be able to do this with positive lookbehind.
Edit: Here's a regex using lookbehind. It's a lot messier, and not really necessary here, but hopefully demonstrates the utility:
(KH|NC)(0\d|(?<=KH)(1[01])|(?<=NC)(1[0-3]))
Sticking with your original idea of options for NC or KH, do the same for the numbers, try this:
^(NC|KH)(0[1-9]|1[0-3])$
Hope that makes sense
EDIT:
Based upon #Patrick's comment below, and sticking with this original answer, you could use this (although I bet there's a better way):
^(NC|KH)(0[1-9]|1[0-1])|(NC1[2-3])$

Regex: Non fixed-width look around assertions?

My college asked my to provide him with a regex that only matches if the test-string endswith
.rar or .part1.rar or part01.rar or part001.rar (and so on).
Should match:
foo.part1.rar
xyz.part01.rar
archive.rar
part3_is_the_best.rar
Should not match:
foo.r61
bar.part03.rar
test.sfv
I immediately came up with the regex \.(part0*1\.)?rar$. But this does match for bar.part03.rar.
Next I tried to add a negative look behind assertion: .*(?<!part\d*)\.(part\0*1\.)?rar$ That didn't work either, because look around assertions need to be fixed width.
Then I tried using a regex-conditional. But that didn't work either.
So my question: Can this even be solved by using pure regex?
An answer should either contain a link to regex101.com providing a working solution, or explain why it can't work by using pure regex.
You could use lookahead to verify the one case that fails your original regex (.rar with .part part that isn't 0*1) is discredited:
^(?!.*\.part0*[^1]\.rar$).*\.(part0*1\.)?rar$
See it in action
This is an old question, but here's another approach:
(?:\.part0*1\.rar|^(?<!\.)\w+\.rar)$
The idea is to match either:
A string that ends with .part0*1.rar (ie foo.part01.rar, foo.part1.rar, bar.part001.rar), OR
A string that ends with .rar and doesn't contain any other dots (.) before that.
Works on all your test cases, plus your extra foo.part19.rar.
https://regex101.com/r/EyHhmo/2

Regular Expression to find CVE Matches

I am pretty new to the concept of regex and so I am hoping an expert user can help me craft the right expression to find all the matches in a string. I have a string that represents a lot of support information in it for vulnerabilities data. In that string are a series of CVE references in the format: CVE-2015-4000. Can anyone provide me a sample regex on finding all occurrences of that ? obviously, the numeric part of that changes throughout the string...
Generally you should always include your previous efforts in your question, what exactly you expect to match, etc. But since I am aware of the format and this is an easy one...
CVE-\d{4}-\d{4,7}
This matches first CVE- then a 4-digit number for the year identifier and then a 4 to 7 digit number to identify the vulnerability as per the new standard.
See this in action here.
If you need an exact match without any syntax or logic violations, you can try this:
^(CVE-(1999|2\d{3})-(0\d{2}[1-9]|[1-9]\d{3,}))$
You can run this against the test data supplied by MITRE here to test your code or test it online here.
I will add my two cents to the accepted answer. Incase we want to detect case insensitive "CVE" we can following regex
r'(?i)\bcve\-\d{4}-\d{4,7}'