Regex that matches even amount of character - regex

Disclamer (after solved): this is my uni assignment thus I the answer could be simple. Hints are shown but my answer is hidden from here. Alternative answers could be found here but I take no responsibility with any plagiarism with direct answers posted here.
Hi I'm having troubles with the following exercise
Find regex that strictly represents the language:
b^(m+1), such that m>=0, m mod 2 = 1
The language breaks down to words:
{bb,bbbb,bbbbbb,bbbbbbbb,...}
I have tried the following:
b(bbb)?(bb)*
But this also accepts
{bb,bbb,bbbb,bbbbb,...}
Is there a way to write it such one bit of expression is depended on the other? ie: (bb)* cannot be chosen if (bbb)? is chosen at once, then repeat the decision but allow the vice versa.
Any help would be appreciated. Thanks

Update:-
You can use
^(?:bb)+$
Regex Demo
Initial heading of question was --> Regex that matches odd amount of character
You can try this
^b(?:(?:b{2})+)?$
Regex Demo

My guess is that, this might be closer,
^(?:bb){1,}$
and your set might look like,
bb
bbbb
bbbbbb
not sure though. If your set was correct, expression can likely be modified.
also, b would not probably be in the set, since m=0 does not pass the second requirement.
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.

Related

Why is a long string of ?char?char etc.. in a regex so slow

I'm trying to create a regex to conform to the following (iupac biology) rule:
< Any characters before a < are considered optional and will be matched after the subsequence text has been found
In words that means:
Given CAT<ATTT, I want to find any text that matches ATTT exactly, and then if there is a T directly before that match, I want to include it in the match, if there is an A directly before the T, I want to include it as well. If there is also a C directly before the A I want to include it as well.
Here's an example of what that rule would look like when applied:
More examples:
CAT<ATTT (example pattern)
CATA ATTT (left side unmatched, right side matched)
CG ATATTT (left side unmatched, right side matched)
GGGT<GAGGGGGG (example pattern)
T GGTGAGGGGGG (left side unmatched, right side matched)
GGG GAGGGGGG (left side unmatched, right side matched)
To satisfy this rule. I basically add a question mark to all the characters that come before the "<". For example,
TTGATAGCCATCATCATATCGAAGTTTCACTACCCTTTTTCCATTTGCCATCTATTGAAGTAATAATAGGC<GCATG
becomes:
T?T?G?A?T?A?G?C?C?A?T?C?A?T?C?A?T?A?T?C?G?A?A?G?T?T?T?C?A?C?T?A?C?C?C?T?T?T?T?T?C?C?A?T?T?T?G?C?C?A?T?C?T?A?T?T?G?A?A?G?T?A?A?T?A?A?T?A?G?G?C?GCATG
However I've found that adding a bunch of optional ? characters creates a very slow regex (at least when run in the chrome browser). You can try for yourself by running the following code in your browser:
"GACGTCTTATGACAACTTGACGGCTACGCATGATCATTCACTT".match("C?A?T?A?T?CT?T?G?A?T?A?G?C?C?A?T?C?A?T?C?A?T?A?T?C?G?A?A?G?T?T?T?C?A?C?T?A?C?C?C?T?T?T?T?T?C?C?A?T?T?T?G?C?C?A?T?C?T?A?T?T?G?A?A?G?T?A?A?T?A?A?T?A?G?G?C?GCATG", "gi")
Or if that is fast for you, run this one:
"GACGTCTTATGACAACTTGACGGCTACGCATGATCATTCACTT".match("T?T?G?A?T?A?G?C?C?A?T?C?A?T?C?A?T?A?T?CT?T?G?A?T?A?G?C?C?A?T?C?A?T?C?A?T?A?T?CT?T?G?A?T?A?G?C?C?A?T?C?A?T?C?A?T?A?T?C?G?A?A?G?T?T?T?C?A?C?T?A?C?C?C?T?T?T?T?T?C?C?A?T?T?T?G?C?C?A?T?C?T?A?T?T?G?A?A?G?T?A?A?T?A?A?T?A?G?G?C?GCATG", "gi")
My question is two fold. First, why is this regex so slow? And second, how can I implement the rules pictured above (specifically the "<" and ">" ones) in a performant way using regex? (Or maybe it isn't possible?)
Please ask any clarifying questions if you have them.
Thanks so much!
A regex that looks like:
(CAT|AT|T)?ATTTT
works well for me. It is orders more performant than the previous (broken) solution I was attempting to use.
Note: I found that looking using reg.exec was including intermediary results that I didn't want. Thus the code I'm using looks like:
[..."CGCATATTT".matchAll("(CCAT|CAT|AT|T)?ATTT", "gi")]
Thanks to #WiktorStribiżew and my coworker for helping me find a good solution!

Matching within matches by extending an existing Regex

I'm trying to see if its possible to extend an existing arbitrary regex by prepending or appending another regex to match within matches.
Take the following example:
The original regex is cat|car|bat so matching output is
cat
car
bat
I want to add to this regex and output only matches that start with 'ca',
cat
car
I specifically don't want to interpret a whole regex, which could be quite a long operation and then change its internal content to match produce the output as in:
^ca[tr]
or run the original regex and then the second one over the results. I'm taking the original regex as an argument in python but want to 'prefilter' the matches by adding the additional code.
This is probably a slight abuse of regex, but I'm still interested if it's possible. I have tried what I know of subgroups and the following examples but they're not giving me what I need.
Things I've tried:
^ca(cat|car|bat)
(?<=ca(cat|car|bat))
(?<=^ca(cat|car|bat))
It may not be possible but I'm interested in what any regex gurus think. I'm also interested if there is some way of doing this positionally if the length of the initial output is known.
A slightly more realistic example of the inital query might be [a-z]{4} but if I create (?<=^ca([a-z]{4})) it matches against 6 letter strings starting with ca, not 4 letter.
Thanks for any solutions and/or opinions on it.
EDIT: See solution including #Nick's contribution below. The tool I was testing this with (exrex) seems to have a slight bug that, following the examples given, would create matches 6 characters long.
You were not far off with what you tried, only you don't need a lookbehind, but rather a lookahead assertion, and a parenthesis was misplaced. The right thing is: Put the original pattern in parentheses, and prepend (?=ca):
(?=ca)(cat|car|bat)
(?=ca)([a-z]{4})
In the second example (without | alternative), the parentheses around the original pattern wouldn't be required.
Ok, thanks to #Armali I've come to the conclusion that (?=ca)(^[a-z]{4}$) works (see https://regexr.com/3f4vo). However, I'm trying this with the great exrex tool to attempt to produce matching strings, and it's producing matches that are 6 characters long rather than 4. This may be a limitation of exrex rather than the regex, which seems to work in other cases.
See #Nick's comment.
I've also raised an issue on the exrex GitHub for this.

RegEx to match sets of literal strings along with value ranges

Utter RegEx noob here with a project involving RegEx I need to modify. Has been a blast learning all of this.
I need to search for/verify a set of vales that start with one of two string combinations (NC or KH) and a variable numeric list—unique to each string prefix. NC01-NC13 or KH01-11.
I have been able to pull off the first common "chunk" of this with:
^(NC|KH)0[1-9]$
to verify NC01-NC09 or KH01-KH09. The next part is completely throwing me—needing to change the leading character of the two-digit character to a 1 vs a 0, and restricting the range to 0–3 for NC and 0–1 for KH.
I have found references abound for selecting between two strings (where I got the (NC|KH) from), but nothing as detailed as how to restrict following values based on the found text.
Any and all help would be greatly appreciated, as well as any great references/books/tutorials to RegEx (currently using Regular-Expressions.info).
The best way to do this is to just separate the two case altogether.
((NC(0\d|1[0-3])|(KH(0\d|1[01])))
You might want to turn some of those internal capturing groups into non capturing groups, but that make the regex a little hard to read.
Edit: You might also be able to do this with positive lookbehind.
Edit: Here's a regex using lookbehind. It's a lot messier, and not really necessary here, but hopefully demonstrates the utility:
(KH|NC)(0\d|(?<=KH)(1[01])|(?<=NC)(1[0-3]))
Sticking with your original idea of options for NC or KH, do the same for the numbers, try this:
^(NC|KH)(0[1-9]|1[0-3])$
Hope that makes sense
EDIT:
Based upon #Patrick's comment below, and sticking with this original answer, you could use this (although I bet there's a better way):
^(NC|KH)(0[1-9]|1[0-1])|(NC1[2-3])$

Is this the way Regex works?

So I don't know exactly how to ask this question exactly but as you can see from the picture I have labelled each $replacement_number and drew a line to where it ends. The top line is what I am looking for and the bottom line is what I am replacing it with but I'm sure you all know this.
This is for an MP3 tag editor and what I am accomplishing here is it looks for exactly one letter that follows a number which may follow [anything BUT a letter] or follow JUST ONE letter and capitalize the letter that's after the number. So basically if I have 22b it will become 22B, if I have y2k, it will become y2K. But if I have yy2k it will be yy2k or if I have 2bb it will stay 2bb... etc, etc...
My question is, are the numbers in the image exactly how regex understands them or am I wrong somewhere?
Also, is my code efficient or not?
Yes, that is how the capture groups will be ordered, with all major (and probably minor) flavors of Regex.
If you're interested to see how exactly regex understands your pattern you can use this nifty visualization tool:
you are correct, the numbers - as defined by the parenthesis (as you pictured) are exactly how RegEx will label them.

Regular Expression to find CVE Matches

I am pretty new to the concept of regex and so I am hoping an expert user can help me craft the right expression to find all the matches in a string. I have a string that represents a lot of support information in it for vulnerabilities data. In that string are a series of CVE references in the format: CVE-2015-4000. Can anyone provide me a sample regex on finding all occurrences of that ? obviously, the numeric part of that changes throughout the string...
Generally you should always include your previous efforts in your question, what exactly you expect to match, etc. But since I am aware of the format and this is an easy one...
CVE-\d{4}-\d{4,7}
This matches first CVE- then a 4-digit number for the year identifier and then a 4 to 7 digit number to identify the vulnerability as per the new standard.
See this in action here.
If you need an exact match without any syntax or logic violations, you can try this:
^(CVE-(1999|2\d{3})-(0\d{2}[1-9]|[1-9]\d{3,}))$
You can run this against the test data supplied by MITRE here to test your code or test it online here.
I will add my two cents to the accepted answer. Incase we want to detect case insensitive "CVE" we can following regex
r'(?i)\bcve\-\d{4}-\d{4,7}'