RegEx to match sets of literal strings along with value ranges - regex

Utter RegEx noob here with a project involving RegEx I need to modify. Has been a blast learning all of this.
I need to search for/verify a set of vales that start with one of two string combinations (NC or KH) and a variable numeric list—unique to each string prefix. NC01-NC13 or KH01-11.
I have been able to pull off the first common "chunk" of this with:
^(NC|KH)0[1-9]$
to verify NC01-NC09 or KH01-KH09. The next part is completely throwing me—needing to change the leading character of the two-digit character to a 1 vs a 0, and restricting the range to 0–3 for NC and 0–1 for KH.
I have found references abound for selecting between two strings (where I got the (NC|KH) from), but nothing as detailed as how to restrict following values based on the found text.
Any and all help would be greatly appreciated, as well as any great references/books/tutorials to RegEx (currently using Regular-Expressions.info).

The best way to do this is to just separate the two case altogether.
((NC(0\d|1[0-3])|(KH(0\d|1[01])))
You might want to turn some of those internal capturing groups into non capturing groups, but that make the regex a little hard to read.
Edit: You might also be able to do this with positive lookbehind.
Edit: Here's a regex using lookbehind. It's a lot messier, and not really necessary here, but hopefully demonstrates the utility:
(KH|NC)(0\d|(?<=KH)(1[01])|(?<=NC)(1[0-3]))

Sticking with your original idea of options for NC or KH, do the same for the numbers, try this:
^(NC|KH)(0[1-9]|1[0-3])$
Hope that makes sense
EDIT:
Based upon #Patrick's comment below, and sticking with this original answer, you could use this (although I bet there's a better way):
^(NC|KH)(0[1-9]|1[0-1])|(NC1[2-3])$

Related

Matching within matches by extending an existing Regex

I'm trying to see if its possible to extend an existing arbitrary regex by prepending or appending another regex to match within matches.
Take the following example:
The original regex is cat|car|bat so matching output is
cat
car
bat
I want to add to this regex and output only matches that start with 'ca',
cat
car
I specifically don't want to interpret a whole regex, which could be quite a long operation and then change its internal content to match produce the output as in:
^ca[tr]
or run the original regex and then the second one over the results. I'm taking the original regex as an argument in python but want to 'prefilter' the matches by adding the additional code.
This is probably a slight abuse of regex, but I'm still interested if it's possible. I have tried what I know of subgroups and the following examples but they're not giving me what I need.
Things I've tried:
^ca(cat|car|bat)
(?<=ca(cat|car|bat))
(?<=^ca(cat|car|bat))
It may not be possible but I'm interested in what any regex gurus think. I'm also interested if there is some way of doing this positionally if the length of the initial output is known.
A slightly more realistic example of the inital query might be [a-z]{4} but if I create (?<=^ca([a-z]{4})) it matches against 6 letter strings starting with ca, not 4 letter.
Thanks for any solutions and/or opinions on it.
EDIT: See solution including #Nick's contribution below. The tool I was testing this with (exrex) seems to have a slight bug that, following the examples given, would create matches 6 characters long.
You were not far off with what you tried, only you don't need a lookbehind, but rather a lookahead assertion, and a parenthesis was misplaced. The right thing is: Put the original pattern in parentheses, and prepend (?=ca):
(?=ca)(cat|car|bat)
(?=ca)([a-z]{4})
In the second example (without | alternative), the parentheses around the original pattern wouldn't be required.
Ok, thanks to #Armali I've come to the conclusion that (?=ca)(^[a-z]{4}$) works (see https://regexr.com/3f4vo). However, I'm trying this with the great exrex tool to attempt to produce matching strings, and it's producing matches that are 6 characters long rather than 4. This may be a limitation of exrex rather than the regex, which seems to work in other cases.
See #Nick's comment.
I've also raised an issue on the exrex GitHub for this.

Match a string with a fixed substring in variable positions

there:
I want to create a filter in my email server that matches any message that contains any URL (using either http or https protocols) from a certain domain (let's say domain.org). I want it to match things like:
https://site1.domain.org
https://anothersite.domain.org
http://yetanotherone.domain.org
The problem here is that these strings can be wrapped in the message body at any random position of the string. And even worse, when the string is wrapped an equal sign is added before the end of the line, so I would need it to be able to match strings like these:
ht=
tps://thisisanexample.domain.org
https://thisisane=
xample.domain.org
https://thisisanexample.do=
main.org
I came up with a simple (but huge) solution, but I think there must be a much more elegant one than mine:
/h[=[:cntrl:]]*t[=[:cntrl:]]*t[=[:cntrl:]]*p[=[:cntrl:]]*s?[=[:cntrl:]]*:[=[:cntrl:]]*\/[=[:cntrl:]]*\/[=[:cntrl:]]*[-+_#&%$#|()=?¿:;,.,çÇ^[:cntrl:][:alnum:]\[\]\{\}\*\\]*[=[:cntrl:]]*.[=[:cntrl:]]*d[=[:cntrl:]]*o[=[:cntrl:]]*m[=[:cntrl:]]*a[=[:cntrl:]]*[=[:cntrl:]]*i[=[:cntrl:]]*n[=[:cntrl:]]*.[=[:cntrl:]]*o[=[:cntrl:]]*r[=[:cntrl:]]*g/
I have been looking around but I can not find anything that I understand to improve my solution given that my knowledge of regex does not go beyond simple queries.
Thank you very much in advance.
Regards.
2018/04/11 EDIT: Thank you to everyone who tried but the solutions proposed do not meet the requirements of elegance and readability I was expecting. I was looking for something like capturing everything but the equal-return string and performing the web address string search on the captured result of the first search. Is this a doable idea?

Too Many Characters Included in Attempt to Parse a CSV File

Background
I am attempting to parse a CSV file using PCRE regular expressions. That is, making out (or extracting) the various different "cells" available in the CSV, to then put them in a somewhat nicely organized array containing all the parts that the process of parsing managed to make out.
The following regular expression is what I have come up with so far:
/(?:;|^)(?:(?:"(?:(?!"(;|$)).)*)|(?:([^;]*)))/g
I would highly recommend that you put this in a tester for regular expressions. Here is a slight bit of test data, that should match to a great extent.
"There; \"be";"but; someone spoke";hence the young man;hence the son;"test;"
The Problem
The regular expression manages to extract the correct number of parts. It is meant for the regular expression to retrieve the text from inside each and every "cell" available in the CSV (use the CSV provided above for reference). It does to some extent.
Here is the result of the groups in the regular expression above:
"There; \"be
;"but; someone spoke
hence the young man
hence the son
;"test;
As we can clearly see, the lines that are "escaped" using double-quotation marks include the " inside its group for the match, also selects the ", and sometimes even the semi-colon. From my understanding, the group for the negative lookahead should not include those.
I have probably missed something very essential here. Perhaps someone can point me in the right direction towards a fix.
Edit and Potential Solution
It appears as though I might have managed to solve it. As opposed to what I said above, the negative lookahead does not actually appear to create a capture group, which I initially thought. As such, adding yet another group to the equation seems to parse out the segments I am after.
/(?:;|^)(?:(?:"((?:(?!"(;|$)).)*))|(?:([^;]*)))/g
I will, however, leave the question open for now, and will answer it myself if no other answer comes tumbling in. As not to make it opinion based, I would therefore further inquire as to whether there might be a more efficient way in terms of speed than that in which I am using above.

Regex character required between 1st and 8th character

I am currently using this regex to limit the characters that can be used "([A-Za-z0-9_-]+)". I now have an additional requirement to require a hyphen between the 1st and 8th character. I am not sure where to begin for this and my search results have not been fruitful. Could anyone point me in a direction or give me pointers of where to get started with this request? I can usually cobble together some regex on my own through examples here and elsewhere on the web, but I can't find anything similar to these requirements.
here are some good examples of what I mean:
this-isvalid
so-isthis
Thank you in advance!
Yeah, typically when you know the requirements use an online regex checker.
http://www.regexplanet.com/advanced/java/index.html
There's a number of them, you can google them.
You can go ahead and specify between 1 and 7 copies of that and then a dash so something like:
(^[A-Za-z0-9_]{1,7}-[A-Za-z0-9_]+)

Capture string until first caret sign hit in regex?

I am working with legacy systems at the moment, and a lot of work involves breaking up delimited strings and testing against certain rules.
With this string, how could I return "Active" in a back reference and search terms, stopping when it hits the first caret (^)?:
Active^20080505^900^LT^100
Can it be done with an inclusion in the regex of this "(.+)" ? The reason I ask is that the actual regex "(.+)" is defined in a database as cutting up these messages and their associated rules can be set from a front-end system. The content could be anything ('Active' in this case), that's why ".+" has been used in this case.
Rule: The caret sign cannot feature between the brackets, as that would result with it being stored in the database field too, and it is defined elsewhere in another system field.
If you have a better suggestion than "(.+)" will be happy to hear it.
Thanks in advance.
(.+?)\^
Should grab up to the first ^
If you have to include (.+) w/o modifications you could use this:
(.+?)\^(.+)
The first backreference will still be the correct one and you can ignore the second.
A regex is really overkill here.
Just take the first n characters of the string where n is the position of the first caret.
Pseudo code:
InputString.Left(InputString.IndexOf("^"))
^([^\^]+)
That should work if your RE library doesn't support non-greediness.