REGEX if then |else - regex

I am having a difficult time understanding the syntax for if then|else.
I'm looking at a large multiline chunk of text and attempting to retrieve a file number that could match a varying number of expressions. I want to start with the most restrictive match and if its not found move on to another possible match. I dont want to use a simple or | because I want to make sure that one match does not exist before evaluating the other possiblity.
For example I could have a file number in the format ABC-12345 or I could have it in the format ABC-1234567 (there are several other possibilities that make just matching the number of digits impractical as there may be other masks all together, I think if I can understand how to properly structure the if statement for this simple example I can work into the rest)
I thought I might be able to accomplish this by using something like
?(?=[a-zA-Z]{3}-[A-Za-z0-9]{5}$)[a-zA-Z]{3}-[A-Za-z0-9]{5}$|[a-zA-Z]{3}-[A-Za-z0-9]{7}$)
I am trying to get at :If I get a match on the ABC-12345 then return that, otherwise return anything that would match ABC-1234567
I am looking on different lines at the moment hence using the $ , however I suspect I may need to change this to \b if I am searching on the same line

Related

Use regex in UFT PDF comparison

In one of my UFT test cases, I need to verify a amount on a PDF file.
Sometimes the amount is "3000" and sometimes it is "3.000". And sometimes even "3 000"!
I would like to accept those 3 possibilities, knowing that this amount is stored in a datatable.
I tried something like "3.?000" (with regex check in the file checkpoint) but it's not matching any of the 3 solutions.
How would you do?
One of UFT's idiosyncrasies when dealing with regexs is that it add implicit anchors at the beginning and ends of lines.
Try adding .* before and after the text you want to match - .*\s3[., ]?000\s.*.
Also verify that you've activated the regular expression flag for your line. I find the UI for File Content Checkpoints to be a bit unintuitive so you may have missed that.

Matching within matches by extending an existing Regex

I'm trying to see if its possible to extend an existing arbitrary regex by prepending or appending another regex to match within matches.
Take the following example:
The original regex is cat|car|bat so matching output is
cat
car
bat
I want to add to this regex and output only matches that start with 'ca',
cat
car
I specifically don't want to interpret a whole regex, which could be quite a long operation and then change its internal content to match produce the output as in:
^ca[tr]
or run the original regex and then the second one over the results. I'm taking the original regex as an argument in python but want to 'prefilter' the matches by adding the additional code.
This is probably a slight abuse of regex, but I'm still interested if it's possible. I have tried what I know of subgroups and the following examples but they're not giving me what I need.
Things I've tried:
^ca(cat|car|bat)
(?<=ca(cat|car|bat))
(?<=^ca(cat|car|bat))
It may not be possible but I'm interested in what any regex gurus think. I'm also interested if there is some way of doing this positionally if the length of the initial output is known.
A slightly more realistic example of the inital query might be [a-z]{4} but if I create (?<=^ca([a-z]{4})) it matches against 6 letter strings starting with ca, not 4 letter.
Thanks for any solutions and/or opinions on it.
EDIT: See solution including #Nick's contribution below. The tool I was testing this with (exrex) seems to have a slight bug that, following the examples given, would create matches 6 characters long.
You were not far off with what you tried, only you don't need a lookbehind, but rather a lookahead assertion, and a parenthesis was misplaced. The right thing is: Put the original pattern in parentheses, and prepend (?=ca):
(?=ca)(cat|car|bat)
(?=ca)([a-z]{4})
In the second example (without | alternative), the parentheses around the original pattern wouldn't be required.
Ok, thanks to #Armali I've come to the conclusion that (?=ca)(^[a-z]{4}$) works (see https://regexr.com/3f4vo). However, I'm trying this with the great exrex tool to attempt to produce matching strings, and it's producing matches that are 6 characters long rather than 4. This may be a limitation of exrex rather than the regex, which seems to work in other cases.
See #Nick's comment.
I've also raised an issue on the exrex GitHub for this.

Regex exclude value from repetition

I'm working with load files and trying to write a regex that will check for any rows in the file that do not have the correct number of delimiters. Let's pretend the delimiter is % (I'm not sure this text field supports the delimiters that are in the load file). The regex I wrote that finds all rows that are correct is:
^%([^%]*%){20}$
Sometimes it's more beneficial to find any rows that do not have the correct number of delimiters, so to accomplish that, I wrote this:
(^%([^%]*%){0,19}$)|(^%([^%]*%){21,}$)
I'm concerned about the efficiency of this (and any regex I write in general), so I'm wondering if there's a better way to write it, or if the way I wrote it is fine. I thought maybe there would be some way to use alternation with the repetition tokens, such as:
{0,19}|{21,}
but that doesn't seem to work.
If it's helpful to know, I'm just searching through the files in Sublime Text, which I believe uses PCRE. I'm also open to suggestions for making the first regex better in general, although I've found it to work pretty well even in exceptionally large load files.
If your regex engine supports negative lookaheads, you can slightly modify your original regex.
^(?!%([^%]*%){20}$)
The regex above is useful for test only. If you want to capture, then you need to add .* part.
^(?!%([^%]*%){20}$).*$

Having trouble creating a regex for a list of zip codes

I need to test whether a list of zip codes in a textarea has only 5-digit zip codes. Under normal circumstances the list would look like this:
56228, 56243, 55324, 55325, 55329, 55355, 55389
I need to find out if there is anything but the above pattern in the textarea. There can be any number of individual zip codes, but I need to make sure there isn't anything else. (I think I'm going to need to be able to highlight illegal matches in the textarea also, but I'll cross that bridge when I get to it).
I started with this regex:
^\d{5},?\s?$+
I'm very new to building regular expressions, but as I understand it, the above should match any set of 5 digits, and commas and whitespace after the five digits may or may not be there.
Online regex testers (I've tried several) aren't finding any matches, whether I have a legitimate list of zip codes or a list with "illegal" characters.
What am I missing here?
This one should suit your needs:
^([, ]*\d{5})+[, ]*$

Regex assistance: include/exclude

Hello I am trying to figure out this RegEx expression. I have a URL that can have different querystring parameter at different location.
test.aspx?foo=bar&abc=123
test.aspx?abc=123&foo=bar
test.aspx?foo=bar&abc=123#T1
test.aspx?abc=123&foo=bar#T2
I am trying to only find the one without the #Tnumber
Here what I have so far.
test.aspx\?(?!\#T[0-9])
However it still select all of them, is there a way to have a string constant and scan it down the line?
Juniorflip
If #Tnum is always at the end, you just need to do a bit of anchoring. For example, like this:
test.aspx\?.*(?!\#T[0-9])...$
But that's very fragile as it depends on the bad URLs always ending in a very particular form and good URLs always having enough characters to soak up that end matching. A negative lookbehind assertion is somewhat better, but still fragile and less commonly supported:
test.aspx\?.*(?<!\#T[0-9])$
It's better to write a regular expression that matches what you don't want and to just invert the logic of what to do when you get a match (i.e., "if it matches throw it away", instead of "if it matches use it"). But really it's much better to delegate the parsing of the URLs to a specialized library and then just do a simpler check against the fragment identifier as a logical component instead of as a horrible RE hack.