Regular Expression for irregularly occurring repeating string - regex

I searched but have not found an answer to the question - maybe it is so obvious that no one else had to ask...
I am using UltraEdit 16.00 to run my Regular Expressions in PERL mode...
Situation:
I have a delimited string that can contain a variable number of repeating segments that must adhere to a very specific format. These segments occur randomly throughout the delimited string.
Example:
CLP*data*data*data~REF*data*data~N1*data*data*data~**CAS*OA*29*99.99**~AMT*I*99.99~SVC*data*data*data*data~**CAS*PR*99.99**~**CAS*CO**99.99**~DTM*150*date~AMT*B6*99.99~SVC*data*data*data*data~CAS*PR*N16*99.99~**CAS*CO* *99.99**...line continues from here.
Correct format - CAS*OA*29*99.99~
Incorrect format 1 - CAS*OA* *99.99~
Incorrect format 2 - CAS*OA**99.99~
Goal:
Identify only those strings where ALL of the CAS segments adhere to the format.
Things I've Tried:
(BTW: I know my Regular Expressions are not optimized, so please give me a break)
CAS Segment Missing value or containing one or more spaces
CAS\*(OA|PR|CR|CO)\*\*[-]?[\d]+\.?[\d]{0,2}~ matches the first instance if finds
CAS\*(OA|PR|CR|CO)\*[\s]+?\*[-]?[\d]+\.?[\d]{0,2}~ matches the first instance if finds
CAS segment NOT Missing value or containing space(s)
CAS\*(OA|PR|CR|CO)\*[^0-9A-Z]+?\*[-]?[\d]+\.?[\d]{0,2}~ Again, matches first instance
Negative Lookahead using combinations of the above (I am new to trying this approach)
^(?:(?!ab).)+$ - ab => one of the above regular expressions - never got it to work
Question:
How do I write the regular expression to enforce/validate the format of EVERY CAS instance no matter how often it occurs (there is a potential for 0 instances)?

To say that every CAS instance in your string is valid is to say that there does not exist at least one invalid CAS sequence. The approach you were getting at with a negative lookahead is the simplest way to represent this - here's an example:
/^(?!.*CAS(?!<whatever matches a valid CAS instance>))/
Basically: "Make sure there does not exist in the string an instance of CAS that is not followed by whatever matches a valid CAS instance". Replace the contents of the second negative lookahead, and include whatever it is before 'CAS' that indicates the start of a CAS instance.
As you can see, you don't need to match the string from start to finish to do what you want.

This idea will make sure the whole line is correct. E.G. It will not match the line unless it is correct.
^(regexThatOnlyMatchesASingleCorrectInstance)*$
This starts at the beginning of the line ^ and matches as many as it can + of regexThatOnlyMatchesASingleCorrectInstance and ensures that the end of the string $ is found right after the last one.
Of course this will only work when there is a ~ at the end of the string. For the ~ part, use this: (?:~|$) so that you it doesn't require the delimiter at the end of the string.

Related

Regular Expression misses matches in string

I'm trying to write a regular expression that captures desired strings between strings
("f38 ","f38 ","f1 ", "..") and ("\par","\hich","{","}","","..") from a decompiled DOC file and append each match to an array to eventually be printed out into a new file.
I'm having an issue with catching certain strings between "f38 " and "\hich" (usually when the string spans multiple lines but there is at least 1 exception to this I've found in the example string snippet of the DOC file I'm using on regex101.com)
Here is the regular expression as I have it now
(?<=f38 |f38 | |f1 |\.\.)\w.+(?=\\par|\\cell |\\hich|{|}|\\|\.\.)
The troublesome matches come out including "\hich". Like "e\hich" and "d\hich" and I want to match "e" and "d" respectively in these examples not the \hich portion. I'm thinking the problem is with handling the newline/line-breaks somehow.
Here is a smaller snippet of the input string, I have bolded what is matched and bolded + capitalized the problematic match. From this I want the "e" not the \hich. Note that above there are 2 examples of things going right and "\hich" is not included in the match.
l\hich\af38\dbch\af31505\loch\f38 ..ikely to involve asbestos exposure: removal, encapsulation, alteration, repair, maintenance, insulation, spill/emergency clean-up, transportation, disposal and storage of ACM. The general industry standards cover all other operations where exposure to asb..\hich\af38\dbch\af31505\loch\f38 E\HICH\af38\dbch\af31505\loch\f38 stos is possible
Here is an example with a longer portion of the input string at regex101.com
Any help would be appreciated. Thanks!
The problem is with the part you want to match those single-character samples. \w.+ requires at least two characters to match. So, for when you get "e\hich" that first backslash get matched to the dot in regex and lasts until the next backslash (which is one of the "terminators" listed in the positive lookahead portion of the regex).
You might want to use * instead of +.

Regular Expression: Two words in any order but with a string between?

I want to use positive lookaheads so that RegEx will pick up two words from two different sets in any order, but with a string between them of length 1 to 20 that is always in the middle.
It also is already case insensitive, allow for any number of characters including 0 before the first word found and the same after the second word found - I am unsure if it is more correct to terminate in $.
Without the any order matching I am so far as:
(?i:.*(new|launch|releas)+.{1,20}(product1|product2)+.*)
I have attempted to add any order matching with the following but it only picks up the first word:
(?i:.*(?=new|launch|releas)+.{1,20}(?=product1|product2)+.*)
I thought perhaps this was because of the +.{1,20} in the middle but I am unsure how it could work if I add this to both sets instead, as for instance this could cause a problem if the first word is the very first part of the source text it is parsing, and so no character before it.
I have seen example where \b is used for lookaheads but that also seems like it may cause a problem as I want it to match when the first word is at the start of the source text but also when it is not.
How should I edit my RegEx here please?

Numbers between 99 and 9999999 regular expression

I am trying to generate a regular expression that will match any numbers within the range of 99 and 9999999. I have trouble understanding how generating number ranges generally works. I managed to find a range generator online that does the job for me, but I want to understand how it actually works.
My attempt to do this range is as follows:
(99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])
This is supposed to match 99, any 3 digit number or any 4 digit number, but it does not work as expected. When tested it matches only numbers 99 and 3 digit numbers. Four digit numbers are not matched at all. If I only write the part for 4 digit numbers on its own as
[1-9][0-9][0-9][0-9]
It matches 4 digit numbers, but when I construct it as in the first example it does not work. Can someone give me some clarification how this actually works and how successfully to generate a regular expression for the range of 99 to 9999999.
Link to demo - Here
So you want to know how this works...
Regexs have no real understanding of the values of numbers in your string, it only cares how they are represented, which is why looking for numbers in a range seems more awkward than it should be. The only reason your regex engine can understand a range in a character class like [0-9] at all is because of the characters' positions in a list (a character range like [&-~] is just as valid, and equally understandable to it.)
So, to match a range like 99-9999999, ya gotta spell out what that looks like: literal "99", or three digits without a leading zero, or four digits without a leading zero, and so on.
But this is what your demo did, right? And it didn't work. Of your test string "9293" your regex only matched "929". What happened here is the regex engine is eager to return a complete match - as soon as it found one it returned it, even though a better/longer match might have occurred later.
Here's how that match happened. (I'll skip some details like grouping, as they're not super relevant here.)
Step 1.
The engine compares the first token in the regex with the first character in the string
(99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])
9293 ✅
Success, they match.
Step 2.
The engine then advances both to the next token in the regex and the next character in the string and compares them.
(99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])
9293 ❌
Failure, no match. The engine would stop and return the failure here, but you're using alternation via |, so it knows there's an alternate expression to try.
Step 3.
The engine advances to the first token of the next alternate expression in the regex, and rewinds the position in the string.
(99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])
9293 ✅
Success, they match.
Step 4.
Continuing on.
(99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])
9293 ✅
Match.
Step 5.
And again.
(99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])
9293 ✅
Success. The complete expression matches. There's no need to try the remaining alternate. The match here returned is:
929
As you've probably figured out, if your input string was instead "9923" then step 2 would've matched and the engine there would've stopped and returned "99".
As you've also probably figured out, if you rearrange your alternate expressions from longest to shortest
([1-9][0-9][0-9][0-9]|[1-9][0-9][0-9]|99)
the longest would be attempted first, which would match and return your expected "9293".
Simplifying
It's still pretty wordy though, especially as you crank up the number of digits in your range. There are a couple things you can do to simplify it.
The character class [0-9] can be represented by the shorthand character class \d.
([1-9]\d\d\d|[1-9]\d\d|99)
And instead of repeating them use a quantifier in curly brackets like so:
([1-9]\d{3}|[1-9]\d{2}|99)
As it happens, quantifiers can also take the form of {min, max}, so you can combine the two similar alternates:
([1-9]\d{2,3}|99)
You might expect this to land you back returning "929" again, the engine being eager and all, but quantifiers are by default greedy so they'll try to pick up as much as they can. This lends itself well to your larger desired range:
([1-9]\d{2,6}|99)
Finishing up
What you do with it from here depends on what you need the regex to do. As it stands the parentheses are superfluous, there's no point in creating a capturing group of the entire regex itself. However a decision comes when you've got an input string like:
You will likely be eaten by 1000 grue.
If you're trying to pluck out how many grue are about to eat you, you might use
[1-9]\d{2,6}|99
which will return 1000.
However that sorta runs back into the original problem with your demo. If it's "12345678 grue", which is out of range, this'll match "1234567" which might not be what you want. You can make sure the number you've matched isn't immediately followed by (or preceded by) another digit by using negative lookarounds.
(?<!\d)([1-9]\d{2,6}|99)(?!\d)
(?<!\d) means "from this position, the prior character is not a digit" while (?!\d) means "from this position, the next character is not a digit."
The parentheses around the alternates are back as they're necessary for grouping here, otherwise the lookbehind would only be part of and apply in the first alternate expression and the lookahead would only be part of and apply in the second alternate.
On the other hand if you're trying to make sure the entire string only consists of a number in your range you'll want to instead use the anchors ^ and $ (start of string and end of string, respectively):
^([1-9]\d{2,6}|99)$
And finally you can trade the capturing group out for a non-capturing group (?:...), so:
^(?:[1-9]\d{2,6}|99)$
or
(?<!\d)(?:[1-9]\d{2,6}|99)(?!\d)
You'll still grab the number as the match, it just won't be repeated in a group capture. (Lookarounds are already non-capturing, no need to worry about those.)
First of all you need some string boundaries for you regex (anything except digit, in my example I use ^ and $ -- begging and end of line or string)
Try this one:
^([1-9][0-9]{2,6}|99)$

Regular Expression to match most explicit string

I have some experience with regular expressions but I am far from expert level and need a way to match the record with the most explicit string in a file where each record begins with a unique 1-5 digit integer and is padded with various other characters when it is shorter than 5 digits. For example, my file has records that begin with:
32000
3201X
32014
320xy
In this example, the non-numeric characters represent wildcards. I thought the following regex examples would work but rather than match the record with the MOST explicit number, they always match the record with the LEAST explicit number. Remember, I do not know what is in the file so I need to test all possibilities to locate the MOST explicit match.
If I need to search for 32000, the regex looks something like:
/^3\D{4}|^32\D{3}|^320\D{2}|^3200\D|^32000/
It should match 32000 but it matches 320xy
If I need to search for 32014, the regex looks something like:
/^3\D{4}|^32\D{3}|^320\D{2}|^3201\D|^32014/
It should match 32014 but it matches 320xy
If I need to search for 32015, the regex looks something like:
/^3\D{4}|^32\D{3}|^320\D{2}|^3201\D|^32015/
It should match 3201x but it matches 320xy
In each case, the matched result is the LEAST specific numeric value. I also tried reversing the regex as follows by still get the same results:
/^32014|^3201\D|^320\D{2}|^32\D{3}|^3\D{4}/
Any help is much appreciated.
Okay, if you want to match a string literally then use anchors. Then specify the string you want matched. For instance match '123456xyz' where the xyz can be anything excep numeric use:
'^123456[^0-9]{3}$'
If you prefer specific letters to match at the end, if they will always be x y or z then use:
'^123456[xyz]{3}$'
Note the ^ and $ anchor the string to start with 12345 and end with three letters that are x y or z.
Good luck!
Ok, I did quite some tinkering here. I am 99% percent sure that this is pretty much impossible (if we don't cheat and interpolate code into the regex). The reason is you will need a negative lookbehind with variable length at some point.
However, I came up with two alternatives. One is if you want just to find the "most exact match", the second one is if you want to replace it with something. Here we go:
/(32000)|\A(?!.*32000).*(3200\D)|\A(?!.*3200[0\D]).*(320\D\D)|\A(?!.*320[0\D][0\D]).*(32\D\D\D)|\A(?!.*32[0\D][0\D][0\D]).*(3\D\D\D\D)/m
Question:
So what is my "most exact match" here?
Answer:
The concatenation of the 5 matched groups - \1\2\3\4\5. In fact always only one of them will match, the other 4 will be empty.
/(32000)|\A(?!.*32000)(.*)(3200\D)|\A(?!.*3200[0\D])(.*)(320\D\D)|\A(?!.*320[0\D][0\D])(.*)(32\D\D\D)|\A(?!.*32[0\D][0\D][0\D])(.*)(3\D\D\D\D)/m
Question:
How can I use this to replace my "most exact match"?
Answer:
In this case your "most exact match" will be the concatenation of \1\3\5\7\9, but we will have also matched some other things before that, namely \2\4\6\8 (again, only one of these can be non empty). Therefore if you want to replace your "most exact match" with fubar you can match with the above regex and replace with \2\4\6\8fubar
Another way you can think about it (and might be helpful) is that your "most exact match" will be the last matched line of either of the two regexes.
Two things to note here:
I used Ruby style RE, \A means the beginning of the string (not the beginning of a line - ^). \m means multi line mode. You should be able to find syntax for the same things in your language/technology as long as it uses some flavor of PCRE.
This can be slow. If we don't find exact match we might possibly have to match and replace the entire string (if the non exact match can be found at the end of the string).

Combine Regexp?

After collecting user input for various conditions like
Starts with : /(^#)/
Ends with : /(#$)/
Contains : /#/
Doesn't contains
To make single regex if user enter multiple conditions,
I combine them with "|" so if 1 and 2 given it become /(^#)|(#$)/
This method works so far but,
I'm not able to determine correctly, What should be the regex for the 4th condition? And combining regex this way work?
Update: #(user input) won't be same
for two conditions and not all four
conditions always present but they can
be and in future I might need more
conditions like "is exactly" and "is
exactly not" etc. so, I'm more curious
to know this approach will scale ?
Also there may be issues of user input
cleanup so regex escaped properly, but
that is ignored right now.
Will the conditions be ORed or ANDed together?
Starts with: abc
Ends with: xyz
Contains: 123
Doesn't contain: 456
The OR version is fairly simple; as you said, it's mostly a matter of inserting pipes between individual conditions. The regex simply stops looking for a match as soon as one of the alternatives matches.
/^abc|xyz$|123|^(?:(?!456).)*$/
That fourth alternative may look bizarre, but that's how you express "doesn't contain" in a regex. By the way, the order of the alternatives doesn't matter; this is effectively the same regex:
/xyz$|^(?:(?!456).)*$|123|^abc/
The AND version is more complicated. After each individual regex matches, the match position has to be reset to zero so the next regex has access to the whole input. That means all of the conditions have to be expressed as lookaheads (technically, one of them doesn't have to be a lookahead, I think it expresses the intent more clearly this way). A final .*$ consummates the match.
/^(?=^abc)(?=.*xyz$)(?=.*123)(?=^(?:(?!456).)*$).*$/
And then there's the possibility of combined AND and OR conditions--that's where the real fun starts. :D
Doesn't contain #: /(^[^#]*$)/
Combining works if the intended result of combination is that any of them matching results in the whole regexp matching.
If a string must not contain #, every character must be another character than #:
/^[^#]*$/
This will match any string of any length that does not contain #.
Another possible solution would be to invert the boolean result of /#/.
In my experience with regex you really need to focus on what EXACTLY you are trying to match, rather than what NOT to match.
for example
\d{2}
[1-9][0-9]
The first expression will match any 2 digits....and the second will match 1 digit from 1 to 9 and 1 digit - any digit. So if you type 07 the first expression will validate it, but the second one will not.
See this for advanced reference:
http://www.regular-expressions.info/refadv.html
EDITED:
^((?!my string).)*$ Is the regular expression for does not contain "my string".
1 + 2 + 4 conditions: starts|ends, but not in the middle
/^#[^#]*#?$|^#?[^#]*#$/
is almost the same that:
/^#?[^#]*#?$/
but this one matches any string without #, sample 'my name is hal9000'
Combining the regex for the fourth option with any of the others doesn't work within one regex. 4 + 1 would mean either the string starts with # or doesn't contain # at all. You're going to need two separate comparisons to do that.