Pattern matching software builds - regex

I need to check if a XML document contains some software build numbers
normally they're like ######, where:
First two characters are numbers
Third character is a letter in uppercase
last three characters are numbers
Example: 10B329 or 11A465.
But there could be some exceptions, like 8L1 or 11B465a. (if there's another character after the sixth, it's always a letter in lowercase).
I think they're always with a minimum length of 3 characters and a maximum length of 7 characters.
So what could be the best pattern to match? I tried this but it doesn't work since it takes also words...
Dim BuildPattern As String = "<key>[0-9A-Z]*</key>"

Try this Regex: \d{2}\w\d{3}
You can see a live demo here.
You can add the <key></key> tags too: <key>(\d{2}\w\d{3})<\/key>. This way, your match will be in group 1 of the match. Changed demo.
Note that you should rather use XML parser for this as it's safer and more accurate than working with regex on XML files.
EDIT: Can't help you with the non-standard length though, my knowledge of regexes is still too low. Perhaps you really should try XML parser instead?

Related

Regular Expression misses matches in string

I'm trying to write a regular expression that captures desired strings between strings
("f38 ","f38 ","f1 ", "..") and ("\par","\hich","{","}","","..") from a decompiled DOC file and append each match to an array to eventually be printed out into a new file.
I'm having an issue with catching certain strings between "f38 " and "\hich" (usually when the string spans multiple lines but there is at least 1 exception to this I've found in the example string snippet of the DOC file I'm using on regex101.com)
Here is the regular expression as I have it now
(?<=f38 |f38 | |f1 |\.\.)\w.+(?=\\par|\\cell |\\hich|{|}|\\|\.\.)
The troublesome matches come out including "\hich". Like "e\hich" and "d\hich" and I want to match "e" and "d" respectively in these examples not the \hich portion. I'm thinking the problem is with handling the newline/line-breaks somehow.
Here is a smaller snippet of the input string, I have bolded what is matched and bolded + capitalized the problematic match. From this I want the "e" not the \hich. Note that above there are 2 examples of things going right and "\hich" is not included in the match.
l\hich\af38\dbch\af31505\loch\f38 ..ikely to involve asbestos exposure: removal, encapsulation, alteration, repair, maintenance, insulation, spill/emergency clean-up, transportation, disposal and storage of ACM. The general industry standards cover all other operations where exposure to asb..\hich\af38\dbch\af31505\loch\f38 E\HICH\af38\dbch\af31505\loch\f38 stos is possible
Here is an example with a longer portion of the input string at regex101.com
Any help would be appreciated. Thanks!
The problem is with the part you want to match those single-character samples. \w.+ requires at least two characters to match. So, for when you get "e\hich" that first backslash get matched to the dot in regex and lasts until the next backslash (which is one of the "terminators" listed in the positive lookahead portion of the regex).
You might want to use * instead of +.

positive look ahead and replace

Recently I'm writing/testing regexps on https://regex101.com/.
My question is: Is it possible to do a positive look-ahead AND a replacement in the same "replacement"? Or just limited kind of replacement is possible.
Input is several lines with phone numbers. Let's say the correct phone number where the number of "numbers" are 11. No matter how the numbers are divided/group together with - / characters, no matter if starts with + 00 or it is omitted.
Some example lines:
+48301234567
+48/30/1234567
+48-30-12-345-67
+483011223344556677
0048301234567
+(48)30/1234567
Positive look-ahead able to check if from the beginning until the end of line there are only 11 digits, regardless how many other, above specified character separating them. This works perfectly.
Where the positive look-ahead check is fine, I would like to delete every character but numbers. The replacement works fine until I'm not involving look-ahead.
Checking the regexp itself working perfectly ("gm" modes):
^(?:\+|00)?(?:[\-\/\(\)]?\d){11}$
Checking the replace part works perfectly (replace to nothing):
[^\d\n]
Put this into look-ahead, after the deletion of non new-line and non-digit characters from the matching lines:
(?=^(?:\+|00)?(?:[\-\/\(\)]?\d){11}$)[^\d\n]
Even I put the ^ $ into look-ahead, seems the replacement working only from beginning of the lines until the very first digit.
I know in real life the replacement and the check should/would go separate ways, however I'm curious if I could mix look-ahead/look-behind with string operations like replace, delete, take the string apart and put together as I like.
UPDATE: This is what would do the trick, however I feel this one "ugly" a bit. Is there any prettier solution?
https://regex101.com/r/yT5dA4/2
Or the version which I asked originally, where only digits remains: regex101.com/r/yT5dA4/3
You cannot replace/delete text with regex. Regex is just a tool for matching certain strings and then taking certain action depending on the matching text, eg. perform a substitution, retrieve the second capture group.
However it is possible to perform certain decisions within a regex engine, by using conditionals. The common syntax for this, with a lookahead assertion, is (?(?=regex)then|else).
With conditionals you can change the behaviour depending on how the text matches the regex. For your example you could do something like:
^(\+)?(?(1)\(|\d)
If the phone number starts with a plus it must be followed by a bracket, else it should start with a digit. Although in your situation, this is not very useful.
If you want to read up more on conditionals in regex you can do so here.

Regular Expression to match most explicit string

I have some experience with regular expressions but I am far from expert level and need a way to match the record with the most explicit string in a file where each record begins with a unique 1-5 digit integer and is padded with various other characters when it is shorter than 5 digits. For example, my file has records that begin with:
32000
3201X
32014
320xy
In this example, the non-numeric characters represent wildcards. I thought the following regex examples would work but rather than match the record with the MOST explicit number, they always match the record with the LEAST explicit number. Remember, I do not know what is in the file so I need to test all possibilities to locate the MOST explicit match.
If I need to search for 32000, the regex looks something like:
/^3\D{4}|^32\D{3}|^320\D{2}|^3200\D|^32000/
It should match 32000 but it matches 320xy
If I need to search for 32014, the regex looks something like:
/^3\D{4}|^32\D{3}|^320\D{2}|^3201\D|^32014/
It should match 32014 but it matches 320xy
If I need to search for 32015, the regex looks something like:
/^3\D{4}|^32\D{3}|^320\D{2}|^3201\D|^32015/
It should match 3201x but it matches 320xy
In each case, the matched result is the LEAST specific numeric value. I also tried reversing the regex as follows by still get the same results:
/^32014|^3201\D|^320\D{2}|^32\D{3}|^3\D{4}/
Any help is much appreciated.
Okay, if you want to match a string literally then use anchors. Then specify the string you want matched. For instance match '123456xyz' where the xyz can be anything excep numeric use:
'^123456[^0-9]{3}$'
If you prefer specific letters to match at the end, if they will always be x y or z then use:
'^123456[xyz]{3}$'
Note the ^ and $ anchor the string to start with 12345 and end with three letters that are x y or z.
Good luck!
Ok, I did quite some tinkering here. I am 99% percent sure that this is pretty much impossible (if we don't cheat and interpolate code into the regex). The reason is you will need a negative lookbehind with variable length at some point.
However, I came up with two alternatives. One is if you want just to find the "most exact match", the second one is if you want to replace it with something. Here we go:
/(32000)|\A(?!.*32000).*(3200\D)|\A(?!.*3200[0\D]).*(320\D\D)|\A(?!.*320[0\D][0\D]).*(32\D\D\D)|\A(?!.*32[0\D][0\D][0\D]).*(3\D\D\D\D)/m
Question:
So what is my "most exact match" here?
Answer:
The concatenation of the 5 matched groups - \1\2\3\4\5. In fact always only one of them will match, the other 4 will be empty.
/(32000)|\A(?!.*32000)(.*)(3200\D)|\A(?!.*3200[0\D])(.*)(320\D\D)|\A(?!.*320[0\D][0\D])(.*)(32\D\D\D)|\A(?!.*32[0\D][0\D][0\D])(.*)(3\D\D\D\D)/m
Question:
How can I use this to replace my "most exact match"?
Answer:
In this case your "most exact match" will be the concatenation of \1\3\5\7\9, but we will have also matched some other things before that, namely \2\4\6\8 (again, only one of these can be non empty). Therefore if you want to replace your "most exact match" with fubar you can match with the above regex and replace with \2\4\6\8fubar
Another way you can think about it (and might be helpful) is that your "most exact match" will be the last matched line of either of the two regexes.
Two things to note here:
I used Ruby style RE, \A means the beginning of the string (not the beginning of a line - ^). \m means multi line mode. You should be able to find syntax for the same things in your language/technology as long as it uses some flavor of PCRE.
This can be slow. If we don't find exact match we might possibly have to match and replace the entire string (if the non exact match can be found at the end of the string).

How to optimise this regex to match string (1234-12345-1)

I've got this RegEx example: http://regexr.com?34hihsvn
I'm wondering if there's a more elegant way of writing it, or perhaps a more optimised way?
Here are the rules:
Digits and dashes only.
Must not contain more than 10 digits.
Must have two hyphens.
Must have at least one digit between each hyphen.
Last number must only be one digit.
I'm new to this so would appreciate any hints or tips.
In case the link expires, the text to search is
----------
22-22-1
22-22-22
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
666666-7777777-1
88888888-88888888-1
1-1-1
88888888-88888888-22
22-333-
333-22
----------
My regex is: \b((\d{1,4}-\d{1,5})|(\d{1,5}-\d{1,4}))-\d{1}\b
I'm using this site for testing: http://gskinner.com/RegExr/
Thanks for any help,
Nick
Here is a regex I came up with:
(?=\b[\d-]{3,10}-\d\b)\b\d+-\d+-\d\b
This uses a look-ahead to validate the information before attempting the match. So it looks for between 3-10 characters in the class of [\d-] followed by a dash and a digit. And then after that you have the actual match to confirm that the format of your string is actually digit(dash)digit(dash)digit.
From your sample strings this regex matches:
22-22-1
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
1-1-1
It also matches the following strings:
22-7777777-1
1-88888888-1
Your regexp only allows a first and second group of digits with a maximum length of 5. Therefore, valid strings like 1-12345678-1 or 123456-1-1 won't be matched.
This regexp works for the given requirements:
\b(?:\d\-\d{1,8}|\d{2}\-\d{1,7}|\d{3}\-\d{1,6}|\d{4}\-\d{1,5}|\d{5}\-\d{1,4}|\d{6}\-\d{1,3}|\d{7}\-\d{1,2}|\d{8}\-\d)\-\d\b
(RegExr)
You can use this with the m modifier (switch the multiline mode on):
^\d(?!.{12})\d*-\d+-\d$
or this one without the m modifier:
\b\d(?!.{12})\d*-\d+-\d\b
By design these two patterns match at least three digits separated by hyphens (so no need to put a {5,n} quantifier somewhere, it's useless).
Patterns are also build to fail faster:
I have chosen to start them with a digit \d, this way each beginning of a line or word-boundary not followed by a digit is immediately discarded. Other thing, using only one digit, I know the remaining string length.
Then I test the upper limit of the string length with a negative lookahead that test if there is one more character than the maximum length (if there are 12 characters at this position, there are 13 characters at least in the string). No need to use more descriptive that the dot meta-character here, the goal is to quickly test the length.
finally, I describe the end of string without doing something particular. That is probably the slower part of the pattern, but it doesn't matter since the overwhelming majority of unnecessary positions have already been discarded.

What is wrong with my simple regex that accepts empty strings and apartment numbers?

So I wanted to limit a textbox which contains an apartment number which is optional.
Here is the regex in question:
([0-9]{1,4}[A-Z]?)|([A-Z])|(^$)
Simple enough eh?
I'm using these tools to test my regex:
Regex Analyzer
Regex Validator
Here are the expected results:
Valid
"1234A"
"Z"
"(Empty string)"
Invalid
"A1234"
"fhfdsahds527523832dvhsfdg"
Obviously if I'm here, the invalid ones are accepted by the regex. The goal of this regex is accept either 1 to 4 numbers with an optional letter, or a single letter or an empty string.
I just can't seem to figure out what's not working, I mean it is a simple enough regex we have here. I'm probably missing something as I'm not very good with regexes, but this syntax seems ok to my eyes. Hopefully someone here can point to my error.
Thanks for all help, it is greatly appreciated.
You need to use the ^ and $ anchors for your first two options as well. Also you can include the second option into the first one (which immediately matches the third variant as well):
^[0-9]{0,4}[A-Z]?$
Without the anchors your regular expression matches because it will just pick a single letter from anywhere within your string.
Depending on the language, you can also use a negative look ahead.
^[0-9]{0,4}[A-Za-z](?!.*[0-9])
Breakdown:
^[0-9]{0,4} = This look for any number 0 through 4 times at the beginning of the string
[A-Za-z] = This look for any characters (Both cases)
(?!.*[0-9]) = This will only allow the letters if there are no numbers anywhere after the letter.
I haven't quite figured out how to validate against a null character, but that might be easier done using tools from whatever language you are using. Something along this logic:
if String Doesn't equal $null Then check the Rexex
Something along those lines, just adjusted for however you would do it in your language.
I used RegEx Skinner to validate the answers.
Edit: Fixed error from comments