AUTOHOTKEY: RegExMatch() a series of numbers and letters - regex

I've tested my regular expression in http://www.regextester.com/
([0-9]{4,4})([A-Z]{2})([0-9]{1,3})
It's matching perfect with the following strings just as I want it.
1234AB123
2000AZ20
1000XY753
But when I try it in Autohotkey I get 0 result
test := RegExMatch("2000SY155","([0-9]{4,4})([A-Z]{2})([0-9]{1,3})")
MsgBox %test%
testing for:
first 4 characters must be a number
next 2 characters must be caps letters
next 1 to 3 characters must be numbers

You had to many ( )
This is the correct implementation:
test := RegExMatch("1234AB123","[0-9]{4,4}([A-Z]{2})[0-9]{1,3}")
Edit:
So what I noticed is you want this pattern to match, but you aren't really telling it much.
Here's what I was able to come up with that matches what you asked for, it's probably not the best solution but it works:
test := RegExMatch("1234AB567","^[0-9]{4,4}[A-Z]{2}(?![0-9]{4,})[0-9$]{1,3}")
Breaking it down:
RegExMatch(Haystack, NeedleRegEx [, UnquotedOutputVar = "", StartingPosition = 1])
Circumflex (^) and dollar sign ($) are called anchors because
they don't consume any characters; instead, they tie the pattern to
the beginning or end of the string being searched.
^ may appear at the beginning of a pattern to require the match to occur at
the very beginning of a line. For example, **
** matches abc123 but not 123abc.
$ may appear at the end of a pattern to require the match to occur at the very > end of a line. For example, abc$ matches 123abc but not abc123.
So by adding Circumflex we are requiring that our Pattern [0-9]{4,4} be at the beginning of the our Haystack.
Look-ahead and look-behind assertions: The groups (?=...), (?!...) are
called assertions because they demand a condition to be met but don't
consume any characters.
(?!...) is a negative look-ahead because it requires that the specified pattern not exist.
Our next Pattern is looking for two Uppercase Alpha Characters [A-Z]{2}(?![0-9]{4,}) that does not have four or more Numeric characters after it.
And finally our last Pattern that needs to match one to three Numeric characters as the last characters in our Haystack [0-9$]{1,3}

test := RegExMatch("2000SY155","([0-9]{4,4})([A-Z]{2})([0-9]{1,3})")
MsgBox %test%
But when I try it in Autohotkey I get 0 result
The message box correctly returns 1 for me, meaning your initial script works fine with my version. Usually, braces are no problem in RegExes, you can put there as many as you like... maybe your AutoHotkey version is outdated?

Related

Am I implementing negative lookaheads correctly with my regex?

I'm a beginner with regex and stuck with creating regex with the following conditions:
Minimum of 8 characters
Maximum of 60 characters
Must contain 2 letters
Must contain 1 number
Must contain 1 special character
Special character cannot be the following: & ` ( ) = [ ] | ; " ' < >
So far I have the following...
(?=^.{8,60}$)(?=.*\d)(?=[a-zA-Z]{2,})(?!.*[&`()=[|;"''\]'<>]).*
But my last two tests are failing and I have no idea why...
!##$%^*+-_~?,.{}!HR12345
123456789AB!
If you'd like to see my test and expected results, visit here: https://regexr.com/73m2o
My tests contains acceptable number of characters, appropriate number of alphabetic characters, and supported special characters... I don't know why it's failing!
Using .* to verify a character in the string can be very inefficient and I would suggest using negated character classes for the principle of contast.
Apart from that, there is a point in the question Must contain 1 special character that is not addressed yet in the current answers.
You can use a positive lookahead for that to assert one of the characters that you consider special.
^(?=[^\d\n]*\d)(?=[^a-zA-Z\n]*[a-zA-Z][^a-zA-Z\n]*[a-zA-Z])(?=[^!##$%^\n]*[!##$%^])[^&`()=[|;"''\]'<>\n]{8,60}$
Explanation
^ Start of string (Outside of the lookahead)
(?=[^\d\n]*\d) Assert a digit
(?=[^a-zA-Z\n]*[a-zA-Z][^a-zA-Z\n]*[a-zA-Z]) Assert 2 chars a-zA-Z
(?=[^!##$%^\n]*[!##$%^]) Assert a "special" character
[^&`()=[|;"''\]'<>\n]{8,60} Match 8-60 characters except for the ones that you don't want to match
$ End of string
See a regex demo.
Part of the issue is that you're missing the .* in (?=[a-zA-Z]{2,}). However, your implementation of "two or more" letters is not correct unless the letters must be consecutive.
You'll see that the string 1234567B89A! fails to match, even with the correction. You can fix this like so:
(?=^.{8,60}$)(?=.*\d)(?=.*[a-zA-Z].*[a-zA-Z])(?!.*[&`()=[|;"''\]'<>]).*
The part I changed is (?=.*[a-zA-Z].*[a-zA-Z]) asserting that we can match a letter, zero or more other characters, and then another letter.
https://regex101.com/r/jEsK0S/1
Also, there's currently no assertion that there must be a special character, only an assertion of which ones shouldn't match. So I'd suggest adding another lookahead with a list of valid special characters.
Since the 2+ alphabetical characters can appear anywhere in the string, you need to prepend your check for them with .* (as you have with the other character classes you're checking for); otherwise the positive lookaheads will, in this scenario, try to assert their appearance at the beginning of the string (position 0):
(?=^.{8,60}$)(?=.*\d)(?=.*[a-zA-Z]{2,})(?!.*[&`()=[|;"''\]'<>]).*

php check ncr with negative lookbehind and greedy doesn't work

I want to find a erroneous NCR without &# and remedy it, the unicode is 4 or 5 decimal digit, I write this PHP statement:
function repl0($m) {
return '&#'.$m[0];
}
$s = "This is a good 23200; sample ship";
echo "input1= ".htmlentities($s)."<br>";
$out1=preg_replace_callback('/(?<!#)(\d{4,5};)/','repl0',$s);
echo 'output1 = '.htmlentities($out1).'<br>';
The output is:
input1= This is a good 23200; sample ship
output1 = This is a good 2ಀ sample ship
The match only happens once according to the output message.
What I want is to match '23200;' instead of '3200;'.
Default should be greedy mode and I thought it will capture 5-digit number instead 4-digit's
Do I misunderstand 'greedy' here? How can I get what I want?
The (?<!#)(\d{4,5};) pattern matches like this:
(?<!#) - matches a location that is not immediately preceded with #
(\d{4,5};) - then tries to match and consume four or five digits and a ; char immediately after these digits.
So, if you have #32000; string input, 3 cannot be a starting character of a match, as it is preceded with #, but 2 can since it is not preceded by a # and there are five digits with a ; for the pattern to match.
What you need here is to curb the match on the left by adding a digit to the lookbehind,
(?<![#\d])(\d{4,5};)
With this trick, you ensure that the match cannot be immediately preceded with either # or a digit.
You say you finally used (?<!#)(?<!\d)\d{4,5};, and this pattern is functionally equivalent to the pattern above since the lookbehinds, as all lookarounds, "stand their ground", i.e. the regex index does not move when the lookaround patterns are matched. So, the check for a digit or a # char occurs at the same location in the string.

Regular expression with "not character" not matches as expected

I am trying to satisfy next restrictions:
line has from 3 to 256 chars that are a-z, 0-9, dash - or dot .
this line cannot start or end with -
I want to get kind of next output:
aaa -> good
aaaa -> good
-aaa -> bad
aaa- -> bad
---a -> bad
A have some of regexes that don't give right answer:
1) ^[^-][a-z0-9\-.]{3,256}[^-]$ gives all test lines as bad;
2) ^[^-]+[a-z0-9\-.]{3,256}[^-]+$ treats first three lines as one matching string since [^-] matches new line I guess.
3) ^[^-]?[a-z0-9\-.]{3,256}[^-]?$ (? for one or zero matching dash) gives all test lines as good
Where is the truth? I'm sensing it's either close to mine or much more complicated.
P.S. I use python 3 re module.
This one is almost correct: ^[^-][a-z0-9\-.]{3,256}[^-]$
The [^-] at the start and end represent one character already, so you will need to change {3,256} into {1,254}
Also, you probably only want a-z, 0-9 and . at the start and end (not just anything except -), so the full regex becomes:
^[a-z0-9.][a-z0-9\-.]{1,254}[a-z0-9.]$
Use a lookahead to confirm that the line matches your basic requirement ((?=^[0-9a-z.-]{3,256}$)) and then apply further restrictions.:
^((?=^[0-9a-z.-]{3,256}$)[^-].*[^-])$
Regex101 link
You can use this:
^(?!-)[a-z0-9.-]{3,256}(?<!-)$
Where (?!-) is a negative lookahead assertion (not followed by a dash) and (?<!-) is a negative lookbehind (not preceded by a dash).
You don't want {3,256}... You want {1,254} because [^-] each also match 1 character at the beginning and end of your string, so you have to subtract them from the total amount of characters that you want.
^[a-z0-9.][a-z0-9.-]{1,254}[^a-z0-9.]$
Or, if you want to keep your values you can also use lookahead/behinds:
^(?=[a-z0-9.])[a-z0-9.-]{3,256}(?<=[a-z0-9.])$

Lookahead regular expression

I am looking to match the following pattern
(1)
10digits sometext (e.g. 1235873490 ABCD EFGK)
In a text that might have the pattern above, as well as very similar pattern like this one
(2)
10digits sometext decimal_number (e.g. 9835873490 VBGF XMF 23.233)
How I can write the regular expression to match only pattern (1) and ignore pattern (2)?
I have looked at negative lookaheads using something like this:
(\d{10})\s*([A-Za-z0-9]+(?:\s+[A-Za-z0-9]+)(?:\s+[A-Za-z0-9]+))\s*(?!(\d+.\d+))
but cannot get it to work. Any ideas? By the way, I am using c++ boost::regex.
First, start with the straightforward version:
(\d{10} # 10 digits
(?:\s+\w+)+) # some text, separated by spaces,
# at least one time
(?!\s*\d+\.\d+) # not followed by a decimal number
I changed your [A-Za-z0-9] to \w for simplicity, and allowed it to occur as many times as it wants.
However, this will also match the second string - it will gobble up the 23 at the end, then see that this doesn't have a decimal number following (it's followed by ".23"), so it will match.
To prevent this, we can say that it must be followed by a space or the end of the text:
(\d{10}(?:\s+\w+)+)
(?=\s|$) # it must be followed by a space or end of text
(?!\s*\d+\.\d+)
However, this still has a problem. Now, it will match up to "...XMF", but then see it is followed by a decimal number, and backtrack. It will go back to "...VBGF" and then match, since "VBGF" isn't followed by a decimal.
To prevent this, we can tell the regex that it can't backtrack once it has matched our main section:
(?> # added '?>': not allowed to backtrack once this group is matched
\d{10}(?:\s+\w+)+)
(?=\s|$)(?!\s*\d+\.\d+)
Alternately, if you know that there will always be 2 parts in sometext, this will also solve the backtracking:
(\d{10}(?:\s+\w+){2} # can only occur twice
)
(?=\s|$)(?!\s*\d+\.\d+)

Regular Expression to match set of arbitrary codes

I am looking for some help on creating a regular expression that would work with a unique input in our system. We already have some logic in our keypress event that will only allow digits, and will allow the letter A and the letter M. Now I need to come up with a RegEx that can match the input during the onblur event to ensure the format is correct.
I have some examples below of what would be valid. The letter A represents an age, so it is always followed by up to 3 digits. The letter M can only occur at the end of the string.
Valid Input
1-M
10-M
100-M
5-7
5-20
5-100
10-20
10-100
A5-7
A10-7
A100-7
A10-20
A5-A7
A10-A20
A10-A100
A100-A102
Invalid Input
a-a
a45
4
This matches all of the samples.
/A?\d{1,3}-A?\d{0,3}M?/
Not sure if 10-A10M should or shouldn't be legal or even if M can appear with numbers. If it M is only there without numbers:
/A?\d{1,3}-(A?\d{1,3}|M)/
Use the brute force method if you have a small amount of well defined patterns so you don't get bad corner-case matches:
^(\d+-M|\d+-\d+|A\d+-\d+|A\d+-A\d+)$
Here are the individual regexes broken out:
\d+-M <- matches anything like '1-M'
\d+-\d+ <- 5-7
A\d+-\d+ <- A5-7
A\d+-A\d+ <- A10-A20
/^[A]?[0-9]{1,3}-[A]?[0-9]{1,3}[M]?$/
Matches anything of the form:
A(optional)[1-3 numbers]-A(optional)[1-3 numbers]M(optional)
^A?\d+-(?:A?\d+|M)$
An optional A followed by one or more digits, a dash, and either another optional A and some digits or an M. The '(?: ... )' notation is a Perl 'non-capturing' set of parentheses around the alternatives; it means there will be no '$1' after the regex matches. Clearly, if you wanted to capture the various bits and pieces, you could - and would - do so, and the non-capturing clause might not be relevant any more.
(You could replace the '+' with '{1,3}' as JasonV did to limit the numbers to 3 digits.)
^A?\d{1,3}-(M|A?\d{1,3})$
^ -- the match must be done from the beginning
A? -- "A" is optional
\d{1,3} -- between one and 3 digits; [0-9]{1,3} also work
- -- A "-" character
(...|...) -- Either one of the two expressions
(M|...) -- Either "M" or...
(...|A?\d{1,3}) -- "A" followed by at least one and at most three digits
$ -- the match should be done to the end
Some consequences of changing the format. If you do not put "^" at the beginning, the match may ignore an invalid beginning. For example, "MAAMA0-M" would be matched at "A0-M".
If, likewise, you leave $ out, the match may ignore an invalid trail. For example, "A0-MMMMAAMAM" would match "A0-M".
Using \d is usually preferred, as is \w for alphanumerics, \s for spaces, \D for non-digit, \W for non-alphanumeric or \S for non-space. But you must be careful that \d is not being treated as an escape sequence. You might need to write it \\d instead.
{x,y} means the last match must occur between x and y times.
? means the last match must occur once or not at all.
When using (), it is treated as one match. (ABC)? will match ABC or nothing at all.
I’d use this regular expression:
^(?:[1-9]\d{0,2}-(?:M|[1-9]\d{0,2})|A[1-9]\d{0,2}-A?[1-9]\d{0,2})$
This matches either:
<number>-M or <number>-<number>
A<number>-<number> or A<number>-A<number>
Additionally <number> must not begin with a 0.