Decyphering a simple regex - regex

The regular expression in question is
(\d{3,4}[.-]?)+
sample text
707-7019-789
My progress so far
( )+ a capturing group, capturing one or more
\d{3,4} digit, in quantities 3 or 4
[.-]? dot (or something) or hyphen, in quantities zero or one <-- this is the part I'm interested in
From my understanding this should match 3 or 4 digit number, followed by a dot (or anything, since dot matches anything) or a hyphen, bundled in a group, one or more times. Why doesn't this matches a
707+123-4567
then?

. in a character group [] is just a literal ., it does not have the special meaning "anything". [.-]? means "a dot or a hyphen or nothing", because the entire group is made optional with the ?.

[.-]?
What this means literally:
character class [.-]
Match only one out of the following characters: . and - literally.
lazy quantifier ?
Repeat the last token between 0 and 1 times, as few times as possible.

The brackets remove the functionality of the dot.
Brackets mean "Range"/"Character class".
Thus you are saying Choose from the list/range/character class .-
You aren't saying choose from the list "anything"- (anything is the regular meaning of .)

Related

Regex Help required for User-Agent Matching

Have used an online regex learning site (regexr) and created something that works but with my very limited experience with regex creation, I could do with some help/advice.
In IIS10 logs, there is a list for time, date... but I am only interested in the cs(User-Agent) field.
My Regex:
(scan\-\d+)(?:\w)+\.shadowserver\.org
which matches these:
scan-02.shadowserver.org
scan-15n.shadowserver.org
scan-42o.shadowserver.org
scan-42j.shadowserver.org
scan-42b.shadowserver.org
scan-47m.shadowserver.org
scan-47a.shadowserver.org
scan-47c.shadowserver.org
scan-42a.shadowserver.org
scan-42n.shadowserver.org
scan-42o.shadowserver.org
but what I would like it to do is:
Match a single number with the option of capturing more than one: scan-2 or scan-02 with an optional letter: scan-2j or scan-02f
Append the rest of the User Agent: .shadowserver.org to the regex.
I will then add it to an existing URL Rewrite rule (as a condition) to abort the request.
Any advice/help would be very much appreciated
Tried:
To write a regex for IIS10 to block requests from a certain user-agent
Expected:
It to work on single numbers as well as double/triple numbers with or without a letter.
(scan\-\d+)(?:\w)+\.shadowserver\.org
Input Text:
scan-2.shadowserver.org
scan-02.shadowserver.org
scan-2j.shadowserver.org
scan-02j.shadowserver.org
scan-17w.shadowserver.org
scan-101p.shadowserver.org
UPDATE:
I eventually came up with this:
scan\-[0-9]+[a-z]{0,1}\.shadowserver\.org
This is explanation of your regex pattern if you only want the solution, then go directly to the end.
(scan\-\d+)(?:\w)+
(scan\-\d+) Group1: match the word scan followed by a literal -, you escaped the hyphen with a \, but if you keep it without escaping it also means a literal - in this case, so you don't have to escape it here, the - followed by \d+ which means one more digit from 0-9 there must be at least one digit, then the value inside the group will be saved inside the first capturing group.
(?:\w)+ non-capturing group, \w one character which is equal to [A-Za-z0-9_], but the the plus + sign after the non-capturing group (?:\w)+, means match the whole group one or more times, the group contains only \w which means it will match one or more word character, note the non-capturing group here is redundant and we can use \w+ directly in this case.
Taking two examples:
The first example: scan-02.shadowserver.org
(scan\-\d+)(?:\w)+
scan will match the word scan in scan-02 and the \- will match the hyphen after scan scan-, the \d+ which means match one or more digit at first it will match the 02 after scan- and the value would be scan-02, then the (?:\w)+ part, the plus + means match one or more word character, at least match one, it will try to match the period . but it will fail, because the period . is not a word character, at this point, do you think it is over ? No , the regex engine will return back to the previous \d+, and this time it will only match the 0 in scan-02, and the value scan-0 will be saved inside the first capturing group, then the (?:\w)+ part will match the 2 in scan-02, but why the engine returns back to \d+ ? this is because you used the + sign after \d+, (?:\w)+ which means match at least one digit, and one word character respectively, so it will try to do what it is asked to do literally.
The second example: scan-2.shadowserver.org
(scan\-\d+)(?:\w)+
(scan\-\d+) will match scan-2, (?:\w)+ will try to match the period after scan-2 but it fails and this is the important point here, then it will go back to the beginning of the string scan-2.shadowserver.org and try to match (scan\-\d+) again but starting from the character c in the string , so s in (scan\-\d+) faild to match c, and it will continue trying, at the end it will fail.
Simple solution:
(scan-\d+[a-z]?)\.shadowserver\.org
Explanation
(scan-\d+[a-z]?), Group1: will capture the word scan, followed by a literal -, followed by \d+ one or more digits, followed by an optional small letter [a-z]? the ? make the [a-z] part optional, if not used, then the [a-z] means that there must be only one small letter.
See regex demo

Find certain colons in string using Regex

I'm trying to search for colons in a given string so as to split the string at the colon for preprocessing based on the following conditions
Preceeded or followed by a word e.g A Book: Chapter 1 or A Book :Chapter 1
Do not match if it is part of emoticons i.e :( or ): or :/ or :-) etc
Do not match if it is part of a given time i.e 16:00 etc
I've come up with a regex as such
(\:)(?=\w)|(?<=\w)(\:)
which satisfies conditions 2 & 3 but still fails on condition 3 as it matches the colon present in the string representation of time. How do I fix this?
edit: it has to be in a single regex statement if possible
You can use
(:\b|\b:)(?!(?:(?<=\b\d:)|(?<=\b\d{2}:))\d{1,2}\b)
See the regex demo. Details:
(:\b|\b:) - Group 1: a : that is either preceded or followed with a word char
(?!(?:(?<=\b\d:)|(?<=\b\d{2}:))\d{1,2}\b) - there should be no one or two digits right after : (followed with a word boundary) if the : is preceded with a single or two digits (preceded with a word boundary).
Note :\b is equal to :(?=\w) and \b: is equal to (?<=\w):.
If you need to get the same capturing groups as in your original pattern, replace (:\b|\b:) with (?:(:)\b|\b(:)).
More flexible solution
Note that excluding matches can be done with a simpler pattern that matches and captures what you need and just matches what you do not need. This is called "best regex trick ever". So, you may use a regex like
8:|:[PD]|\d+(?::\d+)+|(:\b|\b:)
that will match 8:, :P, :D, one or more digits and then one or more sequences of : and one or more digits, or will match and capture into Group 1 a : char that is either preceded or followed with a word char. All you need to do is to check if Group 1 matched, and implement required extraction/replacement logic in the code.
Word characters \w include numbers [a-zA-Z0-9_]
So just use [a-ZA-Z] instead
(\:)(?=[a-zA-Z])|(?<=[a-zA-Z])(\:)
Test Here

Match this regex on perl

I am fairly new with Perl, and even more so with regex.
Have been trying to match the following, but without success:
First, 3 to 4 letters (ideally case insensitive)
Optionally a space (but not mandatory)
Then, also optionally a known big-case letter (M) and a number out of 1,2,3
An example of a valid string would be abc, but also DEFG M2. Invalid would be mem M, for example
What I have so far is:
$myExpr ~= m/^[a-z,A-z]{3,4}M[1,2,3]$/i
Not sure how to make the M and numbers optional
Why don't you try the following regular expression for it:
$myExpr =~ m/^([a-zA-Z]{3,4})(\s|)(M|)([1-3]|)$/;
([a-zA-Z]{3,4}) - Group of any character in this class: [a-zA-Z] with 3 to 4 repetition.
(\s|) - Either there will be a white-space(space) or not.
(M|) - Either there will be a Uppercase M or not.
([1-3]|) - Either there will any charter this class: [1-3] or not.
(OR) Try the following
I personally recommend this
$myExpr =~ m/^([a-zA-Z]{3,4})(\s{0,1})(M{0,1})([1-3]{0,1})$/;
([a-zA-Z]{3,4}) - Group of any character in this class: [a-zA-Z] with 3 to 4 repetition i.e., it should contain minimum of 3 characters and maximum of 4.
(\s{0,1}) - Group of \s with 0 to 1 repetition i.e., it's optional.
(M{0,1}) - Group of character M with 0 to 1 repetition i.e., it's optional.
([1-3]{0,1}) - Group of any digit from 1 to 3 with 0 to 1 repetition i.e., it's optional.
Group your optional symbols with (?:) and use "zero or one" quantifier ?.
$myExpr =~ m/^[a-zA-Z]{3,4}(?: M[123])?$/
I've also fixed errors in your regexp: you don't use , in character classes - that'd literraly mean "match ,", fixed A-Z range and removed /i modifier, since you didn't say if you need lower case M and first range already covers both small and big letters.
You can use the following regex. You don't need to use comma inside character class []. And also remove i as you need to match with M.
$myExpr ~= m/^[a-zA-z]{3,4}(?: M[123])?$/
If you think your space is optional, then again add a ? after that space too (i.e. (?: ?M[123])).

Regex for 9-digit phone number dot-separated

I would like to check if a phone number contains exactly 3 digits - dot - 3 digits - dot - 3 digits. (e.g. 123.456.789)
So far I have this, but it doesn't work:
^(\d{3}\){2}\d{4}$
Note that an escaped bracket \) loses its special meaning in regex and the pattern becomes invalid since the capturing group is not closed.
If you want to match a dot with a regex, you need to include it to your pattern, and if you say 3 digits must be at the end there is no point in declaring 4 digits with \d{4}.
^(\d{3}\.){2}\d{3}$
^ ^
or if we expand the first group:
^\d{3}\.\d{3}\.\d{3}$
So all the fix consists in adding a dot after the second backslash and adjusting the final limiting quantifier.
Note that for mostly "stylistics" concerns (since efficiency gain is insignificant) I'd use a non-capturing group with the first regex variant:
^(?:\d{3}\.){2}\d{3}$

REGEX Repeater "Or" Operator

I am looking to match a regex with either 2 [0-9] repeats (and then some other pattern)
[0-9]{2}[A-z]{4}
OR 6 [0-9] repeats (and then some other pattern)
[0-9]{6}[A-z]{4}
The following is too inclusive:
[0-9]{2,6}[A-z]{4}
QUESTION
Is there a way that I can specify either 2 or 6 repeats?
You can use the or | like this within a non-capturing group:
(?:[0-9]{2}|[0-9]{6})[A-z]{4}
Be aware that using [A-z] doesn't only include lower and upper case letters, but also [, \, ], ^, _, and ' which lie between Z and a in the ASCII code points. Use [A-Za-z] for letters, as pointed out by #AlanMoore in his comment.
This should work
(?:[0-9]{2}|[0-9]{6})[a-zA-Z]{4}
Do you have some test cases I can verify it with.
12asdf - passes
123456asdf - passes
1234asdf - fails
However, if you don't anchor the start of the regex to a word (\b) or line boundary (^), the 1234asdf will have 34asdf as a partial match.
So either
\b(?:[0-9]{2}|[0-9]{6})[a-zA-Z]{4}
or
^(?:[0-9]{2}|[0-9]{6})[a-zA-Z]{4}
As a quick rundown of the regex changes
(?: ) creates a non capturing group
| selects between the alteratives [0-9]{2} and [0-9]{6}
^ matches the start of a line
$ matches the end of a line
\b matches a word boundary
[a-zA-Z] is being used instead of [A-z] as it's likely what was intended (all alpha characters, regardless of case)
You can also replace your [0-9]s with a \d which is shorthand for any digit. The best way I can think of to right this, and not get partial matches is as follows
(?:\b|^)(?:\d{2}|\d{6})[a-zA-Z]{4}(?:\b|$)
The classic way would be:
(?:[0-9]{2}|[0-9]{6})[A-z]{4}
[Literally as [0-9]{2} OR [0-9]{6}]
But you can also use this one, which should be a little more efficient than the above with less potential backtracking:
[0-9]{2}(?:[0-9]{4})?[A-z]{4}
[Here, [0-9]{2} then potential other 4 [0-9] which makes a total of 6 [0-9] in the required conditions]
You might not be aware that [A-z] matches letters and some other characters, but it actually does.
The range [A-z] effectively is equivalent to:
[A-Z\[\\\]^_`a-z]
Notice that the additional characters that match are:
[ \ ] ^ _ `
[spaces included voluntarily for separation but is not part of the characters]
This is because those characters are between the block letters and lowercase letters in the unicode table.
Not obvious, but yes:
(?:\d{2}|\d{6})