Regex to match time ranges involving am/pm like 7am-10pm - regex

I have written the following regex
(1[012]|[1-9])(am|pm)\-(1[012]|[1-9])(am|pm)
to match following kind of time formats:
7am-10pm (matches correctly and creates 4 match groups 7, am, 10, pm)
13am-10pm (this should not be matched, however it matches and creates 4 match groups 3, am, 10, pm)
10pm (this doesn't match as expected because it doesn't specify the time range end)
111am-10pm (this should not be matched, however it matches and creates 4 match groups 11, am, 10, pm)
How can I improve my regex such that I don't need to repeat the digits and am/pm pattern and also following things:
it captures only the time range components like in 7am-10am there should be only 2 match groups 7am, 10am.
it matches only proper hours for e.g. 111am or 13pm etc should be considered a no-match.
I don't know if its possible to with a regex but can we make the regex match correct time ranges for e.g. 7am-1pm should match, however 4pm-1pm should be considered as no match?
Note: I am using Ruby 2.2.1
Thanks.

First let's see what you did wrong :
13am-10pm (this should not be matched, however it matches and creates 4 match groups 3, am, 10, pm)
it matches only proper hours for e.g. 111am or 13pm etc should be considered a no-match.
This matches, since you allow to match a single digit [1-9] here : (1[012]|[1-9]).
In order to fix this, you should either allow one [1-9] digit, or 1 + [0-2]. Since we do not know when the regex starts we 'll use some word boundary to be sure we have a "word start".
Since you do not want to capture the numbers but the whole time plus the am|pm you can use a non capturing group :
\b((?:1[0-2]|[1-9])
Then it's simply a matter of repeating ourselves and adding a dash :
\b((?:1[0-2]|[1-9])[ap]m)-((?:1[0-2]|[1-9])[ap]m)
Regarding point 3. Well, yes you could do this with a regex, but you are better off by simply adding a logical check once you get group 1 and 2 to see if the time range really makes sense.
All in all this is what you get :
# \b((?:1[0-2]|[1-9])[ap]m)-((?:1[0-2]|[1-9])[ap]m)
#
#
# Assert position at a word boundary «\b»
# Match the regular expression below and capture its match into backreference number 1 «((?:1[0-2]|[1-9])[ap]m)»
# Match the regular expression below «(?:1[0-2]|[1-9])»
# Match either the regular expression below (attempting the next alternative only if this one fails) «1[0-2]»
# Match the character “1” literally «1»
# Match a single character in the range between “0” and “2” «[0-2]»
# Or match regular expression number 2 below (the entire group fails if this one fails to match) «[1-9]»
# Match a single character in the range between “1” and “9” «[1-9]»
# Match a single character present in the list “ap” «[ap]»
# Match the character “m” literally «m»
# Match the character “-” literally «-»
# Match the regular expression below and capture its match into backreference number 2 «((?:1[0-2]|[1-9])[ap]m)»
# Match the regular expression below «(?:1[0-2]|[1-9])»
# Match either the regular expression below (attempting the next alternative only if this one fails) «1[0-2]»
# Match the character “1” literally «1»
# Match a single character in the range between “0” and “2” «[0-2]»
# Or match regular expression number 2 below (the entire group fails if this one fails to match) «[1-9]»
# Match a single character in the range between “1” and “9” «[1-9]»
# Match a single character present in the list “ap” «[ap]»
# Match the character “m” literally «m»

You are missing ^ (start of the line) in your regex and thats why it is matching from between.
You have to use:
^(1[012]|[1-9])(am|pm)\-(1[012]|[1-9])(am|pm)
Better solution: You can also use \b (boundary) if your pattern doesn't always start from new line.
\b(1[012]|[1-9])(am|pm)\-(1[012]|[1-9])(am|pm)\b
See DEMO.

Related

Regex Help required for User-Agent Matching

Have used an online regex learning site (regexr) and created something that works but with my very limited experience with regex creation, I could do with some help/advice.
In IIS10 logs, there is a list for time, date... but I am only interested in the cs(User-Agent) field.
My Regex:
(scan\-\d+)(?:\w)+\.shadowserver\.org
which matches these:
scan-02.shadowserver.org
scan-15n.shadowserver.org
scan-42o.shadowserver.org
scan-42j.shadowserver.org
scan-42b.shadowserver.org
scan-47m.shadowserver.org
scan-47a.shadowserver.org
scan-47c.shadowserver.org
scan-42a.shadowserver.org
scan-42n.shadowserver.org
scan-42o.shadowserver.org
but what I would like it to do is:
Match a single number with the option of capturing more than one: scan-2 or scan-02 with an optional letter: scan-2j or scan-02f
Append the rest of the User Agent: .shadowserver.org to the regex.
I will then add it to an existing URL Rewrite rule (as a condition) to abort the request.
Any advice/help would be very much appreciated
Tried:
To write a regex for IIS10 to block requests from a certain user-agent
Expected:
It to work on single numbers as well as double/triple numbers with or without a letter.
(scan\-\d+)(?:\w)+\.shadowserver\.org
Input Text:
scan-2.shadowserver.org
scan-02.shadowserver.org
scan-2j.shadowserver.org
scan-02j.shadowserver.org
scan-17w.shadowserver.org
scan-101p.shadowserver.org
UPDATE:
I eventually came up with this:
scan\-[0-9]+[a-z]{0,1}\.shadowserver\.org
This is explanation of your regex pattern if you only want the solution, then go directly to the end.
(scan\-\d+)(?:\w)+
(scan\-\d+) Group1: match the word scan followed by a literal -, you escaped the hyphen with a \, but if you keep it without escaping it also means a literal - in this case, so you don't have to escape it here, the - followed by \d+ which means one more digit from 0-9 there must be at least one digit, then the value inside the group will be saved inside the first capturing group.
(?:\w)+ non-capturing group, \w one character which is equal to [A-Za-z0-9_], but the the plus + sign after the non-capturing group (?:\w)+, means match the whole group one or more times, the group contains only \w which means it will match one or more word character, note the non-capturing group here is redundant and we can use \w+ directly in this case.
Taking two examples:
The first example: scan-02.shadowserver.org
(scan\-\d+)(?:\w)+
scan will match the word scan in scan-02 and the \- will match the hyphen after scan scan-, the \d+ which means match one or more digit at first it will match the 02 after scan- and the value would be scan-02, then the (?:\w)+ part, the plus + means match one or more word character, at least match one, it will try to match the period . but it will fail, because the period . is not a word character, at this point, do you think it is over ? No , the regex engine will return back to the previous \d+, and this time it will only match the 0 in scan-02, and the value scan-0 will be saved inside the first capturing group, then the (?:\w)+ part will match the 2 in scan-02, but why the engine returns back to \d+ ? this is because you used the + sign after \d+, (?:\w)+ which means match at least one digit, and one word character respectively, so it will try to do what it is asked to do literally.
The second example: scan-2.shadowserver.org
(scan\-\d+)(?:\w)+
(scan\-\d+) will match scan-2, (?:\w)+ will try to match the period after scan-2 but it fails and this is the important point here, then it will go back to the beginning of the string scan-2.shadowserver.org and try to match (scan\-\d+) again but starting from the character c in the string , so s in (scan\-\d+) faild to match c, and it will continue trying, at the end it will fail.
Simple solution:
(scan-\d+[a-z]?)\.shadowserver\.org
Explanation
(scan-\d+[a-z]?), Group1: will capture the word scan, followed by a literal -, followed by \d+ one or more digits, followed by an optional small letter [a-z]? the ? make the [a-z] part optional, if not used, then the [a-z] means that there must be only one small letter.
See regex demo

regex match two words based on a matching substring

there are 4 strings as shown below
ABC_FIXED_20220720_VALUEABC.csv
ABC_FIXED_20220720_VALUEABCQUERY_answer.csv
ABC_FIXED_20220720_VALUEDEF.csv
ABC_FIXED_20220720_VALUEDEFQUERY_answer.csv
Two strings are considered as matched based on a matching substring value (VALUEABC, VALUEDEF in the above shown strings). Thus I am looking to match first 2 (having VALUEABC) and then next 2 (having VALUEDEF). The matched strings are identified based on the same value returned for one regex group.
What I tried so far
ABC.*[0-9]{8}_(.*[^QUERY_answer])(?:QUERY_answer)?.csv
This returns regex group-1 (from (.*[^QUERY_answer])) value "VALUEABC" for first 2 strings and "VALUEDEF" for next 2 strings and thus desired matching achieved.
But the problem with above regex is that as soon as the value ends with any of the characters of "QUERY_answer", the regex doesn't match any value for the grouping. For instance, the below 2 strings doesn't match at all as the VALUESTU ends with "U" here :
ABC_FIXED_20220720_VALUESTU.csv
ABC_FIXED_20220720_VALUESTUQUERY_answer.csv
I tried to use Negative Lookahead:
ABC.*[0-9]{8}_(.*(?!QUERY_answer))(?:QUERY_answer)?.csv
but in this case the grouping-1 value is returned as "VALUESTU" for first string and "VALUESTUQUERY_answer" for second string, thus effectively making the 2 strings unmatched.
Any way to achieve the desired matching?
With your shown samples please try following regex.
^ABC_[^_]*_[0-9]+_(.*?)(?:QUERY_answer)?\.csv$
OR to match exact 8 digits try:
^ABC_[^_]*_[0-9]{8}_(.*?)(?:QUERY_answer)?\.csv$
Here is the online demo for above regex.
Explanation: Adding detailed explanation for above regex.
^ABC_[^_]*_ ##Matching from starting of value ABC followed by _ till next occurrence of _.
[0-9]+_ ##Matching continuous occurrences of digits followed by _ here.
(.*?) ##Creating one and only capturing group using lazy match which is opposite of greedy match.
(?:QUERY_answer)? ##In a non-capturing group matching QUERY_answer and keeping it optional.
\.csv$ ##Matching dot literal csv at the end of the value.
You need
ABC.*[0-9]{8}_(.*?)(?:QUERY_answer)?\.csv
See the regex demo.
Note
.*[^QUERY_answer] matches any zero or more chars other than line break chars as many as possible, and then any one char other than Q, U, E, etc., i.e. any char in the negated character class. This is replaced with .*?, to match any zero or more chars other than line break chars as few as possible.
(?:QUERY_answer)? - the group is made non-capturing to reduce grouping complexity.
\.csv - the . is escaped to match a literal dot.

How to conditionally expect particular characters if a prior regex matched?

I want to expect some characters only if a prior regex matched. If not, no characters (empty string) is expected.
For instance, if after the first four characters appears a string out of the group (A10, B32, C56, D65) (kind of enumeration) then a "_" followed by a 3-digit number like 123 is expected. If no element of the mentioned group appears, no other string is expected.
My first attempt was this but the ELSE branch does not work:
^XXX_(?<DT>A12|B43|D14)(?(DT)(_\d{1,3})|)\.ZZZ$
XXX_A12_123.ZZZ --> match
XXX_A11.ZZZ --> match
XXX_A12_abc.ZZZ --> no match
XXX_A23_123.ZZZ --> no match
These are examples of filenames. If the filename contains a string of the mentioned group like A12 or C56, then I expect that this element if followed by an underscore followed by 1 to 3 digits. If the filename does not contain a string of that group (no character or a character sequence different from the strings in the group) then I don't want to see the underscore followed by 1 to 3 digits.
For instance, I could extend the regex to
^XXX_(?<DT>A12|B43|D14)_\d{5}(?(DT)(_\d{1,3})|)_someMoreChars\.ZZZ$
...and then I want these filenames to be valid:
XXX_A12_12345_123_wellDone.ZZZ
XXX_Q21_00000_wellDone.ZZZ
XXX_Q21_00000_456_wellDone.ZZZ
...but this is invalid:
XXX_A12_12345_wellDone.ZZZ
How can I make the ELSE branch of the conditional statement work?
In the end I intend to have two groups like
Group A: (A11, B32, D76, R33)
Group B: (A23, C56, H78, T99)
If an element of group A occurs in the filename then I expect to find _\d{1,3} in the filename.
If an element of group B occurs ion the filename then the _\d{1,3} shall be optional (it may or may not occur in the filename).
I ended up in this regex:
^XXX_(?:(?A12|B43|D14))?(?(DT)(_\d{5}_\d{1,3})|(?!(?&DT))(?!.*_\d{3}(?!\d))).*\.ZZZ$
^XXX_(?:(?<DT>A12|B43|D14))?_\d{5}(?(DT)(_\d{1,3})|(?!(?&DT))(?!.*_\d{3}(?!\d))).+\.ZZZ$
Since I have to use this regex in the OpenApi #Pattern annotation I have the problem that I get the error:
Conditionals are not supported in this regex dialect.
As #The fourth bird suggested alternation seems to do the trick:
XXX_((((A12|B43|D14)_\d{5}_\d{1,3}))|((?:(A10|B10|C20)((?:_\d{5}_\d{3})|(?:_\d{3}))))).*\.ZZZ$
The else branch is the part after the |, but if you also want to match the 2nd example, the if clause would not work as you have already matched one of A12|B43|D14
The named capture group is not optional, so the if clause will always be true.
What you can do instead is use an alternation to match either the numeration part followed by an underscore and 3 digits, or match an uppercase char and 2 digits.
^XXX_(?:(?<DT>A12|B43|D14)_\d{1,3}|[A-Z]\d{2})\.ZZZ$
Regex demo
If you want to make use of the if/else clause, you can make the named capture group optional, and then check if group 1 exists.
^XXX_(?<DT>A12|B43|D14)?(?(DT)_\d{1,3}|[A-Z]\d{2})\.ZZZ$
Regex demo
For the updated question:
^XXX_(?<DT>A12|B43|D14)?(?(DT)(?:_\d{5})?_\d{3}(?!\d)|(?!A12|B43|D14|[A-Z]\d{2}_\d{3}(?!\d))).*\.ZZZ$
The pattern matches:
^ Start of string
XXX_ Match literally
(?<DT>A12|B43|D14)?
(?(DT) If we have group DT
(?:_\d{5})? Optionally match _ and 5 digits
_\d{3}(?!\d) Match _ and 3 digits
| Or
(?! Negative lookahead, assert not to the right
A12|B43|D14| Match one of the alternatives, or
[A-Z]\d{2}_\d{3}(?!\d) Match 1 char A-Z, 2 digits _ 3 digits not followed by a digit
) Close lookahead
) Close if clause
.* Match the rest of the line
\.ZZZ Match . and ZZZ
$ End of string
Regex demo

Extract a sub-string from a matched string

I am attempting to extract a sub-string from a string after matching for 24 at the beginning of the string. The substring is a MAC id starting at position 6 till the end of the string. I am aware that a sub string method can do the job. I am curious to know a regex implementation.
String = 2410100:80:a3:bf:72:d45
After much trial and error, this the reg-ex I have which I think is convoluted.
[^24*$](?<=^\S{6}).*$
How can this reg-ex be modified to match for 24, then extract the substring from position 6 till the end of the line?
https://regex101.com/r/vcvfMx/2
Expected Results: 00:80:a3:bf:72:d45
You can use:
(?<=^24\S{3}).*$
Here's a demo: https://regex101.com/r/HqT0RV/1/
This will get you the result you expect (i.e., 00:80:a3:bf:72:d45). However, that doesn't seem to be a valid MAC address (the 5 at the end seems to be not part of the MAC). In which case, you should be using something like this:
(?<=^24\S{3})(?:[0-9a-f]{2}:){5}[0-9a-f]{2}
Demo: https://regex101.com/r/HqT0RV/2
Breakdown:
(?<= # Start of a positive Lookbehind.
^ # Asserts position at the beginning of the string.
24 # Matches `24` literally.
\S{3} # Matches any three non-whitespace characters.
) # End of the Lookbehind (five characters so far).
(?: # Start of a non-capturing group.
[0-9a-f] # A number between `0` and `9` or a letter between `a` and `f` (at pos. #6).
{2} # Matches the previous character class exactly two times.
: # Matches `:` literally.
) # End of the non-capturing group.
{5} # Matches the previous group exactly five times.
[0-9a-f] # Any number between `0` and `9` or any letter between `a` and `f`.
{2} # Matches the previous character class exactly two times.

Regular expression that would contains a number less then or equal to 20

I need the regular expression that would return true if the string contains a number less then or equal to 20 and only allow the use of numbers.
Assuming that you are matching numbers which are:
Integers
Within the range of [0,20]
This should work: ^(([01]?[0-9])|(20))$.
If you are matching floats, things get a bit messier. Checking numeric ranges should, ideally, always be done through your platform's numeric operators.
This would match integers less than or equal to 20
(?:\b|-)0*([0-9]|1[0-9]|20)\b
Explanation
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
\b # Assert position at a word boundary
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
- # Match the character “-” literally
)
0 # Match the character “0” literally
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
( # Match the regular expression below and capture its match into backreference number 1
# Match either the regular expression below (attempting the next alternative only if this one fails)
[0-9] # Match a single character in the range between “0” and “9”
| # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
1 # Match the character “1” literally
[0-9] # Match a single character in the range between “0” and “9”
| # Or match regular expression number 3 below (the entire group fails if this one fails to match)
20 # Match the characters “20” literally
)
\b # Assert position at a word boundary
Visit here for future problems.
I don't know the language that supports the regex. I will assume that it uses some variants of PCRE.
The code here is to strictly validate the string only contains the number.
Only integer, assuming non-negative, no leading 0's:
^(1?\d|20)$
Only integer, assuming non-negative, allow arbitrary leading 0's:
^0*(1?\d|20)$
Any integer, no leading 0's:
^(+?(1?\d|20)|-\d+)$
Any integer, allow arbitrary leading 0's:
^(+?0*(1?\d|20)|-\d+)$
If the number is not arbitrary large, it is better if you capture the number with a loose regex \b[+-]?\d+(\.\d*)?\b, then convert it to number and check it.
(\b[0-9]\b|\b1[0-9]\b|\b20\b) worked for me
Only integer, assuming non-negative, no leading 0's
I used to to find percentages less than 20, so it end up being:
(\b[0-9]%|\b1[0-9]%|\b20%)