regex - How to create date format regex without repeating previous group - regex

I want to create a regex for matching date formats entered by a user. The user will enter date formats as a string ("dd/MMM/yyyy") and not actual values.
For example:
dd/MMM/yyyy = ✅
MMM/dd/yyyy = ✅
dd/dd/yyyy = ❌ (previously captured groups cannot be repeated)
MMM/MMM/yyyy = ❌ (same reason as above)
I'm having issues with working negative lookahead. Any assistance is much appreciated.

I believe you could use the regular expression
\b(?:dd\/(?:mm|MMM)\/yyyy|(?:mm|MMM)\/dd\/yyyy|yyyy\/(?:mm|MMM)\/dd)\b
Demo
The regex engine performs the following operations.
\b # match a word break
(?: # begin a non-capture group
dd\/(?:mm|MMM)\/yyyy # match 'dd/' followed by 'mm' or 'MMM'
# followed by '/yyyy'
| # or
(?:mm|MMM)\/dd\/yyyy # match 'mm' or 'MMM' followed by '/dd'
# followed by '/yyyy'
| # or
yyyy\/(?:mm|MMM)\/dd # match 'yyyy/' followed by 'mm' or 'MMM'
# followed by 'dd'
)
\b

Related

how to implement or after group in regex pattern

I want to get the thread-id from my urls in one pattern. The pattern should hat just one group (on level 1). My test Strings are:
https://www.mypage.com/thread-3306-page-32.html
https://www.mypage.com/thread-3306.html
https://www.mypage.com/Thread-String-Thread-Id
So I want a Pattern, that gives me for line 1 and 2 the number 3306 and for the last line "String-Thread-Id"
My current state is .*[t|T]hread-(.*)[\-page.*|.html]. But it fails at the end after the id. How to do it well? I also solved it like .*Thread-(.*)|.*thread-(\\w+).*, but this is with two groups not applicable for my java code.
Not knowing if this fits for all situations, but I would try this:
^.*?thread-((?:(?!-page|\.html).)*)
In Java, that could look something like
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("^.*?thread-((?:(?!-page|\\.html).)*)", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE | Pattern.MULTILINE);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group(1));
}
Explanation:
^ # Match start of line
.*? # Match any number of characters, as few as possible
thread- # until "thread-" is matched.
( # Then start a capturing group (number 1) to match:
(?: # (start of non-capturing group)
(?!-page|\.html) # assert that neither "page-" nor ".html" follow
. # then match any character
)* # repeat as often as possible
) # end of capturingn group

Or between groups when one group has to be preceeded by a character

I have the following data:
$200 – $4,500
Points – $2,500
I would like to capture the ranges in dollars, or capture the Points string if that is the lower range.
For example, if I ran my regex on each of the entries above I would expect:
Group 1: 200
Group 2: 4,500
and
Group 1: Points
Group 2: 2,500
For the first group, I can't figure out how to capture only the integer value (without the $ sign) while allowing for capturing Points.
Here is what I tried:
(?:\$([0-9,]+)|Points) – \$([0-9,]+)
https://regex101.com/r/mD9JeR/1
Just use an alternation here:
^(?:(Points)|\$(\d{1,3}(?:,\d{3})*)) - \$(\d{1,3}(?:,\d{3})*)$
Demo
The salient points of the above regex pattern are that we use an alternation to match either Points or a dollar amount on the lower end of the range, and we use the following regex for matching a dollar amount with commas:
\$\d{1,3}(?:,\d{3})*
Coming up with a regex that doesn't match the $ is not difficult. Coming up with a regex that doesn't match the $ and consistently puts the two values, whether they are both numeric or one of them is Points, as capture groups 1 and 2 is not straightforward. The difficulties disappear if you use named capture groups. This regex requires the regex module from the PyPi repository since it uses the same named groups multiple times.
import regex
tests = [
'$200 – $4,500',
'Points – $2,500'
]
re = r"""(?x) # verbose mode
^ # start of string
(
\$ # first alternate choice
(?P<G1>[\d,]+) # named group G1
| # or
(?P<G1>Points) # second alternate choice
)
\x20–\x20 # ' – '
\$
(?P<G2>[\d,]+) # named group g2
$ # end of string
"""
# or re = r'^(\$(?P<G1>[\d,]+)|(?P<G1>Points)) – \$(?P<G2>[\d,]+)$'
for test in tests:
m = regex.match(re, test)
print(m.group('G1'), m.group('G2'))
Prints:
200 4,500
Points 2,500
UPDATE
#marianc was on the right track with his comment but did not ensure that there were no extraneous characters in the input. So, with his useful input:
import re
tests = [
'$200 – $4,500',
'Points – $2,500',
'xPoints – $2,500',
]
rex = r'((?<=^\$)\d{1,3}(?:,\d{3})*|(?<=^)Points) – \$(\d{1,3}(?:,\d{3})*)$'
for test in tests:
m = re.search(rex, test)
if m:
print(test, '->', m.groups())
else:
print(test, '->', 'No match')
Prints:
$200 – $4,500 -> ('200', '4,500')
Points – $2,500 -> ('Points', '2,500')
xPoints – $2,500 -> No match
Note that a search rather than a match is done since a lookbehind assertion done at the beginning of the line cannot succeed. But we enforce no extraneous characters at the start of the line by including the ^ anchor in our lookbehind assertion.
For the first capturing group, you could use an alternation matching either Points and assert what is on the left is a non whitespace char, or match the digits with an optional decimal value asserting what is on the left is a dollar sign using a positive lookbehind if that is supported.
For the second capturing group, there is no alternative so you can match the dollar sign and capture the digits with an optional decimal value in group 2.
((?<=\$)\d{1,3}(?:,\d{3})*|(?<!\S)Points) – \$(\d{1,3}(?:,\d{3})*)
Explanation
( Capture group 1
(?<=\$)\d{1,3}(?:,\d{3})* Positive lookbehind, assert a $ to the left and match 1-3 digits and repeat 0+ matching a comma and 3 digits
| Or
(?<!\S)Points Positive lookbehind, assert a non whitespace char to the left and match Points
) Close group 1
– Match literally
\$ Match $
( Capture group 2
\d{1,3}(?:,\d{3})* Match 1-3 digits and 0+ times a comma and 3 digits
) Close group
Regex demo

Use regex to validate angular expressions in a paragraph input

I have a difficult user-input validation question (or at least it's difficult for me). I'm trying to make sure users are inputting a pre-defined subset of allowed Angular expressions if they try to add angular to their input at all.
I'm currently using http://www.regexpal.com/ (the actual implementation is in an HTML webpage using javascript) to test my expression and the two following cases:
VALID
Any text, punctuation (except double-{), or numb3r5 {{model.variable|phone}} is valid
Any text, punctuation (except double-{), or numb3r5 {{model.variable}} is valid.
Stick with the format {{model.variable|zipcode}} and we remain valid.
INVALID
Any text, punctuation (except double-{), or numb3r5 {{model.variable|phone}} is valid
Any text, punctuation (except double-{), or numb3r5 {{model.variable}} is valid.
Any deviation from the format, e.g. {{model.variable|custom}} makes the entire input invalid.
I figured out the regex to identify the three angular blocks and un-match the "custom" one...
{{model\.[^}|]+(\|((ein)|(phone)|(zipcode)|(currency:'':0)){1})?}}
... but I can't get it to enforce that regex. I tried lots of variations on lookaheads, and this is what I think I need, but it doesn't match the valid input, so obviously I'm off.
^(((.(?!({{)|(}})))*({{model\.[^}|]+(\|((ein)|(phone)|(zipcode)|(currency:'':0)){1})?}}))?)+$
Does anyone out there know how I might validate this input?
Nicely composed question. You described the problem and what you have tried.
Using Lookaheads is one solution but you may end up consuming the text for other purposes, so normal groups work fine here.
I would suggest:^((?:^|[^\r\n\{]*)(?:\{(?:[^{]|$)|(?:\{{2}model\.variable(?:\|(?:(ein)|(phone)|(zipcode)|(currency:'':0)))?\}{2}|$)))+$ (demo)
Be aware that visibly empty strings can pass this regex. I would do a .trim().length check if that is an issue. I didn't think it was appropriate to add more bloat to this regex.
^ # Anchors to beginning of string or line,
# depending on multinline flag
( # Opens capturing group 1
(?: # Opens noncapturing group
^ # Anchors to the beginning of string or line
| # or
[^\r\n\{]* # Any character but carriage return, new line, {, one or more times
) # Closes noncapturing group
(?: # Opens noncapturing group
\{ # Literal {
(?: # Opens noncapturing group
[^{] # Any character but {
# to filter {{'ss
| # or
$ # End of string or line
) # Closes noncapturing group
| # or
(?: # Opens noncapturing group
\{{2} # {, twice
model\.variable # model.variable
(?: # Opens noncapturing group
\| # Literal |
(?: # Opens noncapturing group
(ein) # ein as capturing group 2
| # or
(phone) # phone as capturing group 3
| # or
(zipcode) # zipcode as capturing group 4
| # or
(currency:'':0) # currency as capturing group 5
) # closes non-capturing group
)? # closes non-capturing group, iternates 0 or 1 times
\}{2} # }, twice
| # or
$ # end of string or line, dependong on multiline
) #
) #
)+ #
$ #
Per: I'm going to run with this and see if I can get it to ignore the newlines/carriage returns when building the overall match for the entire input.
^((?:^|[^{]+)(?:\{(?:[^{]|$)|(?:\{{2}model\.variable(?:\|(?:(ein)|(phone)|(zipcode)|(currency:'':0)))?\}{2}|$)))+$ (demo)
I only needed to remove the single \r\n and remove the multiline flag.

Selecting if no delimiter, and no selecting if it is

I have string like "smth 2sg. smth", and sometimes "smth 2sg.| smth.".
What mask should I use for selecting "2sg." if string does not contains"|", and select nothing if string does contains "|"?
I have 2 methods. They both use something called a Negative Lookahead, which is used like so:
(?!data)
When this is inserted into a RegEx, it means if data exists, the RegEx will not match.
More info on the Negative Lookahead can be found here
Method 1 (shorter)
Just capture 2sg.
Try this RegEx:
(\dsg\.)(?!\|)
Use (\d+... if the number could be longer than 1 digit
Live Demo on RegExr
How it works:
( # To capture (2sg.)
\d # Digit (2)
sg # (sg)
\. # . (Dot)
)
(?!\|) # Do not match if contains |
Method 2 (longer but safer)
Match the whole string and capture 2sg.
Try this RegEx:
^\w+\s*(\dsg\.)(?!\|)\s*\w+\.?$
Use (\d+sg... if the number could be longer than 1 digit
Live Demo on RegExr
How it works:
^ # String starts with ...
\w+\s* # Letters then Optional Whitespace (smth )
( # To capture (2sg.)
\d # Digit (2)
sg # (sg)
\. # . (Dot)
)
(?!\|) # Do not match if contains |
\s* # Optional Whitespace
\w+ # Letters (smth)
\.? # Optional . (Dot)
$ # ... Strings ends with
Something like this might work for you:
(\d*sg\.)(?!\|)
It assumes that there is(or there is no)number followed by sg. and not followed by |.
^.*(\dsg\.)[^\|]*$
Explanation:
^ : starts from the beginning of the string
.* : accepts any number of initial characters (even nothing)
(\dsg\.) : looks for the group of digit + "sg."
[^\|]* : considers any number of following characters except for |
$ : stops at the end of the string
You can now select your string by getting the first group from your regex
Try:
(\d+sg.(?!\|))
depending on your programming environment, it can be little bit different but will get your result.
For more information see Negative Lookahead

regex conditional statement (only parse numbers if string has a certain beginning)

first I used a string that returned my relaystates, so “1.0.0.0.1.1.0.0” would get parsed/grouped with \d+,
then my eight switches used ‘format response’, e.g. {1} to get the state for each switch.
now I need to get the numbers out of this string: “RELAYS.1.0.0.0.1.1.0.0”
\d+ will still get the numbers but I only want to get them IF the string starts with “RELAYS"
can anyone please explain how I could do that?
thnx a million in advance!
Edited icebear (today 00:24)
With a .NET engine, you could use the regex (?<=^RELAYS[\d.]*)\d+. But most regex engines don't support indefinite repetition in a negative lookbehind assertion.
See it live on regexhero.net.
Explanation:
(?<= # Assert that the following can be matched before the current position:
^RELAYS # Start of string, followed by "RELAYS"
[\d.]* # and any number of digits/dots.
) # End of lookbehind assertion
\d+ # Match one or more digits.
With a PCRE engine, you could use (?:^RELAYS\.|\G\.)(\d+) and access group 1 for each match.
See it live on regex101.com.
Explanation:
(?: # Start a non-capturing group that matches...
^RELAYS\. # either the start of the string and "RELAYS."
| # or
\G\. # the position after the previous match, followed by "."
) # End of non-capturing group
(\d+) # Match a number and capture it in group 1