regex not matching for multiple part of string - regex

I want to write regex something like
/scheduling/groups/members/[list*|get*|search*].json
which should match with string
/scheduling/groups/list.json or
/scheduling/groups/getGroup.json or
/scheduling/groups/searchMember.json
and should not match with
/scheduling/groups/save.json

You can use this regex
\/scheduling\/groups\/(?:list|get|search).*\.json
Online demo

Even could use something like
# (?i)/scheduling/groups/(?:list|get|search)[^/]*\.json
(?i) # Case insensitive modifier
/scheduling/groups/ # Literal '/scheduling/groups/'
(?: # Cluster group, one of either
list # 'list'
| # or,
get # 'get'
| # or,
search # 'search'
) # End cluster group
[^/]* # 0 - many characters that are non filename separators
\. json # to get to '.jason'

Related

Regex Expression to remove "autoplay" parameter in url

I'm trying to match the url https://youtube.com/embed/id and its parameters i.e ?start=10&autoplay=1, but I need the autoplay parameter removed or set to 0.
These are some example urls and what I want the results to look like:
http://www.youtube.com/embed/JW5meKfy3fY?autoplay=1
I want to remove the autoplay parameter and its value:
http://www.youtube.com/embed/JW5meKfy3fY
2nd example
http://www.youtube.com/embed/JW5meKfy3fY?start=10&autoplay=1
results should be
http://www.youtube.com/embed/JW5meKfy3fY?start=10
I have tried (https?:\/\/www.youtube.com\/embed\/[a-zA-Z0-9\\-_]+)(\?[^\t\n\f\r \"']*)(\bautoplay=[01]\b&?) and replace with $1$2, but it matches with a trailing ? and & in example 1 and 2 respectively. Also, it doesn't match at all for a url like
http://www.youtube.com/embed/JW5meKfy3fY
I have the regex and examples on here
NB:
The string I am working on contains HTML with one or more youtube urls in it, so I don't think I can easily use go's net/url package to parse the url.
You're asking for a regex but I think you'd be better off using Go's "net/url" package. Something like this:
import "net/url"
//...
u, _ := url.Parse("http://www.youtube.com/embed/JW5meKfy3fY?start=10&autoplay=1")
q := u.Query()
q.Del("autoplay")
u.RawQuery = q.Encode()
clean_url_string = u.String()
In real life you'd want to handle errors from u.Parse of course.
Here's a solution that ensures a valid page URI. Simply match this and only return capture group 1 and 3.
Edit: The pattern is not elegant but it ensures no stale ampersands stay. The previous solution was more elegant and albeit wouldn't break anything, isn't worth the tradeoff imo.
Pattern
(https?:\/\/www\.youtube\.com\/embed\/[^?]+\?.*)(&autoplay=[01]|autoplay=[01]&?)(.*)
See the demo here.
As the OP has linked to a regex tester that employs the the PCRE (PHP) engine I offer a PCRE-compatible solution. The one token I've used in the regular expression below that is not widely supported in other regex engines is \K (though it is supported by Perl, Ruby, Python's PyPI regex module, R with Perl=TRUE and possibly other engines.
\K causes the regex engine to reset the beginning of the match to the current location in the string and to discard any previously-matched characters in the match it returns (if there is one).
With one caveat you can replace matches of the following regular expression with empty strings.
(?x) # assert 'extended'/'free spacing' mode
\bhttps?:\/\/www.youtube.com\/embed\/
# match literal
(?=.*autoplay=[01]) # positive lookahead asserts 'autoplay='
# followed by '1' or '2' appears later in
# the string
[a-zA-Z0-9\\_-]+ # match 1+ of the chars in the char class
[^\t\n\f\r \"']* # match 0+ chars other than those in the
# char class
(?<![?&]) # negative lookbehind asserts that previous
# char was neither '?' nor '&'
(?: # begin non-capture group
(?=\?) # use positive lookahead to assert next char
# is a '?'
(?: # begin a non-capture group
(?=.*autoplay=[01]&)
# positive lookahead asserts 'autoplay='
# followed by '1' or '2', then '&' appears
# later in the string
\? # match '?'
)? # end non-capture group and make it optional
\K # reset start of match to current location
# and discard all previously-matched chars
\?? # optionally match '?'
autoplay=[01]&? # match 'autoplay=' followed by '1' or '2',
# optionally followed by '&'
| # or
(?=&) # positive lookahead asserts next char is '&'
\K # reset start of match to current location
# and discard all previously-matched chars
&autoplay=[01]&? # match '&autoplay=' followed by '1' or '2',
# optionally followed by '&'
) # end non-capture group
The one limitation is that it fails to match all instances of .autoplay=.. if more than one such substring appears in the string.
I wrote this expression with the x flag, called extended or free spacing mode, to be able to make it self-documenting.
Start your engine!

regex - How to create date format regex without repeating previous group

I want to create a regex for matching date formats entered by a user. The user will enter date formats as a string ("dd/MMM/yyyy") and not actual values.
For example:
dd/MMM/yyyy = ✅
MMM/dd/yyyy = ✅
dd/dd/yyyy = ❌ (previously captured groups cannot be repeated)
MMM/MMM/yyyy = ❌ (same reason as above)
I'm having issues with working negative lookahead. Any assistance is much appreciated.
I believe you could use the regular expression
\b(?:dd\/(?:mm|MMM)\/yyyy|(?:mm|MMM)\/dd\/yyyy|yyyy\/(?:mm|MMM)\/dd)\b
Demo
The regex engine performs the following operations.
\b # match a word break
(?: # begin a non-capture group
dd\/(?:mm|MMM)\/yyyy # match 'dd/' followed by 'mm' or 'MMM'
# followed by '/yyyy'
| # or
(?:mm|MMM)\/dd\/yyyy # match 'mm' or 'MMM' followed by '/dd'
# followed by '/yyyy'
| # or
yyyy\/(?:mm|MMM)\/dd # match 'yyyy/' followed by 'mm' or 'MMM'
# followed by 'dd'
)
\b

Regex: match an empty string instead of nothing

I have a Python script in which I'm trying to parse a string of the form:
one[two=three].four
Each word should be in its own capture group. The punctuation should not be captured.
Additionally, each part of the string is optional, and the part delimited by brackets can be repeated. So the above is the most complete example, but all of the following should also be valid matches:
one
.four
one[two=three][five=six]
[two=three]
[two].four
[two][five]
[]
In the case that one of the words is not present, instead of failing to capture, I'd like to capture a string of length 0.
The regex that I'm using is as follows:
pattern = re.compile(
r"""
^ # Assert start of string
(?P<cap1> # Start a new group for "one"
[a-z]* #
) #
(?: # Start a group for "two" and "three"
\[ # Match the "["
(?P<cap_2> # Start a group for "two"
[a-z]* #
) #
=? # Delimit two/three with "="
(?P<cap_3> # Start a group for "three"
[a-z]* #
) #
\] # Match the "]"
)* # End the two-three group, allowing repeats
\.? # Delimit three/four with "."
(?P<cap_4> # Begin a group for "four"
[a-z]* #
) #
$ # Assert end of string
""", re.IGNORECASE|re.VERBOSE)
What I've tried to do during that regex is, instead of allowing 0 or 1 of a group by appending ? to the entire group, I allowed any number of characters to be in the actual match itself by appending * to the character selection. Therefore, the match is forced to exist, but the string itself can have a length of 0.
The problem comes with the bracketed block. The package I'm using allows me to access all captures of a named group using match.captures(groupname). This way, I can access all matches for cap_2 using match.captures("cap_2"):
>>> pattern.match("one[two=three][five=six].four").captures("cap_2")
["two", "five"]
This works fine when the brackets are present. However, when they're not:
>>> pattern.match("one.four").captures("cap_2")
[]
Expected: [""]
I expect there to be at least an empty string present for cap_2 and cap_3. However, there's nothing.
This is because of the * I place after the two+three section of the regex, in order to allow multiple of those groups - this is allowing that part of the regex to be skipped altogether.
Changing that * to + breaks the regex, as now it won't match the above example at all because it's trying to match the brackets. Adding a ? after each bracket means that cap_1 and cap_2 are not delimited and includes what should be in cap_4 in cap_3.
What's the solution here? How can I allow a group containing two capturing groups to be executed multiple times, but match only empty strings when the brackets are not present?
You may solve the problem by replacing * after the (?:\[(?P<cap_2>[a-z]*)=?(?P<cap_3>[a-z]*)\])* repeated group with + and adding an alternative with a second occurrence of groups cap_2 and cap_3 (note that PyPi regex module supports multiple identically named groups in the same regex):
import regex as re
s = 'one.four'
pattern = re.compile(
r"""
^ # Assert start of string
(?P<cap1> # Start a new group for "one"
[a-z]* #
) #
(?:
(?: # Start a group for "two" and "three"
\[ # Match the "["
(?P<cap_2> # Start a group for "two"
[a-z]* #
) #
=? # Delimit two/three with "="
(?P<cap_3> # Start a group for "three"
[a-z]* #
) #
\] # Match the "]"
)+ # End the two-three group, allowing repeats
|
(?P<cap_2>)(?P<cap_3>)
)
\.? # Delimit three/four with "."
(?P<cap_4> # Begin a group for "four"
[a-z]* #
) #
$ # Assert end of string
""", re.IGNORECASE|re.VERBOSE)
print ( pattern.match("one.four").captures("cap_2") )
# => ['']
See the Python demo
The thing is, the (?:\[(?P<cap_2>[a-z]*)=?(?P<cap_3>[a-z]*)\])* part matches by all means since it can match an empty string, and if you just add the alternatives without changing the modifier, the expected results won't be achieved. So, if there is no [...]s, the second cap_2 and cap_3 groups with empty patterns willmatch by all means capturing an empty string.
if you want it to either match the empty string or something else, you need the OR operator: |
if you want your regexp to match an empty string, you need something that matches the empty string: e.g. () or (not empty|)
Combined and applied to your case, that would look like this (simplified):
((?:\[stuff inside the brackets\])+|)
The outermost group captures the whole bracket construct (e.g. [two][three]) if it's present or the empty string. Notice that the left part of the | operator now has to match at least once (+).

Selecting if no delimiter, and no selecting if it is

I have string like "smth 2sg. smth", and sometimes "smth 2sg.| smth.".
What mask should I use for selecting "2sg." if string does not contains"|", and select nothing if string does contains "|"?
I have 2 methods. They both use something called a Negative Lookahead, which is used like so:
(?!data)
When this is inserted into a RegEx, it means if data exists, the RegEx will not match.
More info on the Negative Lookahead can be found here
Method 1 (shorter)
Just capture 2sg.
Try this RegEx:
(\dsg\.)(?!\|)
Use (\d+... if the number could be longer than 1 digit
Live Demo on RegExr
How it works:
( # To capture (2sg.)
\d # Digit (2)
sg # (sg)
\. # . (Dot)
)
(?!\|) # Do not match if contains |
Method 2 (longer but safer)
Match the whole string and capture 2sg.
Try this RegEx:
^\w+\s*(\dsg\.)(?!\|)\s*\w+\.?$
Use (\d+sg... if the number could be longer than 1 digit
Live Demo on RegExr
How it works:
^ # String starts with ...
\w+\s* # Letters then Optional Whitespace (smth )
( # To capture (2sg.)
\d # Digit (2)
sg # (sg)
\. # . (Dot)
)
(?!\|) # Do not match if contains |
\s* # Optional Whitespace
\w+ # Letters (smth)
\.? # Optional . (Dot)
$ # ... Strings ends with
Something like this might work for you:
(\d*sg\.)(?!\|)
It assumes that there is(or there is no)number followed by sg. and not followed by |.
^.*(\dsg\.)[^\|]*$
Explanation:
^ : starts from the beginning of the string
.* : accepts any number of initial characters (even nothing)
(\dsg\.) : looks for the group of digit + "sg."
[^\|]* : considers any number of following characters except for |
$ : stops at the end of the string
You can now select your string by getting the first group from your regex
Try:
(\d+sg.(?!\|))
depending on your programming environment, it can be little bit different but will get your result.
For more information see Negative Lookahead

Replace dynamic string with Regex

I'm using Visual Basic .NET and I'm trying to download a string of HTML, and I want to replace this
id="dynamicstring"
With
id="replacement"
The dynamicstring can be anything, that's why I'm having trouble replacing it.
You can match the content of the id attribute with this pattern:
(?<=<div\b(?>[^i]+|\Bi|i(?!d\s*=))*id\s*=\s*")[^"]+
details:
(?<= # open a look behind assertion (it's just a check
# nothing is matched inside it)
<div\b # div tag
(?> # atomic group (all the content until the id attribute
[^i]+ # all that is not a "i"
| # OR
\Bi # a "i" not preceded by a word boundary
| # OR
i(?!d\s*=) # a "i" (with an implicite word boundary)
# not followed by "d="
)* # close the atomic group and repeat as necessary
id\s*=\s*" # the id attribute until the first double quote
) # close the lookbehind
[^"]+ # content of the id attribute
# (all that is not a double quote)