Selecting if no delimiter, and no selecting if it is - regex

I have string like "smth 2sg. smth", and sometimes "smth 2sg.| smth.".
What mask should I use for selecting "2sg." if string does not contains"|", and select nothing if string does contains "|"?

I have 2 methods. They both use something called a Negative Lookahead, which is used like so:
(?!data)
When this is inserted into a RegEx, it means if data exists, the RegEx will not match.
More info on the Negative Lookahead can be found here
Method 1 (shorter)
Just capture 2sg.
Try this RegEx:
(\dsg\.)(?!\|)
Use (\d+... if the number could be longer than 1 digit
Live Demo on RegExr
How it works:
( # To capture (2sg.)
\d # Digit (2)
sg # (sg)
\. # . (Dot)
)
(?!\|) # Do not match if contains |
Method 2 (longer but safer)
Match the whole string and capture 2sg.
Try this RegEx:
^\w+\s*(\dsg\.)(?!\|)\s*\w+\.?$
Use (\d+sg... if the number could be longer than 1 digit
Live Demo on RegExr
How it works:
^ # String starts with ...
\w+\s* # Letters then Optional Whitespace (smth )
( # To capture (2sg.)
\d # Digit (2)
sg # (sg)
\. # . (Dot)
)
(?!\|) # Do not match if contains |
\s* # Optional Whitespace
\w+ # Letters (smth)
\.? # Optional . (Dot)
$ # ... Strings ends with

Something like this might work for you:
(\d*sg\.)(?!\|)
It assumes that there is(or there is no)number followed by sg. and not followed by |.

^.*(\dsg\.)[^\|]*$
Explanation:
^ : starts from the beginning of the string
.* : accepts any number of initial characters (even nothing)
(\dsg\.) : looks for the group of digit + "sg."
[^\|]* : considers any number of following characters except for |
$ : stops at the end of the string
You can now select your string by getting the first group from your regex

Try:
(\d+sg.(?!\|))
depending on your programming environment, it can be little bit different but will get your result.
For more information see Negative Lookahead

Related

Regex Expression to remove "autoplay" parameter in url

I'm trying to match the url https://youtube.com/embed/id and its parameters i.e ?start=10&autoplay=1, but I need the autoplay parameter removed or set to 0.
These are some example urls and what I want the results to look like:
http://www.youtube.com/embed/JW5meKfy3fY?autoplay=1
I want to remove the autoplay parameter and its value:
http://www.youtube.com/embed/JW5meKfy3fY
2nd example
http://www.youtube.com/embed/JW5meKfy3fY?start=10&autoplay=1
results should be
http://www.youtube.com/embed/JW5meKfy3fY?start=10
I have tried (https?:\/\/www.youtube.com\/embed\/[a-zA-Z0-9\\-_]+)(\?[^\t\n\f\r \"']*)(\bautoplay=[01]\b&?) and replace with $1$2, but it matches with a trailing ? and & in example 1 and 2 respectively. Also, it doesn't match at all for a url like
http://www.youtube.com/embed/JW5meKfy3fY
I have the regex and examples on here
NB:
The string I am working on contains HTML with one or more youtube urls in it, so I don't think I can easily use go's net/url package to parse the url.
You're asking for a regex but I think you'd be better off using Go's "net/url" package. Something like this:
import "net/url"
//...
u, _ := url.Parse("http://www.youtube.com/embed/JW5meKfy3fY?start=10&autoplay=1")
q := u.Query()
q.Del("autoplay")
u.RawQuery = q.Encode()
clean_url_string = u.String()
In real life you'd want to handle errors from u.Parse of course.
Here's a solution that ensures a valid page URI. Simply match this and only return capture group 1 and 3.
Edit: The pattern is not elegant but it ensures no stale ampersands stay. The previous solution was more elegant and albeit wouldn't break anything, isn't worth the tradeoff imo.
Pattern
(https?:\/\/www\.youtube\.com\/embed\/[^?]+\?.*)(&autoplay=[01]|autoplay=[01]&?)(.*)
See the demo here.
As the OP has linked to a regex tester that employs the the PCRE (PHP) engine I offer a PCRE-compatible solution. The one token I've used in the regular expression below that is not widely supported in other regex engines is \K (though it is supported by Perl, Ruby, Python's PyPI regex module, R with Perl=TRUE and possibly other engines.
\K causes the regex engine to reset the beginning of the match to the current location in the string and to discard any previously-matched characters in the match it returns (if there is one).
With one caveat you can replace matches of the following regular expression with empty strings.
(?x) # assert 'extended'/'free spacing' mode
\bhttps?:\/\/www.youtube.com\/embed\/
# match literal
(?=.*autoplay=[01]) # positive lookahead asserts 'autoplay='
# followed by '1' or '2' appears later in
# the string
[a-zA-Z0-9\\_-]+ # match 1+ of the chars in the char class
[^\t\n\f\r \"']* # match 0+ chars other than those in the
# char class
(?<![?&]) # negative lookbehind asserts that previous
# char was neither '?' nor '&'
(?: # begin non-capture group
(?=\?) # use positive lookahead to assert next char
# is a '?'
(?: # begin a non-capture group
(?=.*autoplay=[01]&)
# positive lookahead asserts 'autoplay='
# followed by '1' or '2', then '&' appears
# later in the string
\? # match '?'
)? # end non-capture group and make it optional
\K # reset start of match to current location
# and discard all previously-matched chars
\?? # optionally match '?'
autoplay=[01]&? # match 'autoplay=' followed by '1' or '2',
# optionally followed by '&'
| # or
(?=&) # positive lookahead asserts next char is '&'
\K # reset start of match to current location
# and discard all previously-matched chars
&autoplay=[01]&? # match '&autoplay=' followed by '1' or '2',
# optionally followed by '&'
) # end non-capture group
The one limitation is that it fails to match all instances of .autoplay=.. if more than one such substring appears in the string.
I wrote this expression with the x flag, called extended or free spacing mode, to be able to make it self-documenting.
Start your engine!

How to use regular expression to use as few groups as possible to match as long string as possible

For example, this is the regular expression
([a]{2,3})
This is the string
aaaa // 1 match "(aaa)a" but I want "(aa)(aa)"
aaaaa // 2 match "(aaa)(aa)"
aaaaaa // 2 match "(aaa)(aaa)"
However, if I change the regular expression
([a]{2,3}?)
Then the results are
aaaa // 2 match "(aa)(aa)"
aaaaa // 2 match "(aa)(aa)a" but I want "(aaa)(aa)"
aaaaaa // 3 match "(aa)(aa)(aa)" but I want "(aaa)(aaa)"
My question is that is it possible to use as few groups as possible to match as long string as possible?
How about something like this:
(a{3}(?!a(?:[^a]|$))|a{2})
This looks for either the character a three times (not followed by a single a and a different character) or the character a two times.
Breakdown:
( # Start of the capturing group.
a{3} # Matches the character 'a' exactly three times.
(?! # Start of a negative Lookahead.
a # Matches the character 'a' literally.
(?: # Start of the non-capturing group.
[^a] # Matches any character except for 'a'.
| # Alternation (OR).
$ # Asserts position at the end of the line/string.
) # End of the non-capturing group.
) # End of the negative Lookahead.
| # Alternation (OR).
a{2} # Matches the character 'a' exactly two times.
) # End of the capturing group.
Here's a demo.
Note that if you don't need the capturing group, you can actually use the whole match instead by converting the capturing group into a non-capturing one:
(?:a{3}(?!a(?:[^a]|$))|a{2})
Which would look like this.
Try this Regex:
^(?:(a{3})*|(a{2,3})*)$
Click for Demo
Explanation:
^ - asserts the start of the line
(?:(a{3})*|(a{2,3})*) - a non-capturing group containing 2 sub-sequences separated by OR operator
(a{3})* - The first subsequence tries to match 3 occurrences of a. The * at the end allows this subsequence to match 0 or 3 or 6 or 9.... occurrences of a before the end of the line
| - OR
(a{2,3})* - matches 2 to 3 occurrences of a, as many as possible. The * at the end would repeat it 0+ times before the end of the line
-$ - asserts the end of the line
Try this short regex:
a{2,3}(?!a([^a]|$))
Demo
How it's made:
I started with this simple regex: a{2}a?. It looks for 2 consecutive a's that may be followed by another a. If the 2 a's are followed by another a, it matches all three a's.
This worked for most cases:
However, it failed in cases like:
So now, I knew I had to modify my regex in such a way that it would match the third a only if the third a is not followed by a([^a]|$). So now, my regex looked like a{2}a?(?!a([^a]|$)), and it worked for all cases. Then I just simplified it to a{2,3}(?!a([^a]|$)).
That's it.
EDIT
If you want the capturing behavior, then add parenthesis around the regex, like:
(a{2,3}(?!a([^a]|$)))

RegExp ignore special and repeated characters (username test)

All,
I am trying to construct a regular expression (I will use to test valid usernames):
^[(0-9)|(_|\.)]|^[0-9]+$|[^a-zA-z0-9_.]{3,}|(_\.|\._)|\.{2,}|_{2,}
and testing it against this string:
1123#sssssasdf sslkdf*.sf...____.__sfsfdddddsss
What this regular expression should test is:
string should not begin with numbers, underscore or dot
string should be alphanumeric
should not contain characters repeated thrice or more --this fails
should not contain underscore and dot together
should not contain dot and underscore together
should not contain repeated dots should not contain repeated underscores
It looks like all of the cases are matched but 3rd. It doesn't catch repeated characters that are repeated thrice or more.
My questions are:
How can I fix this regular expression so it can catch repeated characters?
How can I optimize this regular expression?
Thanks in advance
EDIT
As requested the valid string are:
john
john.snow
john.snow123
john1.snow1
john_snow
john_snow123
john1_snow1
The invalid strings are:
123
1john.snow
.john_snow
john__snow
john..snow
jjjohn.snow
_john_snow
And can be done this way too -
(?i)^(?=[a-z])(?!.*?(?:\._|_\.|\.{2}|_{2}|([a-z\d])\1{2}))[a-z\d._]+$
Formatted:
(?i) # Case insensitive
^ # BOS
(?= [a-z] ) # First char alpha
(?! # Not these
.*?
(?:
\._ # dot, underscore
| _\. # underscore, dot
| \.{2} # 2 or more dots
| _{2} # 3 or more underscore
| ( [a-z\d] ) # (1), 3 or more repeated alpha-num's
\1{2}
)
)
[a-z\d._]+ # Get valid char's: alpha-num, dot and underscore
$ # EOS
^(?![0-9_.])(?!.*([._])\1)(?!.*(?:_\.|\._))(?!.*(.)\2{2,})[\w.]+$
You can add a negative lookahead for each of the condition.See demo.
https://regex101.com/r/rO0yD8/6

The different behavior of OR operator in regex when captured or not

I have two Regex expression, one is ^0|[1-9][0-9]*$, another one is ^(0|[1-9][0-9]*), the first expression matches string "01", while the later one can't. What's the difference of the two expressions? In my opinion, the later only captures the matched string. I want to know why the later can't match "01" string.
See graphic explanation
^0|[1-9][0-9]*$
Debuggex Demo
Versus
^(0|[1-9][0-9]*)$
Debuggex Demo
So second RegEx requires string to be either "0" or to start with 1-9 character.
Look at them this way:
^0 # Match a 0 at the start of the string
| # or
[1-9][0-9]*$ # match a number > 1 at the end of the string.
versus
^ # Match the start of the string.
( # Start of group 1:
0 # Match a zero
| # or
[1-9][0-9]* # a number > 1.
) # End of group 1.
$ # Match the end of the string.
The alternation extends to the anchors in the first example whereas it's contained within the group in the second example.

regex conditional statement (only parse numbers if string has a certain beginning)

first I used a string that returned my relaystates, so “1.0.0.0.1.1.0.0” would get parsed/grouped with \d+,
then my eight switches used ‘format response’, e.g. {1} to get the state for each switch.
now I need to get the numbers out of this string: “RELAYS.1.0.0.0.1.1.0.0”
\d+ will still get the numbers but I only want to get them IF the string starts with “RELAYS"
can anyone please explain how I could do that?
thnx a million in advance!
Edited icebear (today 00:24)
With a .NET engine, you could use the regex (?<=^RELAYS[\d.]*)\d+. But most regex engines don't support indefinite repetition in a negative lookbehind assertion.
See it live on regexhero.net.
Explanation:
(?<= # Assert that the following can be matched before the current position:
^RELAYS # Start of string, followed by "RELAYS"
[\d.]* # and any number of digits/dots.
) # End of lookbehind assertion
\d+ # Match one or more digits.
With a PCRE engine, you could use (?:^RELAYS\.|\G\.)(\d+) and access group 1 for each match.
See it live on regex101.com.
Explanation:
(?: # Start a non-capturing group that matches...
^RELAYS\. # either the start of the string and "RELAYS."
| # or
\G\. # the position after the previous match, followed by "."
) # End of non-capturing group
(\d+) # Match a number and capture it in group 1