Regex Expression to remove "autoplay" parameter in url - regex

I'm trying to match the url https://youtube.com/embed/id and its parameters i.e ?start=10&autoplay=1, but I need the autoplay parameter removed or set to 0.
These are some example urls and what I want the results to look like:
http://www.youtube.com/embed/JW5meKfy3fY?autoplay=1
I want to remove the autoplay parameter and its value:
http://www.youtube.com/embed/JW5meKfy3fY
2nd example
http://www.youtube.com/embed/JW5meKfy3fY?start=10&autoplay=1
results should be
http://www.youtube.com/embed/JW5meKfy3fY?start=10
I have tried (https?:\/\/www.youtube.com\/embed\/[a-zA-Z0-9\\-_]+)(\?[^\t\n\f\r \"']*)(\bautoplay=[01]\b&?) and replace with $1$2, but it matches with a trailing ? and & in example 1 and 2 respectively. Also, it doesn't match at all for a url like
http://www.youtube.com/embed/JW5meKfy3fY
I have the regex and examples on here
NB:
The string I am working on contains HTML with one or more youtube urls in it, so I don't think I can easily use go's net/url package to parse the url.

You're asking for a regex but I think you'd be better off using Go's "net/url" package. Something like this:
import "net/url"
//...
u, _ := url.Parse("http://www.youtube.com/embed/JW5meKfy3fY?start=10&autoplay=1")
q := u.Query()
q.Del("autoplay")
u.RawQuery = q.Encode()
clean_url_string = u.String()
In real life you'd want to handle errors from u.Parse of course.

Here's a solution that ensures a valid page URI. Simply match this and only return capture group 1 and 3.
Edit: The pattern is not elegant but it ensures no stale ampersands stay. The previous solution was more elegant and albeit wouldn't break anything, isn't worth the tradeoff imo.
Pattern
(https?:\/\/www\.youtube\.com\/embed\/[^?]+\?.*)(&autoplay=[01]|autoplay=[01]&?)(.*)
See the demo here.

As the OP has linked to a regex tester that employs the the PCRE (PHP) engine I offer a PCRE-compatible solution. The one token I've used in the regular expression below that is not widely supported in other regex engines is \K (though it is supported by Perl, Ruby, Python's PyPI regex module, R with Perl=TRUE and possibly other engines.
\K causes the regex engine to reset the beginning of the match to the current location in the string and to discard any previously-matched characters in the match it returns (if there is one).
With one caveat you can replace matches of the following regular expression with empty strings.
(?x) # assert 'extended'/'free spacing' mode
\bhttps?:\/\/www.youtube.com\/embed\/
# match literal
(?=.*autoplay=[01]) # positive lookahead asserts 'autoplay='
# followed by '1' or '2' appears later in
# the string
[a-zA-Z0-9\\_-]+ # match 1+ of the chars in the char class
[^\t\n\f\r \"']* # match 0+ chars other than those in the
# char class
(?<![?&]) # negative lookbehind asserts that previous
# char was neither '?' nor '&'
(?: # begin non-capture group
(?=\?) # use positive lookahead to assert next char
# is a '?'
(?: # begin a non-capture group
(?=.*autoplay=[01]&)
# positive lookahead asserts 'autoplay='
# followed by '1' or '2', then '&' appears
# later in the string
\? # match '?'
)? # end non-capture group and make it optional
\K # reset start of match to current location
# and discard all previously-matched chars
\?? # optionally match '?'
autoplay=[01]&? # match 'autoplay=' followed by '1' or '2',
# optionally followed by '&'
| # or
(?=&) # positive lookahead asserts next char is '&'
\K # reset start of match to current location
# and discard all previously-matched chars
&autoplay=[01]&? # match '&autoplay=' followed by '1' or '2',
# optionally followed by '&'
) # end non-capture group
The one limitation is that it fails to match all instances of .autoplay=.. if more than one such substring appears in the string.
I wrote this expression with the x flag, called extended or free spacing mode, to be able to make it self-documenting.
Start your engine!

Related

Regex to pick out key information between words/characters

I have a string as follows:
players: 2-8
Using regex how would I match the 2 and the 8 without matching everything else (ie 'players: ' and the '-')?
I have tried:
players:\s*([^.]+|\S+)
However, this matches the entire phrase and also uses a '.' at the end to mark the end of the string which might not always be the case.
It'd be much better if I could use the '-' to match the numbers, but I also need it to be looking ahead from 'players' as I will be using this to know that the data is correct for a given variable.
Using python if that's important
Thanks!!
Using players:\s*([^.]+|\S+) will use a single capture group matching either any char except a dot, or match a non whitespace char. Combining those, it can match any character.
To get the matches only using, you could make use of the Python PyPi regex module you can use the \G anchor:
(?:\bplayers:\s+|\G(?!^))-?\K\d+
The pattern matches:
(?: Non capture group
\bplayers:\s+ A word boundary to prevent a partial word match, then match players: and 1+ whitespace chars
| Or
\G(?!^) Anchor to assert the current position at the end of the previous match to continue matching
) Close non capture group
-?\K Match an optional - and forget what is matched so far
\d+ Match 1+ digits
Regex demo | Python demo
import regex
s = "players: 2-8"
pattern = r"(?:\bplayers:\s+|\G(?!^))-?\K\d+"
print(regex.findall(pattern, s))
Output
['2', '8']
You could also use a approach using 2 capture groups with re
import re
s = "players: 2-8"
pattern = r"\bplayers:\s+(\d+)-(\d+)\b"
print(re.findall(pattern, s))
Output
[('2', '8')]

Regex for find and replace in N++ of number followed by anything

I have a text file with the following lines:
asm-java-2.0.0-lib
cib-slides-3.1.0
lib-hibernate-common-4.0.0-beta
astp
act4lib-4.0.0
I want to remove everything from, including the '-' before the numbers begin so the results look like:
2.0.0-lib
3.1.0
4.0.0-beta
act4lib
Does anyone know the correct regex for this? So far I have come up with -\D.*(a-z)* but its got too many errors.
Ctrl+H
Find what: ^.*?(?=\d|$)
Replace with: LEAVE EMPTY
check Wrap around
check Regular expression
UNCHECK . matches newline
Replace all
Explanation:
^ # beginning of line
.*? # 0 or more any character but newline, not greedy
(?= # start lookahead, zero-length assertion that makes sure we have after
\d # a digit
| # OR
$ # end of line
) # end lookahead
Result for given example:
2.0.0-lib
3.1.0
4.0.0-beta
Another solution that deals with act4lib-4.0.0:
Ctrl+H
Find what: ^(?:.*-(?=\d)|\D+)
Replace with: LEAVE EMPTY
check Wrap around
check Regular expression
UNCHECK . matches newline
Replace all
Explanation:
^ # beginning of line
(?: # start non capture group
.* # 0 or more any character but newline
- # a dash
(?=\d) # lookahead, zero-length assertion that makes sure we have a digit after
| # OR
\D+ # 1 or more non digit
) # end group
Replacement:
\t # a tabulation, you may replace with what you want
Given:
asm-java-2.0.0-lib
cib-slides-3.1.0
lib-hibernate-common-4.0.0-beta
astp
act4lib-4.0.0
Result for given example:
2.0.0-lib
3.1.0
4.0.0-beta
4.0.0
Use
^\D+\-
If you want to completely remove lines without numbers then use this
^\D+(\-|$)
In case the packages contain numbers in their names like act4lib-4.0.0 then a longer variant is needed
^[\w-]+(\-(?=\d+\.\d+)|$)
It can be shortened to ^.+?(\-(?=\d+\.)|$) but I just want to be sure so I also check the minor version number
The ^ will match from the start of line

Selecting if no delimiter, and no selecting if it is

I have string like "smth 2sg. smth", and sometimes "smth 2sg.| smth.".
What mask should I use for selecting "2sg." if string does not contains"|", and select nothing if string does contains "|"?
I have 2 methods. They both use something called a Negative Lookahead, which is used like so:
(?!data)
When this is inserted into a RegEx, it means if data exists, the RegEx will not match.
More info on the Negative Lookahead can be found here
Method 1 (shorter)
Just capture 2sg.
Try this RegEx:
(\dsg\.)(?!\|)
Use (\d+... if the number could be longer than 1 digit
Live Demo on RegExr
How it works:
( # To capture (2sg.)
\d # Digit (2)
sg # (sg)
\. # . (Dot)
)
(?!\|) # Do not match if contains |
Method 2 (longer but safer)
Match the whole string and capture 2sg.
Try this RegEx:
^\w+\s*(\dsg\.)(?!\|)\s*\w+\.?$
Use (\d+sg... if the number could be longer than 1 digit
Live Demo on RegExr
How it works:
^ # String starts with ...
\w+\s* # Letters then Optional Whitespace (smth )
( # To capture (2sg.)
\d # Digit (2)
sg # (sg)
\. # . (Dot)
)
(?!\|) # Do not match if contains |
\s* # Optional Whitespace
\w+ # Letters (smth)
\.? # Optional . (Dot)
$ # ... Strings ends with
Something like this might work for you:
(\d*sg\.)(?!\|)
It assumes that there is(or there is no)number followed by sg. and not followed by |.
^.*(\dsg\.)[^\|]*$
Explanation:
^ : starts from the beginning of the string
.* : accepts any number of initial characters (even nothing)
(\dsg\.) : looks for the group of digit + "sg."
[^\|]* : considers any number of following characters except for |
$ : stops at the end of the string
You can now select your string by getting the first group from your regex
Try:
(\d+sg.(?!\|))
depending on your programming environment, it can be little bit different but will get your result.
For more information see Negative Lookahead

How to write a RegEx pattern that accepts a string with at most one of each letter, but unordered?

I have tried this:
[a]?[b]?[c]?[d]?[e]?[f]?[g]?[h]?[i]?[j]?[k]?[l]?[m]?[n]?[o]?[p]?[q]?[r]?[s]?[t]?[u]?[v]?[w]?[x]?[y]?[z]?
But this RegEx rejects string where the order in not alphabetical, like these:
"zabc"
"azb"
I want patterns like these two to be accepted too. How could I do that?
EDIT 1
I don't want letter repetitions, i.e., I want the following strings to be rejected:
aazb
ozob
Thanks.
You can use a negative lookahead assertion to make sure no two characters are the same:
^(?!.*(.).*\1)[a-z]*$
Explanation:
^ # Start of string
(?! # Assert that it's impossible to match the following:
.* # any number of characters
(.) # followed by one character (capture that in group 1)
.* # followed by any number of characters
\1 # followed by the same character as the one captured before
) # End of lookahead
[a-z]* # Match any number of ASCII lowercase letters
$ # End of string
Test it live on regex101.com.
Note: This regex needs to brute-force check all possible character pairs, so performance may be a problem with larger strings. If you can use anything besides regex, you're going to be happier. For example, in Python:
if re.search("^[a-z]*$", mystring) and len(mystring) == len(set(mystring)):
# valid string

regex conditional statement (only parse numbers if string has a certain beginning)

first I used a string that returned my relaystates, so “1.0.0.0.1.1.0.0” would get parsed/grouped with \d+,
then my eight switches used ‘format response’, e.g. {1} to get the state for each switch.
now I need to get the numbers out of this string: “RELAYS.1.0.0.0.1.1.0.0”
\d+ will still get the numbers but I only want to get them IF the string starts with “RELAYS"
can anyone please explain how I could do that?
thnx a million in advance!
Edited icebear (today 00:24)
With a .NET engine, you could use the regex (?<=^RELAYS[\d.]*)\d+. But most regex engines don't support indefinite repetition in a negative lookbehind assertion.
See it live on regexhero.net.
Explanation:
(?<= # Assert that the following can be matched before the current position:
^RELAYS # Start of string, followed by "RELAYS"
[\d.]* # and any number of digits/dots.
) # End of lookbehind assertion
\d+ # Match one or more digits.
With a PCRE engine, you could use (?:^RELAYS\.|\G\.)(\d+) and access group 1 for each match.
See it live on regex101.com.
Explanation:
(?: # Start a non-capturing group that matches...
^RELAYS\. # either the start of the string and "RELAYS."
| # or
\G\. # the position after the previous match, followed by "."
) # End of non-capturing group
(\d+) # Match a number and capture it in group 1