Forcing regex to ignore detection depending on preposition - regex

I'm trying to build a regex, which will detect usernames mentioned in a string. The usernames can look like "username", "username[0-9]", "adm-username", "adm-username[0-9]".
As of now, I have this: \b(adm\-){0,1}username[0-9]{0,1}\b (link: https://regexr.com/4at34)
The problem is with adm-. If the preposition is aadm-username, the regex still detects 'username', I want it to fail. Any tips how to do that?
Thanks

You could replace \b by [\w-] in your case.
Also, don't match the boundaries.
And finally, don't match intermediate groups, make a single big group for your matches.
Demo
(?<![\w-])((?:adm-)?username\d?)(?![\w-])
[v] username
[v] username2
[v] adm-username
[v] adm-username2
[x] aadm-username
[x] aadm-username2
Explanation
(?<![\w-]) # negative lookbehind, only match if no word character or hyphen is present
(
(?:adm-)? # non-matching group containing adm- literally once or none, will be matched in the greater group
username\d? # literally matching username and a digit, once or none
)
(?![\w-]) # negative lookahead, only match if no word character or hyphen is present

Related

Regex to pick out key information between words/characters

I have a string as follows:
players: 2-8
Using regex how would I match the 2 and the 8 without matching everything else (ie 'players: ' and the '-')?
I have tried:
players:\s*([^.]+|\S+)
However, this matches the entire phrase and also uses a '.' at the end to mark the end of the string which might not always be the case.
It'd be much better if I could use the '-' to match the numbers, but I also need it to be looking ahead from 'players' as I will be using this to know that the data is correct for a given variable.
Using python if that's important
Thanks!!
Using players:\s*([^.]+|\S+) will use a single capture group matching either any char except a dot, or match a non whitespace char. Combining those, it can match any character.
To get the matches only using, you could make use of the Python PyPi regex module you can use the \G anchor:
(?:\bplayers:\s+|\G(?!^))-?\K\d+
The pattern matches:
(?: Non capture group
\bplayers:\s+ A word boundary to prevent a partial word match, then match players: and 1+ whitespace chars
| Or
\G(?!^) Anchor to assert the current position at the end of the previous match to continue matching
) Close non capture group
-?\K Match an optional - and forget what is matched so far
\d+ Match 1+ digits
Regex demo | Python demo
import regex
s = "players: 2-8"
pattern = r"(?:\bplayers:\s+|\G(?!^))-?\K\d+"
print(regex.findall(pattern, s))
Output
['2', '8']
You could also use a approach using 2 capture groups with re
import re
s = "players: 2-8"
pattern = r"\bplayers:\s+(\d+)-(\d+)\b"
print(re.findall(pattern, s))
Output
[('2', '8')]

Regex Negative Lookahead Is Being Ignored

I have the following sample text
[Item 1](./path/notes.md)
[Item 2](./path)
[Item 3](./path/notes.md)
[Item 4](./path)
When I applied the following regex \[(.*)\].*(?!notes\.md).*\), I expected the following output when I printed the first capture group
Item 2
Item 4
but what I ended up getting was
Item 1
Item 2
Item 3
Item 4
which, to me, seems that the negative lookahead portion, (?!notes\.md), was being ignored for some reason, so the regex was matching the whole string.
What is really confusing me, is that a positive lookahead works as I would expect. For example using \[(.*)\].*(?=notes\.md).*\) returns the following when printing the first capture group
Item 1
Item 3
which makes sense, so I am really confused as to why the negative lookahead is not functioning properly.
Let's go through what happens when matching your pattern on item 1:
\[(.*)\] matches [Item 1]
.* matches (./path/notes.md
The remaining string is now )
(?!notes\.md) checks whether the remaining string matches the pattern notes\.md. It does not, so the match continues.
\) matches the ) and the match succeeds.
If you change it so that the .* before the lookahead is inside the lookahead instead (\[(.*)\](?!.*notes\.md).*\)), it will now work as follows:
\[(.*)\] matches [Item 1]
The remaining string is now (./path/notes.md)
(?!.*notes\.md) checks whether the remaining string matches the pattern .*notes\.md, which it does, so the match fails (well more precisely, the regex engine will try to backtrack before it gives up on the match, but there's no alternative way to match \[(.*)\]', so the match still fails).
So with that change, it will reject all strings where notes.md appears anywhere before the ). If you want it to only reject instances where notes.md appears directly before the ), you can use a loookbehind (without .*) instead or add the \) to the lookahead.
In short, you have way too many .* (can lead to Catastrophic backtracking, look it up!). Remember that they match any character zero or more times. That means they will keep trying to match until they get success. And that is not necessarily the number of characters, you want.
A way to solve your problem is to move the negative look ahead to the front, like this:
(?!.*notes\.md)\[([^\]]+)\].*
Explanation:
(?!.*notes\.md) negative look ahead for any number of any char followed by 'notes.md'
\[ a [ character
([^\]]+) group 1, any character not being ], one or more times
\] a ] character
.* the rest of the line
Use the 'multiline' flag to get each line.
The problem here is that .* before the negative lookahead is greedy and will continue finding anything then stops.
A way to manage this is to include this greedy behaviour inside of the negative lookahead like hère
https://regex101.com/r/yzUQoP/1
/\[(.*)\](?!.*notes\.md)/gm
The pattern \[(.*)\].*(?!notes\.md).*\) that you tried matches from the first [ until the last occurrence of ]
Then what happens is that .* will match the rest of the line, so the following assertion (?!notes\.md) will always be true as the rest of the line is already matched.
Then the engine can backtrack to match the last )
If you don't want to cross [] and () while matching:
\[([^][]+)]\((?![^()]*\bnotes\.md\b)[^()]*\)
\[ Match [
([^][]+) Capture group 1, match 0+ times any char other than [ and ]
]\( Match ](
(?! Negative lookahead
[^()]*\bnotes\.md\b Match 0+ times any char other then ( and ) and then match notes.md between word boundaries to prevent partial matches
) Close lookahead
[^()]* Match 0+ times any char except ( and )
\) Match )
Regex demo

Regex Expression to remove "autoplay" parameter in url

I'm trying to match the url https://youtube.com/embed/id and its parameters i.e ?start=10&autoplay=1, but I need the autoplay parameter removed or set to 0.
These are some example urls and what I want the results to look like:
http://www.youtube.com/embed/JW5meKfy3fY?autoplay=1
I want to remove the autoplay parameter and its value:
http://www.youtube.com/embed/JW5meKfy3fY
2nd example
http://www.youtube.com/embed/JW5meKfy3fY?start=10&autoplay=1
results should be
http://www.youtube.com/embed/JW5meKfy3fY?start=10
I have tried (https?:\/\/www.youtube.com\/embed\/[a-zA-Z0-9\\-_]+)(\?[^\t\n\f\r \"']*)(\bautoplay=[01]\b&?) and replace with $1$2, but it matches with a trailing ? and & in example 1 and 2 respectively. Also, it doesn't match at all for a url like
http://www.youtube.com/embed/JW5meKfy3fY
I have the regex and examples on here
NB:
The string I am working on contains HTML with one or more youtube urls in it, so I don't think I can easily use go's net/url package to parse the url.
You're asking for a regex but I think you'd be better off using Go's "net/url" package. Something like this:
import "net/url"
//...
u, _ := url.Parse("http://www.youtube.com/embed/JW5meKfy3fY?start=10&autoplay=1")
q := u.Query()
q.Del("autoplay")
u.RawQuery = q.Encode()
clean_url_string = u.String()
In real life you'd want to handle errors from u.Parse of course.
Here's a solution that ensures a valid page URI. Simply match this and only return capture group 1 and 3.
Edit: The pattern is not elegant but it ensures no stale ampersands stay. The previous solution was more elegant and albeit wouldn't break anything, isn't worth the tradeoff imo.
Pattern
(https?:\/\/www\.youtube\.com\/embed\/[^?]+\?.*)(&autoplay=[01]|autoplay=[01]&?)(.*)
See the demo here.
As the OP has linked to a regex tester that employs the the PCRE (PHP) engine I offer a PCRE-compatible solution. The one token I've used in the regular expression below that is not widely supported in other regex engines is \K (though it is supported by Perl, Ruby, Python's PyPI regex module, R with Perl=TRUE and possibly other engines.
\K causes the regex engine to reset the beginning of the match to the current location in the string and to discard any previously-matched characters in the match it returns (if there is one).
With one caveat you can replace matches of the following regular expression with empty strings.
(?x) # assert 'extended'/'free spacing' mode
\bhttps?:\/\/www.youtube.com\/embed\/
# match literal
(?=.*autoplay=[01]) # positive lookahead asserts 'autoplay='
# followed by '1' or '2' appears later in
# the string
[a-zA-Z0-9\\_-]+ # match 1+ of the chars in the char class
[^\t\n\f\r \"']* # match 0+ chars other than those in the
# char class
(?<![?&]) # negative lookbehind asserts that previous
# char was neither '?' nor '&'
(?: # begin non-capture group
(?=\?) # use positive lookahead to assert next char
# is a '?'
(?: # begin a non-capture group
(?=.*autoplay=[01]&)
# positive lookahead asserts 'autoplay='
# followed by '1' or '2', then '&' appears
# later in the string
\? # match '?'
)? # end non-capture group and make it optional
\K # reset start of match to current location
# and discard all previously-matched chars
\?? # optionally match '?'
autoplay=[01]&? # match 'autoplay=' followed by '1' or '2',
# optionally followed by '&'
| # or
(?=&) # positive lookahead asserts next char is '&'
\K # reset start of match to current location
# and discard all previously-matched chars
&autoplay=[01]&? # match '&autoplay=' followed by '1' or '2',
# optionally followed by '&'
) # end non-capture group
The one limitation is that it fails to match all instances of .autoplay=.. if more than one such substring appears in the string.
I wrote this expression with the x flag, called extended or free spacing mode, to be able to make it self-documenting.
Start your engine!

If pattern repeats two times (nonconsecutive) match both patterns, regex

I have 3 values that I'm trying to match. foo, bar and 123. However I would like to match them only if they can be matched twice.
In the following line:
foo;bar;123;foo;123;
since bar is not present twice, it would only match:
foo;bar;123;foo;123;
I understand how to specify to match exactly two matches, (foo|bar|123){2} however I need to use backreferences in order to make it work in my example.
I'm struggling putting the two concepts together and making a working solution for this.
You could use
(?<=^|;)([^\n;]+)(?=.*(?:(?<=^|;)\1(?=;|$)))
Broken down, this is
(?<=^|;) # pos. loobehind, either start of string or ;
([^\n;]+) # not ; nor newline 1+ times
(?=.* # pos. lookahead
(?:
(?<=^|;) # same pattern as above
\1 # group 1
(?=;|$) # end or ;
)
)
\b # word boundary
([^;]+) # anything not ; 1+ times
\b # another word boundary
(?=.*\1) # pos. lookahead, making sure the pattern is found again
See a demo on regex101.com.
Otherwise - as said in the comments - split on the ; programmatically and use some programming logic afterwards.
Find a demo in Python for example (can be adjusted for other languages as well):
from collections import Counter
string = """
foo;bar;123;foo;123;
foo;bar;foo;bar;
foo;foo;foo;bar;bar;
"""
twins = [element
for line in string.split("\n")
for element, times in Counter(line.split(";")).most_common()
if times == 2]
print(twins)
making sure to allow room for text that may occur in between matches with a ".*", this should match any of your values that occur at least twice:
(foo|bar|123).*\1

Regex - Find all matching words that don't begin with a specific prefix

How would I construct a regular expression to find all words that end in a string but don't begin with a string?
e.g. Find all words that end in 'friend' that don't start with the word 'girl' in the following sentence:
"A boyfriend and girlfriend gained a friend when they asked to befriend them"
The items in bold should match. The word 'girlfriend' should not.
Off the top of my head, you could try:
\b # word boundary - matches start of word
(?!girl) # negative lookahead for literal 'girl'
\w* # zero or more letters, numbers, or underscores
friend # literal 'friend'
\b # word boundary - matches end of word
Update
Here's another non-obvious approach which should work in any modern implementation of regular expressions:
Assuming you wish to extract a pattern which appears within multiple contexts but you only want to match if it appears in a specific context, you can use an alteration where you first specify what you don't want and then capture what you do.
So, using your example, to extract all of the words that either are or end in friend except girlfriend, you'd use:
\b # word boundary
(?: # start of non-capture group
girlfriend # literal (note 1)
| # alternation
( # start of capture group #1 (note 2)
\w* # zero or more word chars [a-zA-Z_]
friend # literal
) # end of capture group #1
) # end of non-capture group
\b
Notes:
This is what we do not wish to capture.
And this is what we do wish to capture.
Which can be described as:
for all words
first, match 'girlfriend' and do not capture (discard)
then match any word that is or ends in 'friend' and capture it
In Javascript:
const target = 'A boyfriend and girlfriend gained a friend when they asked to befriend them';
const pattern = /\b(?:girlfriend|(\w*friend))\b/g;
let result = [];
let arr;
while((arr=pattern.exec(target)) !== null){
if(arr[1]) {
result.push(arr[1]);
}
}
console.log(result);
which, when run, will print:
[ 'boyfriend', 'friend', 'befriend' ]
This may work:
\w*(?<!girl)friend
you could also try
\w*(?<!girl)friend\w* if you wanted to match words like befriended or boyfriends.
I'm not sure if ?<! is available in all regex versions, but this expression worked in Expersso (which I believe is .NET).
Try this:
/\b(?!girl)\w*friend\b/ig
I changed Rob Raisch's answer to a regexp that finds words Containing a specific substring, but not also containing a different specific substring
\b(?![\w_]*Unwanted[\w_]*)[\w_]*Desired[\w_]*\b
So for example \b(?![\w_]*mon[\w_]*)[\w_]*day[\w_]*\b will find every word with "day" (eg day , tuesday , daywalker ) in it, except if it also contains "mon" (eg monday)
Maybe useful for someone.
In my case I needed to exclude some words that have a given prefix from regex matching result
the text was query-string params
?=&sysNew=false&sysStart=true&sysOffset=4&Question=1
the prefix is sys and I dont the words that have sys in them
the key to solve the issue was with word boundary \b
\b(?!sys)\w+\b
then I added that part in the bigger regex for query-string
(\b(?!sys)\w+\b)=(\w+)