RegEx: How can I match all characters until the next match? [duplicate] - regex

This question already has answers here:
Tempered Greedy Token - What is different about placing the dot before the negative lookahead?
(3 answers)
Closed 3 years ago.
I have a string like this:
Hello [#foo] how are you [#bar] more text
Ultimately I need to modify each instance of a substring matching /\[#.+?\]/, but I also need to modify each substring before/after the [#foo] and [#bar].
The following regex matches the substring before a [#.+], the [#.+] itself, then a substring after the [#.+] until the next character is followed by another [#.+].
(.*?)(\[(#.+?)\])((.(?!(\[#.+?\])))*)
So the first match is "Hello [#foo] how are you" and the second match is " [#bar] more text".
Note the space at the beginning of the second match. That's the problem. Is there a way to get the first match to include all characters right up to the next [#.+]?
My regex includes characters after the [#.+] that are not followed by an instance of [#.+], and I cannot see any way of getting it to include all characters until we are actually in another instance of [#.+].
I'm really interested in whether I'm missing something - it certainly feels like there should be a simpler way to capture the characters around a given match, or a simpler way to capture characters not part of a match...

You have this regex:
(.*?)(\[(#.+?)\])((.(?!(\[#.+?\])))*)
^
Look at that dot. It precedes a negative lookahead. It matches a unit of data only if negative lookahead is satisfied. If negative lookahead fails, dot won't match. This happens at a character before matching a \[#.+?\]. Hence the space character isn't included.
To include it you just change the order. Put the dot after negative lookahead is passed:
(.*?)(\[(#.+?)\])(((?!(\[#.+?\])).)*)
^
See live demo here

If I understand correctly, you want to separate your text into groups, each one having one instance of [#.+], and all of the text must be matched into a group.
Try (?:^.*?)?\[#.+?\].*?(?=\[|$).

This RegEx might help you to get those vars.
(?:\[#[A-Za-z0-9]+\])
You can also add any other char to [A-Za-z0-9] such as ., +, #:
`[A-Za-z0-9\.\+\#]`
and change it as you wish:
(?:\[#[A-Za-z0-9\.\+\#]+\])

x = 'Hello [#foo] how are you [#bar] more text'
out = re.search('((.*)(\[.*\])(.*))((\[.*\])(.*))',x)
After getting above output you can use groups method to access different groups:
out.group(1)
'Hello [#foo] how are you '
out.group(2)
'Hello '
out.group(3)
'[#foo]'
out.group(4)
' how are you '
out.group(5)
'[#bar] more text'
out.group(6)
'[#bar]'
out.group(7)
' more text'

Related

RegEx for adding a zero between a dash and number [duplicate]

This question already has answers here:
Replacing digits immediately after a saved pattern
(2 answers)
Closed 3 years ago.
I want to find a way to add a leading zero "0" in front of numbers but BBEdit thinks it's substitute #10 Example:
Original string: Video 2-1: Title Goes Here
Desired result: Video 2-01: Title Goes Here
My find regex is: (-)(\d:)
My replace regex is: \10\2. The first substitute is NOT 10. I simply intend to replace first postion, then add a "0", then replace second position.
Kindly tell me how to tell BBEdit that I want to add a zero and that I don't mean 10th position.
If you simply need a number preceded by a dash, then I recommend using the regex lookbehind for this one.
Try this out:
(?<=-)(\d+:)
As seen here: regex101.com
It tells the regex that the match should be preceded by a dash -, and the - itself won't be matched!
You really don't need to capture hyphen in group1 (as it is a fixed string so no benefit capturing in group1 and replacing with \1) for replacement, instead just capture hyphen with digit using -(\d+:) and while replacing just use -0\1
Regex Demo
Also, there are other better ways to make the replacement where you don't need to deal with back references at all.
Another alternate solution is to use this look around based regex,
(?<=-)(?=\d+:)
and replace it with just 0 which will just insert a zero before the digit.
Regex Demo with lookaround
Another alternate solution when lookbehind is not supported (like in Javascript prior to EcmaScript2018), you can use a positive look ahead based solution. Basically match a hyphen - which is followed by digits and colon using this regex,
-(?=\d+:)
and replace it with -0
Regex Demo with only positive look ahead
Try \1\x30\2 as the replacement. \x30 is the hex escape for the 0 character, so the replacement is \1, then 0, then \2, and cannot be interpreted as \10 then 2. I don't know if BBEdit supports hex escapes in the replacement string though.
This expression might help you to do so, if Video 2- is a fixed input:
(Video 2-)(.+)
If you have other instances, you can add left boundary to this expression, maybe something similar to this:
([A-Za-z]+\s[0-9]+-)(.+)
Then, you can simply replace it with a leading zero after capturing group $1:
Graph
This graph shows how the expression would work:
If you wish, you can add additional boundaries to the expression.
Replacement
For replacing, you can simply use \U0030 or \x30 instead of zero, whichever your program might support, in between $1 and $2.

Regex for string containing one string, but not another [duplicate]

This question already has answers here:
Regular expression for a string containing one word but not another
(5 answers)
Closed 3 years ago.
Have regex in our project that matches any url that contains the string
"/pdf/":
(.+)/pdf/.+
Need to modify it so that it won't match urls that also contain "help"
Example:
Shouldn't match: "/dealer/help/us/en/pdf/simple.pdf"
Should match: "/dealer/us/en/pdf/simple.pdf"
If lookarounds are supported, this is very easy to achieve:
(?=.*/pdf/)(?!.*help)(.+)
See a demo on regex101.com.
(?:^|\s)((?:[^h ]|h(?!elp))+\/pdf\/\S*)(?:$|\s)
First thing is match either a space or the start of a line
(?:^|\s)
Then we match anything that is not a or h OR any h that does not have elp behind it, one or more times +, until we find a /pdf/, then match non-space characters \S any number of times *.
((?:[^h ]|h(?!elp))+\/pdf\/\S*)
If we want to detect help after the /pdf/, we can duplicate matching from the start.
((?:[^h ]|h(?!elp))+\/pdf\/(?:[^h ]|h(?!elp))+)
Finally, we match a or end line/string ($)
(?:$|\s)
The full match will include leading/trailing spaces, and should be stripped. If you use capture group 1, you don't need to strip the ends.
Example on regex101

Regex to match words after dot until a whitespace occurs

Given the following string
span.a.b this.is.really.confusing
I need to return the matches a and b. I've been able to get close with the following regex:
(?<=\.)[\w]+
But it's also matching is, really, and confusing. When I include a negative lookahead I get even closer, but I'm still not there.
(?<=\.)[\w]+(?=\s) # matches b, confusing
How can I match words after a dot until a whitespace occurs?
How can I match words after a dot until a whitespace occurs?
NB: this is language agnostic pseudo-code, but should work.
regex = "^[^\s.]+.(\S+).*"
targets = <extracted_group>.split(".")
Regex explanation:
"^": beings with
"[^\s.]+." 1 or more non-whitespace, non-period characters, followed by a period.
"(\S+)": group and capture all of the following non-whitespace characters
".*": matches 0 or more of any non-newline character
If the split function takes a regex instead of a string, you'll need to escape the '.' or use a character class.
NB: You can do it without the split, but I think that the split is more transparent.
I am not sure if this is good enough for all your possible cases, but it should work with the provided example:
\.([\w]+)\.([\w]+)\s
$1 = a, $2 = b

Regex * not working

I have the string "abc.aspx?sdfsdfds;eter;yid=10". I want my regex to match the 10 part of that string.
I wrote the regex (abc.aspx?*[?;]yid=), but it is not matching my string.
The regex abc.aspx?yid=10;sdfsdf matches my string, and I used this instead of *.
Why does the first regex with * not match, but the second one without it does?
I also want ['test'?%] in this clause.
i.e before yid i want only ?,%,Test. for single character it works i.e $% but it is working for Test
I tried this (abc.aspx.*['Test'?%]yid=) but it consider ' as character rather i want to match whole Test word.
Escape the question mark for a literal question mark, and if yid can appear immediately after the question mark you need to make the entire intervening input optional:
abc.aspx\?(.*;)?yid=(\d+)
Your target is in group 1.
See live demo working with both of your sample inputs.
You have to escape ? and use .* instead of * :
abc.aspx\?.*[?;]?yid=(\d+)
Demo and Explanation

Regex to match one or two quotes but not three in a row

For the life of me I can't figure this one out.
I need to search the following text, matching only the quotes in bold:
Don't match: """This is a python docstring"""
Match: " This is a regular string "
Match: "" ← That is an empty string
How can I do this with a regular expression?
Here's what I've tried:
Doesn't work:
(?!"")"(?<!"")
Close, but doesn't match double quotes.
Doesn't work:
"(?<!""")|(?!"")"(?<!"")|(?!""")"
I naively thought that I could add the alternates that I don't want but the logic ends up reversed. This one matches everything because all quotes match at least one of the alternates.
(Please note: I'm not running the code, so solutions around using __doc__ won't help, I'm just trying to find and replace in my code editor.)
You can use /(?<!")"{1,2}(?!")/
DEMO
Autopsy:
(?<!") a negative look-behind for the literal ". The match cannot have this character in front
"{1,2} the literal " matched once or twice
(?!") a negative look-ahead for the literal ". The match cannot have this character after
Your first try might've failed because (?!") is a negative look-ahead, and (?<!") is a negative look-behind. It makes no sense to have look-aheads before your match, or look-behinds after your match.
I realized that my original problem description was actually slightly wrong. That is, I need to actually only match a single quote character, unless if it's part of a group of 3 quote characters.
The difference is that this is desirable for editing so that I can find and replace with '. If I match "one or two quotes" then I can't automatically replace with a single character.
I came up with this modification to h20000000's answer that satisfies that case:
(?<!"")(?<=(?!""").)"(?!"")
In the demo, you can see that the "" are matched individually, instead of as a group.
This works very similarly to the other answer, except:
it only matches a single "
that leaves us with matching everything we want except it still matches the middle quotes of a """:
Finally, adding the (?<=(?!""").) excludes that case specifically, by saying "look back one character, then fail the match if the next three characters are """):
I decided not to change the question because I don't want to hijack the answer, but I think this can be a useful addition.