Regex get all matches including smaller submatches - regex

I have following input string
Testing <B><I>bold italic</I></B> text.
and following regex :
<([A-Z][A-Z0-9]*)\b[^>]*>.*</\1>
This regex only gives following larger match
<B><I>bold italic</I></B>
How to use regex to get the smaller match ?
<I>bold italic</I>
I tried using non-greedy operators, but it didn't worked either.
And Is it possible to get both as match groups using like java or c# match groups or match collections ?

Try the below regex which uses positive lookbehind,
(?<=>)<([A-Z][A-Z0-9]*)\b[^>]*>.*<\/\1>
DEMO
It looks for the tag which starts just after to the > symbol.
Explanation:
(?<=>) Positive lookbehind is used here, which sets the matching marker just after tp the > symbol.
< Literal < symbol.
([A-Z][A-Z0-9]*\b[^>]*>) Captures upto the next > symbol.
.* Matches nay character except \n zero or more times.
<\/\1> Matches the lietral </+first captured group+>

As you probably know, many people prefer using a DOM parser to parse html. But looking at your existing regex, to fix it, I would suggest this:
<([A-Z][A-Z0-9]*)\b[^<>]*>[^<]*</\1>
See the demo.
Explanation
Inside the tags, inside of the .* that match too many chars, we use [^<]*, which matches any chars that are not an opening tag. That way we won't go into another tag.
Likewise, I changed your [^>]* to [^<>]* so we don't start another tag
I assume you will make this case-insensitive

Related

RegEx in VSCode: capture every character/letter - not just ASCII

I am working with historical text and I want to reformat it with RegEx. Problem is: There are lots of special characters (that is: letters) in the text that are not matched by RegEx character classes like [a-z] / [A-Z] or \w .
For example I want to match the dot (and only the dot) in the following line:
<tag1>Quomodo restituendus locus Demosth. Olÿnth</tag1>
Without the ÿ I could easily work with the mentioned character classes, like:
(?<=(<tag1>(\w|\s)*))\.(?=((\w|\s)*</tag1>))
But it does not work with special characters that are not covered by ASCII. I tried lots of things but I can't make it work so the RegEx really only captures the dot in this very line. If I use more general Expressions like (.)* (instead of (\w|\s)* ) I get many more of the dots in the document (for example dots that are not between an opening and a closing tag but in between two such tagsets), which is not what I want. Any ideas for an expression that covers like all unicode letters?
You may match any text between < and > with [^<>]*:
(?<=(<tag1>[^<>]*))\.(?=([^<>]*</tag1>))
See the regex demo. Not sure you need all those capturing groups, you might get what you need without them:
(?<=<tag1>[^<>]*)\.(?=[^<>]*</tag1>)
See this regex demo. Details:
(?<=<tag1>[^<>]*) - a location immediately preceded with <tag1 and then any zero or more chars other than < and >
\. - a dot
(?=[^<>]*</tag1>) - a location immediately preceded with any zero or more chars other than < and > and then </tag1>.
use a negated character class that exculdes the dot and the opening angle bracket:
(?<=<tag1>[^.<]*(?:<(?!/tag1>)[^.<]*)*)\.
with this kind of pattern it isn't even needed to check the closing tag. But if you absolutely want to check it, ends the pattern with:
(?=[^<]*(?:<(?!/tag1>)[^<]*)*</tag1>)

Regex optionally extracting characters between two characters

I have the following string thisIs/My-7777-Any-other-text it also is possible for the following thisIs/My-7777
I am looking to extract My-777 in both scenarios using regex. So essentially I am looking to extract everything between the first forward flash and the second hyphen (Second hyphen may not exist). I tried the following regex which wasn't quite right
(?<=\/)(.*)(?=-)
You could use a capture group
^[^\/]*\/([^-]*-[^-]*)
^ Start of string
[^\/]*\/
( Capture group
[^-]*-[^-]* Match a - between optional chars that are not -
) Close capture group
regex demo
Without an anchor, and not allowing / before and after -
[^\/]*\/([^-\/]*-[^-\/]*)
Regex demo
If we take into account the structure of your current input strings, you can use
(?<=\/)[^-]+-[^-]+
See the regex demo.
If your strings are more complex and look like thisIs/My-7777/more-text-here, and you actually want to match from the first /, then you may use
^[^\/]+\/\K[^\/-]+-[^\/-]+ ## PHP, PCRE, Boost (Notepad++), Onigmo (Ruby)
(?<=^[^\/]+\/)[^\/-]+-[^\/-]+ ## JS (except IE & Safari), .NET, Python PyPi regex)
See this regex demo or this regex demo. Note \n is added in the demo since the input is a single multiline string, in real life input, if a newline char is expected, use it in each negated character class to keep matching on the one line.
This one is working for me, Try it with case insensitive ticked
Find what: .*?/|-any.*
Replace with: blank
Output should be ↠↠ My-7777

Mixing Lookahead and Lookbehind in 1 Regexp

I'm trying to match first occurrence of window.location.replace("http://stackoverflow.com") in some HTML string.
Especially I want to capture the URL of the first window.location.replace entry in whole HTML string.
So for capturing URL I formulated this 2 rules:
it should be after this string: window.location.redirect("
it should be before this string ")
To achieve it I think I need to use lookbehind (for 1st rule) and lookahead (for 2nd rule).
I end up with this Regex:
.+(?<=window\.location\.redirect\(\"?=\"\))
It doesn't work. I'm not even sure that it legal to mix both rules like I did.
Can you please help me with translating my rules to Regex? Other ways of doing this (without lookahead(behind)) also appreciated.
The pattern you wrote is really not the one you need as it matches something very different from what you expect: text window.location.redirect("=") in text window.location.redirect("=") something. And it will only work in PCRE/Python if you remove the ? from before \" (as lookbehinds should be fixed-width in PCRE). It will work with ? in .NET regex.
If it is JS, you just cannot use a lookbehind as its regex engine does not support them.
Instead, use a capturing group around the unknown part you want to get:
/window\.location\.redirect\("([^"]*)"\)/
or
/window\.location\.redirect\("(.*?)"\)/
See the regex demo
No /g modifier will allow matching just one, first occurrence. Access the value you need inside Group 1.
The ([^"]*) captures 0+ characters other than a double quote (URLs you need should not have it). If these URLs you have contain a ", you should use the second approach as (.*?) will match any 0+ characters other than a newline up to the first ").

Match Latin words which not in the hook

I'm trying to filter words which is not in the "[ ]".
Why is this not working?
[^\[][\u0000-\u024F]+[^\]]
The reason your expression is not working is that it matches all text inside brackets as well as outside.
This is the best I've been able to do:
/(?:^|])[^[]+/g
It includes the ]s in the match because look-behind is not allowed:
http://regexr.com/3c515
If look-behind were allowed, this would be the ticket:
/(?:^|(?<=]))[^[]+/g
https://regex101.com/r/lK9tS7/3
Because this will match [\u0000-\u024F]+ and 2 character which will be matches by [^\[]. If you want to your regex engine match the whole of pattern you need to use start and end anchors in your regex :
/^[^\[][\u0000-\u024F]+[^\]]$/m
But this will work if your string is contain words in each line, which is not a proper way.
As a better way you can use negative look arounds :
(?<!\[)[\u0000-\u024F]+(?!\])

Trying to figure out how to capture text between slashes regex

I have a regex
/([/<=][^/]*[/=?])$/g
I'm trying to capture text between the last slashes in a file path
/1/2/test/
but this regex matches "/test/" instead of just test. What am I doing wrong?
You need to use lookaround assertions.
(?<=\/)[^\/]*(?=\/[^\/]*$)
DEMO
or
Use the below regex and then grab the string you want from group index 1.
\/([^\/]*)\/[^\/]*$
The easy way
Match:
every character that is not a "/"
Get what was matched here. This is done by creating a backreference, ie: put inside parenthesis.
followed by "/" and then the end of string $
Code:
([^/]*)/$
Get the text in group(1)
Harder to read, only if you want to avoid groups
Match exactly the same as before, except now we're telling the regex engine not to consume characters when trying to match (2). This is done with a lookahead: (?= ).
Code:
[^/]*(?=/$)
Get what is returned by the match object.
The issue with your code is your opening and closing slashes are part of your capture group.
Demo
text: /1/2/test/
regex: /\/(\[^\/\]*?)(?=\/)/g
captures a list of three: "1", "2", "test"
The language you're using affects the results. For instance, JavaScript might not have certain lookarounds, or may actually capture something in a non-capture group. However, the above should work as intended. In PHP, all / match characters must be escaped (according to regex101.com), which is why the cleaner [/] wasn't used.
If you're only after the last match (i.e., test), you don't need the positive lookahead:
/\/([^\/]*?)\/$/