Greedy Regex Matching - regex

I'm trying to match a string that looks something like this:
<$Fexample text in here>>
with this expression:
<\$F(.+?)>{2}
However, there are some cases where my backreferenced content includes a ">", thus something like this:
<$Fexample text in here <em>>>
only matches example text in here <em in the backreference. What do I need to do to conditionally return a correct backrefernce with or without these html entities?

You can add start and end anchors to the regex as:
^<\$F(.+?)>{2}$

Try
<\$F(.+?)>>(?!>)
The (?!>) forces only the last >> in a long sequence of >>>..>>> will be matched.
Edit:
<\$F(.+?>*)>>
Also works.

Please note than tu truly do what (I think) you want to do, you would have to interpret well-formed bracket expressions, which is not possible in a regular language.
In other words, <$Fexample <tag <tag <tag>>> example>> oh this should not happen> will return example <tag <tag <tag>>> example>> oh this should not happen as the capture group.

Related

Using Regex to wrap xml element value with cdata

I have to edit a stored procedure that builds xml strings so that all the element values are wrapped in cdata. Some of the values have already been wrapped in cdata so I need to ignore those.
I figured this is a good attempt to learn some regex
From: <element>~DATA_04</element>
to: <element><![CDATA[~DATA_04]]></element>
What are my options on how to do this? I can do simple regex, this is way more advanced.
NOTE: The <element> is generic for illustration purposes, in reality, it could be anything and is unknown.
Sample text:
declare #sql nvarchar(max) =
' <data>
<header></header>
<docInfo>Blah</docInfo>
<someelement>~DATA_04</someelement>
<anotherelement><![CDATA[~DATA_05]]></anotherelement>
</data>
'
Using the sample xml, the regex would need to find someelement and add cdata to it like <someelement><![CDATA[~DATA_04]]></someelement> and leave the other elements alone.
Bear in mind, I did not write this horrible sql code, i just have to edit it.
This is c#:
string text = Regex.Replace( inputString, #"<element>~(.+)</element>", "<element>![CDATA[~$1]]</element>" , RegexOptions.None );
The find is:
<element>~(.+)</element>
The replace is:
<element>![CDATA[~$1]]</element>
I'm assuming there is a ~ at the start of the inside of the element tag.
You will also want to watch out for whitespace if that is an issue...
You may want to add some
\s*
Any whitespace characters, zero or more matches
Try with (<[^>]+>)(\~data_([^<]+))(<[^>]+>)
and replace for \1<![CDATA[\2]]>\4
this will give you: <element><![CDATA[~DATA_04]]></element>,
where element could be anything else. Check the DEMO
Good luck

Negative lookahead but with something before it

I'm using a regex to parse some HTML I have the following regex which matches all tags except img and a.
\<(?!img|a)[^\>]+\>
This works well but I also want it to match the closing tags, I've tried the following but it doesn't work:
\</?(?!img|a)[^\>]+\>
What would be the best way to do this?
(Also before there is a plethora of comments saying not to use regexes to parse HTML I'd just like to say that this HTML is generated by a tool and is very uniform.)
EDIT:
<p>So in this</p>
<p>HTML <strong>with nested tags</strong></p>
<p>It should remove <i>everything</i> except This link
and this <img src="#" alt="image" /> but it also needs to kep the textual content</p>
I think that the simplest solution would be the following:
<\/?(?!img|a)[^>]+>
It simply matches:
a <,
a / (escaped with \) if there is any (quantifier ?),
asserts that there is neither img nor a,
a sequence of anything but > ([^>]+) and
a >
See it working here on regex101.
Ok here is a pretty wasteful solution:
<(?!img|a|\/img|\/a)[^>]+>
It would be great if someone could find a better one.

Limiting a character after a wildcard in regex to it's first occurrence,

How can I tell a character that comes after a wildcard to use the first occurrence of it?
I did the following to find any tag with the word "title" in it:
<(.*?)(title)(.*?)>
but clearly what happens is I end up with the entire tag to the end of
</title>
So that in
<Bla bla ="nametitle">Yada yada</title>
I want
<Bla bla ="nametitle">
but end up with the whole tag.
Please if anyone is offended by the use of parsing html with regex simply move on and accept my apologies for the transgression. I am simply trying to find out how to use the wildcard which I have not used before correctly and apply as I see fit. Thank you.
You can use this regex:
<title.+?>
The above matches <title and goes till it encounters a >
Stop parsing at the first >. Using your example, you could do this with: <(.*?)(title)([^>]*?)>
<(?![\/]).*?title.*?>
This will find title inside any set of < > tags except for closing tags beginning with </
Example:
https://regex101.com/r/QFs4ny/1

matching only bold tag with regular expression

I am trying to match the <b> tag or <b style=".....">
by using a regular expression like
<(/)?(b)[^>]*>
it not only matches the b tag but all the tags starting with b
Try using a word boundary (\b):
<(/)?(b\b)[^>]*>
This ensures that the next character after the <b must not be a 'word' character (a letter, number or underscore).
Of course, this could match a tag like <b-foo>, which might be a concern. In that case, I'd recommend using a lookahead like this:
<(/)?(b(?=[\s>]))[^>]*>
This ensures that the next character after the <b must either be a whitespace character, or a >.
In JavaScript, to find and replace all <b> and </b> tags, you'd do something like this:
const myStr = '<b>Hello</b>';
myStr.replace(/<[/]?(b)>/gi, '');
console.log(myStr); // Outputs 'Hello' with the bold tags removed
Hope that helps someone.
Why don't you use this:
/<\/?b.*?>/g
Your regex:
/<(\/)?(b)[^>]*>/g
You may want to use the global modifier as per the syntax of your language of use. The one I've used is Javascript.

Regex match for contents of <li> element

I have the following content
<li>Title: [...]</li>
and I'm looking for regex that will match and replace this so that I can parse it as XML. I'm just looking to use a regex find and replace inside Sublime Text 2, so I want to match everything in the above example except for the [...] which is the content.
Why not extract the content and use it to build the xml rather than trying to mold the wrapper of the content into xml? (or am i mis understanding you?)
<li>Title: ([^<]*)<\/li>
is the regular expression to extract the content.
Its pretty self explanatory other than the [^<]* which means match any number of characters that is not a "<"
I don't know Sublime, but something like this should suffice to get you the contents of the li. It allows for there being optional extra attributes on the tag. Make sure and turn off case-sensitivity, incase of LI or Li etc. (lifted straight from http://www.regular-expressions.info/examples.html ):
<li\b[^>]*>(.*?)</li>
<li>\S*(.*)?</li>
That should match your string, with the content being capturing group 1.