Regular Expression in Notepad++ to replace < and > inside CDATA - regex

I'm using Notepad++ to fix a huge XML export file and one of the challenges here is to replace all < and > characters to < and >. The thing is, I can't simply use the replace all action since the XML file is full of < and > that cannot be changed.
Luckly all the < and > that I need to change are wrapped by CDATA tags, like this:
<![CDATA[Text here... <span class="vSpecial"><p>Special Offer - more text here!</p></span>]]>
I was wondering if there'd be a Regular Expression to identify < and > wrapped in CDATA content, so I could easily use the Replace All to change only them.
UPDATE
The content of CDATA can contain line breaks.

Code
See regex in use here
<!\[CDATA\[)(?:(?!\]\]>).)*?\K(?:(<)|(>))
Replacement: (?{1}<)(?{2}>)
Note: For display purposes the link above uses \G(?!\A). This is not supported in Notepad++, thus it's been dropped in the actual answer. I added it to the link to show what it basically does.
See the Notepadd++ documentation for more information. It mentions the following:
For those readers familiar with Perl, \G is not supported.
Results
Before
After
Explanation
Click Replace All repeatedly until the message at the bottom shows Replace All: 0 occurrences were replaced. It will replace the first occurrence, then the second occurrence, then third, etc. for each CDATA that is found until there are no more matches.
Pattern
<!\[CDATA\[ Matches <![[CDATA[ literally
(?:(?!\]\]>).)*? Tempered lazy token matching any character any number of times, but as few as possible ensuring what follows doesn't match ]]>
\K Resets the starting point of the reported match. Any previously consumed characters are no longer included in the final match
(?:(<)|(>)) Match either of the following
(<) Capture < literally into capture group 1
(>) Capture > literally into capture group 2
Replacement
Notepad++ allows conditional replacements, so (?{1}<) makes reference to capture group one and (?{2}>) makes reference to capture group 2.

Related

RegEx for transforming the next text using PhpStorm's search and replace dialog

I need to transform text using regex
TPI +2573<br>
NM$ +719<br>
Молоко +801<br>
Прод. жизнь +6.5<br>
Оплод-сть +3.6<br>
Л. отела 6.3/3.9<br>
Вымя +1.48<br>
Ноги +1.61<br>
to this one
<strong>TPI</strong> +2573<br>
<strong>NM$</strong> +719<br>
<strong>Молоко</strong> +801<br>
<strong>Прод. жизнь</strong> +6.5<br>
<strong>Оплод-сть</strong> +3.6<br>
<strong>Л. отела</strong> 6.3/3.9<br>
<strong>Вымя</strong> +1.48<br>
<strong>Ноги</strong> +1.61<br>
Is it possible with regex in PhpStorm's search and replace dialog?
Given your text, you can use this regex,
.* +
and replace it with <strong>$0</strong> (Notice there is a space after </strong>)
We're using .* to capture everything but stop just before one (possible one or more) space because that's the point after which we want the text to remain intact. Once we capture the text, we use back-reference $0 to replace the match with <strong>$0</strong> so only matched text is placed within <strong> tags.
Regex Demo
Just in case, if this doesn't work for any of the samples you haven't included in your post, then please list the rules of replacement and I will give you a more robust solution, that will work flawlessly for your given rules.

How to get the string that start after the last > by regular expression?

I am writing a C# code that read a webpage and grep the content from the webpage.
I spent a lot of time to figure the content and now I stuck on this:
<i class="icon"></i><a href="https://www.nytimes.com/2017/09/12/us/irma-storm-updates.html">Latest Updates: 90 Percent of Houses in Florida Keys Are Damaged
I wanna get the "Latest Updates: 90 Percent of Houses in Florida Keys Are Damaged" only
I used to use "(?<=\">)(.*)" to get some content out successfully but not fit for all of it.
Therefore, how could I use R.E. to point I want the element that start get after the last ' > '
Thank you.
If the substring that you want to match appears after the last > then the main thing you know about it is that it does not contain a >. This is matched with [^>]. If the string must contain at least one character then you'll want to use + as the quantifier; if it's allowed to be empty then you'll want to use * to allow for zero matches. Finally, you need to match the full remainder of the text, up to the end of the line, which you do with a $.
So the full expression is [^>]*$ (or [^>]+$ if it can't be zero length).
If you want to also require that the preceding text does have a >, you can make it a bit more complicated, using a non-matching look-behind, (?<=\>). This says to find a > (which needs to be escaped here with \>) but don't include it in the match. The final expression would then be (?<=\>)[^>]*$. Now, C# strings also make use of \ for escaping, so you have to escape it twice before passing it to the Regex constructor. So it becomes new Regex("(?<=\\>)[^>]*$").
The simpler version, [^>]*$, is probably sufficient for your needs.
Finally, I would add that parsing XML or HTML with regular expressions is usually the wrong thing to do because there are lots of edge cases, and you will have to make assumptions about the formatting. For example, based on your example text, I assumed you are searching up to the end of the input text. It's usually better to parse XML with an XML parser, which won't have these problems.
This is the Regex you need here is a working example in RegexStorm.net example:
>([^<>]+)
This says: Find a string that matches a closing angle bracket, followed by text that doesn't include angle brackets. The [^<>] says find letters, numbers, whitespace that are NOT open/close angle brackets. The parenthesis around the [^<>] captures the text as a separate group. The (+) says get at least one or more.
Here is a C# example that uses it. You need to get the second capture group for the text you want.
void Main()
{
string text = "<i class=\"icon\"></i><a href=\"https://www.nytimes.com/2017/09/12/us/irma-storm-updates.html\">Latest Updates: 90 Percent of Houses in Florida Keys Are Damaged";
Regex regex = new Regex(">([^<>]+)");
MatchCollection matchCollection = regex.Matches(text);
if (matchCollection != null)
{
foreach (Match m in matchCollection)
{
Console.WriteLine(m.Groups[1].Value);
}
}
}
RegexStorm.net is a good .Net test site. Regex101.com is a good site to learn different Regex tools.

Regex Full match

I'm trying to understand regular expressions:
I need to only match on text_01 and text_02 and filter out the tags.
<span>text_01<b>text_02</b>
I've tried to do it like:
(?<=<span>)(([^>]+)<b>)(.+?)(?=</b>)
But it captures 3 groups and and the Full Match includes a tag.
text_01<b>text_02
Could you give me advice on how I need to build a regex whose Full match contains only text and no tags?
Parsing HTML with regular expressions can get very complicated. In general it is not advised practice and better to use a parser for this (some library in whatever language you are using).
But for cases where you are sure the text content does not have < nor >, and these < and > are not nested, you could use this one:
[^<>]*(?=<[^<>]*>)
This only matches text that is followed by a pair of < and >.
If it is enough to test that text is followed by <, it can be simply:
[^<>]*(?=<)
By using a non-capturing group you are able to exclude the middle <b> tag as a capture group, but you will never be able to get a full match without the tag included. It's not possible, a regular expression cannot skip a part while capturing. A match must be consecutive.
(?<=<span>)(.+?)(?:<b>)(.+?)(?=<\/b>)
Full match text_01<b>text_02
Group 1. text_01
Group 2. text_02

Regex to grab all <,> not a part of an XML tag

I have an XML file with a bunch of <, > characters, accidentally, and I need to replace them with < and >. What kind of regex can select <,>, and ignore any string of the form <[any word]>? It may not be possible, if so, regex that just ignores strings of the form <Abstract> are also great.
Thanks
You can try this as a good start: /<(?![a-z\/])|(?<![a-z])>/g.
See it working here: https://regex101.com/r/YPNEMU/1.
It will actually match every occurence of < and > that are not directly preceded by a letter or followed by either a letter or /.
Now remain to match also if just next to a letter but missing opening or closing the tag!
[EDIT] improve regex
This one goes further with matching also < occurences that are directly followed by a letter but non closing tag: /<(?![a-z\/][a-z\/ ]*?>)|(?<![a-z])>/g
See it working here: https://regex101.com/r/YPNEMU/2
[EDIT] best solution
I found it using (*SKIP)(*FAIL)!
/(<[a-z\/][^<>]*?>)(*SKIP)(*FAIL)|[<>]/g.
See it working here: https://regex101.com/r/YPNEMU/3

RegExp , Notepad++ Replace / remove several values

I have this dataset: (about 10k times)
<Id>HOW2SING</Id>
<PopularityRank>1</PopularityRank>
<Title><![CDATA[Superior Singing Method - Online Singing Course]]></Title>
<Description><![CDATA[High Quality Vocal Improvement Product With High Conversions. Online Singing Lessons Course Converts Like Crazy Using Content Packed Sales Video. You Make 75% On Every Sale Including Front End, Recurring, And 1-click Upsells!]]></Description>
<HasRecurringProducts>true</HasRecurringProducts>
<Gravity>45.9395</Gravity>
<PercentPerSale>74.0</PercentPerSale>
<PercentPerRebill>20.0</PercentPerRebill>
<AverageEarningsPerSale>74.9006</AverageEarningsPerSale>
<InitialEarningsPerSale>70.1943</InitialEarningsPerSale>
<TotalRebillAmt>16.1971</TotalRebillAmt>
<Referred>75.0</Referred>
<Commission>75</Commission>
<ActivateDate>2011-06-23</ActivateDate>
</Site>
I am trying to do the following:
Get the data from within the tags, and use it to create a URL, so in this example it should make
http://www.reviews.how2sing.domain.com
also, all other data has to go, i want to perform a REGEX function that will just give me a list of URLS.
I prefer to do it using notepad++ but i suck at regex, any help would be welome
To keep the regex relatively simple you can just use:
.*?<id>(.+?)</id>
Replace with:
http://www.reviews.\1.domain.com\n
That will search and replace all instances of Id tag and preceding text. You can then just remove the last manually.
Make sure matches newline is selected.
Regex is straightforward, only slightly tricky part is that it uses +? and *? which are non-greedy. This prevents the whole file from being matched. The () indicate a capture group that is used in the replacement, i.e. \1.
If you want to a regex that will include replacing the last part then use:
.*?(?:(<id>)?(.+?)</id>).+?(?:<id>|\Z)
This is a bit more tricky, it uses:
?:. A non-capturing group.
| OR
\Z end of file
Basically, the first time it will match everything up to the end of the first </id> and replace up to and including the next <id>. After that it will have replaced the starting <id> so everything before </id> goes in the group. On the last match it will match the end of file \Z.
If you only want the Id values, you can do:
'<Id>([^<]*)<\/Id>'
Then you can get the first captured group \1 which is the Id text value and then create a link from it.
Here is a demo:
http://regex101.com/r/jE9qN8
[UPDATE]
To get rid of all other lines, match this regex: '.*<Id>([^<]*)<\/Id>.*' and replace by first captured group \1. Note for the regex match, since there are multiple lines, you will need to have the DOTALL or /s flag activated to also match newlines.
Hope that helps.