Using Regex to wrap xml element value with cdata - regex

I have to edit a stored procedure that builds xml strings so that all the element values are wrapped in cdata. Some of the values have already been wrapped in cdata so I need to ignore those.
I figured this is a good attempt to learn some regex
From: <element>~DATA_04</element>
to: <element><![CDATA[~DATA_04]]></element>
What are my options on how to do this? I can do simple regex, this is way more advanced.
NOTE: The <element> is generic for illustration purposes, in reality, it could be anything and is unknown.
Sample text:
declare #sql nvarchar(max) =
' <data>
<header></header>
<docInfo>Blah</docInfo>
<someelement>~DATA_04</someelement>
<anotherelement><![CDATA[~DATA_05]]></anotherelement>
</data>
'
Using the sample xml, the regex would need to find someelement and add cdata to it like <someelement><![CDATA[~DATA_04]]></someelement> and leave the other elements alone.
Bear in mind, I did not write this horrible sql code, i just have to edit it.

This is c#:
string text = Regex.Replace( inputString, #"<element>~(.+)</element>", "<element>![CDATA[~$1]]</element>" , RegexOptions.None );
The find is:
<element>~(.+)</element>
The replace is:
<element>![CDATA[~$1]]</element>
I'm assuming there is a ~ at the start of the inside of the element tag.
You will also want to watch out for whitespace if that is an issue...
You may want to add some
\s*
Any whitespace characters, zero or more matches

Try with (<[^>]+>)(\~data_([^<]+))(<[^>]+>)
and replace for \1<![CDATA[\2]]>\4
this will give you: <element><![CDATA[~DATA_04]]></element>,
where element could be anything else. Check the DEMO
Good luck

Related

Is it possible to put regex directly into XML content?

I have a XML file I use to manually route users to specific pages in a website.
Currently, we have separate entries for every variation of possible searches (plural, typos etc.). I would like to know if there is a way I can condense it with regex to something like so:
<OnSiteSearch>
...
<Search>
<SearchTerm>(horses?|cows?) for sale</SearchTerm>
<Destination>~/some/path.html</Destination>
</Search>
...
</OnSiteSearch>
Is something like this possible? I've looked online for regex and XML but it seems to be about validating content between the XML tags and not about using regex as the content.
Yes, a regex can be stored in XML as long as you mind XML escaping rules to keep the XML well-formed:
Element content: Escape any < as < and any & as & when writing
the regex; reverse the substitution before using the regex.
Attribute value: Follow rules for element content plus escape any " as
&quote; or any ' as &apos; to avoid conflict with chosen attribute
value delimiters.
CDATA: No escaping needed, but make sure your regex doesn't include
the string ]]>.

REGEXP_LIKE to match xml tag content that is not like a specific string

I'm trying to do a regular expression matching with REGEXP_LIKE and I'm looking for a regexp to find if the value of a specific tag is not a specific string.
For example:
<person>
<name>John</name>
<age>40</age>
</person>
My goal is to validate that the name tag's value is not John, so the REGEXP_LIKE would return true for input xmls where name is not John.
Thank you in advance for the help!
A quick and easy way to do this is to simply negate the regex search:
... WHERE NOT REGEXP_LIKE('column_name', '<name>John</name>')
However, as should be mentioned every time a question like this is posted, it's generally a bad idea to parse XML with regex. If you find yourself constructing more complex regex patterns to search this XML data, then you should:
Use an XML parser instead of regular expressions, or
Change how you are storing the data! Make person.age a separate table column; don't bung the entire XML structure into a single place.

Match particular CDATA sections in XML data

I am trying to do a PowerShell Regex, I have the following page (further below) that I want to do a match from, the two parts in bold is the information that I want to capture and assign to a variable. So I need two regex's. From the text below, the two area's I need to find exactly are King and Years & Years. Please note, these two areas change (hence the reason I need to capture them), the rest of of the code stays the same.
This is the regex I have at the moment, but it's not working for me.
\s+artist\s*>\s*<\s*!\s*[CDATA\s*[(.*)\s*]\s*]\s*>\s*<\s*/artist
And here is the page (or data) I am trying to use regex with.
<on_air>
<publishedInfo publishedDate="2015-07-18 16:24:28" />
<stationName><![CDATA[Mix 106.5]]></stationName>
<stationPrefix><![CDATA[mix1065]]></stationPrefix>
<generic_coverart><![CDATA[http://media.arn.com.au/images/getImage.aspx?i=generic_mix1065.jpg]]></generic_coverart>
<now_playing>
<audio ID="id_1705168034_30458146" type="song">
<title generic="False"><![CDATA[King*]]></title>
<artist><![CDATA[Years & Years]]></artist>
<number><![CDATA[46029]]></number>
<cut><![CDATA[1]]></cut>
<ref><![CDATA[]]></ref>
<played_datetime><![CDATA[2015-07-18 16:24:27]]></played_datetime>
<length><![CDATA[00:03:28]]></length>
<coverart generic="true"><![CDATA[http://media.arn.com.au/images/getImage.aspx?i=generic_mix1065.jpg]]></coverart>
<options>
<option><![CDATA[KIIS S Integrated]]></option>
</options>
</audio>
</now_playing>
If it is a valid XML, then you does not need to use regular expressions. PowerShell adapt XML objects and you can use standard property syntax to navigate on them:
$xml=[xml]#'
<on_air>
<publishedInfo publishedDate="2015-07-18 16:24:28" />
<stationName><![CDATA[Mix 106.5]]></stationName>
<stationPrefix><![CDATA[mix1065]]></stationPrefix>
<generic_coverart><![CDATA[http://media.arn.com.au/images/getImage.aspx?i=generic_mix1065.jpg]]></generic_coverart>
<now_playing>
<audio ID="id_1705168034_30458146" type="song">
<title generic="False"><![CDATA[King*]]></title>
<artist><![CDATA[Years & Years]]></artist>
<number><![CDATA[46029]]></number>
<cut><![CDATA[1]]></cut>
<ref><![CDATA[]]></ref>
<played_datetime><![CDATA[2015-07-18 16:24:27]]></played_datetime>
<length><![CDATA[00:03:28]]></length>
<coverart generic="true"><![CDATA[http://media.arn.com.au/images/getImage.aspx?i=generic_mix1065.jpg]]></coverart>
<options>
<option><![CDATA[KIIS S Integrated]]></option>
</options>
</audio>
</now_playing>
</on_air>
'#
$xml.on_air.now_playing.audio.title.'#cdata-section'
$xml.on_air.now_playing.audio.artist.'#cdata-section'
You want to escape bracket literals.
Also, it's a good practice to avoid using the dot "match almost any character" metacharacter when your intentions are more specific. In your case, what you really want to do is match until you hit the closing bracket, so it's safer to specify that:
'\s+artist\s*>\s*<\s*!\s*\[CDATA\s*\[([^]]*)\s*\]\s*\]\s*>\s*<\s*\/artist'
Note: Regex is contextual, so the reason I don't have to escape the closing bracket within the character class is because of its position, i.e., being the first character specified in the negated class--in that context, it cannot be the closing bracket for the character class. In other words, it's not ambiguous.
To help get off the ground, here is a suggestion for y&y (insert whitespace-selector whereever possible):
artist><!\[CDATA\[Years & Years\]\]></artist

Limiting a character after a wildcard in regex to it's first occurrence,

How can I tell a character that comes after a wildcard to use the first occurrence of it?
I did the following to find any tag with the word "title" in it:
<(.*?)(title)(.*?)>
but clearly what happens is I end up with the entire tag to the end of
</title>
So that in
<Bla bla ="nametitle">Yada yada</title>
I want
<Bla bla ="nametitle">
but end up with the whole tag.
Please if anyone is offended by the use of parsing html with regex simply move on and accept my apologies for the transgression. I am simply trying to find out how to use the wildcard which I have not used before correctly and apply as I see fit. Thank you.
You can use this regex:
<title.+?>
The above matches <title and goes till it encounters a >
Stop parsing at the first >. Using your example, you could do this with: <(.*?)(title)([^>]*?)>
<(?![\/]).*?title.*?>
This will find title inside any set of < > tags except for closing tags beginning with </
Example:
https://regex101.com/r/QFs4ny/1

Regex match for contents of <li> element

I have the following content
<li>Title: [...]</li>
and I'm looking for regex that will match and replace this so that I can parse it as XML. I'm just looking to use a regex find and replace inside Sublime Text 2, so I want to match everything in the above example except for the [...] which is the content.
Why not extract the content and use it to build the xml rather than trying to mold the wrapper of the content into xml? (or am i mis understanding you?)
<li>Title: ([^<]*)<\/li>
is the regular expression to extract the content.
Its pretty self explanatory other than the [^<]* which means match any number of characters that is not a "<"
I don't know Sublime, but something like this should suffice to get you the contents of the li. It allows for there being optional extra attributes on the tag. Make sure and turn off case-sensitivity, incase of LI or Li etc. (lifted straight from http://www.regular-expressions.info/examples.html ):
<li\b[^>]*>(.*?)</li>
<li>\S*(.*)?</li>
That should match your string, with the content being capturing group 1.