Grab only the first or the last match - regex

I need some help with regex which does not work perfect:
/(?<=([H|h][i|I])+\w+\>)(.*)(?=(\<))/
I have got a few XML, I need to filter out the errorMessage and the errorCode from those XMLs. Not all XML have the same syntax. Sometimes errorMessage sometimes ERRORTEXT sometimes Error_Messages is the tag name in my XMLs.
An example:
<?xml version="1.0" endcoding=UTF-8"?>
<n0:szemelyKutyaFuleResponsexmlns:prx="urn:sap.comproxy:SWP:/1SAI/TREASE1243804269AE457508F4:753" mmlns:n0="http://csajgeneratorws.tny.interfesz.kok.lo/">
<return>
<tanzakciosAzonosito>46981682-4637-49d2-bd4d-dcfff543742ed</tanzakciosAzonosito>
<erdmeny>HIBAS</eredmeny>
<errorCode>TSH08</errorCode>
<errorMessage>Azonosítószám már hozzá lett rendelve üzleti partnerhez</errorMessage>
</return>
</n0:szemelyKutyaFuleResponse>
I think I need to create two regex:
One to find the text TSH08 in errorCode
and another regex to find Azonosítószám már hozzá lett rendelve üzleti partnerhez in errorMessage!
Pls help THX

If you just want the content of each tag, which is what I understood from your question, then perhaps something like these:
For the first regex:
<errorCode>([^<>]+)</errorCode> Demo
(?<=<errorCode>)[^<>]+(?=</errorCode>) Demo
For the second regex:
<errorMessage>([^<>]+)</errorMessage> Demo
(?<=<errorMessage>)[^<>]+(?=</errorMessage>) Demo
You also can merge them with an | between the two if you don't care about the tag.
A | can also be added if the tag's name might differ like this:
<(?:errorMessage|ERRORTEXT|Error_Messages)>([^<>]+)</(?:errorMessage|ERRORTEXT|Error_Messages)> Demo

Related

regex to match link inside xml with last mod

<?xml version='1.0' encoding='UTF-8'?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>https://google.com/2020/08/this1.html</loc><lastmod>2020-08-06T11:30:55Z</lastmod></url>
<url><loc>https://google.com/2020/08/this2.html</loc><lastmod>2020-08-05T11:30:06Z</lastmod></url>
<url><loc>https://google.com/2020/08/this3.html</loc><lastmod>2020-08-06T11:29:25Z</lastmod></url>
</lastmod></url></urlset>
I'm trying to get links from above xml to get links which has lastmod of 2020-08-06
my regex code is https:.+2020-08-05.+<\/url
but it ended up getting it all from 1st and last link
I want to match only
<url><loc>https://google.com/2020/08/this1.html</loc><lastmod>2020-08-06T11:30:55Z</lastmod></url>
<url><loc>https://google.com/2020/08/this3.html</loc><lastmod>2020-08-06T11:29:25Z</lastmod></url>
/<loc>(.+)<\/loc>.*2020-08-06/g
capturing the group between loc tags
Demo and explanation here:
https://regex101.com/r/HBvG3K/8
A very easy and stupid regex - see regexr:
.*<lastmod>2020-08-06.*

Not able to select the right data

I have been handed a legacy xml which is not going to change.
In formatted way it looks like this:
<Result>
<StepSequence>
<RealMeasure>
<Text value="Batman"/>
</RealMeasure>
</StepSequence>
<StepSequence>
<RealMeasure>
<Text value="Superman"/>
</RealMeasure>
</StepSequence>
</Result>
Actually it comes like this:
<Result><StepSequence><RealMeasure><Text value="Batman"/></RealMeasure></StepSequence><StepSequence><RealMeasure><Text value="Superman"/></RealMeasure></StepSequence></Result>
Regex I have come up with is:
<RealMeasure><((\w*)\s+value="(.*)".*?)></RealMeasure>
But it is selecting data:
<RealMeasure><Text value="Batman"/></RealMeasure></StepSequence><StepSequence><RealMeasure><Text value="Superman"/></RealMeasure>
I want to select:
<RealMeasure><Text value="Batman"/></RealMeasure>
and
<RealMeasure><Text value="Superman"/></RealMeasure>
I want to get groups so that I can later convert the match to something like:
<RealMeasure type="Text" value="Superman"/>
using pattern like:
<RealMeasure type="$2" value=$3>
Link to online regex tester
Any tips to improve my regex?
Try this -
let reg = /<RealMeasure><((\w+)\s+value="(.*?)".*?)><\/RealMeasure>/g;
let str= `<Result><StepSequence><RealMeasure><Text value="Batman"/></RealMeasure></StepSequence><StepSequence><RealMeasure><Text value="Superman"/></RealMeasure></StepSequence></Result>`;
str.replace(reg, `<RealMeasure type="$2" value="$3"/>`); //<Result><StepSequence><RealMeasure type="Text" value="Batman"/></StepSequence><StepSequence><RealMeasure type="Text" value="Superman"/></StepSequence></Result>
The group value="(.*?)" has to be non-greedy as well. And changed the (\w*) to (\w+) to ensure that type is not empty.
Also, / in </RealMeasure> has to be escaped like <\/RealMeasure>.
I used the following regex:
<RealMeasure><(\w+).*?("[^"]*").*?<\/RealMeasure>
and it seems to be doing exactly what you want.
Test here. Detailed explanations are to the right-hand side of the page.
Please note that the software that you use might impose some limitations to the regex features that you can use.
Alternatively, use a proper XML parser to extract and reformat the data.

REGEXP_LIKE to match xml tag content that is not like a specific string

I'm trying to do a regular expression matching with REGEXP_LIKE and I'm looking for a regexp to find if the value of a specific tag is not a specific string.
For example:
<person>
<name>John</name>
<age>40</age>
</person>
My goal is to validate that the name tag's value is not John, so the REGEXP_LIKE would return true for input xmls where name is not John.
Thank you in advance for the help!
A quick and easy way to do this is to simply negate the regex search:
... WHERE NOT REGEXP_LIKE('column_name', '<name>John</name>')
However, as should be mentioned every time a question like this is posted, it's generally a bad idea to parse XML with regex. If you find yourself constructing more complex regex patterns to search this XML data, then you should:
Use an XML parser instead of regular expressions, or
Change how you are storing the data! Make person.age a separate table column; don't bung the entire XML structure into a single place.

garbage character is getting exported on the folloowing regex

i want to extract the category id from the response message. the regex i had used is categoryId=(.*?)>
I am doing this on the following response messages. can you please correct me like what is going wrong here ?
<img border="0" src="../images/sm_fish.gif" />
Try this regex:
categoryId=(.*?)"
This uses the non greedy operator to make sure that it only matches the content between the categoryId label and the ending quotation.
Try this: categoryId=([^"]+)"
[^"] matches any character, which is not in the list. So, in this case everything, but "

Using Regex to wrap xml element value with cdata

I have to edit a stored procedure that builds xml strings so that all the element values are wrapped in cdata. Some of the values have already been wrapped in cdata so I need to ignore those.
I figured this is a good attempt to learn some regex
From: <element>~DATA_04</element>
to: <element><![CDATA[~DATA_04]]></element>
What are my options on how to do this? I can do simple regex, this is way more advanced.
NOTE: The <element> is generic for illustration purposes, in reality, it could be anything and is unknown.
Sample text:
declare #sql nvarchar(max) =
' <data>
<header></header>
<docInfo>Blah</docInfo>
<someelement>~DATA_04</someelement>
<anotherelement><![CDATA[~DATA_05]]></anotherelement>
</data>
'
Using the sample xml, the regex would need to find someelement and add cdata to it like <someelement><![CDATA[~DATA_04]]></someelement> and leave the other elements alone.
Bear in mind, I did not write this horrible sql code, i just have to edit it.
This is c#:
string text = Regex.Replace( inputString, #"<element>~(.+)</element>", "<element>![CDATA[~$1]]</element>" , RegexOptions.None );
The find is:
<element>~(.+)</element>
The replace is:
<element>![CDATA[~$1]]</element>
I'm assuming there is a ~ at the start of the inside of the element tag.
You will also want to watch out for whitespace if that is an issue...
You may want to add some
\s*
Any whitespace characters, zero or more matches
Try with (<[^>]+>)(\~data_([^<]+))(<[^>]+>)
and replace for \1<![CDATA[\2]]>\4
this will give you: <element><![CDATA[~DATA_04]]></element>,
where element could be anything else. Check the DEMO
Good luck