Regex for matching a complete element in a xml file - regex

I would like to know the Regex that match these kind of sequences
<person name="the name I want" ....[other things]>
.... [other tags]
</person>
I tried with something like this:
<person +name="the name I want" +.*
But I'm not going any further, I can only match the first line, but not the complete element
Would you like to help me?

Try this:
<person[*>]*name="the name I want"[^>]*>(.|[\r\n])*?<\/person>
If your language supports the "dotall" flag, you can use that and change (.|[\r\n])* to just .*.

I found this in another stackoverflow thread:
<person(.|\r\n)*?<\/person>
I hope it is useful
Edit:
I forgot to add the name attribute
(<person name="the name I want"(.|\r\n)*?<\/person>)

Related

how to use '*' in XPATH starts-with()?

we received banking statements from the SAP System. We sometimes observe the naming convention of the file name will be not as per the standards and the files will be rejected.
We wanted to validate the file name, as per the below example, we get the file name in the name attribute.
Can the country ISO code escape in the validation?
We wanted an Xpath that captures GLO_***_UPLOAD_STATEMENT like this so that ISO code is not validated.
Example XML:
<?xml version="1.0" encoding="UTF-8"?>
<Details name="GLO_ZFA_UPLOAD_STATEMENT" type="banking" version="3.0">
<description/>
<object>
<encrypted data="b528f05b96102f5d99743ff6122bb0984aa16a02893984a9e427a44fcedae1612104a7df1173d9c61a99ebe0c34ea67a46aecc86f41f5924f74dd525"/>
</object>
</Details>
Xpath tried:
Details[#type="banking"]/#name[not(starts-with(., "GLO_***_UPLOAD_STATEMENT"))]
which is not working :(
Can anyone help here, please :)
Thanks in advance!
Try using the matches() function for a regex like this:
Details[#type="banking"]/#name[not(matches(., "^GLO_(.){3}_UPLOAD_STATEMENT"))]
starts-with() is char based, it doesn't recognize patterns.
If your XPath version doesn't support regex then you can use something like:
Details[#type="banking"]/#name[not(starts-with(., "GLO_")) and not(ends-with(., "_UPLOAD_STATEMENT"))]
You can match regular expressions using the matches() function. For example:
//Details[#type="banking" and not(matches(#name, "GLO_[A-Z]*_UPLOAD_STATEMENT"))]/#name
Will only select Details node's name attribute for Details that have type="banking" and name not matching the regular expression "GLO_[A-Z]*_UPLOAD_STATEMENT". You can refine the regex as needed.

Grab only the first or the last match

I need some help with regex which does not work perfect:
/(?<=([H|h][i|I])+\w+\>)(.*)(?=(\<))/
I have got a few XML, I need to filter out the errorMessage and the errorCode from those XMLs. Not all XML have the same syntax. Sometimes errorMessage sometimes ERRORTEXT sometimes Error_Messages is the tag name in my XMLs.
An example:
<?xml version="1.0" endcoding=UTF-8"?>
<n0:szemelyKutyaFuleResponsexmlns:prx="urn:sap.comproxy:SWP:/1SAI/TREASE1243804269AE457508F4:753" mmlns:n0="http://csajgeneratorws.tny.interfesz.kok.lo/">
<return>
<tanzakciosAzonosito>46981682-4637-49d2-bd4d-dcfff543742ed</tanzakciosAzonosito>
<erdmeny>HIBAS</eredmeny>
<errorCode>TSH08</errorCode>
<errorMessage>Azonosítószám már hozzá lett rendelve üzleti partnerhez</errorMessage>
</return>
</n0:szemelyKutyaFuleResponse>
I think I need to create two regex:
One to find the text TSH08 in errorCode
and another regex to find Azonosítószám már hozzá lett rendelve üzleti partnerhez in errorMessage!
Pls help THX
If you just want the content of each tag, which is what I understood from your question, then perhaps something like these:
For the first regex:
<errorCode>([^<>]+)</errorCode> Demo
(?<=<errorCode>)[^<>]+(?=</errorCode>) Demo
For the second regex:
<errorMessage>([^<>]+)</errorMessage> Demo
(?<=<errorMessage>)[^<>]+(?=</errorMessage>) Demo
You also can merge them with an | between the two if you don't care about the tag.
A | can also be added if the tag's name might differ like this:
<(?:errorMessage|ERRORTEXT|Error_Messages)>([^<>]+)</(?:errorMessage|ERRORTEXT|Error_Messages)> Demo

Using Perl with Regex, how can I remove a string within a string?

So I have several XML files that have persons with unique IDs and they each have a favorite food (a person can be in several xml files):
There are cases where the person with id=300 might have the food right in the beginning of the tag.
<person id="299">
<food>
<type> Hot Dog </type>
</food>
</person>
<person id="300">
<food>
<type> Burger</type>
</food>
</person>
Or there might be other tags before the food tag
<person id="300">
<year>
<birth> 1990 </birth>
<marriage> 2020 </marriage>
</year>
<food>
<type> Vegan </type>
</food>
</person>
I need to use a single Perl RegEx functions to remove the food tags ONLY of the persons whose ID is 300, independently if it is at the beginning, middle, or end of the person tag
I know if it was for the whole person tag I could use something like :
$fileContents =~ s/<person id=\"300\"[^<]+<\/person>//g;
But I must leave the person tag intact, I must only remove the food tag inside the person tag, but I can't remove all the food tags because I need to leave it for people with other ID's.
Could you help me please?? I been struggling a lot with this D:
You can't safely do that with a substitution.
And even a half-assed approach is more complicated than using an existing XML parser.
$_->unbindNode()
for $doc->findnodes('//person[#id="300"]/food');
Full solution:
use XML::LibXML qw( );
# my $doc = XML::LibXML->new->parse_file(...);
# or
# my $doc = XML::LibXML->new->parse_string(...);
$_->unbindNode()
for $doc->findnodes('//person[#id="300"]/food');
# $doc->toFile(...)
# or
# $doc->toString(...)
perl -i.bk -pe'BEGIN{undef$/}s|<person (.*?)>.*?</person>|$p=$&;$1=~/id="300"/?$p=~s,<food>.*?</food>,,sr:$p|esg' files*.xml
...removes <food>.....</food> from persons with id="300" in one or more files*.xml. The original files are kept and renamed with .bk added to their file names. So only run this once if you need to keep the original files...or change -i.bk into for example -i.bk$(date +%Y%m%d%h%M%S).
Note: I think ikegami's answer is much better.
But sometimes one writes perl for systems not allowing extra modules and XML::LibXML sadly isn't a core module. And sometimes half-assed XML might be best/fastest handled with half-assed methods. Perhaps "XML" written by something beyond your control. Maybe it's missing a root node for the list of persons, like in the first example here (the list of <person>s could be surrounded with <list>...</list> to make it readable to XML::LibXML) Or with ' or " missing around attribute values, which also wouldn't be readable to XML::LibXML right away.

REGEXP_LIKE to match xml tag content that is not like a specific string

I'm trying to do a regular expression matching with REGEXP_LIKE and I'm looking for a regexp to find if the value of a specific tag is not a specific string.
For example:
<person>
<name>John</name>
<age>40</age>
</person>
My goal is to validate that the name tag's value is not John, so the REGEXP_LIKE would return true for input xmls where name is not John.
Thank you in advance for the help!
A quick and easy way to do this is to simply negate the regex search:
... WHERE NOT REGEXP_LIKE('column_name', '<name>John</name>')
However, as should be mentioned every time a question like this is posted, it's generally a bad idea to parse XML with regex. If you find yourself constructing more complex regex patterns to search this XML data, then you should:
Use an XML parser instead of regular expressions, or
Change how you are storing the data! Make person.age a separate table column; don't bung the entire XML structure into a single place.

Greedy Regex Matching

I'm trying to match a string that looks something like this:
<$Fexample text in here>>
with this expression:
<\$F(.+?)>{2}
However, there are some cases where my backreferenced content includes a ">", thus something like this:
<$Fexample text in here <em>>>
only matches example text in here <em in the backreference. What do I need to do to conditionally return a correct backrefernce with or without these html entities?
You can add start and end anchors to the regex as:
^<\$F(.+?)>{2}$
Try
<\$F(.+?)>>(?!>)
The (?!>) forces only the last >> in a long sequence of >>>..>>> will be matched.
Edit:
<\$F(.+?>*)>>
Also works.
Please note than tu truly do what (I think) you want to do, you would have to interpret well-formed bracket expressions, which is not possible in a regular language.
In other words, <$Fexample <tag <tag <tag>>> example>> oh this should not happen> will return example <tag <tag <tag>>> example>> oh this should not happen as the capture group.