Regex to find subelement in XML - regex

I am using the Regular Expression search feature in Notepad++ to find matches in a few hundred files.
My goal is to find a parent/child combo in in each. I don't care a lot about what specifically is selected (parent and child or just child). I just want to know if the parent contains a specific child.
I want to find a parent element that also has a child element.
Example of what it should find (since one of the sub-elements is a ):
<description>
<otherstuff>
</otherstuff>
<something>
</something>
<description>
</description>
<otherstuff>
</otherstuff>
</description>
Example of what it should NOT find:
<description>
<otherstuff>
</otherstuff>
<something>
</something>
<notadescription>
</notadescription>
<otherstuff>
</otherstuff>
<description>
Each may have other children and sub children as well. They both also may be in the same document.
If I search for this:
<description>(.*)<description>(.*)</description>
It selects too much, because it will select another top level when I only want it to select the child for that 2nd piece.

You said you're working with Notepad++, here here a way to go:
Ctrl+F
Find what: <description>(?:(?!</description).)*<description>(?:(?!<description>).)*</description>
check Match case
check Wrap around
check Regular expression
CHECK . matches newline
Explanation:
<description> # opening tag
(?:(?!</description).)* # tempered greedy token, make sure we have not closing tag before:
<description> # opening tag
(?:(?!<description>).)* # tempered greedy token, make sure we have not opening tag before:
</description> # closing tag
Screen capture:

You should not use (.*) it is greedy
here is an example why you shouldn't use it in you case
<description>
<otherstuff>
</otherstuff>
<description>
<description>hello<\description>
</description>
<\description>
Supposing that here we use <description>(.*)<description>(.*)</description>
It will parse:
<description>
<description>hello<\description>
</description>
<\description>
So if you want to parse only what is inside the 2nd description you should use (.*?) it is called non greedy
Using <description>(.*)<description>(.*?)</description> will parse:
<description>
<description>hello<\description> # end of parse
# here <\description> is missing cause (.*?) will look only for the first match
So you must use (.*?) it will stop parsing right when it found the first end match, but (.*) is greedy so it will look for largest match possible
So if you use <description>(.*)<description>(.*?)</description> it will be fine, cause it will parse only what is inside the sub description in your case

I'm guessing that we'd be designing an expression to exclude <notadescription>, such as:
<description>(?!<notadescription>)[\s\S]*<\/description>
which if we would be capturing the description element, we might want a capturing group:
(<description>(?!<notadescription>)[\s\S]*<\/description>)
Demo

Related

Regex - How to remove last instance of the search value it finds?

I have multiple XML files that I need to delete a line from. The same line exists in different sections of the file but I only need to delete the last instance it finds. For example -
(Openning tag here)Simple name="DisplayValue" value="{?Consumer}" />
(Openning tag here)Simple name="DisplayValue" value="{?Consumer}" />
(Openning tag here)Simple name="DisplayValue" value="{?Consumer}" /> - This is the line I need to delete
This is the line in file.
I am using the Find in Files feature in Notepad++ to achieve this. Tia.
Try the following find and replace, in regex mode (with dot all enabled):
Find: (.*)Same Text(?:\r?\n|$)(.*)
Replace: $1$2
This should work because the initial (.*) capture group should match and capture all content up to, but not including, the last occurrence of Same Text. Then, we also match and capture all content after this last occurrence. Finally, we replace with just the first two capture groups, to effectively splice out the line you want to remove.

Regular Expressions: Lookback to only the first occurrence (non-greedy lookback?)

Here's the problem:
XML:
<userPermissions>
<enabled>true</enabled>
<name>ViewPublicReports</name>
</userPermissions>
<userPermissions>
<enabled>true</enabled>
<name>ViewRoles</name>
</userPermissions>
<userPermissions>
<enabled>true</enabled>
<name>ViewSetup</name>
</userPermissions>
What I'm trying to match is:
<userPermissions>
<enabled>true</enabled>
<name>ViewRoles</name>
</userPermissions>
All the patterns that I've managed to put together matches up to the first string:
(?<=<userPermissions>)[\s\S]+?ViewRoles[\s\S]*?<\/userPermissions>
Not quite sure how to make the backwards match from "ViewRoles" non-greedy.
Thanks in advance for your help.
*Edit: I'm using a tool that deploys metadata between Salesforce instances, which are captured as XML. The tool provides a "find/replace" functionality that uses regex for the "find." I don't have the option of using an XML parser.
This <userPermissions>(?:(?!</userPermissions>)[\S\s])*?ViewRoles[\S\s]*?</userPermissions>
matches that tag.
Formatted
<userPermissions>
(?:
(?! </userPermissions> )
[\S\s]
)*?
ViewRoles
[\S\s]*?
</userPermissions>
It has been told, but the correct way to extract this would be to use an XML parser. However, you can also use the following regex:
(.+\n){2}.+ViewRoles.+\n.+
Which actually matches the following structure:
2 rows without restrictions
a row that includes "ViewRoles"
another row without restrictions

notepad++ xml node regex find and replace

can you tell me the what to search for in notepad++ in order to find all nodes with the optioncode 09 below and delete that node? For example, I would like to be able to search on the below xml and delete the first entry and be left with is below that.
I tried to searching for <Vehicle>.*?</Vehicle> which works to replace a blank value, however I want to also add criteria to search for the 09 value in the entire node. Is it possible to add a search condition for the ">09<" text string value?
Search here:
<Vehicle>
<InvoiceDateTime>2016-03-20T00:00:00</InvoiceDateTime>
<InvoiceChargeCents>63</InvoiceChargeCents>
<OptionCode>09</OptionCode>
<JobEndDateTime>2016-03-19T00:00:00</JobEndDateTime>
<AuthorizationCode />
</Vehicle>
<Vehicle>
<InvoiceDateTime>2016-03-20T00:00:00</InvoiceDateTime>
<InvoiceChargeCents>63</InvoiceChargeCents>
<OptionCode>35</OptionCode>
<JobEndDateTime>2016-03-19T00:00:00</JobEndDateTime>
<AuthorizationCode />
</Vehicle>
Return the entry below:
<Vehicle>
<InvoiceDateTime>2016-03-20T00:00:00</InvoiceDateTime>
<InvoiceChargeCents>63</InvoiceChargeCents>
<OptionCode>35</OptionCode>
<JobEndDateTime>2016-03-19T00:00:00</JobEndDateTime>
<AuthorizationCode />
</Vehicle>
My approach consists in matching the <Vehicle> opening node, and then restrict dot matching so that it could match neither opening nor closing Vehicle nodes with a tempered greedy token:
<Vehicle>(?:(?!</?Vehicle>).)*>09<.*?</Vehicle>\R*
The . matches newline option should be enabled.
See the regex demo, replace with an empty string.
The (?:(?!</?Vehicle>).)* is a tempered greedy token that matches any text up to the first >09< after the closest <Vehicle> opening node.
Note that \R* matches zero or more newline sequences (either CRLF, or CR, or LF).
Also please note that a more efficient pattern will be the unrolled version of the above pattern:
<Vehicle>[^<]*(?:<(?!\/?Vehicle>)[^<]*)*>09<.*?<\/Vehicle>\R*
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
See another regex demo
It matches the same text, but is more efficient with larger texts.

Regular Expressions - match content in XML page

I am new to regular expressions and need to write one that will pull certain data out of an XML page. For instance,
<name>Number of test runs</name>
<value>2</value>
The only number I need to pull is the 2. I want it to look at the XML tag Name so I don't pull from any other numbers on the page. Below is what I have but I am matching all the content instead of just the 2. Any help would be appreciative.
Current Regular Expression:
/<name>Number of Failed BGPs</name>\n<value>(.+?)/
You said the problem is that it's matching all the content, not just the value (2). But you do need to match all the content to ensure it's the correct <name> tag.
The distinction you want is the matched group, designated by parens.
/<name>Number of Failed BGPs<\/name>\n<value>(.+?)<\/value>/
You want to get the first matched group, which should be just the value itself. Notice I also added the </value> tag to the regex. If you don't, your lazy quantifier would pick up only the first digit.

How to only match elements without a closing tag in regex?

I am trying to match all XML nodes within a parent node that do not have closing tags. Does anybody know a regular expression to do so?
A simple start:
<.*?\/>
Note that it will fail with, for example, this:
<bar attr="oops/>"/>
However, having /> in an attribute is a very rare occurence, and you could always escape them.