Regular Expressions - match content in XML page - regex

I am new to regular expressions and need to write one that will pull certain data out of an XML page. For instance,
<name>Number of test runs</name>
<value>2</value>
The only number I need to pull is the 2. I want it to look at the XML tag Name so I don't pull from any other numbers on the page. Below is what I have but I am matching all the content instead of just the 2. Any help would be appreciative.
Current Regular Expression:
/<name>Number of Failed BGPs</name>\n<value>(.+?)/

You said the problem is that it's matching all the content, not just the value (2). But you do need to match all the content to ensure it's the correct <name> tag.
The distinction you want is the matched group, designated by parens.
/<name>Number of Failed BGPs<\/name>\n<value>(.+?)<\/value>/
You want to get the first matched group, which should be just the value itself. Notice I also added the </value> tag to the regex. If you don't, your lazy quantifier would pick up only the first digit.

Related

regex match get first match

I'm working with existing code where regular expressions are used to parse HTML. For specific reasons it is not possible to use XPATH. The HTML is actually a html/text email. In the email I have multiple div elements with text content. I'm trying to write regex which match n-th div element. Unfortunatelly these div elements do not have any attributes like classes or ids. I tried this but it match all occurrences
<div>(.*)<\/div>{1}
There many suggestions out there but none of theme is working for me.
Thanks.

RegEx for mining XML tag content

Fellow Forum Members,
I am using the latest NotePad++. I have 430 separate XML files and my goal is to make a "dmcode" list of all 430 XML files. The dmcode identifies each XML file and looks like the example code shown below. I need help in developing a Regular Expression that will grab the dmcode tag content located between the <dmCode opening tag and the closing /> terminator. Also I only need this extraction to only apply to dmcode tags that follows the <dmIdent> tag. In other words, any dmcode tag that is not preceded by a <dmIdent> tag does not end up on my NotePad++ search result list. Is such a Regular Expression that can pull targeted data from a lot of XML files possible?
<dmIdent>
<dmCode assyCode="00" disassyCode="00" disassyCodeVariant="00" infoCode="042" infoCodeVariant="A" itemLocationCode="O" modelIdentCode="SASA" subSubSystemCode="6" subSystemCode="0" systemCode="A03" systemDiffCode="XY"/>
As an alternative I have been researching using an XPath expression to accomplish the same task. However, I can't seem to find a NotePad++ XPath plugin that will enable me to specify the data I want to extract from 430 XML files by using an XPath expression instead of a Regular Expression. I will also appreciate it if anyone can provide an example of an XPath expression that will perform the same task I'm trying to accomplish by using a Regular Expression.
Any help will be greatly appreciated.
I know there are plugins for XPath, but I don't know one that allows you to search several files. The following XPath would match all attributes in <dmCode> as a child of the root element <dmIdent>:
/dmIdent/dmCode[#*]
I need help in developing a Regular Expression that will grab the dmcode tag content located between the <dmCode opening tag and the closing /> terminator. Also I only need this extraction to only apply to dmCode tags that follow the <dmIdent> tag.
This will work for the most simple cases, where:
<dmCode> is the first child of <dmIdent>
There are no comments, CDATA tags, or similar constructs that could make it fail.
(?i)<dmIdent>\s*<dmCode \K[^"/>]*(?>(?:"[^\\"]*(?:\\.[^\\"]*)*"|/(?!>))[^"/>]*)*(?=/>)
regex101 demo
Matches:
(?i)<dmIdent>\s*<dmCode both tags spearated by whitespace (case-insensitively)
\K resets the matched text
[^"/>]* Any characters except ", / or >
And loops:
"[^\\"]*(?:\\.[^\\"]*)*" text in quotes, or
/(?!>) a / not followed by >
both followed by the previous [^"/>]*
(?=/>) All followed by />

REGEX replace all " style='anything'" except within tables

I am parsing html. I know this shouldn't be done with regex but dom/xpath. In my case it should just be fast, simple and without tidy so I chose regex.
The task is replacing all the style='xxx' with an empty string, except within tables.
This regex for preg_replace works catching all style='xxx' no matter where:
'/ style="([^"]+)"/s'
The content can look like this
<!-- more html here -->
<span style='do:smtg'><table class=... > <span style="...">
<table> <div style=""></div></table></span></table>
<!-- more html here -->
or just simple non nested tables, meaning regex should exclude all style='...' also within nested tables.
Is there a simple syntax doing this?
Thou Shalt Not Parse HTML with Regular Expressions!
No, really, you shouldn't.
As evidenced by your example, you can expect nested tables. That means the regex should keep track of the level of nesting, to decide whether or not you're in a table. If you find a way to do this, it will certainly not be "fast and simple".
Email, resurrecting this question because it had a regex that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
With all the disclaimers about using regex to parse html, here is a simple way to do it.
First we need a regex to match tables, nested or not. This does it with simple recursion:
<table(?:.*?(?R).*?|.*?)</table>
Next, we exclude these, and match what we do want. Here is the whole regex:
(?s)<table(?:.*?(?R).*?|.*?)<\/table>(*SKIP)(*F)|style=(['"])[^'"]*\1
See the demo
The left side of the alternation matches complete tables, nested or not, then deliberately fails. The right side matches and captures your styles to Group 1, allowing for different quote styles. We know these are the right styles because they were not matched by the expression on the left.
With this regex, you can do a simple preg_replace($regex, "", $yourstring);
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...

How to only match elements without a closing tag in regex?

I am trying to match all XML nodes within a parent node that do not have closing tags. Does anybody know a regular expression to do so?
A simple start:
<.*?\/>
Note that it will fail with, for example, this:
<bar attr="oops/>"/>
However, having /> in an attribute is a very rare occurence, and you could always escape them.

Regex match for contents of <li> element

I have the following content
<li>Title: [...]</li>
and I'm looking for regex that will match and replace this so that I can parse it as XML. I'm just looking to use a regex find and replace inside Sublime Text 2, so I want to match everything in the above example except for the [...] which is the content.
Why not extract the content and use it to build the xml rather than trying to mold the wrapper of the content into xml? (or am i mis understanding you?)
<li>Title: ([^<]*)<\/li>
is the regular expression to extract the content.
Its pretty self explanatory other than the [^<]* which means match any number of characters that is not a "<"
I don't know Sublime, but something like this should suffice to get you the contents of the li. It allows for there being optional extra attributes on the tag. Make sure and turn off case-sensitivity, incase of LI or Li etc. (lifted straight from http://www.regular-expressions.info/examples.html ):
<li\b[^>]*>(.*?)</li>
<li>\S*(.*)?</li>
That should match your string, with the content being capturing group 1.