Simple Regex from HTML - regex

I have the following code grabbed from a webpage source code:
<span>41,396</span>
And the following regex:
("<span>.*</span>")
Which returns
<span>New Users</span>
However, I don't want to have the tags in the results. I've tried a few things, but Regular Expressions are new to me.
More so than this I need to get the Regex for the following code:
<span>41,396</span>
</span>
<span class="levelColumn">
<span>2,150</span>
</span>
<span class="xpColumn">
<span>161,305,807</span>
I was thinking this may involve line breaks and more, which is why I threw this is separately.

You could try something like
<span( class=\".+\")?>(.*)</span>
And then get capture group 2 for the tag's body. But be aware that regular expressions are NOT good for parsing HTML/XML. What would happen if you had nested <span> tags?
If the input gets even the slightest bit more complicated than what you've shown, look for an HTML parser and try using that instead.

You can use capturing group differently to get the value instead of tag + value
"<span>(.*)</span>"
Think to use a HTML parsing library in your language of choice if regex become more complicated.

As far as I know regex will lookup line by line, but you could have an expression that would work that out.
Try: <span>(.*)</span>
You should be able to retrieve the information you want with \1
In the case of <span class="xpColumn"> it would just not match and \1 would be empty..
Cheers :)

Related

Unable to accurately search a particular text in a html tag using Python

I have the below regex to identify text in a html tag that doesn't yields the result expected.
HTML Tag:
<td>Issue Amount</td>
<td>:</td>
<td>20,000,000.00</td>
Find = re.findall(?<=Issue Amount</td> <td>:</td> <td>) [0-9,]),soup_string)[0]
I need to get the numerical value 20,000,000.00 from this tag.
Any advise what am I doing wrong here. I did try couple of other ways but with no success.
Do not under any circumstances try to parse XML with a regex unless you wish to invoke rite 666 Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn.
Use an HTML parsing library see this page for some ways to do it.
However in your case you have mucked up your regex by looking for a space between your </td> and <td> tags. Whereas your data has carriage returns. You can use the \s meta-character to look for any white space character
Below is the regex piece that helped me get the desired output. Thanks all for your inputs.
(?<=Issue Amount[td\W]{21})([\d,.]+)

regex for address in span tags

I need to extract an address which will change on every new page from a sample like this. So I need a regex to extract 100 E Faith Ter from the following html code snippet.
<span style="..." class="addr">100 E Faith Ter<br>
<span class="locality">Maitland</span>,
<span class="region">FL</span>
<span class="postal-code">32751</span>
</span>
I am using Javascript.
You don't specify a language, and regular expressions are pretty language agnostic, but they differ in specifying how they deal with multiple lines. In javascript: /^.*$/m selects the first line.
Having updated your question to be full HTML instead of raw text, you can use:
^\<.+?\>(.+?)\<br\>$
and retrieve the first parenthesized submatch (be sure you use the multiline option)
The Pony He Comes!!
A regex is not necessary for the whole thing. Instead, just use strip all HTML tags - if you're using PHP, strip_tags does this nicely, otherwise you can regex it replacing <[^>]+> with an empty string. You should get the plain text of the address. You can then split this on its separate lines.
Or you could just be this guy:

Regex replace is eating up the whole string! How do I make regex ungreedy?

I'm working with a really large spreedsheet in Open Office and I've had to learn regular expressions to clean it up.
Right now I'm trying to remove all <span> tags and I've come up with an expression to do so:
(<span.*?>|</span>)
The problem is that OpenOffice doesn't seem to like the question mark (which should make it ungreedy), so when I try to remove the <span> tags, it removes most of my string.
Here is a sample of the data: http://pastebin.com/AKWZJJCv
What is an alternative way of reming the <span> tags that would work in OpenOffice's find and replace?
You could also try (<span[^>]*>|</span>)
Give this a try:
<(\/)?span([a-zA-z\-\="0-9 ]*)?>
Tested here.

Regular Expression matching nested TAGS

Hello I'm trying to match multi-nested quote's blockquotes and transform them back into BBCode
This is what I got so far as far as regex is involved
Converted it back to html entities to be seen on stackedoverflow
<div class="quoteheader"><div class="topslice_quote">([\s\S]*?)</div></div><blockquote>([\s\S]*?)(?:</blockquote><div class="quotefooter"><div class="botslice_quote"></div></div>){2,})
I'm trying to match this
<div class="quoteheader"><div class="topslice_quote">Quote</div></div><blockquote>Outside quote is this
<div class="quoteheader"><div class="topslice_quote">Quote</div></div><blockquote>Inner quote is this</blockquote><div class="quotefooter"><div class="botslice_quote"></div></div>
</blockquote><div class="quotefooter"><div class="botslice_quote"></div></div>
to generate this
[quote]Outside quote is
this[quote]Inner quote is
this[/quote][/quote]
I'm using VBScript 5.5 Regeular Expressions for this. (but this isn't that important)
I really need help on the expression. I've tired using a HTML Parser for this but it turns out to be more difficult then using regex
I'm just repeating what's said here.
Regular Expressions can't match Context Free languages, like groups of tags. You can't match opening to closing tags, so parsing a block (Especially a nested one) becomes impossible to do reliably.
You can certainly build a cludge to help, but there will be situations where it won't work.
Well, this is all you need to do with the parser.
Here's the pseudocode. I don't know your parser so this is the best I can offer.
First find the div tag with the quoteheader class. Get the next sibling.
That is the blockquote tag. Let's call this tag theQuote.
Get the first child of theQuote. It will be a html text item. That is the outer quote.
Get the third child of theQuote. It will be another blockquote tag. Let's call this tag theInner.
Get the first child of theInner. It will be a html text item. That is the inner quote.

How to Find Quotes within a Tag?

I have a string like this:
This <span class="highlight">is</span> a very "nice" day!
What should my RegEx-pattern in VB look like, to find the quotes within the tag? I want to replace it with something...
This <span class=^highlight^>is</span> a very "nice" day!
Something like <(")[^>]+> doesn't work :(
Thanks
It depends on your regex flavor, but this works for most of them:
"(?=[^<]*>)
EDIT: For anyone curious how this works. This translates into English as "Find a quote that is followed by a > before the next <".
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
If you are using VB.net you should be able to use HTMLAgilityPack.
Try this: <span class="([^"]+?)?">
This should get your the first attribute value in a tag:
<[^">]+"(?<value>[^"]*)"[^>]*>
If your intention is to replace ALL quotation marks within tags, you could use the following regular expression:
(<[^>"]*)(")([^>]*>)
That will isolate the substrings before and after your quotation mark. Note that this does not attempt to match opening and closing quotation marks. It simply matches a quotation mark within a tag.