I'm trying to figure out the regex for the following:
String</td><td>[number 0-100]%</td><td>[number 0-100]%</td><td>String</td><td>String</td>
Also, some of these td tags may have style attributes at some point.
I tried this:
String<.*>
and that returned
String</td>
but trying
String<.*><.*>
returned nothing. Why is this?
You probably shouldn't be trying to use a regex to parse HTML, because that way lies madness.
(.+)</td><td>(1?\d?\d)%</td><td>(1?\d?\d)%</td><td>(.+)</td><td>(.+)</td>
use Character class, like <td[^>]*> if <td> or <td class="abc">
Try the following:
(.+)(<[^>]+>){2}(1?\d?\d)%(<[^>]+>){2}(1?\d?\d)%(<[^>]+>){2}(.+)(<[^>]+>){2}(.+)<[^>]+>
You can test it here.
EDIT: Although this will work for most of the time, if there is > character in one attribute of the tag, this regex won't work.
Related
I have the below regex to identify text in a html tag that doesn't yields the result expected.
HTML Tag:
<td>Issue Amount</td>
<td>:</td>
<td>20,000,000.00</td>
Find = re.findall(?<=Issue Amount</td> <td>:</td> <td>) [0-9,]),soup_string)[0]
I need to get the numerical value 20,000,000.00 from this tag.
Any advise what am I doing wrong here. I did try couple of other ways but with no success.
Do not under any circumstances try to parse XML with a regex unless you wish to invoke rite 666 Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn.
Use an HTML parsing library see this page for some ways to do it.
However in your case you have mucked up your regex by looking for a space between your </td> and <td> tags. Whereas your data has carriage returns. You can use the \s meta-character to look for any white space character
Below is the regex piece that helped me get the desired output. Thanks all for your inputs.
(?<=Issue Amount[td\W]{21})([\d,.]+)
Regex is not being very friendly with me, giving me 0 matches haha.
Basically, I have a big string, that includes this:
<td class="fieldLabel02Std">FIELD_LABEL</td>
<td class="fieldLabel02Std">
VALUE
</td>
Thanks to the FIELD_LABEL I should be able to find it inside the bigger string. The "VALUE" is what I want to get.
I tried this pattern
String field = "FIELD_NAME";
String pattern = field + #"[\s\S]*?\<td[\s\S]*?\<\/td\>";
That didn't work. I was thinking about this:
Get the field_name + some characters + => which would be able to give me VALUE.
This gives me 0 matches.
Help is very appreciated!
You can use something like this:
FIELD_LABEL</td>[\n\r\s]*<td class="fieldLabel02Std">[\n\r\s]*(.+?)[\n\r\s]*</td>
Generally it's bad to use a regex to parse HTML, but if you have a small problem with a known html format and you don't mind if this stop working when they change a comma...
Consider the following Regex...
(?<=FIELD_LABEL[\S\s]*?\<td.*?\>[\S\s]*?)\w+(?=[\S\s]*?\</td\>)
Good Luck!
Is this what You looking for?
FIELD_LABEL<\/td>[.\s]*?<td.*?>[.\s]*?VALUE[.\s]*?<\/td>
or
String pattern = field + #"<\/td>[.\s]*?<td.*?>[.\s]*?VALUE[.\s]*?<\/td>";
I have the following code grabbed from a webpage source code:
<span>41,396</span>
And the following regex:
("<span>.*</span>")
Which returns
<span>New Users</span>
However, I don't want to have the tags in the results. I've tried a few things, but Regular Expressions are new to me.
More so than this I need to get the Regex for the following code:
<span>41,396</span>
</span>
<span class="levelColumn">
<span>2,150</span>
</span>
<span class="xpColumn">
<span>161,305,807</span>
I was thinking this may involve line breaks and more, which is why I threw this is separately.
You could try something like
<span( class=\".+\")?>(.*)</span>
And then get capture group 2 for the tag's body. But be aware that regular expressions are NOT good for parsing HTML/XML. What would happen if you had nested <span> tags?
If the input gets even the slightest bit more complicated than what you've shown, look for an HTML parser and try using that instead.
You can use capturing group differently to get the value instead of tag + value
"<span>(.*)</span>"
Think to use a HTML parsing library in your language of choice if regex become more complicated.
As far as I know regex will lookup line by line, but you could have an expression that would work that out.
Try: <span>(.*)</span>
You should be able to retrieve the information you want with \1
In the case of <span class="xpColumn"> it would just not match and \1 would be empty..
Cheers :)
Sorry this might be a simple question, but I could not figure it out. What I need is to filter out all the <a href...> and </a> strings out from a html text. Not sure what regular expression I should use? I tried the following search without any luck:
/<\shref^(>)>
what I mean here is to search for any string starting with "< href" and any string not containing '>' and finally '>'. My search code is not working. What is the correct one?
If I understand what you're looking for it should be <\shref[^>]*>.
Another way would be to use non-greedy matching:
/<a\shref.\{-}>
I think I got it:
/<a\shref[^>]+>
where [] is a set and ^ is not.
I have a string like this:
This <span class="highlight">is</span> a very "nice" day!
What should my RegEx-pattern in VB look like, to find the quotes within the tag? I want to replace it with something...
This <span class=^highlight^>is</span> a very "nice" day!
Something like <(")[^>]+> doesn't work :(
Thanks
It depends on your regex flavor, but this works for most of them:
"(?=[^<]*>)
EDIT: For anyone curious how this works. This translates into English as "Find a quote that is followed by a > before the next <".
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
If you are using VB.net you should be able to use HTMLAgilityPack.
Try this: <span class="([^"]+?)?">
This should get your the first attribute value in a tag:
<[^">]+"(?<value>[^"]*)"[^>]*>
If your intention is to replace ALL quotation marks within tags, you could use the following regular expression:
(<[^>"]*)(")([^>]*>)
That will isolate the substrings before and after your quotation mark. Note that this does not attempt to match opening and closing quotation marks. It simply matches a quotation mark within a tag.