Regex - How to pick out these integers, individually - regex

I have the following HTML that I am trying to pick apart. For some reason I can't figure out the Regex (which, admittedly, I suck at):
<td class="score">
286
<span class="pos">(2455 of 3921)</span>
</td>
I'm looking to get the 3 integers out, individually. So, basically:
Score = 286
Place = 2455
Entries = 3921
I went through the 'numeric ranges' page on regular-expressions.info, but still can't figure it out!!! Yes, I know it is easy... apparently my brain can't comprehend this type of logic.
I will be using it in vb.net, BTW. In case that matters.

Here's a simple example of code that does it for you at ideone.com.
The gut looks something like:
Dim regex As Regex = New Regex("(\d+)[^\d]*(\d+)[^\d]*(\d+)")
Dim match As Match = regex.Match("<td class='score'> 286 <span class='pos'>(2455 of 3921)</span> </td>")
If match.Success Then
Console.WriteLine(match.Groups(1).Value)
Console.WriteLine(match.Groups(2).Value)
Console.WriteLine(match.Groups(3).Value)
End If

This regex fetches all numbers in a string.
/\d+/g;

Related

xpath+ regex: matches text

I'm trying to write an xpath such that only nodes with text with numbers alone will be returned.
I wanted to use regex and was hoping this would work
td[matches(text(),'[\d.]')]
Can anyone please help me understand what am I doing wrong here
<tr>
<td>1</td>
<td>10</td>
<td>a</td>
</tr>
seams that you are missing quantification, [\d.] will match only 1 character, so 1 should be selected, 10 on the other site requires something like +, so try your regex like:
td[matches(text(),'\d+')]
Also that . in regex will make it capture non-digit characters, do not add that one.
You can test all your regex queries on regex101.
AFAIK so far Selenium support XPath 1.0 only, so matches() is not supported.
You can try below instead:
//td[number(.) >= 0 or number(.) < 0]
To match table cells with integers
Replace:
td[matches(text(),'[\d+]')]
with:
td[matches(text(),'\d+')]
Note: regex works only in xPath 2.0

How to get this field in regex using a pattern?

Regex is not being very friendly with me, giving me 0 matches haha.
Basically, I have a big string, that includes this:
<td class="fieldLabel02Std">FIELD_LABEL</td>
<td class="fieldLabel02Std">
VALUE
</td>
Thanks to the FIELD_LABEL I should be able to find it inside the bigger string. The "VALUE" is what I want to get.
I tried this pattern
String field = "FIELD_NAME";
String pattern = field + #"[\s\S]*?\<td[\s\S]*?\<\/td\>";
That didn't work. I was thinking about this:
Get the field_name + some characters + => which would be able to give me VALUE.
This gives me 0 matches.
Help is very appreciated!
You can use something like this:
FIELD_LABEL</td>[\n\r\s]*<td class="fieldLabel02Std">[\n\r\s]*(.+?)[\n\r\s]*</td>
Generally it's bad to use a regex to parse HTML, but if you have a small problem with a known html format and you don't mind if this stop working when they change a comma...
Consider the following Regex...
(?<=FIELD_LABEL[\S\s]*?\<td.*?\>[\S\s]*?)\w+(?=[\S\s]*?\</td\>)
Good Luck!
Is this what You looking for?
FIELD_LABEL<\/td>[.\s]*?<td.*?>[.\s]*?VALUE[.\s]*?<\/td>
or
String pattern = field + #"<\/td>[.\s]*?<td.*?>[.\s]*?VALUE[.\s]*?<\/td>";

regex for address in span tags

I need to extract an address which will change on every new page from a sample like this. So I need a regex to extract 100 E Faith Ter from the following html code snippet.
<span style="..." class="addr">100 E Faith Ter<br>
<span class="locality">Maitland</span>,
<span class="region">FL</span>
<span class="postal-code">32751</span>
</span>
I am using Javascript.
You don't specify a language, and regular expressions are pretty language agnostic, but they differ in specifying how they deal with multiple lines. In javascript: /^.*$/m selects the first line.
Having updated your question to be full HTML instead of raw text, you can use:
^\<.+?\>(.+?)\<br\>$
and retrieve the first parenthesized submatch (be sure you use the multiline option)
The Pony He Comes!!
A regex is not necessary for the whole thing. Instead, just use strip all HTML tags - if you're using PHP, strip_tags does this nicely, otherwise you can regex it replacing <[^>]+> with an empty string. You should get the plain text of the address. You can then split this on its separate lines.
Or you could just be this guy:

Regular expression very slow. Trying to extract multiple strings

I am fairly new to Regular Expressions and practicing a little with Notepad++. I am trying to extract some stock related data from Yahoo but somewhat lack the experience. Maybe somebody could give me a hand. It would be highly appreciated.
An example of what I try to parse is:
<strong>230.00</strong></a></td><td class="yfnc_tabledata1">AMZN121026C00230000</td><td class="yfnc_tabledata1" align="right"><b>9.35</b></td><td class="yfnc_tabledata1" align="right"><span id="yfs_c10_amzn121026c00230000"><img style="margin-right:-2px;" src="op_files/up_g.gif" alt="Up" border="0" height="14" width="10"> <span class="yfi-price-change-green">0.35</span></span></td><td class="yfnc_tabledata1" align="right">9.25</td><td class="yfnc_tabledata1" align="right">9.40</td><td class="yfnc_tabledata1" align="right">3,857</td><td class="yfnc_tabledata1" align="right">1,041</td></tr><tr><td class="yfnc_tabledata1" nowrap="nowrap">
I basically try to extract the numbers 230.00, 9.35, 0.35, 9.25, 9.40, 3,857, 1,041. What
What I managed so far is:
<strong>(\d.*?)</strong>.*?<b>(.*?)<
But it is really slow. Is that correct so far?
a possible faster variant could be (?<=>)(\d{1,3}(?:,\d{3})*+(?:\.\d+)?)(?=<)
it only matches only the numbers between > and < an ignores the rest...
but keep in mind, like SomeKittens said: "Generally, parsing HTML with regex is a bad idea...."
Demo
You can have this example, will match the tag and its number so you can do whatever you want with them. You can even filter by tag changing [a-z]+ by (span|b|td|whatever)

Simple Regex from HTML

I have the following code grabbed from a webpage source code:
<span>41,396</span>
And the following regex:
("<span>.*</span>")
Which returns
<span>New Users</span>
However, I don't want to have the tags in the results. I've tried a few things, but Regular Expressions are new to me.
More so than this I need to get the Regex for the following code:
<span>41,396</span>
</span>
<span class="levelColumn">
<span>2,150</span>
</span>
<span class="xpColumn">
<span>161,305,807</span>
I was thinking this may involve line breaks and more, which is why I threw this is separately.
You could try something like
<span( class=\".+\")?>(.*)</span>
And then get capture group 2 for the tag's body. But be aware that regular expressions are NOT good for parsing HTML/XML. What would happen if you had nested <span> tags?
If the input gets even the slightest bit more complicated than what you've shown, look for an HTML parser and try using that instead.
You can use capturing group differently to get the value instead of tag + value
"<span>(.*)</span>"
Think to use a HTML parsing library in your language of choice if regex become more complicated.
As far as I know regex will lookup line by line, but you could have an expression that would work that out.
Try: <span>(.*)</span>
You should be able to retrieve the information you want with \1
In the case of <span class="xpColumn"> it would just not match and \1 would be empty..
Cheers :)