Regular expression very slow. Trying to extract multiple strings

Regular expression very slow. Trying to extract multiple strings - regex

I am fairly new to Regular Expressions and practicing a little with Notepad++. I am trying to extract some stock related data from Yahoo but somewhat lack the experience. Maybe somebody could give me a hand. It would be highly appreciated.
An example of what I try to parse is:
<strong>230.00</strong></a></td><td class="yfnc_tabledata1">AMZN121026C00230000</td><td class="yfnc_tabledata1" align="right"><b>9.35</b></td><td class="yfnc_tabledata1" align="right"><span id="yfs_c10_amzn121026c00230000"><img style="margin-right:-2px;" src="op_files/up_g.gif" alt="Up" border="0" height="14" width="10"> <span class="yfi-price-change-green">0.35</span></span></td><td class="yfnc_tabledata1" align="right">9.25</td><td class="yfnc_tabledata1" align="right">9.40</td><td class="yfnc_tabledata1" align="right">3,857</td><td class="yfnc_tabledata1" align="right">1,041</td></tr><tr><td class="yfnc_tabledata1" nowrap="nowrap">
I basically try to extract the numbers 230.00, 9.35, 0.35, 9.25, 9.40, 3,857, 1,041. What
What I managed so far is:
<strong>(\d.*?)</strong>.*?<b>(.*?)<
But it is really slow. Is that correct so far?

a possible faster variant could be (?<=>)(\d{1,3}(?:,\d{3})*+(?:\.\d+)?)(?=<)
it only matches only the numbers between > and < an ignores the rest...
but keep in mind, like SomeKittens said: "Generally, parsing HTML with regex is a bad idea...."

Demo
You can have this example, will match the tag and its number so you can do whatever you want with them. You can even filter by tag changing [a-z]+ by (span|b|td|whatever)

Related

RegEx to find a string included between two characters while EXCLUDING the delimiters

I'm kinda lost with Regex and would appreciate some help.
Target: To extract the URL between the two " ", without returning the " themselves.
Base string:
<span class="fa fa-eye fa-fw poptip" data-toggle="tooltip" title="" data-original-title="Inspect in-game"></span>
I came up with the following solution:
(="(.*)" class="btn btn-xs btn-default ")
Too bad it is matching
="somerandomurl" class="btn btn-xs btn-default "
Is it possible to match only the inner result, without the delimiters?
somerandomurl
Since this should be included in a script that should run as fast as possible, maybe there is a faster and better approach? In reality this regex search will be applied on a complete website.

Using RegEx to match markup is usually not a good idea. If you have the option you might want prefer a HTML / DOM parser.
That said your RegEx should match the sample in most languages. But it defines two sets of parenthesis so the result you want is located in group 2. Both group 0 and 1 will hold the full match.
If you have trouble reading the correct result group, please provide some additional information like which language your're working in and preferabbly a snippet.

Vb.net help me with regex please

I havent worked with regex before... But I need to parse values in about 500 urls and I need regex for automate it.
Each site contains about 10 values, I need to separate them to own list.
1.
<td width="78" style="padding-left:9px;" align="left"><a style="font-weight:bold;color:#E93393;" href="/meanings/Example1.html">Example1</a> </td>
2.
<td width="78" style="padding-left:9px;" align="left"><a style="font-weight:bold;color:#004EFF;" href="/meanings/Example2.html">Example2</a> </td>
So, I need to get those 2 values to separate list. It should look for color code to determine in which list value goes.
Could somebody help me? :)

NO..NO..NO..
Regex doesnt work for parsing HTML files..
HTML is not strict nor is it regular with its format..
Use htmlagilitypack

replacing image path with regular expression

I have massive html code, with loooads of images, problem is, every single image has a different path, example:
<img src="../media/2010/01/something.jpg" />
<img src="../media/logo.png" />
What I wanted to do with regular expressions is, to find every image path and replace it with:
<img src="../img/FILENAME.EXTENSION" />
I know that it's definately possible with regular expressions ... but it's just not my cup of tea, could any1 help me please?
Cheers, Mart

This might not be the best solution but it might work:
(<img.*?src=")([^"]*?(\/[^/]*\.[^"]+))
and then you use capture group 1 and 3 to create the new string (depending on flavor):
$1../img$3
You can see it in action here: http://regexr.com?2v8ir

If you want to parse html, its much better if you use an html parser instead of regex. There are quite alot of them and they do a very good work.
Html Agility Pack is a good one

Try this link
Using this regex <img src="[\w/\.]+"(\s|)/> and replacing with <img src="../img/FILENAME.EXTENSION" />

Simple Regex from HTML

I have the following code grabbed from a webpage source code:
<span>41,396</span>
And the following regex:
("<span>.*</span>")
Which returns
<span>New Users</span>
However, I don't want to have the tags in the results. I've tried a few things, but Regular Expressions are new to me.
More so than this I need to get the Regex for the following code:
<span>41,396</span>
</span>
<span class="levelColumn">
<span>2,150</span>
</span>
<span class="xpColumn">
<span>161,305,807</span>
I was thinking this may involve line breaks and more, which is why I threw this is separately.

You could try something like
<span( class=\".+\")?>(.*)</span>
And then get capture group 2 for the tag's body. But be aware that regular expressions are NOT good for parsing HTML/XML. What would happen if you had nested <span> tags?
If the input gets even the slightest bit more complicated than what you've shown, look for an HTML parser and try using that instead.

You can use capturing group differently to get the value instead of tag + value
"<span>(.*)</span>"
Think to use a HTML parsing library in your language of choice if regex become more complicated.

As far as I know regex will lookup line by line, but you could have an expression that would work that out.
Try: <span>(.*)</span>
You should be able to retrieve the information you want with \1
In the case of <span class="xpColumn"> it would just not match and \1 would be empty..
Cheers :)

A regular expression question

I have content something like
<div class="c2">
<div class="c3">
<p>...</p>
</div>
</div>
What I want is to match the div.c2's inner HTML. The contents of it may vary a lot. The only problem I am facing here is that how can I make it to work so that the right closing div is taken?

You can't. This problem is unsolvable with classic regular expressions, and with most of the existing regex implementations.
However, some regex engines have special support for balanced pair matching. See, e.g., here (.NET). Though even in this case your regex will be able to parse only a subset of syntactically correct texts (e.g., what if a < /div > is embedded in a comment?). You need an HTML parser to get reliable results.

Any chance this will always be valid XHTML? If so, you'd be better off parsing it as XML than trying to regex this.

Delete the first line, delete the last line. Problem solved. No need for RegEx.
The following pattern works well with .Net RegEx implementation:
\<div class="c2"\>{[\n a-z.<>="0-9/]+}\</div\>
And we replace that with \1.
Input:
<div class="c2">
<div class="c3">
<p>...</p>
</div></div></div></div></div></div></div></div>
</div>
Output:
<div class="c3">
<p>...</p>
</div></div></div></div></div></div></div></div>

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular expression very slow. Trying to extract multiple strings - regex

a possible faster variant could be (?<=>)(\d{1,3}(?:,\d{3})*+(?:\.\d+)?)(?=<) it only matches only the numbers between > and < an ignores the rest... but keep in mind, like SomeKittens said: "Generally, parsing HTML with regex is a bad idea...."

Demo You can have this example, will match the tag and its number so you can do whatever you want with them. You can even filter by tag changing [a-z]+ by (span|b|td|whatever)

Related

RegEx to find a string included between two characters while EXCLUDING the delimiters

Vb.net help me with regex please

replacing image path with regular expression

Simple Regex from HTML

A regular expression question

Categories

Resources