What regex can I use to extract URLs from a Google search? - regex

I'm using Delphi with the JCLRegEx and want to capture all the result URL's from a google search. I looked at HackingSearch.com and they have an example RegEx that looks right, but I cannot get any results when I try it.
I'm using it similar to:
Var re:JVCLRegEx;
I:Integer;
Begin
re := TJclRegEx.Create;
With re do try
Compile('class="?r"?>.+?href="(.+?)".*?>(.+?)<\/a>.+?class="?s"?>(.+?)<cite>.+?class="?gl"?><a href="(.+?)"><\/div><[li|\/ol]',false,false);
If match(memo1.lines.text) then begin
For I := 0 to captureCount -1 do
memo2.lines.add(captures[1]);
end;
finally free;
end;
freeandnil(re);
end;
Regex is available at hackingsearch.com
I'm using the Delphi Jedi version, since everytime I install TPerlRegEx I get a conflict with the two...

Offtopic: You can try Google AJAX Search API: http://code.google.com/apis/ajaxsearch/documentation/

Below is a relevant section from Google search results for the term python tuple. (I modified it to fit the screen here by adding new lines here and there, but I tested your regex on the raw string obtained from Google's source as revealed by Firebug). Your regex gave no matches for this string.
<li class="g w0">
<h3 class="r">
<a onmousedown="return rwt(this,'','','res','2','AFQjCNG5WXSP8xy6BkJFyA2Emg8JrFW2_g','&sig2=4MpG_Ib3MrwYmIG6DbZjSg','0CBUQFjAB')"
class="l" href="http://www.korokithakis.net/tutorials/python">Learn <em>Python</em> in 10 minutes | Stavros's Stuff</a>
</h3>
<span style="display: inline-block;">
<button class="w10">
</button>
<button class="w20">
</button>
</span>
<span class="m"> <span dir="ltr">- 2 visits</span> <span dir="ltr">- Jan 21</span></span>
<div class="s">
The data structures available in <em>python</em> are lists, <em>tuples</em>
and dictionaries. Sets are available in the sets library (but are built-in in <em>
Python</em> 2.5 and <b>...</b><br>
<cite>
www.korokithakis.net/tutorials/<b>
python</b>
-
</cite>
<span class="gl">
<a onmousedown="return rwt(this,'','','clnk','2','AFQjCNFVaSJCprC5enuMZ9Nt7OZ8VzDkMg','&sig2=4qxw5AldSTW70S01iulYeA')"
href="http://74.125.153.132/search?q=cache:oeYpHokMeBAJ:www.korokithakis.net/tutorials/python+python+tuple&cd=2&hl=en&ct=clnk&client=firefox-a">
Cached
</a>
- <button title="Comment" class="wci">
</button>
<button class="w4" title="Promote">
</button>
<button class="w5" title="Remove">
</button>
</span>
</div>
<div class="wce">
</div>
<!--n-->
<!--m-->
</li>
FWIW, I guess one of the many reasons is that there is no <Va> in this result at all. I copied the full html source from Firebug and tried to match it with your regex - didn't get any match at all.
Google might change the way they display the results from time to time - at a given time, it can vary depending on factors like your logged in status, web history etc. The particular regex you came up with might be working for you for now, but in the long run it will become difficult to maintain. People suggest using html parser instead of giving a regex because they know that the solution won't be stable.

If you need to debug regular expressions in any language you need to look at RegExBuddy, its not free but it will pay for itself in a day.

class=r?>.+?href="(.+?)".*?>(.+?)<\/a>.+?class="?s"?>(.+?)<cite>.+?class="?gl"?>
works for now.

Related

Which regex tag to use in a Mechanize function?

I retrieved all the links from the web page containing /title/tt inside the url in a list.
my #url_links= $mech->find_all_links( url_regex => qr/title\/tt/i );
but the list is too long so I want to filter by adding in the function find_all_Links that the link must be also in the tags starting with <id="actor-tt..."> here is where the link (/title/tt...) is, in the code source retrieved by cmd.exe:
<div class="filmo-row odd" id="actor-tt0361748">
<span class="year_column">
2009
</span>
<b><a href="/title/tt0361748/"
>Inglourious Basterds</a></b>
<br/>
Lt. Aldo Raine
</div>
I imagine you have to use a tag_regex but I don't know how because the command prompt doesn't seem to take tag_regex into account when I put it in.
Using HTML::TreeBuilder and HTML::Element instead of Mechanize:
use strict;
use warnings;
use feature 'say';
use HTML::TreeBuilder;
my $html_string = join "", <DATA>;
my $tree = HTML::TreeBuilder->new_from_content($html_string);
my #url_links = map { $_->attr_get_i("href") }
map { $_->look_down(href => qr{/title/tt}) }
$tree->look_down(id => qr/^actor-tt/);
say for #url_links;
__DATA__
<div class="filmo-row odd" id="actor-tt0361748">
<span class="year_column">
2009
</span>
<b>Inglourious Basterds</b>
<br/>
Lt. Aldo Raine
</div>
<div id="not-the-right-id">
</div>
<div class="filmo-row odd" id="actor-tt0123456">
<b>Another movie</b>
</div>
<div class="filmo-row odd" id="actor-tt0123456">
the id will match, but no href in here
</div>
$tree->look_down(id => qr/^actor-tt/); finds all elements whose id matches actor-tt. Then $_->look_down(href => qr{/title/tt}) will find all elements within them with a field href matching /title/tt. Finally, $_->attr_get_i("href") returns the value of their href fields.
You might be interested in the method new_from_url or new_from_file from HTML::TreeBuilder rather than the new_from_content I used.
WWW::Mechanize is not sophisticated enough to do what you're trying to do. It can only search links on one criterium at a time, and it converts them to WWW::Mechanize::Link objects, which do not maintain their ancestry (as in position in the DOM tree).
Mechanize is meant to be a browser, not a scraper. It's important to pick the right tools for the job you have to do.
As Dada suggested in their answer, you can use your own parser to search for this. You can still extract the HTML out of WWW::Mechanize and then use the code they suggest. Use $mech->content or $mech->content_raw to get the HTML out.
There are several alternatives to this. While I personally like Web::Scraper for this kind of task, its interface is a bit weird and has a learning curve.
Instead, I would suggest using Mojo::UserAgent and Mojo::DOM. In fact, the handy ojo package for one-liners should be able to do this.
perl -Mojo -E 'g("https://www.imdb.com/name/nm0000093/")->dom->find("div[id^=actor-tt] a")->map(sub {say $_->attr("href")})'
Broken down, this does the following:
use Mojo::UserAgent to get that page
look at the DOM tree
find all <a>s inside <div>s that have an id that starts with actor-tt (see https://metacpan.org/pod/Mojo::DOM::CSS#SELECTORS for details)
for each of them, print out the href attribute
You can customise this as much as you want.
Please note that according to their Terms of Services, scraping IMDB is not allowed.

Can't parse Google Finance html

I'm trying to scrape some stock prices, and variations, from Google Finance using python3 but I just can't figure out if there's something wrong with the page, or my regex. I'm thinking that either the svg graphic or the many script tags throughout the page are making the regex parsers fail to properly analyze the code.
I have tested this regex on many online regex builders/testers and it looks ok. As ok as a regex designed for HTML can be, anyway.
The Google Finance page I'm testing this out on is https://www.google.com/finance?q=NYSE%3AAAPL
And my python code is the following
import urllib.request
import re
page = urllib.request.urlopen('https://www.google.com/finance?q=NYSE%3AAAPL')
text = page.read().decode('utf-8')
m = re.search("id=\"price-panel.*>(\d*\d*\d\.\d\d)</span>.*\((-*\d\.\d\d%)\)", text, re.S)
print(m.groups())
It would extract the stock price and its percent variation.
I have also tried using python2 + BeautifulSoup, like so
soup.find(id='price-panel')
but it returns empty even for this simple query. This is especially why I'm thinking that there's something weird with the html.
And here's the most important bit of html that I'm aiming for
<div id="price-panel" class="id-price-panel goog-inline-block">
<div>
<span class="pr">
<span class="unchanged" id="ref_22144_l"><span class="unchanged">96.41</span><span></span></span>
</span>
<div class="id-price-change nwp goog-inline-block">
<span class="ch bld"><span class="down" id="ref_22144_c">-1.13</span>
<span class="down" id="ref_22144_cp">(-1.16%)</span>
</span>
</div>
</div>
<div>
<span class="nwp">
Real-time:
<span class="unchanged" id="ref_22144_ltt">3:42PM EDT</span>
</span>
<div class="mdata-dis">
<span class="dis-large"><nobr>NASDAQ
real-time data -
Disclaimer
</nobr></span>
<div>Currency in USD</div>
</div>
</div>
</div>
I'm wondering if any of you have encountered a similar problem with this page and/or can figure out if there's anything wrong with my code. Thanks in advance!
You might try a different URL that will be easier to parse, such as: http://www.google.com/finance/info?q=AAPL
The catch is that Google has said that using this API in an application for public consumption is against their Terms of Service. Maybe there is an alternative that Google will allow you to use?
I managed to get it working using BeautifulSoup, on the link posted originally.
Here's the bit of code I finaly used:
response = urllib2.urlopen('https://www.google.com/finance?q=NYSE%3AAAPL')
html = response.read()
soup = BeautifulSoup(html, "lxml")
aaplPrice = soup.find(id='price-panel').div.span.span.text
aaplVar = soup.find(id='price-panel').div.div.span.find_all('span')[1].string.split('(')[1].split(')')[0]
aapl = aaplPrice + ' ' + aaplVar
I couldn't get it working with BeautifulSoup before because I was actually trying to parse the table in this page https://www.google.com/finance?q=NYSE%3AAAPL%3BNYSE%3AGOOG, not the one I posted.
Neither method described on my question has worked on this page.

Trouble accessing attribute after using BeautifulSoup's findAll

I'm trying to scrape sites like this one on the BBC website to grab the relevant parts of the programme listing, and I've just started using BeautifulSoup to do this.
The parts of interest start with sections like:
<li about="/programmes/p013zzsl#segment" class="segment track" id="segmentevent-p013zzsm" typeof="po:MusicSegment">
<li about="/programmes/p014003v#segment" class="segment speech alt" id="segmentevent_p014003w" typeof="po:SpeechSegment">
What I've done so far is opened the HTML as soup and then used soup.findAll(typeof=['po:MusicSegment', 'po:SpeechSegment']) to give a ResultSet of the parts I'm interested in the order in which they appear.
What I then want to do is check whether a section refers to po:MusicSegment or po:SpeechSegment in HTML that looks like:
<li about="/programmes/p01400m9#segment" class="segment track" id="segmentevent-p01400mb" typeof="po:MusicSegment"> <span class="artist-image"> <span class="depiction" rel="foaf:depiction"><img alt="" height="63" src="http://static.bbci.co.uk/programmes/2.54.3/img/thumbnail/artists_default.jpg" width="112"/></span> </span> <script type="text/javascript"> window.programme_data.tracklist.push({ segment_event_pid : "p01400mb", segment_pid : "p01400m9", playlist : "http://www.bbc.co.uk/programmes/p01400m9.emp" }); </script> <h3> <span rel="mo:performer"> <span class="artist no-image" property="foaf:name" typeof="mo:MusicArtist">Mala</span> </span> <span class="title" property="dc:title">Calle F</span> </h3></li>
I want to access the typeof attribute associated with <li>, but if this chunk of HTML (as a BS4 tag) is called section and I enter section.li, it returns None.
Note that if I do section.img instead, I get something back:
<img alt="" height="63" src="http://static.bbci.co.uk/programmes/2.54.3/img/thumbnail/artists_default.jpg" width="112"/>
and I could then do, e.g. section.img['height'] to get back u'63'
What I want is something analogous for the section.li part, so section.li['typeof'] to give me po:MusicSegment or po:SpeechSegment
Of course, I could simply convert each result to text and then do a simple string search, but searching by attribute seems more elegant.
I'd iterate over the list returned by findAll:
soup = BeautifulSoup('<li about="/programmes/p013zzsl#segment" class="segment track" id="segmentevent-p013zzsm" typeof="po:MusicSegment"><li about="/programmes/p014003v#segment" class="segment speech alt" id="segmentevent_p014003w" typeof="po:SpeechSegment">')
for elem in soup.findAll(typeof=['po:MusicSegment', 'po:SpeechSegment']):
print elem['typeof']
returns
po:MusicSegment
po:SpeechSegment
and then conditionally perform your other tasks:
if elem['typeof'] == 'po:MusicSegment'
do.something()
elif elem['typeof'] == 'po:SpeechSegment':
do.something_else()

Matching text that is not html tags with regular expression

So I am trying to create a regular expression that matches text inside different kinds of html tags. It should match the bold text in both of these cases:
<div class="username_container">
<div class="popupmenu memberaction">
<a rel="nofollow" class="username offline " href="http://URL/surfergal.html" title="Surfergal is offline"><strong><!-- google_ad_section_start(weight=ignore) -->**Surfergal**<!-- google_ad_section_end --></strong></a>
</div>
<div class="username_container">
<span class="username guest"><b><a>**Advertisement**</a></b></span>
</div>
I have tried with the following regular expression without any result:
/<div class="username_container">.*?((?<=^|>)[^><]+?(?=<|$)).*?<\/div>/is
This is my first time posting here on stackoverflow so if I am doing something incredibly stupid I can only apologize.
Using regex to parse html is.. hard. See the links in the comments to your question.
What do you plan to do with these matches? Here's a quick jquery script that logs the results in the console:
var a = [];
$('strong, b').each(function(){
a.push($(this).html());
});
console.log(a);
results:
["<!-- google_ad_section_start(weight=ignore) -->**Surfergal**<!-- google_ad_section_end -->", "<a>**Advertisement**</a>"] ​
http://jsfiddle.net/Mk7xf/

How to write this Regex

HTML:
<dt>
<a href="#profile-experience" >Past</a>
</dt>
<dd>
<ul class="past">
<li>
President, CEO & Founder <span class="at">at</span> China Connection
</li>
<li>
Professional Speaker and Trainer <span class="at">at</span> Edgemont Enterprises
</li>
<li>
Nurse & Clinic Manager <span class="at">at</span> <span>USAF</span>
</li>
</ul>
</dd>​​​​​
I want match the <li> node.
I write the Regex:
<dt>.+?Past+?</dt>\s+?<dd>\s+?<ul class=""past"">\s+?(?:<li>\s*?([\W\w]+?)+?\s*?</li>)+\s+?</ul>
In fact they do not work.
No not parse HTML using a regex like it's just a big pile of text. Using a DOM parser is a proper way.
Don't use regular expressions to parse HTML...
Don't use a regular expression to match an html document. It is better to parse it as a DOM tree using a simple state machine instead.
I'm assuming you're trying to get html list items. Since you're not specifying what language you use here's a little pseudo code to get you going:
Pseudo code:
while (iterating through the text)
if (<li> matched)
find position to </li>
put the substring between <li> to </li> to a variable
There are of course numerous third-party libraries that do this sort of thing. Depending on your development environment, you might have a function that does this already (e.g. javascript).
Which language do you use?
If you use Python, you should try lxml: http://lxml.de. With lxml, you can search for the node with tag ul and class "past". You then retrieve its children, which are li, and get text of those nodes.
If you are trying to extract from or manipulate this HTML, xPath, xsl, or CSS selectors in jQuery might be easier and more maintainable than a regex. What exactly is your goal and in what framework are you operating?
please learn to use jQuery for this sort of thing