Interpreting A Regular expression [duplicate] - regex

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I am new to Regular expressions and I came across this piece of code in Wordpress but I have failed to understand what's going on, despite the comments. Kindly help me figure it out.
// catch base url
preg_match('/href="(.+?)"/i', $content, $matches);
$baseref = (is_array($matches) && !empty($matches)) ? $matches[1] : '';
// get the first image from content
preg_match('/<img.+?src="(.+?)"[^}]+>/i', $content, $matches);
$img_url = (is_array($matches) && !empty($matches)) ? $matches[1] : '';
Here's what $content contains.
<![CDATA[<p>Buganda Road Chief Magistrate James Mawanda Eremye has released Makerere University administrator Edward Kisuze. The suspended administrator is accused of sexually harassing a student. Court told Kisuzze to pay cash bail of Shs2m and each of his three sureties Shs10m.</p>
<p><img class="alignnone wp-image-32386" src="http://matookerepublic.com/wp-content/uploads/2018/04/kisuze-300x175.png" alt="" width="680" height="396" srcset="http://matookerepublic.com/wp-content/uploads/2018/04/kisuze-300x175.png 300w, http://matookerepublic.com/wp-content/uploads/2018/04/kisuze-696x405.png 696w, http://matookerepublic.com/wp-content/uploads/2018/04/kisuze.png 720w" sizes="(max-width: 680px) 100vw, 680px" /></p>
<p>However, before releasing him the magistrate ordered the prosecutor to disclose to the defence the evidence to enable commencement of hearing of this case come <span data-term="goog_350196878">May 28 2018</span>.</p>
<p>On April 14, police arrested Kisuze after a viral picture of him kissing the student’s private parts in office was released online. On May 4<sup>,</sup> he appeared before court and was remanded to Luzira prison after pleading not guilty to charges.</p>
]]>

The first one /href="(.+?)"/i is used to extract the href property from an a tag.
Check out the live example: https://regex101.com/r/84dhEk/2 (the green part is the matching one)
The second one /<img.+?src="(.+?)"[^}]+>/i is used to extract the src property from a img tag.
Check out this example: https://regex101.com/r/SOPN5I/2

Related

Which regex tag to use in a Mechanize function?

I retrieved all the links from the web page containing /title/tt inside the url in a list.
my #url_links= $mech->find_all_links( url_regex => qr/title\/tt/i );
but the list is too long so I want to filter by adding in the function find_all_Links that the link must be also in the tags starting with <id="actor-tt..."> here is where the link (/title/tt...) is, in the code source retrieved by cmd.exe:
<div class="filmo-row odd" id="actor-tt0361748">
<span class="year_column">
2009
</span>
<b><a href="/title/tt0361748/"
>Inglourious Basterds</a></b>
<br/>
Lt. Aldo Raine
</div>
I imagine you have to use a tag_regex but I don't know how because the command prompt doesn't seem to take tag_regex into account when I put it in.
Using HTML::TreeBuilder and HTML::Element instead of Mechanize:
use strict;
use warnings;
use feature 'say';
use HTML::TreeBuilder;
my $html_string = join "", <DATA>;
my $tree = HTML::TreeBuilder->new_from_content($html_string);
my #url_links = map { $_->attr_get_i("href") }
map { $_->look_down(href => qr{/title/tt}) }
$tree->look_down(id => qr/^actor-tt/);
say for #url_links;
__DATA__
<div class="filmo-row odd" id="actor-tt0361748">
<span class="year_column">
2009
</span>
<b>Inglourious Basterds</b>
<br/>
Lt. Aldo Raine
</div>
<div id="not-the-right-id">
</div>
<div class="filmo-row odd" id="actor-tt0123456">
<b>Another movie</b>
</div>
<div class="filmo-row odd" id="actor-tt0123456">
the id will match, but no href in here
</div>
$tree->look_down(id => qr/^actor-tt/); finds all elements whose id matches actor-tt. Then $_->look_down(href => qr{/title/tt}) will find all elements within them with a field href matching /title/tt. Finally, $_->attr_get_i("href") returns the value of their href fields.
You might be interested in the method new_from_url or new_from_file from HTML::TreeBuilder rather than the new_from_content I used.
WWW::Mechanize is not sophisticated enough to do what you're trying to do. It can only search links on one criterium at a time, and it converts them to WWW::Mechanize::Link objects, which do not maintain their ancestry (as in position in the DOM tree).
Mechanize is meant to be a browser, not a scraper. It's important to pick the right tools for the job you have to do.
As Dada suggested in their answer, you can use your own parser to search for this. You can still extract the HTML out of WWW::Mechanize and then use the code they suggest. Use $mech->content or $mech->content_raw to get the HTML out.
There are several alternatives to this. While I personally like Web::Scraper for this kind of task, its interface is a bit weird and has a learning curve.
Instead, I would suggest using Mojo::UserAgent and Mojo::DOM. In fact, the handy ojo package for one-liners should be able to do this.
perl -Mojo -E 'g("https://www.imdb.com/name/nm0000093/")->dom->find("div[id^=actor-tt] a")->map(sub {say $_->attr("href")})'
Broken down, this does the following:
use Mojo::UserAgent to get that page
look at the DOM tree
find all <a>s inside <div>s that have an id that starts with actor-tt (see https://metacpan.org/pod/Mojo::DOM::CSS#SELECTORS for details)
for each of them, print out the href attribute
You can customise this as much as you want.
Please note that according to their Terms of Services, scraping IMDB is not allowed.

RegEx for removing all spam links in a <div> The only identifier is overflow:hidden

I have just discovered around a thousands posts on our site with hidden links. They are all contained in divs the styles like this:
<div style='width:10px;height:13px;overflow:hidden'>
<div style='overflow:hidden;width:7px;height:13px'>
The width and height are all different, the only identifier is the overflow:hidden
Here is one example
<div style='width:10px;height:13px;overflow:hidden'>
<p>BRANDO CHANGED WILL IN LAST DAYS.(News)</p>
<p>The Mirror (London, England) July 8, 2004 Byline: IAN MARKHAM-SMITH HOLLYWOOD legend Marlon Brando changed his will days before his death, it emerged last night.</p>
<p>Movie mogul Mike Medavoy revealed that before the eccentric 80-year-old succumbed to illness on Friday, he summoned lawyers and some friends to make significant changes to his estate. lastnightmovienow.net last night movie</p>
</div>
How do I create a RegEx that finds every day with the style that contains overflow:hidden then any character, set of character etc up until the closing div.
I tried this, but didn't work
<div style='.*overflow:hidden'>(.*)</div>
I think it's due to not escaping the normal HTML.
I'm a RegEx noob.
Thanks
Ollie
Thanks mate, very detailed response :)
As you say it's sketchy, worked on some posts and not others.
We solved this by adding this to the functions.php file to strip all the problematic divs out server side.
RegEx was the incorrect approach.
function my_the_content_filter( $content ) {
$content = preg_replace("#<div[^>]*overflow:hidden[^>]*>.*?</div>#is", "", $content);
return $content;
}
add_filter( 'the_content', 'my_the_content_filter');
?>

How to Map photos and texts

Please Observe the google Doc below:
https://docs.google.com/document/d/1dw6mJW0VxHzD3_h86RgtZwmelBQE8tYGgi41jb1oz-o/edit
I am attempting to put the data into Hbase using either MapReduce or Importtsv. But my main problem is dealing with the photos. I would like to put the photos in a seperate column family. How do i go about selecting only the photos and importing them into HBase, given that the photos dont have nothing that it can be identified by...like a (text) name.
I thought about using Regex. But some of the districts are of different structure. for instance, "Arizona 1" vs. "Alaska at large".
I need to know how to specifically identify the photos, so they that can be distinguished and imported appropriately.
Having in mind the structure of the document mentioned above, this is the expression you need. It will match all image URLs and each image description.
<image\sxlink:href="(https:\/\/[^"\s]+)".*?<title><\/title><desc>(.+?)<\/desc><\/image>
Demo
Usage in PHP:
$html = '<p>Members of our tim</p><image xlink:href="https://lh4.googleusercontent.com/z3GK1MdYyLTo0Q0xLmawvcptIrK4qkQx7XJWUgTK_i6Psm22GBqZXBh-w0TeQ5xgKxckQOB2wHWySSIpNj3tXx65MPXmaxKjK4ye_Xu-wAUFKLVhvWFgIedtzxo" width="100%" height="100%" preserveAspectRatio="none"><title></title><desc>Bradley Byrne.jpg</desc></image><h1>Some big title</h1><p>Something <span>more</span> here</p><image xlink:href="https://lh5.googleusercontent.com/fWYh7qTWqu4_4oxAiNhmnMCmD6DScZ6bIvkF5nSFunU8NxKlBT1T-1J85MJCqghhbChFzoLi-p4ZFVDCA2DWWBP9Paagp9ZgshqnGK5CQQF6D7IoBGihcFZoOms" width="100%" height="100%" preserveAspectRatio="none"><title></title><desc>Spencer Bachus 113th Congress.jpg</desc></image><h1>TITLE</h1><p>Testing, testing, testing</p><image xlink:href="https://lh5.googleusercontent.com/VAHzM6OkdtxT61j9XSgTDKlpVi99WsFfzNAlvqmnpCi90XFs9aUNMfuCeeeQ3e26fykjveoxldHvv5jO1Bk9IeEmeU7DdGVAM1N9xXoB8tJTYBeTeFBxigXtT5s" width="100%" height="100%" preserveAspectRatio="none"><title></title><desc>Kyrsten Sinema 113th Congress.jpg</desc></image><p>Last updated on 25th of July, 2014</p>';
$pattern = '/<image\sxlink:href="(https:\/\/[^"\s]+)".*?<title><\/title><desc>(.+?)<\/desc><\/image>/';
if(preg_match_all($pattern, $html, $matches)){
$size_of_matches = count($matches[0]);
for($i = 0; $i < $size_of_matches; $i++){
echo $matches[1][$i] . " -> " . $matches[2][$i] . "<br />";
}
}
Output:
https://lh4.googleusercontent.com/z3GK1MdYyLTo0Q0xLmawvcptIrK4qkQx7XJWUgTK_i6Psm22GBqZXBh-w0TeQ5xgKxckQOB2wHWySSIpNj3tXx65MPXmaxKjK4ye_Xu-wAUFKLVhvWFgIedtzxo -> Bradley Byrne.jpg
https://lh5.googleusercontent.com/fWYh7qTWqu4_4oxAiNhmnMCmD6DScZ6bIvkF5nSFunU8NxKlBT1T-1J85MJCqghhbChFzoLi-p4ZFVDCA2DWWBP9Paagp9ZgshqnGK5CQQF6D7IoBGihcFZoOms -> Spencer Bachus 113th Congress.jpg
https://lh5.googleusercontent.com/VAHzM6OkdtxT61j9XSgTDKlpVi99WsFfzNAlvqmnpCi90XFs9aUNMfuCeeeQ3e26fykjveoxldHvv5jO1Bk9IeEmeU7DdGVAM1N9xXoB8tJTYBeTeFBxigXtT5s -> Kyrsten Sinema 113th Congress.jpg
I do not have experience with MapReduce or Importtsv, so I went about this a different way using c#. As hex494D49 pointed out, the images do have text associated with them. You just have to obtain that data from the document's source (i.e. right click-->View page source).
This code reads in the document's source, makes an attempt to match the politician with an image file (based on the available information that was posted), and writes the results to a text file. The code has many examples of the c# flavor of regex. A sample of the output is here.

html parsing with grep and regex

I'm making a shell script that gets a mountain (only over 8000m) as a parameter and returns the name or names of those who were the first to climb it. I found a page from where i can parse my info which i can download with curl but i don't really know my way too well around regex ... can anyone help me from a html code like this given the mountains name how can i get the climbers ... thx anticipated
site: http://www.valandre.com/blog/2011/06/21/the-14-peaks-over-8000-meters/
html sample
<p class="wp-caption-text">Everest</p></div></div></div><p><strong>Other names: </strong>Sagamartha, Chomolangma or Qomolangma<br
/> <strong>Altitude:</strong> 8848 m<br
/> <strong>Location: </strong>Tibet / Nepal<br
/> <strong>First ascent:</strong> May 29, 1953 by Sir Edmund Hillary and Tenzing Norgay<br
/> <strong>Expedition</strong><strong>: </strong>New Zeeland/India</p><blockquote><p> </p><p><strong>Difficulty</strong> : <em>Mostly a non-technical climb regardless on which of the two normal routes you choose. On the south you have to deal with a dangerous ice fall and The Hillary Step, a short section of rock, on the north side there are some short technical passages. On both routes (permanent) fixed ropes are placed at the tricky sections. The altitude is main obstacle. Nowadays also crowding is mentioned as a factor of difficulty</em>.</p>
found another site maybe it's easier: http://www.alpineascents.com/8000m-peaks.asp
html sample
<tr>
<td><strong>Everest</strong></td>
<td>8,850m <br /></td>
<td>29,035ft</td>
<td><div align="center">Nepal/Tibet </div></td>
<td>1953; Sir E. Hillary, T. Norgay</td>
</tr>
Using the first HTML sample:
grep '<strong>First ascent:</strong>' | sed 's/.*by \([^>]*\)<.*/\1/'
Output:
Sir Edmund Hillary and Tenzing Norgay
Achille Compagnoni and Lino Lacedelli
George Band and Joe Brown
Kurt Diemberger, Peter Diener, Nawang Dorje, Nima Dorje, Ernst Forrer and Albin Schelbert
Hermann Buhl
Maurice Herzog and Louis Lachenal
Andrew Kauffman and Peter Schoening
Hermann Buhl, Kurt Diemberger, Marcus Schmuck and Fritz Wintersteller
It finds all lines with the 'First ascent' label and grabs everything between by and the <br /> tag.
Edit:
The original answer doesn't filter by the name of the mountain. In addition, the <strong>First ascent:</strong> is too specific for the page (sometimes there is a space after the :). The following should work.
grep -i "$1" -A3 | grep 'First ascent:' | sed 's/.*by \([^>]*\)<.*/\1/'
Explanation:
grep -i "$1" -A3 selects the line with the mountain. -i makes the search case insensitive. The -A3 selects the 3 lines following the matched line, which gets the line with the list of climbers. The quotes around "$1" are for mountains with names that have spaces.
You can use my Xidel which does pattern matching on the html tree:
xidel http://www.alpineascents.com/8000m-peaks.asp -e "<tr><strong>Everest</strong><td/>{3}<td>{.}</td></tr>"
Just 109 characters...
(Replace Everest with $1 if it is inside a script with that as parameter)
Or for the other site:
xidel http://www.valandre.com/blog/2011/06/21/the-14-peaks-over-8000-meters/ -e "<p class=\"wp-caption-text\">Everest</p><strong>First ascent:</strong>{text()}"
Firstly, go with the first page in your question. Here's a Java scraper for the "curl" downloaded file:
import java.util.Scanner;
import java.io.*;
public class PageInfo {
public static void main(String[] args) {
Scanner scan = new Scanner(new File(args[0])); //file you downloaded
PrintWriter output = new PrintWriter("climbers.txt");
while (scan.hasNextLine()) {
String s = scan.nextLine();
if (s.contains("wp-caption-text\">") {
s = s.split("wp-caption-text\">")[1];
if (s.length() > 1) output.println(s.split("</p>")[0]);
} else if (s.contains("First ascent:")) {
s = s.split("by ")[1];
output.println(s.split("<br")[0]);
}
}
scan.close();
output.close();
}
}

Get Youtube Id from youtube embed using Regex [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
php regex - find all youtube video ids in string
How can I get the youtube Id from the embed using Regex, even it is in old format or ifarme
example
<iframe width="560" height="315" src="http://www.youtube.com/embed/ghc8cYOA1Vo" frameborder="0" allowfullscreen></iframe>
or
<object width="560" height="315"><param name="movie" value="http://www.youtube.com/v/ghc8cYOA1Vo?hl=en_US&version=3"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/ghc8cYOA1Vo?hl=en_US&version=3" type="application/x-shockwave-flash" width="560" height="315" allowscriptaccess="always" allowfullscreen="true"></embed></object>
please advice,
youtube.com/((v|embed)/)?[a-zA-Z0-9]+
Youtube has recently gone from 10-character IDs to 11 characters, and it's possible that they may eventually increase that number.
Using the regexp youtube[.]com/(v|embed)/([^"?]+), the YouTube ID will be captured in the second group.
DEMO
I tested this regex ((v|embed))\/?[a-zA-Z0-9_-]+ and it works fine