html parsing with grep and regex - regex

I'm making a shell script that gets a mountain (only over 8000m) as a parameter and returns the name or names of those who were the first to climb it. I found a page from where i can parse my info which i can download with curl but i don't really know my way too well around regex ... can anyone help me from a html code like this given the mountains name how can i get the climbers ... thx anticipated
site: http://www.valandre.com/blog/2011/06/21/the-14-peaks-over-8000-meters/
html sample
<p class="wp-caption-text">Everest</p></div></div></div><p><strong>Other names: </strong>Sagamartha, Chomolangma or Qomolangma<br
/> <strong>Altitude:</strong> 8848 m<br
/> <strong>Location: </strong>Tibet / Nepal<br
/> <strong>First ascent:</strong> May 29, 1953 by Sir Edmund Hillary and Tenzing Norgay<br
/> <strong>Expedition</strong><strong>: </strong>New Zeeland/India</p><blockquote><p> </p><p><strong>Difficulty</strong> : <em>Mostly a non-technical climb regardless on which of the two normal routes you choose. On the south you have to deal with a dangerous ice fall and The Hillary Step, a short section of rock, on the north side there are some short technical passages. On both routes (permanent) fixed ropes are placed at the tricky sections. The altitude is main obstacle. Nowadays also crowding is mentioned as a factor of difficulty</em>.</p>
found another site maybe it's easier: http://www.alpineascents.com/8000m-peaks.asp
html sample
<tr>
<td><strong>Everest</strong></td>
<td>8,850m <br /></td>
<td>29,035ft</td>
<td><div align="center">Nepal/Tibet </div></td>
<td>1953; Sir E. Hillary, T. Norgay</td>
</tr>

Using the first HTML sample:
grep '<strong>First ascent:</strong>' | sed 's/.*by \([^>]*\)<.*/\1/'
Output:
Sir Edmund Hillary and Tenzing Norgay
Achille Compagnoni and Lino Lacedelli
George Band and Joe Brown
Kurt Diemberger, Peter Diener, Nawang Dorje, Nima Dorje, Ernst Forrer and Albin Schelbert
Hermann Buhl
Maurice Herzog and Louis Lachenal
Andrew Kauffman and Peter Schoening
Hermann Buhl, Kurt Diemberger, Marcus Schmuck and Fritz Wintersteller
It finds all lines with the 'First ascent' label and grabs everything between by and the <br /> tag.
Edit:
The original answer doesn't filter by the name of the mountain. In addition, the <strong>First ascent:</strong> is too specific for the page (sometimes there is a space after the :). The following should work.
grep -i "$1" -A3 | grep 'First ascent:' | sed 's/.*by \([^>]*\)<.*/\1/'
Explanation:
grep -i "$1" -A3 selects the line with the mountain. -i makes the search case insensitive. The -A3 selects the 3 lines following the matched line, which gets the line with the list of climbers. The quotes around "$1" are for mountains with names that have spaces.

You can use my Xidel which does pattern matching on the html tree:
xidel http://www.alpineascents.com/8000m-peaks.asp -e "<tr><strong>Everest</strong><td/>{3}<td>{.}</td></tr>"
Just 109 characters...
(Replace Everest with $1 if it is inside a script with that as parameter)
Or for the other site:
xidel http://www.valandre.com/blog/2011/06/21/the-14-peaks-over-8000-meters/ -e "<p class=\"wp-caption-text\">Everest</p><strong>First ascent:</strong>{text()}"

Firstly, go with the first page in your question. Here's a Java scraper for the "curl" downloaded file:
import java.util.Scanner;
import java.io.*;
public class PageInfo {
public static void main(String[] args) {
Scanner scan = new Scanner(new File(args[0])); //file you downloaded
PrintWriter output = new PrintWriter("climbers.txt");
while (scan.hasNextLine()) {
String s = scan.nextLine();
if (s.contains("wp-caption-text\">") {
s = s.split("wp-caption-text\">")[1];
if (s.length() > 1) output.println(s.split("</p>")[0]);
} else if (s.contains("First ascent:")) {
s = s.split("by ")[1];
output.println(s.split("<br")[0]);
}
}
scan.close();
output.close();
}
}

Related

RegEx for removing all spam links in a <div> The only identifier is overflow:hidden

I have just discovered around a thousands posts on our site with hidden links. They are all contained in divs the styles like this:
<div style='width:10px;height:13px;overflow:hidden'>
<div style='overflow:hidden;width:7px;height:13px'>
The width and height are all different, the only identifier is the overflow:hidden
Here is one example
<div style='width:10px;height:13px;overflow:hidden'>
<p>BRANDO CHANGED WILL IN LAST DAYS.(News)</p>
<p>The Mirror (London, England) July 8, 2004 Byline: IAN MARKHAM-SMITH HOLLYWOOD legend Marlon Brando changed his will days before his death, it emerged last night.</p>
<p>Movie mogul Mike Medavoy revealed that before the eccentric 80-year-old succumbed to illness on Friday, he summoned lawyers and some friends to make significant changes to his estate. lastnightmovienow.net last night movie</p>
</div>
How do I create a RegEx that finds every day with the style that contains overflow:hidden then any character, set of character etc up until the closing div.
I tried this, but didn't work
<div style='.*overflow:hidden'>(.*)</div>
I think it's due to not escaping the normal HTML.
I'm a RegEx noob.
Thanks
Ollie
Thanks mate, very detailed response :)
As you say it's sketchy, worked on some posts and not others.
We solved this by adding this to the functions.php file to strip all the problematic divs out server side.
RegEx was the incorrect approach.
function my_the_content_filter( $content ) {
$content = preg_replace("#<div[^>]*overflow:hidden[^>]*>.*?</div>#is", "", $content);
return $content;
}
add_filter( 'the_content', 'my_the_content_filter');
?>

Interpreting A Regular expression [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I am new to Regular expressions and I came across this piece of code in Wordpress but I have failed to understand what's going on, despite the comments. Kindly help me figure it out.
// catch base url
preg_match('/href="(.+?)"/i', $content, $matches);
$baseref = (is_array($matches) && !empty($matches)) ? $matches[1] : '';
// get the first image from content
preg_match('/<img.+?src="(.+?)"[^}]+>/i', $content, $matches);
$img_url = (is_array($matches) && !empty($matches)) ? $matches[1] : '';
Here's what $content contains.
<![CDATA[<p>Buganda Road Chief Magistrate James Mawanda Eremye has released Makerere University administrator Edward Kisuze. The suspended administrator is accused of sexually harassing a student. Court told Kisuzze to pay cash bail of Shs2m and each of his three sureties Shs10m.</p>
<p><img class="alignnone wp-image-32386" src="http://matookerepublic.com/wp-content/uploads/2018/04/kisuze-300x175.png" alt="" width="680" height="396" srcset="http://matookerepublic.com/wp-content/uploads/2018/04/kisuze-300x175.png 300w, http://matookerepublic.com/wp-content/uploads/2018/04/kisuze-696x405.png 696w, http://matookerepublic.com/wp-content/uploads/2018/04/kisuze.png 720w" sizes="(max-width: 680px) 100vw, 680px" /></p>
<p>However, before releasing him the magistrate ordered the prosecutor to disclose to the defence the evidence to enable commencement of hearing of this case come <span data-term="goog_350196878">May 28 2018</span>.</p>
<p>On April 14, police arrested Kisuze after a viral picture of him kissing the student’s private parts in office was released online. On May 4<sup>,</sup> he appeared before court and was remanded to Luzira prison after pleading not guilty to charges.</p>
]]>
The first one /href="(.+?)"/i is used to extract the href property from an a tag.
Check out the live example: https://regex101.com/r/84dhEk/2 (the green part is the matching one)
The second one /<img.+?src="(.+?)"[^}]+>/i is used to extract the src property from a img tag.
Check out this example: https://regex101.com/r/SOPN5I/2

Regex (or no?) : encode all < > & in XML file and preserve XML markup

I'm mimicking a large xml file, which I'm willing to import in mediawiki.
File is done, yet content in <text>content</text> still has remaining < and > I must encode first.
I wish encoding step may be done with regex (I'm using Windows and software like sublime text or edit pad or vim). I shoud be able to run a php script as well.
Using ({{word)(.*?)(?=</text>)I was able to select all targets for replacements – as I dont want to encode the xml markup itself – but I dont know how get the hard job done, i.e. how to replace all < and > lying in the well targeted text.
For better clarity here it is a light extract of how the content where I need to encode a few caracters looks like (I have 50000 more like that in a 30 mo file) :
<page>
<title>Title:75002</title>
<ns>510</ns>
<id>21</id>
<revision>
<id></id>
<parentid></parentid>
<timestamp>2015-1-5T14:49:09Z</timestamp>
<contributor>
<ip>0:0:0:0:0:0:0:1</ip>
</contributor>
<text xmlspace="preserve" bytes="345">{{word
| vedette ={{{vedette}}}
| id ={{ROOTPAGENAME}}
| vedette =boutique, with forbidden > and
evil < multiline
<!-----------encyclo---------->
| étymologie = still have sometimes a messing >
and maybe a < more.
<!-----------relations-------->
| synonyme ={{AutoLienSyno | }}
}}</text>
<sha1></sha1>
<model>wikitext</model>
<format>text/x-wiki</format>
</revision>
</page>
Thank you.
The easy way to do multiple substitutions in a repeated selection of text, for me, was to use sed.
Write a command.txt file with :
/<text/,/<\/text>/{
/<text/b
/<\/text>/b
s/\&/\&/g
s/>/\>/g
s/</\</g
}
Then run sed -f command.txt input.xml > output.xml
This way, all < > & will be encoded, only in the targeted portions of text delimited by <text and </text> (these boundaries remain unaltered).
doc here : http://sed.sourceforge.net/sedfaq4.html#s4.24

How to Map photos and texts

Please Observe the google Doc below:
https://docs.google.com/document/d/1dw6mJW0VxHzD3_h86RgtZwmelBQE8tYGgi41jb1oz-o/edit
I am attempting to put the data into Hbase using either MapReduce or Importtsv. But my main problem is dealing with the photos. I would like to put the photos in a seperate column family. How do i go about selecting only the photos and importing them into HBase, given that the photos dont have nothing that it can be identified by...like a (text) name.
I thought about using Regex. But some of the districts are of different structure. for instance, "Arizona 1" vs. "Alaska at large".
I need to know how to specifically identify the photos, so they that can be distinguished and imported appropriately.
Having in mind the structure of the document mentioned above, this is the expression you need. It will match all image URLs and each image description.
<image\sxlink:href="(https:\/\/[^"\s]+)".*?<title><\/title><desc>(.+?)<\/desc><\/image>
Demo
Usage in PHP:
$html = '<p>Members of our tim</p><image xlink:href="https://lh4.googleusercontent.com/z3GK1MdYyLTo0Q0xLmawvcptIrK4qkQx7XJWUgTK_i6Psm22GBqZXBh-w0TeQ5xgKxckQOB2wHWySSIpNj3tXx65MPXmaxKjK4ye_Xu-wAUFKLVhvWFgIedtzxo" width="100%" height="100%" preserveAspectRatio="none"><title></title><desc>Bradley Byrne.jpg</desc></image><h1>Some big title</h1><p>Something <span>more</span> here</p><image xlink:href="https://lh5.googleusercontent.com/fWYh7qTWqu4_4oxAiNhmnMCmD6DScZ6bIvkF5nSFunU8NxKlBT1T-1J85MJCqghhbChFzoLi-p4ZFVDCA2DWWBP9Paagp9ZgshqnGK5CQQF6D7IoBGihcFZoOms" width="100%" height="100%" preserveAspectRatio="none"><title></title><desc>Spencer Bachus 113th Congress.jpg</desc></image><h1>TITLE</h1><p>Testing, testing, testing</p><image xlink:href="https://lh5.googleusercontent.com/VAHzM6OkdtxT61j9XSgTDKlpVi99WsFfzNAlvqmnpCi90XFs9aUNMfuCeeeQ3e26fykjveoxldHvv5jO1Bk9IeEmeU7DdGVAM1N9xXoB8tJTYBeTeFBxigXtT5s" width="100%" height="100%" preserveAspectRatio="none"><title></title><desc>Kyrsten Sinema 113th Congress.jpg</desc></image><p>Last updated on 25th of July, 2014</p>';
$pattern = '/<image\sxlink:href="(https:\/\/[^"\s]+)".*?<title><\/title><desc>(.+?)<\/desc><\/image>/';
if(preg_match_all($pattern, $html, $matches)){
$size_of_matches = count($matches[0]);
for($i = 0; $i < $size_of_matches; $i++){
echo $matches[1][$i] . " -> " . $matches[2][$i] . "<br />";
}
}
Output:
https://lh4.googleusercontent.com/z3GK1MdYyLTo0Q0xLmawvcptIrK4qkQx7XJWUgTK_i6Psm22GBqZXBh-w0TeQ5xgKxckQOB2wHWySSIpNj3tXx65MPXmaxKjK4ye_Xu-wAUFKLVhvWFgIedtzxo -> Bradley Byrne.jpg
https://lh5.googleusercontent.com/fWYh7qTWqu4_4oxAiNhmnMCmD6DScZ6bIvkF5nSFunU8NxKlBT1T-1J85MJCqghhbChFzoLi-p4ZFVDCA2DWWBP9Paagp9ZgshqnGK5CQQF6D7IoBGihcFZoOms -> Spencer Bachus 113th Congress.jpg
https://lh5.googleusercontent.com/VAHzM6OkdtxT61j9XSgTDKlpVi99WsFfzNAlvqmnpCi90XFs9aUNMfuCeeeQ3e26fykjveoxldHvv5jO1Bk9IeEmeU7DdGVAM1N9xXoB8tJTYBeTeFBxigXtT5s -> Kyrsten Sinema 113th Congress.jpg
I do not have experience with MapReduce or Importtsv, so I went about this a different way using c#. As hex494D49 pointed out, the images do have text associated with them. You just have to obtain that data from the document's source (i.e. right click-->View page source).
This code reads in the document's source, makes an attempt to match the politician with an image file (based on the available information that was posted), and writes the results to a text file. The code has many examples of the c# flavor of regex. A sample of the output is here.

Extracting variables from string, regular expression?

My puzzle: as a PHP newby I am trying to extract some data from a string using a regular expression, but I cannot find a correct syntax.
The content of the string is scraped as html of several images from a website, I want the final output to be 3 seperate variables: "$Number1", "$Number2" and "$Status".
An example of the content of the input string $html:
<div id="system">
<img alt="2" height="35" src="/images/numbers/2.jpg" width="18" /><img alt="2" height="35" src="/images/numbers/2.jpg" width="18" /><img alt=".5" height="35" src="/images/numbers/point5.jpg" style="margin-left: -4px" width="26" /><img alt="system statusA" height="35" src="/images/numbers/statusA.jpg" width="37" /><img alt="2" height="35" src="/images/numbers/2.jpg" width="18" /><img alt="1" height="35" src="/images/numbers/1.jpg" width="18" /><img alt=".0" height="35" src="/images/numbers/point0.jpg" style="margin-left: -4px" width="26" />
</div>
The possible values which can appear in this string are:
0.jpg
1.jpg
2.jpg
3.jpg
4.jpg
5.jpg
6.jpg
7.jpg
8.jpg
9.jpg
point0.jpg
point5.jpg
statusA.jpg
statusB.jpg
statusC.jpg
statusD.jpg
statusE.jpg
statusF.jpg
The result should be variables:
"Number1" (XX.X) based upon the first two numbers (0-9) and .0 or .5
"Status" (statusX) based upon the status
"Number2" (XX.X) based upon the last two numbers (0-9) and .0 or .5
Code so far:
$regex = '\balt='(.*?)';
preg_match($regex,$html,$match);
var_dump($match);
echo $match[0];
Probably I have to do this in multiple steps or use another function, who can help me?
The very first thing that you should ask yourself is: "in what format is my input data". Since in this case it is clearly a snippet of HTML, you should feed that snippet to an HTML parser, and not to a regular expression engine.
I don't know the exact function names, but your code should look like this:
$htmltext = '<div id="system">[...]</div>';
$htmltree = htmlparser_parse($htmltext);
$images = $htmltree->find_all('img');
foreach ($images as $image) {
echo $image->src;
}
So you need to find an HTML parser that parses a string into a tree of nodes. The nodes should have methods for finding node inside them based on CSS classes, element names or node IDs. For Python this library is called BeautifulSoup, for Java it is JSoup, and I'm sure that there is something similar for PHP.
The examples provided with simplehtmldom look promising.
Possibly DOM : http://www.php.net/manual/en/book.dom.php
See Robust and Mature HTML Parser for PHP too
You want just the alt's? Try this xpath example:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DomXpath($doc);
foreach($xpath->query('//img/#alt') as $node){
echo $node->nodeValue."\n";
}