How to Map photos and texts - regex

Please Observe the google Doc below:
https://docs.google.com/document/d/1dw6mJW0VxHzD3_h86RgtZwmelBQE8tYGgi41jb1oz-o/edit
I am attempting to put the data into Hbase using either MapReduce or Importtsv. But my main problem is dealing with the photos. I would like to put the photos in a seperate column family. How do i go about selecting only the photos and importing them into HBase, given that the photos dont have nothing that it can be identified by...like a (text) name.
I thought about using Regex. But some of the districts are of different structure. for instance, "Arizona 1" vs. "Alaska at large".
I need to know how to specifically identify the photos, so they that can be distinguished and imported appropriately.

Having in mind the structure of the document mentioned above, this is the expression you need. It will match all image URLs and each image description.
<image\sxlink:href="(https:\/\/[^"\s]+)".*?<title><\/title><desc>(.+?)<\/desc><\/image>
Demo
Usage in PHP:
$html = '<p>Members of our tim</p><image xlink:href="https://lh4.googleusercontent.com/z3GK1MdYyLTo0Q0xLmawvcptIrK4qkQx7XJWUgTK_i6Psm22GBqZXBh-w0TeQ5xgKxckQOB2wHWySSIpNj3tXx65MPXmaxKjK4ye_Xu-wAUFKLVhvWFgIedtzxo" width="100%" height="100%" preserveAspectRatio="none"><title></title><desc>Bradley Byrne.jpg</desc></image><h1>Some big title</h1><p>Something <span>more</span> here</p><image xlink:href="https://lh5.googleusercontent.com/fWYh7qTWqu4_4oxAiNhmnMCmD6DScZ6bIvkF5nSFunU8NxKlBT1T-1J85MJCqghhbChFzoLi-p4ZFVDCA2DWWBP9Paagp9ZgshqnGK5CQQF6D7IoBGihcFZoOms" width="100%" height="100%" preserveAspectRatio="none"><title></title><desc>Spencer Bachus 113th Congress.jpg</desc></image><h1>TITLE</h1><p>Testing, testing, testing</p><image xlink:href="https://lh5.googleusercontent.com/VAHzM6OkdtxT61j9XSgTDKlpVi99WsFfzNAlvqmnpCi90XFs9aUNMfuCeeeQ3e26fykjveoxldHvv5jO1Bk9IeEmeU7DdGVAM1N9xXoB8tJTYBeTeFBxigXtT5s" width="100%" height="100%" preserveAspectRatio="none"><title></title><desc>Kyrsten Sinema 113th Congress.jpg</desc></image><p>Last updated on 25th of July, 2014</p>';
$pattern = '/<image\sxlink:href="(https:\/\/[^"\s]+)".*?<title><\/title><desc>(.+?)<\/desc><\/image>/';
if(preg_match_all($pattern, $html, $matches)){
$size_of_matches = count($matches[0]);
for($i = 0; $i < $size_of_matches; $i++){
echo $matches[1][$i] . " -> " . $matches[2][$i] . "<br />";
}
}
Output:
https://lh4.googleusercontent.com/z3GK1MdYyLTo0Q0xLmawvcptIrK4qkQx7XJWUgTK_i6Psm22GBqZXBh-w0TeQ5xgKxckQOB2wHWySSIpNj3tXx65MPXmaxKjK4ye_Xu-wAUFKLVhvWFgIedtzxo -> Bradley Byrne.jpg
https://lh5.googleusercontent.com/fWYh7qTWqu4_4oxAiNhmnMCmD6DScZ6bIvkF5nSFunU8NxKlBT1T-1J85MJCqghhbChFzoLi-p4ZFVDCA2DWWBP9Paagp9ZgshqnGK5CQQF6D7IoBGihcFZoOms -> Spencer Bachus 113th Congress.jpg
https://lh5.googleusercontent.com/VAHzM6OkdtxT61j9XSgTDKlpVi99WsFfzNAlvqmnpCi90XFs9aUNMfuCeeeQ3e26fykjveoxldHvv5jO1Bk9IeEmeU7DdGVAM1N9xXoB8tJTYBeTeFBxigXtT5s -> Kyrsten Sinema 113th Congress.jpg

I do not have experience with MapReduce or Importtsv, so I went about this a different way using c#. As hex494D49 pointed out, the images do have text associated with them. You just have to obtain that data from the document's source (i.e. right click-->View page source).
This code reads in the document's source, makes an attempt to match the politician with an image file (based on the available information that was posted), and writes the results to a text file. The code has many examples of the c# flavor of regex. A sample of the output is here.

Related

Web crawler to extract data particular subset of html tags

1.
<p class="followText">Follow us</p>
<p><a class="symbol ss-social-circle ss-facebook" href="http://www.facebook.com/HowStuffWorks" target="_blank">Facebook</a></p>
2.
<p>Gyroscopes can be very perplexing objects because they move in peculiar ways and even seem to defy gravity. These special properties make ­gyroscopes extremely important in everything from your bicycle to the advanced navigation system on the space shuttle. A typical airplane uses about a dozen gyroscopes in everything from its compass to its autopilot. The Russian Mir space station used 11 gyroscopes to keep its orientation to the sun, and the Hubble Space Telescope has a batch of navigational gyros as well. Gyroscopic effects are also central to things like yo-yos and Frisbees!</p>
This is part of source of the website http://science.howstuffworks.com/gyroscope.htm, from which I'm trying to extract contents of the <p> tag from.
This is the code I'm using to do that
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://science.howstuffworks.com/gyroscope' + str(page) + ".htm"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('p'):
paragraph = link.string
print paragraph
But I'm getting both types of data( both 1 and 2) inside the p tag.
I need to get only the data from the part 2 section and not part 1.
Please suggest me a way to leave out tags with attributes but keep the basic tags of the same html tag.

Laravel extract excerpt from content using tinymce

I'm using tinymce as rich text editor and separate excerpt from content via pagebreak button that insert a <!-- pagebreak --> tag . I'm wondering what is the best way to extract excerpt from database.
I know i can use preg_math as well as preg_split , but is it realy best and optimized solution?
wouldn't it be better and faster to save excerpt in a separate column?
This should work, without using any regex functions:
$pagebreak = '<!-- pagebreak -->';
$content = 'I am the excerpt<!-- pagebreak -->I am the rest of the content';
$excerpt = substr($content, 0, strpos($content, $pagebreak));
$restOfTheContent = substr($content, strpos($content, $pagebreak) + strlen($pagebreak));
var_dump($excerpt); // string(16) "I am the excerpt"
var_dump($restOfTheContent); // string(28) "I am the rest of the content"
Please note that this is really only designed to work with a single page break. It wouldn't be too difficult to modify it to generate an array of $pages based off of the string $content should multiple page breaks be necessary.

actionscript htmltext. removing a table tag from dynamic html text

AS 3.0 / Flash
I am consuming XML which I have no control over.
the XML has HTML in it which i am styling and displaying in a HTML text field.
I want to remove all the html except the links.
Strip all HTML tags except links
is not working for me.
does any one have any tips? regEx?
the following removes tables.
var reTable:RegExp = /<table\s+[^>]*>.*?<\/table>/s;
but now i realize i need to keep content that is the tables and I also need the links.
thanks!!!
cp
Probably shouldn't use regex to parse html, but if you don't care, something simple like this:
find /<table\s+[^>]*>.*?<\/table\s+>/
replace ""
ActionScript has a pretty neat tool for handling XML: E4X. Rather than relying on RegEx, which I find often messes things up with XML, just modify the actual XML tree, and from within AS:
var xml : XML = <page>
<p>Other elements</p>
<table><tr><td>1</td></tr></table>
<p>won't</p>
<div>
<table><tr><td>2</td></tr></table>
</div>
<p>be</p>
<table><tr><td>3</td></tr></table>
<p>removed</p>
<table><tr><td>4</td></tr></table>
</page>;
clearTables (xml);
trace (xml.toXMLString()); // will output everything but the tables
function removeTables (xml : XML ) : void {
xml.replace( "table", "");
for each (var child:XML in xml.elements("*")) clearTables(child);
}

how to match a URL inside a HTML comment with regular expressions?

I'm making an automated script with PHP to check if my link exists at my partner website ( link exchange) .. besides making sure my link exists in the source code , I want to make sure he is not placing it in a HTML comment like <!-- http://www.mywebsite.com --> and cheating me ..
I tried to match it with REGEXP , but have failed
Use the DOM and XPath, it ignores comments:
$doc = new DOMDocument();
$doc->loadHTML($htmlstring);
$xpath = new DOMXPath($doc);
$result = $xpath->query('//a[contains(#href, "mywebsite.com")]');
if (!$result->length) echo "You've been cheated\n";
And then if you still want to know if your website is being commented out
if (strpos($htmlstring, 'mywebsite.com') !== false && !$result->length)
echo "Your partner is hiding your link in a comment, sneaky bastard\n";
Sounds like a perfect use for an HTML parser like DOMDocument->loadHTML() and look for an anchor tag with your link. He could still remove it via javascript on the browser side, but that's a different issue.
If it's a cat and mouse game of "are you showing a link to my site" using a standard parser is your best bet. There are just too many ways for a regex to fail on html.

The regular expression for finding the image url in <img> tag in HTML using VB .Net code

I want to extract the image url from any website. I am reading the source info through webRequest. I want a regular expression which will fetch the Image url from this content i.e the Src value in the <img> tag.
I'd recommend using an HTML parser to read the html and pull the image tags out of it, as regexes don't mesh well with data structures like xml and html.
In C#: (from this SO question)
var web = new HtmlWeb();
var doc = web.Load("http://www.stackoverflow.com");
var nodes = doc.DocumentNode.SelectNodes("//img[#src]");
foreach (var node in nodes)
{
Console.WriteLine(node.src);
}
/(?:\"|')[^\\x22*<>|\\\\]+?\.(?:jpg|bmp|gif|png)(?:\"|')/i
is a decent one I have used before. This gets any reference to an image file within an html document. I didn't strip " or ' around the match, so you will need to do that.
Try this*:
<img .*?src=["']?([^'">]+)["']?.*?>
Tested here with:
<img class="test" src="/content/img/so/logo.png" alt="logo homepage">
Gives
$1 = /content/img/so/logo.png
The $1 (you have to mouseover the match to see it) corresponds to the part of the regex between (). How you access that value will depend on what implementation of regex you are using.
*If you want to know how this works, leave a comment
EDIT
As nearly always with regexp, there are edge cases:
<img title="src=hack" src="/content/img/so/logo.png" alt="logo homepage">
This would be matched as 'hack'.