Extracting variables from string, regular expression? - regex

My puzzle: as a PHP newby I am trying to extract some data from a string using a regular expression, but I cannot find a correct syntax.
The content of the string is scraped as html of several images from a website, I want the final output to be 3 seperate variables: "$Number1", "$Number2" and "$Status".
An example of the content of the input string $html:
<div id="system">
<img alt="2" height="35" src="/images/numbers/2.jpg" width="18" /><img alt="2" height="35" src="/images/numbers/2.jpg" width="18" /><img alt=".5" height="35" src="/images/numbers/point5.jpg" style="margin-left: -4px" width="26" /><img alt="system statusA" height="35" src="/images/numbers/statusA.jpg" width="37" /><img alt="2" height="35" src="/images/numbers/2.jpg" width="18" /><img alt="1" height="35" src="/images/numbers/1.jpg" width="18" /><img alt=".0" height="35" src="/images/numbers/point0.jpg" style="margin-left: -4px" width="26" />
</div>
The possible values which can appear in this string are:
0.jpg
1.jpg
2.jpg
3.jpg
4.jpg
5.jpg
6.jpg
7.jpg
8.jpg
9.jpg
point0.jpg
point5.jpg
statusA.jpg
statusB.jpg
statusC.jpg
statusD.jpg
statusE.jpg
statusF.jpg
The result should be variables:
"Number1" (XX.X) based upon the first two numbers (0-9) and .0 or .5
"Status" (statusX) based upon the status
"Number2" (XX.X) based upon the last two numbers (0-9) and .0 or .5
Code so far:
$regex = '\balt='(.*?)';
preg_match($regex,$html,$match);
var_dump($match);
echo $match[0];
Probably I have to do this in multiple steps or use another function, who can help me?

The very first thing that you should ask yourself is: "in what format is my input data". Since in this case it is clearly a snippet of HTML, you should feed that snippet to an HTML parser, and not to a regular expression engine.
I don't know the exact function names, but your code should look like this:
$htmltext = '<div id="system">[...]</div>';
$htmltree = htmlparser_parse($htmltext);
$images = $htmltree->find_all('img');
foreach ($images as $image) {
echo $image->src;
}
So you need to find an HTML parser that parses a string into a tree of nodes. The nodes should have methods for finding node inside them based on CSS classes, element names or node IDs. For Python this library is called BeautifulSoup, for Java it is JSoup, and I'm sure that there is something similar for PHP.
The examples provided with simplehtmldom look promising.

Possibly DOM : http://www.php.net/manual/en/book.dom.php
See Robust and Mature HTML Parser for PHP too

You want just the alt's? Try this xpath example:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DomXpath($doc);
foreach($xpath->query('//img/#alt') as $node){
echo $node->nodeValue."\n";
}

Related

Yesod Hamlet breaks HTML by replacing single quotes with double quotes

I have some HTML code that I'm using in Hamlet:
<div .modal-card .card data-options='{"valueNames": ["name"]}' data-toggle="lists">
Notice that the single quotes for data-options allows the use of double quotes inside the string.
The problem is that when Hamlet renders the page, Hamlet puts " around the ' and so the HTML is broken:
<div class="modal-card card" data-options="'{" valuenames":"="" ["name"]}'="" data-toggle="lists">
Some external JS library plugin code runs, it tries to parse the JSON inside data-options and fails.
How can I tell Hamlet to include a literal string?
I've tried various combinations of:
let theString = "{\"valueNames\": [\"name\"]}"
let theString2 = "data-options='{\"valueNames\": [\"name\"]}'"
etc
And in the hamlet file:
<div .modal-card .card data-options='#{ preEscapedText theString }' data-toggle="lists">
or
<div .modal-card .card #{ preEscapedText theString2 } data-toggle="lists">
But all attempts produce invalid HTML or invalid JSON inside the string.
How can I instruct Hamlet to simply include a literal string in the output HTML?
Update:
Tried more things, no result.
The string2 example doesn't work because Hamlet seems to think that I'm trying to set id="{" as per https://www.yesodweb.com/book/shakespearean-templates#shakespearean-templates_attributes
Why not render the JSON escaped (" become ") and “handle” the quotes later when parsing?
Interpolate in Hamlet:
<div #the-modal .modal-card .card data-options='#{theString}' data-toggle="lists">
Parse the data attribute as JSON:
let json = document.getElementById("the-modal").getAttribute("data-options");
let opts = JSON.parse(json); // At least in Chrome, it works!
As for theString2 alternative, you can also interpolate attributes in Hamlet using a tuple or list of tuples and the star symbol:
let dataOptions = ("data-options", "{\"valueNames\": [\"name\"]}") :: (Text, Text)
...
<div #the-modal .modal-card .card *{dataOptions} data-toggle="lists">

preg_match_all grab everything in a HTML tag when malformatted

I am trying to automatically grab everything in a special tag in a html string.
What i need to do is grab everything in
<font size="8"></font>
so that i wrote following preg_match_all
preg_match_all('/<font(.*?)size="8"(.*?)>(.*?)<\/font\>/s', $row['html'], $titles,PREG_PATTERN_ORDER);
however it only works on certain cases only for example following string (Mal-formatted) is failed to match. do you have any idea on how to fix this or to modify above preg with this
<font FACE="Times New Roman" SIZE="8">
<p><font color="#003300">adadas <br>
dfsf sdfsdf <font size="4"><br>
<br>
gdfgdg
</font>
</font>
Give something like this a try:
<?php
$titles = array(); // CREATE AN ARRAY
$string = '<font FACE="Times New Roman" SIZE="8"><p><font color="#003300">adadas <br>dfsf sdfsdf <font size="4"><br><br>gdfgdg</font></font>';
$dom_document = new DOMDocument(); // CREATE A NEW DOCUMENT
$dom_document->loadHTML($string); // LOAD THE STRING INTO THE DOCUMENT
// LOOP THROUGH EACH font TAG
foreach ($dom_document->getElementsByTagName('font') as $font_item) {
// CHECK TO SEE IF IT HAS A SIZE ATTRIBUTE OF 8
if ($font_item->getAttribute('size') == 8) {
$titles[] = $font_item->ownerDocument->saveXML($font_item);
}
}
print_r($titles);
Basically, instead of using REGEX, you can use PHP's built-in DOM Parser. What this script does is creates a new document named $dom_document and loads your string into it. Then it loops through any font tags that it finds and checks to see if any of them have an attribute of size="8". If it finds any, it grabs the HTML and stores it into the $titles array.

Trying to match src part of HTML <img> tag Regular Expression

I've got a bunch of strings already separated from an HTML file, examples:
<img alt="" src="//i.imgur.com/tApg8ebb.jpg" title="Some manly skills for you guys<p><span class='points-q7Vdm'>18,736</span> <span class='points-text-q7Vdm'>points</span> : 316,091 views</p>">
<img src="//i.imgur.com/SwmwL4Gb.jpg" width="48" height="48">
<img src="//s.imgur.com/images/blog_rss.png">
I am trying to make a regular expression that will grab the src="URL" part of the img tag so that I can replace it later based on a few other conditions. The many instances of quotation marks are giving me the biggest problem, I'm still relatively new with Regex, so a lot of the tricks are out of my knowledge,
Thanks in advance
Use DOM or another parser for this, don't try to parse HTML with regular expressions.
Example:
$html = <<<DATA
<img alt="" src="//i.imgur.com/tApg8ebb.jpg" title="Some manly skills for you guys<p><span class='points-q7Vdm'>18,736</span> <span class='points-text-q7Vdm'>points</span> : 316,091 views</p>">
<img src="//i.imgur.com/SwmwL4Gb.jpg" width="48" height="48">
<img src="//s.imgur.com/images/blog_rss.png">
DATA;
$doc = new DOMDocument();
$doc->loadHTML($html); // load the html
$xpath = new DOMXPath($doc);
$imgs = $xpath->query('//img');
foreach ($imgs as $img) {
echo $img->getAttribute('src') . "\n";
}
Output
//i.imgur.com/tApg8ebb.jpg
//i.imgur.com/SwmwL4Gb.jpg
//s.imgur.com/images/blog_rss.png
If you would rather store the results in an array, you could do..
foreach ($imgs as $img) {
$sources[] = $img->getAttribute('src');
}
print_r($sources);
Output
Array
(
[0] => //i.imgur.com/tApg8ebb.jpg
[1] => //i.imgur.com/SwmwL4Gb.jpg
[2] => //s.imgur.com/images/blog_rss.png
)
$pattern = '/<img.+src="([\w/\._\-]+)"/';
I'm not sure which language you're using, so quote syntax will vary.

Regular Expression to replace link in case having no particular class

I tried real hard to find solution but couldn't do. Yup regex is way too complex. Anyways here is problem.
Objective:
I want to replace image link with cdn image links in PHP. In order to do that I thought better is to use preg_replace.
if links is /var/b.png OR http://www.example.com/png it will be replaced with CDN but if case src or class contains 'captcha' then it shouldn't as these are dynamic in nature.
For start I am trying:
$_SERVER["HTTP_HOST"] = 'www.bring.com';
$preg_host = preg_quote($_SERVER["HTTP_HOST"], '/');
$content = preg_replace('/((\<image\s+.*?src\=)(["\']http\:\/\/'.$preg_host.')(\/.*?["\'](^(?=.*(captcha)))(.*)?\>))/i', '$2$3.nyud.net:8080$4', $content);
$content = preg_replace('/(\<image\s+.*?src\=["\'])(\/.*?["\'].*?\>)/i', '$1http://'.$_SERVER['HTTP_HOST'].'.nyud.net:8080$2', $content);
Condition is that:
When not to do: src can contain "captcha" word and in some cases class contains "captcha" and this class can ahead or src or behind src which is making it more complicated. In these cases I don't want to replace links for example:
$content = <<<END
<image
type="image" src="/skins/bph/customer/images/icons/go.gif" alt="Search" title="Search" class="go-button" />
<image
id="verification_image_login_login_popup_form" src="http://www.bring.com/index.php?dispatch=image.captcha&verification_id=%3Alogin_login_popup_form&login_login_popup_form4ef33269bf30b=" alt="" onclick="this.src += 'reload' ;" width="100" height="25" class="image-captcha valign" /></p><div
class="clear">
<image
id="verification_image_login_login_popup_form" class="valign" src="http://www.bring.com/skins/bph/customer/images/icons/go.gif" alt="" onclick="this.src += 'reload' ;" width="100" height="25" /></p><div
class="clear">
END;
So as a result:
Shouldn't be replaced, but is happening opposite :(
Following should get replace as it doesn't have any class with captcha or link with captcha word in it
<image
id="verification_image_login_login_popup_form" class="valign" src="http://www.bring.com/skins/bph/customer/images/icons/xxx" alt="" onclick="this.src += 'reload' ;" width="100" height="25" /></p>
Rather than trying to solve whole problem by using regex magic (which can bite you at unexpected times) it is highly recommended to use PHP DOM parser.
Using DOM parser iterate through all the images and examine their src and class attributes and make your link modification as needed.
You can see tons of examples on using DOM if you search it here on SO or on Google.

Regex HTML help

Hey all I'm in need of some help trying to figure out the RegEx formula for finding the values within the tags of HTML mark-up like this:
<span class=""releaseYear"">1993</span>
<span class=""mpaa"">R</span>
<span class=""average-rating"">2.8</span>
<span class=""rt-fresh-small rt-fresh"" title=""Rotten Tomatoes score"">94%</span>
I only need 1993, R, 2.8 and 94% from that HTML above.
Any help would be great as I don't have much knowledge when it comes to forming one of these things.
Don't use a regular expression to parse HTML. Use an HTML parser. There is a good one here.
If you already have the HTML in a string:
string html = #"
<span class=""releaseYear"">1993</span>
<span class=""mpaa"">R</span>
<span class=""average-rating"">2.8</span>
<span class=""rt-fresh-small rt-fresh"" title=""Rotten Tomatoes score"">94%</span>
";
Or you can load a page from the internet directly (saves you from 5 lines of streams and requests):
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.rottentomatoes.com/m/source_code/");
Using the HTML Agility Pack:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
HtmlNodeCollection spans = doc.DocumentNode.SelectNodes("//span");
Now you can iterate over them, or simply get the text of each node:
IEnumerable<string> texts = spans.Select(option => option.InnerText).ToList();
Alternatively, you can search for the node you're after:
HtmlNode nodeReleaseYear = doc.DocumentNode
.SelectSingleNode("//span[#class='releaseYear']");
string year = nodeReleaseYear.InnerText;