preg_match_all grab everything in a HTML tag when malformatted - regex

I am trying to automatically grab everything in a special tag in a html string.
What i need to do is grab everything in
<font size="8"></font>
so that i wrote following preg_match_all
preg_match_all('/<font(.*?)size="8"(.*?)>(.*?)<\/font\>/s', $row['html'], $titles,PREG_PATTERN_ORDER);
however it only works on certain cases only for example following string (Mal-formatted) is failed to match. do you have any idea on how to fix this or to modify above preg with this
<font FACE="Times New Roman" SIZE="8">
<p><font color="#003300">adadas <br>
dfsf sdfsdf <font size="4"><br>
<br>
gdfgdg
</font>
</font>

Give something like this a try:
<?php
$titles = array(); // CREATE AN ARRAY
$string = '<font FACE="Times New Roman" SIZE="8"><p><font color="#003300">adadas <br>dfsf sdfsdf <font size="4"><br><br>gdfgdg</font></font>';
$dom_document = new DOMDocument(); // CREATE A NEW DOCUMENT
$dom_document->loadHTML($string); // LOAD THE STRING INTO THE DOCUMENT
// LOOP THROUGH EACH font TAG
foreach ($dom_document->getElementsByTagName('font') as $font_item) {
// CHECK TO SEE IF IT HAS A SIZE ATTRIBUTE OF 8
if ($font_item->getAttribute('size') == 8) {
$titles[] = $font_item->ownerDocument->saveXML($font_item);
}
}
print_r($titles);
Basically, instead of using REGEX, you can use PHP's built-in DOM Parser. What this script does is creates a new document named $dom_document and loads your string into it. Then it loops through any font tags that it finds and checks to see if any of them have an attribute of size="8". If it finds any, it grabs the HTML and stores it into the $titles array.

Related

How to apply Drupal's Ckeditor HTML filters when manually saving node fields?

I have written a custom content import script for Drupal 8, which imports content from a JSON export from another website.
My ckeditor fields have pretty basic HTML filtering and for example replaces <i> statements for <em> statements and <b> for <strong>.
Now when I save my HTML into a field with such settings my HTML works fine for <p> and <ul> statements, but <i> tags are not being displayed:
$html = '<p><i>Italic text</i> and some <b>bold</b> text</p>';
$node->set('field_some_html', ['value' => $html, 'format' => 'basic_html']);
It now renders as:
<p>Italic text and some bold text</p>
When I then edit the node, while editing I do see the text in cursive or bold.
When I save the node everything is corrected. The statements have now been converted.
It now renders as:
<p><em>Italic text</em> and some <strong>bold</strong> text</p>
So my question is: how do I fix this? How do I apply the filters to my HTML input before saving the node to the database?
Update 1 After some more investigation I found FilterFormat. Now I tried this:
$html = '<p><i>Italic text</i> and some <b>bold</b> text</p>';
$filter_format = FilterFormat::load($format);
$filters = $filter_format->filters();
/* #var \Drupal\filter\Plugin\Filter\FilterHtml $filter */
foreach($filters as $filter) {
$html = $filter->process(is_string($html) ? $html : $value->getProcessedText(), 'nl');
}
die($html->getProcessedText());
However, this does the opposite of what I want to achieve. This returns the HTML stripped from <i> and <b> tags.
I think I may be close to the solution though...

Trying to match src part of HTML <img> tag Regular Expression

I've got a bunch of strings already separated from an HTML file, examples:
<img alt="" src="//i.imgur.com/tApg8ebb.jpg" title="Some manly skills for you guys<p><span class='points-q7Vdm'>18,736</span> <span class='points-text-q7Vdm'>points</span> : 316,091 views</p>">
<img src="//i.imgur.com/SwmwL4Gb.jpg" width="48" height="48">
<img src="//s.imgur.com/images/blog_rss.png">
I am trying to make a regular expression that will grab the src="URL" part of the img tag so that I can replace it later based on a few other conditions. The many instances of quotation marks are giving me the biggest problem, I'm still relatively new with Regex, so a lot of the tricks are out of my knowledge,
Thanks in advance
Use DOM or another parser for this, don't try to parse HTML with regular expressions.
Example:
$html = <<<DATA
<img alt="" src="//i.imgur.com/tApg8ebb.jpg" title="Some manly skills for you guys<p><span class='points-q7Vdm'>18,736</span> <span class='points-text-q7Vdm'>points</span> : 316,091 views</p>">
<img src="//i.imgur.com/SwmwL4Gb.jpg" width="48" height="48">
<img src="//s.imgur.com/images/blog_rss.png">
DATA;
$doc = new DOMDocument();
$doc->loadHTML($html); // load the html
$xpath = new DOMXPath($doc);
$imgs = $xpath->query('//img');
foreach ($imgs as $img) {
echo $img->getAttribute('src') . "\n";
}
Output
//i.imgur.com/tApg8ebb.jpg
//i.imgur.com/SwmwL4Gb.jpg
//s.imgur.com/images/blog_rss.png
If you would rather store the results in an array, you could do..
foreach ($imgs as $img) {
$sources[] = $img->getAttribute('src');
}
print_r($sources);
Output
Array
(
[0] => //i.imgur.com/tApg8ebb.jpg
[1] => //i.imgur.com/SwmwL4Gb.jpg
[2] => //s.imgur.com/images/blog_rss.png
)
$pattern = '/<img.+src="([\w/\._\-]+)"/';
I'm not sure which language you're using, so quote syntax will vary.

Extracting variables from string, regular expression?

My puzzle: as a PHP newby I am trying to extract some data from a string using a regular expression, but I cannot find a correct syntax.
The content of the string is scraped as html of several images from a website, I want the final output to be 3 seperate variables: "$Number1", "$Number2" and "$Status".
An example of the content of the input string $html:
<div id="system">
<img alt="2" height="35" src="/images/numbers/2.jpg" width="18" /><img alt="2" height="35" src="/images/numbers/2.jpg" width="18" /><img alt=".5" height="35" src="/images/numbers/point5.jpg" style="margin-left: -4px" width="26" /><img alt="system statusA" height="35" src="/images/numbers/statusA.jpg" width="37" /><img alt="2" height="35" src="/images/numbers/2.jpg" width="18" /><img alt="1" height="35" src="/images/numbers/1.jpg" width="18" /><img alt=".0" height="35" src="/images/numbers/point0.jpg" style="margin-left: -4px" width="26" />
</div>
The possible values which can appear in this string are:
0.jpg
1.jpg
2.jpg
3.jpg
4.jpg
5.jpg
6.jpg
7.jpg
8.jpg
9.jpg
point0.jpg
point5.jpg
statusA.jpg
statusB.jpg
statusC.jpg
statusD.jpg
statusE.jpg
statusF.jpg
The result should be variables:
"Number1" (XX.X) based upon the first two numbers (0-9) and .0 or .5
"Status" (statusX) based upon the status
"Number2" (XX.X) based upon the last two numbers (0-9) and .0 or .5
Code so far:
$regex = '\balt='(.*?)';
preg_match($regex,$html,$match);
var_dump($match);
echo $match[0];
Probably I have to do this in multiple steps or use another function, who can help me?
The very first thing that you should ask yourself is: "in what format is my input data". Since in this case it is clearly a snippet of HTML, you should feed that snippet to an HTML parser, and not to a regular expression engine.
I don't know the exact function names, but your code should look like this:
$htmltext = '<div id="system">[...]</div>';
$htmltree = htmlparser_parse($htmltext);
$images = $htmltree->find_all('img');
foreach ($images as $image) {
echo $image->src;
}
So you need to find an HTML parser that parses a string into a tree of nodes. The nodes should have methods for finding node inside them based on CSS classes, element names or node IDs. For Python this library is called BeautifulSoup, for Java it is JSoup, and I'm sure that there is something similar for PHP.
The examples provided with simplehtmldom look promising.
Possibly DOM : http://www.php.net/manual/en/book.dom.php
See Robust and Mature HTML Parser for PHP too
You want just the alt's? Try this xpath example:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DomXpath($doc);
foreach($xpath->query('//img/#alt') as $node){
echo $node->nodeValue."\n";
}

how to bold a text which is on <strong> tag in blackberry

I've got HTML string returned from a Web Service. Sample string would be:
I want to display the HTML string in the label field.
I've replaced the string <b>, <br/> to </n>. Its fine.
I want to set bold the string which is on <strong>, </strong>.
IF you want to make all the text bold, just use:
yourLabelField.setFont(getFont().derive(Font.BOLD, getFont().getHeight()));
If you want only a part of the text, you will have to use RichTextField:
http://www.blackberry.com/developers/docs/3.7api/net/rim/device/api/ui/component/RichTextField.html
in many tags are not working in label field so if you want to display according to HTML tags you need to use Browser Field like
String str="<html><head><style type=\"text/css\">a {color:OLIVE;}</style></head><body style=background-image:url('local:///background.png');background-repeat:no-repeat; width:100%;height:100%;> <font size=3 color=olive><b>About Us</b></font> <font size=2>"what the string do you want to show hear you can paste"</body></html>";
browser_field.displayContent(str,"");

Regex HTML help

Hey all I'm in need of some help trying to figure out the RegEx formula for finding the values within the tags of HTML mark-up like this:
<span class=""releaseYear"">1993</span>
<span class=""mpaa"">R</span>
<span class=""average-rating"">2.8</span>
<span class=""rt-fresh-small rt-fresh"" title=""Rotten Tomatoes score"">94%</span>
I only need 1993, R, 2.8 and 94% from that HTML above.
Any help would be great as I don't have much knowledge when it comes to forming one of these things.
Don't use a regular expression to parse HTML. Use an HTML parser. There is a good one here.
If you already have the HTML in a string:
string html = #"
<span class=""releaseYear"">1993</span>
<span class=""mpaa"">R</span>
<span class=""average-rating"">2.8</span>
<span class=""rt-fresh-small rt-fresh"" title=""Rotten Tomatoes score"">94%</span>
";
Or you can load a page from the internet directly (saves you from 5 lines of streams and requests):
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.rottentomatoes.com/m/source_code/");
Using the HTML Agility Pack:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
HtmlNodeCollection spans = doc.DocumentNode.SelectNodes("//span");
Now you can iterate over them, or simply get the text of each node:
IEnumerable<string> texts = spans.Select(option => option.InnerText).ToList();
Alternatively, you can search for the node you're after:
HtmlNode nodeReleaseYear = doc.DocumentNode
.SelectSingleNode("//span[#class='releaseYear']");
string year = nodeReleaseYear.InnerText;