Regular Expression issue

Regular Expression issue - regex

I have a code like this
<div class="rgz">
<div class="xyz">
</div>
<div class="ckh">
</div>
</div>
The class ckh wont appear everytime. Can someone suggest the regex to get the data of fiv rgz. Data inside ckh is not needed but the div wont appear always.
Thanks in advance

#diEcho and #Dve are correct, you should learn to use something like the native DOMdocument class rather than using regex. Your code will be easier to read and maintain, and will handle malformed HTML much better.
Here is some sample code which may or may not do what you want:
$contents = '';
$doc = new DOMDocument();
$doc->load($page_url);
$nodes = $doc->getElementsByTagName('div');
foreach ($nodes as $node)
{
if($node->hasAttributes()){
$attributes = $element->attributes;
if(!is_null($attributes)){
foreach ($attributes as $index=>$attr){
if($attr->name == 'class' && $attr->value == 'rgz'){
$contents .= $node->nodeValue;
}
}
}
}
}

Regex is probably not your best option here.
A javascript framework such as jquery will allow you to use CSS selectors to get to the element your require, by doing something like
$('.rgz').children().last().innerHTML

Related

Which regex tag to use in a Mechanize function?

I retrieved all the links from the web page containing /title/tt inside the url in a list.
my #url_links= $mech->find_all_links( url_regex => qr/title\/tt/i );
but the list is too long so I want to filter by adding in the function find_all_Links that the link must be also in the tags starting with <id="actor-tt..."> here is where the link (/title/tt...) is, in the code source retrieved by cmd.exe:
<div class="filmo-row odd" id="actor-tt0361748">
<span class="year_column">
2009
</span>
<b><a href="/title/tt0361748/"
>Inglourious Basterds</a></b>
<br/>
Lt. Aldo Raine
</div>
I imagine you have to use a tag_regex but I don't know how because the command prompt doesn't seem to take tag_regex into account when I put it in.

Using HTML::TreeBuilder and HTML::Element instead of Mechanize:
use strict;
use warnings;
use feature 'say';
use HTML::TreeBuilder;
my $html_string = join "", <DATA>;
my $tree = HTML::TreeBuilder->new_from_content($html_string);
my #url_links = map { $_->attr_get_i("href") }
map { $_->look_down(href => qr{/title/tt}) }
$tree->look_down(id => qr/^actor-tt/);
say for #url_links;
__DATA__
<div class="filmo-row odd" id="actor-tt0361748">
<span class="year_column">
2009
</span>
<b>Inglourious Basterds</b>
<br/>
Lt. Aldo Raine
</div>
<div id="not-the-right-id">
</div>
<div class="filmo-row odd" id="actor-tt0123456">
<b>Another movie</b>
</div>
<div class="filmo-row odd" id="actor-tt0123456">
the id will match, but no href in here
</div>
$tree->look_down(id => qr/^actor-tt/); finds all elements whose id matches actor-tt. Then $_->look_down(href => qr{/title/tt}) will find all elements within them with a field href matching /title/tt. Finally, $_->attr_get_i("href") returns the value of their href fields.
You might be interested in the method new_from_url or new_from_file from HTML::TreeBuilder rather than the new_from_content I used.

WWW::Mechanize is not sophisticated enough to do what you're trying to do. It can only search links on one criterium at a time, and it converts them to WWW::Mechanize::Link objects, which do not maintain their ancestry (as in position in the DOM tree).
Mechanize is meant to be a browser, not a scraper. It's important to pick the right tools for the job you have to do.
As Dada suggested in their answer, you can use your own parser to search for this. You can still extract the HTML out of WWW::Mechanize and then use the code they suggest. Use $mech->content or $mech->content_raw to get the HTML out.
There are several alternatives to this. While I personally like Web::Scraper for this kind of task, its interface is a bit weird and has a learning curve.
Instead, I would suggest using Mojo::UserAgent and Mojo::DOM. In fact, the handy ojo package for one-liners should be able to do this.
perl -Mojo -E 'g("https://www.imdb.com/name/nm0000093/")->dom->find("div[id^=actor-tt] a")->map(sub {say $_->attr("href")})'
Broken down, this does the following:
use Mojo::UserAgent to get that page
look at the DOM tree
find all <a>s inside <div>s that have an id that starts with actor-tt (see https://metacpan.org/pod/Mojo::DOM::CSS#SELECTORS for details)
for each of them, print out the href attribute
You can customise this as much as you want.
Please note that according to their Terms of Services, scraping IMDB is not allowed.

Trying to match src part of HTML <img> tag Regular Expression

I've got a bunch of strings already separated from an HTML file, examples:
<img alt="" src="//i.imgur.com/tApg8ebb.jpg" title="Some manly skills for you guys<p><span class='points-q7Vdm'>18,736</span> <span class='points-text-q7Vdm'>points</span> : 316,091 views</p>">
<img src="//i.imgur.com/SwmwL4Gb.jpg" width="48" height="48">
<img src="//s.imgur.com/images/blog_rss.png">
I am trying to make a regular expression that will grab the src="URL" part of the img tag so that I can replace it later based on a few other conditions. The many instances of quotation marks are giving me the biggest problem, I'm still relatively new with Regex, so a lot of the tricks are out of my knowledge,
Thanks in advance

Use DOM or another parser for this, don't try to parse HTML with regular expressions.
Example:
$html = <<<DATA
<img alt="" src="//i.imgur.com/tApg8ebb.jpg" title="Some manly skills for you guys<p><span class='points-q7Vdm'>18,736</span> <span class='points-text-q7Vdm'>points</span> : 316,091 views</p>">
<img src="//i.imgur.com/SwmwL4Gb.jpg" width="48" height="48">
<img src="//s.imgur.com/images/blog_rss.png">
DATA;
$doc = new DOMDocument();
$doc->loadHTML($html); // load the html
$xpath = new DOMXPath($doc);
$imgs = $xpath->query('//img');
foreach ($imgs as $img) {
echo $img->getAttribute('src') . "\n";
}
Output
//i.imgur.com/tApg8ebb.jpg
//i.imgur.com/SwmwL4Gb.jpg
//s.imgur.com/images/blog_rss.png
If you would rather store the results in an array, you could do..
foreach ($imgs as $img) {
$sources[] = $img->getAttribute('src');
}
print_r($sources);
Output
Array
(
[0] => //i.imgur.com/tApg8ebb.jpg
[1] => //i.imgur.com/SwmwL4Gb.jpg
[2] => //s.imgur.com/images/blog_rss.png
)

$pattern = '/<img.+src="([\w/\._\-]+)"/';
I'm not sure which language you're using, so quote syntax will vary.

Styling menu block's menu links in Drupal 7

I'm trying to style a block in Drupal 7 and I'm having a very hard time figuring things out!
I've used the menu_block module to get all links from the main menu. It produces a block with links in a ul, which I would like to theme as divs for each menu tree.
The styling itself should be easy, but I'm really struggling with finding the theme hook/template filename that I should use to style it.
I've tried to hook into theme_menu_tree and theme_menu_link, but they theme way too many places, and I can't see what I'm styling. I've tried menu-tree--menu-block--main-menu.tpl.php, but the variables are nothing like what I need.
My thought is that I need to style the $content variable in block.tpl.php, but I can't figure out how to do it for a specific block. Where should I hook in, if I want to style the menu points when the block (block type) is display (in the footer)?

I think the easiest (not necessarily the best) place to do this in hook_block_view_alter()
function MYMODULE_block_view_alter(&$data, $block) {
if ($block->module == 'menu_block') {
// Extract the links from the available data
$links = element_children($data['content']['#content']);
$content = '';
// Loop through the links and build up the required output.
foreach ($links as $link) {
$content .= '<div class="something">' . l($link['#title'], $link['#href']) . '</div>';
}
// Assign the new output to the block content...done :)
$data['content'] = $content;
}
}
The Devel module and it's handy dpm() function are your best friend here...they'll let you examine any PHP variable in a nicely structured format in the standard messages area. If you don't already have it installed I'd advise doing so, it's an absolute must for Drupal development.
Don't forget to clear Drupal's caches once you've implemented that hook or the system won't pick it up.

I had a very similar problem trying to figure out how to name my templates and hooks properly. Googling didn't help (way too much noise), but eventually I tried the Menu Block module documentation on drupal.org and it lead me in the right direction...
Template: menu-block-wrapper--main-menu.tpl.php
<nav role="navigation" id="siteNavigation">
<?php echo render($content); ?>
</nav>
Hooks: THEMENAME_menu_tree__menu_block__MENUNAME() and THEMENAME_menu_link__menu_block__MENUNAME():
function THEME_menu_tree__menu_block__main_menu($vars) {
return '<ul class="my-custom-menu-wrapper">' . $vars['tree'] . '</ul>';
}
function THEME_menu_link__menu_block__main_menu($data) {
$el = $data['element'];
// ... render any classes or other attributes that need to go in this <li>
$attr = drupal_attributes($el['#attributes']);
// ... render the menu link
$link = l($el['#title'], $el['#href'], $el['#localized_options']);
// ... and render any submenus
$sub_menu = drupal_render($el['#below']);
return sprintf("\n<li %s>%s %s</li>", $attr, $link, $sub_menu);
}

With print theme you can put CSS styles to the < ul >
print theme('links', array('links' => menu_navigation_links($your_menu_name), 'attributes' => array('class'=> array('ul_class')) ));

RegEx to modify urls in htmlText as3

I have some html text that I set into a TextField in flash. I want to highlight links ( either in a different colour, either just by using underline and make sure the link target is set to "_blank".
I am really bad at RegEx. I found a handy expression on RegExr :
</?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)/?>
but I couldn't use it.
What I will be dealing with is this:
<a href="http://randomwebsite.web" />
I will need to do a String.replace()
to get something like this:
<u><a href="http://randomwebsite.web" target="_blank"/></u>
I'm not sure this can be done in one go. Priority is making sure the link has target set to blank.

I do not know how Action Script regexes work, but noting that attributes can appear anywhere in the tag, you can substitute <a target="_blank" href= for every <a href=. Something like this maybe:
var pattern:RegExp = /<a\s+href=/g;
var str:String = "<a href=\"http://stackoverflow.com/\">";
str.replace(pattern, "<a target=\"_blank\" href=");
Copied from Adobe docs because I do not know much about AS3 regex syntax.
Now, manipulating HTML through regex is usually very fragile, but I think you can get away with it in this case. First, a better way to style the link would be through CSS, rather than using the <font> tag:
str.replace(pattern, "<a style=\"color:#00d\" target=\"_blank\" href=");
To surround the link with other tags, you have to capture everything in <a ...>anchor text</a> which is fraught with difficulty in the general case, because pretty much anything can go in there.
Another approach would be to use:
var start:RegExp = /<a href=/g;
var end:RegExp = /<\/a>/g;
var str:String = "<a\s+href=\"http://stackoverflow.com/\">";
str.replace(start, "<font color=\"#0000dd\"><a target=\"_blank\" href=");
str.replace(end, "</a></font>");
As I said, I have never used AS and so take this with a grain of salt. You might be better off if you have any way of manipulating the DOM.
Something like this might appear to work as well:
var pattern:RegExp = /<a\s+href=(.+?)<\/a>/mg;
...
str.replace(pattern,
"<font color=\"#0000dd\"><a target=\"_blank\" href=$1</a></font>");

I recomend you this simple test tool
http://www.regular-expressions.info/javascriptexample.html
Here's a working example with a more complex input string.
var pattern:RegExp = /<a href="([\w:\/.-_]*)"[ ]* \/>/gi;
var str:String = 'hello world <a href="http://www.stackoverflow.com/" /> hello there';
var newstr = str.replace(pattern, '<li><a href="$1" target="blank" /></li>');
trace(newstr);

What about this? I needed this for myself and it looks for al links (a-tags) with ot without a target already.
var pattern:RegExp = /<a ( ( [^>](?!target) )* ) ((.+)target="[^"]*")* (.*)<\/a> /xgi;
str.replace(pattern, '<a$1$4 target="_blank"$5<\/a>');

Add to <body> tag of a cakePHP app

I have an app where I need to call some JS in the onLoad event of the BODY tag of two forms. However, I can't find how to modify the tag for just them. Does anyone know?
Frank

inkedmn certainly has provided the right answer for this case, but in general, you can "hand information up" like this:
(in views/controller/view.ctp)
$this->set('bodyAttr', 'onload="something"');
(in views/layouts/default.ctp)
<?php
if (isset($bodyAttr)) {
$bodyAttr = " $bodyAttr";
} else {
$bodyAttr = null;
}
?>
<body<?php echo $bodyAttr; ?>>
I often use it like this to add extra classes to a "top level element":
<?php
if (!isset($docClass)) {
$docClass = null;
}
?>
<div id="doc" class="<?php echo $docClass; ?>">

You don't need to modify the body tag to have Javascript execute when the page loads. You could just include something like this in your layout where appropriate:
(jQuery)
$("body").load(
function(){
// do stuff when the body element is loaded.
}
);
Or, if you want to have the code execute when the document.ready event fires:
$(function(){
// do stuff when the document is ready
}
);
Or, if you don't want to use jQuery, you could do something like this:
function doStuff(){
// whatever you want to happen when the load completes
}
window.onload = dostuff;
Good luck - and please clarify your question if this answer isn't satisfactory.

I do the following:
We apply $ bodyLoad in my body
<body <? = (isset ($ bodyLoad)? 'onload = \''. $ bodyLoad.' \'',''); ?> >
Already in my [action]. Ctp, I do the following:
<? php $ this-> set ('bodyLoad', 'field.focus ();');?>
If you want you can also put this code in the controller.
Good luck

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular Expression issue - regex

I have a code like this <div class="rgz"> <div class="xyz"> </div> <div class="ckh"> </div> </div> The class ckh wont appear everytime. Can someone suggest the regex to get the data of fiv rgz. Data inside ckh is not needed but the div wont appear always. Thanks in advance

Regex is probably not your best option here. A javascript framework such as jquery will allow you to use CSS selectors to get to the element your require, by doing something like $('.rgz').children().last().innerHTML

Related

Which regex tag to use in a Mechanize function?

Trying to match src part of HTML <img> tag Regular Expression

Styling menu block's menu links in Drupal 7

RegEx to modify urls in htmlText as3

Add to <body> tag of a cakePHP app

Categories

Resources