Regex to match only the first occurrence of an html element - regex

Yes yes, I know, "don't parse HTML with Regex". I'm doing this in notepad++ and it's a one-time thing so please bear with me for a moment.
I'm trying to simplify some HTML code by using some more advanced techniques. Notably, I have "inserts" or "callouts" or whatever you call them, in my documentation, indicating "note", "warning" and "technical" short phrases to grab the attention of the reader on important information:
<div class="note">
<p><strong>Notes</strong>: This icon shows you something that complements
the information around it. Understanding notes is not critical but
may be helpful when using the product.</p>
</div>
<div class="warning">
<p><strong>Warnings</strong>: This icon shows information that may
be critical when using the product.
It is important to pay attention to these warnings.</p>
</div>
<div class="technical">
<p><strong>Technical</strong>: This icon shows technical information
that may require some technical knowledge to understand. </p>
</div>
I want to simplify this HTML into the following:
<div class="box note"><strong>Notes</strong>: This icon shows you something that complements
the information around it. Understanding notes is not critical but
may be helpful when using the product.</div>
<div class="box warning"><strong>Warnings</strong>: This icon shows information that may
be critical when using the product.
It is important to pay attention to these warnings.</div>
<div class="box technical"><strong>Technical</strong>: This icon shows technical information
that may require some technical knowledge to understand.</div>
I almost have the regex necessary to do a nice global search & replace in my project from notepad++, but it's not picking up "only" the first div, it's picking up all of them - if my cursor is at the beginning of my file, the "select" when I click Find is from the first <div class="something"> up until the last </div>, essentially.
Here's my expression: <div class="(.*[^"])">[^<]*<p>(.*?)<\/p>[^<]*<\/div> (notepad++ "automatically" adds the / / around it, kinda).
What am I doing wrong, here?

You have a greedy dot-quantifier while matching the class attribute — that's the evil guy who's causing your problems.
Make it non-greedy: <div class="(.*?[^"])"> or change it to a character class: <div class="([^"]*)">.
Compare: greedy class vs. non-greedy class.

Related

Bootstrap group list shows text sprites at runtime

The code is listed below, using the bootstrap class class="list-group-item-text".
As the image here shows, the text seems to have a duplicate version just above it - looks almost like dotted lines.
`
<asp:Panel ID="pnlInstructions" runat="server" Visible="true">
<h4>Instructions:</h4>
<div class="form-inline col-lg-12" id="Instructions" style="padding-top:10px;padding-bottom:20px;padding-left:10px;">
<ol class="list-group-item-text" style="text-emphasis:filled;outline:none;">
<li>First, please use MyEd to get your extract file ready.</li>
<li>Then, fill in the following to Log in.</li>
</ol>
</div>
</asp:Panel>
`
I've researched the problem using the word "sprites", but that seems to have a different meaning than I expected, since I thought it meant "unwanted junk" in a display.
I'm not sure if this appears on all browsers.

Which regex tag to use in a Mechanize function?

I retrieved all the links from the web page containing /title/tt inside the url in a list.
my #url_links= $mech->find_all_links( url_regex => qr/title\/tt/i );
but the list is too long so I want to filter by adding in the function find_all_Links that the link must be also in the tags starting with <id="actor-tt..."> here is where the link (/title/tt...) is, in the code source retrieved by cmd.exe:
<div class="filmo-row odd" id="actor-tt0361748">
<span class="year_column">
2009
</span>
<b><a href="/title/tt0361748/"
>Inglourious Basterds</a></b>
<br/>
Lt. Aldo Raine
</div>
I imagine you have to use a tag_regex but I don't know how because the command prompt doesn't seem to take tag_regex into account when I put it in.
Using HTML::TreeBuilder and HTML::Element instead of Mechanize:
use strict;
use warnings;
use feature 'say';
use HTML::TreeBuilder;
my $html_string = join "", <DATA>;
my $tree = HTML::TreeBuilder->new_from_content($html_string);
my #url_links = map { $_->attr_get_i("href") }
map { $_->look_down(href => qr{/title/tt}) }
$tree->look_down(id => qr/^actor-tt/);
say for #url_links;
__DATA__
<div class="filmo-row odd" id="actor-tt0361748">
<span class="year_column">
2009
</span>
<b>Inglourious Basterds</b>
<br/>
Lt. Aldo Raine
</div>
<div id="not-the-right-id">
</div>
<div class="filmo-row odd" id="actor-tt0123456">
<b>Another movie</b>
</div>
<div class="filmo-row odd" id="actor-tt0123456">
the id will match, but no href in here
</div>
$tree->look_down(id => qr/^actor-tt/); finds all elements whose id matches actor-tt. Then $_->look_down(href => qr{/title/tt}) will find all elements within them with a field href matching /title/tt. Finally, $_->attr_get_i("href") returns the value of their href fields.
You might be interested in the method new_from_url or new_from_file from HTML::TreeBuilder rather than the new_from_content I used.
WWW::Mechanize is not sophisticated enough to do what you're trying to do. It can only search links on one criterium at a time, and it converts them to WWW::Mechanize::Link objects, which do not maintain their ancestry (as in position in the DOM tree).
Mechanize is meant to be a browser, not a scraper. It's important to pick the right tools for the job you have to do.
As Dada suggested in their answer, you can use your own parser to search for this. You can still extract the HTML out of WWW::Mechanize and then use the code they suggest. Use $mech->content or $mech->content_raw to get the HTML out.
There are several alternatives to this. While I personally like Web::Scraper for this kind of task, its interface is a bit weird and has a learning curve.
Instead, I would suggest using Mojo::UserAgent and Mojo::DOM. In fact, the handy ojo package for one-liners should be able to do this.
perl -Mojo -E 'g("https://www.imdb.com/name/nm0000093/")->dom->find("div[id^=actor-tt] a")->map(sub {say $_->attr("href")})'
Broken down, this does the following:
use Mojo::UserAgent to get that page
look at the DOM tree
find all <a>s inside <div>s that have an id that starts with actor-tt (see https://metacpan.org/pod/Mojo::DOM::CSS#SELECTORS for details)
for each of them, print out the href attribute
You can customise this as much as you want.
Please note that according to their Terms of Services, scraping IMDB is not allowed.

Remove all HTML tags from a cell

I'm trying to remove all the HTML tags and comments within the following cell in Google Sheets:
<div class="prod-desc" itemprop="description">
<div class="row">
<div class="col-md-8">
<p>This is a 100 count box of the ACC-DX01A Proximity Card to be used with any of our DX line of Access Control Readers. It is the size of a credit card so it can easily fit into your wallet. Use these like a proximity card and carry them on your key ring for easy access. </p>
<p> Please note: To add a DX Card or FOB to the DX Access Control System, you must use the Auto/Add Function. If you need assistance, FREE US based tech support is just a phone call away. </p>
</div>
<!-- Description Side Bar START ************************************ -->
<div class="col-md-4"> <img src="/images_templ/Accesss-Control_product-image.jpg"> <span class="boxtitle ">Full Line of Access Control</span> <span style="font-size: 18px; font-family: inherit; font-weight: 400">Access Control Proximity Card Readers and Electronic Door Locks and more!</span> </div>
<!-- Description Side Bar END ************************************ -->
</div>
</div>
So ideally the input should come out as:
This is a 100 count box of the ACC-DX01A Proximity Card to be used with any of our DX line of Access Control Readers. It is the size of a credit card so it can easily fit into your wallet. Use these like a proximity card and carry them on your key ring for easy access.
Please note: To add a DX Card or FOB to the DX Access Control System, you must use the Auto/Add Function. If you need assistance, FREE US based tech support is just a phone call away.
Full Line of Access Control Access Control Proximity Card Readers and Electronic Door Locks and more!
I've searched around found several answers, however, none of them seems to be working for me, maybe it's because of the new lines and carriage returns? I don't know. What I want to do is remove all the HTML and keep all the newlines and carriage returns in the text. Here are some posts that I was following:
Remove HTML In Google Sheets Cells
https://superuser.com/questions/564727/html-tags-in-google-spreadsheet
try like this:
=ARRAYFORMULA(TEXTJOIN(CHAR(10), 1,
TRIM(SPLIT(REGEXREPLACE(A1, "</?\S+[^<>]*>", ), CHAR(10)))))
Yes. Besides the answer that #player0 gave, you can also use 'Search and Replace' ctrl+H And then just paste all you wish to change/remove and replace it with nothing. It works for more than 1 cell too.
Its more laborious but you can target the entire book or ranges if needed.

span inside .addthis_button_facebook_like gets width of 450px after like, breaks addThis layout

So yeah, I have this on the page:
<div class="addthis_toolbox addthis_default_style " id="divAddThis" runat="server">
<a class="addthis_button_facebook_like" style="opacity:1;" <%="fb:like:layout='button_count'"%>></a>
<a class="addthis_button_tweet"></a>
<a class="addthis_counter addthis_pill_style"></a>
</div>
But after clicking the like button (and closing the bubble), a <span> just inside the <fb:like ...> element dynamically gets a width of 450px, breaking the layout of AddThis (the buttons are all inline).
Recommendations?
Seems like it might be a bug on Facebook's side of things. I've seen a few reports pop up this week and I'm experiencing it myself as well.
Here is a link to one of the bug reports for it that says that Facebook is looking into it:
Facebook Bug Report
Hopefully we'll have an answer soon.

CSS3 class match letter range [a-z]+?

Is there any possibility to create CSS definition for any element with the class "icon-" and then a set of letters but not numbers.
According to this article something like:
[class^='/icon\-([a-zA-Z]+)/'] {}
should works. But for some reason it doesn't.
In particular I need to create style definition for all elements like "icon-user", "icon-ok" etc but not "icon-16" or "icon-32"
Is it possible at all?
CSS attribute selectors do not support regular expressions.
If you actually read that article closely:
Regex Matching Attribute Selectors
They don’t exist, but wouldn’t that be so cool? I’ve no idea how hard it would be to implement, or how to expensive to parse, but wouldn’t it just be the bomb?
Notice the first three words. They don't exist. That article is nothing more than a blog post lamenting the absence of regex support in CSS attribute selectors.
But if you're using jQuery, James Padolsey's :regex selector for jQuery may interest you. Your given CSS selector might look like this for example:
$(":regex(class, ^icon\-[a-zA-Z]+)")
I answered this one on facebook but thought I'd best share here too :)
I haven't tested this so don't shoot me if it doesn't work :) but my guess would be to excplicitly target elements that contain the word icon in the classname, but to instruct the browser not to inlcude those classes containing numbers.
Example code:
div[class|=icon]:not(.icon-16, .icon-32, icon-64, icon-96) {.....}
Reference:
attribute selectors... (http://www.w3.org/TR/CSS2/selector.html#attribute-selectors):
[att|=val]
Represents an element with the att attribute, its value either being exactly "val" or beginning with "val" immediately followed by "-" (U+002D).
:not selector...
(http://kilianvalkhof.com/2008/css-xhtml/the-css3-not-selector/)
Hope this helps,
Waseem
I tested my previous solution and can confirm that it DOES NOT work (see comment from BoltClock). This however does:
OP: "In particular I need to create style definition for all elements like "icon-user", "icon-ok" etc but not "icon-16" or "icon-32""
The required CSS code would look something like this:
/* target every element were the class name begins with ( ^= ) "icon" but NOT those that begin with ( ^= ) "icon-16", or "icon-32" */
*[class^="icon"]:not([class^="icon-16"]):not([class^="icon-32"]) {.....}
or
/* target every element were the class name begins with ( ^= ) "icon" but NOT those that contain ( *= ) the number "16" or the number "18" */
*[class^="icon"]:not([class*="16"]):not([class*="32"]) { ...... }
Test code:
<!DOCTYPE html>
<html>
<head>
<style>
div{border:1px solid #999;margin-bottom:1em;height:100px;}
*[class|=icon]:not([class|=icon-16]):not([class|=icon-32]) {background:red;color:white;}
</style>
</head>
<body>
<div class="icon-something">
<h4>icon-something</h4>
<p><strong>IS</strong> targeted therfore background colour will be red</p>
</div>
<div class="icon-anotherthing">
<h4>icon-anotherthing</h4>
<p><strong>IS</strong> targeted therfore background colour will be red</p>
</div>
<div class="icon-16-install">
<h4>icon-16-install</h4>
<p>Is <strong>NOT</strong> targeted therfore no background colour</p>
</div>
<div class="icon-16-redirect">
<h4>icon-16-redirect</h4>
<p>Is <strong>NOT</strong> targeted therfore no background colour</p>
</div>
<div class="icon-16-login">
<h4>icon-16-login</h4>
<p>Is <strong>NOT</strong> targeted therfore no background colour</p>
</div>
<div class="icon-32-install">
<h4>icon-32-install</h4>
<p>Is <strong>NOT</strong> targeted therfore no background colour</p>
</div>
<div class="icon-32-redirect">
<h4>icon-32-redirect</h4>
<p>Is <strong>NOT</strong> targeted therfore no background colour</p>
</div>
<div class="icon-32-login">
<h4>icon-32-login</h4>
<p>Is <strong>NOT</strong> targeted therfore no background colour</p>
</div>
</body>
</html>