how to match a URL inside a HTML comment with regular expressions? - regex

I'm making an automated script with PHP to check if my link exists at my partner website ( link exchange) .. besides making sure my link exists in the source code , I want to make sure he is not placing it in a HTML comment like <!-- http://www.mywebsite.com --> and cheating me ..
I tried to match it with REGEXP , but have failed

Use the DOM and XPath, it ignores comments:
$doc = new DOMDocument();
$doc->loadHTML($htmlstring);
$xpath = new DOMXPath($doc);
$result = $xpath->query('//a[contains(#href, "mywebsite.com")]');
if (!$result->length) echo "You've been cheated\n";
And then if you still want to know if your website is being commented out
if (strpos($htmlstring, 'mywebsite.com') !== false && !$result->length)
echo "Your partner is hiding your link in a comment, sneaky bastard\n";

Sounds like a perfect use for an HTML parser like DOMDocument->loadHTML() and look for an anchor tag with your link. He could still remove it via javascript on the browser side, but that's a different issue.
If it's a cat and mouse game of "are you showing a link to my site" using a standard parser is your best bet. There are just too many ways for a regex to fail on html.

Related

Regex Custom Redirect in Blogger for every archive.html

I need to create a regex Custom Redirect in Blogger. The purpose is to redirect all HTML archives to somewhere else.
Currently I'm using the following in Settings / Search preferences / Custom Redirects:
From:/2018_11_21_archive.html
To:/p/somewhere_else.html
Permanent:Yes
The problem is that this method requires to add every date, and that's not acceptable.
/2016_10_21_archive.html
/2016_10_22_archive.html
/2016_10_23_archive.html
/2017_07_10_archive.html
/2017_07_10_archive.html
/2017_07_10_archive.html
/2018_11_21_archive.html
/2019_11_21_archive.html
...
So far I've tried this regex with no success:
From:/2018_(.*)
To:/p/somewhere_else.html
Permanent:Yes
Blogger custom Redirects does not support regex.
But I have a solution for you, use this code, and put it after <head>
<b:if cond='data:view.isArchive and data:view.url contains "_archive"'>
<b:with value='"https://www.example.com/p/somewhere_else.html"' var='destination'>
<script>window.location.replace("<data:destination/>")</script>
<noscript><meta expr:content='"0; URL=" + data:destination' http-equiv='refresh'/></noscript>
</b:with>
</b:if>
You have to escape the "/" character! Just insert a "\" before.
This line must be like this:
From:\/2018_.*
But be aware that this way only /2018_11_21_archive.html will match.
If you need ALL dates as you mentioned, I recommend this regex below:
\/([12]\d{3}_(0[1-9]|1[0-2])_(0[1-9]|[12]\d|3[01]))_archive\.html

How to add an extra parameter to the img source in HTML using perl

I have a situation where I need to differentiate two calls by the path in the source of a HTML. This is how the img tag looks like
<img src="/folder/12280218/160024536.images.jpg" />
I am planning to alter the source to
<img src="/folder/12280218/160024536.images.jpg/1" />
observe the "/1" at the end of src
I need this so that I can change the flow in the controller when I am serving this image.
This is what I have tried until now.
my $string = '<p><img src="/folder/12280218/160024536.images.jpg" /></p>';
$string =~ s/<img\s+src\=\"(.*)"\s+\/><\/p>/<img src\=\"$1\/1" \><\/p>/g;
This is working as long as the $string looks like this.
In our application, user has the ability to alter the HTML input using CKEditor.
He can alter the image tag by adding width="800" before or after the src attribute. I want the regular expression to handle all these situations.
Please let me know how to proceed.
Thanks in advance.
Replace :
(<img.*src="[^"]*)(".*\/>)
by
$1/1$2
Demo here
Edit : Changed the regex to handle situations with other attributes (like the "width" part)

Laravel extract excerpt from content using tinymce

I'm using tinymce as rich text editor and separate excerpt from content via pagebreak button that insert a <!-- pagebreak --> tag . I'm wondering what is the best way to extract excerpt from database.
I know i can use preg_math as well as preg_split , but is it realy best and optimized solution?
wouldn't it be better and faster to save excerpt in a separate column?
This should work, without using any regex functions:
$pagebreak = '<!-- pagebreak -->';
$content = 'I am the excerpt<!-- pagebreak -->I am the rest of the content';
$excerpt = substr($content, 0, strpos($content, $pagebreak));
$restOfTheContent = substr($content, strpos($content, $pagebreak) + strlen($pagebreak));
var_dump($excerpt); // string(16) "I am the excerpt"
var_dump($restOfTheContent); // string(28) "I am the rest of the content"
Please note that this is really only designed to work with a single page break. It wouldn't be too difficult to modify it to generate an array of $pages based off of the string $content should multiple page breaks be necessary.

Getting the website title from a link in a string

string: "Here is the badges, https://stackoverflow.com/badges bla bla bla"
If string contatins a link (see above) I want to parse the website title of that link.
It should return : Badges - Stack Overflow.
How can i do that?
Thanks.
#!/usr/bin/perl -w
require LWP::UserAgent;
my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;
my $response = $ua->get('http://search.cpan.org/');
if ($response->is_success) {
print $response->title();
}
else {
die $response->status_line;
}
See LWP::UserAgent. Cheers :-)
I use URI::Find::Simple's list_uris method and URI::Title for this.
Depending how the link is given and how you define title, you need one or other approach.
In the exact scenario that you have presented, getting the URL with URI::Find, HTML::LinkExtractor etc, and then my $title=URI->new($link)->path() will provide the title and the link.
But if the website title is the linked text like badged, then How can I extract URL and link text from HTML in Perl? will give you the answer.
If the title is encoded in the link itself and the link is the text itself of the link, how do you define the title?
Do you want the last bit of the URI before any query? What happens with the queries set as URL paths?
Do you want the part between the host and the query?
Do you want to parse the link source and retrieve the title tag if any?
As always going from trivial first implementation to cover all corner cases is a daunting tasks ;-)

The regular expression for finding the image url in <img> tag in HTML using VB .Net code

I want to extract the image url from any website. I am reading the source info through webRequest. I want a regular expression which will fetch the Image url from this content i.e the Src value in the <img> tag.
I'd recommend using an HTML parser to read the html and pull the image tags out of it, as regexes don't mesh well with data structures like xml and html.
In C#: (from this SO question)
var web = new HtmlWeb();
var doc = web.Load("http://www.stackoverflow.com");
var nodes = doc.DocumentNode.SelectNodes("//img[#src]");
foreach (var node in nodes)
{
Console.WriteLine(node.src);
}
/(?:\"|')[^\\x22*<>|\\\\]+?\.(?:jpg|bmp|gif|png)(?:\"|')/i
is a decent one I have used before. This gets any reference to an image file within an html document. I didn't strip " or ' around the match, so you will need to do that.
Try this*:
<img .*?src=["']?([^'">]+)["']?.*?>
Tested here with:
<img class="test" src="/content/img/so/logo.png" alt="logo homepage">
Gives
$1 = /content/img/so/logo.png
The $1 (you have to mouseover the match to see it) corresponds to the part of the regex between (). How you access that value will depend on what implementation of regex you are using.
*If you want to know how this works, leave a comment
EDIT
As nearly always with regexp, there are edge cases:
<img title="src=hack" src="/content/img/so/logo.png" alt="logo homepage">
This would be matched as 'hack'.