Extract all Images from HTML whose width or height higher than a specified value - Regex - regex

I'm trying to make a small link share function with Classic ASP like LinkedIn or Facebook.
What I need to do is to get HTML of remote URL and extract all the images whose width are greater than 50px for example.
I can crawl and take the HTML and also I can find the images with this regex:
<img([^<>+]*)>
It matches; <img src="/images/icon.jpg" width="60" height="90" style="display:none"/>
Then I'm able to extract the path but sometimes it matches <img src="/track.php" style="display:none" width="1" height="1"/> which is not a real image.
Anyway, I feel like you are gonna be mad because of classic ASP but my company ....
I know there are lots of topics about this issue and mostly, they recommend not to USE regex but I couldn't find a way to this with classic asp. Is there a component or something to this?
Regards

This will get you close:
<img [^>]*width="(0?[1-9]\d{2,}|[5-9]\d)"[^>]*>
It accepts image tags with a width of 50 or greater.
Edit: tags with unspecified widths:
<img [^>]*width="(0?[1-9]\d{2,}|[5-9]\d)"[^>]*>|<img ((?!width=)[^>])*>

Related

How to add an extra parameter to the img source in HTML using perl

I have a situation where I need to differentiate two calls by the path in the source of a HTML. This is how the img tag looks like
<img src="/folder/12280218/160024536.images.jpg" />
I am planning to alter the source to
<img src="/folder/12280218/160024536.images.jpg/1" />
observe the "/1" at the end of src
I need this so that I can change the flow in the controller when I am serving this image.
This is what I have tried until now.
my $string = '<p><img src="/folder/12280218/160024536.images.jpg" /></p>';
$string =~ s/<img\s+src\=\"(.*)"\s+\/><\/p>/<img src\=\"$1\/1" \><\/p>/g;
This is working as long as the $string looks like this.
In our application, user has the ability to alter the HTML input using CKEditor.
He can alter the image tag by adding width="800" before or after the src attribute. I want the regular expression to handle all these situations.
Please let me know how to proceed.
Thanks in advance.
Replace :
(<img.*src="[^"]*)(".*\/>)
by
$1/1$2
Demo here
Edit : Changed the regex to handle situations with other attributes (like the "width" part)

Sitecore - setting background-image a div to a CMS value

I've been try to add the background image of a div to a value from Sitecore (8.0) in C# MVC using the code
<div style="background-image: url({Model.MyImage.Src})>
Where MyImage is of type Image as returned by GlassView
This is returning html such as
<div style="background-image: url(/~/media/myFolders/myImage.ashx)">
This image isnt being displayed when the page is rendered- although the url resolves when entered into the browser's address bar so it must be an issue with the .ashx extension as a background image for a div.
I also tried using Sitecore.Resources.Media.MediaManager.GetMediaUrl(mediaItem) but this also returned me the ashx which couldn't be resolved!
Try background-image: url('#Model.Image.Src'). While your example doesn't show it, you most likely have spaces in your folder or file name, which requires single or double quotes.
Add single quotes around it.
url('{Model.MyImage.Src}')
Should make this
url('/~/media/myFolders/myImage.ashx')
for my case sitecore 8.2 stop working below code because of style attribute
I have replaced style attribute with img tag and start working
<span><img src="#item.GetImageUrl("MyImage")" alt="" class="icon" /></span>

Specific xPath and Regex - Web Crawling

I'm currently in the process of trying to scrape a website. The problem is the information is placed on google maps in an iframe. Specifically, Latitude and Longitude.
I'm able to get all the other information I currently need expect this. Searching around, and working with import.io tech support, I found I need to use specific xPath and Regex to pull this information but the code I found on the site has me lost. Ideally I'd like to pull Latitude and Longitude separately. This is the code I have to work with.
What are my options? Thank you.
<div class="padding-listItem--sm">
<iframe width="100%" height="310" frameborder="0" allowfullscreen="" src="https://www.google.com/maps/embed/v1/place?q=33.3929503,-111.908652&key=AIzaSyDK08tC4NRubbIiw-xwDR1WEp-YAXX1Mx8" style="border:0"></iframe>
</div>
1) Get the src attribute of the iframe element.
string srcText = driver.findElement(By.tagName("iframe")).getAttribute("src");
2) Parse the url (found in srcText) for the latitude and longitude values.
Regex to find both numbers:
/([-]?\d+\.\d+)/g
when the url is as you specified:
https://www.google.com/maps/embed/v1/place?q=33.3929503,-111.908652&key=AIzaSyDK08tC4NRubbIiw-xwDR1WEp-YAXX1Mx8"
The XPath to obtain the iframe source is:
//div[#class='padding-listItem--sm']/iframe/#src
Then you can apply a regex like this one to obtain latitude and longitude
/q=(-?[\d\.]*),(-?[\d\.]*)/g
Implementation online Here

Embedding issuu

I need to embed an issuu document inside a website. The website administrator should be allowed to decide which document is displayed on the frontend.
This is an easy task, using the embed link on the issuu page. But I need to customize some options - for instance, disable sharing, set the dimensions and so on. I cannot rely on the administrators doing this process every time they need to change the document.
I can easily customize the issuu embed code to my taste, and all that I need is the document id. Unfortunately, the id is not included in the issuu page for the document. For instance, the id for this random link happens to be 110209071155-d0ed1d10ac0b40dda80dad24166a76ee, which is nowhere to be found, neither in the URL nor easily inside the page. You have to dig into the embed code to find it.
I thought the issuu API could allow me to get a document id given its URL, but I cannot find anything like this. The closest match is the search API, but if I search for the exact name of the document I get only one match for a different document!
Is there some easy way to be able to embed a document only knowing its URL? Or an easy way for a non techie person to find a document id in the page?
Unfortunate the only way for you to costomize is to pay for the service wich is 39$ for month =/.
You can force a fullscreen mode without ads by using
<body style="margin:0px;padding:0px;overflow:hidden">
<iframe src="YOUR ISSU EMBED" frameborder="0" style="overflow:hidden;height:105%;width:105%;position:absolute;" height="100%" width="100%""></iframe>
</body>
You can embed of course stacks but that isnt showed on Issuu site. This is code (its old code but it works):
<iframe src="http://static.issuu.com/widgets/shelf/index.html?folderId=FOLDERIDamp;theme=theme1&rows=1&thumbSize=large&roundedCorners=true&showTitle=true&showAuthor=false&shadow=true&effect3d=true" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" width="100%" height="200"></iframe>
FOLDERID is number of 36 chars that you get on address bar when you enter stacks (example: https://issuu.com/username/stacks/FOLDERID). When you replacing that in code you must paste 36 chars in this format 8-4-4-4-12 with - between chars. And voila its working.
You can change theme and other stuffs in code.
The Document ID is found in the HTML source of every document. It is in the og:video meta property.
<meta property="og:video" content="http://static.issuu.com/webembed/viewers/style1/v2/IssuuReader.swf?mode=mini&documentId=XXXXXXXX-XXXXXXXXXXXXX&pageNumber=0">
You can easily handle it by using the DomDocument and DomXPath php classes.
Here is how-to using PHP:
// Your document URL
$url = 'https://issuu.com/proyectotres/docs/proyecto_3_edicion_135';
// Turn off errors, loads the URL as an object and then turn errors on again
libxml_use_internal_errors(true);
$dom = DomDocument::loadHTMLFile($url);
libxml_use_internal_errors(false);
// DomXPath helps find the <meta property="og:video" content="http://hereyoucanfindthedocumentid?documentId=xxxxx-xxxxxxx"/>
$xpath = new DOMXPath($dom);
$meta = $xpath->query("//html/head/meta[#property='og:video']");
// Get the content attribute of the <meta> node and parse its query
$vars = [];
parse_str(parse_url($meta[0]->getAttribute('content'))['query'], $vars);
// Ready. The document ID is here:
$docID = $vars['documentId'];
// You can print it:
echo $docID;
You can try it with the URL of your own Issu document.
You can use the Issuu URL of your document to complete this iframe :
<iframe width="100%" height="283" style="display: block; margin-left: auto; margin-right: auto;" src="https://e.issuu.com/issuu-reader3-embed-files/latest/twittercard.html?u=nantucketchamber&d=program-update1&p=1" frameborder="0" allowfullscreen="allowfullscreen" span="" id="CmCaReT"></iframe>
You just need to replace "nantucketchamber" by a user name and "program-update1" by the file name in the Issuu URL
(for this example the URL is https://issuu.com/nantucketchamber/docs/program-update1)

The regular expression for finding the image url in <img> tag in HTML using VB .Net code

I want to extract the image url from any website. I am reading the source info through webRequest. I want a regular expression which will fetch the Image url from this content i.e the Src value in the <img> tag.
I'd recommend using an HTML parser to read the html and pull the image tags out of it, as regexes don't mesh well with data structures like xml and html.
In C#: (from this SO question)
var web = new HtmlWeb();
var doc = web.Load("http://www.stackoverflow.com");
var nodes = doc.DocumentNode.SelectNodes("//img[#src]");
foreach (var node in nodes)
{
Console.WriteLine(node.src);
}
/(?:\"|')[^\\x22*<>|\\\\]+?\.(?:jpg|bmp|gif|png)(?:\"|')/i
is a decent one I have used before. This gets any reference to an image file within an html document. I didn't strip " or ' around the match, so you will need to do that.
Try this*:
<img .*?src=["']?([^'">]+)["']?.*?>
Tested here with:
<img class="test" src="/content/img/so/logo.png" alt="logo homepage">
Gives
$1 = /content/img/so/logo.png
The $1 (you have to mouseover the match to see it) corresponds to the part of the regex between (). How you access that value will depend on what implementation of regex you are using.
*If you want to know how this works, leave a comment
EDIT
As nearly always with regexp, there are edge cases:
<img title="src=hack" src="/content/img/so/logo.png" alt="logo homepage">
This would be matched as 'hack'.