The regular expression for finding the image url in <img> tag in HTML using VB .Net code - regex

I want to extract the image url from any website. I am reading the source info through webRequest. I want a regular expression which will fetch the Image url from this content i.e the Src value in the <img> tag.

I'd recommend using an HTML parser to read the html and pull the image tags out of it, as regexes don't mesh well with data structures like xml and html.
In C#: (from this SO question)
var web = new HtmlWeb();
var doc = web.Load("http://www.stackoverflow.com");
var nodes = doc.DocumentNode.SelectNodes("//img[#src]");
foreach (var node in nodes)
{
Console.WriteLine(node.src);
}

/(?:\"|')[^\\x22*<>|\\\\]+?\.(?:jpg|bmp|gif|png)(?:\"|')/i
is a decent one I have used before. This gets any reference to an image file within an html document. I didn't strip " or ' around the match, so you will need to do that.

Try this*:
<img .*?src=["']?([^'">]+)["']?.*?>
Tested here with:
<img class="test" src="/content/img/so/logo.png" alt="logo homepage">
Gives
$1 = /content/img/so/logo.png
The $1 (you have to mouseover the match to see it) corresponds to the part of the regex between (). How you access that value will depend on what implementation of regex you are using.
*If you want to know how this works, leave a comment
EDIT
As nearly always with regexp, there are edge cases:
<img title="src=hack" src="/content/img/so/logo.png" alt="logo homepage">
This would be matched as 'hack'.

Related

Removing entire tags containing a specific term using regex

I am altering a database with approximately 500 html pages using phpmyadmin.
Several pages contain a Facebook Pixel or Google Tag that I would like to remove.
The easiest way I thought would be to search via regex the entire tag that contains some expression or term related to Facebook or Google, and replace it with blank.
An example would be
<script>
window.dataLayer = window.dataLayer || [];
function gtag() {
dataLayer.push(arguments);
}
gtag('js', new Date());
gtag('config', 'G-XXXXXXXX');
</script>
or
<script>
(window, document, 'script', 'https://connect.facebook.net/en_US/fbevents.js');
fbq('init', '9999999999999999');
fbq('track', 'salespage_xxxxxx');
</script>
Although all are unique, some have the same code or another element that makes it possible to identify each one of them.
Before running in myphpadmin, I'm trying to formulate the expression using SublimeText3
It's the first contact I have with the regex and I found it fascinating, but even following some references I can't match the search.
The expression I came up with after some research was
<(.*)>[\s\S]face[\s\S]<\/(.*)>
Where I thought the expression would select the entire tag containing the word "face", but it doesn't find anything.
I would like some help.
If it works, it would be able to make several other necessary changes.
This regex expression will match the <script> tag that contains the face keyword
<(script)>(?:(?!<\/\1>|face)[\s\S])+face(?:(?!<\/\1>)[\s\S])+<\/\1>
See example: https://regex101.com/r/LfRlBV/1

How to add an extra parameter to the img source in HTML using perl

I have a situation where I need to differentiate two calls by the path in the source of a HTML. This is how the img tag looks like
<img src="/folder/12280218/160024536.images.jpg" />
I am planning to alter the source to
<img src="/folder/12280218/160024536.images.jpg/1" />
observe the "/1" at the end of src
I need this so that I can change the flow in the controller when I am serving this image.
This is what I have tried until now.
my $string = '<p><img src="/folder/12280218/160024536.images.jpg" /></p>';
$string =~ s/<img\s+src\=\"(.*)"\s+\/><\/p>/<img src\=\"$1\/1" \><\/p>/g;
This is working as long as the $string looks like this.
In our application, user has the ability to alter the HTML input using CKEditor.
He can alter the image tag by adding width="800" before or after the src attribute. I want the regular expression to handle all these situations.
Please let me know how to proceed.
Thanks in advance.
Replace :
(<img.*src="[^"]*)(".*\/>)
by
$1/1$2
Demo here
Edit : Changed the regex to handle situations with other attributes (like the "width" part)

Hyperlinks inside object fields

I have an object inside my Ember app, with a description field. This description field may contain hyperlinks, like this
My fancy text <a href='http://other.site.com' target='_blank'>My link</a> My fancy text continues...
However, when i output it normally, like {{ description }} my hyperlinks are displayed as a plain text. Why is this happening and how can i fix this?
Handlebars escapes any HTML within output by default. For unescaped text in markup use triple-stashes:
{{{ description }}}
There's an alternative when one controls the property: Handlebars.SafeString. SafeStrings are assumed to be safe and are not escaped either. From the documentation:
Handlebars.registerHelper('link', function(text, url) {
text = Handlebars.Utils.escapeExpression(text);
url = Handlebars.Utils.escapeExpression(url);
var result = '' + text + '';
return new Handlebars.SafeString(result);
});
Note - please be careful with this. There are security concerns with rendering unescaped text that has come from user input; an attacker could inject a malicious script into the description and hijack your page, for example.

Extracting string from jmeter response using regular expression extractor

I'm trying to extract the string (201 & 202) from the html response code below.
So far I have tried the following regex
punumber=(.+)
but the problem is that there are many instances of the punumber on the page and gets me stuff that I dont need.
The string i need are inside the <h3 class="content-title">.
So can someone please help me write a regex to extract the punumber within the h3 class only?
<h3 class="content-title">
<!-- change when this is completed -->
<a href="/container/recentIssue.jsp?punumber=201">
Title 1
</a>
</h3>
<h3 class="content-title">
<!-- change when this is completed -->
<a href="/container/mostRecentIssue.jsp?punumber=202">
Title 1
</a>
</h3>
This works for me:
Reference Name : test
Regexp : punumber=([^"]+?)"
Template : $1$
Match No : -1
(this will get all values)
NV_punumber
With -1, JMeter will create:
${test_1} => 201
${test_2} => 202
Here is the regex that works for me :
punumber=(\d+)
If you're parsing html you should consider using something else other than regex to extract info like jsoup.
Anyways here is the jmeter test file attached with dummy sampler(with regex post processor) simulating your case and debug sampler that gets the result you want.
http://pastebin.com/Uti8Pv9E
You can possibly combine in this case XPath Extractor with structured query (to get all href values with punumber from ONLY instances inside <h3> tags) together with extracting then punumber value from href in ForEach Controller loop.
. . .
YOUR HTTP REQUEST
XPath Extractor
Use Tidy = true
Reference Name = punum
XPath Query = //h3[#class="content-title"]/a[text()="Title 1"]/#href
Default value = NOT_FOUND
ForEach Controller
Input variable prefix = punum
Output variable name = pnum
Add "_" before number = true
User Parameters
cnt = ${__counter(FALSE,)}
Regular Expression Extractor
Apply to = Jmeter Variable = pnum
Reference Name = punumber_${cnt}
Regular Expression = punumber=(\d+)
Template = $1$
Match No. = 1
Default value = NOT_FOUND
...
. . .
XPath Extractor will give you hrefs values of all the <a> items under <h3> tag as punum_1,punum_2,...,punum_N vars.
Foreach Controller takes one after another punum_X var, refers it as pnum, applies to it RegEx Extractor to get punumber value and stores extracted value as punumber_1, punumber_2,...,punumber_N (using counter defined in User Parameters and incremented each step).
NOTE: Since here XPath Extractor is used to parse HTML (not XML) response ensure that Use Tidy (tolerant parser) option is CHECKED (in XPath Extractor's control panel).
Same test-plan available here: http://db.tt/dnACZtGL (I've used #ant's one from his answer, thank him).

actionscript htmltext. removing a table tag from dynamic html text

AS 3.0 / Flash
I am consuming XML which I have no control over.
the XML has HTML in it which i am styling and displaying in a HTML text field.
I want to remove all the html except the links.
Strip all HTML tags except links
is not working for me.
does any one have any tips? regEx?
the following removes tables.
var reTable:RegExp = /<table\s+[^>]*>.*?<\/table>/s;
but now i realize i need to keep content that is the tables and I also need the links.
thanks!!!
cp
Probably shouldn't use regex to parse html, but if you don't care, something simple like this:
find /<table\s+[^>]*>.*?<\/table\s+>/
replace ""
ActionScript has a pretty neat tool for handling XML: E4X. Rather than relying on RegEx, which I find often messes things up with XML, just modify the actual XML tree, and from within AS:
var xml : XML = <page>
<p>Other elements</p>
<table><tr><td>1</td></tr></table>
<p>won't</p>
<div>
<table><tr><td>2</td></tr></table>
</div>
<p>be</p>
<table><tr><td>3</td></tr></table>
<p>removed</p>
<table><tr><td>4</td></tr></table>
</page>;
clearTables (xml);
trace (xml.toXMLString()); // will output everything but the tables
function removeTables (xml : XML ) : void {
xml.replace( "table", "");
for each (var child:XML in xml.elements("*")) clearTables(child);
}