I have a string. This is the value of some attribute of some html tag.
How to check if this string contains javascript?
For example (SRC attribute of IMG tag):
1. <IMG src="javascript:alert('XSS')"> - contains script<br/>
2. <IMG src="JaVaScRiPt:alert('XSS')"> - contains script<br/>
3. <IMG javascript:alert('XSS')> - also contains javascript
You first have to canonicalize, then check. But i would look at HtmlPurifier or OWASP AntiSamy for that.
It's pretty hard to do, as there are lots of odd and tricky ways to sneak JavaScript in.
HTMLPurifier has pretty complex parsing to filter out all potentially unsafe HTML if you must allow HTML input in the first place.
However, generally you shouldn't even try to do that, and simply always escape the string.
In PHP that is:
echo htmlspecialchars($string);
In JS you can use document.createTextNode() or jQuery's equivalent $(el).text() to safely insert text into DOM (those two methods don't require escaping).
Related
I'm struggling here, trying to figure out how to replace all double slashes that come after a specific word.
Example:
<img alt="" src="/pt/webf//2015//47384_1.JPG" height="235" width="378" />
<div>Don't remove this // or this//</div>
I want the string above to look like this:
<img alt="" src="/pt/webf/2015/47384_1.JPG" height="235" width="378" />
<div>Don't remove this // or this//</div>
Notice the double slashes have been replaced with just one slash in the img tag but left unscathed in the div tag. I only want to replace the double slashes IF they come after the word: pt.
I tried something like this:
(?=pt)((.*?)\/\/)+
However, the first thing wrong with it is (?=) does not do pattern backtracking, as far as I'm aware. That is, it'll only look for the first matching pattern. The second thing wrong with it is it doesn't work as I intended it to.
https://regex101.com/r/kC4tA5/1
Or maybe I'm going about this the wrong way, since regular expression support is not expansive in VBScript/Classic ASP and I should try to break up the string and process, instead of trying to do everything in one regular expression???
Any help would be appreciated.
Thank you.
I am interpreting your issue as "Removing repeated slashes in all <img src> attributes."
As I said in the comments, working with HTML requires a parser. HTML is too complex for regular expressions, all kinds of things can go wrong.
Luckily, there is a parser available to VBScript: The htmlfile object. It creates a standard DOM from your HTML string. So the solution becomes exactly as described:
Function FixHtml(htmlString)
Dim doc, img, slashes
Set slashes = New RegExp
slashes.Pattern = "/+"
slashes.Global = True
Set doc = CreateObject("htmlfile")
doc.Write htmlString
For Each img In doc.getElementsByTagName("IMG")
img.src = slashes.Replace(img.src, "/")
img.src = Replace(Replace(img.src, "about:blank", ""), "about:", "")
Next
FixHtml = doc.body.innerHTML
End Function
Unfortunately, htmlfile is not the most advanced HTML parser in the world, but rest assured that it will still do way better than any regex.
There are two minor issues:
I found in my tests that for some reason it insists on prepending the img.src with about: or about:blank. This should not happen, but it does. The second line of Replace() calls gets rid of the unwanted additions.
The .innerHTML will produce tags names in upper case, so <img> becomes <IMG> in the output. Also insignificant line breaks in the HTML source might be removed. This is a minor annoyance, I recommend you don't obsess over it.(*)
But there are two big plus sides as well:
The DOM puts you in a position where you can work with the input in a structured way. You can put in any number of complex fixes now that would have been impossible to do with regex.
The return value of .innerHTML is sane HTML. It will fix any gross blunder in the input and turn it into something that is well-nested, well-escaped and otherwise well-behaved.
(*) If you do find yourself obsessing over it, you can use the wisdom from this blog post to create a function that replaces all uppercase tags that come out of .innerHTML with lowercase versions of themselves. This actually is something you can use regex for ("(</?[A-Z]+)", to be exact), because we know that there will be no stray < not belonging to a tag anywhere in the string, because that's .innerHTML's guarantee. While it would be a nice exercise (and it introduces you to the little-known fact that VBScript has function pointers), I would say it's not really worth it.
I'm using the following expression in classic asp that successfully grabs any image tag with a .jpg and .png suffix.
re.Pattern = " ]*src=[""'][^ >]*(jpg|png)[""']"
The problem that I've found is many sites that I need to use do not actually use a suffix. So, I need to new regex that finds an image tag and grabs whatever is in the src attribute.
As simple as this sounds, finding an regular expression to accomplish this in Classic ASP seems impossible without writing it myself (which IS impossible).
Please advise.
To match plainly on the img src you can do:
\<img src\=\"(\w+\.(gif|jpg|png)\")
And then if you only want the value that's in the img src, you can do a match for anything in quotes ending in a picture extension (but this may get you false positives depending on what you want):
\w+\.(gif|jpg|png)
But to match just the value while ensuring that it follows img src, you need a negative lookahead to do this (note that I added a matching group there):
(?!.*\<img src\=\")(\w+\.(gif|jpg|png))
Now to include the possibility of having image links in your image source:
(?!.*\<img src\=\")([\/\.\-\:\w]+\.(gif|jpg|png)?[\?\w+\%]+)
And then let's remove the false positives we get by fixing that lazy quantifier after (gif|jpg|png) and moving it to after the next set (which matches data you may get in a JS link, etc.) and making sure we have an end quote:
(?!.*\<img src\=\")([\/\.\-\:\w]+\.(gif|jpg|png)([\?\w+\%]+)?)(?=\")
Note: This will match this data, but regular expressions don't parse HTML, and I personally don't recommend using regular expressions to look through HTML data unless you're doing it on a case-by-case basis. If you're wanting to do some URL/Image scraping via a script, look into an XML/HTML parser.
Sample data:
<img src="picture.gif">
<img src="pic859.jpg">
<img src="859.png">
<img id="test1" class="answer1" src="text.jpg">
<img src="http://media.site.com/media/img/staff/2013/ROTHBARD-350_s90x126.jpg?e3e29f4a7131cd3bc7c4bf334be801215db5e3c2%22%3E">
<img src="yahoo.com/images/imagename.gif">
HTML Source
I'm trying to match the following video url:
<iframe width="420" height="315" src="//www.youtube.com/embed/F40ZBDAG8-o?rel=0" frameborder="0" allowfullscreen></iframe>
I have the following:
^<iframe
(\swidth="\d{1,3}")?
(\sheight="\d{1,3}")?
(\salt=""[^""<>]*"")?
(\stitle=""[^""<>]*"")?
\ssrc="//(www.youtube.com|player.vimeo.com)/[-a-z0-9+&##/%?=~_|!:,.;\(\)]+"
(\sframeborder="[^""<>]*")?
(\sallowfullscreen)?
\s?/?></iframe>$
This is working, but I can't rely on the fact that youtube will always provide embed links that follow this structure. If they move the width attribute to after src, my regex will fail.
Is there any way to do order-agnostic groupings, to address this?
You can make each of the search terms a lookahead - these don't consume the strings, so they can be in any order. Example:
<iframe (?=.*height="\d{1,3}")(?=.*width="\d{1,3}").*
will match both
<iframe width="123" height="321"
and
<iframe height="321" width="123"
demo on regex101.com
I am sure you can finish this yourself (adding all the terms you want to match).
Note - this "matches" - it does not "extract". But it will tell you that all these terms are present in the expression, in any order.
EDIT since I started writing this answer a number of comments appeared that change my understanding of your request. If you "just" want to extract the src= thing, you simply do
<iframe.*?src="([^"]+)"
and the match (the thing in brackets) will be whatever is between the first and the second double quote. Typically there are better tools than regex for parsing HTML - my personal preference is BeautifulSoup (Python).
i am trying to read a html page using file_get_contents. After I processed the data, there are some incomplete tags for example:
</p><p> test test test test</p>
In this case there does not have a <p> to open </p>
or
<font color="#333333">abc</font><div><p>go go go go </p>
in this case there does not have a </div> to close<div>
thus I want to use preg_replace to remove all these incomplete tags, in my examples, the extra </p> and <div> should be removed. How can I do that? these tags can be any valid html5 tags.
First, you need to understand what a "well formed markup document" is in XHTML.
With well formed markup it does not guarantee the tags chosen as a "start end pair(open close)" will be the correct two if their is a spare unpaired tag.
Second, you will need to build a loop to call each tag per iteration from an array repository of the tag types. The tags in the array should be "literals".
Each tag "length" int should be taken and set in the loop before testing for the tag presence.
When the match of the tag pair(open close) is found, preg match puts the section onto an array of copy of matches,position and length, then take the length of the match and its start position from the parts of the preg match return result array(use a debug print-out of the array while developing the script).
Inside each open close pair matched you need to do a sub loop of the same action to check internal tags.
Synopsis:
To build such a system as a customised script ranks with an XML well formed document parser and debugger having any valid efficiency.As much it would be a markup debugger for an IDE if it had that as valid efficiency.
Good luck.
You should investigate the use of the PHP Tidy extension (http://php.net/manual/en/book.tidy.php). You can use Tidy to clean up malformed HTML based on whatever DOCTYPE you are attempting to validate.
I want to get string between em tag , including other html also.
for example:
<em>UNIVERSALPOSTAL UNION - International Bureau Circular<br />
By: K.J.S. McKeown</em>
output should be as:
UNIVERSALPOSTAL UNION - International Bureau Circular<br />
By: K.J.S. McKeown
please help me.
Thanks
Use the regular expression function like this:
REMatch("(?s)<em>.*?</em>", html)
See also: http://livedocs.adobe.com/coldfusion/8/htmldocs/help.html?content=regexp_01.html
The (?s) sets the mode to single line, so that the input text is interpreted as one line even if it contains line feeds. This is probably the default (I'm not sure) so it can be omitted. As Peter pointed out in a comment, this is not the default and therefore must be set.
The .*? matches all characters inbetween <em> and </em>. The questionmark after the multiplier makes it "non-greedy", so that as few as possible characters are matched. This is needed in case the input html contains something like <em>foo</em><em>bar</em> where otherwise only the outermost <em></em> tags are considered.
The returned array contains all matches found, i.e. all texts including html that was in <em> tags.
Note that this could fail for circumstances where </em> also occurs as attribute text and is incorrectly not html-encoded, for example: <em><a title="Help for </em> tag">click</a></em> or in other rare circumstances (e.g. javascript script tags etc.). A regex cannot replace a full HTML/XML parser and if you need 100% accurateness, you should consider using one: http://livedocs.adobe.com/coldfusion/8/htmldocs/help.html?content=functions_t-z_23.html
If your input is exactly in the format given above, you don't even need regex - just strip the outer tags:
<cfsavecontent variable="Input">[text from above]</cfsavecontent>
<cfset Output = mid( Input, 4 , len(Input) - 9 />
If your input is more than this (i.e. a significant piece of HTML, or a full HTML document), regex is still not the ideal tool - instead, you should be using a HTML parser, such as JSoup:
<cfset jsoup = createObject('java','org.jsoup.Jsoup') />
<cfset Output = jsoup.parse(Input).select('em').html() />
(With CF8, this code requires placing the jsoup JAR file in CF's lib directory, or using a tool such as JavaLoader.)
If you are using jquery you can do this also pretty easily.
$("em").html();
Will return all html between the em tags.
See this fiddle
I had to remove any text that was to follow after a partiucular tag . Now the HTML content was getting generated dynamically from a database that cater to 5 different langauges. so I only had the div tag to help me. I am not sure why REMatch("(?s).*?", html) did not work for me. However Ben helped me here (http://www.bennadel.com/blog/769-Learning-ColdFusion-8-REMatch-For-Regular-Expression-Matching.htm). My code looks like tghis:
<cfset extContentArr = REMatch("(?i)<div class=""inlineBlock"" style=""margin-right:30px;"">.+?</div>",qry_getContent.colval) />
<cfif !ArrayIsEmpty(extContentArr)>
Loop the array and do whatever you need with the extract , I just deleted them.
</cfif>