Regex with iframe in Yahoo! Pipes - regex

I'm building a Yahoo! Pipe to pull an RSS feed from Reddit which links to some content in the description. I'm using a regex to match the href attribute of the anchor link in an item.description field. The regex I'm using is:
^.+?href="([^"]+)">\[link\].+?$
As a test, I set the replace to simply:
$1
and I see that the entire description field has been replaced with the URL. So far, so good.
I then put the following in the replace field. The idea being to iframe the content that's linked to:
Content: <iframe src="$1">no iframe support</iframe> End
What I get out however is:
Content: no iframe support End
I've confirmed that this is also coming through in the pipe's output and not just in the Yahoo! Pipes debug console.
I've so far tried replacing my angle brackets with < and > entities. I've tried wrapping the entire thing in a <![CDATA[ ... ]]> block and still, I get nothing. If I break my iframe tag by removing an angle bracket, the broken content comes through fine, but if I have a well-formed iframe element, it vanishes, leaving the "no iframe support" text. Am I doing something wrong here, or is Yahoo! actively preventing me from using iframe tags in my generated pipe? A cursory search on Google isn't turning up anything related to this.
The pipe in question is here:
http://pipes.yahoo.com/pipes/pipe.info?_id=2ba41448cadd2347d86f377efd3d199f

This Pipes FAQ Question "Why does Pipes Strip <object> and <embed> tags... ?" shows that a certain amount of sanitization is performed, by placing content (at least certain content) into an iframe for the safety of RSS consumers - though it does not state it specifically, this probably also removes other iframes in order to avoid nesting and other work-arounds.
Yahoo is big enough I would doubt they have a week sanitizer, but an extremely long shot is that you might be able to fool it by nesting the iframe in a bunch of other tags (again I doubt this will work). Also depending upon which step does the sanitization, perhaps adding part of the tag in one step, then adding another part somewhere else might work (yet again, doubt overwhelms me)
Not sure what else to suggest, other than getting something else to consume and transform your RSS a little bit more (by fixing otherwise broken tags??) - but that's what you're using pipes for to begin with, isn't it? Idunno...
Good luck!

Pipes has an fanatical devotion to the RSS spec and the spec says the description field is plain text only. HTML etc is supposed to go in the content:encoded field, not that I've had much luck getting pipes to do that.

Related

XSS DOM vulnerable

I tested site for vulnerables (folder /service-contact) and possible XSS DOM issue came up (using Kali Linux, Vega and XSSER). However, i tried to manually test url with 'alert' script to make sure it's vulnerable. I used
www.babyland.nl/service-contact/alert("test")
No alert box/pop-up was shown, only the html code showed up in contact form box.
I am not sure i used the right code (i'm a rookie) or did the right interpretation. Server is Apache, using javascript/js.
Can you help?
Thanks!
This is Not Vulnerable to XSS, Whatever you are writing in the URL is Coming in Below Form section ( Vraag/opmerking ) . And the Double Quotes (") are Escaped. If you try another Payload like <script>alert(/xss/)</script> That Also won't work, Because this is Not Reflecting neither Storing. You will see output as a Text in Vraag/opmerking. Don't Rely on Online Scanners, Test Manually, For DOM Based XSS ..Check Sink and Sources and Analyze them.
The tool is right. There is a XSS-Vulnerability on the site, but the proof of concept (PoC) code is wrong. The content of a <textarea> can only contain character data (see <textarea> description on MDN). So your <script>alert("test")</script> is interpreted as text and not as HTML code. But you can close the <textarea> tag and insert the javascript code after that.
Here is the working PoC URL:
https://www.babyland.nl/service-contact/</textarea><script>alert("test")</script>
which is rendered as:
<textarea rows="" cols="" id="comment" name="comment"></textarea<script>alert("test")</script></textarea>
A little note to testing for XSS injection: Chrome/Chromium has a XSS protection. So this code doesn't exploit in this browser. For manual testing you can use Firefox or run Chrome with: --disable-web-security (see this StackOverflow Question and this for more information).

Using a regular expression to replace everything between tags contained within XML output

I've been trawling the internet trying to find a solution to this issue. Basically I am using a web service provided by the company that runs our support software to retrieve customer tickets and output them (dependent on filtering) through our system so that customers can see from their dashboard which current support tickets they have active. I've managed to get the desired tags from the XML that is returned via the web service and place their content in a html table (therefore listing the active tickets row by row in the table) however, as the ticket description tag is populated with the content from emails sent by clients, there is lots of nasty redundant css and styling that has been applied to the Email that I would like to remove.
So far I have managed to use the 'replace' function to replace some of the redundant content from this email content ->
l_html_build := replace(l_html_build,'<','<');
l_html_build := replace(l_html_build,'>','>');
l_html_build := replace(l_html_build,'&lt;','');
l_html_build := replace(l_html_build,'&gt;','');
l_html_build := replace(l_html_build,'&nbsp;',' ');
However I now need to overwrite the p tags which have all sorts of garbage added to them so that they just become standard p tags->
From this:
<p 0in;"="" 3.0pt="" padding:="" 1.0pt;="" solid="" border-top:="" none;="" _mce_style=""border:" 0in"="" 0in="" 1.0pt;padding:3.0pt="" #b5c4df="" style=""border:none;border-top:solid">
To this:
<p>
I've looked into using the regEXP function listed here psoug however this appears to require a select statement that is performed each time. The data I need to manipulate is stored in a CLOB called l_html_build so is there any way of adapting the regEXP function to be used in a similar way to the replace function above or is there an alternative method that I am not aware of?
I apologise if this is a noob question. My expertise lies in front end development, PHP and MySQL but unfortunately I'm now required to bits of PL/SQL in my new role.
Any help would be greatly appreciated.
Knowing that:
There is no standard PL/SQL package that parses HTML.
You can't reliably parse HTML with regex. Furthermore, Oracle only support basic regular expressions, restricting its capabilities.
You want to stay in PL/SQL
You are left with few options (that I can think of):
Write a simple procedure yourself that will work in most of the cases (but there will be many exceptions that will break your parser).
Use a java parser, load class in database, call java from PL/SQL. Oracle comes with its integrated jvm, so this involves no extra setup.
I would go with option (2) if you want reliability, or option (1) if infrequent but inevitable losses are acceptable.
Since your content will be coming from email client, we can assume that only a tiny (negligible?) fraction will have very obscure HTML.
In that case you could start with simple regex expressions that may need some tweaking:
SQL> SELECT regexp_replace(
2 '<p1 3.0pt="" padding:="" #b5c4df="">
3 text
4 </p>',
5 '<([[:alpha:]]+)[^>]*>',
6 '<\1>') remove_attr_simple
7 FROM dual;
REMOVE_ATTR_SIMPLE
------------------
<p>
text
</p>
This will fail to catch tricky valid HTML (such as <P attr=">">) but since your input is somewhat standard this should be fine often enough. You may need to remove HTML comments with another procedure -- I'm not sure it can be done with regex.
SQL is really not the best tool for this job. Nor will regexes be able to perform this kind of task reliably. You would be better off extracting the data and processing it in another language using an XML parser.
Presumably Oracle itself is not sending these emails. What program does the sending, and can you add some programmatic processing at that point?
Since you already know PHP, here is a discussion of parsing HTML/XML in PHP. Similar tools are available in most other languages.

MVC - Strip unwanted text from rss feed

Ive got the following code in my RSS consumer (Vandelay Industries RemoteRSS) in my Orchard CMS implementation:
#using System.Xml.Linq
#{
var feed = Model.Feed as XElement;
}
<ul>
#foreach(var item in feed
.Element("channel")
.Elements("item")
.Take((int)Model.ItemsToDisplay))
{
<li>#T(item.Element("description").Value)</li>
}
</ul>
The rss feed Im using is from Pinterest, and this bundles the image, link, and a short description all inside the 'description' elements of the feed.
<description><a href="/pin/215609900882251703/"><img src="http://media-cache-ec2.pinterest.com/upload/88664686384961121_UIyVRN8A_b.jpg"></a>How to install Orchard CMS on IIS Server</description>
My issue is that I don't want the text bits, and I also need to prefix the 'href=' links with 'http://www.pinterest.com'.
I've managed to edit the original code with my newbie skills to the above,, which essentially displays the images as links which are only relative and thus pointing locally to my server. These images are also then followed by the short description.
So to summarise, I need a way to prefix all links with 'http://pinterest.com' and then to remove the fee text after the image/links.
Any pointers will be greatly appreciated, Thanks.
You should probably parse the description, with something like http://htmlagilitypack.codeplex.com/, and then tweak it to add the prefix. Or you can learn regular expression and do without a library. Could be a little trickier and error-prone however.

Is there any performance implication in using one big <cfoutput> tag?

I'm being forced/payed to work on a Legacy ColdFusion project (I'm an usual C# programmer) and one peculiarity with CF is that they have they're own tags that are supposed to blend with HTML (bad bad decision, IMO, since it just confuses the hell out of me even with the "starts with cf rule).
Besides this, they have the # character to indicate the start of CF "territory" much alike <% in ASP.Net or $ in Spark or so many equivalents. But this only gets parsed if inside a tag.
My question is: Is there a problem with opening one tag in the begining of the file and closing it, against using only when i'm going to use the # character?
To illustrate here's some code:
<cfoutput>
Some text #SomeVar# Some text.<br />
Some Images some other things #AnotherVar#
</cfoutput>
Against:
Some text <cfoutput>#SomeVar#</cfoutput> Some text.<br/>
Some Images some other things <cfoutput>#AnotherVar#</cfoutput>
Granted, this is might seem trivial for small content but i'm talking about a whole page.
Depending on the page contents, either is fine. There may be a performance impact (minor) by putting all of your page inside the CFOUTPUT tag, because the CFML engine needs to parse and scan the contents of the tag for executable code. Outside of the CFOUTPUT tag, the CFML engine can ignore the page as static content.
If you have CSS and HTML code that uses pound signs (for example named anchors or Hex color codes), you need to escape all pound signs (by adding a second one like "##") when within a CFOUTPUT. Because of this, I generally only put the CFOUTPUT around code I specifically want the CF engine to run.
That said, the CFML engine pays a bit of a performance penalty for constantly opening and closing the CFOUTPUT. If you're looping over come content, put the CFOUTPUT around the entire loop, rather than opening and closing it in each iteration of the loop.
Also, if you're having trouble knowing what code is CFML and what isn't, you might want to get a better IDE/editor for CFML like CFEclipse. It color codes the tags and lets you see the difference between CFML and HTML tags immediately. It's open source.
One problem you might find is that cfoutput is often used to display queries and they can not be nested inside of other cfoutput tags. So this will cause a 'Invalid tag nesting configuration' error
<cfoutput>
<cfoutput query="qFriends">
<li>#qFriends.fname# #qFriends.lname#</li>
</cfoutput>
</cfoutput>
It should not be a big issue but be careful using hex-valued colors, you'll need to escape those with an extra #. If it was me, I would try to break down those huge chunks of content into smaller pieces. Let HTML, JS, Flash and CSS do their jobs and use CF for the server side.
If you want to put cfoutput at the beginning and end of the page, you have to use double sign ## for colors value.

Markdown and XSS

Ok, so I have been reading about markdown here on SO and elsewhere and the steps between user-input and the db are usually given as
convert markdown to html
sanitize html (w/whitelist)
insert into database
but to me it makes more sense to do the following:
sanitize markdown (remove all tags -
no exceptions)
convert to html
insert into database
Am I missing something? This seems to me to be pretty nearly xss-proof
Please see this link:
http://michelf.com/weblog/2010/markdown-and-xss/
> hello <a name="n"
> href="javascript:alert('xss')">*you*</a>
Becomes
<blockquote>
<p>hello <a name="n"
href="javascript:alert('xss')"><em>you</em></a></p>
</blockquote>
∴​ you must sanitize after converting to HTML.
There are two issues with what you've proposed:
I don't see a way for your users to be able to format posts. You took advantage of Markdown to provide nice numbered lists, for example. In the proposed no-tags-no-exceptions world, I'm not seeing how the end user would be able to do such a thing.
Considerably more important: When using Markdown as the "native" formatting language, and whitelisting the other available tags,you are limiting not just the input side of the world, but the output as well. In other words, if your display engine expects Markdown and only allows whitelisted content out, even if (God forbid) somebody gets to the database and injects some nasty malware-laden code into a bunch of posts, the actual site and its users are protected because you are sanitizing it upon display, as well.
There are some good resources on the web about output sanitization:
Sanitizing user data: Where and how to do it
Output sanitization (One of my clients, who shall remain nameless and whose affected system was not developed by me, was hit with this exact worm. We have since secured those systems, of course.)
BizTech: Best Practices: Never heard of XSS?
Well certainly removing/escaping all tags would make a markup language more secure. However the whole point of Markdown is that it allows users to include arbitrary HTML tags as well as its own forms of markup(*). When you are allowing HTML, you have to clean/whitelist the output anyway, so you might as well do it after the markdown conversion to catch everything.
*: It's a design decision I don't agree with at all, and one that I think has not proven useful at SO, but it is a design decision and not a bug.
Incidentally, step 3 should be ‘output to page’; this normally takes place at the output stage, with the database containing the raw submitted text.
insert into database
convert markdown to html
sanitize html (w/whitelist)
perl
use Text::Markdown ();
use HTML::StripScripts::Parser ();
my $hss = HTML::StripScripts::Parser->new(
{
Context => 'Document',
AllowSrc => 0,
AllowHref => 1,
AllowRelURL => 1,
AllowMailto => 1,
EscapeFiltered => 1,
},
strict_comment => 1,
strict_names => 1,
);
$hss->filter_html(Text::Markdown::markdown(shift))
convert markdown to html
sanitize html (w/whitelist)
insert into database
Here, the assumptions are
Given dangerous HTML, the sanitizer can produce safe HTML.
The definition of safe HTML will not change, so if it is safe when I insert it into the DB, it is safe when I extract it.
sanitize markdown (remove all tags - no exceptions)
convert to html
insert into database
Here the assumptions are
Given dangerous markdown, the sanitizer can produce markdown that when converted to HTML by a different program will be safe.
The definition of safe HTML will not change, so if it is safe when I insert it into the DB, it is safe when I extract it.
The markdown sanitizer has to know not just about dangerous HTML and dangerous markdown, but how the markdown->HTML converter does its job. That makes it more complex, and more likely to be wrong than the simpler unsafeHTML->safeHTML function above.
As a concrete example, "remove all tags" assumes you can identify tags, and would not work against UTF-7 attacks. There might be other encoding attacks out there that render this assumption moot, or there might be a bug that causes the markdown->HTML program to convert (full-width '<', exotic white-space characters stripped by markdown, SCRIPT) into a <script> tag.
The most secure would be:
sanitize markdown (remove all tags - no exceptions)
convert markdown to HTML
sanitize HTML
insert into a DB column marked risky
re-sanitize HTML every time you fetch that column from the DB
That way, when you update your HTML sanitizer you get protection against any newly discovered attacks. This is often inefficient, but you can get pretty good security by storing a timestamp with HTML inserted so that you can tell which might have been inserted during the time when someone knew about an attack that gets past your sanitizer.