Modify (potentially) many URLs within an HTML document in C++ - c++

I'm given a string which contains the contents of an HTML document, and I need to modify some of the URLs contained within the document. The URLs which need modification begin with the form:
<script src="https://foo.com/some/variable/path/to/file.js" ...
And must be modified to:
<script src="https://foo.com/some/variable/path/to/NEW/file.js" ...
My current approach has been to use Google's RE2's GlobalReplace function with the regexp:
"(?i)(<script\\s+(?:[^>]+\\s+)?src=[\"']https://foo\\.com/"
"(?:.*?/)*?)(.*?\\.js[\"'][^>]*>)"
Which almost works, until I realized that it's possible that the HTML that I'm given might already have some of the URLs modified and some not, the former of which should be left alone.
Question: What's the easiest way to go about modifying the URLs without modifying the ones that have already been modified upstream?
A single pass approach is essential.

Related

Getting WebPage to use a specific URL to download HTML resources

I have a Qt program that downloads webpages (HTML), parses them and then generates its own HTML which is then displayed with QWebPage. Some times the HTML that I download contains IMG tags, which work fine when the src attribute contains a full URL. However, some times the IMG tag might use a relative path like:
<IMG SRC="images/foo.png" />
Since I know the URL that should be prepended to the SRC my first thought was to just tack it onto my resulting HTML when I'm parsing. However, this is proving more difficult than I anticipated and now I'm wondering if there's a better way.
If there any mechanism/property with QWebPage that I can say "use this URL for relative paths"? Or maybe someone can suggest a better way to accomplish what I want?
Thanks!
In the comments, you mentioned that you're using QWebView::setHtml(). The second, optional parameter of this method sets the URL to use for resolving relative paths. According to the documentation:
External objects such as stylesheets or images referenced in the HTML
document are located relative to baseUrl.
Setting that parameter should be all that's needed here.

HTML Purifier - Change default allowed HTML tags configuration

I want to allow a limited white list of HTML tags that users can use in my forum. So I have configured the HTML Purifier like so:
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.Allowed', 'p,a[href|rel|target|title],img[src],span[style],strong,em,ul,ol,li');
$purifier = new HTMLPurifier($config);
What I am wondering is, does the default configuration of the HTML Purifier still apply, with the exception of a reduced number of accepted HTML tags or do I need to re-set every possible configuration parameter manually?
Additionally, should I tweak the default configuration in any way to stay safe? I am new to the whole XSS protection thing, new to HTML Purifier and didn't find that the manual gave a lot of 'basic' tips and hints.
HTML Purifier is safe by default and any restrictions you impose on it by changing %HTML.Allowed are guaranteed only to reduce the permitted tag set. Check out http://htmlpurifier.org/live/smoketests/printDefinition.php to see how tweaking configuration changes the allowed tagset.
Why not just use a DOM parser and check if tag type is in allowed white list of HTML tags?
Converting the input to a DOM node list you should be able to loop through all the DOM nodes and check if the type is allowed that way. php.net has great examples for how to do this written by others like you trying to solve the input sanitization problem.
More information here:
http://php.net/manual/en/class.domdocument.php

Whitelist tags exempt from escaping using Go's html/template

Pass a []byte into a template as the body of a message post on a forum-style web app. In the template, call a method to convert to string and along the way, switch out all newlines for line breaks:
<p>{{.BodyString}}</p>
...
func (p *Post) BodyString() string {
nl := regexp.MustCompile(`\n`)
return nl.ReplaceAllString(string(p.Body), `<br>`)
}
What you'll end up with:
paragraphs <br> <br>in <br> <br>this <br> <br>post
I don't want to pass the entire post in with HTML(p.Body), as it represents third party data from potentially untrustworthy sources. Is there a way to whitelist only some tags for formatting purposes using the vanilla Go1 template package?
I do think you want to parse the HTML. The HTML parser in exp/html was deemed incomplete and so removed from Go 1, although the exp tree is still in the Go source tree and can be accessed by weekly tag, for example. I don't know exactly what is incomplete. I used it for a simple task once and it met my needs.
Also of course, check the dashboard and see related SO post, Any smart method to get exp/html back after Go1?, mostly for the recomendation of http://code.google.com/p/go-html-transform/
I'm affraid the template package cannot help with this too much. If you want to remove specific (black-listed) tags (resp. the sub-tree enclosed by such tags) or allow to pass only specific tags (white-listed) then I think probably nothing less than parsing and rewriting the html AST can be a good solution. That said, one can see here and there some crazy REs trying to do the same, but I don't consider that a "good solution" and I doubt they can be a "correct" solution in the general case of a specs conforming HTML, including several legal irregularities, as it is probably ruled out of a regular grammar category problem.

How to simply parse html references

how it is possible ,to simply parse html links. For example I receive http response containing http. In which you have links to other files, which need to be downloaded for example jpgs, css files,js files. What is the simplest way to parse all this references.
Use an HTML parser for your platform/language.
There are some recommendations for c++ ones here.
Once you have a parsed document, you will need to look at each src and href in it - you will also need to remember the base tag, if one exists and add logic for external, relative and absolute paths.

How can I sanitize user input but keep the content of <pre> tags?

I'm using CKEditor in Markdown format to submit user created content. I would like to sanitize this content from malicious tags, but I would like to keep the formatting that is the result of the markdown parser. I've used two methods that do not work.
Method one
<!--- Sanitize post content --->
<cfset this.text = HTMLEditFormat(this.text)>
<!--- Apply mark down parser --->
<cfx_markdown textIn="#this.text#" variable="parsedNewBody">
Problem For some reason <pre> and <blockquote> are being escaped, and thus I'm unable to use them. Only special characters appear. Other markdown tagging works well, such as bold, italic, etc. Could it be CKEdit does not apply markdown correctly to <pre> and <blockquote>?
Example: If I were to type <pre><script>alert("!");</script></pre> I would get the following: <script>alert("!");</script>
Method two
Same as method one, but reverse the order where the sanitation takes place after the markdown parser has done it's work. This is effectively useless since the sanitation function will escape all the tags, malicious ones or ones created by the markdown parser.
While I want to sanitize malicious content, I do want to keep basic HTML tags and contents of <pre> and <blockquote> tags!--any ideas how?
Thanks!
There are two important sanitizations that need to be done on user generated content. First, you want to protect your database from SQL injection. You can do this by using stored procedures or the <cfqueryparam> tag, without modifying the data.
The other thing you want to do is protect your site from XSS and other content-display based attacks. The way you do this is by sanitizing the content on display. It would be fine, technically, to do it before saving, but generally the best practice is to store the highest fidelity data possible and only modify it for display. Either way, I think your problem is that you're doing this sanitization out of order. You should run the Markdown formatter on the content first, THEN run it through HTMLEditFormat().
It's also important to note that HTMLEditFormat will not protect you from all attacks, but it's a good start. You'll want to look into implementing OWASP utilities, which is not difficult in ColdFusion, as you can directly use the provided Java implementation.
Why don't you just prepend and append pre tag after parsing?
I mean, if you only care about first an dlast pre and you dont have nested pre's or similar. If you cfx tag clears pre, make new wrapper method which is going to check if <pre> exists and if not, add it. Also if you use pre tags I guess new line chars are important, so check what your cfx does with those.
Maybe HTMLEditFormat twin HTMLCodeFormat is what you need?