Realtime URI-translation of HTML content in C/C++

Realtime URI-translation of HTML content in C/C++ - c++

For the development of a custom reverse proxy (written in C++) I want to do a realtime translation of URIs in HTML content. For example if I want to access a ressource on http://myserver/ using http://my-reverse-proxy/myserver, all absolute and toplevel links like http://myserver/somecontent1.ext or /somecontent2.ext need to be modified.
An HTML tag
<img src="/sample.png">
would therefore be translated to
<img src="/myserver/sample.png">
From my point of view there are to approaches:
1) Using regular expressions and string replacement to find all related HTML tags and their paths using capture groups and do some string replacement.
2) Parse entire HTML content, do some transformation on the parse tree and pretty-print the result back to a valid HTML ressource.
And this is what this question is all about: Do you have any experiences what solution might be faster and maybe even more reasonable? Do you know a framework I might use to not reinvent the wheel? As this process should be used later for CSS and XML-based ressources as well, it should not be a HTML-depend solution.
Thanks in advance!

Proxy servers generally work by being servers. They handle all HTTP requests, modify the requested URLs, and then pass the modified request on to the server on the other side.
You should stick to this paradigm. It is far easier and more efficient than mucking around with the files themselves. Anything that is being done real-time can be done at the point of the request.
Also, it should probably be asked: why a custom reverse proxy? Such things exist already.

Related

easiest way to parse html from soapui http request

I created a http request step in soapui which is an html page from which I need to extract one single value from
<span class="result">12345<span>
I'm thinking about using groovy,is it the best way ? If yes I'm beginner in both soapui and groovy, any snippet code to get started (how to get html Content from http request step, how to parse in groovy) thanks.

If you feel more comfortable with Groovy, then go for it!
SoapUI internally represents almost everything as XML. Therefore the easiest way to manipulate things in SoapUI is using XPath. In your case, you could probably use a Property Transfer step to extract //span[#class="result"].

Siking answer worked setting the source 'property' to ResponseAsXml along with XPath expresions, as for common HTML XPath expressions there's a resource online at
https://devhints.io/xpath

Multi language website, how to approach this?

I have a website (Coldfusion) on which I want to offer multi language, but no idea what is the best way to do this.
There 2 plans I have:
1:
Of course all content (text) is in a database.
If a user would want a different language, the user would click on a link/flag, this would put the requested language in a session variable, for example: session.language = "es"
In the database I would have 2 columns (every language has 1 column) and then select the text which belongs to 'es'
Every page would then do a request to the database to get the text beloging to the session.language.
PROS: Relatively simple to implement
CONS: SEO wise I don't think this could be very good. http:// www.domain.com/page.cfm would give an english text or spanish text (or other language). Google will not add duplicate URL's
2:
Do something with http:// www.domain.com/en/page.cfm for english and http:// www.domain.com/es/page.cfm for english.
With a URL rewrite rule the language value in the URL http:// www.domain.com/en/page.cfm would actually be a page http:// www.domain.com/page.cfm?language=en
The url.language variable will then select the correct language from the database.
PROS: Unique URL for each language. Good for SEO and Google indexing.
CONS: A bit more difficult to implement. (I think)
Or does anyone have other / better ideas?
Thanks!!

You should always first check the browser header "Accept-Language" for the default language(s) (the correct standard way to do it), and offer links (the intuitively seemingly right way) only as an alternative.
Doing it in a database doesn't seem very standard. Let's assume you would like to use MVC architecture (model-view-controller). Most software uses keys in the presentation layer (view) (eg. html) and along with the presentation layer, you have language files (in Java, this is typically properties files) which are mapped simply by their filenames, and can be modified by regular users, without any special skills, such as professional translators with no computer skills. Certainly you could put it in a database, but then it is just more work, and moves the information out of the presentation layer.
There are various libraries for doing this. You should find the normal one for your application. Please edit your question to include what you are using to develop the application. (eg. JSP, Tapestry, Wicket, ASP, PHP, etc.) So for example, if you wanted to use JSPs, I would then suggest you use the JSTL tag library's language support. Or if you were using Tapestry, I would point you to http://tapestry.apache.org/localization.html or http://tapestry.apache.org/tapestry4.1/UsersGuide/localization.html
To look it up, you can look for the terms "internationalization" aka "i18n", or "localization". (The terms don't mean the same thing, but few use them correctly, so either works. http://www.w3.org/International/questions/qa-i18n)

I would go for option 2. Every translation should have its own url. Links to your website will already be in the intended translation.
To store translations in a database, I wouldn't put every translation in a seperate column, but rather put them in a seperate table:
Table Posts:
- id
- title_id
- ...
Table Translations:
- label_id
- value
- country_code
- language_code
Where title_id matches label_id
This way you won't have to alter your table structure when a new translation is added. This allows you to have infinite translations for any label or text.

To effectively do a multi-lingual site then you need set a rule for yourself that NO TEXT is ever put in the source as hard coded. It either needs to come from the database and / or a Resource Bundle.
Text from the database
You need to make sure that the column you are storing your data in is unicode otherwise you'll have issues with accented character. Also don't have a column per language as this is not scalable, do what #jan suggests and have a translations table where the items are keyed on a reference as well as a language.
Resource bundles
You are not going to want to get every last little bit of text from the database so for those you can utilise a resource bundle. This is an, admittedly old, link http://www.sustainablegis.com/blog/cfg11n/index.cfm?mode=entry&entry=FD48909C-50FC-543B-1FE177C1B97E8CC1 from Paul Hastings's blog about some solutions to resource bundles. To be honest his blog is an excellent resource on this very subject.
With regards to how you handle the URLs do not do option 1 as you quite rightly identified you will cause issues the SEO rankings of the page and it will mean that users cannot correctly share or return to the page.
Two approaches are having the language code in the URL as you identified in option 1.
Pros
Simpler to configure
Cons
You have one application which means that as you add more languages you add more complexity and weight on the memory of that app
Or you can have a different sub domain or domain per application e.g. es.yourdomain.com or yourdomain.es they can all be the same codebase
Pros
Each language is a standalone application meaning it has it's own memory
Cons
more effort to configure

http://i18n.riaforge.org/ has a download for i18n. It can be used to make sure that all string labels match. That way if some one wants to change "Save" to "Update", it can all be done in one spot.
It is also important to consider the technical background of those that will being doing the translation. It is often easier to get the translation team to edit files in notepad as opposed to updating a db. Text files work well with version control.

The best way i found is to use an XML to hold just that pages language stuff, one xml to cover each page, and you then vary it for language. when the page loads, just load a different XML from the database or files... many ways to do this. all other methods i have tried have their issues, and at least this one allows you to take a language XML, hand it to someone who will copy it, and then change the boxes... you put it in the DB to serve it.
one can also do this for text, and have the DB make the XML for just the text for that page by using a list of items to include in the XML for the page.
once you get the idea, the rest becomes very easy...
and given CF ways of accessing such data with dot notation, easy peasy to us
say you have "Load Images"
in english xml it may be <LoadIMGS>Load Images</LoadIMGS>
in chinese it may be <LoadIMGS>加载图像</LoadIMGS>
or <LoadIMGS>Jiā zǎi túxiàng</LoadIMGS>
regardless, in your CFM code you would just put #variablename.LoadIMGS# in the place... i would also suggest putting in the loadimages tag the size the font should be adjusted to if not normal size. that way, when translations are too large, you can shrink that font there for that... etc.
enjoy!!!

How to provide image data for embedded web control in C++

In my C++ app I'm embedding (via COM) a web browser (Internet Explorer) control (CLSID_WebBrowser).
I can display my own html in that control by using IHTMLDocument2::write() method but if the html has <img src="foo.png"> element, it's not displayed.
I assume there is a way for me to provide the data for foo.png somehow to the web control, but I can't find the right place to hook this functionality?
I need to be in full control of providing the content of foo.png, so work-arounds like using res:// protocol or saving to disk and using file:// protocol are not good enough. I just want to plug my code somehow so that when embedded CLSID_WebBrowser control sees <img src="foo.png"> in html data given with IHTMLDocument2::write() it will ask me to provide this data.

To answer my own question, the solution that finally worked for me is:
register custom IInternetProtocol/IInternetProtocolInfo/ via custom IClassFactory given to IInternetSession::RegisterNameSpace(). For reasons that seem like a bug to me, it has to be a protocol already known to IE (I've chosen "its") even though it would be much better if it was my own, unique namespace.
feed html data via custom IMoniker through IPersistentMoniker::Load() and make sure that IMoniker::GetDisplayName() (which is a base url according to which relative links in provided html will be resolved) starts with that protocol scheme (in my case "its://"). That way relative link "foo.png" in the html data will be its://foo.png to IE which will make urlmon call IInternetProtocol::Start() and IInternetProtocol::Read() to ask for the data for that url.
This is all rather complicated, you can look at the actual (BSD-licensed) code here:
http://code.google.com/p/sumatrapdf/source/browse/trunk/src/utils/HtmlWindow.cpp

You can embed a small webserver such as mongoose and reference those impage from there.
In mongoose, you can attach callback to specific path, thus returning images from C++ code.
We use this for our debugging tools, where each images is accessible from a web interface

The easiest solution would be a Data URI. You'd inline out the image directly with IHTMLDocument2::write().

Cleansing string / input in Coldfusion 9

I have been working with Coldfusion 9 lately (background in PHP primarily) and I am scratching my head trying to figure out how to 'clean/sanitize' input / string that is user submitted.
I want to make it HTMLSAFE, eliminate any javascript, or SQL query injection, the usual.
I am hoping I've overlooked some kind of function that already comes with CF9.
Can someone point me in the proper direction?

Well, for SQL injection, you want to use CFQUERYPARAM.
As for sanitizing the input for XSS and the like, you can use the ScriptProtect attribute in CFAPPLICATION, though I've heard that doesn't work flawlessly. You could look at Portcullis or similar 3rd-party CFCs for better script protection if you prefer.

This an addition to Kyle's suggestions not an alternative answer, but the comments panel is a bit rubbish for links.
Take a look a the ColdFusion string functions. You've got HTMLCodeFormat, HTMLEditFormat, JSStringFormat and URLEncodedFormat. All of which can help you with working with content posted from a form.
You can also try to use the regex functions to remove HTML tags, but its never a precise science. This ColdFusion based regex/html question should help there a bit.
You can also try to protect yourself from bots and known spammers using something like cfformprotect, which integrates Project Honeypot and Akismet protection amongst other tools into your forms.

You've got several options:
"Global Script Protection" Administrator setting, which applies a regular expression against post and get (i.e. FORM and URL) variables to strip out <script/>, <img/> and several other tags
Use isValid() to validate variables' data types (see my in depth answer on this one).
<cfqueryparam/>, which serves to create SQL bind parameters and validate the datatype passed to it.
That noted, if you are really trying to sanitize HTML, use Java, which ColdFusion can access natively. In particular use the OWASP AntiSamy Project, which takes an HTML fragment and whitelists what values can be part of it. This is the same approach that sites like SO and slashdot.org use to protect submissions and is a more secure approach to accepting markup content.

Sanitation of strings in coldfusion and in quite any language is very important and depends on what you want to do with the string. most mitigations are for
saving content to database (e.g. <cfqueryparam ...>)
using content to show on next page (e.g. put url-parameter in link or show url-parameter in text)
saving files and using upload filenames and content
There is always a risk if you follow the idea to prevent and reduce a string by allow basically everything in the first step and then sanitize malicious code "away" by deleting or replacing characters (blacklist approach).
The better solution is to replace strings with rereplace(...) agains regular expressions that explicitly allow only the characters needed for the scenario you use it as an easy solution, whenever this is possible. use cases are inputs for numbers, lists, email-addresses, urls, names, zip, cities, etc.
For example if you want to ask for a email-address, you could use
<cfif reFindNoCase("^[A-Z0-9._%+-]+#[A-Z0-9.-]+\.(?:[A-Z]{5})$", stringtosanitize)>...ok, clean...<cfelse>...not ok...</cfif>
(or an own regex).
For HTML-Imput or CSS-Imput I would also recommend OWASP Java HTML Sanitizer Project.

Preventing XSS in Node.js / server side javascript

Any idea how one would go about preventing XSS attacks on a node.js app? Any libs out there that handle removing javascript in hrefs, onclick attributes,etc. from POSTed data?
I don't want to have to write a regex for all that :)
Any suggestions?

I've created a module that bundles the Caja HTML Sanitizer
npm install sanitizer
http://github.com/theSmaw/Caja-HTML-Sanitizer
https://www.npmjs.com/package/sanitizer
Any feedback appreciated.

One of the answers to Sanitize/Rewrite HTML on the Client Side suggests borrowing the whitelist-based HTML sanitizer in JS from Google Caja which, as far as I can tell from a quick scroll-through, implements an HTML SAX parser without relying on the browser's DOM.
Update: Also, keep in mind that the Caja sanitizer has apparently been given a full, professional security review while regexes are known for being very easy to typo in security-compromising ways.
Update 2017-09-24: There is also now DOMPurify. I haven't used it yet, but it looks like it meets or exceeds every point I look for:
Relies on functionality provided by the runtime environment wherever possible. (Important both for performance and to maximize security by relying on well-tested, mature implementations as much as possible.)
Relies on either a browser's DOM or jsdom for Node.JS.
Default configuration designed to strip as little as possible while still guaranteeing removal of javascript.
Supports HTML, MathML, and SVG
Falls back to Microsoft's proprietary, un-configurable toStaticHTML under IE8 and IE9.
Highly configurable, making it suitable for enforcing limitations on an input which can contain arbitrary HTML, such as a WYSIWYG or Markdown comment field. (In fact, it's the top of the pile here)
Supports the usual tag/attribute whitelisting/blacklisting and URL regex whitelisting
Has special options to sanitize further for certain common types of HTML template metacharacters.
They're serious about compatibility and reliability
Automated tests running on 16 different browsers as well as three diffferent major versions of Node.JS.
To ensure developers and CI hosts are all on the same page, lock files are published.

All usual techniques apply to node.js output as well, which means:
Blacklists will not work.
You're not supposed to filter input in order to protect HTML output. It will not work or will work by needlessly malforming the data.
You're supposed to HTML-escape text in HTML output.
I'm not sure if node.js comes with some built-in for this, but something like that should do the job:
function htmlEscape(text) {
return text.replace(/&/g, '&').
replace(/</g, '<'). // it's not neccessary to escape >
replace(/"/g, '"').
replace(/'/g, ''');
}

I recently discovered node-validator by chriso.
Example
get('/', function (req, res) {
//Sanitize user input
req.sanitize('textarea').xss(); // No longer supported
req.sanitize('foo').toBoolean();
});
XSS Function Deprecation
The XSS function is no longer available in this library.
https://github.com/chriso/validator.js#deprecations

You can also look at ESAPI. There is a javascript version of the library. It's pretty sturdy.

In newer versions of validator module you can use the following script to prevent XSS attack:
var validator = require('validator');
var escaped_string = validator.escape(someString);

Try out the npm module strip-js. It performs the following actions:
Sanitizes HTML
Removes script tags
Removes attributes such as "onclick", "onerror", etc. which contain JavaScript code
Removes "href" attributes which contain JavaScript code
https://www.npmjs.com/package/strip-js

Update 2021-04-16: xss is a module used to filter input from users to prevent XSS attacks.
Sanitize untrusted HTML (to prevent XSS) with a configuration specified by a Whitelist.
Visit https://www.npmjs.com/package/xss
Project Homepage: http://jsxss.com

You should try library npm "insane".
https://github.com/bevacqua/insane
I try in production, it works well. Size is very small (around ~3kb gzipped).
Sanitize html
Remove all attributes or tags who evaluate js
You can allow attributes or tags that you don't want sanitize
The documentation is very easy to read and understand.
https://github.com/bevacqua/insane

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js