C++, web browser control: cannot change encoding/charset - c++

There's a document I'm displaying in a web browser ActiveX control hosted in a C++ app. This document has a META tag that specifies incorrect charset, so the output is funny. I know the correct encoding and want to change it programmatically to fix that. But whatever I try, the encoding remains unchanged.
I alredy tried, in various combinations and flavors:
IHTMLDocument2::put_Charset (after the document finished loading);
changing the "charset" property of the "META" tag (using IHTMLMetaElement);
deleting the "META" tag altogether (by setting its "outerHTML" to empty string);
refreshing the control.
The control demonstrates remarkable persistence in preserving the incorrect encoding. What are my other options? I can't manipulate the source of the document being loaded.

try to put the designMode property "On".

According to this, it should work if you call IWebBrowser->Refresh() after calling IHTMLDocument2->put_charset().

Here's what eventually worked:
In the handler of the "NavigateComplete2" browser event,
the charset is modified using the charset property,
then the META tag is thrown away by setting its outerHTML to empty string,
and then the control is refreshed.
Modifying the order of these actions, or omitting a step, will render the entire operation void. MSHTML is picky.

Related

Why is Visible controlled by

In my home page I have the following snippet that fetches all blog posts:
var docs = CurrentPage.Children.Where("Visible")
What I don't understand is that Visible is controlled by a property in the document named umbracoNaviHide. Setting it to true on the document excludes the page from the list above.
How is umbracoNaviHide translated to Visible? I have no macros or XSLT (none actually) that is doing anything funny...
umbracoNaviHide is one of umbraco's internal property implementations.
We used to have to check the property explicitly in xslt but nowadays it is used as you are using it here.
Here is a more complete explanation from the Umbraco wiki
The "umbracoNaviHide" is an Umbraco convention for marking nodes which
should not show up in a navigational context. It is normally added (or
inherited) on every Document Type with a Data Type of "True/false".
NOTE: This property is not added by default on new installations,
meaning you need to add it manually
There are a number of other useful properties that everybody should know about:
umbracoSitemapHide
umbracoUrlAlias
umbracoUrlName
umbracoInternalRedirectId
umbracoRedirect
We always insert these properties on a master page doctype so that all other doc types that represent data on web page content nodes inherit them
Wing

encoding of query string parameters in IE10

I got a request from a customer that he wants to be able to type the query string of my web service with parameters in the IE10 address bar and get the service results. The parameters include string in Hebrew, like:
http://mywebsite.com/service.asmx/foo?param1=123&param2=מחרוזתבעברית
It seems to me that that IE10 won't encode the query string parameters - every non-ASCII character that goes after the ? mark would be turned to '3f' byte, though it does encode what goes before the ? mark - the url itself.
For example, if i try to reach the url (the parameter is fictional, url is not, and I have no connection with the site)
http://www.shlomo.co.il/pageshe/sales/רכב-למכירה.asp?param=פאראם
and look in wireshark for the bytes I send to the server, it shows me
You can see it does substitute the hebrew part of the URL with urlencoded string, but substitutes the hebrew parameters with ?????, which are '3f's.
The same string in chrome would be encoded in it's entirety:
GET http://www.shlomo.co.il/pageshe/sales/%D7%A8%D7%9B%D7%91-%D7%9C%D7%9E%D7%9B%D7%99%D7%A8%D7%94.asp?param=%D7%A4%D7%90%D7%A8%D7%90%D7%9D HTTP/1.1
I tried it on machines with win7/IE10 and winXPheb/IE8.
My IE settings are (especially checked the "Always show encoded addresses option" to see if it helps and restarted, but made no difference):
I tried to search around for any info about the issue, but didn't find much of it.
My questions are:
Is it indeed like this, or am I missing something?
Is this behavior documented anywhere?
Are there any settings in IE/Win which enable the parameters encoding.
p.s. Sure if I was developing the client/web ui, I would simply urlencode my query, but my request from customer was exactly to paste the query to IE address bar, that's why I'm interested in this specific behavior.
Thanks.
Yes, your observation of the behavior is accurate. Internet Explorer 10 and below follow a complicated algorithm for encoding the URL. This was allegedly updated in Internet Explorer 11, but I've found that the new option doesn't seem to work.
The "Always show encoded addresses option" concerns whether PunyCode is shown for IDN hostnames, and does not impact the query string. Send UTF-8 URLs mostly applies to the encoding of the path, although it can also affect other codepaths
The behavior isn't fully documented anywhere. I'd meant to write a full post on my IEInternals blog about it but ended up moving on from Microsoft before doing so. There's a partial explanation in this blog post.
Yes, there are settings that impact the behavior. The Send UTF-8 URLs checkbox inside Tools > Internet Options > Advanced is one of the variables that determines how URLs are sent, but the option does not blindly do what it implies (it only UTF-8 encodes the path, not the query string). Other variables involved include:
Where the URL was typed (e.g. address bar vs. Start > Run, etc)
What the system's ANSI codepage is (e.g. what locale the OS uses as default)
The charset of the currently loaded page in the browser
As a consequence of these variables, you cannot reliably use URLs which are not properly encoded (e.g. %-escaped UTF8) in Internet Explorer.
Unfortunately this is still true for Internet Explorer 11 (build 11.0.9600.17358, win7-x64)
I saw that you can not unfortunately change the web server. However those who are developing new services may consider changing request parameters into path variables, e.g. from http://myserver.com/page?τεστ into http://myserver.com/τεστ/
If the client is calling the web-service from javascript,
encodeuricomponent can be used. In your case encodeuricomponent("מחרוזתבעברית");
http://www.w3schools.com/jsref/jsref_encodeURIComponent.asp

HTML Purifier - Change default allowed HTML tags configuration

I want to allow a limited white list of HTML tags that users can use in my forum. So I have configured the HTML Purifier like so:
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.Allowed', 'p,a[href|rel|target|title],img[src],span[style],strong,em,ul,ol,li');
$purifier = new HTMLPurifier($config);
What I am wondering is, does the default configuration of the HTML Purifier still apply, with the exception of a reduced number of accepted HTML tags or do I need to re-set every possible configuration parameter manually?
Additionally, should I tweak the default configuration in any way to stay safe? I am new to the whole XSS protection thing, new to HTML Purifier and didn't find that the manual gave a lot of 'basic' tips and hints.
HTML Purifier is safe by default and any restrictions you impose on it by changing %HTML.Allowed are guaranteed only to reduce the permitted tag set. Check out http://htmlpurifier.org/live/smoketests/printDefinition.php to see how tweaking configuration changes the allowed tagset.
Why not just use a DOM parser and check if tag type is in allowed white list of HTML tags?
Converting the input to a DOM node list you should be able to loop through all the DOM nodes and check if the type is allowed that way. php.net has great examples for how to do this written by others like you trying to solve the input sanitization problem.
More information here:
http://php.net/manual/en/class.domdocument.php

How to provide image data for embedded web control in C++

In my C++ app I'm embedding (via COM) a web browser (Internet Explorer) control (CLSID_WebBrowser).
I can display my own html in that control by using IHTMLDocument2::write() method but if the html has <img src="foo.png"> element, it's not displayed.
I assume there is a way for me to provide the data for foo.png somehow to the web control, but I can't find the right place to hook this functionality?
I need to be in full control of providing the content of foo.png, so work-arounds like using res:// protocol or saving to disk and using file:// protocol are not good enough. I just want to plug my code somehow so that when embedded CLSID_WebBrowser control sees <img src="foo.png"> in html data given with IHTMLDocument2::write() it will ask me to provide this data.
To answer my own question, the solution that finally worked for me is:
register custom IInternetProtocol/IInternetProtocolInfo/ via custom IClassFactory given to IInternetSession::RegisterNameSpace(). For reasons that seem like a bug to me, it has to be a protocol already known to IE (I've chosen "its") even though it would be much better if it was my own, unique namespace.
feed html data via custom IMoniker through IPersistentMoniker::Load() and make sure that IMoniker::GetDisplayName() (which is a base url according to which relative links in provided html will be resolved) starts with that protocol scheme (in my case "its://"). That way relative link "foo.png" in the html data will be its://foo.png to IE which will make urlmon call IInternetProtocol::Start() and IInternetProtocol::Read() to ask for the data for that url.
This is all rather complicated, you can look at the actual (BSD-licensed) code here:
http://code.google.com/p/sumatrapdf/source/browse/trunk/src/utils/HtmlWindow.cpp
You can embed a small webserver such as mongoose and reference those impage from there.
In mongoose, you can attach callback to specific path, thus returning images from C++ code.
We use this for our debugging tools, where each images is accessible from a web interface
The easiest solution would be a Data URI. You'd inline out the image directly with IHTMLDocument2::write().

How to retrieve codepage from cURL HTTP response?

I'm using lib-cURL as a HTTP client to retrieve various pages (can be any URL for that matter).
Usually the data comes as a UTF-8 string and then I just call "MultiByteToWideChar" and it works well.
However, some web-pages still use code-page encoding and I see gibberish if i try to convert those pages to UTF-8.
Is there an easy way to retrieve the code page from the data? or I'll have to scan it manually (for "encoding=") and then translate it accordingly.
If so, how do i get the code-page id from name (Code Page Identifiers)?
Thanks,
Omer
There are several location where a document can state its encoding:
the Content-Type HTTP header
the (optional) XML declaration
the Content-Type meta tag inside the document header
for HTML5 documents the charset meta tag.
There are probably even more I've forgotten.
In the end, detecting the actual encoding is rather hard. You really shouldn't do this yourself but use high-level libraries for retrieving and parsing HTML content. I'm sure they are available even for C++, even if they have to be thiefed from the a browser environment. :)
I used DetectInputCodepage in IMultiLanguage2 interface and it worked great !