How to retrieve codepage from cURL HTTP response? - c++

I'm using lib-cURL as a HTTP client to retrieve various pages (can be any URL for that matter).
Usually the data comes as a UTF-8 string and then I just call "MultiByteToWideChar" and it works well.
However, some web-pages still use code-page encoding and I see gibberish if i try to convert those pages to UTF-8.
Is there an easy way to retrieve the code page from the data? or I'll have to scan it manually (for "encoding=") and then translate it accordingly.
If so, how do i get the code-page id from name (Code Page Identifiers)?
Thanks,
Omer

There are several location where a document can state its encoding:
the Content-Type HTTP header
the (optional) XML declaration
the Content-Type meta tag inside the document header
for HTML5 documents the charset meta tag.
There are probably even more I've forgotten.
In the end, detecting the actual encoding is rather hard. You really shouldn't do this yourself but use high-level libraries for retrieving and parsing HTML content. I'm sure they are available even for C++, even if they have to be thiefed from the a browser environment. :)

I used DetectInputCodepage in IMultiLanguage2 interface and it worked great !

Related

encoding of query string parameters in IE10

I got a request from a customer that he wants to be able to type the query string of my web service with parameters in the IE10 address bar and get the service results. The parameters include string in Hebrew, like:
http://mywebsite.com/service.asmx/foo?param1=123&param2=מחרוזתבעברית
It seems to me that that IE10 won't encode the query string parameters - every non-ASCII character that goes after the ? mark would be turned to '3f' byte, though it does encode what goes before the ? mark - the url itself.
For example, if i try to reach the url (the parameter is fictional, url is not, and I have no connection with the site)
http://www.shlomo.co.il/pageshe/sales/רכב-למכירה.asp?param=פאראם
and look in wireshark for the bytes I send to the server, it shows me
You can see it does substitute the hebrew part of the URL with urlencoded string, but substitutes the hebrew parameters with ?????, which are '3f's.
The same string in chrome would be encoded in it's entirety:
GET http://www.shlomo.co.il/pageshe/sales/%D7%A8%D7%9B%D7%91-%D7%9C%D7%9E%D7%9B%D7%99%D7%A8%D7%94.asp?param=%D7%A4%D7%90%D7%A8%D7%90%D7%9D HTTP/1.1
I tried it on machines with win7/IE10 and winXPheb/IE8.
My IE settings are (especially checked the "Always show encoded addresses option" to see if it helps and restarted, but made no difference):
I tried to search around for any info about the issue, but didn't find much of it.
My questions are:
Is it indeed like this, or am I missing something?
Is this behavior documented anywhere?
Are there any settings in IE/Win which enable the parameters encoding.
p.s. Sure if I was developing the client/web ui, I would simply urlencode my query, but my request from customer was exactly to paste the query to IE address bar, that's why I'm interested in this specific behavior.
Thanks.
Yes, your observation of the behavior is accurate. Internet Explorer 10 and below follow a complicated algorithm for encoding the URL. This was allegedly updated in Internet Explorer 11, but I've found that the new option doesn't seem to work.
The "Always show encoded addresses option" concerns whether PunyCode is shown for IDN hostnames, and does not impact the query string. Send UTF-8 URLs mostly applies to the encoding of the path, although it can also affect other codepaths
The behavior isn't fully documented anywhere. I'd meant to write a full post on my IEInternals blog about it but ended up moving on from Microsoft before doing so. There's a partial explanation in this blog post.
Yes, there are settings that impact the behavior. The Send UTF-8 URLs checkbox inside Tools > Internet Options > Advanced is one of the variables that determines how URLs are sent, but the option does not blindly do what it implies (it only UTF-8 encodes the path, not the query string). Other variables involved include:
Where the URL was typed (e.g. address bar vs. Start > Run, etc)
What the system's ANSI codepage is (e.g. what locale the OS uses as default)
The charset of the currently loaded page in the browser
As a consequence of these variables, you cannot reliably use URLs which are not properly encoded (e.g. %-escaped UTF8) in Internet Explorer.
Unfortunately this is still true for Internet Explorer 11 (build 11.0.9600.17358, win7-x64)
I saw that you can not unfortunately change the web server. However those who are developing new services may consider changing request parameters into path variables, e.g. from http://myserver.com/page?τεστ into http://myserver.com/τεστ/
If the client is calling the web-service from javascript,
encodeuricomponent can be used. In your case encodeuricomponent("מחרוזתבעברית");
http://www.w3schools.com/jsref/jsref_encodeURIComponent.asp

Realtime URI-translation of HTML content in C/C++

For the development of a custom reverse proxy (written in C++) I want to do a realtime translation of URIs in HTML content. For example if I want to access a ressource on http://myserver/ using http://my-reverse-proxy/myserver, all absolute and toplevel links like http://myserver/somecontent1.ext or /somecontent2.ext need to be modified.
An HTML tag
<img src="/sample.png">
would therefore be translated to
<img src="/myserver/sample.png">
From my point of view there are to approaches:
1) Using regular expressions and string replacement to find all related HTML tags and their paths using capture groups and do some string replacement.
2) Parse entire HTML content, do some transformation on the parse tree and pretty-print the result back to a valid HTML ressource.
And this is what this question is all about: Do you have any experiences what solution might be faster and maybe even more reasonable? Do you know a framework I might use to not reinvent the wheel? As this process should be used later for CSS and XML-based ressources as well, it should not be a HTML-depend solution.
Thanks in advance!
Proxy servers generally work by being servers. They handle all HTTP requests, modify the requested URLs, and then pass the modified request on to the server on the other side.
You should stick to this paradigm. It is far easier and more efficient than mucking around with the files themselves. Anything that is being done real-time can be done at the point of the request.
Also, it should probably be asked: why a custom reverse proxy? Such things exist already.

C++, web browser control: cannot change encoding/charset

There's a document I'm displaying in a web browser ActiveX control hosted in a C++ app. This document has a META tag that specifies incorrect charset, so the output is funny. I know the correct encoding and want to change it programmatically to fix that. But whatever I try, the encoding remains unchanged.
I alredy tried, in various combinations and flavors:
IHTMLDocument2::put_Charset (after the document finished loading);
changing the "charset" property of the "META" tag (using IHTMLMetaElement);
deleting the "META" tag altogether (by setting its "outerHTML" to empty string);
refreshing the control.
The control demonstrates remarkable persistence in preserving the incorrect encoding. What are my other options? I can't manipulate the source of the document being loaded.
try to put the designMode property "On".
According to this, it should work if you call IWebBrowser->Refresh() after calling IHTMLDocument2->put_charset().
Here's what eventually worked:
In the handler of the "NavigateComplete2" browser event,
the charset is modified using the charset property,
then the META tag is thrown away by setting its outerHTML to empty string,
and then the control is refreshed.
Modifying the order of these actions, or omitting a step, will render the entire operation void. MSHTML is picky.

How to provide image data for embedded web control in C++

In my C++ app I'm embedding (via COM) a web browser (Internet Explorer) control (CLSID_WebBrowser).
I can display my own html in that control by using IHTMLDocument2::write() method but if the html has <img src="foo.png"> element, it's not displayed.
I assume there is a way for me to provide the data for foo.png somehow to the web control, but I can't find the right place to hook this functionality?
I need to be in full control of providing the content of foo.png, so work-arounds like using res:// protocol or saving to disk and using file:// protocol are not good enough. I just want to plug my code somehow so that when embedded CLSID_WebBrowser control sees <img src="foo.png"> in html data given with IHTMLDocument2::write() it will ask me to provide this data.
To answer my own question, the solution that finally worked for me is:
register custom IInternetProtocol/IInternetProtocolInfo/ via custom IClassFactory given to IInternetSession::RegisterNameSpace(). For reasons that seem like a bug to me, it has to be a protocol already known to IE (I've chosen "its") even though it would be much better if it was my own, unique namespace.
feed html data via custom IMoniker through IPersistentMoniker::Load() and make sure that IMoniker::GetDisplayName() (which is a base url according to which relative links in provided html will be resolved) starts with that protocol scheme (in my case "its://"). That way relative link "foo.png" in the html data will be its://foo.png to IE which will make urlmon call IInternetProtocol::Start() and IInternetProtocol::Read() to ask for the data for that url.
This is all rather complicated, you can look at the actual (BSD-licensed) code here:
http://code.google.com/p/sumatrapdf/source/browse/trunk/src/utils/HtmlWindow.cpp
You can embed a small webserver such as mongoose and reference those impage from there.
In mongoose, you can attach callback to specific path, thus returning images from C++ code.
We use this for our debugging tools, where each images is accessible from a web interface
The easiest solution would be a Data URI. You'd inline out the image directly with IHTMLDocument2::write().

How to get input from web?

i am trying to find out, how to get input from html inputs using c++. In windows you can send WM_GETTEXT to the window and it returns text, that you wanted. But is there any way to do the same thing in web interface?.
I am not interesting in sniffing packets now.
For example. Some site has html intput which expects name. I write name to the input. And then i want to catch it with my program
If I understood correctly what you want to do, you have to set up a web server that calls your C++ application via CGI. So, you'll have an HTML page (static or generated by your program) that will contain a form, that refers to the URL of your application. So, when the user will click Submit, the browser will issue a request to the webserver, which in turn will call your application, passing to it the various POST/GET parameters related to the form.
Your application then can process the data, extracting such parameters from the environment variables (if the data is passed using the GET method) or from the standard input (if the POST method is used). To generate the output page (along with the output HTTP header) you'll simply have to write it to the standard output.
One thing I can think of (if you're using Linux) is using wget via system() from within your C++ app.
Wget to fetch the html page and output it to a file, parse the file for the URL of the form and data that it needs, pass the response as POST / GET via wget and so on.
That is, if I understood what you meant by "do it from existing page" correctly.