InternetCanonicalizeUrl fails to decode diacritic letters

InternetCanonicalizeUrl fails to decode diacritic letters - c++

I'm having lots of troubles while dealing with some characters into a URL, let's suppose that I have the following URL:
http: //localhost/somewere/myLibrary.dll/rest/something?parameter=An%C3%A1lisis
Which must be converted to:
http: //localhost/somewere/myLibrary.dll/rest/something?parameter=Análisis
In order to deal with the decoding of diacritic letters, I've decided to use the InternetCanonicalizeUrl function, because the application that I'm working on is going to work only in Windows and I don't want to install additional libraries, the helper function I've used is the following:
String DecodeURL(const String &a_URL)
{
String result;
unsigned long size = a_reportType.Length() * 2;
wchar_t *buffer = new wchar_t[size];
if (InternetCanonicalizeUrlW(a_URL.c_str(), buffer, &size, ICU_DECODE | ICU_NO_ENCODE))
{
result = buffer;
}
delete [] buffer;
return result;
}
That works kind of well for almost any of the URL passed through it, except for diacritic letters, my example URL is decoded as follows:
http: //localhost/somewere/myLibrary.dll/rest/something?parameter=AnÃ¡lisis
The IDE I'm working with is CodeGear™ C++Builder® 2009 (that's why I'm forced to use String instead of std::string), I've also tried with a AnsiString and char buffer version with the same results.
Any hint/alternative about how to deal with this error?
Thanks in advance.

InternetCanonicalizeUrl() is doing the right thing, you just have to take into account what it is actually doing.
URLs do not support Unicode (IRIs do), so Unicode data has to be charset-encoded into byte octets and then those octets are url-encoded using %HH sequences as needed. In this case, the data was encoded as UTF-8 (not uncommon in many URLs nowadays, but also not guaranteed), but InternetCanonicalizeUrl() has no way of knowing that as URLs do not have a syntax for describing which charset is being used. All it can do is decode %HH sequences to the relevant byte octet values, it cannot charset-decode the octets for you. In the case of the Unicode version, InternetCanonicalizeUrlW() returns those byte values as-is as wchar_t elements. But either way, you have to charset-decode the octets yourself to recover the original Unicode data.
So what you can do in this case is copy the decoded data to a UTF8String and then assign/return that as a String so it gets decoded to UTF-16. That will only work for UTF-8 encoded URLs, of course. For example:
String DecodeURL(const String &a_URL)
{
DWORD size = 0;
if (!InternetCanonicalizeUrlW(a_URL.c_str(), NULL, &size, ICU_DECODE | ICU_NO_ENCODE))
{
if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
{
String buffer;
buffer.SetLength(size-1);
if (InternetCanonicalizeUrlW(a_URL.c_str(), buffer.c_str(), &size, ICU_DECODE | ICU_NO_ENCODE))
{
UTF8String utf8;
utf8.SetLength(buffer.Length());
for (int i = 1; i <= buffer.Length(); ++i)
utf8[i] = (char) buffer[i];
return utf8;
}
}
}
return String();
}
Alternatively:
// encoded URLs are always ASCII, so it is safe
// to pass an encoded URL UnicodeString as an
// AnsiString...
String DecodeURL(const AnsiString &a_URL)
{
DWORD size = 0;
if (!InternetCanonicalizeUrlA(a_URL.c_str(), NULL, &size, ICU_DECODE | ICU_NO_ENCODE))
{
if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
{
UTF8String buffer;
buffer.SetLength(size-1);
if (InternetCanonicalizeUrlA(a_URL.c_str(), buffer.c_str(), &size, ICU_DECODE | ICU_NO_ENCODE))
{
return utf8;
}
}
}
FYI, C++Builder ships with Indy pre-installed. Indy has a TIdURI class, which can decode URL and take charsets into account, eg:
#include <IdGlobal.hpp>
#include <IdURI.hpp>
String DecodeURL(const String &a_URL)
{
return TIdURI::URLDecode(URL, enUTF8);
}
In any case, you have to know the charset used to encode the URL data. If you do not, all you can do is decode the raw octets and then use heuristic analysis to guess what the charset might be, but that is not 100% reliable for non-ASCII and non-UTF charsets.

Related

How to get random salt from OpenSSL as std::string

I would like to generate a random string with OpenSSL and use this as a salt in a hashing function afterwards (will be Argon2). Currently I'm generating the random data this way:
if(length < CryptConfig::sMinSaltLen){
return 1;
}
if (!sInitialized){
RAND_poll();
sInitialized = true;
}
unsigned char * buf = new unsigned char[length];
if (!sInitialized || !RAND_bytes(buf, length)) {
return 1;
}
salt = std::string (reinterpret_cast<char*>(buf));
delete buf;
return 0;
But a std::cout of salt doesn't seem to be a proper string (contains control symbols and other stuff). This is most likely only my fault.
Am I using the wrong functions of OpenSSL to generate the random data?
Or is my conversion from buf to string faulty?

Random data is random data. That's what you're asking for and that's exactly what you are getting. Your salt variable is a proper string that happens to contain unprintable characters. If you wish to have printable characters, one way of achieving that is using base64 encoding, but that will blow up its length. Another option is to somehow discard non-printable characters, but I don't see any mechanism to force RAND_bytes to do this. I guess you could simply fetch random bytes in a loop until you get length printable characters.
If encoding base64 is acceptable for you, here is an example of how to use the OpenSSL base64 encoder, extracted from Joe Linoff's Cipher library:
string Cipher::encode_base64(uchar* ciphertext,
uint ciphertext_len) const
{
DBG_FCT("encode_base64");
BIO* b64 = BIO_new(BIO_f_base64());
BIO* bm = BIO_new(BIO_s_mem());
b64 = BIO_push(b64,bm);
if (BIO_write(b64,ciphertext,ciphertext_len)<2) {
throw runtime_error("BIO_write() failed");
}
if (BIO_flush(b64)<1) {
throw runtime_error("BIO_flush() failed");
}
BUF_MEM *bptr=0;
BIO_get_mem_ptr(b64,&bptr);
uint len=bptr->length;
char* mimetext = new char[len+1];
memcpy(mimetext, bptr->data, bptr->length-1);
mimetext[bptr->length-1]=0;
BIO_free_all(b64);
string ret = mimetext;
delete [] mimetext;
return ret;
}
To this code, I suggest adding BIO_set_flags(b64, BIO_FLAGS_BASE64_NO_NL), because otherwise you'll get a new line character inserted after every 64 characters. See OpenSSL's -A switch for details.

C++: socket encoding (working with TeamSpeak)

As I'm currently working on a program for a TeamSpeak server, I need to retrieve the names of the currently online users which I'm doing with sockets - that's working fine so far.In my UI I'm displaying all clients in a ListBox which is basically working. Nevertheless I'm having problems with wrong displayed characters and symbols in the ListBox.
I'm using the following code:
//...
auto getClientList() -> void{
i = 0;
queryString.str("");
queryString.clear();
queryString << clientlist << " \n";
send(sock, queryString.str().c_str(), strlen(queryString.str().c_str()), NULL);
TeamSpeak::getAnswer(1);
while(p_1 != -1){
p_1 = lastLog.find(L"client_nickname=", sPos + 1);
if(p_1 != -1){
sPos = p_1;
p_2 = lastLog.find(L" ", p_1);
temporary = lastLog.substr(p_1 + 16, p_2 - (p_1 + 16));
users[i].assign(temporary.begin(), temporary.end());
SendMessage(hwnd_2, LB_ADDSTRING, (WPARAM)NULL, (LPARAM)(LPTSTR)(users[i].c_str()));
i++;
}
else{
sPos = 0;
p_1 = 0;
break;
}
}
TeamSpeak::getAnswer(0);
}
//...
I've already checked lastLog, temporary and users[i] (by writing them to a file), but all of them have no encoding problem with characters or symbols (for example Andrè). If I add a string directly:SendMessage(hwnd_2, LB_ADDSTRING, (WPARAM)NULL, (LPARAM)(LPTSTR)L"Andrè", it is displayed correctly in the ListBox.What might be the issue here, is it a problem with my code or something else?
Update 1:I recently continued working on this problem and considered the word Olè! receiving it from the socket. The result I got, is the following:O (79) | l (108) | � (-61) | � (-88) | ! (33).How can I convert this char array to a wstring containing the correct characters?
Solution: As #isanae mentioned in his post, the std::wstring_convert-template did the trick for me, thank you very much!

Many things can go wrong in this code, and you don't show much of it. What's particularly lacking is the definition of all those variables.
Assuming that users[i] contains meaningful data, you also don't say how it is encoded. Is it ASCII? UTF-8? UTF-16? The fact that you can output it to a file and read it with an editor doesn't mean anything, as most editors are able to guess at encoding.
If it really is UTF-16 (the native encoding on Windows), then I see no reason for this code not to work. One way to check would be to break into the debugger and look at the individual bytes in users[i]. If you see every character with a value less than 128 followed by a 0, then it's probably UTF-16.
If it is not UTF-16, then you'll need to convert it. There are a variety of ways to do this, but MultiByteToWideChar may be the easiest. Make sure you set the codepage to same encoding used by the sender. It may be CP_UTF8, or an actual codepage.
Note also that hardcoding a string with non-ASCII characters doesn't help you much either, as you'd first have to find out the encoding of the file itself. I know some versions of Visual C++ will convert your source file to UTF-16 if it encounters non-ASCII characters, which may be what happened to you.
O (79) | l (108) | � (-61) | � (-88) | ! (33).
How can I convert this char array to a wstring containing the correct characters?
This is a UTF-8 string. It has to be converted to UTF-16 so Windows can use it.
This is a portable, C++11 solution on implementations where sizeof(wchar_t) == 2. If this is not the case, then char16_t and std::u16string may be used, but the most recent version of Visual C++ as of this writing (2015 RC) doesn't implement std::codecvt for char16_t and char32_t.
#include <string>
#include <codecvt>
std::wstring utf8_to_utf16(const std::string& s)
{
static_assert(sizeof(wchar_t)==2, "wchar_t needs to be 2 bytes");
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv;
return conv.from_bytes(s);
}
std::string utf16_to_utf8(const std::wstring& s)
{
static_assert(sizeof(wchar_t)==2, "wchar_t needs to be 2 bytes");
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv;
return conv.to_bytes(s);
}
Windows-only:
#include <string>
#include <cassert>
#include <memory>
#include <codecvt>
#include <Windows.h>
std::wstring utf8_to_utf16(const std::string& s)
{
// getting the required size in characters (not bytes) of the
// output buffer
const int size = ::MultiByteToWideChar(
CP_UTF8, 0, s.c_str(), static_cast<int>(s.size()),
nullptr, 0);
// error handling
assert(size != 0);
// creating a buffer with enough characters in it
std::unique_ptr<wchar_t[]> buffer(new wchar_t[size]);
// converting from utf8 to utf16
const int written = ::MultiByteToWideChar(
CP_UTF8, 0, s.c_str(), static_cast<int>(s.size()),
buffer.get(), size);
// error handling
assert(written != 0);
return std::wstring(buffer.get(), buffer.get() + written);
}
std::string utf16_to_utf8(const std::wstring& ws)
{
// getting the required size in bytes of the output buffer
const int size = ::WideCharToMultiByte(
CP_UTF8, 0, ws.c_str(), static_cast<int>(ws.size()),
nullptr, 0, nullptr, nullptr);
// error handling
assert(size != 0);
// creating a buffer with enough characters in it
std::unique_ptr<char[]> buffer(new char[size]);
// converting from utf16 to utf8
const int written = ::WideCharToMultiByte(
CP_UTF8, 0, ws.c_str(), static_cast<int>(ws.size()),
buffer.get(), size, nullptr, nullptr);
// error handling
assert(written != 0);
return std::string(buffer.get(), buffer.get() + written);
}
Test:
// utf-8 string
const std::string s = {79, 108, -61, -88, 33};
::MessageBoxW(0, utf8_to_utf16(s).c_str(), L"", MB_OK);

What is the right way to convert UTF16 string to wchar_t on Mac?

In the project that still uses XCode 3 (no C++11 features like codecvt)

Use a conversion library, like libiconv. You can set its input encoding to "UTF-16LE" or "UTF-16BE" as needed, and set its output encoding to "wchar_t" rather than any specific charset.
#include <iconv.h>
uint16_t *utf16 = ...; // input data
size_t utf16len = ...; // in bytes
wchar_t *outbuf = ...; // allocate an initial buffer
size_t outbuflen = ...; // in bytes
char *inptr = (char*) utf16;
char *outptr = (char*) outbuf;
iconv_t cvt = iconv_open("wchar_t", "UTF-16LE");
while (utf16len > 0)
{
if (iconv(cvt, &inptr, &utf16len, &outptr, &outbuflen) == (size_t)(−1))
{
if (errno == E2BIG)
{
// resize outbuf to a larger size and
// update outptr and outbuflen according...
}
else
break; // conversion failure
}
}
iconv_close(cvt);

Why do you want wchar_t on mac? wchar_t does not necessary be 16 bit, it is not very useful on mac.
I suggest to convert yo NSString using
char* payload; // point to string with UTF16 encoding
NSString* s = [NSString stringWithCString:payload encoding: NSUTF16LittleEndianStringEncoding];
To convert NSString to UTF16
const char* payload = [s cStringUsingEncoding:NSUTF16LittleEndianStringEncoding];
Note that mac support NSUTF16BigEndianStringEncoding as well.
Note2: Although const char* is used, the data is encoded with UTF16 so don't pass it to strlen().

I would go the safest route.
Get the UTF-16 string as a UTF-8 string (using NSString)
set the locale to UTF-8
use mbstowcs() to convert the UTF-8 multi-byte string to a wchart_t
At each step you are ensured the string value will be protected.

libxml2 xmlChar * to std::wstring

libxml2 seems to store all its strings in UTF-8, as xmlChar *.
/**
* xmlChar:
*
* This is a basic byte in an UTF-8 encoded string.
* It's unsigned allowing to pinpoint case where char * are assigned
* to xmlChar * (possibly making serialization back impossible).
*/
typedef unsigned char xmlChar;
As libxml2 is a C library, there's no provided routines to get an std::wstring out of an xmlChar *. I'm wondering whether the prudent way to convert xmlChar * to a std::wstring in C++11 is to use the mbstowcs C function, via something like this (work in progress):
std::wstring xmlCharToWideString(const xmlChar *xmlString) {
if(!xmlString){abort();} //provided string was null
int charLength = xmlStrlen(xmlString); //excludes null terminator
wchar_t *wideBuffer = new wchar_t[charLength];
size_t wcharLength = mbstowcs(wideBuffer, (const char *)xmlString, charLength);
if(wcharLength == (size_t)(-1)){abort();} //mbstowcs failed
std::wstring wideString(wideBuffer, wcharLength);
delete[] wideBuffer;
return wideString;
}
Edit: Just an FYI, I'm very aware of what xmlStrlen returns; it's the number of xmlChar used to store the string; I know it's not the number of characters but rather the number of unsigned char. It would have been less confusing if I had named it byteLength, but I thought it would have been clearer as I have both charLength and wcharLength. As for the correctness of the code, the wideBuffer will be larger or equal to the required size to hold the buffer, always (I believe). As characters that require more space than wide_t will be truncated (I think).

xmlStrlen() returns the number of UTF-8 encoded codeunits in the xmlChar* string. That is not going to be the same number of wchar_t encoded codeunits needed when the data is converted, so do not use xmlStrlen() to allocate the size of your wchar_t string. You need to call std::mbtowc() once to get the correct length, then allocate the memory, and call mbtowc() again to fill the memory. You will also have to use std::setlocale() to tell mbtowc() to use UTF-8 (messing with the locale may not be a good idea, especially if multiple threads are involved). For example:
std::wstring xmlCharToWideString(const xmlChar *xmlString)
{
if (!xmlString) { abort(); } //provided string was null
std::wstring wideString;
int charLength = xmlStrlen(xmlString);
if (charLength > 0)
{
char *origLocale = setlocale(LC_CTYPE, NULL);
setlocale(LC_CTYPE, "en_US.UTF-8");
size_t wcharLength = mbtowc(NULL, (const char*) xmlString, charLength); //excludes null terminator
if (wcharLength != (size_t)(-1))
{
wideString.resize(wcharLength);
mbtowc(&wideString[0], (const char*) xmlString, charLength);
}
setlocale(LC_CTYPE, origLocale);
if (wcharLength == (size_t)(-1)) { abort(); } //mbstowcs failed
}
return wideString;
}
A better option, since you mention C++11, is to use std::codecvt_utf8 with std::wstring_convert instead so you do not have to deal with locales:
std::wstring xmlCharToWideString(const xmlChar *xmlString)
{
if (!xmlString) { abort(); } //provided string was null
try
{
std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> conv;
return conv.from_bytes((const char*)xmlString);
}
catch(const std::range_error& e)
{
abort(); //wstring_convert failed
}
}
An alternative option is to use an actual Unicode library, such as ICU or ICONV, to handle Unicode conversions.

There are some problems in this code, besides the fact that you are using wchar_t and std::wstring which is a bad idea unless you're making calls to the Windows API.
xmlStrlen() does not do what you think it does. It counts the number of UTF-8 code units (a.k.a. bytes) in a string. It does not count the number of characters. This is all stuff in the documentation.
Counting characters will not portably give you the correct size for a wchar_t array anyway. So not only does xmlStrlen() not do what you think it does, what you wanted isn't the right thing either. The problem is that the encoding of wchar_t varies from platform to platform, making it 100% useless for portable code.
The mbtowcs() function is locale-dependent. It only converts from UTF-8 if the locale is a UTF-8 locale!
This code will leak memory if the std::wstring constructor throws an exception.
My recommendations:
Use UTF-8 if at all possible. The wchar_t rabbit hole is a lot of extra work for no benefit (except the ability to make Windows API calls).
If you need UTF-32, then use std::u32string. Remember that wstring has a platform-dependent encoding: it could be a variable-length encoding (Windows) or fixed-length (Linux, OS X).
If you absolutely must have wchar_t, then chances are good that you are on Windows. Here is how you do it on Windows:
std::wstring utf8_to_wstring(const char *utf8)
{
size_t utf8len = std::strlen(utf8);
int wclen = MultiByteToWideChar(
CP_UTF8, 0, utf8, utf8len, NULL, 0);
wchar_t *wc = NULL;
try {
wc = new wchar_t[wclen];
MultiByteToWideChar(
CP_UTF8, 0, utf8, utf8len, wc, wclen);
std::wstring wstr(wc, wclen);
delete[] wc;
wc = NULL;
return wstr;
} catch (std::exception &) {
if (wc)
delete[] wc;
}
}
If you absolutely must have wchar_t and you are not on Windows, use iconv() (see man 3 iconv, man 3 iconv_open and man 3 iconv_close for the manual). You can specify "WCHAR_T" as one of the encodings for iconv().
Remember: You probably don't want wchar_t or std::wstring. What wchar_t does portably isn't useful, and making it useful isn't portable. C'est la vie.

add
#include <boost/locale.hpp>
convert xmlChar* to string
std::string strGbk((char*)node);
convert string to wstring
std::string strGbk = "china powerful forever";
std::wstring wstr = boost::locale::conv::to_utf<wchar_t>(strGbk, "gbk");
std::cout << strGbk << std::endl;
std::wcout << wstr. << std::endl;
it works for me,good lucks.

MultiByteToWideChar or WideCharToMultiByte and txt files

I'm trying to write a universal text editor which can open and display ANSI and Unicode in EditControl. Do I need to repeatedly call ReadFile() if I determine that the text is ANSI? Can't figure out how to perform this task. My attempt below does not work, it displays '?' characters in EditControl.
LARGE_INTEGER fSize;
GetFileSizeEx(hFile,&fSize);
int bufferLen = fSize.QuadPart/sizeof(TCHAR)+1;
TCHAR* buffer = new TCHAR[bufferLen];
buffer[0] = _T('\0');
DWORD wasRead = 0;
ReadFile(hFile,buffer,fSize.QuadPart,&wasRead,NULL);
buffer[wasRead/sizeof(TCHAR)] = _T('\0');
if(!IsTextUnicode(buffer,bufferLen,NULL))
{
CHAR* ansiBuffer = new CHAR[bufferLen];
ansiBuffer[0] = '\0';
WideCharToMultiByte(CP_ACP,0,buffer,bufferLen,ansiBuffer,bufferLen,NULL,NULL);
SetWindowTextA(edit,ansiBuffer);
delete[]ansiBuffer;
}
else
SetWindowText(edit,buffer);
CloseHandle(hFile);
delete[]buffer;

There are a few buffer length errors and oddities, but here's your big problem. You call WideCharToMultiByte incorrectly. That is meant to receive UTF-16 encoded text as input. But when IsTextUnicode returns false that means that the buffer is not UTF-16 encoded.
The following is basically what you need:
if(!IsTextUnicode(buffer,bufferLen*sizeof(TCHAR),NULL))
SetWindowTextA(edit,(char*)buffer);
Note that I've fixed the length parameter to IsTextUnicode.
For what it is worth, I think I'd read in to a buffer of char. That would remove the need for the sizeof(TCHAR). In fact I'd stop using TCHAR altogether. This program should be Unicode all the way - TCHAR is what you use when you compile for both NT and 9x variants of Windows. You aren't compiling for 9x anymore I imagine.
So I'd probably code it like this:
char* buffer = new char[filesize+2];//+2 for UTF-16 null terminator
DWORD wasRead = 0;
ReadFile(hFile, buffer, filesize, &wasRead, NULL);
//add error checking for ReadFile, including that wasRead == filesize
buffer[filesize] = '\0';
buffer[filesize+1] = '\0';
if (IsTextUnicode(buffer, filesize, NULL))
SetWindowText(edit, (wchar_t*)buffer);
else
SetWindowTextA(edit, buffer);
delete[] buffer;
Note also that this code makes no allowance for the possibility of receiving UTF-8 encoded text. If you want to handle that you'd need to take your char buffer and send to through MultiByteToWideChar using CP_UTF8.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

InternetCanonicalizeUrl fails to decode diacritic letters - c++

Related

How to get random salt from OpenSSL as std::string

C++: socket encoding (working with TeamSpeak)

What is the right way to convert UTF16 string to wchar_t on Mac?

libxml2 xmlChar * to std::wstring

MultiByteToWideChar or WideCharToMultiByte and txt files

Categories

Resources