Using Poco XMLWriter with UTF8 strings in C++ - c++

I have a problem trying to get my head around using UTF8 with Poco::XML::XMLWriter. In the following code example, everything works fine when the input contains ASCII characters. However, occasionally the string in wordmapIt->first contains a non-ASCII value, such as a character -105 occurring in the middle of a string. When this happens the xml stream seems to terminate on the -105 char even though there are many other words after this one. I want to save whatever string was there so just stripping the char out isn't the right answer - theres got to be some kind of encoding I can apply (I think) but what?
I'm clearly missing something conceptually but for the life of me I cant figure out the right way to do this.
Poco::XML::XMLString EDocument::makeXMLString()
{
std::stringstream xmlstream;
Poco::UTF8Encoding utf8encoding;
Poco::XML::XMLWriter writer(xmlstream, 0, "UTF-8", &utf8encoding);
writer.startDocument();
std::map<std::string, std::string>::iterator wordmapIt;
for ( wordmapIt = nodeinfo->wordmap.begin(); wordmapIt != nodeinfo->wordmap.end(); wordmapIt++ )
{
writer.startElement("", "", "word");
writer.characters(Poco::XML::toXMLString(wordmapIt->first));
writer.endElement("", "", "word");
}
writer.endDocument();
return xmlstream.str();
}
Edit:
Solution based on answer below.
Poco::XML::XMLString EDocument::makeXMLString()
{
std::stringstream xmlstream;
Poco::UTF8Encoding utf8encoding;
Poco::XML::XMLWriter writer(xmlstream, 0, "UTF-8", &utf8encoding);
Poco::Windows1252Encoding windows1252encoding;
Poco::UTF8Encoding utf8encoding;
Poco::TextConverter textconverter(windows1252encoding, utf8encoding);
writer.startDocument();
std::map<std::string, std::string>::iterator wordmapIt;
for ( wordmapIt = nodeinfo->wordmap.begin(); wordmapIt != nodeinfo->wordmap.end(); wordmapIt++ )
{
std::string strword;
textconverter.convert(wordmapIt->first, strword);
writer.startElement("", "", "word");
writer.characters(strword);
writer.endElement("", "", "word");
}
writer.endDocument();
return xmlstream.str();
}

It sounds like you have a byte string in Windows code page 1252 encoding. “Character -105” presumably really means byte 0x97, which would map to Unicode character U+2014 Em Dash (—) in cp1252.
I'm not familiar with Poco, but I would guess you're expected to convert your cp1252 strings to UTF-8 output encoding using a TextConverter with Windows1252Encoding and UTF8Encoding.
Although if what you really have is an “ANSI string” (a byte string in the default code page for the current machine's locale), 1252 might not be the right answer and you might have to use a function from another library to do the conversion properly.

Related

C++ wstring - Error reading characters of strings | Happens after the 10th character

I am asking for information to save onto the database like their username and stuff, you can use up to 10 characters perfectly fine, the wstring holds it and I can use it how I like, but if a user were to type 11 characters it would all of a sudden say "Error reading characters of strings".
At first I thought I didn't have a big enough space for GetWindowText so I pumped that up, I didn't know if you can change the wstring capacity so that is why I am asking here.
Why is wstring only working with 10 or less characters? Thanks!
case WM_COMMAND: {
switch (LOWORD(wParam))
{
case IDB_REGISTERACCOUNT: {
std::wstringstream SQLStatementStream;
std::wstring SQLUsername;
std::wstring SQLPassword;
//Get user information than store in wide Strings
GetWindowText(hUserNameRegister, &SQLUsername[0], 50);
GetWindowText(hPasswordRegister, &SQLPassword[0], 50);
std::wstring SQLStatement = SQLStatementStream.str();
break;
}
}
break;
}
Your code is essentially saying, "I am giving you the address of the first character of this string, start writing to it indiscriminately." This is going to be a bad time. What you need to do is first allocate an actual buffer to use to store the result, then if you wanted in the wstring, you can use the wstring's '=' operator to do the proper assignment.
something like:
WCHAR temp[50];
GetWindowText(hUserNameRegister, temp, 50);
SQLUsername = temp
You write past the end of your string. This causes undefined behaviour.
Only use &x[0] on a standard library string when you are in read-only mode, or if you are writing but not going to change the length of the string.
You should have written:
wchar_t SQLUsername[50] = {};
GetWindowTextW(hUserNameRegister, SQLUsername, 50);
Then you can convert to wstring if you wanted with std::wstring wsSQLUsername = SQLUsername; - however your code never actually uses your wstrings before they go out of scope so perhaps you have some other misconceptions too.

How to work with UTF-8 in C++, Conversion from other Encodings to UTF-8

I don't know how to solve that:
Imagine, we have 4 websites:
A: UTF-8
B: ISO-8859-1
C: ASCII
D: UTF-16
My Program written in C++ does the following: It downloads a website and parses it. But it has to understand the content. My problem is not the parsing which is done with ASCII-characters like ">" or "<".
The problem is that the program should find all words out of the website's text. A word is any combination of alphanumerical characters.
Then I send these words to a server. The database and the web-frontend are using UTF-8.
So my questions are:
How can I convert "any" (or the most used) character encoding to UTF-8?
How can I work with UTF-8-strings in C++? I think wchar_t does not work because it is 2 bytes long. Code-Points in UTF-8 are up to 4 bytes long...
Are there functions like isspace(), isalnum(), strlen(), tolower() for such UTF-8-strings?
Please note: I do not do any output(like std::cout) in C++. Just filtering out the words and send them to the server.
I know about UTF8-CPP but it has no is*() functions. And as I read, it does not convert from other character encodings to UTF-8. Only from UTF-* to UTF-8.
Edit: I forgot to say, that the program has to be portable: Windows, Linux, ...
How can I convert "any" (or the most used) character encoding to UTF-8?
ICU (International Components for Unicode) is the solution here. It is generally considered to be the last say in Unicode support. Even Boost.Locale and Boost.Regex use it when it comes to Unicode. See my comment on Dory Zidon's answer as to why I recommend using ICU directly, instead of wrappers (like Boost).
You create a converter for a given encoding...
#include <ucnv.h>
UConverter * converter;
UErrorCode err = U_ZERO_ERROR;
converter = ucnv_open( "8859-1", &err );
if ( U_SUCCESS( error ) )
{
// ...
ucnv_close( converter );
}
...and then use the UnicodeString class as appripriate.
I think wchar_t does not work because it is 2 bytes long.
The size of wchar_t is implementation-defined. AFAICR, Windows is 2 byte (UCS-2 / UTF-16, depending on Windows version), Linux is 4 byte (UTF-32). In any case, since the standard doesn't define Unicode semantics for wchar_t, using it is non-portable guesswork. Don't guess, use ICU.
Are there functions like isspace(), isalnum(), strlen(), tolower() for such UTF-8-strings?
Not in their UTF-8 encoding, but you don't use that internally anyway. UTF-8 is good for external representation, but internally UTF-16 or UTF-32 are the better choice. The abovementioned functions do exist for Unicode code points (i.e., UChar32); ref. uchar.h.
Please note: I do not do any output(like std::cout) in C++. Just filtering out the words and send them to the server.
Check BreakIterator.
Edit: I forgot to say, that the program has to be portable: Windows, Linux, ...
In case I haven't said it already, do use ICU, and save yourself tons of trouble. Even if it might seem a bit heavyweight at first glance, it is the best implementation out there, it is extremely portable (using it on Windows, Linux, and AIX myself), and you will use it again and again and again in projects to come, so time invested in learning its API is not wasted.
No sure if this will give you everything you're looking for but it might help a little.
Have you tried looking at:
1) Boost.Locale library ?
Boost.Locale was released in Boost 1.48(November 15th, 2011) making it easier to convert from and to UTF8/16
Here are some convenient examples from the docs:
string utf8_string = to_utf<char>(latin1_string,"Latin1");
wstring wide_string = to_utf<wchar_t>(latin1_string,"Latin1");
string latin1_string = from_utf(wide_string,"Latin1");
string utf8_string2 = utf_to_utf<char>(wide_string);
2) Or at
conversions are part of C++11?
#include <codecvt>
#include <locale>
#include <string>
#include <cassert>
int main() {
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert;
std::string utf8 = convert.to_bytes(0x5e9);
assert(utf8.length() == 2);
assert(utf8[0] == '\xD7');
assert(utf8[1] == '\xA9');
}
How can I work with UTF-8-strings in C++? I think wchar_t does not
work because it is 2 bytes long. Code-Points in UTF-8 are up to 4
bytes long...
This is easy, there is a project named tinyutf8 , which is a drop-in replacement for std::string/std::wstring.
Then the user can elegantly operate on codepoints, while their representation is always encoded in chars.
How can I convert "any" (or the most used) character encoding to
UTF-8?
You might want to have a look at std::codecvt_utf8 and simlilar templates from <codecvt> (C++11).
UTF-8 is an encoding that uses multiple bytes for non-ASCII (7 bits code) utilising the 8th bit. As such you won't find '\', '/' inside of a multi-byte sequence. And isdigit works (though not arabic and other digits).
It is a superset of ASCII and can hold all Unicode characters, so definitely to use with char and string.
Inspect the HTTP headers (case insensitive); they are in ISO-8859-1, and precede an empty line and then the HTML content.
Content-Type: text/html; charset=UTF-8
If not present, there also there might be
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta charset="UTF-8"> <!-- HTML5 -->
ISO-8859-1 is Latin 1, and you might do better to convert from Windows-1252, the Windows Latin-1 extension using 0x80 - 0xBF for some special characters like comma quotes and such.
Even browsers on MacOS will understand these though ISO-8859-1 was specified.
Conversion libraries: alread mentioned by #syam.
Conversion
Let's not consider UTF-16. One can read the headers and start till a meta statement for the charset as single-byte chars.
The conversion from single-byte encoding to UTF-8 can happen via a table. For instance generated with Java: a const char* table[] indexed by the char.
table[157] = "\xEF\xBF\xBD";
public static void main(String[] args) {
final String SOURCE_ENCODING = "windows-1252";
byte[] sourceBytes = new byte[1];
System.out.println(" const char* table[] = {");
for (int c = 0; c < 256; ++c) {
String comment = "";
System.out.printf(" /* %3d */ \"", c);
if (32 <= c && c < 127) {
// Pure ASCII
if (c == '\"' || c == '\\')
System.out.print("\\");
System.out.print((char)c);
} else {
if (c == 0) {
comment = " // Unusable";
}
sourceBytes[0] = (byte)c;
try {
byte[] targetBytes = new String(sourceBytes, SOURCE_ENCODING).getBytes("UTF-8");
for (int j = 0; j < targetBytes.length; ++j) {
int b = targetBytes[j] & 0xFF;
System.out.printf("\\x%02X", b);
}
} catch (UnsupportedEncodingException ex) {
comment = " // " + ex.getMessage().replaceAll("\\s+", " "); // No newlines.
}
}
System.out.print("\"");
if (c < 255) {
System.out.print(",");
}
System.out.println();
}
System.out.println(" };");
}

How to convert (char *) from ISO-8859-1 to UTF-8 in C++ multiplatformly?

I'm changing a software in C++, wich process texts in ISO Latin 1 format, to store data in a database in SQLite.
The problem is that SQLite works in UTF-8... and the Java modules that use same database work in UTF-8.
I wanted to have a way to convert the ISO Latin 1 characters to UTF-8 characters before storing in the database. I need it to work in Windows and Mac.
I heard ICU would do that, but I think it's too bloated. I just need a simple convertion system(preferably back and forth) for these 2 charsets.
How would I do that?
ISO-8859-1 was incorporated as the first 256 code points of ISO/IEC 10646 and Unicode. So the conversion is pretty simple.
for each char:
uint8_t ch = code_point; /* assume that code points above 0xff are impossible since latin-1 is 8-bit */
if(ch < 0x80) {
append(ch);
} else {
append(0xc0 | (ch & 0xc0) >> 6); /* first byte, simplified since our range is only 8-bits */
append(0x80 | (ch & 0x3f));
}
See http://en.wikipedia.org/wiki/UTF-8#Description for more details.
EDIT: according to a comment by ninjalj, latin-1 translates direclty to the first 256 unicode code points, so the above algorithm should work.
TO c++ i use this:
std::string iso_8859_1_to_utf8(std::string &str)
{
string strOut;
for (std::string::iterator it = str.begin(); it != str.end(); ++it)
{
uint8_t ch = *it;
if (ch < 0x80) {
strOut.push_back(ch);
}
else {
strOut.push_back(0xc0 | ch >> 6);
strOut.push_back(0x80 | (ch & 0x3f));
}
}
return strOut;
}
If general-purpose charset frameworks (like iconv) are too bloated for you, roll your own.
Compose a static translation table (char to UTF-8 sequence), put together your own translation. Depending on what do you use for string storage (char buffers, or std::string or what) it would look somewhat differently, but the idea is - scroll through the source string, replace each character with code over 127 with its UTF-8 counterpart string. Since this can potentially increase string length, doing it in place would be rather inconvenient. For added benefit, you can do it in two passes: pass one determines the necessary target string size, pass two performs the translation.
If you don't mind doing an extra copy, you can just "widen" your ISO Latin 1 chars to 16-bit characters and thus get UTF-16. Then you can use something like UTF8-CPP to convert it to UTF-8.
In fact, I think UTF8-CPP could even convert ISO Latin 1 to UTF-8 directly (utf16to8 function) but you may get a warning.
Of course, it needs to be real ISO Latin 1, not Windows CP 1232.

How to open a file with wchar_t* containing non-Ascii string in Linux?

Environment: Gcc/G++ Linux
I have a non-ascii file in file system and I'm going to open it.
Now I have a wchar_t*, but I don't know how to open it. (my trusted fopen only opens char* file)
Please help. Thanks a lot.
There are two possible answers:
If you want to make sure all Unicode filenames are representable, you can hard-code the assumption that the filesystem uses UTF-8 filenames. This is the "modern" Linux desktop-app approach. Just convert your strings from wchar_t (UTF-32) to UTF-8 with library functions (iconv would work well) or your own implementation (but lookup the specs so you don't get it horribly wrong like Shelwien did), then use fopen.
If you want to do things the more standards-oriented way, you should use wcsrtombs to convert the wchar_t string to a multibyte char string in the locale's encoding (which hopefully is UTF-8 anyway on any modern system) and use fopen. Note that this requires that you previously set the locale with setlocale(LC_CTYPE, "") or setlocale(LC_ALL, "").
And finally, not exactly an answer but a recommendation:
Storing filenames as wchar_t strings is probably a horrible mistake. You should instead store filenames as abstract byte strings, and only convert those to wchar_t just-in-time for displaying them in the user interface (if it's even necessary for that; many UI toolkits use plain byte strings themselves and do the interpretation as characters for you). This way you eliminate a lot of possible nasty corner cases, and you never encounter a situation where some files are inaccessible due to their names.
Linux is not UTF-8, but it's your only choice for filenames anyway
(Files can have anything you want inside them.)
With respect to filenames, linux does not really have a string encoding to worry about. Filenames are byte strings that need to be null-terminated.
This doesn't precisely mean that Linux is UTF-8, but it does mean that it's not compatible with wide characters as they could have a zero in a byte that's not the end byte.
But UTF-8 preserves the no-nulls-except-at-the-end model, so I have to believe that the practical approach is "convert to UTF-8" for filenames.
The content of files is a matter for standards above the Linux kernel level, so here there isn't anything Linux-y that you can or want to do. The content of files will be solely the concern of the programs that read and write them. Linux just stores and returns the byte stream, and it can have all the embedded nuls you want.
Convert wchar string to utf8 char string, then use fopen.
typedef unsigned int uint;
typedef unsigned short word;
typedef unsigned char byte;
int UTF16to8( wchar_t* w, char* s ) {
uint c;
word* p = (word*)w;
byte* q = (byte*)s; byte* q0 = q;
while( 1 ) {
c = *p++;
if( c==0 ) break;
if( c<0x080 ) *q++ = c; else
if( c<0x800 ) *q++ = 0xC0+(c>>6), *q++ = 0x80+(c&63); else
*q++ = 0xE0+(c>>12), *q++ = 0x80+((c>>6)&63), *q++ = 0x80+(c&63);
}
*q = 0;
return q-q0;
}
int UTF8to16( char* s, wchar_t* w ) {
uint cache,wait,c;
byte* p = (byte*)s;
word* q = (word*)w; word* q0 = q;
while(1) {
c = *p++;
if( c==0 ) break;
if( c<0x80 ) cache=c,wait=0; else
if( (c>=0xC0) && (c<=0xE0) ) cache=c&31,wait=1; else
if( (c>=0xE0) ) cache=c&15,wait=2; else
if( wait ) (cache<<=6)+=c&63,wait--;
if( wait==0 ) *q++=cache;
}
*q = 0;
return q-q0;
}
Check out this document
http://www.firstobject.com/wchar_t-string-on-linux-osx-windows.htm
I think Linux follows POSIX standard, which treats all file names as UTF-8.
I take it it's the name of the file that contains non-ascii characters, not the file itself, when you say "non-ascii file in file system". It doesn't really matter what the file contains.
You can do this with normal fopen, but you'll have to match the encoding the filesystem uses.
It depends on what version of Linux and what filesystem you're using and how you've set it up, but likely, if you're lucky, the filesystem uses UTF-8. So take your wchar_t (which is probably a UTF-16 encoded string?), convert it to a char string encoded in UTF-8, and pass that to fopen.

WideCharToMultiByte problem

I have the lovely functions from my previous question, which work fine if I do this:
wstring temp;
wcin >> temp;
string whatever( toUTF8(getSomeWString()) );
// store whatever, copy, but do not use it as UTF8 (see below)
wcout << toUTF16(whatever) << endl;
The original form is reproduced, but the in between form often contains extra characters. If I enter for example àçé as the input, and add a cout << whatever statement, i'll get ┬à┬ç┬é as output.
Can I still use this string to compare to others, procured from an ASCII source? Or asked differently: if I would output ┬à┬ç┬é through the UTF8 cout in linux, would it read àçé? Is the byte content of a string àçé, read in UTF8 linux by cin, exactly the same as what the Win32 API gets me?
Thanks!
PS: the reason I'm asking is because I need to use the string a lot to compare to other read values (comparing and concatenating...).
Let's start by me saying that it appears that there is simply no way to output UTF-8 text to the console in Windows via cout (assuming you compile with Visual Studio).
What you can do however for your tests is to output your UTF-8 text via the Win32 API fn WriteConsoleA:
if(!SetConsoleOutputCP(CP_UTF8)) { // 65001
cerr << "Failed to set console output mode!\n";
return 1;
}
HANDLE const consout = GetStdHandle(STD_OUTPUT_HANDLE);
DWORD nNumberOfCharsWritten;
const char* utf8 = "Umlaut AE = \xC3\x84 / ue = \xC3\xBC \n";
if(!WriteConsoleA(consout, utf8, strlen(utf8), &nNumberOfCharsWritten, NULL)) {
DWORD const err = GetLastError();
cerr << "WriteConsole failed with << " << err << "!\n";
return 1;
}
This should output:
Umlaut AE = Ä / ue = ü if you set your console (cmd.exe) to use the Lucida Console font.
As for your question (taken from your comment) if
a win23 API converted string is the
same as a raw UTF8 (linux) string
I will say yes: Given a Unicode character sequence, it's UTF-16 (Windows wchar_t) representation converted to a UTF-8 (char) representation via the WideCharToMultiByte function will always yield the same byte sequence.
When you convert the string to a UTF 16 it is a 16 byte wide character, you can't compare it to the ASCII values because they aren't 16 byte values. You have to convert them to compare, or write a specialized comparision to ASCII function.
I doubt the UTF8 cout in linux would produce the same correct output unless it were regular ASCII values, as UTF8 UTF-8 encoding forms are binary-compatible with ASCII for code points below 128, and I assume UTF16 comes after UTF8 in a simliar fashion.
The good news is there are many converters out there written to convert these strings to different character sets.