What unicode format do I pass tinyxml functions that take char*? - c++

When I call a tinyxml function that takes a char*, what unicode format do I need to pass it?
TiXmlText *element_text = new TiXmlText(string);
The reason is that I am using a wxString object and there is a lot of different encodings I can give it. If I just do string.c_str(), the wxstring object will query the encoding for the current locale and create a char* string in that format. Or if I do string.utf8_str(), it will pass a utf-8 string but it seems like tinyxml will not realize that it's utf-8 encoded already and reencode the utf-8 string as utf-8 (yes, the result is double utf-8 encoding). So when I write out, if I set notepad++ to show utf-8, I see:
baÄŸlam instead of bağlam.
I'd like to do the encoding myself to utf_8 (string.utf8_str()) and not have tinyxml touch it and just write it out.
How do I do this? What format does tinyxml expect to be passed in the function parameter (constructor in the above code)? The answer from testing is not utf-8 though it eventually writes it out as utf-8 if that makes sense.

TinyXML only supports UTF-8 encoding. So if you want to provide characters outside of ASCII, you must provide them in UTF-8.

You may want to look at this section on http://www.grinninglizard.com/tinyxmldocs/index.html
TinyXML can be compiled to use or not use STL. When using STL, TinyXML uses the std::string class, and fully supports std::istream, std::ostream, operator <<, and operator >>. Many API methods have both 'const char*' and 'const std::string&' forms.
When STL support is compiled out, no STL files are included whatsoever. All the string classes are implemented by TinyXML itself. API methods all use the 'const char*' form for input.
Use the compile time define TIXML_USE_STL to compile one version or the other. This can be passed by the compiler, or set as the first line of "tinyxml.h".

Related

UTF8 conversion wxString::ToStdString()

I am writing application and I am using wxWidgets as GUI backend. Core part of app uses std::string and UTF8 as encoding. I need sane way to convert between wxString and std::string. I know about wxString::ToUTF8() but it is somewhat awkward to use (and inefficient I think, as it return some proxy object). There is a better method wxString::ToStdString() but, if I understood properly, it uses current locale encoding. Is there a way to configure wxWidgets globally in such a way that it uses UTF8 encoding when converting between wxString and narrow char (const char*, std::string)?
No, there is no way to do this, you will have to write your own helper function using ToUTF8() or equivalent utf8_str(). This will be inefficient in the sense that it will require a conversion from UTF-32 or UTF-16 every time it's called, but this is unlikely to be a bottleneck.

How to convert unicode QString to an std::string?

I need convert a QString to an std::string. However, if this string contains unicode symbols, I get ????. How can I convert the string with the proper encoding?
Thank you.
How did you try to convert the string so far?
According to documentation std::string QString::toStdString () should convert the unicode-data to an ascii-string
But be warned that you loose special-chars which ascii can't handle.
According to Qt documentation, QString::toStdString internally uses toAscii() function: http://doc.qt.nokia.com/latest/qstring.html#toStdString
Basically, you'll need to make your own converter function that would use QString::toUtf8() instead.

Is there an STL string class that properly handles Unicode?

I know all about std::string and std::wstring but they don't seem to fully pay attention to extended character encoding of UTF-8 and UTF-16 (On windows at least). There is also no support for UTF-32.
So does anyone know of cross-platform drop-in replacement classes that provide full UTF-8, UTF-16 and UTF-32 support?
And let's not forget the lightweight, very user-friendly, header-only UTF-8 library UTF8-CPP. Not a drop-in replacement, but can easily be used in conjunction with std::string and has no external dependencies.
Well in C++0x there are classes std::u32string and std::u16string. GCC already partially supports them, so you can already use them, but streams support for unicode is not yet done Unicode support in C++0x.
It's not STL, but if you want proper Unicode in C++, then you should take a look at ICU.
There is no support of UTF-8 on the STL. As an alternative youo can use boost codecvt:
//...
// My encoding type
typedef wchar_t ucs4_t;
std::locale old_locale;
std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);
// Set a New global locale
std::locale::global(utf8_locale);
// Send the UCS-4 data out, converting to UTF-8
{
std::wstringstream oss;
oss.imbue(utf8_locale);
std::copy(ucs4_data.begin(),ucs4_data.end(),
std::ostream_iterator<ucs4_t,ucs4_t>(oss));
std::wcout << oss.str() << std::endl;
}
For UTF-8 support, there is the Glib::ustring class. It is modeled after std::string but is utf-8 aware,e.g. when you are scanning the string with an iterator. It also has some restrictions, e.g. the iterator is always const, as replacing a character can change the length of the string and so it can invalidate other iterators.
ustring does not automatically converts other encodings to utf-8, Glib library has various conversion functions for this. You can validate whether the string is a valid utf-8 though.
And also, ustring and std::string are interchangeable, i.e. ustring has a cast operator to std::string so you can pass a ustring as a parameter where an std::string is expected, and vice versa of course, as ustring can be constructed from std::string.
Qt has QString which uses UTF-16 internally, but has methods for converting to or from std::wstring, UTF-8, Latin1 or locale encoding. There is also the QTextCodec class which can convert QStrings to or from basically anything. But using Qt for just strings seems like an overkill to me.
Also look at http://grigory.info/UTF8Strings.About.html it is UTF8 native.

Convert ICU UnicodeString to platform dependent char * (or std::string)

In my application I use ICU UnicodeString to store my strings. Since I use some libraries incompatible with ICU, I need to convert UnicodeString to its platform dependent representation.
Basicly what I need to do is reverse process form creating new UnicodeString object - new UnicodeString("string encoded in system locale").
I found out this topic - so I know it can be done with use of stringstream.
So my answer is, can it be done in some other simpler way, without using stringstream to convert?
i use
std::string converted;
us.toUTF8String(converted);
us is (ICU) UnicodeString
You could use UnicodeString::extract() with a codepage (or a converter). Actually passing NULL for the codepage will use what ICU detected as the default codepage.
You could use the functions in ucnv.h -- namely void ucnv_fromUnicode (UConverter *converter, char **target, const char *targetLimit, const UChar **source, const UChar *sourceLimit, int32_t *offsets, UBool flush, UErrorCode *err). It's not a nice C++ API like UnicodeString, but it will work.
I'd recommend just sticking with the operator<< you're already using if at all possible. It's the standard way to handle lexical conversions (i.e. string to/from integers) in C++ in any case.

UnicodeString to char* (UTF-8)

I am using the ICU library in C++ on OS X. All of my strings are UnicodeStrings, but I need to use system calls like fopen, fread and so forth. These functions take const char* or char* as arguments. I have read that OS X supports UTF-8 internally, so that all I need to do is convert my UnicodeString to UTF-8, but I don't know how to do that.
UnicodeString has a toUTF8() member function, but it returns a ByteSink. I've also found these examples: http://source.icu-project.org/repos/icu/icu/trunk/source/samples/ucnv/convsamp.cpp and read about using a converter, but I'm still confused. Any help would be much appreciated.
call UnicodeString::extract(...) to extract into a char*, pass NULL for the converter to get the default converter (which is in the charset which your OS will be using).
ICU User Guide > UTF-8 provides methods and descriptions of doing that.
The simplest way to use UTF-8 strings in UTF-16 APIs is via the C++ icu::UnicodeString methods fromUTF8(const StringPiece &utf8) and toUTF8String(StringClass &result). There is also toUTF8(ByteSink &sink).
And extract() is not prefered now.
Note: icu::UnicodeString has constructors, setTo() and extract() methods which take either a converter object or a charset name. These can be used for UTF-8, but are not as efficient or convenient as the fromUTF8()/toUTF8()/toUTF8String() methods mentioned above.
This will work:
std::string utf8;
uStr.toUTF8String(utf8);