Convert ICU UnicodeString to platform dependent char * (or std::string) - c++

In my application I use ICU UnicodeString to store my strings. Since I use some libraries incompatible with ICU, I need to convert UnicodeString to its platform dependent representation.
Basicly what I need to do is reverse process form creating new UnicodeString object - new UnicodeString("string encoded in system locale").
I found out this topic - so I know it can be done with use of stringstream.
So my answer is, can it be done in some other simpler way, without using stringstream to convert?

i use
std::string converted;
us.toUTF8String(converted);
us is (ICU) UnicodeString

You could use UnicodeString::extract() with a codepage (or a converter). Actually passing NULL for the codepage will use what ICU detected as the default codepage.

You could use the functions in ucnv.h -- namely void ucnv_fromUnicode (UConverter *converter, char **target, const char *targetLimit, const UChar **source, const UChar *sourceLimit, int32_t *offsets, UBool flush, UErrorCode *err). It's not a nice C++ API like UnicodeString, but it will work.
I'd recommend just sticking with the operator<< you're already using if at all possible. It's the standard way to handle lexical conversions (i.e. string to/from integers) in C++ in any case.

Related

Proper way crossplatfom convert from std::string to 'const TCHAR *'

I'm working for crossplatrofm project in c++ and I have variable with type std::string and need convert it to const TCHAR * - what is proper way, may be functions from some library ?
UPD 1: - as I see in function definition there is split windows and non-Windows implementations:
#if defined _MSC_VER || defined __MINGW32__
#define _tinydir_char_t TCHAR
#else
#define _tinydir_char_t char
#endif
- so is it a really no way for non spliting realization for send parameter from std::string ?
Proper way crossplatfom convert from std::string to 'const TCHAR *'
TCHAR should not be used in cross platform programs at all; Except of course, when interacting with windows API calls, but those need to be abstracted away from the rest of the program or else it won't be cross-platform. So, you only need to convert between TCHAR strings and char strings in windows specific code.
The rest of the program should use char, and preferably assume that it contains UTF-8 encoded strings. If user input, or system calls return strings that are in a different encoding, you need to figure out what that encoding is, and convert accordingly.
Character encoding conversion functionality of the C++ standard library is rather weak, so that is not of much use. You can implement the conversion according the encoding specification or you can use a third party implementation, as always.
may be functions from some library ?
I recommend this.
as I see in function definition there is split windows and non-Windows implementations
The library that you use doesn't provide a uniform API to different platforms, so it cannot be used in a truly cross-platform way. You can write a wrapper library with uniform function declarations that handles the character encoding conversion on platforms that need it.
Or, you can use another library, which provides a uniform API and converts the encoding transparently.
TCHAR are Windows type and it defined in this way:
#ifdef UNICODE
typedef wchar_t TCHAR, *PTCHAR;
#else
typedef char TCHAR, *PTCHAR;
#endif
UNICODE macro is typically defined in project settings (in case when your use Visual Studio project on Windows).
You can get the const TCHAR* from std::string (which is ASCII or UTF8 in most cases) in this way:
std::string s("hello world");
const TCHAR* pstring = nullptr;
#ifdef UNICODE
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
std::wstring wstr = converter.from_bytes(s);
pstring = wstr.data();
#else
pstring = s.data();
#endif
pstring will be the result.
But it's highly not recommended to use the TCHAR on other platforms. It's better to use the UTF8 strings (char*) within std::string
I came across boost.nowide the other day. I think it will do exactly what you want.
http://cppcms.com/files/nowide/html/
As others have pointed out, you should not be using TCHAR except in code that interfaces with the Windows API (or libraries modeled after the Windows API).
Another alternative is to use the character conversion classes/macros defined in atlconv.h. CA2T will convert an 8-bit character string to a TCHAR string. CA2CT will convert to a const TCHAR string (LPCTSTR). Assuming your 8-bit strings are UTF-8, you should specify CP_UTF8 as the code page for the conversion.
If you want to declare a variable containing a TCHAR copy of a std::string:
CA2T tstr(stdstr.c_str(), CP_UTF8);
If you want to call a function that takes an LPCTSTR:
FunctionThatTakesString(CA2CT(stdsr.c_str(), CP_UTF8));
If you want to construct a std::string from a TCHAR string:
std::string mystdstring(CT2CA(tstr, CP_UTF8));
If you want to call a function that takes an LPTSTR then maybe you should not be using these conversion classes. (But you can if you know that the function you are calling does not modify the string outside its current length.)

Easiest way to convert UnicodeString to const char* in c++?

I'm new in c++ and have problem with converting UnicodeString to string, so now searching for easiest method to convert from one type to other.
I want to use basic windows function which needs string with UnicodeString, how to make code work?
UnicodeString Exec = "notepad";
WinExec(Exec.c_str(), 0);
Environment used is c++ builder xe2
A std::string can not store unicode data. You will need a std::wstring for that.
I've never heard of UnicodeString before, but looking at the API here:
http://docwiki.embarcadero.com/Libraries/XE2/en/System.UnicodeString_Methods
It has a function called .c_str() which returns a wchar_t* which you can then use to construct a std::wstring
If you really need a std::string, then have a look at this answer.
How to convert wstring into string?
If you are looking for complete unicode support in C++ go for ICU API. Here is the website where you can find everything about it. http://site.icu-project.org/

What type of string is best to use for Win32 and DirectX?

I am in the process of developing a small game in DirectX 10 and C++ and I'm finding it hell with the various different types of strings that are required for the various different directx / win32 function calls.
Can anyone recommend which of the various strings are available are the best to use, ideally it would be one type of string that gives a good cast to the other types (LPCWSTR and LPCSTR mostly). Thus far I have been using std::string and std::wstring and doing the .c_str() function to get it in to the correct format.
I would like to get just 1 type of string that I use to pass in to all functions and then doing the cast inside the function.
Use std::wstring with c_str() exactly as you have been doing. I see no reason to use std::string on Windows, your code may as well always use the native UTF-16 encoding.
I would stick to std::wstring as well. If you really need to pass std::string somewhere, you can still convert it on the fly:
std::string s = "Hello, World";
std::wstring ws(s.begin(), s.end());
It works the other way around as well.
If you're using Native COM (the stuff of #import <type_library>), then _bstr_t. It natively typecasts to both LPCWSTR and LPCSTR, and it meshes nicely with COM's memory allocation model. No need for .c_str() calls.

Is there an STL string class that properly handles Unicode?

I know all about std::string and std::wstring but they don't seem to fully pay attention to extended character encoding of UTF-8 and UTF-16 (On windows at least). There is also no support for UTF-32.
So does anyone know of cross-platform drop-in replacement classes that provide full UTF-8, UTF-16 and UTF-32 support?
And let's not forget the lightweight, very user-friendly, header-only UTF-8 library UTF8-CPP. Not a drop-in replacement, but can easily be used in conjunction with std::string and has no external dependencies.
Well in C++0x there are classes std::u32string and std::u16string. GCC already partially supports them, so you can already use them, but streams support for unicode is not yet done Unicode support in C++0x.
It's not STL, but if you want proper Unicode in C++, then you should take a look at ICU.
There is no support of UTF-8 on the STL. As an alternative youo can use boost codecvt:
//...
// My encoding type
typedef wchar_t ucs4_t;
std::locale old_locale;
std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);
// Set a New global locale
std::locale::global(utf8_locale);
// Send the UCS-4 data out, converting to UTF-8
{
std::wstringstream oss;
oss.imbue(utf8_locale);
std::copy(ucs4_data.begin(),ucs4_data.end(),
std::ostream_iterator<ucs4_t,ucs4_t>(oss));
std::wcout << oss.str() << std::endl;
}
For UTF-8 support, there is the Glib::ustring class. It is modeled after std::string but is utf-8 aware,e.g. when you are scanning the string with an iterator. It also has some restrictions, e.g. the iterator is always const, as replacing a character can change the length of the string and so it can invalidate other iterators.
ustring does not automatically converts other encodings to utf-8, Glib library has various conversion functions for this. You can validate whether the string is a valid utf-8 though.
And also, ustring and std::string are interchangeable, i.e. ustring has a cast operator to std::string so you can pass a ustring as a parameter where an std::string is expected, and vice versa of course, as ustring can be constructed from std::string.
Qt has QString which uses UTF-16 internally, but has methods for converting to or from std::wstring, UTF-8, Latin1 or locale encoding. There is also the QTextCodec class which can convert QStrings to or from basically anything. But using Qt for just strings seems like an overkill to me.
Also look at http://grigory.info/UTF8Strings.About.html it is UTF8 native.

UnicodeString to char* (UTF-8)

I am using the ICU library in C++ on OS X. All of my strings are UnicodeStrings, but I need to use system calls like fopen, fread and so forth. These functions take const char* or char* as arguments. I have read that OS X supports UTF-8 internally, so that all I need to do is convert my UnicodeString to UTF-8, but I don't know how to do that.
UnicodeString has a toUTF8() member function, but it returns a ByteSink. I've also found these examples: http://source.icu-project.org/repos/icu/icu/trunk/source/samples/ucnv/convsamp.cpp and read about using a converter, but I'm still confused. Any help would be much appreciated.
call UnicodeString::extract(...) to extract into a char*, pass NULL for the converter to get the default converter (which is in the charset which your OS will be using).
ICU User Guide > UTF-8 provides methods and descriptions of doing that.
The simplest way to use UTF-8 strings in UTF-16 APIs is via the C++ icu::UnicodeString methods fromUTF8(const StringPiece &utf8) and toUTF8String(StringClass &result). There is also toUTF8(ByteSink &sink).
And extract() is not prefered now.
Note: icu::UnicodeString has constructors, setTo() and extract() methods which take either a converter object or a charset name. These can be used for UTF-8, but are not as efficient or convenient as the fromUTF8()/toUTF8()/toUTF8String() methods mentioned above.
This will work:
std::string utf8;
uStr.toUTF8String(utf8);