How to use unicode in C++ without much pain? [closed] - c++

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last month.
Improve this question
In C++ there's no solid standard when it comes to encoding. If I want to use unicode, for example, UTF-8 in C++ for Windows, how can I achieve that?
On Windows I have to use something like wide-strings to use unicode, is it the only way?
If I have to use third-party libraries, what libraries do you can advise?
What I have to remember when using unicode instead of std::string?

If you are talking about source code, then its implementation specific for each compiler, but I believe every modern compiler supports UTF-8 at least.
C++ itself has following types to support Unicode:
wchar_t, char16_t, char32_t and char8_t for characters and corresponding std::wstring, std::u16string, std::u32string and std::u8string for strings.
And following notations for literals:
char8_t ch_utf8 = u8'c';
char16_t ch_utf16 = u'c';
char32_t ch_utf32 = U'C';
wchar_t ch_wide = L'c';
char8_t str_utf8[] = u8"str";
char16_t str_utf16[] = u"str";
char32_t str_utf32[] = U"str";
wchar_t str_wide[] = L"str";
std::codecvt template for string conversions between different encodings.

Related

How can I best use UTF-8 for text in Windows program development? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 12 months ago.
Improve this question
I've just started doing some Windows programming.
I'm trying to decide how best to handle non-ASCII text.
I'd prefer to use 8-bit characters rather than 16-bit i.e. declare all my strings as char.
I've read the UTF-8 Everywhere proposals, and I think they misrepresent the current state of Windows.
Since Windows 10 version 1803 (10.0.17134.0) support for a UTF-8 page has been implemented to the same standard as other multibyte character encodings.
I think now that I can:
Ensure Visual Studio uses UTF-8 to store source code using an EditorConfig file and use UTF-8 strings by specifying '/utf-8' as an "additional" option in the C/C++/Command Line
Make sure the system knows the program is using UTF-8 character strings by calling setlocale(LC_ALL,".UTF-8"); and/or setting <activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage> in the manifest. (The system will actually expect UTF-8 by default if 'Beta: Use Unicode UTF-8 for worldwide language support' is ticked in Region/Language/Administrative language settings/Region Settings - I believe this sets the active code page to UTF-8, and is the default for Windows 11).
Don't define UNICODE and _UNICODE in source, and so use the Win32 'Ansi' interfaces. Windows will convert any text to UTF-16 internally.
Use the standard strings and char variables I'm used to, rather than wstring and wchar.
Have I got this right?
Is there anything else I need to do, apart from watching out for any code that in some way depends on a single character being held in a single byte?
Or is there some gotcha that is waiting to trip me?

Should I use TCHAR today [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I am starting to work on a completely new project for Windows Desktops written in C++. When I learned Windows programming, I read that using TCHAR is a great improvement because I can build an ANSI or a Unicode version of my program without changing the code. However, I never really used the option to build the ANSI version. Moreover, in the standard library of C++, there is no TCHAR, I have to create typedefs for std::string, std::stringstream, etc. and their wide string counterparts. So currently I am thinking of abandoning TCHAR in favor of wchar_t, and I collected the following advantages and disadvantages.
Advantages:
TCHAR is a macro, so if I don't use it, the front-end compiler and Intellisense will give better results.
It is more explicit what the type of a variable is.
L"" is easier to type than _T("").
Disadvantages:
Loss of modularity regarding the character type (even though I don't really need the ANSI version, I find using an abstract character type to be a neat feature, and what if in the future I will need a UTF-8 or UTF-32 version?).
I would have to postfix some API functions with W, like GetWindowTextW.
And my questions:
Is there an easier way in the C++ standard library to use TCHAR than the one I described above? Like a standard header file that has these typedefs?
Do you think that my reasoning is correct?
Do I miss any important point?
What is the state-of-the-art solution today? Do professional Windows programmers still use TCHAR (in new code)?
If I remove TCHAR, than should I write L"" or u"" instead of _T("")?
In modern windows all ANSI functions are internally converting char* to wchar_t* and calling unicode versions of same function. Basically, by adopting TCHAR instead of wchar_t you gain nothing, but have to deal with quirky syntax.

Why std::string doesn't have methods for upper/lower case, format etc? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Why std::string is not smart at all. Why it doesn't have string manipulation features like Format/sprintf, convert to upper, convert to lower, take input from integer/real, to convert to integer/real, and other important functions any string class should have (Reference: CString, wxString, System.String, BASIC strings...).
I am aware that there are new functions like std::to_string, but.. why string itself is so dumb. Why it is just vector<char>? Why still in stone age? Why standards don't make it smart!?
Case comparisons and conversions, in full generality, are hard and require too much information; it's as simple as that.
In American and British English it's simple indeed.
But what about German? E.g. the lower case ß (which in lower case is one character, but in upper case would be two characters: SS).
What about wide character sets which std::string can support? What about accented characters from other European languages like ë?
There's nothing idiotic about this class at all. It has a well defined specification and the standards committee will not emit functionality that could break the language.
As for formatting, this is largely deferred to the streaming libraries, e.g. std::stringstream. There's no reason to incorporate directly into std::string.

visual c++ doesn't understand u'' and U'' literals [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
I use visual studio 2013. When I write this code
char16_t ch1 = u'q';
visual studio complains with Error: identifier "u" is undefined.
I thought VS 2013 should support c++11 standard and u'' identifier as well.
While Microsoft's Visual C++ 2013 supports many C++11 features, the support still isn't complete.
As for string literals, they support only two (or three; depending on how you count) string literal prefixes so far:
L"Hello \"World\"" using Lto mark wide character strings (i.e. wchar_t rather than char).
R"(Hello "World")" using R to mark raw strings with special user defined delimiters (new to C++11).
LR"(Hello "World")" using a combination of both.

Size of implementation specific std::mbstate_t [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
The docs on this are rather lacking so I'm hoping the community can run a simple test and post results here so that I, and anybody else, has a reference.
#include <cwchar>
sizeof( std::mbstate_t );
If you could post the results here and also mention which compiler you are using, I would be very grateful.
On VS2010 it's declared as typedef int mbstate_t; and it's size is 4 bytes for both 32 and 64 bit builds.
I'm asking this because mbstate_t is a member of streampos. I need to use this member to store the conversion state of an encoding. The minimum space I can get away with is 3 bytes so I need to know if any implementation is going to break my code.
Thanks in advance.
gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3 on x86_64
size = 8
gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3 on armv7l
size = 8
You just want know the results of the sizeof?
Qt 5.1 with GCC x86 32bit under Debian:
size = 8
From the C11 specification (7.29.1/2):
mbstate_t
which is a complete object type other than an array type that can hold the conversion state
information necessary to convert between sequences of multibyte characters and wide
characters;
So while I was wrong in that is can be an array, it could be anything else (including a structure containing an array). The language in the specification doesn't say anything about how it should be implemented, just that it's "a complete object type other than an array type".
From the C++11 specification (multiple places, for example 21.2.3.1/4):
The type mbstate_t is defined in <cwchar> and can represent any of the conversion states that can occur in an implementation-defined set of supported multibyte character encoding rules.
In conclusion, you can not rely on mbstate_t being an integer type, or of a specific size, if you want to be portable. If you want to be portable, you have to let the standard library manage the state for you.