How to create multibyte characters in C - c++

During my study of character encoding in C and C++ I came across two general ways of encoding: multibyte characters and wide characters. In order to strengthen my understanding of those systems (benefits and drawbacks) I wanted to do some examples.
Doing examples with wide characters is not a problem due to the native support with the wchar_t type. But when I wanted to create a string which contains those so called multibyte characters I came to a problem.
How can I actually create a multibyte character string which uses an encoding that works with a char array (using Visual C++)? This kind of encoding sure does exist: http://www.gnu.org/software/libc/manual/html_node/Shift-State.html. But I read only about it and never saw an actual example. Or do you have to create your own encoding for this kind of string?

If you are able to create a wide character string literal, simply omitting the L should give you a multibyte character string literal with an implementation defined encoding (gcc has an option to chose it, I don't know about visual C++).
If you have a wide character string, you can get the equivalent multibyte string according to the C locale using the functions wcstombs (in <stdlib.h>) and wcsrtombs (in <wchar.h>).
C++ locale system also provides a way to do that conversion. (Look for the in and out member of the codecvt facet, I won't provide here a tutorial on their use, the site cppreference has example codes, for instance for out).
I'm not sure you'll be able to find easily support either on Unix or on Windows for an encoding with a shift state. You should search for encoding for China, Japan, Korea, Vietman (for instance ISO 2022-JP, but it seems to me that Unix tend to use EUC-JP instead and Windows Shift JIS).

Related

how character sets are stored in strings and wstrings?

So, i've been trying to do a bit of research of strings and wstrings as i need to understand how they work for a program i'm creating so I also looked into ASCII and unicode, and UTF-8 and UTF-16.
I believe i have an okay understanding of the concept of how these work, but what i'm still having trouble with is how they are actually stored in 'char's, 'string's, 'wchar_t's and 'wstring's.
So my questions are as follows:
Which character set and encoding is used for char and wchar_t? and are these types limited to using only these character sets / encoding?
If they are not limited to these character sets / encoding, how is it decided what character set / encoding is used for a particular char or wchar_t? is it automatically decided at compile for example or do we have to explicitly tell it what to use?
From my understanding UTF-8 uses 1 byte when using the first 128 code points in the set but can use more than 1 byte when using code point 128 and above. If so how is this stored? for example is it simply stored identically to ASCII if it only uses 1 byte? and how does the type (char or wchar_t or whatever) know how many bytes it is using?
Finally, if my understanding is correct I get why UTF-8 and UTF-16 are not compatible, eg. a string can't be used where a wstring is needed. But in a program that requires a wstring would it be better practice to write a conversion function from a string to a wstring and the use this when a wstring is required to make my code exclusively string-based or just use wstring where needed instead?
Thanks, and let me know if any of my questions are incorrectly worded or use the wrong terminology as i'm trying to get to grips with this as best as I can.
i'm working in C++ btw
They use whatever characterset and encoding you want. The types do not imply a specific characterset or encoding. They do not even imply characters - you could happily do math problems with them. Don't do that though, it's weird.
How do you output text? If it is to a console, the console decides which character is associated with each value. If it is some graphical toolkit, the toolkit decides. Consoles and toolkits tend to conform to standards, so there is a good chance they will be using unicode, nowadays. On older systems anything might happen.
UTF8 has the same values as ASCII for the range 0-127. Above that it gets a bit more complicated; this is explained here quite well: https://en.wikipedia.org/wiki/UTF-8#Description
wstring is a string made up of wchar_t, but sadly wchar_t is implemented differently on different platforms. For example, on Visual Studio it is 16 bits (and could be used to store UTF16), but on GCC it is 32 bits (and could thus be used to store unicode codepoints directly). You need to be aware of this if you want your code to be portable. Personally I chose to only store strings in UTF8, and convert only when needed.
Which character set and encoding is used for char and wchar_t? and are these types limited to using only these character sets / encoding?
This is not defined by the language standard. Each compiler will have to agree with the operating system on what character codes to use. We don't even know how many bits are used for char and wchar_t.
On some systems char is UTF-8, on others it is ASCII, or something else. On IBM mainframes it can be EBCDIC, a character encoding already in use before ASCII was defined.
If they are not limited to these character sets / encoding, how is it decided what character set / encoding is used for a particular char or wchar_t? is it automatically decided at compile for example or do we have to explicitly tell it what to use?
The compiler knows what is appropriate for each system.
From my understanding UTF-8 uses 1 byte when using the first 128 code points in the set but can use more than 1 byte when using code point 128 and above. If so how is this stored? for example is it simply stored identically to ASCII if it only uses 1 byte? and how does the type (char or wchar_t or whatever) know how many bytes it is using?
The first part of UTF-8 is identical to the corresponding ASCII codes, and stored as a single byte. Higher codes will use two or more bytes.
The char type itself just store bytes and doesn't know how many bytes we need to form a character. That's for someone else to decide.
The same thing for wchar_t, which is 16 bits on Windows but 32 bits on other systems, like Linux.
Finally, if my understanding is correct I get why UTF-8 and UTF-16 are not compatible, eg. a string can't be used where a wstring is needed. But in a program that requires a wstring would it be better practice to write a conversion function from a string to a wstring and the use this when a wstring is required to make my code exclusively string-based or just use wstring where needed instead?
You will likely have to convert. Unfortunately the conversion needed will be different for different systems, as character sizes and encodings vary.
In later C++ standards you have new types char16_t and char32_t, with the string types u16string and u32string. Those have known sizes and encodings.
Everything about used encoding is implementation defined. Check your compiler documentation. It depends on default locale, encoding of source file and OS console settings.
Types like string, wstring, operations on them and C facilities, like strcmp/wstrcmp expect fixed-width encodings. So the would not work properly with variable width ones like UTF8 or UTF16 (but will work with, e.g., UCS-2). If you want to store variable-width encoded strings, you need to be careful and not use fixed-width operations on it. C-string do have some functions for manipulation of such strings in standard library .You can use classes from codecvt header to convert between different encodings for C++ strings.
I would avoid wstring and use C++11 exact width character string: std::u16string or std::u32string
As an example here is some info on how windows uses these types/encodings.
char stores ASCII values (with code pages for non-ASCII values)
wchar_t stores UTF-16, note this means that some unicode characters will use 2 wchar_t's
If you call a system function, e.g. puts then the header file will actually pick either puts or _putws depending on how you've set things up (i.e. if you are using unicode).
So on windows there is no direct support for UTF-8, which means that if you use char to store UTF-8 encoded strings you have to covert them to UTF-16 and call the corresponding UTF-16 system functions.

UTF-8 Compatibility in C++

I am writing a program that needs to be able to work with text in all languages. My understanding is that UTF-8 will do the job, but I am experiencing a few problems with it.
Am I right to say that UTF-8 can be stored in a simple char in C++? If so, why do I get the following warning when I use a program with char, string and stringstream: warning C4566: character represented by universal-character-name '\uFFFD' cannot be represented in the current code page (1252). (I do not get that error when I use wchar_t, wstring and wstringstream.)
Additionally, I know that UTF is variable length. When I use the at or substr string methods would I get the wrong answer?
To use UTF-8 string literals you need to prefix them with u8, otherwise you get the implementation's character set (in your case, it seems to be Windows-1252): u8"\uFFFD" is null-terminated sequence of bytes with the UTF-8 representation of the replacement character (U+FFFD). It has type char const[4].
Since UTF-8 has variable length, all kinds of indexing will do indexing in code units, not codepoints. It is not possible to do random access on codepoints in an UTF-8 sequence because of it's variable length nature. If you want random access you need to use a fixed length encoding, like UTF-32. For that you can use the U prefix on strings.
Yes, the UTF-8 encoding can be used with char, string, and stringstream. A char will hold a single UTF-8 code unit, of which up to four may be required to represent a single Unicode code point.
However, there are a few issues using UTF-8 specifically with Microsoft's compilers. C++ implementations use an 'execution character set' for a number of things, such as encoding character and string literals. VC++ always use the system locale encoding as the execution character set, and Windows does not support UTF-8 as the system locale encoding, therefore UTF-8 can never by the execution character set.
This means that VC++ never intentionally produces UTF-8 character and string literals. Instead the compiler must be tricked.
The compiler will convert from the known source code encoding to the execution encoding. That means that if the compiler uses the locale encoding for both the source and execution encodings then no conversion is done. If you can get UTF-8 data into the source code but have the compiler think that the source uses the locale encoding, then character and string literals will use the UTF-8 encoding. VC++ uses the so-called 'BOM' to detect the source encoding, and uses the locale encoding if no BOM is detected. Therefore you can get UTF-8 encoded string literals by saving all your source files as "UTF-8 without signature".
There are caveats with this method. First, you cannot use UCNs with narrow character and string literals. Universal Character Names have to be converted to the execution character set, which isn't UTF-8. You must either write the character literally so it appears as UTF-8 in the source code, or you can use hex escapes where you manually write out a UTF-8 encoding. Second, in order to produce wide character and string literals the compiler performs a similar conversion from the source encoding to the wide execution character set (which is always UTF-16 in VC++). Since we're lying to the compiler about the encoding, it will perform this conversion to UTF-16 incorrectly. So in wide character and string literals you cannot use non-ascii characters literally, and instead you must use UCNs or hex escapes.
UTF-8 is variable length (as is UTF-16). The indices used with at() and substr() are code units rather than character or code point indices. So if you want a particular code unit then you can just index into the string or array or whatever as normal. If you need a particular code point then you either need a library that can understand composing UTF-8 code units into code points (such as the Boost Unicode iterators library), or you need to convert the UTF-8 data into UTF-32. If you need actual user perceived characters then you need a library that understands how code points are composed into characters. I imagine ICU has such functionality, or you could implement the Default Grapheme Cluster Boundary Specification from the Unicode standard.
The above consideration of UTF-8 only really matters for how you write Unicode data in the source code. It has little bearing on the program's input and output.
If your requirements allow you to choose how to do input and output then I would still recommend using UTF-8 for input. Depending on what you need to do with the input you can either convert it to another encoding that's easy for you to process, or you can write your processing routines to work directly on UTF-8.
If you want to ever output anything via the Windows console then you'll want a well defined module for output that can have different implementations, because internationalized output to the Windows console will require a different implementation from either outputting to a file on Windows or console and file output on other platforms. (On other platforms the console is just another file, but the Windows console needs special treatment.)
The reason you get the warning about \uFFFD is that you're trying to fit FF FD inside a single byte, since, as you noted, UTF-8 works on chars and is variable length.
If you use at or substr, you will possibly get wrong answers since these methods count that one byte should be one character. This is not the case with UTF-8. Notably, with at, you could end up with a single byte of a character sequence; with substr, you could break a sequence and end up with an invalid UTF-8 string (it would start or end with �, \uFFFD, the same one you're apparently trying to use, and the broken character would be lost).
I would recommend that you use wchar to store Unicode strings. Since the type is at least 16 bits, many many more characters can fit in a single "unit".

Internal and external encoding vs. Unicode

Since there was a lot of missinformation spread by several posters in the comments for this question: C++ ABI issues list
I have created this one to clarify.
What are the encodings used for C style strings?
Is Linux using UTF-8 to encode strings?
How does external encoding relate to the encoding used by narrow and wide strings?
Implementation defined. Or even application defined; the standard
doesn't really put any restrictions on what an application does with
them, and expects a lot of the behavior to depend on the locale. All
that is really implemenation defined is the encoding used in string
literals.
In what sense. Most of the OS ignores most of the encodings; you'll
have problems if '\0' isn't a nul byte, but even EBCDIC meets that
requirement. Otherwise, depending on the context, there will be a few
additional characters which may be significant (a '/' in path names,
for example); all of these use the first 128 encodings in Unicode, so
will have a single byte encoding in UTF-8. As an example, I've used
both UTF-8 and ISO 8859-1 for filenames under Linux. The only real
issue is displaying them: if you do ls in an xterm, for example,
ls and the xterm will assume that the filenames are in the same
encoding as the display font.
That mainly depends on the locale. Depending on the locale, it's
quite possible for the internal encoding of a narrow character string not to
correspond to that used for string literals. (But how could it be
otherwise, since the encoding of a string literal must be determined at
compile time, where as the internal encoding for narrow character
strings depends on the locale used to read it, and can vary from one
string to the next.)
If you're developing a new application in Linux, I would strongly
recommend using Unicode for everything, with UTF-32 for wide character
strings, and UTF-8 for narrow character strings. But don't count on
anything outside the first 128 encoding points working in string
literals.
This depends on the architecture. Most Unix architectures are using UTF-32 for wide strings (wchar_t) and ASCII for (char). Note that ASCII is just 7bit encoding. Windows was using UCS-2 until Windows 2000, later versions use variable encoding UTF-16 (for wchar_t).
No. Most system calls on Linux are encoding agnostic (they don't care what the encoding is, since they are not interpreting it in any way). External encoding is actually defined by your current locale.
The internal encoding used by narrow and wide strings is fixed, it does not change with changing locale. By changing the locale you are chaning the translation functions that encode and decode data which enters/leaves your program (assuming you stick with standard C text functions).

Using Different Character Encodings

Recently, I have gotten interested in Text Encoding. As you know, there are many kinds of Text Encoding such as CRC949, UTF-8 and so on.
I am wondering how to express them properly. (To the screen and users.) I mean, they are different from each other. I remember there was particular way to express text accrording to encoding in C#.
Is it possible one can use just simple printf() in C to express string regardless of encoding? Does the compiler automatically do it?
Read Joel Spolsky's article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
From the article:
We decided to do everything internally in UCS-2 (two byte) Unicode,
which is what Visual Basic, COM, and Windows NT/2000/XP use as their
native string type. In C++ code we just declare strings as wchar_t
("wide char") instead of char and use the wcs functions instead of the
str functions (for example wcscat and wcslen instead of strcat and
strlen). To create a literal UCS-2 string in C code you just put an L
before it as so: L"Hello".

Is a wide character string literal starting with L like L"Hello World" guaranteed to be encoded in Unicode?

I've recently tried to get the full picture about what steps it takes to create platform independent C++ applications that support unicode. A thing that is confusing to me is that most howtos and stuff equalize the character encoding (i.e. ANSI or Unicode) and the character type (char or wchar_t). As I've learned so far, these are different things and there may exist a character sequence encodeded in Unicode but represented by std::string as well as a character sequence encoded in ANSI but represented as std::wstring, right?
So the question that comes to my mind is whether the C++ standard gives any guarantee about the encoding of string literals starting with L or does it just say it's of type wchar_t with implementation specific character encoding?
If there is no such guaranty, does that mean I need some sort of external resource system to provide non ASCII string literals for my application in a platform independent way?
What is the prefered way for this? Resource system or proper encoding of source files plus proper compiler options?
The L symbol in front of a string literal simply means that each character in the string will be stored as a wchar_t. But this doesn't necessarily imply Unicode. For example, you could use a wide character string to encode GB 18030, a character set used in China which is similar to Unicode. The C++03 standard doesn't have anything to say about Unicode, (however C++11 defines Unicode char types and string literals) so it's up to you to properly represent Unicode strings in C++03.
Regarding string literals, Chapter 2 (Lexical Conventions) of the C++ standard mentions a "basic source character set", which is basically equivalent to ASCII. So this essentially guarantees that "abc" will be represented as a 3-byte string (not counting the null), and L"abc" will be represented as a 3 * sizeof(wchar_t)-byte string of wide-characters.
The standard also mentions "universal-character-names" which allow you to refer to non-ASCII characters using the \uXXXX hexadecimal notation. These "universal-character-names" usually map directly to Unicode values, but the standard doesn't guarantee that they have to. However, you can at least guarantee that your string will be represented as a certain sequence of bytes by using universal-character-names. This will guarantee Unicode output provided the runtime environment supports Unicode, has the appropriate fonts installed, etc.
As for string literals in C++03 source files, again there is no guarantee. If you have a Unicode string literal in your code which contains characters outside of the ASCII range, it is up to your compiler to decide how to interpret these characters. If you want to explicitly guarantee that the compiler will "do the right thing", you'd need to use \uXXXX notation in your string literals.
The C++03 does not mention unicode (future C++0x does). Currently you have to either use external libraries (ICU, UTF-CPP, etc.) or build your own solution using platform specific code. As others have mentioned, wchar_t encoding (or even size) is not specified. Consequently, string literal encoding is implementation specific. However, you can give unicode codepoints in string literals by using \x \u \U escapes.
Typically unicode apps in Windows use wchar_t (with UTF-16 encoding) as internal character format, because it makes using Windows APIs easier as Windows itself uses UTF-16. Unix/Linux unicode apps in turn usually use char (with UTF-8 encoding) internally. If you want to exchange data between different platforms, UTF-8 is usual choice for data transfer encoding.
The standard makes no mention of encoding formats for strings.
Take a look at ICU from IBM (its free). http://site.icu-project.org/