When encoding actually matters? (e.g., string storing, printing?) - c++

Just curious about the encodings that system is using when handling string storing(if it cares) and printing.
Question 1: If I store one-byte string in std::string or two-byte string in std::wstring, will the underlying integer value differ depending on the encoding currently in use? (I remember that Bjarne says that encoding is the mapping between char and integer(s) so char should be stored as integer(s) in memory, and different encodings don't necessarily have the same mapping)
Question 2: If positive, std::string and std::wstring must have the knowledge of the encoding themselves(although another guy told me this is NOT true)? Otherwise, how is it able to translate the char to correct integers and store them? How does the system know the encoding?
Question 3: What is the default encoding in one particular system, and how to change it(Is it so-called "locale")? I guess the same mechanism matters?
Question 4: What if I print a string to the screen with std::cout, is it the same encoding?

(I remember that Bjarne says that
encoding is the mapping between char
and integer(s) so char should be
stored as integer(s) in memory)
Not quite. Make sure you understand one important distinction.
A character is the minimum unit of text. A letter, digit, punctuation mark, symbol, space, etc.
A byte is the minimum unit of memory. On the overwhelming majority of computers, this is 8 bits.
Encoding is converting a sequence of characters to a sequence of bytes. Decoding is converting a sequence of bytes to a sequence of characters.
The confusing thing for C and C++ programmers is that char means byte, NOT character! The name char for the byte type is a legacy from the pre-Unicode days when everyone (except East Asians) used single-byte encodings. But nowadays, we have Unicode, and its encoding schemes which have up to 4 bytes per character.
Question 1: If I store one-byte string
in std::string or two-byte string in
std::wstring, will the underlying
integer value depend on the encoding
currently in use?
Yes, it will. Suppose you have std::string euro = "€"; Then:
With the windows-1252 encoding, the string will be encoded as the byte 0x80.
With the ISO-8859-15 encoding, the string will be encoded as the byte 0xA4.
With the UTF-8 encoding, the string will be encoded as the three bytes 0xE2, 0x82, 0xAC.
Question 3: What is the default
encoding in one particular system, and
how to change it(Is it so-called
"locale")?
Depends on the platform. On Unix, the encoding can be specified as part of the LANG environment variable.
~$ echo $LANG
en_US.utf8
Windows has a GetACP function to get the "ANSI" code page number.
Question 4: What if I print a string
to the screen with std::cout, is it
the same encoding?
Not necessarily. On Windows, the command line uses the "OEM" code page, which is usually different from the "ANSI" code page used elsewhere.

Encoding and Decoding is inherently the same process, i.e. they both transform one integral sequence to another integral sequence.
The difference between encoding and decoding is on the conceptual level. When you "decode" a character, you transform an integral sequence encoded in a known encoding ("string") into a system-specific integral sequence ("text"). And when you "encode", you're transforming a system-specific integral sequence ("text") into an integral sequence encoded in a particular encoding ("string").
This difference is conceptual, and not physical, the memory still holds a decoded "text" as a "string"; however since a particular system always represent "text" in a particular encoding, text transformations would not need to deal with the specificities of the actual system encoding, and can safely assume to be able to work on a sequence of conceptual "characters" instead of "bytes".
Generally however, the encoding used for "text" uses encoding that have properties that makes it easy to work with (e.g. fixed-length characters, simple one-to-one mapping between characters and byte-sequence, etc); while the encoded "string" is encoded with an efficient encoding (e.g. variable-length characters, context-dependant encoding, etc)
Joel On Software has a writeup on this: http://www.joelonsoftware.com/articles/Unicode.html
This one is a good one as well: http://www.jerf.org/programming/encoding.html

Question 1: If I store one-byte string
in std::string or two-byte string in
std::wstring, will the underlying
integer value differ depending on the
encoding currently in use? (I remember
that Bjarne says that encoding is the
mapping between char and integer(s) so
char should be stored as integer(s) in
memory, and different encodings don't
necessarily have the same mapping)
You're sort of thinking about this backwards. Different encodings interpret the underlying integers as different characters (or parts of characters, if we're talking about a multi-byte character set), depending on the encoding.
Question 2: If positive, std::string and std::wstring must have
the knowledge of the encoding
themselves(although another guy told
me this is NOT true)? Otherwise, how
is it able to translate the char to
correct integers and store them? How
does the system know the encoding?
Both std::string and std::wstring are completely encoding agnostic. As far as C++ is concerned, they simply store arrays of char objects and wchar_t objects respectively. The only requirement is that char is one-byte, and wchar_t is some implementation-defined width. (Usually 2 bytes on Windows and 4 on Linux/UNIX)
Question 3: What is the default
encoding in one particular system, and
how to change it(Is it so-called
"locale")?
That depends on the platform. ISO C++ only talks about the global locale object, std::locale(), which generally refers to your current system-specific settings.
Question 4: What if I print a string
to the screen with std::cout, is it
the same encoding?
Generally, if you output to the screen through stdout, the characters you see displayed are interpreted and rendered according to your system's current locale settings.

Any one working with encodings should read this Joel on Software article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). I found it useful when I started working with encodings.
Question 1: If I store one-byte string in std::string or two-byte string in std::wstring, will the underlying integer value differ depending on the encoding currently in use?
C/C++ programmers are used to thinking of characters as bytes, because almost everyone starts working with the ascii character set, maps the integers 0-255 to symbols such as the letters of the alphabet and arabic numbers. The fact that the C char datatype is actually a byte doesn't help matters.
The std::string class stores data as 8-bit integers, and std::wstring stores data in 16-bit integers. Neither class contains any concept of encoding. You can use any 8-bit encoding such as ASCII, UTF-8, Latin-1, Windows-1252 with a std::string, and any 8-bit or 16-bit encoding, such as UTF-16, with a std::wstring.
Data stored in std::string and std::wstring must always be interpreted by some encoding. This generally comes into play when you interact with the operating system: reading or writing data from a file, a stream, or making OS API calls that interact with strings.
So to answer your question, if you store the same byte in a std::string and a std::wstring, the memory will contain the same value (except the wstring will contain a null byte), but the interpretation of that byte will depend on the encoding in use.
If you store the same character in each of the strings, then the bytes may be different, again depending on the encoding. For example, the Euro symbol (€) might be stored in the std::string using a UTF-8 encoding, which is corresponds to the bytes 0xE2 0x82 0xAC. In the std::wstring, it might be stored using the UTF-16 encoding, which would be the bytes 0x20AC.
Question 3: What is the default encoding in one particular system, and how to change it(Is it so-called "locale")? I guess the same mechanism matters?
Yes, the locale determines how the OS interprets strings at it's API boundaries. Locale's define more than just encoding. They also include things information on how money, date, time, and other things should be formatted. On Linux or OS X, you can use the locale command in the terminal to see what the current locale is:
mch#bohr:/$ locale
LANG=en_CA.UTF-8
LC_CTYPE="en_CA.UTF-8"
LC_NUMERIC="en_CA.UTF-8"
LC_TIME="en_CA.UTF-8"
LC_COLLATE="en_CA.UTF-8"
LC_MONETARY="en_CA.UTF-8"
LC_MESSAGES="en_CA.UTF-8"
LC_PAPER="en_CA.UTF-8"
LC_NAME="en_CA.UTF-8"
LC_ADDRESS="en_CA.UTF-8"
LC_TELEPHONE="en_CA.UTF-8"
LC_MEASUREMENT="en_CA.UTF-8"
LC_IDENTIFICATION="en_CA.UTF-8"
LC_ALL=
So in this case, my locale is Canadian English. Each locale defines a encoding used to interpret strings. In this case the locale name makes it clear that it is using a UTF-8 encoding, but you can run locale -ck LC_CTYPE to see more information about the current encoding:
mch#bohr:/$ locale -ck LC_CTYPE
LC_CTYPE
ctype-class-names="upper";"lower";"alpha";"digit";"xdigit";"space";"print";"graph";"blank";"cntrl";"punct";"alnum";"combining";"combining_level3"
ctype-map-names="toupper";"tolower";"totitle"
ctype-width=16
ctype-mb-cur-max=6
charmap="UTF-8"
... output snipped ...
If you want to test a program using encodings, you can set the LC_ALL environment variable to the locale you want to use. You can also change the locale using setlocale. Permanently changing the locale depends on your distribution.
On Windows, most API functions come in a narrow and a wide format. For example, [GetCurrentDirectory][9] comes in GetCurrentDirectoryW (Unicode) and GetCurrentDirectoryA (ANSI) variants. Unicode, in this context, means UTF-16.
I don't know enough about Windows to tell you how to set the locale, other than to try the languages control panel.
Question 4: What if I print a string to the screen with std::cout, is it the same encoding?
When you print a string to std::cout, the OS will interpret that string in the encoding set by the locale. If your string is UTF-8 encoded and the OS is using Windows-1252, it will be necessary to convert it to that encoding. One way to do this is with the iconv library.

Related

how character sets are stored in strings and wstrings?

So, i've been trying to do a bit of research of strings and wstrings as i need to understand how they work for a program i'm creating so I also looked into ASCII and unicode, and UTF-8 and UTF-16.
I believe i have an okay understanding of the concept of how these work, but what i'm still having trouble with is how they are actually stored in 'char's, 'string's, 'wchar_t's and 'wstring's.
So my questions are as follows:
Which character set and encoding is used for char and wchar_t? and are these types limited to using only these character sets / encoding?
If they are not limited to these character sets / encoding, how is it decided what character set / encoding is used for a particular char or wchar_t? is it automatically decided at compile for example or do we have to explicitly tell it what to use?
From my understanding UTF-8 uses 1 byte when using the first 128 code points in the set but can use more than 1 byte when using code point 128 and above. If so how is this stored? for example is it simply stored identically to ASCII if it only uses 1 byte? and how does the type (char or wchar_t or whatever) know how many bytes it is using?
Finally, if my understanding is correct I get why UTF-8 and UTF-16 are not compatible, eg. a string can't be used where a wstring is needed. But in a program that requires a wstring would it be better practice to write a conversion function from a string to a wstring and the use this when a wstring is required to make my code exclusively string-based or just use wstring where needed instead?
Thanks, and let me know if any of my questions are incorrectly worded or use the wrong terminology as i'm trying to get to grips with this as best as I can.
i'm working in C++ btw
They use whatever characterset and encoding you want. The types do not imply a specific characterset or encoding. They do not even imply characters - you could happily do math problems with them. Don't do that though, it's weird.
How do you output text? If it is to a console, the console decides which character is associated with each value. If it is some graphical toolkit, the toolkit decides. Consoles and toolkits tend to conform to standards, so there is a good chance they will be using unicode, nowadays. On older systems anything might happen.
UTF8 has the same values as ASCII for the range 0-127. Above that it gets a bit more complicated; this is explained here quite well: https://en.wikipedia.org/wiki/UTF-8#Description
wstring is a string made up of wchar_t, but sadly wchar_t is implemented differently on different platforms. For example, on Visual Studio it is 16 bits (and could be used to store UTF16), but on GCC it is 32 bits (and could thus be used to store unicode codepoints directly). You need to be aware of this if you want your code to be portable. Personally I chose to only store strings in UTF8, and convert only when needed.
Which character set and encoding is used for char and wchar_t? and are these types limited to using only these character sets / encoding?
This is not defined by the language standard. Each compiler will have to agree with the operating system on what character codes to use. We don't even know how many bits are used for char and wchar_t.
On some systems char is UTF-8, on others it is ASCII, or something else. On IBM mainframes it can be EBCDIC, a character encoding already in use before ASCII was defined.
If they are not limited to these character sets / encoding, how is it decided what character set / encoding is used for a particular char or wchar_t? is it automatically decided at compile for example or do we have to explicitly tell it what to use?
The compiler knows what is appropriate for each system.
From my understanding UTF-8 uses 1 byte when using the first 128 code points in the set but can use more than 1 byte when using code point 128 and above. If so how is this stored? for example is it simply stored identically to ASCII if it only uses 1 byte? and how does the type (char or wchar_t or whatever) know how many bytes it is using?
The first part of UTF-8 is identical to the corresponding ASCII codes, and stored as a single byte. Higher codes will use two or more bytes.
The char type itself just store bytes and doesn't know how many bytes we need to form a character. That's for someone else to decide.
The same thing for wchar_t, which is 16 bits on Windows but 32 bits on other systems, like Linux.
Finally, if my understanding is correct I get why UTF-8 and UTF-16 are not compatible, eg. a string can't be used where a wstring is needed. But in a program that requires a wstring would it be better practice to write a conversion function from a string to a wstring and the use this when a wstring is required to make my code exclusively string-based or just use wstring where needed instead?
You will likely have to convert. Unfortunately the conversion needed will be different for different systems, as character sizes and encodings vary.
In later C++ standards you have new types char16_t and char32_t, with the string types u16string and u32string. Those have known sizes and encodings.
Everything about used encoding is implementation defined. Check your compiler documentation. It depends on default locale, encoding of source file and OS console settings.
Types like string, wstring, operations on them and C facilities, like strcmp/wstrcmp expect fixed-width encodings. So the would not work properly with variable width ones like UTF8 or UTF16 (but will work with, e.g., UCS-2). If you want to store variable-width encoded strings, you need to be careful and not use fixed-width operations on it. C-string do have some functions for manipulation of such strings in standard library .You can use classes from codecvt header to convert between different encodings for C++ strings.
I would avoid wstring and use C++11 exact width character string: std::u16string or std::u32string
As an example here is some info on how windows uses these types/encodings.
char stores ASCII values (with code pages for non-ASCII values)
wchar_t stores UTF-16, note this means that some unicode characters will use 2 wchar_t's
If you call a system function, e.g. puts then the header file will actually pick either puts or _putws depending on how you've set things up (i.e. if you are using unicode).
So on windows there is no direct support for UTF-8, which means that if you use char to store UTF-8 encoded strings you have to covert them to UTF-16 and call the corresponding UTF-16 system functions.

Determine if a byte array contains an ANSI or Unicode string?

Say I have a function that receives a byte array:
void fcn(byte* data)
{
...
}
Does anyone know a reliable way for fcn() to determine if data is an ANSI string or a Unicode string?
Note that I'm intentionally NOT passing a length arg, all I receive is the pointer to the array. A length arg would be a great help, but I don't receive it, so I must do without.
This article mentions an OLE API that apparently does it, but of course they don't tell you WHICH api function: http://support.microsoft.com/kb/138142
First, a word on terminology. There is no such thing as an ANSI string; there are ASCII strings, which represents a character encoding. ASCII was developed by ANSI, but they're not interchangable.
Also, there is no such thing as a Unicode string. There are Unicode encodings, but those are only a part of Unicode itself.
I will assume that by "Unicode string" you mean "UTF-8 encoded codepoint sequence." And by ANSI string, I'll assume you mean ASCII.
If so, then every ASCII string is also a UTF-8 string, by the definition of UTF-8's encoding. ASCII only defines characters up to 0x7F, and all UTF-8 code units (bytes) up to 0x7F mean the same thing as they do under ASCII.
Therefore, your concern would be for the other 128 possible values. That is... complicated.
The only reason you would ask this question is if you have no control over the encoding of the string input. And therefore, the problem is that ASCII and UTF-8 are not the only possible choices.
There's Latin-1, for example. There are many strings out there that are encoded in Latin-1, which takes the other 128 bytes that ASCII doesn't use and defines characters for them. That's bad, because those other 128 bytes will conflict with UTF-8's encoding.
There are also code pages. Many strings were encoded against a particular code page; this is particularly so on Windows. Decoding them requires knowing what codepage you're working on.
If you are in a situation where you are certain that a string is either ASCII (7-bit, with the high bit always 0) or UTF-8, then you can make the determination easily. Either the string is ASCII (and therefore also UTF-8), or one or more of the bytes will have the high bit set to 1. In which case, you must use UTF-8 decoding logic.
Unless you are truly certain of that these are the only possibilities, you are going to need to do a bit more. You can validate the data by trying to run it through a UTF-8 decoder. If it runs into an invalid code unit sequence, then you know it isn't UTF-8. The problem is that it is theoretically possible to create a Latin-1 string that is technically valid UTF-8. You're kinda screwed at that point. The same goes for code page-based strings.
Ultimately, if you don't know what encoding the string is, there's no guarantee you can display it properly. That's why it's important to know where your strings come from and what they mean.

Internal and external encoding vs. Unicode

Since there was a lot of missinformation spread by several posters in the comments for this question: C++ ABI issues list
I have created this one to clarify.
What are the encodings used for C style strings?
Is Linux using UTF-8 to encode strings?
How does external encoding relate to the encoding used by narrow and wide strings?
Implementation defined. Or even application defined; the standard
doesn't really put any restrictions on what an application does with
them, and expects a lot of the behavior to depend on the locale. All
that is really implemenation defined is the encoding used in string
literals.
In what sense. Most of the OS ignores most of the encodings; you'll
have problems if '\0' isn't a nul byte, but even EBCDIC meets that
requirement. Otherwise, depending on the context, there will be a few
additional characters which may be significant (a '/' in path names,
for example); all of these use the first 128 encodings in Unicode, so
will have a single byte encoding in UTF-8. As an example, I've used
both UTF-8 and ISO 8859-1 for filenames under Linux. The only real
issue is displaying them: if you do ls in an xterm, for example,
ls and the xterm will assume that the filenames are in the same
encoding as the display font.
That mainly depends on the locale. Depending on the locale, it's
quite possible for the internal encoding of a narrow character string not to
correspond to that used for string literals. (But how could it be
otherwise, since the encoding of a string literal must be determined at
compile time, where as the internal encoding for narrow character
strings depends on the locale used to read it, and can vary from one
string to the next.)
If you're developing a new application in Linux, I would strongly
recommend using Unicode for everything, with UTF-32 for wide character
strings, and UTF-8 for narrow character strings. But don't count on
anything outside the first 128 encoding points working in string
literals.
This depends on the architecture. Most Unix architectures are using UTF-32 for wide strings (wchar_t) and ASCII for (char). Note that ASCII is just 7bit encoding. Windows was using UCS-2 until Windows 2000, later versions use variable encoding UTF-16 (for wchar_t).
No. Most system calls on Linux are encoding agnostic (they don't care what the encoding is, since they are not interpreting it in any way). External encoding is actually defined by your current locale.
The internal encoding used by narrow and wide strings is fixed, it does not change with changing locale. By changing the locale you are chaning the translation functions that encode and decode data which enters/leaves your program (assuming you stick with standard C text functions).

How to UTF-8 encode a character/string

I am using a Twitter API library to post a status to Twitter. Twitter requires that the post be UTF-8 encoded. The library contains a function that URL encodes a standard string, which works perfectly for all special characters such as !##$%^&*() but is the incorrect encoding for accented characters (and other UTF-8).
For example, 'é' gets converted to '%E9' rather than '%C3%A9' (it pretty much only converts to a hexadecimal value). Is there a built-in function that could input something like 'é' and return something like '%C9%A9"?
edit: I am fairly new to UTF-8 in case what I am requesting makes no sense.
edit: if I have a
string foo = "bar é";
I would like to convert it to
"bar %C3%A9"
Thanks
If you have a wide character string, you can encode it in UTF8 with the standard wcstombs() function. If you have it in some other encoding (e.g. Latin-1) you will have to decode it to a wide string first.
Edit: ... but wcstombs() depends on your locale settings, and it looks like you can't select a UTF8 locale on Windows. (You don't say what OS you're using.) WideCharToMultiByte() might be more useful on Windows, as you can specify the encoding in the call.
To understand what needs to be done, you have to first understand a bit of background. Different encodings use different values for the "same" character. Latin-1, for example, says "é" is a single byte with value E9 (hex), while UTF-8 says "é" is the two byte sequence C3 A9, and yet UTF-16 says that same character is the single double-byte value 00E9 – a single 16-bit value rather than two 8-bit values as in UTF-8. (Unicode, which isn't an encoding, actually uses the same codepoint value, U+E9, as Latin-1.)
To convert from one encoding to another, you must first take the encoded value, decode it to a value independent of the source encoding (i.e. Unicode codepoint), then re-encode it in the target encoding. If the target encoding doesn't support all of the source encoding's codepoints, then you'll either need to translate or otherwise handle this condition.
This re-encoding step requires knowing both the source and target encodings.
Your API function is not converting encodings; it appears to be URL-escaping an arbitrary byte string. The authors of the function apparently assume you will have already converted to UTF-8.
In order to convert to UTF-8, you must know what encoding your system is using and be able to map to Unicode codepoints. From there, the UTF-8 encoding is trivial.
Depending on your system, this may be as easy as converting the "native" character set (which has "é" as E9 for you, so probably Windows-1252, Latin-1, or something very similar) to wide characters (which is probably UTF-16 or UCS-2 if sizeof(wchar_t) is 2, or UTF-32 if sizeof(wchar_t) is 4) and then to UTF-8. Wcstombs, as Martin answers, may be able to handle the second part of this conversion, but this is system-dependent. However, I believe Latin-1 is a subset of Unicode, so conversion from this source encoding can skip the wide character step. Windows-1252 is close to Latin-1, but replaces some control characters with printable characters.

UTF usage in C++ code

What is the difference between UTF and UCS.
What are the best ways to represent not European character sets (using UTF) in C++ strings. I would like to know your recommendations for:
Internal representation inside the code
For string manipulation at run-time
For using the string for display purposes.
Best storage representation (i.e. In file)
Best on wire transport format (Transfer between application that may be on different architectures and have a different standard locale)
What is the difference between UTF and UCS.
UCS encodings are fixed width, and are marked by how many bytes are used for each character. For example, UCS-2 requires 2 bytes per character. Characters with code points outside the available range can't be encoded in a UCS encoding.
UTF encodings are variable width, and marked by the minimum number of bits to store a character. For example, UTF-16 requires at least 16 bits (2 bytes) per character. Characters with large code points are encoded using a larger number of bytes -- 4 bytes for astral characters in UTF-16.
Internal representation inside the code
Best storage representation (i.e. In file)
Best on wire transport format (Transfer between application that may
be on different architectures and have
a different standard locale)
For modern systems, the most reasonable storage and transport encoding is UTF-8. There are special cases where others might be appropriate -- UTF-7 for old mail servers, UTF-16 for poorly-written text editors -- but UTF-8 is most common.
Preferred internal representation will depend on your platform. In Windows, it is UTF-16. In UNIX, it is UCS-4. Each has its good points:
UTF-16 strings never use more memory than a UCS-4 string. If you store many large strings with characters primarily in the basic multi-lingual plane (BMP), UTF-16 will require much less space than UCS-4. Outside the BMP, it will use the same amount.
UCS-4 is easier to reason about. Because UTF-16 characters might be split over multiple "surrogate pairs", it can be challenging to correctly split or render a string. UCS-4 text does not have this issue. UCS-4 also acts much like ASCII text in "char" arrays, so existing text algorithms can be ported easily.
Finally, some systems use UTF-8 as an internal format. This is good if you need to inter-operate with existing ASCII- or ISO-8859-based systems because NULL bytes are not present in the middle of UTF-8 text -- they are in UTF-16 or UCS-4.
Have you read Joel Spolsky's article on The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)?
I would suggest:
For representation in code, wchar_t or equivalent.
For storage representation, UTF-8.
For wire representation, UTF-8.
The advantage of UTF-8 in storage and wire situations is that machine endianness is not a factor. The advantage of using a fixed size character such as wchar_t in code is that you can easily find out the length of a string without having to scan it.
UTC is Coordinated Universal Time, not a character set (I didn't find any charset called UTC).
For internal representation, you may want to use wchar_t for each character, and std::wstring for strings. They use exactly 2 bytes for each character, so seeking and random access will be fast.
For storage, if most of the data are not ASCII (i.e. code >= 128), you may want to use UTF-16 which is almost the same as serialized wstring and wchar_t.
Since UTF-16 can be little endian or big endian, for wire transport, try to convert it to UTF-8, which is architecture-independent.
In internal representation inside the code, you'd better do this for both European and non-European characters:
\uNNNN
Characters in the range \u0020 to \u007E, and a little bit of whitespace (e.g. end of line) can be written as ordinary characters. Anything above \u0080, if you write it as an ordinary character then it will compile only in your code page (e.g. OK in France but breaking in Russia, OK in Russia but breaking in Japan, OK in China but breaking in the US, etc.).