Multi-Byte UTF-8 in Arrays in C++

Multi-Byte UTF-8 in Arrays in C++ - c++

I have been having trouble working with 3-byte Unicode UTF-8 characters in arrays. When they are in char arrays I get multi-character character constant and implicit constant conversion warnings, but when I use wchar_t arrays, wcout returns nothing at all. Because of the nature of the project, it must be an array and not a string. Below is an example of what I've been trying to do.
#include <iostream>
#include <string>
using namespace std;
int main()
{
wchar_t testing[40];
testing[0] = L'\u0B95';
testing[1] = L'\u0BA3';
testing[2] = L'\u0B82';
testing[3] = L'\0';
wcout << testing[0] << endl;
return 0;
}
Any suggestions? I'm working with OSX.

Since '\u0B95' requires 3 bytes, it is considered a multicharacter literal. A multicharacter literal has type int and an implementation-defined value. (Actually, I don't think gcc is correct to do this)
Putting the L prefix before the literal makes it have type wchar_t and has an implementation defined value (it maps to a value in the execution wide-character set which is an implementation defined superset of the basic execution wide-character set).
The C++11 standard provides us with some more Unicode aware types and literals. The additional types are char16_t and char32_t, whose values are the Unicode code-points that represent the character. They are analogous to UTF-16 and UTF-32 respectively.
Since you need character literals to store characters from the basic multilingual plane, you'll need a char16_t literal. This can be written as, for example, u'\u0B95'. You can therefore write your code as follows, with no warnings or errors:
char16_t testing[40];
testing[0] = u'\u0B95';
testing[1] = u'\u0BA3';
testing[2] = u'\u0B82';
testing[3] = u'\0';
Unfortunately, the I/O library does not play nicely with these new types.
If you do not truly require using character literals as above, you may make use of the new UTF-8 string literals:
const char* testing = u8"\u0B95\u0BA3\u0B82";
This will encode the characters as UTF-8.

Related

Chinese character too large for enclosing character literal type

I'm trying to assign the Chinese character 牛 as a char value in C++. On Xcode, I get the error:
"Character too large for enclosing character literal type."
When I use an online IDE like JDoodle or Browxy, I get the error:
"multi-character character constant."
It doesn't matter whether I use char, char16_t, char32_t or wchar_t, it won't work. I thought any Chinese character could at least fit into wchar_t, but this appears not to be the case. What can I do differently?
char letter = '牛';
char16_t character = '牛';
char32_t hanzi = '牛';
wchar_t word = '牛';

All of your character literals are standard chars. To get a wider type, you need to include the proper prefix on the literal:
char letter = '牛';
char16_t character = u'牛';
char32_t hanzi = U'牛';
wchar_t word = L'牛';

Declare multibyte character array where bytes > 2

How can I declare a multibyte character array in which each is character is represented for 3 or 4 bytes?
I know I can do: char var[] = "AA"; which will write to memory 6161 and I can do wchar var[] = L"AA"; which will do 00610061. How can I declare a wider character array in C or C++?
Is there any other prefix like the L to instruct the compiler to do so?

Both C and C++ offer char32_t. In C char32_t is a typedef of/same type as uint_least32_t. In C++ char32_t has the same size, signedness, and alignment as std::uint_least32_t, but is a distinct type.
Both of them can be used like
char32_t string[] = U"some text";

You could try this, as long as you don't mind manually typing out each character:
int characters[3] = { 'h', 'e', 'y' };
You can also use a capital U in front of the string literal to get UTF-32:
char32_t characters[] = U"hey";

Your best bet when talking multi-byte character arrays is to use UTF8 encoding. That way all of the standard string library functions will continue to work, and ASCII representations remain the same.

Terminating unicode null character

I have an array of wchar_t. I need to add a unicode null character at a specific position in the array.
wchar_t var1[100];
var1[79] = '\u0000';
I tried the above but get the following compilation error.
error C3850: '\u0000': a universal-character-name specifies an invalid character
How do I add a unicode null character?

I think you can use
var1[79] = L'\0'

The language doesn't allow you to use universal character names for characters that you can easily write without using a UCN. That's why '\u0000' isn't permitted. (I'm not quite sure what the rationale for that rule is.)
Since var1 is an array of wchar_t, L'\0' is the most straightforward thing to use.
But since char, wchar_t, and int are all integral types, and since values of any integral type can be assigned to an object of another integral type (as long as the value is in range of the target type), any of the following will work:
var1[79] = L'\0'; // best
var1[79] = '\0'; // char value converted to wchar_t
var1[79] = 0; // int value converted to wchar_t

I have programmed some simple window applications using raw api32 and my best guess is to use L'\0'.

Integer zero will also do:
var1[79] = 0;

Convert std::string to Unicode in Linux

EDIT I modified the question after realizing it was wrong to begin with.
I'm porting part of a C# application to Linux, where I need to get the bytes of a UTF-16 string:
string myString = "ABC";
byte[] bytes = Encoding.Unicode.GetBytes(myString);
So that the bytes array is now:
"65 00 66 00 67 00" (bytes)
How can I achieve the same in C++ on Linux? I have a myString defined as std::string, and it seems that std::wstring on Linux is 4 bytes?

You question isn't really clear, but I'll try to clear up some confusion.
Introduction
Status of the handling of character set in C (and that was inherited by C++) after the '95 amendment to the C standard.
the character set used is given by the current locale
wchar_t is meant to store code point
char is meant to store a multibyte encoded form (a constraint for instance is that characters in the basic character set must be encoded in one byte)
string literals are encoded in an implementation defined manner. If they use characters outside of the basic character set, you can't assume they are valid in all locale.
Thus with a 16 bits wchar_t you are restricted to the BMP. Using the surrogates of UTF-16 is not compliant but I think MS and IBM are more or less forced to do this because they believed Unicode when they said they'll forever be a 16 bits charset. Those who delayed their Unicode support tend to use a 32 bits wchar_t.
Newer standards don't change much. Mostly there are literals for UTF-8, UTF-16 and UTF-32 encoded strings and there are types for 16 bits and 32 bits char. There is little or no additional support for Unicode in the standard libraries.
How to do the transformation of one encoding to the other
You have to be in a locale which use Unicode. Hopefully
std::locale::global(locale(""));
will be enough for that. If not, your environment is not properly setup (or setup for another charset and assuming Unicode won't be a service to your user.).
C Style
Use the wcstomsb and mbstowcs functions. Here is an example for what you asked.
std::string narrow(std::wstring const& s)
{
std::vector<char> result(4*s.size() + 1);
size_t used = wcstomsb(&result[0], s.data(), result.size());
assert(used < result.size());
return result.data();
}
C++ Style
The codecvt facet of the locale provide the needed functionality. The advantage is that you don't have to change the global locale for using it. The inconvenient is that the usage is more complex.
#include <locale>
#include <iostream>
#include <string>
#include <vector>
#include <assert.h>
#include <iomanip>
std::string narrow(std::wstring const& s,
std::locale loc = std::locale())
{
std::vector<char> result(4*s.size() + 1);
wchar_t const* fromNext;
char* toNext;
mbstate_t state = {0};
std::codecvt_base::result convResult
= std::use_facet<std::codecvt<wchar_t, char, std::mbstate_t> >(loc)
.out(state,&s[0], &s[s.size()], fromNext,
&result[0], &result[result.size()], toNext);
assert(fromNext == &s[s.size()]);
assert(toNext != &result[result.size()]);
assert(convResult == std::codecvt_base::ok);
*toNext = '\0';
return &result[0];
}
std::wstring widen(std::string const& s,
std::locale loc = std::locale())
{
std::vector<wchar_t> result(s.size() + 1);
char const* fromNext;
wchar_t* toNext;
mbstate_t state = {0};
std::codecvt_base::result convResult
= std::use_facet<std::codecvt<wchar_t, char, std::mbstate_t> >(loc)
.in(state, &s[0], &s[s.size()], fromNext,
&result[0], &result[result.size()], toNext);
assert(fromNext == &s[s.size()]);
assert(toNext != &result[result.size()]);
assert(convResult == std::codecvt_base::ok);
*toNext = L'\0';
return &result[0];
}
you should replace the assertions by better handling.
BTW, this is standard C++ and doesn't assume Unicode excepted for the computation of the size of result, you can do better by checking convResult which can indicate a partial conversion).

The easiest way is to grab a small library, such as UTF8 CPP and do something like:
utf8::utf8to16(line.begin(), line.end(), back_inserter(utf16line));

I usually use the UnicodeConverter class from the Poco C++ libraries. If you don't want the dependency then you can have a look at the code.

Different char type in windows programming

Recently, I meet some tasks about the char/string on windows platform. I see that they are different char type like char, TCHAR, WCHAR, LPSTR, LPWSTR, LPCTSTR. Can someone give me some information about it? And how to use like the regular char and char *. I cam confused about these types?
Best Regards,

They are documented on MSDN. Here's a few:
TCHAR: A WCHAR if UNICODE is defined, a CHAR otherwise.
WCHAR: A 16-bit Unicode character.
CHAR: An 8-bit Windows (ANSI) character.
LPTSTR: An LPWSTR if UNICODE is defined, an LPSTR otherwise.
LPSTR: A pointer to a null-terminated string of 8-bit Windows (ANSI) characters.
LPWSTR: A pointer to a null-terminated string of 16-bit Unicode characters.
LPCTSTR: An LPCWSTR if UNICODE is defined, an LPCSTR otherwise.
LPCWSTR: A pointer to a constant null-terminated string of 16-bit Unicode characters.
LPCSTR: A pointer to a constant null-terminated string of 8-bit Windows (ANSI) characters.
Note that some of these types map to something different depending on whether UNICODE has been #define'd. By default, they resolve to the ANSI versions:
#include <windows.h>
// LPCTSTR resolves to LPCSTR
When you #define UNICODE before #include <windows.h>, they resolve to the Unicode versions.
#define UNICODE
#include <windows.h>
// LPCTSTR resolves to LPCWSTR
They are in reality typedefs to some fundamental types in the C and C++ language. For example:
typedef char CHAR;
typedef wchar_t WCHAR;
On compilers like Visual C++, there's really no difference between an LPCSTR and a const char* or a LPCWSTR and a const wchar_t* . This might differ between compilers however, which is why these data types exist in the first place!
It's sort of like the Windows API equivalent of <cstdint> or <stdint.h>. The Windows API has bindings in other languages, and having data types with a known size is useful, if not required.

char is the standard 8-bit character type.
wchar_t is a 16-bit Unicode UTF-16 character type, used since about Windows 95. WCHAR is another name for it.
TCHAR can be either one, depending on your compiler settings. Most of the time in a modern program it's wchar_t.
The P and LP prefixes are pointers to the different types. The L is legacy (stands for Long pointer), and became obsolete with Windows 95; you still see it quite a bit though.
The C after the prefix stands for const.

TCHAR, LPTSTR and LPCTSTR are all generalized macros that will be either regular character strings or wide character strings depending on whether or not the UNICODE define is set. CHAR, LPSTR and LPCSTR are regular character strings. WCHAR, LPWSTR and LPCWSTR are wide character strings. TCHAR, CHAR and WCHAR represents a single character. LPTSTR, LPSTR and LPWSTR are "Long Pointer to STRing". LPCTSTR, LPCSTR and LPWCSTR are constant string pointers.

Let me try to shed some light (I've blogged this on my site at https://www.dima.to/blog/?p=190 in case you want to check it out):
#include "stdafx.h"
#include "Windows.h"
int _tmain(int argc, _TCHAR* argv[])
{
/* Quick Tutorial on Strings in Microsoft Visual C++
The Unicode Character Set and Multibyte Character Set options in MSVC++ provide a project with two flavours of string encodings. They will use different encodings for characters in your project. Here are the two main character types in MSVC++ that you should be concerned about:
1. char <-- char characters use an 8-bit character encoding (8 bits = 1 byte) according to MSDN.
2. wchar_t <-- wchar_t uses a 16-bit character encoding (16 bits = 2 bytes) according to MSDN.
From above, we can see that the size of each character in our strings will change depending on our chosen character set.
WARNING: Do NOT assume that any given character you append to either a Mutlibyte or Unicode string will always take up a single-byte or double-byte space defined by char or wchar_t! That is up to the discretion of the encoding used. Sometimes, characters need to be combined to define a character that the user wants in their string. In other words, take this example: Multibyte character strings take up a byte per character inside of the string, but that does not mean that a given byte will always produce the character you desire at a particular location, because even multibyte characters may take up more than a single byte. MSDN says it may take up TWO character spaces to produce a single multibyte-encoded character: "A multibyte-character string may contain a mixture of single-byte and double-byte characters. A two-byte multibyte character has a lead byte and a trail byte."
WARNING: Do NOT assume that Unicode contains every character for every language. For more information, please see http://stackoverflow.com/questions/5290182/how-many-bytes-takes-one-unicode-character.
Note: The ASCII Character Set is a subset of both Multibyte and Unicode Character Sets (in other words, both of these flavours encompass ASCII characters).
Note: You should always use Unicode for new development, according to MSDN. For more information, please see http://msdn.microsoft.com/en-us/library/ey142t48.aspx.
*/
// Strings that are Multibyte.
LPSTR a; // Regular Multibyte string (synonymous with char *).
LPCSTR b; // Constant Multibyte string (synonymous with const char *).
// Strings that are Unicode.
LPWSTR c; // Regular Unicode string (synonymous with wchar_t *).
LPCWSTR d; // Constant Unicode string (synonymous with const wchar_t *).
// Strings that take on either Multibyte or Unicode depending on project settings.
LPTSTR e; // Multibyte or Unicode string (can be either char * or wchar_t *).
LPCTSTR f; // Constant Multibyte or Unicode string (can be either const char * or const wchar_t *).
/* From above, it is safe to assume that the pattern is as follows:
LP: Specifies a long pointer type (this is synonymous with prefixing this type with a *).
W: Specifies that the type is of the Unicode Character Set.
C: Specifies that the type is constant.
T: Specifies that the type has a variable encoding.
STR: Specifies that the type is a string type.
*/
// String format specifiers:
e = _T("Example."); // Formats a string as either Multibyte or Unicode depending on project settings.
e = TEXT("Example."); // Formats a string as either Multibyte or Unicode depending on project settings (same as _T).
c = L"Example."; // Formats a string as Unicode.
a = "Example."; // Formats a string as Multibyte.
return 0;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Multi-Byte UTF-8 in Arrays in C++ - c++

Related

Chinese character too large for enclosing character literal type

Declare multibyte character array where bytes > 2

Terminating unicode null character

Convert std::string to Unicode in Linux

Different char type in windows programming

Categories

Resources