clang++ and u16string - c++

I'm having a hell of a time with this simple line of code and the latest clang++
#include <stdio.h>
#include <string>
using std::u16string;
int main ( int argc, char** argv )
{
u16string s16 = u"鵝滿是快烙滴好耳痛";
return EXIT_SUCCESS;
}
Ben-iMac:Desktop Ben$ clang++ -std=c++0x -stdlib=libc++ main.cpp -o main
main.cpp:15:21: error: use of undeclared identifier 'u'
u16string s16 = u"鵝滿是快烙滴好耳痛"

The latest released versions of clang, v2.9 from llvm.org or Apple's clang 3.0, do not support Unicode string literals. The latest available version, built from top of trunk source does support Unicode string literals.
The next llvm.org release of clang (i.e., 3.0) will support the Unicode string literal syntax, but does not have support for any source file encoding beyond ASCII. So even with that llvm.org release you won't be able to type in those characters literally in your source and have them converted to a UTF-16 encoded string value. Instead you'll have to use the \u escapes. Again, top of trunk does support UTF-8 source code, but it didn't get put in in time for the llvm.org 3.0 release that is currently under testing. The next release after that (in 6 months or so) should have better support for UTF-8 source code (but not other source encodings).
Edit: The Xcode 4.3 version of clang does have these features.
Edit: And now the 3.1 release from LLVM.org has them
So clang now fully supports the following:
#include <string>
int main() {
std::u16string a = u"鵝"; // UTF-8 source is transformed into UTF-16 literal
std::u32string b = U"滿"; // UTF-8 source is transformed into UTF-32 literal
}
It turns out the standard does not actually require much support for char16_t and char32_t in the iostreams library, so you'll probably have to convert to another string type to get much use out of this. At least the ability to convert between these and the more useful std::string is required (though not exactly convenient to set up...).

You can test clang for individual C++11 features, e.g.:
http://clang.llvm.org/docs/LanguageExtensions.html#cxx_unicode_literals
and here's a status page:
http://clang.llvm.org/cxx_status.html

Related

Same version of clang giving different results on different OS's

I have clang 15.0.7 installed with brew in MacOS and the same version installed with MSYS2 in Windows 10.
When I compile the following program:
#include <filesystem>
int main()
{
std::filesystem::path p("/some/path");
std::string s(p);
}
using clang++ -std=c++20 test.cpp I get no compilation errors on MacOS, but in windows it gives:
test.cpp:6:15 error: no matching constructor for initialization of 'std::string' (aka 'basic_string<char>')
std::string s(p);
^ ~
[more errors]
What is going on?
std::filesystem::path::value_type is wchar_t on Windows and char elsewhere.
Hence, std::filesystem::path has a conversion operator to std::wstring on Windows and to std::string elsewhere (who even thought this was a good idea?!).
Call .string() to get std::string in a "portable" manner.
On Windows, make sure to test UTF-8 support on all standard library flavors you're interested in (MSVC STL, GCC's libstdc++, Clang's libc++). I remember that at least on MSVC you had to enable UTF-8 support with a locale, or use std::u8string.

Cyrillic characters are not saved in char16_t under msvc with cmake build

I faced the following problem: when I try to save a Cyrillic character in the char16_t variable, I get a c2015 error (too many characters in constant). The most interesting thing is that it happens only when I build with MSVC via cmake. If I do it directly through studio, or build with cmake with another compiler, it's ok. Moreover, in the same case, which does not work with MSVC, it is possible to save Cyrillic characters in wchar_t. I use version 16 of MSVC generator to generate it.
Example of broken code:
char16_t val = u'б';
EDIT:
It is trouble with utf literals. I`m use this:
char16_t val = some_int_val
And this work correctly. Also, IDE correctly print int value of literal, but on the compile all crushs.

C++ Use of wstring_convert on Linux

I would like to be able to convert text read from a file into multibyte characters.
I have the following C++ code on Windows that is working for me.
When I try to compile the code on Linux though it is failing.
#include <locale>
....
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> utfconv;
std::string line;
while (std::getline(infile, line))
{
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> utfconv;
std::wstring widestr = utfconv.from_bytes(line);
This throws the following errors:
wstring_convert is not a member of std
codecvt_utf8_utf16 is not a member of std
expected primary-expression before wchar_t
I'm using GCC Red Hat 4.4.7-4.
Based on what I read, I've imported 'locale', but it still can't find it.
If wstring_convert is not available, is there something equivalent that I can do?
Most likely your standard is not set properly. std::wstring_convert was first introduced in C++11 and deprecated in C++17, so you need to add the compiler flag -std=c++11 or -std=c++14.
EDIT:
Just saw the GCC version you're using. It's way out of date. Everything should work fine if you download GCC 4.9.x or above
You will have to use a newer GCC version. Precompiled versions are part of Developer Toolset (available as part of most Red Hat Enterprise Linux subscriptions) or as part of Software Collections for CentOS:
devtoolset-7

POSIX locale for different unicode scripts to use with Boost

I am using Boost Ofstream for Unicode std::string output. I am stuck with using the right locale (Boost)/ Encoding to support all the languages (in all the Unicode versions upto 6.3). The code is targeted to compile on both VS2010 and GCC 4.8
loc::generator gen;
std::locale _loc= gen.generate("en_US.utf-8"); // use the right POSIX locale/ encoding,
// to support different versions of Unicode
// and with different compilers
std::string str = "my unicode string";
boost::filesystem::ofstream _file("my file.txt");
_file.imbue(_loc);
_file<<str;
I am trying to understand the different Unicode versions, encodings, locale support from different compilers.

Non-ASCII wchar_t literals under LLVM

I've migrated an Xcode iOS project from Xcode 3.2.6 to 4.2. Now I'm getting warnings when I try to initialize a wchar_t with a literal with a non-ASCII character:
wchar_t c1;
if(c1 <= L'я') //That's Cyrillic "ya"
The messages are:
MyFile.cpp:148:28: warning: character unicode escape sequence too long for its type [2]
MyFile.cpp:148:28: warning: extraneous characters in wide character constant ignored [2]
And the literal does not work as expected - the comparison misfires.
I'm compiling with -fshort-wchar, the source file is in UTF-8. The Xcode editor displays the file fine. It compiled and worked on GCC (several flavors, including Xcode 3), worked on MSVC. Is there a way to make LLVM compiler recognize those literals? If not, can I go back to GCC in Xcode 4?
EDIT: Xcode 4.2 on Snow Leopard - long story why.
EDIT2: confirmed on a brand new project. File extension does not matter - same behavior in .m files. -fshort-wchar does not affect it either. Looks like I've gotta go back to GCC until I can upgrade to a version of Xcode where this is fixed.
Not an answer, but hopefully helpful information — I could not reproduce the problem with clang 4.0 (Xcode 4.5.1):
$ uname -a
Darwin air 12.2.0 Darwin Kernel Version 12.2.0: Sat Aug 25 00:48:52 PDT 2012; root:xnu-2050.18.24~1/RELEASE_X86_64 x86_64
$ env | grep LANG
LANG=en_US.UTF-8
$ clang -v
Apple clang version 4.0 (tags/Apple/clang-421.0.60) (based on LLVM 3.1svn)
Target: x86_64-apple-darwin12.2.0
Thread model: posix
$ cat test.c
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
wchar_t c1 = 0;
printf("sizeof(c1) == %lu\n", sizeof(c1));
printf("sizeof(L'Я') == %lu\n", sizeof(L'Я'));
if (c1 < L'Я') {
printf("Я люблю часы Заря!\n");
} else {
printf("Что за....?\n");
}
return EXIT_SUCCESS;
}
$ clang -Wall -pedantic ./test.c
$ ./a.out
sizeof(c1) == 4
sizeof(L'Я') == 4
Я люблю часы Заря!
$ clang -Wall -pedantic ./test.c -fshort-wchar
$ ./a.out
sizeof(c1) == 2
sizeof(L'Я') == 2
Я люблю часы Заря!
$
The same behavior is observed with clang++ (where wchar_t is built-in type).
If in fact the source is UTF-8 then this isn't correct behavior. However I can't reproduce the behavior in the most recent version of Xcode
MyFile.cpp:148:28: warning: character unicode escape sequence too long for its type [2]
This error should be refering to a 'Universal Character Name' (UCN), which looks like "\U001012AB" or "\u0403". It indicates that the value represented by the escape sequence is larger than the enclosing literal type is capable of holding. For example if the codepoint value requires more than 16 bits then a 16 bit wchar_t will not be able to hold the value.
MyFile.cpp:148:28: warning: extraneous characters in wide character constant ignored [2]
This indicates that the compiler thinks there's more than one codepoint represented inside a wide character literal. E.g. L'ab'. The behavior is implementation defined and both clang and gcc simply use the last codepoint value.
The code you show shouldn't trigger either of these, at least in clang. The first because that applies only to UCNs, let alone the fact that 'я' fits easily within a single 16-bit wchar_t; and the second because he source code encoding is always taken to be UTF-8 and it will see the UTF-8 multibyte representation of 'я' as a single codepoint.
You might recheck and ensure that the source actually is UTF-8. Then you should check that you're using an up-to-date version of Xcode. You can also try switching the compiler in your project settings > Compile for C/C++/Objective-C
I dont have an answer to your specific question, but wanted to point out that llvm-gcc has been permanently discontinued. In my experience in dealing with delta's between Clang and llvm-gcc, and gcc, Clang is often correct with regards to the C++ specification even if that behavior is surprising.