Non-ASCII wchar_t literals under LLVM - c++

I've migrated an Xcode iOS project from Xcode 3.2.6 to 4.2. Now I'm getting warnings when I try to initialize a wchar_t with a literal with a non-ASCII character:
wchar_t c1;
if(c1 <= L'я') //That's Cyrillic "ya"
The messages are:
MyFile.cpp:148:28: warning: character unicode escape sequence too long for its type [2]
MyFile.cpp:148:28: warning: extraneous characters in wide character constant ignored [2]
And the literal does not work as expected - the comparison misfires.
I'm compiling with -fshort-wchar, the source file is in UTF-8. The Xcode editor displays the file fine. It compiled and worked on GCC (several flavors, including Xcode 3), worked on MSVC. Is there a way to make LLVM compiler recognize those literals? If not, can I go back to GCC in Xcode 4?
EDIT: Xcode 4.2 on Snow Leopard - long story why.
EDIT2: confirmed on a brand new project. File extension does not matter - same behavior in .m files. -fshort-wchar does not affect it either. Looks like I've gotta go back to GCC until I can upgrade to a version of Xcode where this is fixed.

Not an answer, but hopefully helpful information — I could not reproduce the problem with clang 4.0 (Xcode 4.5.1):
$ uname -a
Darwin air 12.2.0 Darwin Kernel Version 12.2.0: Sat Aug 25 00:48:52 PDT 2012; root:xnu-2050.18.24~1/RELEASE_X86_64 x86_64
$ env | grep LANG
$ clang -v
Apple clang version 4.0 (tags/Apple/clang-421.0.60) (based on LLVM 3.1svn)
Target: x86_64-apple-darwin12.2.0
Thread model: posix
$ cat test.c
#include <stdio.h>
#include <stdlib.h>
int main(void)
wchar_t c1 = 0;
printf("sizeof(c1) == %lu\n", sizeof(c1));
printf("sizeof(L'Я') == %lu\n", sizeof(L'Я'));
if (c1 < L'Я') {
printf("Я люблю часы Заря!\n");
} else {
printf("Что за....?\n");
$ clang -Wall -pedantic ./test.c
$ ./a.out
sizeof(c1) == 4
sizeof(L'Я') == 4
Я люблю часы Заря!
$ clang -Wall -pedantic ./test.c -fshort-wchar
$ ./a.out
sizeof(c1) == 2
sizeof(L'Я') == 2
Я люблю часы Заря!
The same behavior is observed with clang++ (where wchar_t is built-in type).

If in fact the source is UTF-8 then this isn't correct behavior. However I can't reproduce the behavior in the most recent version of Xcode
MyFile.cpp:148:28: warning: character unicode escape sequence too long for its type [2]
This error should be refering to a 'Universal Character Name' (UCN), which looks like "\U001012AB" or "\u0403". It indicates that the value represented by the escape sequence is larger than the enclosing literal type is capable of holding. For example if the codepoint value requires more than 16 bits then a 16 bit wchar_t will not be able to hold the value.
MyFile.cpp:148:28: warning: extraneous characters in wide character constant ignored [2]
This indicates that the compiler thinks there's more than one codepoint represented inside a wide character literal. E.g. L'ab'. The behavior is implementation defined and both clang and gcc simply use the last codepoint value.
The code you show shouldn't trigger either of these, at least in clang. The first because that applies only to UCNs, let alone the fact that 'я' fits easily within a single 16-bit wchar_t; and the second because he source code encoding is always taken to be UTF-8 and it will see the UTF-8 multibyte representation of 'я' as a single codepoint.
You might recheck and ensure that the source actually is UTF-8. Then you should check that you're using an up-to-date version of Xcode. You can also try switching the compiler in your project settings > Compile for C/C++/Objective-C

I dont have an answer to your specific question, but wanted to point out that llvm-gcc has been permanently discontinued. In my experience in dealing with delta's between Clang and llvm-gcc, and gcc, Clang is often correct with regards to the C++ specification even if that behavior is surprising.


Do I still have to link -std=c++11? [duplicate]

I have a piece of code that looks like the following. Let's say it's in a file named example.cpp
#include <fstream>
#include <string> // line added after edit for clarity
int main() {
std::string filename = "input.txt";
std::ifstream in(filename);
return 0;
On a windows, if I type in the cmd the command g++ example.cpp, it will fail. It's a long list of errors I think mostly due to the linker complaining about not being able to convert from string to const char*.
But if I run the compiler using an additional argument like so: g++ -std=c++17 example.cpp, it will compile and work fine with no problems.
What happens when I run the former command? I'm guessing a default version standard of the C++ compiler gets called, but I don't know which? And as a programmer/developer, should I always use the latter command with the extra argument?
If your version of g++ is later than 4.7 I think you can find the default version of C++ standard supported like so:
g++ -dM -E -x c++ /dev/null | grep -F __cplusplus
An example from my machine:
mburr#mint17 ~ $ g++ --version | head -1
g++ (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4
mburr#mint17 ~ $ g++ -dM -E -x c++ /dev/null | grep -F __cplusplus
#define __cplusplus 199711L
Some references:
Details on the g++ options used
Why this only works for g++ 4.7 or later
g++ man page actually tells what is the default standard for C++ code.
Use following script to show the relevant part:
man g++ | col -b | grep -B 2 -e '-std=.* This is the default'
For example, in RHEL 6 g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-23), the output:
GNU dialect of -std=c++98. This is the default for C++ code.
And in Fedora 28 g++ (GCC) 8.1.1 20180502 (Red Hat 8.1.1-1), the output:
GNU dialect of -std=c++14. This is the default for C++ code. The name gnu++1y is deprecated.
You can also check with gdb
$ g++ example.cpp -g Compile program with -g flag to generate debug info
$ gdb a.out Debug program with gdb
(gdb) b main Put a breakpoint at main
(gdb) run Run program (will pause at breakpoint)
(gdb) info source
Prints out something like:
Current source file is example.cpp
Compilation directory is /home/xxx/cpp
Located in /home/xxx/cpp/example.cpp
Contains 7 lines.
Source language is c++.
Producer is GNU C++14 6.3.0 20170516 -mtune=generic -march=x86-64 -g.
Compiled with DWARF 2 debugging format.
Does not include preprocessor macro info.
There is the standard used by compiler: Producer is GNU C++14
If you recompile your program using -std=c++11 (for example), gdb detects it:
Producer is GNU C++11
I believe that it is possible to tell by looking at the man page (at least for g++):
Under the description of -std, the man page lists all C++ standards, including the GNU dialects. Under one specific standard, it is rather inconspicuously stated, This is the default for C++ code. (there is an analogous statement for C standards: This is the default for C code.).
For instance, for g++/gcc version 5.4.0, this is listed under gnu++98/gnu++03, whereas for g++/gcc version 6.4.0, this is listed under gnu++14.
I'm guessing a default version of the C++ compiler gets called, but I don't know which?
This is only guessable by reading the documentation of your particular compiler version.
If using a recent GCC, I recommend first to understand what version are you using by running
g++ -v
g++ --version
and then refer to the version of the particular release of GCC. For example for GCC 7, read GCC 7 changes etc
Alternatively, run
g++ -dumpspecs
and decipher the default so called spec file.
BTW, you could ensure (e.g. in some of your common header file) that C++ is at least C++17 by coding
#if __cplusplus < 201412L
#error expecting C++17 standard
and I actually recommend doing it that way.
PS. Actually, think of C++98 & C++17 being two different languages (e.g. like Ocaml4 and C++11 are). Require your user to have a compiler supporting some defined language standard (e.g. C++11), not some particular version of GCC. Read also about package managers.
If you are using the GCC compiler, you can find it on the man pages:
man g++ | grep "This is the default for C++ code"
Typing g++ --version in your command shell will reveal the version of the compiler, and from that you can infer the default standard. So you can't tell directly but you can infer it, with some effort.
Compilers are supposed to #define __cplusplus which can be used to extract the standard that they purport to implement at compile time; but many don't do this yet.
(And don't forget to include all the C++ standard library headers you need: where is the one for std::string for example? Don't rely on your C++ standard library implementation including other headers automatically - in doing that you are not writing portable C++.)
Your question is specific to gnu compilers, so probably better to tag it appropriately, rather than just C++ and C++11.
Your code will compile with any compilers (and associated libraries) compliant with C++11 and later.
The reason is that C++11 introduced a std::ifstream constructor that accepts a const std::string &. Before C++11, a std::string could not be passed, and it would be necessary in your code to pass filename.c_str() rather than filename.
According to information from gnu,, gcc.4.8.1 was the first version to fully support C++11. At the command line g++ -v will prod g++ to telling you its version number.
If you dig into documentation, you might be able to find the version/subversion that first supported enough features so your code - as given - would compile. But such a version would support some C++11 features and not others.
Since windows isn't distributed with g++, you will have whatever version someone (you?) has chosen to install. There will be no default version of g++ associated with your version of windows.
The default language standards for both C and C++ are specified in the GCC Manuals. You can find these as follows:
Browse to
Select the GCC #.## Manual link for the version of GCC you are interested in, e.g. for GCC 7.5.0:
Click the topic link Language Standards Supported by GCC, followed by the topic C++ Language (or C language). Either of these topics will have a sentence such as:
The default, if no C++ language dialect options are given, is -std=gnu++14.
The default, if no C language dialect options are given, is -std=gnu11.
The above two examples are for GCC 7.5.0.

Getting weird result by using %I64u inside Mingw-w64

This is my code :
Note : \n inside scanf is my way to prevent trailing newline problem. That isn't best solution but i'm using it too much and currently it becoming my habit. :-)
int main()
unsigned long long int input[2], calc_square;
while(scanf("\n%I64u %I64u", input[0], input[1]) == 2)
printf("%I64u %I64u\n", input[0], input[1]);
My expected input and program result is :
Input :
89 89
For output, instead of printing back 89, it show this output :
I64u I64u
I'm using g++ (GCC) 4.9.1 from MSYS2 package. Noted that g++ because there are some portion of my code currently using C++ STL.
Edited : I changed my code by using standard %llu instead of %I64u, and here is my expected input and program result :
89 89
For output, it's kind a weird result :
25769968512 2337536
This code is wrong:
while(scanf("\n%I64u %I64u", input[0], input[1]) == 2)
input[0] and input[1] each have type unsigned long long, but they are required to have type unsigned long long * (pointer to unsigned long long) for scanf operations. I'm unsure if MinGW supports checking printf and scanf format specifiers, but ordinary GCC is capable of detecting these kinds of errors at compile time as long as you enable the proper warnings. I highly recommend always compiling with as high of a warning level as you possibly can, such as -Wall -Wextra -Werror -pedantic in the most extreme case.
You need to pass in the address of these variables:
while(scanf("\n%I64u %I64u", &input[0], &input[1]) == 2)
// ^ ^
// | |
I suspect you have been using MSYS2's GCC which isn't a native Windows compiler, and doesn't support the MS-specific %I64 format modifiers (MSYS2's GCC is very much like Cygwin's GCC).
If you wanted to use MinGW-w64 GCC, you should have launched mingw64_shell.bat or mingw32_shell.bat and have the appropriate toolchain installed:
pacman -S mingw-w64-i686-toolchain
pacman -S mingw-w64-x86_64-toolchain
With that done, you can safely use either modifier on any Windows version dating back to Windows XP SP3 provided you pass -D__USE_MINGW_ANSI_STDIO=1.
FWIW, I avoid using the MS-specific modifiers and always pass -D__USE_MINGW_ANSI_STDIO=1
Finally, annoyingly, your sample doesn't work when launched from the MSYS2 shell due to mintty not being a proper Windows console; you need to run it from cmd.exe

clang++ and u16string

I'm having a hell of a time with this simple line of code and the latest clang++
#include <stdio.h>
#include <string>
using std::u16string;
int main ( int argc, char** argv )
u16string s16 = u"鵝滿是快烙滴好耳痛";
Ben-iMac:Desktop Ben$ clang++ -std=c++0x -stdlib=libc++ main.cpp -o main
main.cpp:15:21: error: use of undeclared identifier 'u'
u16string s16 = u"鵝滿是快烙滴好耳痛"
The latest released versions of clang, v2.9 from or Apple's clang 3.0, do not support Unicode string literals. The latest available version, built from top of trunk source does support Unicode string literals.
The next release of clang (i.e., 3.0) will support the Unicode string literal syntax, but does not have support for any source file encoding beyond ASCII. So even with that release you won't be able to type in those characters literally in your source and have them converted to a UTF-16 encoded string value. Instead you'll have to use the \u escapes. Again, top of trunk does support UTF-8 source code, but it didn't get put in in time for the 3.0 release that is currently under testing. The next release after that (in 6 months or so) should have better support for UTF-8 source code (but not other source encodings).
Edit: The Xcode 4.3 version of clang does have these features.
Edit: And now the 3.1 release from has them
So clang now fully supports the following:
#include <string>
int main() {
std::u16string a = u"鵝"; // UTF-8 source is transformed into UTF-16 literal
std::u32string b = U"滿"; // UTF-8 source is transformed into UTF-32 literal
It turns out the standard does not actually require much support for char16_t and char32_t in the iostreams library, so you'll probably have to convert to another string type to get much use out of this. At least the ability to convert between these and the more useful std::string is required (though not exactly convenient to set up...).
You can test clang for individual C++11 features, e.g.:
and here's a status page:

Sun Studio 10 has strange `sun` constant?

Strangely, the following C++ program compiles on Sun Studio 10 without producing a warning for an undefined variable:
int main()
return sun;
The value of sun seems to be 1. Where does this variable come from and what is it for?
It's almost certainly a predefined macro. Formally, the C and
C++ standards reserve names starting with an underscore and
a capital letter, or containing two underscores, for this, but
practically, compilers had such symbols defined before the
standard, and continue to support them, at least in their
non-compliant modes which is the default mode for all of the
compilers I know. I can remember having problems with `linux'
at one time, but not when I invoked g++ with -std=c++89.
It must be one of the automatic macros created by the compiler.
Try the same thing, replace sun by gnu and use a gcc compiler on Linux. You'll get a similar result.
With gcc, you can get all the predefined macros with: echo "" | gcc -E - -dM.
sun is defined for historical backwards compatibility from before the convention to start with an underscore was adopted. For Studio, it's documented in the cc(1) and CC(1) man pages under the -D flag:
Defines a macro symbol name to the preprocessor. Doing so is
equivalent to including a #define directive at the beginning of the
source. You can use multiple -D options.
The following values are predefined.
SPARC and x86 platforms:
__STDC__ = 0
__SUNPRO_CC = 0x5130
_BOOL if type bool is enabled (see "-features=[no%]bool")
__SVR4 (Oracle Solaris)
__SunOS_5_10 (Oracle Solaris)
__SunOS_5_11 (Oracle Solaris)
Various standards compliance options can disable it, as can the +p flag to CC.