Is there any C/C++ system() function accepting unicode? - c++

Question: In C/C++, is there any system function that accepts Unicode ?
See below for reason:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char* argv[])
{
// http://stackoverflow.com/questions/3313332/bringing-another-apps-window-to-front-on-mac-in-c
system("osascript -e \"tell application \\\"Address Book\\\" to activate\"");
return EXIT_SUCCESS;
}

Use _wsystem.
_wsystem is a wide-character version of system; the command argument to _wsystem is a wide-character string. These functions behave identically otherwise.
http://msdn.microsoft.com/en-us/library/277bwbdz.aspx
A special function named main is the starting point of execution for all C and C++ programs. If you are writing code that adheres to the Unicode programming model, you can use wmain, which is the wide-character version of main.
http://msdn.microsoft.com/en-us/library/vstudio/6wd819wh.aspx

system() does not care about the encoding as far as I know, it should just pass it through.
maybe your question is "how to type a UTF-8 string literal in C", or "what encoding does osascript expect"?
the portable way to do UTF-8 in C is with \x escape sequences, though if you are willing to rely on C99 or a specific compiler you can often type the UTF-8 directly.
I would guess osascript expects UTF-8 though I have no real idea.

Standard C and C++ do not explicitly understand Unicode at all: none of the standard APIs are defined as accepting or returning Unicode strings. Unlike Java. Whether a wide string or multibyte string is actually a Unicode encoded string is system dependent. So the simple answer is no.

Related

How can I disable or have a warning when using wchar_t outside Windows?

The wchar_t type is used extensively on Windows API and C++ standard library APIs derived from them therefore it's hard to change Windows code to use something else because you would have to cast/convert back and forth every time.
But on non-Windows wide characters are rarely used and UTF-8 encoding is preferred instead. Therefore having code that uses wchar_t outside Windows probably does something wrong and even if its intended it's better to use types that communicate the intent better eg. using std::u16string and char16_t when dealing with UTF-16 strings instead of wstring and using std::u32string and char32_t when the intent is storing Unicode codepoints.
Is there a GCC option to turn on a diagnostic project wide that warns or errors when it sees a wchar_t, therefore identifying potential sites for refactoring?
That is a little work around and not dedicated to GCC and also will break your build but allows you to find where you use wchar_t. (And also breaks included third-party code more or less)
You can override the definition of wchar_t with the preprocessor which then leads to errors on the usage. In that way you can find the potential usages:
#define wchar_t void
wchar_t Foo() { }
int main()
{
auto wchar_used = Foo();
}
Error message:
error: 'void wchar_used' has incomplete type
10 | auto wchar_used = Foo();

Convert execution character set string to a UTF-8 string

In my program I have a std::string that contains text encoded using the "execution character set" (which is not guaranteed to be UTF-8 or even US-ASCII), and I want to convert that to a std::string that contains the same text, but encoded using UTF-8. How can I do that?
I guess I need a std::codecvt<char, char, std::mbstate_t> character-converter object, but where can I get hold of a suitable object? What function or constructor must I use?
I assume the standard library provides some means for doing this (somewhere, somehow), because the compiler itself must know about UTF-8 (to support UTF-8 string literals) and the execution character set.
I guess I need a std::codecvt<char, char, std::mbstate_t> character-converter object, but where can I get hold of a suitable object?
You can get a std::codecvt object only as a base class instance (by inheriting from it) because the destructor is protected. That said no, std::codecvt<char, char, std::mbstate_t> is not a facet that you need since it represents the identity conversion (i.e. no conversion at all).
At the moment, the C++ standard library has no functionality for conversion between the native (aka excution) character encoding (aka character set) and UTF-8. As such, you can implement the conversion yourself using the Unicode standard: https://www.unicode.org/versions/Unicode11.0.0/UnicodeStandard-11.0.pdf
To use an external library I guess you would need to know the "name" (or ID) of the execution character set. But how would you get that?
There is no standard library function for that either. On POSIX system for example, you can use nl_langinfo(CODESET).
This is hacky but it worked for me in MS VS2019
#pragma execution_character_set( "utf-8" )

Unicode support in C++0x

I'm trying to use new unicode characters in C++0x.
So I wrote sample code:
#include <fstream>
#include <string>
int main()
{
std::u32string str = U"Hello World";
std::basic_ofstream<char32_t> fout("output.txt");
fout<<str;
return 0;
}
But after executing this program I'm getting empty output.txt file. So why it's not printing Hello World?
Also is there something like a cout and cin already defined for these types, or stdin and stdout doesn't support Unicode?
Edit: I'm using g++ and Linux.
EDIT:АТТЕNTION. I have discovered, that standard committee dismissed Unicode streams from C++0x. So previously accepted answer is not correct anymore. For more information see my answer!
Unicode string literals support began in GCC 4.5. Maybe that's the problem.
[edit]
After some digging I've found that streams for this new unicode literals are described in N2035 and it was included in a draft of the standard. According to this document you need u32ofstream to output you string but this class is absent in GCC 4.5 C++0x library.
As a workaround you can use ordinary fstream:
std::ofstream fout2("output2.txt", std::ios::out | std::ios::binary);
fout2.write((const char *)str.c_str(), str.size() * 4);
This way I've output your string in UTF-32LE on my Intel machine (which is little-endian).
[edit]
I was a little bit wrong about the status of u32ofstream: according to the latest draft on the The C++ Standards Committee's web site you have to use std::basic_ofstream<char32_t> as you did. This class would use codecvt<char32_t,char,typename traits::state_type> class (see end of §27.9.1.1) which has to be implemented in the standard library (search codecvt<char32_t in the document), but it's not available in GCC 4.5.
In new C++ standard there will not be Unicode streams.
As #ssmir mentioned, standard committee was going to add stream support for Unicode in C++0x. However in the feature editions committee decided to remove stream support for Unicode. For more information see this link.
It seams like the only way to output Unicode string is to convert it to ASCII string with codecvt .
When creating, the stream tries to obtain a 'codecvt' from the global locale, but fails to get one because the only standard codecvt's are for char and wchar_t.
As a result, _M_codecvt member of the stream object is NULL.
Later, during the attempt to output, your code throws an exception (not visible to user) in facet checking function in basic_ios.h, because the facet is initialized from _M_codecvt.
Add a facet to the local associated with the stream to do the conversion from char32_t to the correct output.
Imbue the stream with a locale containing a codecvt of the right type.

What std::locale names are available on common windows compilers?

The standard is pretty much silent on what constitutes a valid locale name; only that passing an invalid locale name results in std::runtime_error. What locale names are usable on common windows compilers such as MSVC, MinGW, and ICC?
Ok, there is a difference between C and C++ locales.
Let's start:
MSVC C++ std::locale and C setlocale
Accepts locale names as "Language[_Country][.Codepage]" for example "English_United States.1251" Otherwise would throws. Note: codepage can't be 65001/UTF-8 and should be consistent with ANSI codepage for this locale (or just omitted)
MSVC C++ std::locale and C setlocale in Vista and 7 should accept locales
[Language][-Script][-Country] like "en-US" using ISO-631 language codes and
ISO 3166 regions and script names.
I tested it with Visual Studio on Windows 7 - it does not work.
MinGW C++ std::locale accepts "C" and "POSIX" it does not support other locales,
actually gcc supports locales only over GNU C library - basically only under Linux.
setlocale is native Windows API call so should support all I mentioned above.
It may support wider range of locales when used with alternative C++ libraries
like Apache stdcxx or STL Port.
ICC - I hadn't tested it but it depends on the standard C++ library it uses. For
example under Linux it used GCC's libstdc++ so it supports all the locales gcc
supports. I don't know what standard C++ library it uses under Windows.
If you want to have "compiler and platform" independent locales support (and actually
much better support) take a look on Boost.Locale
Artyom
I believe the information you need is here :
locale "lang[_country_region[.code_page]]"
| ".code_page"
| ""
| NULL
This page provides links to :
Language Strings
Country/Region String
Code Pages
Although my answers covers setlocale instead of std::locale, this MSDN page seems to imply that the format is indeed the same :
An object of class locale also stores
a locale name as an object of class
string. Using an invalid locale name
to construct a locale facet or a
locale object throws an object of
class runtime_error. The stored
locale name is "*" if the locale
object cannot be certain that a
C-style locale corresponds exactly to
that represented by the object.
Otherwise, you can establish a matching locale within the Standard C
Library, for the locale object loc, by
calling setlocale(LC_ALL,
loc.name.c_str).
Also see this page and this thread which tend to show that std::locale internally uses setlocale.
Here's one locale name that's usable pretty much anywhere: "". That is, the empty string. The is in contrast to the "C" locale that you are probably getting by default. The empty string as an argument to std::setlocale() means something like "Use the preferred locale set by the user or environment." If you use this, the downside is that your program won't have the same output everywhere; the upside is that your users might think it works just the way they want.

What new Unicode functions are there in C++0x?

It has been mentioned in several sources that C++0x will include better language-level support for Unicode(including types and literals).
If the language is going to add these new features, it's only natural to assume that the standard library will as well.
However, I am currently unable to find any references to the new standard library. I expected to find out the answer for these answers:
Does the new library provide standard methods to convert UTF-8 to UTF-16, etc.?
Does the new library allowing writing UTF-8 to files, to the console (or from files, from the console). If so, can we use cout or will we need something else?
Does the new library include "basic" functionality such as: discovering the byte count and length of a UTF-8 string, converting to upper-case/lower-case(does this consider the influence of locales?)
Finally, are any of these functions are available in any popular compilers such as GCC or Visual Studio?
I have tried to look for information, but I can't seem to find anything. I am actually starting to think that maybe these things aren't even decided yet(I am aware that C++0x is a work in progress).
Does the new library provide standard methods to convert UTF-8 to UTF-16, etc.?
No. The new library does provide std::codecvt facets which do the conversion for you when dealing with iostream, however. ISO/IEC TR 19769:2004, the C Unicode Technical Report, is included almost verbatim in the new standard.
Does the new library allowing writing UTF-8 to files, to the console (or from files, from the console). If so, can we use cout or will we need something else?
Yes, you'd just imbue cout with the correct codecvt facet. Note however that the console is not required to display those characters correctly
Does the new library include "basic" functionality such as: discovering the byte count and length of a UTF-8 string, converting to upper-case/lower-case(does this consider the influence of locales?)
AFAIK that functionality exists with the existing C++03 standard. std::toupper and std::towupper of course function just as in previous versions of the standard. There aren't any new functions which specifically operate on unicode for this.
If you need these kinds of things, you're still going to have to rely on an external library -- the <iostream> is the primary piece that was retrofitted.
What, specifically, is added for unicode in the new standard?
Unicode literals, via u8"", u"", and U""
std::char_traits classes for UTF-8, UTF-16, and UTF-32
mbrtoc16, c16rtomb, mbrtoc32, and c32rtomb from ISO/IEC TR 19769:2004
std::codecvt facets for the locale library
The std::wstring_convert class template (which uses the codecvt mechanism for code set conversions)
The std::wbuffer_convert, which does the same as wstring_convert except for raw arrays, not strings.