_wfopen equivalent under Mac OS X - c++

I'm looking to the equivalent of Windows _wfopen() under Mac OS X. Any idea?
I need this in order to port a Windows library that uses wchar* for its File interface. As this is intended to be a cross-platform library, I am unable to rely on how the client application will get the file path and give it to the library.

POSIX API in Mac OS X are usable with UTF-8 strings. In order to convert a wchar_t string to UTF-8, it is possible to use the CoreFoundation framework from Mac OS X.
Here is a class that will wrap an UTF-8 generated string from a wchar_t string.
class Utf8
{
public:
Utf8(const wchar_t* wsz): m_utf8(NULL)
{
// OS X uses 32-bit wchar
const int bytes = wcslen(wsz) * sizeof(wchar_t);
// comp_bLittleEndian is in the lib I use in order to detect PowerPC/Intel
CFStringEncoding encoding = comp_bLittleEndian ? kCFStringEncodingUTF32LE
: kCFStringEncodingUTF32BE;
CFStringRef str = CFStringCreateWithBytesNoCopy(NULL,
(const UInt8*)wsz, bytes,
encoding, false,
kCFAllocatorNull
);
const int bytesUtf8 = CFStringGetMaximumSizeOfFileSystemRepresentation(str);
m_utf8 = new char[bytesUtf8];
CFStringGetFileSystemRepresentation(str, m_utf8, bytesUtf8);
CFRelease(str);
}
~Utf8()
{
if( m_utf8 )
{
delete[] m_utf8;
}
}
public:
operator const char*() const { return m_utf8; }
private:
char* m_utf8;
};
Usage:
const wchar_t wsz = L"Here is some Unicode content: éà€œæ";
const Utf8 utf8 = wsz;
FILE* file = fopen(utf8, "r");
This will work for reading or writing files.

You just want to open a file handle using a path that may contain Unicode characters, right? Just pass the path in filesystem representation to fopen.
If the path came from the stock Mac OS X frameworks (for example, an Open panel whether Carbon or Cocoa), you won't need to do any conversion on it and will be able to use it as-is.
If you're generating part of the path yourself, you should create a CFStringRef from your path and then get that in filesystem representation to pass to POSIX APIs like open or fopen.
Generally speaking, you won't have to do a lot of that for most applications. For example, many applications may have auxiliary data files stored the user's Application Support directory, but as long as the names of those files are ASCII, and you use standard Mac OS X APIs to locate the user's Application Support directory, you don't need to do a bunch of paranoid conversion of a path constructed with those two components.
Edited to add: I would strongly caution against arbitrarily converting everything to UTF-8 using something like wcstombs because filesystem encoding is not necessarily identical to the generated UTF-8. Mac OS X and Windows both use specific (but different) canonical decomposition rules for the encoding used in filesystem paths.
For example, they need to decide whether "é" will be stored as one or two code units (either LATIN SMALL LETTER E WITH ACUTE or LATIN SMALL LETTER E followed by COMBINING ACUTE ACCENT). These will result in two different — and different-length — byte sequences, and both Mac OS X and Windows work to avoid putting multiple files with the same name (as the user perceives them) in the same directory.
The rules for how to perform this canonical decomposition can get pretty hairy, so rather than try to implement it yourself it's best to leave it to the functions the system frameworks have provided for you to do the heavy lifting.

#JKP:
Not all functions in MacOS X accept UTF8, but filenames and filepaths may be UTF8, thus all POSIX functions dealing with file access (open, fopen, stat, etc.) accept UTF8.
See here. Quote:
How a file name looks at the API level
depends on the API. Current Carbon
APIs handle file names as an array of
UTF-16 characters; POSIX ones handle
them as an array of UTF-8, which is
why UTF-8 works well in Terminal. How
it's stored on disk depends on the
disk format; HFS+ uses UTF-16, but
that's not important in most cases.
Some other POSIX functions handle UTF8 as well. E.g. functions dealing with user names, group names or user passwords use UTF8 to store the information (thus a user name can be Japanese and your password can be Chinese, no problem).
But not all handle UTF8. E.g. for all string functions an UTF8 string is just a normal C String and characters above 126 have no special meaning. They don't understand the concept of multiple bytes (chars in C) forming a single Unicode character. How other APIs handle char * pointer being passed to them is different from API to API. However, as a rule as the thumb you can say:
Either the function only accepts C strings with pure ASCII characters (only in the range 0 to 126) or it will accept UTF8. Usually functions don't allow characters above 126 and interpret them in any other encoding than UTF8. If this really was the case, it is documented and then there must be a way to pass the encoding along with the string.

If you're using Cocoa it's fairly easy with NSString. Just load the UTF16 data in using -initWithBytes:length:encoding: (or perhaps -initWithCString:encoding:) and then get a UTF8 version by calling UTF8String on the result. Then, just call fopen with your new UTF8 string as the param.
You can definitely call fopen with a UTF-8 string, regardless of language - can't help with C++ on OSX though - sorry.

I have read file name from configuration UTF8 file through wifstream (it uses wchar_t buffer).
Mac implementation is different from Linux and Windows.
wifstream reads each byte from file to separate wchar_t cell in the buffer. So we have 3 empty bytes, although open requires char string. Thus programmer can use wcstombs function to convert wide character string to multi-byte string.
The API supports UTF8. For better understanding use memory watcher and hex editor for your file.

Related

C++ output Unicode in variable

I'm trying to output a string containing unicode characters, which is received with a curl call. Therefore, I'm looking for something similar to u8 and L options for literal strings, but than applicable for variables. E.g.:
const char *s = u8"\u0444";
However, since I have a string containing unicode characters, such as:
mit freundlichen Grüßen
When I want to print this string with:
cout << UnicodeString << endl;
it outputs:
mit freundlichen Gr??en
When I use wcout, it returns me:
mit freundlichen Gren
What am I doing wrong and how can I achieve the correct output. I return the output with RapidJSON, which returns the string as:
mit freundlichen Gr��en
Important to note, the application is a CGI running on Ubuntu, replying on browser requests
If you are on Windows, what I would suggest is using Unicode UTF-16 at the Windows boundary.
It seems to me that on Windows with Visual C++ (at least up to VS2015) std::cout cannot output UTF-8-encoded-text, but std::wcout correctly outputs UTF-16-encoded text.
This compilable code snippet correctly outputs your string containing German characters:
#include <fcntl.h>
#include <io.h>
#include <iostream>
int main()
{
_setmode(_fileno(stdout), _O_U16TEXT);
// ü : U+00FC
// ß : U+00DF
const wchar_t * text = L"mit freundlichen Gr\u00FC\u00DFen";
std::wcout << text << L'\n';
}
Note the use of a UTF-16-encoded wchar_t string.
On a more general note, I would suggest you using the UTF-8 encoding (and for example storing text in std::strings) in your cross-platform C++ portions of code, and convert to UTF-16-encoded text at the Windows boundary.
To convert between UTF-8 and UTF-16 you can use Windows APIs like MultiByteToWideChar and WideCharToMultiByte. These are C APIs, that can be safely and conveniently wrapped in C++ code (more details can be found in this MSDN article, and you can find compilable C++ code here on GitHub).
On my system the following produces the correct output. Try it on your system. I am confident that it will produce similar results.
#include <string>
#include <iostream>
using namespace std;
int main()
{
string s="mit freundlichen Grüßen";
cout << s << endl;
return 0;
}
If it is ok, then this points to the web transfer not being 8-bit clean.
Mike.
containing unicode characters
You forgot to specify which unicode encoding does the string contain. There is the "narrow" UTF-8, which can be stored in a std::string and printed using std::cout, as well as wider variants, which can't. It is crucial to know which encoding you're dealing with. For the remainder of my answer, I'm going to assume you want to use UTF-8.
When I want to print this string with:
cout << UnicodeString << endl;
EDIT:
Important to note, the application is a CGI running on Ubuntu, replying on browser requests
The concerns here are slightly different from printing onto a terminal.
You need to set the Content-Type response header appropriately or else the client cannot know how to interpret the response. For example Content-Type: application/json; charset=utf-8.
You still need to make sure that the source string is in fact the correct encoding corresponding to the header. See the old answer below for overview.
The browser has to support the encoding. Most modern browsers have had support for UTF-8 a long time now.
Answer regarding printing to terminal:
Assuming that
UnicodeString indeed contains an UTF-8 encoded string
and that the terminal uses UTF-8 encoding
and the font that the terminal uses has the graphemes that you use
the above should work.
it outputs:
mit freundlichen Gr??en
Then it appears that at least one of the above assumptions don't hold.
Whether 1. is true, you can verify by inspecting the numeric value of each code unit separately and comparing it to what you would expect of UTF-8. If 1. isn't true, then you need to figure out what encoding does the string actually use, and either convert the encoding, or configure the terminal to use that encoding.
The terminal typically, but not necessarily, uses the system native encoding. The first step of figuring out what encoding your terminal / system uses is to figure out what terminal / system you are using in the first place. The details are probably in a manual.
If the terminal doesn't use UTF-8, then you need to convert the UFT-8 string within your program into the character encoding that the terminal does use - unless that encoding doesn't have the graphemes that you want to print. Unfortunately, the standard library doesn't provide arbitrary character encoding conversion support (there is some support for converting between narrow and wide unicode, but even that support is deprecated). You can find the unicode standard here, although I would like to point out that using an existing conversion implementation can save a lot of work.
In the case the character encoding of the terminal doesn't have the needed grapehemes - or if you don't want to implement encoding conversion - is to re-configure the terminal to use UTF-8. If the terminal / system can be configured to use UTF-8, there should be details in the manual.
You should be able to test if the font itself has the required graphemes simply by typing the characters into the terminal and see if they show as they should - although, this test will also fail if the terminal encoding does not have the graphemes, so check that first. Manual of your terminal should explain how to change the font, should it be necessary. That said, I would expect üß to exist in most fonts.

What function should I use to open file in mac OS X?

I use C++ Builder to create my cross-platform application.
In the app, I will get file name/path by open file dialog.
In Windows, it's no problem to take care the unicode string. (ex. "C:\測試")
In mac OS X, I can get correct string from UnicodeString. But I can't find a good method to convert it to char array and use "fopen" to open the file correctly.
I tried to assign the UnicodeString to AnsiString directly but it became "C:\??".
Because "fopen" only accepts "char*" and UnicodeString can only export "char16*", I need to convert it to char for "fopen".
Any idea?
Just because fopen() takes a char* does not mean you should give it an ANSI string. POSIX APIs on OSX accept UTF-8 encoded filenames, so use UTF8String instead of AnsiString. A char* can point to a UTF-8 string.
Otherwise, don't use fopen() directly. Use the RTL's own functions instead, like FileCreate()/FileOpen() in System.SysUtils.hpp unit, or the TFileStream class in System.Classes.hpp unit. Let the RTL decide internally how to interact with platform APIs for you.

Converting UTF16(Windows wchar_t) to UTF8 in C++ Non-English letters corrupted(Korean)

I'm trying to make a multiplatform app. On the Windows Store App(winrt) side, open a file and read its path in Platform::String format which is wchar_t, UTF16 in Windows.
Since my core logic is platform independent and only use standard C++ data types, I've converted the path into std::string in UTF8 via this code:
Platform::String^ copyPath = copy->Path;
std::wstring source(copyPath->Data());
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t >, wchar_t > convert;
std::string u8CopyPath = convert.to_bytes(source);
However, when I check u8CopyPath in debugger, it shows corrupted letters for non-English chars. Far as I know, UTF-8 is perfectly capable of encoding non-English languages since it can use multiple bytes for a single letter. Is there something in the conversion that corrupts the non-English letters?
It turns out it's just a debugger thing. Once I wrote it to a file and examine it, it printed out correctly.

c++ Lithuanian language, how to get more than ascii

I am trying to use Lithuanian in my c++ application, but every try is unsuccesfull.
Multi-byte character set is used. I have tryed everything i have tought of, i am new in c++. Never ever tryed to do something in Lithuanian.
Tryed every setlocale(LC_ALL, "en_US.utf8"); setlocale(LC_ALL, "Lithuanian");...
Researched for 2 hours and didnt found proper examples, solution.
I do have a average sized project which needs Lithuanian translation from database and it cant understand most of "ĄČĘĖĮŠŲŪąčęėįšųū".
Compiler - "Visual studio 2013"
Database - sqlite3.
I cant get simple strings to work(defined myself), and output as Lithuanian to win32 application, even.
In Windows use wide character strings (1UTF-16 encoding, wchar_t type) for internal text handling, and preferably UTF-8 for external text files and networking.
Note that Visual C++ will translate narrow text literals from the source encoding to Windows ANSI, which is a platform-dependent usually single-byte encoding (you can check which one via the GetACP API function), i.e., Visual C++ has the platform-specific Windows ANSI as its narrow C++ execution character set.
But also do note that for an app restricted to non-Windows platforms, i.e. Unix-land, it makes practical sense to do everything in UTF-8, based on char type.
For the database communication you may need to translate to and from the program's internal text representation.
This depends on what the database interface requires, which is not stated.
Example for console output in Windows:
#include <iostream>
#include <fcntl.h>
#include <io.h>
auto main() -> int
{
_setmode( _fileno( stdout ), _O_WTEXT );
using namespace std;
wcout << L"ĄČĘĖĮŠŲŪąčęėįšųū" << endl;
}
To make this compile by default with g++, the source code encoding needs to be UTF-8. Then, to make it produce correct results with Visual C++ the source code encoding needs to be UTF-8 with BOM, which happily is also accepted by modern versions of g++. For otherwise the Visual C++ compiler will assume the Windows ANSI encoding and produce an incorrect UTF-16 string.
Not coincidentally this is the default meaning of UTF-8 in Windows, e.g. in the Notepad editor, namely UTF-8 with BOM.
But note that while in Windows the problem is that the main system compiler requires a BOM for UTF-8, in Unix-land the problem is the opposite, that many old tools can't handle the BOM (for example, even MinGW g++ 4.9.1 isn't yet entirely up to speed: it sometimes includes the BOM bytes, then incorrectly interpreted, in error messages).
1) On other platforms wide character text can be encoded in other ways, e.g. with UTF-32. In fact the Windows convention is in direct conflict with the C and C++ standards which require that a single wchar_t should be able to encode any character in the extended character set. However, this requirement was, AFAIK, imposed after Windows adopted UTF-16, so the fault probably lies with the politics of the C and C++ standardization process, not yet another Microsoft'ism.
Complexity of internationalisation
There are several related but distinct topics that can cause mismatches between them, making try and error approach very tedious:
type used for storing strings and chars: windows iuses wchar_t by default, but for most APIs you have also char equivalents functions
character set encoding this defines how the chars stored in the type are to be understood. For exemple unicode (UTF8, UTF16, UTF32), 7 bits ascii, 8 bit ansii. In windows, by default it is UTF16 for wchar_t and ansi/windows for char
locale defines, among other things, the character set asumptions, when processing strings. This permit to use language independent functions like isalpha(i, loc), islower(i, loc), ispunct(i, loc) to find out if a given character is alphanumeric, a lower case alphabetic, or a punctuation, for example to bereak down a user text into words. C++ offers here portable functions.
output codepage or font used to show a character to the user. This assumes that the font used shows the characters using the same character set used in the code internals.
source code encoding. For example your editor could assume an ansi encoding, with windows 1252 character set.
Most typical errors
The problem n°1 is Win32 console output, as unicode is not well supported by the console. But this is not your problem here.
Another cause of mismatch is the encoding of your text editor. It might not be unicode, but use a windows code page. In this case, you type "Č", the deditor displays it as such, but editor might use windows 1257 encoding for lithuanian and store 0xC8 in the file. If you then display this literal with a windows unicode function, it will interpret 0xC8 as "latin E grave accent" and print something else, as the right unicode encoding for "Č" is 0x010C !
I can be even worse: the compiler may have its own assumption about character set encoding used and convert your litterals into unicode using false assumptions (it happened to me when I used some exotic code generation switch).
How to do ?
To figure out what goes wront, proceed by elimination:
First, for plain windows, use the native unicode setting. Ok it's UTF16 and wchar_t instead of UTF8 and as thus comes with some drawbacks, but it's native and well supported.
Then use explict unicode coding in litterals, for example TEXT("\u010C") instead of TEXT("Č"). This avoids editor and compiler mismatch.
If it's still not the right character, make sure that your font FULLY supports unicode. The default system font for instance doesn't while most other do. You can easily check with the windows font pannel (WindowKey+R fonts then click on "search char") to display the character table of your font.
Set fonts explicitely in your code
For example, a very tiny experiment :
...
case WM_PAINT:
{
hdc = BeginPaint(hWnd, &ps);
auto hf = CreateFont(24, 0, 0, 0, 0, TRUE, 0, 0, 0, 0, 0, 0, 0, L"Times New Roman");
auto hfOld = SelectObject(hdc, hf); // if you comment this out, € and Č won't display
TextOut(hdc, 50, 50, L"Test with éç € \u010C special chars", 30);
SelectObject(hdc, hfOld);
DeleteObject(hf);
EndPaint(hWnd, &ps);
break;
}

Windows usage of char * functions with UTF-16

I port one application from Linux to Windows.
On Linux I use libmagic library from which I wouldn't be glad to rid of on Windows.
The problem is that I need pass name of file that is held in UTF-16 encoding to such function:
int magic_load(magic_t cookie, const char *filename);
Unfortunately it accepts only const char *filename. My first idea was to convert UTF-16 string to local encoding, but there are some problems - like string can contain e.g. Chinese symbols and local encoding may be Russian.
As result we will get trash on the output and program will not reach its aim.
Converting into UTF-8 doesn't help either, because this is Windows and Windows holds file name in UTF-16.
But I somehow need make that function able to open file with Unicode name.
I came only to one very very bad solution:
1. I have a filename
2. I can copy file with unicode name to file with ASCII name like "1.mp3"
3. open it with libmagic functions and get what I want
4. remove temporarily file
But I understand how this solution is bad and how it could make my application slower, so I wonder, perhaps are there some better ways to do it?
Thanks in advance for any tips, 'cause I'm really confused with it.
Use 8.3 file names to access the files.
In addition to long file names up to 255 characters in length, Windows also generates an MS-DOS-compatible (short) file name in 8.3 format.
http://support.microsoft.com/kb/142982