Windows usage of char * functions with UTF-16

Windows usage of char * functions with UTF-16 - c++

I port one application from Linux to Windows.
On Linux I use libmagic library from which I wouldn't be glad to rid of on Windows.
The problem is that I need pass name of file that is held in UTF-16 encoding to such function:
int magic_load(magic_t cookie, const char *filename);
Unfortunately it accepts only const char *filename. My first idea was to convert UTF-16 string to local encoding, but there are some problems - like string can contain e.g. Chinese symbols and local encoding may be Russian.
As result we will get trash on the output and program will not reach its aim.
Converting into UTF-8 doesn't help either, because this is Windows and Windows holds file name in UTF-16.
But I somehow need make that function able to open file with Unicode name.
I came only to one very very bad solution:
1. I have a filename
2. I can copy file with unicode name to file with ASCII name like "1.mp3"
3. open it with libmagic functions and get what I want
4. remove temporarily file
But I understand how this solution is bad and how it could make my application slower, so I wonder, perhaps are there some better ways to do it?
Thanks in advance for any tips, 'cause I'm really confused with it.

Use 8.3 file names to access the files.
In addition to long file names up to 255 characters in length, Windows also generates an MS-DOS-compatible (short) file name in 8.3 format.
http://support.microsoft.com/kb/142982

Related

How to read a file name containing 'œ' as character in C/C++ on windows

This post is not a duplicate of this one: dirent not working with unicode
Because here I'm using it on a different OS and I also don't want to do the same thing. The other thread is trying to simply count the files, and I want to access the file name which is more complex.
I'm trying to retrieve data information through files names on a windows 10 OS.
For this purpose I use dirent.h(external c library, but still very usefull also in c++).
DIR* directory = opendir(path);
struct dirent* direntStruct;
if (directory != NULL)
{
while (direntStruct = readdir(directory))
{
cout << direntStruct->d_name << endl;
}
}
This code is able to retrieve all files names located in a specific folder (one by one). And it works pretty well!
But when it encounter a file containing the character 'œ' then things are going crazy:
Example:
grosse blessure au cœur.txt
is read in my program as:
GUODU0~6.TXT
I'm not able to find the original data in the string name because as you can see my string variable has nothing to do with the current file name!
I can rename the file and it works, but I don't want to do this, I just need to read the data from that file name and it seems impossible. How can I do this?

On Windows you can use FindFirstFile() or FindFirstFileEx() followed by FindNextFile() to read the contents of a directory with Unicode in the returned file names.

Short File Name
The name you receive is the 8.3 short file name NTFS generates for non-ascii file names, so they can be accessed by programs that don't support unicode.
clinging to dirent
If dirent doesn't support UTF-16, your best bet may be to change your library.
However, depending on the implementation of the library you may have luck with:
adding / changing the manifest of your application to support UTF-8 in char-based Windows API's. This requires a very recent version of Windows 10.
see MSDN:
Use the UTF-8 code page under Windows - Apps - UWP - Design and UI - Usability - Globalization and localization.
setting the C++ Runtime's code page to UTF-8 using setlocale
I do not recommend this, and I don't know if this will work.
life is change
Use std::filesystem to enumerate directory content.
A simple example can be found here (see the "Update 2017").
Windows only
You can use FindFirstFileW and FindNextFileW as platform API's that support UTF16 strings. However, with std::filesystem there's little reason to do so (at least for your use case).

If you're in C, use the OS functions directly, specifically FindFirstFileW and FindNextFileW. Note the W at the end, you want to use the wide versions of these functions to get back the full non-ASCII name.
In C++ you have more options, specifically with Boost. You have classes like recursive_directory_iterator which allow cross-platform file searching, and they provide UTF-8/UTF-16 file names.
Edit: Just to be absolutely clear, the file name you get back from your original code is correct. Due to backwards compatibility in Windows filesystems (FAT32 and NTFS), every file has two names: the "full", Unicode aware name, and the "old" 8.3 name from DOS days.
You can absolutely use the 8.3 name if you want, just don't show it to your users or they'll be (correctly) confused. Or just use the proper, modern API to get the real name.

What function should I use to open file in mac OS X?

I use C++ Builder to create my cross-platform application.
In the app, I will get file name/path by open file dialog.
In Windows, it's no problem to take care the unicode string. (ex. "C:\測試")
In mac OS X, I can get correct string from UnicodeString. But I can't find a good method to convert it to char array and use "fopen" to open the file correctly.
I tried to assign the UnicodeString to AnsiString directly but it became "C:\??".
Because "fopen" only accepts "char*" and UnicodeString can only export "char16*", I need to convert it to char for "fopen".
Any idea?

Just because fopen() takes a char* does not mean you should give it an ANSI string. POSIX APIs on OSX accept UTF-8 encoded filenames, so use UTF8String instead of AnsiString. A char* can point to a UTF-8 string.
Otherwise, don't use fopen() directly. Use the RTL's own functions instead, like FileCreate()/FileOpen() in System.SysUtils.hpp unit, or the TFileStream class in System.Classes.hpp unit. Let the RTL decide internally how to interact with platform APIs for you.

Can't read unicode (japanese) from a file

Hi I have a file containing japanese text, saved as unicode file.
I need to read from the file and display the information to the stardard output.
I am using Visual studio 2008
int main()
{
wstring line;
wifstream myfile("D:\sample.txt"); //file containing japanese characters, saved as unicode file
//myfile.imbue(locale("Japanese_Japan"));
if(!myfile)
cout<<"While opening a file an error is encountered"<<endl;
else
cout << "File is successfully opened" << endl;
//wcout.imbue (locale("Japanese_Japan"));
while ( myfile.good() )
{
getline(myfile,line);
wcout << line << endl;
}
myfile.close();
system("PAUSE");
return 0;
}
This program generates some random output and I don't see any japanese text on the screen.

Oh boy. Welcome to the Fun, Fun world of character encodings.
The first thing you need to know is that your console is not unicode on windows. The only way you'll ever see Japanese characters in a console application is if you set your non-unicode (ANSI) locale to Japanese. Which will also make backslashes look like yen symbols and break paths containing european accented characters for programs using the ANSI Windows API (which was supposed to have been deprecated when Windows XP came around, but people still use to this day...)
So first thing you'll want to do is build a GUI program instead. But I'll leave that as an exercise to the interested reader.
Second, there are a lot of ways to represent text. You first need to figure out the encoding in use. Is is UTF-8? UTF-16 (and if so, little or big endian?) Shift-JIS? EUC-JP? You can only use a wstream to read directly if the file is in little-endian UTF-16. And even then you need to futz with its internal buffer. Anything other than UTF-16 and you'll get unreadable junk. And this is all only the case on Windows as well! Other OSes may have a different wstream representation. It's best not to use wstreams at all really.
So, let's assume it's not UTF-16 (for full generality). In this case you must read it as a char stream - not using a wstream. You must then convert this character string into UTF-16 (assuming you're using windows! Other OSes tend to use UTF-8 char*s). On windows this can be done with MultiByteToWideChar. Make sure you pass in the right code page value, and CP_ACP or CP_OEMCP are almost always the wrong answer.
Now, you may be wondering how to determine which code page (ie, character encoding) is correct. The short answer is you don't. There is no prima facie way of looking at a text string and saying which encoding it is. Sure, there may be hints - eg, if you see a byte order mark, chances are it's whatever variant of unicode makes that mark. But in general, you have to be told by the user, or make an attempt to guess, relying on the user to correct you if you're wrong, or you have to select a fixed character set and don't attempt to support any others.

Someone here had the same problem with Russian characters (He's using basic_ifstream<wchar_t> wich should be the same as wifstream according to this page). In the comments of that question they also link to this which should help you further.
If understood everything correctly, it seems that wifstream reads the characters correctly but your program tries to convert them to whatever locale your program is running in.

Two errors:
std::wifstream(L"D:\\sample.txt");
And do not mix cout and wcout.
Also check that your file is encoded in UTF-16, Little-Endian. If not so, you will be in trouble reading it.

wfstream uses wfilebuf for the actual reading and writing of the data. wfilebuf defaults to using a char buffer internally which means that the text in the file is assumed narrow, and converted to wide before you see it. Since the text was actually wide, you get a mess.
The solution is to replace the wfilebuf buffer with a wide one.
You probably also need to open the file as binary.
const size_t bufsize = 128;
wchar_t buffer[bufsize];
wifstream myfile("D:\\sample.txt", ios::binary);
myfile.rdbuf()->pubsetbuf(buffer, 128);
Make sure the buffer outlives the stream object!
See details here: http://msdn.microsoft.com/en-us/library/tzf8k3z8(v=VS.80).aspx

Reading file with cyrillic

I have to open file with cyrillic symbols. I've encoded file into utf8. Here is example:
en: Couldn't your family afford a
costume for you
ru: Не ваша семья
позволить себе костюм для вас
How do I open file:
ifstream readFile(fileData.c_str());
while (!readFile.eof())
{
std::getline(readFile, buffer);
...
}
The first trouble, there is some symbol before text 'en' (I saw this in debugger):
"ï»¿en: least"
And another trouble is cyrillic symbols:
" ru: Ð½Ð°Ð¸Ð¼ÐµÐ½ÑŒÑˆÐ¸Ð¹"
What's wrong?

there is some symbol before text 'en'
That's a faux-BOM, the result of encoding a U+FEFF BYTE ORDER MARK character into UTF-8.
Since UTF-8 is an encoding that does not have a byte order, the faux-BOM shouldn't ever be used, but unfortunately quite a bit of existing software (especially in the MS world) does nonetheless. Load the messages file into a text editor and save it back out again as UTF-8, using a “UTF-8 without BOM” encoding if one is especially listed.
ru: Ð½Ð°Ð¸Ð¼ÐµÐ½ÑŒÑˆÐ¸Ð¹
That's what you get when you've got a UTF-8 byte string (representing наименьший) and you print it as if it were a Code Page 1252 (Windows Western European) byte string. It's not an input problem; you have read in the string OK and have a UTF-8 byte string. But then, in code you haven't quoted, it gets output as cp1252.
If you're just printing it to the console, this is to be expected, as the console always uses the system default code page (1252 on a Western Windows install), and not UTF-8. If you need to send Unicode to the console you'll have to convert the bytes to native-Unicode wchars and write them from there. I don't know what the final destination for your strings is though... if you're just going to write them to another file or something you could just keep them as bytes and not care about what encoding they're in.

i suppose that your os is windows. exists several ways simple:
Use wchar_t, wstring, wifstream, etc.
Use icu library
Use other super puper library (them really many)
Note: for console printing you must use WinApi functions to convert UTF-8 to cp866 (my default cyrilic windows encoding cp1251) because of windows console supports only dos encodings.
Note: for file printing you need to know what encoding use your file

Use libiconv to convert the text to a usable encoding after reading.

Use icu to convert the text.

_wfopen equivalent under Mac OS X

I'm looking to the equivalent of Windows _wfopen() under Mac OS X. Any idea?
I need this in order to port a Windows library that uses wchar* for its File interface. As this is intended to be a cross-platform library, I am unable to rely on how the client application will get the file path and give it to the library.

POSIX API in Mac OS X are usable with UTF-8 strings. In order to convert a wchar_t string to UTF-8, it is possible to use the CoreFoundation framework from Mac OS X.
Here is a class that will wrap an UTF-8 generated string from a wchar_t string.
class Utf8
{
public:
Utf8(const wchar_t* wsz): m_utf8(NULL)
{
// OS X uses 32-bit wchar
const int bytes = wcslen(wsz) * sizeof(wchar_t);
// comp_bLittleEndian is in the lib I use in order to detect PowerPC/Intel
CFStringEncoding encoding = comp_bLittleEndian ? kCFStringEncodingUTF32LE
: kCFStringEncodingUTF32BE;
CFStringRef str = CFStringCreateWithBytesNoCopy(NULL,
(const UInt8*)wsz, bytes,
encoding, false,
kCFAllocatorNull
);
const int bytesUtf8 = CFStringGetMaximumSizeOfFileSystemRepresentation(str);
m_utf8 = new char[bytesUtf8];
CFStringGetFileSystemRepresentation(str, m_utf8, bytesUtf8);
CFRelease(str);
}
~Utf8()
{
if( m_utf8 )
{
delete[] m_utf8;
}
}
public:
operator const char*() const { return m_utf8; }
private:
char* m_utf8;
};
Usage:
const wchar_t wsz = L"Here is some Unicode content: éà€œæ";
const Utf8 utf8 = wsz;
FILE* file = fopen(utf8, "r");
This will work for reading or writing files.

You just want to open a file handle using a path that may contain Unicode characters, right? Just pass the path in filesystem representation to fopen.
If the path came from the stock Mac OS X frameworks (for example, an Open panel whether Carbon or Cocoa), you won't need to do any conversion on it and will be able to use it as-is.
If you're generating part of the path yourself, you should create a CFStringRef from your path and then get that in filesystem representation to pass to POSIX APIs like open or fopen.
Generally speaking, you won't have to do a lot of that for most applications. For example, many applications may have auxiliary data files stored the user's Application Support directory, but as long as the names of those files are ASCII, and you use standard Mac OS X APIs to locate the user's Application Support directory, you don't need to do a bunch of paranoid conversion of a path constructed with those two components.
Edited to add: I would strongly caution against arbitrarily converting everything to UTF-8 using something like wcstombs because filesystem encoding is not necessarily identical to the generated UTF-8. Mac OS X and Windows both use specific (but different) canonical decomposition rules for the encoding used in filesystem paths.
For example, they need to decide whether "é" will be stored as one or two code units (either LATIN SMALL LETTER E WITH ACUTE or LATIN SMALL LETTER E followed by COMBINING ACUTE ACCENT). These will result in two different — and different-length — byte sequences, and both Mac OS X and Windows work to avoid putting multiple files with the same name (as the user perceives them) in the same directory.
The rules for how to perform this canonical decomposition can get pretty hairy, so rather than try to implement it yourself it's best to leave it to the functions the system frameworks have provided for you to do the heavy lifting.

#JKP:
Not all functions in MacOS X accept UTF8, but filenames and filepaths may be UTF8, thus all POSIX functions dealing with file access (open, fopen, stat, etc.) accept UTF8.
See here. Quote:
How a file name looks at the API level
depends on the API. Current Carbon
APIs handle file names as an array of
UTF-16 characters; POSIX ones handle
them as an array of UTF-8, which is
why UTF-8 works well in Terminal. How
it's stored on disk depends on the
disk format; HFS+ uses UTF-16, but
that's not important in most cases.
Some other POSIX functions handle UTF8 as well. E.g. functions dealing with user names, group names or user passwords use UTF8 to store the information (thus a user name can be Japanese and your password can be Chinese, no problem).
But not all handle UTF8. E.g. for all string functions an UTF8 string is just a normal C String and characters above 126 have no special meaning. They don't understand the concept of multiple bytes (chars in C) forming a single Unicode character. How other APIs handle char * pointer being passed to them is different from API to API. However, as a rule as the thumb you can say:
Either the function only accepts C strings with pure ASCII characters (only in the range 0 to 126) or it will accept UTF8. Usually functions don't allow characters above 126 and interpret them in any other encoding than UTF8. If this really was the case, it is documented and then there must be a way to pass the encoding along with the string.

If you're using Cocoa it's fairly easy with NSString. Just load the UTF16 data in using -initWithBytes:length:encoding: (or perhaps -initWithCString:encoding:) and then get a UTF8 version by calling UTF8String on the result. Then, just call fopen with your new UTF8 string as the param.
You can definitely call fopen with a UTF-8 string, regardless of language - can't help with C++ on OSX though - sorry.

I have read file name from configuration UTF8 file through wifstream (it uses wchar_t buffer).
Mac implementation is different from Linux and Windows.
wifstream reads each byte from file to separate wchar_t cell in the buffer. So we have 3 empty bytes, although open requires char string. Thus programmer can use wcstombs function to convert wide character string to multi-byte string.
The API supports UTF8. For better understanding use memory watcher and hex editor for your file.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js