Trying to test unicode filename retrieval/conversion to/from UTF8-UTF16 - c++

I've looked at a lot of examples for WideCharToMultiByte, etc. This question is more about testing.
I downloaded another language set, Chinese, Simplified China to my machine. Then using the virtual keyboard I created a directory on C:\ with some Chinese characters in the path, and placed a file inside the directory.
I'm trying to see that I get the correct filename by testing _wfopen with my path. I also have the same file in another location for testing:
//setlocale(LC_ALL, "zh-CN");
//setlocale(LC_ALL, "Chinese_China.936");
setlocale(LC_ALL, "");
wchar_t* outfilename = L"C:\\特殊他\\和阿涛和润\\bracket3holes.sat";
//wchar_t* outfilename = L"C:\\heather\\bracket3holes.sat";
wchar_t w[] = L"r";
FILE* foo = _wfopen(outfilename, w);
First I tried without setting locale, then I tried various combinations of setting locale to the language I downloaded (therefore the language of the path).
_wfopen works fine with the C:\heather path, but always returns a NULL pointer with the unicode path.
What am I missing? Any insight would be greatly appreciated. Note my code must be compilable back to vc9.
--- Based on the feedback, I saved the file as UTF-8 with BOM, added const before the wchar_t declarations, and now in the debugger I do see the right string and the file pointer is no longer null.
Thank you for your help. I'm still trying to wrap my head around this all, we're trying to transition from const char* to unicode-friendly.

Related

Can't display 'ä' in GLFW window title

glfwSetWindowTitle(win, "Nämen");
Becomes "N?men", where '?' is in a little black, twisted square, indicating that the character could not be displayed.
How do I display 'ä'?
If you want to use non-ASCII letters in the window title, then the string has to be utf-8 encoded.
GLFW: Window title:
The window title is a regular C string using the UTF-8 encoding. This means for example that, as long as your source file is encoded as UTF-8, you can use any Unicode characters.
If you see a little black, twisted square then this indicates that the ä is encoded with some iso encoding that is not UTF-8, maybe something like latin1. To fix this you need to open it in the editor in which you can change the encoding of the file, change it to uft-8 (without BOM) and fix the ä in the title.
It seems like the GLFW implementation does not work according to the specification in this case. Probably the function still uses Latin-1 instead of UTF-8.
I had the same problem on GLFW 3.3 Windows 64 bit precompiled binaries and fixed it like this:
SetWindowTextA(glfwGetWin32Window(win),"Nämen")
The issue does not lie within GLFW but within the compiler. Encodings are handled by the major compilers as follows:
Good guy clang assumes that every file is encoded in UTF-8
Trusty gcc checks the system's settings1 and falls back on UTF-8, when it fails to determine one.
MSVC checks for BOM and uses the detected encoding; otherwise it assumes that file is encoded using the users current code page2.
You can determine your current code page by simply running chcp in your (Windows) console or PowerShell. For me, on a fresh install of Windows 11, it yields "850" (language: English; keyboard: German), which stands for Code Page 850.
To fix this issue you have several solutions:
Change your systems code page. Arguably the worst solution.
Prefix your strings with u8, escape all unicode literals and convert the string to wide char before passing Win32 functions; e.g.:
const char* title = u8"\u0421\u043b\u0430\u0432\u0430\u0020\u0423\u043a\u0440\u0430\u0457\u043d\u0456\u0021";
// This conversion is actually performed by GLFW; see footnote ^3
const int l = MultiByteToWideChar(CP_UTF8, 0, title, -1, NULL, 0);
wchar_t* buf = _malloca(l * sizeof(wchar_t));
MultiByteToWideChar(CP_UTF8, 0, title, -1, buf, l);
SetWindowTextW(hWnd, buf);
_freea(buf);
Save your source files with UTF-8 encoding WITH BOM. This allows you to write your strings without having to escape them. You'll still need to convert the string to a wide char string using the method seen above.
Specify the /utf-8 flag when compiling; this has the same effect as the previous solution, but you don't need the BOM anymore.
The solutions stated above still require you convert your good and nice string to a big chunky wide string.
Another option would be to provide a manifest4 with the activeCodePage set to UTF-8. This way all Win32 functions with the A-suffix (e.g. SetWindowTextA) now accept and properly handle UTF-8 strings, if the running system is at least or newer than Windows Version 1903.
TL;DR
Compile your application with the /utf-8 flag active.
IMPORTANT: This works for the Win32 APIs. This doesn't let you magically write Unicode emojis to the console like a hipster JS developer.
1 I suppose it reads the LC_ALL setting on linux. In the last 6 years I have never seen a distribution, that does NOT specify UTF-8. However, take this information with a grain of salt; I might be entirely wrong on how gcc handles this now.
2 If no byte-order mark is found, it assumes that the source file is encoded in the current user code page [...].
3 GLFW performs the conversion as seen here.
4 More about Win32 ANSI-APIs, Manifest and UTF-8 can be found here.

SDL2 loading files with special characters

I got a problem, that is: in a Windows application using SDL2 & SDL2_Image, it opens image files, for later saving them with modifications on the image data.
When it opens an image without special characters (like áéíóúñ, say, "buenos aires.jpg") it works as intended. But, if there is any special character as mentioned (say, "córdoba.jpg"), SDL_Image generates an error saying "Couldn't open file". Whatever, if i use the std::ifstream flux with the exact file name that i got from the CSV file (redundant, as "córdoba.jpg" or "misiónes.jpg"), the ifstream works well... Is it an error using the special characters? UNICODE, UTF, have something to do?
A little information about the environment: Windows 10 (spanish, latin american), SDL2 & SDL2_Image (up to date versions), GCC compiler using Mingw64 7.1.0
About the software I'm trying to make: it uses a CSV form, with the names of various states of Argentina, already tried changing encoding on the .CSV. It loads images based on the names found on the CSV, changes them, and saves.
I know maybe I am missing something basic, but already depleted my resources.
IMG_Load() forwards its file argument directly to SDL_RWFromFile():
// http://hg.libsdl.org/SDL_image/file/8fee51506499/IMG.c#l125
SDL_Surface *IMG_Load(const char *file)
{
SDL_RWops *src = SDL_RWFromFile(file, "rb");
const char *ext = SDL_strrchr(file, '.');
if(ext) {
ext++;
}
if(!src) {
/* The error message has been set in SDL_RWFromFile */
return NULL;
}
return IMG_LoadTyped_RW(src, 1, ext);
}
And SDL_RWFromFile()'s file argument should be a UTF-8 string:
SDL_RWops* SDL_RWFromFile(const char* file,
const char* mode)
Function Parameters:
file: a UTF-8 string representing the filename to open
mode: an ASCII string representing the mode to be used for opening the file; see Remarks for details
So pass UTF-8 paths into IMG_Load().
C++11 has UTF-8 string literal support built-in via the u8 prefix:
IMG_Load( u8"córdoba.jpg" );

CStdioFile problems with encoding on read file

I can't read a file correctly using CStdioFile.
I open notepad.exe, I type àèìòùáéíóú and I save twice, once I set codification as ANSI (really is CP-1252) and other as UTF-8.
Then I try to read it from MFC with the following block of code
BOOL ReadAllFileContent(const CString &FilePath, CString *fileContent)
{
CString sLine;
BOOL isSuccess = false;
CStdioFile input;
isSuccess = input.Open(FilePath, CFile::modeRead);
if (isSuccess) {
while (input.ReadString(sLine)) {
fileContent->Append(sLine);
}
input.Close();
}
return isSuccess;
}
When I call it, with ANSI file I've got the expected result àèìòùáéíóú
but when I try to read the UTF8 encoded file I've got à èìòùáéíóú
I would like my function works with all files regardless of the encoding.
Why I need to implement?
.EDIT.
Unfortunately, in the real app, files come from external app so change the file encoding isn't an option.I must be able to read both UTF-8 and CP-1252 files.
Any file is valid ANSI, what notepad told ANSI is really Windows-1252 encode.
I've figured out a way to read UTF-8 and CP-1252 right based on the example provided here. Although it works, I need to pass the file encode which I don't know in advance.
Thnks!
I personally use the class as advertised here:
https://www.codeproject.com/Articles/7958/CTextFileDocument
It has excellent support for reading and writing text files of various encodings including unicode in its various flavours.
I have not had a problem with it.

Understanding Multibyte/Unicode

I'm just getting back into Programming C++, MFC, Unicode. Lots have changed over the past 20 years.
Code on another project compiled just fine, but had errors when I paste it into my code. It took me 1-1/2 days of wasted time to solve the function call below:
enter code here
CString CFileOperation::ChangeFileName(CString sFileName)
{
char drive[MAX_PATH], dir[MAX_PATH], name[MAX_PATH], ext[MAX_PATH];
_splitpath_s(sFileName, drive, dir, name, ext); //error
------- other code
}
After reading help, I changed the CString sFileName to use a cast:
enter code here
_splitpath_s((LPTCSTR)sFileName, drive, dir, name, ext); //error
This created an error too. So then I used GetBuffer() which is really the same as above.
enter code here
char* s = sFileName.GetBuffer(300);
_splitpath_s(s, drive, dir, name, ext); //same error for the 3rd time
sFileName.ReleaseBuffer();
At this point I was pretty upset, but finally realized that I needed to change the CString to Ascii (I think because I'm set up as Unicode).
hence;
enter code here
CT2A strAscii(sFileName); //convert CString to ascii, for splitpath()
then use strAscii.m_pz in the function _splitpath_s()
This finally worked. So after all this, to make a story short, I need help focusing on:
1. Unicode vs Mulit-Byte (library calls)
2. Variables to uses
I'm willing to purchase another book, please recommend.
Also, is there a way to filter my help on VS2015 so that when I'm on a variable and press F1, it only gives me help for Unicode and ways to convert old code to unicode or convert Mylti-Byte to Unicode.
Hope this is not to confusing, but I have some catching up to do. Be patient if my verbiage is not perfect.
Thanks in advance.
The documentation of _splitpath lists a Unicode (wchar_t based) version _wsplitpath. That's the one you should be using. Don't convert to ASCII or Windows ANSI, that will in general lose information and not produce a valid path when you recombine the pieces.
Modern Windows programming is Unicode based.
A Visual Studio C++ project is Unicode-based by default, in particular it defines the macro symbol UNICODE, which affects the declarations from <windows.h>.
All supported versions of Windows use Unicode internally throughout, and your application should, too. Windows uses UTF-16 encoding.
To make your application Unicode-enabled you need to perform the following steps:
Set up your project's Character Set to "Use Unicode Character Set" (if it's currently set to "Use Multi-Byte Character Set"). This is not strictly required, but it deals with those cases, where you aren't using the Unicode version explicitly.
Use wchar_t (in place of char or TCHAR) for your strings.
Use wide character string literals (L"..." in place of "...").
Use CStringW (in place of CStringA or CString) in an MFC project.
Explicitly call the Unicode version of the CRT (e.g. wcslen in place of strlen or _tcslen).
Explicitly call the Unicode version of any Windows API call where it exists (e.g. CreateWindowExW in place of CreateWindowExA or CreateWindowEx).
Try using _tsplitpath_s and TCHAR.
So the final code looks something like:
CString CFileOperation::ChangeFileName(CString sFileName)
{
TCHAR drive[MAX_PATH], dir[MAX_PATH], name[MAX_PATH], ext[MAX_PATH];
_tsplitpath_s(sFileName, drive, dir, name, ext); //error
------- other code
}
This will enable C++ compiler to use the correct character width during build time depending on the project settings

Windows usage of char * functions with UTF-16

I port one application from Linux to Windows.
On Linux I use libmagic library from which I wouldn't be glad to rid of on Windows.
The problem is that I need pass name of file that is held in UTF-16 encoding to such function:
int magic_load(magic_t cookie, const char *filename);
Unfortunately it accepts only const char *filename. My first idea was to convert UTF-16 string to local encoding, but there are some problems - like string can contain e.g. Chinese symbols and local encoding may be Russian.
As result we will get trash on the output and program will not reach its aim.
Converting into UTF-8 doesn't help either, because this is Windows and Windows holds file name in UTF-16.
But I somehow need make that function able to open file with Unicode name.
I came only to one very very bad solution:
1. I have a filename
2. I can copy file with unicode name to file with ASCII name like "1.mp3"
3. open it with libmagic functions and get what I want
4. remove temporarily file
But I understand how this solution is bad and how it could make my application slower, so I wonder, perhaps are there some better ways to do it?
Thanks in advance for any tips, 'cause I'm really confused with it.
Use 8.3 file names to access the files.
In addition to long file names up to 255 characters in length, Windows also generates an MS-DOS-compatible (short) file name in 8.3 format.
http://support.microsoft.com/kb/142982