OSX and C++ unicode conversion from NFD to NFC - c++

I have a problem with NFD Unicode strings I get from the OSX Filesystem.
This is what I get for the "Ä"-Umlaut on OSX "A\xcc\x88" and this is what I expect "\xc3\x84". The same function does it right under windows (simple boost filesystem operation, listing an directory).
After searching a while, I found out that Apple the NFD coding for UTF-8 and the rest of the world NFC. I tried a bit with converting through NSStrings or with boost::locale::normalize, but without success.
Does anybody know a way to do this in C++ (I can use Cocoa through obj-c if necessary)?
I would like the raw unicode string as std::string (with unicode coding) after that.

This is the solution to get the precomposed form.
std::string precomposeFilename(const std::string& name)
{
CFStringRef cfStringRef = CFStringCreateWithCString(kCFAllocatorDefault, name.c_str(), kCFStringEncodingUTF8);
CFMutableStringRef cfMutable = CFStringCreateMutableCopy(NULL, 0, cfStringRef);
CFStringNormalize(cfMutable,kCFStringNormalizationFormC);
char c_str[255 + 1];
CFStringGetCString(cfMutable, c_str, sizeof(c_str)-1, kCFStringEncodingUTF8);
CFRelease(cfStringRef);
CFRelease(cfMutable);
return std::string(c_str);
}

NSString has - (NSString *)precomposedStringWithCanonicalMapping function, and some other ones, looks like they will help you.

Related

Understanding Multibyte/Unicode

I'm just getting back into Programming C++, MFC, Unicode. Lots have changed over the past 20 years.
Code on another project compiled just fine, but had errors when I paste it into my code. It took me 1-1/2 days of wasted time to solve the function call below:
enter code here
CString CFileOperation::ChangeFileName(CString sFileName)
{
char drive[MAX_PATH], dir[MAX_PATH], name[MAX_PATH], ext[MAX_PATH];
_splitpath_s(sFileName, drive, dir, name, ext); //error
------- other code
}
After reading help, I changed the CString sFileName to use a cast:
enter code here
_splitpath_s((LPTCSTR)sFileName, drive, dir, name, ext); //error
This created an error too. So then I used GetBuffer() which is really the same as above.
enter code here
char* s = sFileName.GetBuffer(300);
_splitpath_s(s, drive, dir, name, ext); //same error for the 3rd time
sFileName.ReleaseBuffer();
At this point I was pretty upset, but finally realized that I needed to change the CString to Ascii (I think because I'm set up as Unicode).
hence;
enter code here
CT2A strAscii(sFileName); //convert CString to ascii, for splitpath()
then use strAscii.m_pz in the function _splitpath_s()
This finally worked. So after all this, to make a story short, I need help focusing on:
1. Unicode vs Mulit-Byte (library calls)
2. Variables to uses
I'm willing to purchase another book, please recommend.
Also, is there a way to filter my help on VS2015 so that when I'm on a variable and press F1, it only gives me help for Unicode and ways to convert old code to unicode or convert Mylti-Byte to Unicode.
Hope this is not to confusing, but I have some catching up to do. Be patient if my verbiage is not perfect.
Thanks in advance.
The documentation of _splitpath lists a Unicode (wchar_t based) version _wsplitpath. That's the one you should be using. Don't convert to ASCII or Windows ANSI, that will in general lose information and not produce a valid path when you recombine the pieces.
Modern Windows programming is Unicode based.
A Visual Studio C++ project is Unicode-based by default, in particular it defines the macro symbol UNICODE, which affects the declarations from <windows.h>.
All supported versions of Windows use Unicode internally throughout, and your application should, too. Windows uses UTF-16 encoding.
To make your application Unicode-enabled you need to perform the following steps:
Set up your project's Character Set to "Use Unicode Character Set" (if it's currently set to "Use Multi-Byte Character Set"). This is not strictly required, but it deals with those cases, where you aren't using the Unicode version explicitly.
Use wchar_t (in place of char or TCHAR) for your strings.
Use wide character string literals (L"..." in place of "...").
Use CStringW (in place of CStringA or CString) in an MFC project.
Explicitly call the Unicode version of the CRT (e.g. wcslen in place of strlen or _tcslen).
Explicitly call the Unicode version of any Windows API call where it exists (e.g. CreateWindowExW in place of CreateWindowExA or CreateWindowEx).
Try using _tsplitpath_s and TCHAR.
So the final code looks something like:
CString CFileOperation::ChangeFileName(CString sFileName)
{
TCHAR drive[MAX_PATH], dir[MAX_PATH], name[MAX_PATH], ext[MAX_PATH];
_tsplitpath_s(sFileName, drive, dir, name, ext); //error
------- other code
}
This will enable C++ compiler to use the correct character width during build time depending on the project settings

How to get the locale of the current user in OSX using C++

I have a C++ application where I need to retrieve the locale of the current user. How can I do it with OSX Yosemite and newer?
I've tried something like setlocale(LC_CTYPE, NULL); but it just returns UTF-8 where my system is clearly in Spanish (es_AR)
After some try and error and lot of help from internet and other questions I did it.
If I want to get only the language.
CFLocaleRef cflocale = CFLocaleCopyCurrent();
CFStringRef value = (CFStringRef)CFLocaleGetValue(cflocale, kCFLocaleLanguageCode);
std::string str(CFStringGetCStringPtr(value, kCFStringEncodingUTF8));
CFRelease(cflocale);
This way, at str I'll get a std::string with the language. If I need something else, I can replace kCFLocaleLanguageCode with any other constant from CFLocale
Also I needed the header #include <CoreFoundation/CoreFoundation.h>

Convert wide CString to char*

There are lots of times this question has been asked and as many answers - none of which work for me and, it seems, many others. The question is about wide CStrings and 8bit chars under MFC. We all want an answer that will work in ALL cases, not a specific instance.
void Dosomething(CString csFileName)
{
char cLocFileNamestr[1024];
char cIntFileNamestr[1024];
// Convert from whatever version of CString is supplied
// to an 8 bit char string
cIntFileNamestr = ConvertCStochar(csFileName);
sprintf_s(cLocFileNamestr, "%s_%s", cIntFileNamestr, "pling.txt" );
m_KFile = fopen(LocFileNamestr, "wt");
}
This is an addition to existing code (by somebody else) for debugging.
I don't want to change the function signature, it is used in many places.
I cannot change the signature of sprintf_s, it is a library function.
You are leaving out a lot of details, or ignoring them. If you are building with UNICODE defined (which it seems you are), then the easiest way to convert to MBCS is like this:
CStringA strAIntFileNameStr = csFileName.GetString(); // uses default code page
CStringA is the 8-bit/MBCS version of CString.
However, it will fill with some garbage characters if the unicode string you are translating from contains characters that are not in the default code page.
Instead of using fopen(), you could use _wfopen() which will open a file with a unicode filename. To create your file name, you would use swprintf_s().
an answer that will work in ALL cases, not a specific instance...
There is no such thing.
It's easy to convert "ABCD..." from wchar_t* to char*, but it doesn't work that way with non-Latin languages.
Stick to CString and wchar_t when your project is unicode.
If you need to upload data to webpage or something, then use CW2A and CA2W for utf-8 and utf-16 conversion.
CStringW unicode = L"Россия";
MessageBoxW(0,unicode,L"Russian",0);//should be okay
CStringA utf8 = CW2A(unicode, CP_UTF8);
::MessageBoxA(0,utf8,"format error",0);//WinApi doesn't get UTF-8
char buf[1024];
strcpy(buf, utf8);
::MessageBoxA(0,buf,"format error",0);//same problem
//send this buf to webpage or other utf-8 systems
//this should be compatible with notepad etc.
//text will appear correctly
ofstream f(L"c:\\stuff\\okay.txt");
f.write(buf, strlen(buf));
//convert utf8 back to utf16
unicode = CA2W(buf, CP_UTF8);
::MessageBoxW(0,unicode,L"okay",0);

How do I convert wchar_t* to string?

I am new to C++.
And I am trying to convert wchar_t* to string.
I cannot use wstring in condition.
I have code below:
wchar_t *wide = L"中文";
wstring ret = wstring( wide );
string str2( ret.begin(), ret.end() );
But str2 returns some strange characters.
Where do I have to fix it?
You're trying to do it backwards. Instead of truncating wide characters to chars (which is very lossy), expand your chars to wide characters.
That is, transform your std::string into an std::wstring and concatenate the two std::wstrings.
I'm not sure what platform you're targeting. If you're on Windows platform you can call WideCharToMultiByte API function. Refer to MSDN for documentation.
If you're on Linux, I think you can use libiconv functions, try google.
Of course there is a port of libiconv for Windows.
In general this is a quite complex topic for a new beginners if you know nothing about character encodings - there are a lot of background knowledge to have to learn.

Get Strings in right encoding in c++

I asked a similar question before.
But I am still in trouble with encodings in c++.
I try to describe the problem as well as possible.
I have a c++ client, communicating with an c# service over TCP.
Now I need to display the Messages from the service in an Messagebox (Win32 API).
The Bytes, sended by the c# service are UTF-8 encoded.
Important to know, the c++ client will only be running on Windows Systems.
This is the code to receive the bytes and to display the Text:
char buffer[1024];
int receivedBytes = recv(socketHandle, buffer, sizeof(buffer) - 1, 0);
char str[receivedBytes];
for (int index = 0; index < receivedBytes; index++)
{
str[index] = buffer[index];
}
MessageBox(mainWindow, (LPCTSTR)str, (LPCTSTR) "Fehler", MB_OK|MB_ICONERROR);
If the Text contains chatacters like üäö, they are not shown in the Messagebox the correct way.
What can I do to receive the message as UTF-8 String in c++?
Is there a possibility to convert the char[] to an UTF-8 String?
Thx for helping
Tobi
If you want to display unicode characters in Windows, you need to translate UTF8 string into UTF16 (older UCS2), as this is the unicode standard Windows handles. You do that with MultiByteToWideChar function.
Also make sure that the #define UNICODE is set before you include Windows headers, so that MessageBox points to MessageBoxW or use MessageBoxW explicitly.