Motivation to only provide wide-string logical string comparison - c++

I've been puzzling over this for quite a while, and as of yet I haven't managed to find a suitable rationale.
The Win32 API provides a function for "logical string comparison" for which the prototype is:
StrCmpLogicalW( _In_ PCWSTR psz1, _In_ PCWSTR psz2 );
This function then uses digits as numbers rather than as plain text and thus provides a more 'logical' comparison of two strings.
However, most functions in the Win32 API seem to be typedef'd to use with either Multibyte or Unicode strings, for instance SendMessage is a macro which expands into SendMessageW for Unicode or SendMessageA for ANSI encodings (depending on which macro switch is enabled), so why does this function only have a wide-string version? I've searched the internet, but have been unable to find anything that explains this, so I'd be grateful if anyone can enlighten me.
Thanks in advance!

The documentation says "Behavior of this function, and therefore the results it returns, can change from release to release. It should not be used for canonical sorting applications." so it does not seem meant for general usage.

Related

Exporting functions from DLLs, LoadLibrary() needs the string cast with TEXT to compile without error

I'm learning to write and use DLLs and this is my first attempt at exporting a function from my dll. It works, but this line is what gave me trouble and what I've been able to find regarding the TEXT cast for UNICODE and ANSI I think I need some guidance. As far as I can find this question has not been asked elsewhere on the site so I apologize if anyone finds what I couldn't.
HINSTANCE hInstLibrary = LoadLibrary("MyDLL.dll");
My initial usage, from a short tutorial on explicit linking gives E0167 and C2664 errors regarding LPCWSTR type
HINSTANCE hInstLibrary = LoadLibrary(TEXT("MyDLL.dll"));
Casting the string to TEXT solves the problem, though I'm not sure why and would like to know
HINSTANCE hInstLibrary = LoadLibraryA("MyDLL.dll");
The line I decided to use in the working example. LoadLibraryA() expands LoadLibrary to accept ANSI rather than Wide, which may be the root of my misunderstanding. Why is this necessary when most examples I find show LoadLibrary("NameOfDLL.dll")?
Why does the string not satisfy the standard LoadLibrary() call?
LoadLibrary() is a preprocessor macro. It maps to either LoadLibraryW() or LoadLibraryA() depending on whether UNICODE is defined or not, respectively. LoadLibraryW() takes a const wchar_t* string as input, while LoadLibraryA() takes a const char * string instead.
The string literal "MyDLL.dll" is a const char[10], which decays into a const char *. If UNICODE is defined, LoadLibrary("MyDLL.dll") will fail to compile, as you cannot pass a const char * where a const wchar_t * is expected.
TEXT() is also a preprocessor macro. If UNICODE is defined, it appends an L prefix to the specified literal making the literal use wchar_t, otherwise no prefix is added and the literal uses char instead.
Thus, if UNICODE is defined, then LoadLibrary(TEXT("MyDLL.dll")) is compiled as LoadLibraryW(L"MyDLL.dll"), otherwise it is compiled as LoadLibraryA("MyDLL.dll") instead.
A majority of Win32 APIs that deal with textual data have similar A and W versions, and corresponding UNICODE-aware preprocessor macros. So, when using character/string literals with these APIs, you should always use the TEXT() macro. Otherwise, just use the A and W APIs directly as needed, depending on the type of textual data you are working with.

What happens to the non-unicode API functions when I define UNICODE and/or _UNICODE?

In MS Visual Studio, when you do not set the character set, the likes of AfxMessageBox() (and countless other API functions) will happily accept a CStringA argument. But the moment you set the character set to Unicode, what appear to be the very same functions will only accept CStringW arguments.
Now this is precisely what the documentation says should happen... but...
where exactly did those non-Unicode API functions go? Are they still there to be linked to under other names (AfxMessageBoxA() perhaps?). By what magic does one API disappear and another one appear in its place... or alternatively... by what mischievous hacker trick can one make them reappear? And if it is possible to make them reappear in the presence of Unicode, should one (judiciously) use such hacker mischief?
The declaration of AfxMessageBox() in afxwin.h is:
int AFXAPI AfxMessageBox(LPCTSTR lpszText, UINT nType = MB_OK,
UINT nIDHelp = 0);
It is LPCTSTR that adapts the string type. If you compile with UNICODE in effect then it is an alias for const wchar_t*. Without it is const char*. There is no AfxMessageBoxA() version.
This is very different from the way the winapi functions work, necessarily so since this is a C++ function that mangles differently. Technically they could have provided another overload of the function, they didn't. You'll also have a different link demand, you need to link the non-Unicode version of the MFC library to keep the linker happy. Notable is that it is deprecated and no longer ships with recent VS editions, but still available (right now) as a separate download.
This should answer your question, it doesn't go anywhere, it simply doesn't exist. Mixing cannot work, you'll need A2W() to convert the string. You could of course simply write your own overload if necessary.

How do I read the MSDN and apply?

Okay, so I want to stop asking lots of questions on how to do most programming stuff because most of my questions are given answers that say "Read the MSDN" founded here. Thing is, I have no idea how to read it or most programming languages. For example, lets take the FtpCreateDirectory function on MSDN (which you can find here)
Now, pretend I just learned this feature and I want to try it out. How do I read it, how to I take the functions/commands it shows me. How do I type it? This reallt does not help:
BOOL FtpCreateDirectory(
_In_ HINTERNET hConnect,
_In_ LPCTSTR lpszDirectory
);
Thanks!
I've not used this myself, but let's step through and give an example:
HINTERNET hinternet = InternetConnect(...); //assume hinternet is valid
if (!FtpCreateDirectory(hinternet, "C:\\example")) {
std::cerr << "Error creating FTP directory. Code: " << GetLastError();
}
Step by step:
First, we get a HINTERNET handle. How? Well, the docs say this about the parameter:
Handle returned by a previous call to InternetConnect using INTERNET_SERVICE_FTP.
That's why I called InternetConnect in the example.
Next, we look at the second parameter. Looking at the Windows Data Types article, you can see that it takes either a CONST WCHAR * or CONST CHAR *, depending on whether UNICODE is defined. For simplicity, I acted as though it wasn't, though you can use the TEXT macro to make a string literal wide or narrow depending on UNICODE.
Pointer to a null-terminated string that contains the name of the directory to be created. This can be either a fully qualified path or a name relative to the current directory.
As we can see, it's just a path, so I passed in an example path. This is just an example, but keep in mind what the Remarks section says about this parameter.
Now, we check the return value:
Returns TRUE if successful, or FALSE otherwise. To get a specific error message, call GetLastError. (more not shown)
Therefore, we wrap the call in an if statement to catch an error, which we can retrieve the code for using GetLastError. It's important to use the error handling technique described in each function's article. A lot of them say that upon an error, you can use GetLastError, but some don't support GetLastError usage, and some support different types of error retrieving functions, so make sure to follow the guidelines for each function individually.
Other than that, the _In_ means that the parameter goes in and it's no use after. This is opposed to, among others, _Out_, which means that you'd pass in allocated memory and the function would write to it, so you can use it after the function call with the value the function writes.
in the refernce part of MSDN there is a basic assumption that you understand the context for the API set.
If win32 c(++) programming is what you want then you need to read an intro do windows programming / win32. Its not clear what your area of interest is, are you trying to write desktop apps, servers, drivers,....
For some cases classic books like Charles Petzold programming windows are a good place to start. MSDN has a lot of intro level stuff too (google 'start win32 programming')

Unicode string normalization in C/C++

Am wondering how to normalize strings (containing utf-8/utf-16) in C/C++.
In .NET there is a function String.Normalize .
I used UTF8-CPP in the past but it does not provide such a function.
ICU and Qt provide string normalization but I prefer lightweight solutions.
Is there any "lightweight" solution for this?
As I wrote in another question, utf8proc is a very nice, lightweight, library for basic Unicode functionality, including Unicode string normalization.
For Windows, there is the NormalizeString() function (unfortunately for Vista and later only - as far as I see on MSDN):
http://msdn.microsoft.com/en-us/library/windows/desktop/dd319093%28v=vs.85%29.aspx
It's the simplest way to go that I have found so far.
I guess it's quite lightweight too.
int NormalizeString(
_In_ NORM_FORM NormForm,
_In_ LPCWSTR lpSrcString,
_In_ int cwSrcLength,
_Out_opt_ LPWSTR lpDstString,
_In_ int cwDstLength
);
You could build ICU with minimal (or possibly, no other data- I think all of the normalization data is now internal), and then statically link. I haven't tried this recently, but I believe the total size is pretty small in that case.
A good UTF-8 solution is glib's g_utf8_normalize() function. Would require to convert std::wstring to std::string (utf16 to utf8) if you need this for wstring too (which would make it quite an expensive solution, hence I'm looking myself for a better solution, if possible with pure C++(11) means).
"Lightweight" in your context means "with limited functionality". I would use ICU source as an example, and reference http://unicode.org/reports/tr15/ to implement this "lightweight" functionality.

LoadStringW - winuser.h. What does it do?

I have been unable to find any decent documentation on this function. The code base I am working with uses a function from winuser.h called LoadStringW which takes as arguments: (HINSTANCE hInstance, UINT uID, LPWSTR lpBuffer, int cchBufferMax).
What is this function? What is it for? When might it return 0?
It might be worth a mention that nearly all Win32 APIs that deal with strings have an 'A' and a 'W' variant.
The variant actually called is determined by the definition of macros that don't end in 'A' or 'W' - those macro names are what you might usually think of as the API function's name (LoadString() in this case). UNICODE builds will use the 'W' names and non-UNICODE builds will use the 'A' names.
https://learn.microsoft.com/en-us/windows/desktop/Intl/unicode-in-the-windows-api
There are times when you might want to call a Unicode version of an API even if the build isn't Unicode, in which case you just directly use the name with the 'W' tacked on to the end (it's less often necessary to need to call the non-Unicode APIs in a Unicode build, but it's just as possible). Since the non-Unicode versions of Windows are obsolete, Microsoft has started more and more to implement only Unicode versions of APIs. Note that in nearly all cases, all that the non-Unicode versions of the APIs do is to convert the ANSI/MBCS strings to Unicode, call the 'W' function, then clean up afterward.
LoadStringW is the Unicode version of LoadString.
The documentation states "If the function succeeds, the return value is the number of TCHARs copied into the buffer, not including the terminating NULL character, or zero if the string resource does not exist. To get extended error information, call GetLastError."
Here is the documentation for LoadString(): http://msdn.microsoft.com/en-us/library/ms647486%28VS.85%29.aspx
.. and here is the documentation explaining the differences between ANSI and Unicode functions in the Windows API: http://msdn.microsoft.com/en-us/library/cc500321.aspx.
Basically, the function LoadString comes in two flavours, ANSI and Unicode. LoadStringW is the Unicode-specific version of LoadString.
Edit: Just to be clear, there aren't really two completely separate functions. The ANSI version really just converts the string and calls the unicode version, which does all of the real work.
LoadStringW() is the WideCharacter version of the LoadString function.
See MSDN
It loads a widestring from a stringtable resource using the Windows Unicode Layer for Win95 and NT 3.51. See MSDN for details (see the remarks section).
For the umpteenth time, I just confirmed that when the resource compiler is instructed to null terminate the strings, the count returned by LoadString includes the terminal NULL character. I did so by examining the output buffer that I made available to LoadString.
Resource strings are not null terminated by default. In that case, the returned count excludes the terminal null character, as described in the documentation, because the null is appended by the function after the string is copied into the output buffer.
I suspect this behavior is due to the fact that LoadString disregards the fact that the resource compiler was instructed to null terminate the strings. Indeed, I suspect that it has no way of knowing that they were.
With respect to why you would want to null terminate your resource strings in the first place, when they work just fine without them, and your PE file is thereby a tad smaller, the reason is that the wide character implementation of LoadString, at the LoadStringW entry point, returns a pointer to the string, rather than copying it into a buffer, if the buffer address passed into it is a NULL pointer. Unless your strings are null terminated, using LoadString in this way produces quite unwelcome results.
Since resource strings are always stored as Unicode (wide character) strings, the ANSI implementation of LoadString cannot return a pointer, as the string must be converted to ANSI; hence, it cannot simply be copied.