Unicode string normalization in C/C++ - c++

Am wondering how to normalize strings (containing utf-8/utf-16) in C/C++.
In .NET there is a function String.Normalize .
I used UTF8-CPP in the past but it does not provide such a function.
ICU and Qt provide string normalization but I prefer lightweight solutions.
Is there any "lightweight" solution for this?

As I wrote in another question, utf8proc is a very nice, lightweight, library for basic Unicode functionality, including Unicode string normalization.

For Windows, there is the NormalizeString() function (unfortunately for Vista and later only - as far as I see on MSDN):
http://msdn.microsoft.com/en-us/library/windows/desktop/dd319093%28v=vs.85%29.aspx
It's the simplest way to go that I have found so far.
I guess it's quite lightweight too.
int NormalizeString(
_In_ NORM_FORM NormForm,
_In_ LPCWSTR lpSrcString,
_In_ int cwSrcLength,
_Out_opt_ LPWSTR lpDstString,
_In_ int cwDstLength
);

You could build ICU with minimal (or possibly, no other data- I think all of the normalization data is now internal), and then statically link. I haven't tried this recently, but I believe the total size is pretty small in that case.

A good UTF-8 solution is glib's g_utf8_normalize() function. Would require to convert std::wstring to std::string (utf16 to utf8) if you need this for wstring too (which would make it quite an expensive solution, hence I'm looking myself for a better solution, if possible with pure C++(11) means).

"Lightweight" in your context means "with limited functionality". I would use ICU source as an example, and reference http://unicode.org/reports/tr15/ to implement this "lightweight" functionality.

Related

Motivation to only provide wide-string logical string comparison

I've been puzzling over this for quite a while, and as of yet I haven't managed to find a suitable rationale.
The Win32 API provides a function for "logical string comparison" for which the prototype is:
StrCmpLogicalW( _In_ PCWSTR psz1, _In_ PCWSTR psz2 );
This function then uses digits as numbers rather than as plain text and thus provides a more 'logical' comparison of two strings.
However, most functions in the Win32 API seem to be typedef'd to use with either Multibyte or Unicode strings, for instance SendMessage is a macro which expands into SendMessageW for Unicode or SendMessageA for ANSI encodings (depending on which macro switch is enabled), so why does this function only have a wide-string version? I've searched the internet, but have been unable to find anything that explains this, so I'd be grateful if anyone can enlighten me.
Thanks in advance!
The documentation says "Behavior of this function, and therefore the results it returns, can change from release to release. It should not be used for canonical sorting applications." so it does not seem meant for general usage.

What type of string is best to use for Win32 and DirectX?

I am in the process of developing a small game in DirectX 10 and C++ and I'm finding it hell with the various different types of strings that are required for the various different directx / win32 function calls.
Can anyone recommend which of the various strings are available are the best to use, ideally it would be one type of string that gives a good cast to the other types (LPCWSTR and LPCSTR mostly). Thus far I have been using std::string and std::wstring and doing the .c_str() function to get it in to the correct format.
I would like to get just 1 type of string that I use to pass in to all functions and then doing the cast inside the function.
Use std::wstring with c_str() exactly as you have been doing. I see no reason to use std::string on Windows, your code may as well always use the native UTF-16 encoding.
I would stick to std::wstring as well. If you really need to pass std::string somewhere, you can still convert it on the fly:
std::string s = "Hello, World";
std::wstring ws(s.begin(), s.end());
It works the other way around as well.
If you're using Native COM (the stuff of #import <type_library>), then _bstr_t. It natively typecasts to both LPCWSTR and LPCSTR, and it meshes nicely with COM's memory allocation model. No need for .c_str() calls.

how many types of strings in visual c++

How many types of string classes are there in visual c++. I got confused when i was going through the msdn center.
I found this type under the namespace system
http://msdn.microsoft.com/en-us/library/system.string(v=VS.71).aspx
and then in the headers section, i found the string header definitions. This seemed different from the above. One thing i noticed, this one comes under the STL.
(pls see the comment for the link, i cant post two links in the same post)
which one is normally used? I'm finding a hard time getting around with the different string classes
Thanks in advance :)
Different libraries come with different string types:
In plain old C you would use char*, the C++ standard library provides std::string which is widely used in C++ development. (string is defined as typedef basic_string<char> string;)
Microsoft created the MFC CString class which is (was?) used in MFC style programming, Qt has its QString which is used in Qt programs. What you're mentioning with System.String is a .net string class which can only be used in Managed code (with .net).
I'd suggest to stick with std::string (#include <string>) if you're new to C++. It's standard and platform independent.
String types in common use in Microsoft code are char*, wchar_t*, LPSTR, LPTSTR, LPWSTR, LPCSTR, LPCTSTR, LPCWSTR, BSTR, OLESTR, UNICODE_STRING, String, string, wstring, _bstr_t, CString
The last 5 are classes. You pick the one that gives you the least conversion headaches, depending on what API you need to use:
std::string and wstring, standard C++ library
System::String, the string type for managed code
_bstr_t, wrapper for a BSTR, used in COM automation
CString, string type for the ATL and MFC libraries.
You're likely to encounter additional string types when you work with other APIs.

using boost string algorithm with MFC CString to check for the end of a string

I need to check whether my CString object in MFC ends with a specific string.
I know that boost::algorithm has many functions meant for string manipulation and that in the header boost/algorithm/string/predicate.hpp could it be used for that purpose.
I usually use this library with std::string. Do you know a convenient way to use this library also with CString?
I know that the library is generic that can be used also with other string libraries used as template arguments, but it is not clear (and whether is possible) to apply this feature to CString.
Can you help me with that in case it is possible?
According to Boost String Algorithms Library, "consult the design chapter to see precise specification of supported string types", which says amongst other things, "first requirement of string-type is that it must [be] accessible using Boost.Range", and note at the bottom the MFC/ATL implementation written by Shunsuke Sogame which should allow you to combine libraries.
Edit: Since you mention regex in the comments below, this is all you really need to do (assuming a unicode build):
CString inputString;
wcmatch matchGroups;
wregex yourRegex(L"^(.*)$"), regex::icase);
if (regex_search(static_cast<LPCWSTR>(inputString), matchGroups, yourRegex))
{
CString firstCapture = matchGroups[1].str().c_str();
}
Note how we reduce the different string types to raw pointers to pass them between libraries. Replace my contrived yourRegex with your requirements, including whether or not you ignore case or are explicit about anchors.
Why don't you save yourself the trouble and just use CStringT::Right?

pwsz string confusion

I have never posted before so I am sorry if I am not clear. I am trying to use a third party DLL written in c++ on 2005 and all I have is some very poor documentation. I am dynamically linking to the DLL and using the Ordinal value read from Dependency walker to get a pointer to a method in the DLL. Such as (LPFNDDLLZC) GetProcAddress(hHILCdll, (LPCSTR)15);
My code is written in C++ compiled in Microsoft VS 6.0, I can not turn on the UNICODE defines or I will break existing code.
The documentation for the DLL says all string arguments are pwsz which I believe means pointer to a wide char string null terminated.
I have tried passing in a pointer to an unsigned short, BSTR and various other things and the DLL crashes on the string. I am totally lost as to why, I believe it has to do with my pwsz string construction and I'm lost as to how to fix this. I have read so may articles related but nothing works.
Can anyone help? I can post code if need be.
Thanks.
You could use MultiByteToWideChar to turn your LPSTR into an LPWSTR which should solve your problem.
Thanks to everyone. I did finally get a copy of the DLL source and my problem wasn't my string construction it was the poor documentation. Turns out they are using double pointers, fixes a ton of things!