Case insensitive search in Unicode in C++ on Windows

Case insensitive search in Unicode in C++ on Windows - c++

I asked a similar question yesterday, but recognize that i need to rephase it in a different way.
In short:
In C++ on Windows, how do I do a case-insensitive search for a string (inside another string) when the strings are in unicode format (wide char, wchar_t), and I don't know the language of the strings. I just want to know whether the needle exists in the haystack. Location of the needle isn't relevant to me.
Background:
I have a repository containing a lot of email bodies. The messages are in different languages (japanese, german, russian, finnish; you name it). All the data is in Unicode format, and I load it to wide strings (wchar_t) in my C++ application (the bodies have been MIME decoded, so in my debugger I can see the actual japanese, german characters). I don't know the language of the messages since email messages doensn't contain that detail, also a single email body may contain characters from several languages.
I'm looking for something like wcsstr, but with the ability to do the search in a case insensitve manner. I know that it's not possible to do a 100% proper conversion from upper case to lower case, without knowing the language of the text. I want a solution which works in the 99% cases where it's possible.
I'm using Visual Studio 2008 with C++, STL and Boost.

You have to specify the language to do case insensitive comparison. For example in Turkish, 'i' is NOT the lower case letter corresponding to 'I'. If the language appears not to be specified, then the comparison is being done with an implicitly selected language.

Boost String Algorithms has an icontains() function template which may do what you need.

You should use the ICU library which provides support for Unicode regular expressions which follow the Unicode rules for case-insensitive matching. The library is available as C/C++ and Java libraries. Many other languages such as Python support a wrapper for the ICU libraries.

you could convert both needle and haystack to lowercase (or uppercase) then do the wcsstr().

Related

The proper way to handle Unicode with C++ in 2018?

I have tried searching stackoverflow to find an answer to this but the questions and answers I've found are around 10 years old and I can't seem to find consensus on the subject due to changes and possible progress.
There are several libraries that I know of outside of the stl that are supposed to handle unicode-
http://userguide.icu-project.org/
https://github.com/nemtrif/utfcpp
https://github.com/CaptainCrowbar/unicorn-lib
There are a few features of the stl (wstring,codecvt_utf8) that were included but people seem to be ambivalent about using because they deal with UTF-16 which this site: (utf-8 everywhere) says shouldn't be used and many people online seem agree with the premise.
The only thing I'm looking for is the ability to do 4 things with a unicode strings-
Read a string into memory
Search the string with regex using unicode or ascii, concatenate or do text replacement/formatting with it with either ascii+unicode numbers or characters.
Convert to ascii + the unicode number format for characters that don't fit in the ascii range.
Write a string to disk or send wherever.
From what I can tell icu handles this and more. What I would like to know is if there is a standard way of handling this on Linux, Windows, and MacOS.
Thank you for your time.

I will try to throw some ideas here:
most C++ programs/programmers just assume that a text is an almost opaque sequence of bytes. UTF-8 is probably guilty for that, and there is no surprise that many comments resume to: don't worry with Unicode, just process UTF-8 encoded strings
files only contains bytes. At a moment, if you try to internally process true Unicode code points, you will have to serialize that to bytes -> here again UTF-8 wins the point
as soon as you go out of the Basic Multilingual Plane (16 bits code points), things become more and more complex. The emoji is specifically awful to process: an emoji can be followed by a variation selector (U+FE0E VARIATION SELECTOR-15 (VS15) for text or U+FE0F VARIATION SELECTOR-16 (VS16) for emoji-style) to alter its display style, more or less the old i bs ^ that was used in 1970 ascii when one wanted to print î. That's not all, the characters U+1F3FB to U+1F3FF are use to provide a skin color for 102 human emoji spread across six blocks: Dingbats, Emoticons, Miscellaneous Symbols, Miscellaneous Symbols and Pictographs, Supplemental Symbols and Pictographs, and Transport and Map Symbols.
That simply means that up to 3 consecutive unicode code points can represent one single glyph... So the idea that one character is one char32_t is still an approximation
My conclusion is that Unicode is a complex thing, and really requires a dedicated library like ICU. You can try to use simple tools like the converters of the standard library when you only deal with the BMP, but full support is far beyond that.
BTW: even other languages like Python that pretend to have a native unicode support (which is IMHO far better than current C++ one) ofter fails on some part:
the tkinter GUI library cannot display any code point outside the BMP - while it is the standard IDLE Python tool
different modules or the standard library are dedicated to Unicode in addition to the core language support (codecs and unicodedata), and other modules are available in the Python Package Index like the emoji support because the standard library does not meet all needs
So support for Unicode is poor for more than 10 years, and I do not really hope that things will go much better in the next 10 years...

Are non-latin numerals in Windows SBCS codepages used by any Microsoft libraries to represent numerical data in C strings?

I'm trying to write a parser for "text" files which I know will be encoded in one of the Windows single byte code pages. These files contain text representations of basic data types, and the spec I have for these representations is lacking, to say the least.
I noticed in Windows-874 ten little inconspicuous characters near the end called THAI DIGIT ZERO to THAI DIGIT NINE.
I'm trying to write this parser to be pretty robust but I'm working a bit in the dark as there are many different programs which can generate these data files and I don't have access to the sources.
What I want to know is: do any functions in Microsoft C++ libraries convert real number data types into a std::string or char const * (i.e. serialization) which would contain non-arabic-numerals?
I don't use Microsoft C++ libraries so can't reference any in particular but a made-up example could be char const * IntegerFunctions::ToString(int i).

These digits certainly could be created by Microsoft libraries. The properties LOCALE_IDIGITSUBSTITUTION and LOCALE_SNATIVEDIGITS determine whether numbers formatted by the OS will use native (i.e. non-ASCII) digits. Those are initially Unicode, because that's what how Windows internally creates strings. When you have a Thai locale, and you convert Unicode to CP874, then those characters will be kept.
A simple function that demonstrates this behavior is GetNumberFormatA

Sort of the inverse answer, but this page seems to indicate that Microsoft's runtime libraries at understand quite a few (but not all) non-Latin numerals when doing what you want to do, i.e. parse a string into a number.
Thai is included, which seems to indicate that it's a good idea to support it in custom code, too.
To include more information here, the linked-to page states that Microsoft's msvcr100 runtime supports decoding numerals from the following character sets:
ASCII
Arabic-Indic
Extended Arabic
Devanagari
Bengali
Gurmukhi
Gujarati
Oriya
Telugu
Kannada
Malayalam
Thai
Lao
Tibetan
Myanmar
Khmer
Mongolian
Full Width
The full page includes more programming environments and more languages (there are plenty of negatives, too).

Which syntax for Unicode strings in VC++?

How should you use unicode strings in VC++? Of course you should to #define UNICODE, but what about your strings?
Should the TEXT() or _T() macro be used around all text or should you just put an L in front of strings? Its my belief that all programs should use unicode these days, so wouldn't it be cleanest to use use the L prefix?
Opinions?

It depends on what you want to achieve. If you want to make sure your code will compile and work correctly both with and without Unicode, use the TEXT or _T macros, and call the "default" Win32 function names (for example CreateWindow).
If you want to make sure your program always uses the Unicode API, then you should use a L prefix in front of your strings, and call the wide versions of Win32 functions (such as CreateWindowW).
In the latter case, you'll get unicode behavior whether or not UNICODE is defined.
In the former case, your application will change its behavior based on whether UNICODE is defined.
I agree with you that the non-unicode versions haven't really been relevant since Win98, so I'd go with the second approach.

Declare Unicode string literals with L prefix.
The TEXT() or _T() macros were for the bad old days when you wanted single source to compile for both Unicode and non-Unicode versions of Windows (Windows 9x). Thankfully you can safely ignore Windows 9x today.

Something I learned a while ago:
Golden rule:
Don't fight the framework.
Do what the framework was designed to do -- if you use Windows, use _T, to make your code independent of the character type. If you're on Linux, use UTF-8. If you have a cross-platform framework, do whatever it does. But don't try to invent something of your own unless you have a really good reason to. (It is simply usually not worth the effort of working against a framework.)

C++ UTF-8 lightweight & permissive code?

Anyone know of a more permissive license (MIT / public domain) version of this:
http://library.gnome.org/devel/glibmm/unstable/classGlib_1_1ustring.html
('drop-in' replacement for std::string thats UTF-8 aware)
Lightweight, does everything I need and even more (doubt I'll use the UTF-XX conversions even)
I really don't want to be carrying ICU around with me.

std::string is fine for UTF-8 storage.
If you need to analyze the text itself, the UTF-8 awareness will not help you much as there are too many things in Unicode that do not work on codepoint base.
Take a look on Boost.Locale library (it uses ICU under the hood):
Reference http://cppcms.sourceforge.net/boost_locale/html/
Tutorial http://cppcms.sourceforge.net/boost_locale/html/tutorial.html
Download https://sourceforge.net/projects/cppcms/files/
It is not lightweight but it allows you handle Unicode correctly and it uses std::string as storage.
If you expect to find Unicode-aware lightweight library to deal with strings, you'll not find such things, because Unicode is not lightweight. And even relatively "simple" stuff like upper-case, lower-case conversion or Unicode normalization require complex algorithms and Unicode data-base access.
If you need an ability to iterate over Code points (that BTW are not characters)
take a look on http://utfcpp.sourceforge.net/
Answer to comment:
1) Find file formats for files included by me
std::string::find is perfectly fine for this.
2) Line break detection
This is not a simple issue. Have you ever tried to find a line-break in Chinese/Japanese text? Probably not as space does not separate words. So line-break detection is hard job. (I don't think even glib does this correctly, I think only pango has something like that)
And of course Boost.Locale does this and correctly.
And if you need to do this for European languages only, just search for space or punctuation marks, so std::string::find is more then fine.
3) Character (or now, code point) counting Looking at utfcpp thx
Characters are not code points, for example a Hebrew word Shalom -- "שָלוֹם" consists of 4 characters and 6 Unicode points, where two code points are used for vowels. Same for European languages where singe character and be represented with two code points, for example: "ü" can be represented as "u" and "¨" -- two code points.
So if you are aware of these issues then utfcpp will be fine, otherwise you will not
find anything simpler.

I never used, but stumbled upon this UTF-8 CPP library a while ago, and had enough good feelings to bookmark it. It is released on a BSD like license IIUC.
It still relies on std::string for strings and provides lots of utility functions to help checking that the string is really UTF-8, to count the number of characters, to go back or forward by one character … It is really small, lives only in header files: looks really good!

You might be interested in the Flexible and Economical UTF-8 Decoder by Björn Höhrmann but by no mean it's a drop-in replacement for std::string.

Which character set to choose when compiling a c++ dll

Could someone give some info regarding the different character sets within visual studio's project properties sheets.
The options are:
None
Unicode
Multi byte
I would like to make an informed decision as to which to choose.
Thanks.

All new software should be Unicode enabled. For Windows apps that means the UTF-16 character set, and for pretty much everyone else UTF-8 is often the best choice. The other character set choices in Windows programming should only be used for compatibility with older apps. They do not support the same range of characters as Unicode.

Multibyte takes exactly 2 bytes per character, none exactly 1, unicode varies.
None is not good as it doesn't support non-latin symbols. It's very boring if some non-English user tries to input their name into edit box. Do not use none.
If you do not use custom computation of string lengths, from programmer's point of view multibyte and unicode do not differ as long as use use TEXT macro to wrap your string constants.
Some libraries explicitly require certain encoding (DirectShow etc.), just use what they want.

As Mr. Shiny recommended, Unicode is the right thing.
If you want to understand a bit more on what are the implications of that decision, take a look here: http://www.mihai-nita.net/article.php?artID=20050306b

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js