C++ UTF-8 lightweight & permissive code?

C++ UTF-8 lightweight & permissive code? - c++

Anyone know of a more permissive license (MIT / public domain) version of this:
http://library.gnome.org/devel/glibmm/unstable/classGlib_1_1ustring.html
('drop-in' replacement for std::string thats UTF-8 aware)
Lightweight, does everything I need and even more (doubt I'll use the UTF-XX conversions even)
I really don't want to be carrying ICU around with me.

std::string is fine for UTF-8 storage.
If you need to analyze the text itself, the UTF-8 awareness will not help you much as there are too many things in Unicode that do not work on codepoint base.
Take a look on Boost.Locale library (it uses ICU under the hood):
Reference http://cppcms.sourceforge.net/boost_locale/html/
Tutorial http://cppcms.sourceforge.net/boost_locale/html/tutorial.html
Download https://sourceforge.net/projects/cppcms/files/
It is not lightweight but it allows you handle Unicode correctly and it uses std::string as storage.
If you expect to find Unicode-aware lightweight library to deal with strings, you'll not find such things, because Unicode is not lightweight. And even relatively "simple" stuff like upper-case, lower-case conversion or Unicode normalization require complex algorithms and Unicode data-base access.
If you need an ability to iterate over Code points (that BTW are not characters)
take a look on http://utfcpp.sourceforge.net/
Answer to comment:
1) Find file formats for files included by me
std::string::find is perfectly fine for this.
2) Line break detection
This is not a simple issue. Have you ever tried to find a line-break in Chinese/Japanese text? Probably not as space does not separate words. So line-break detection is hard job. (I don't think even glib does this correctly, I think only pango has something like that)
And of course Boost.Locale does this and correctly.
And if you need to do this for European languages only, just search for space or punctuation marks, so std::string::find is more then fine.
3) Character (or now, code point) counting Looking at utfcpp thx
Characters are not code points, for example a Hebrew word Shalom -- "שָלוֹם" consists of 4 characters and 6 Unicode points, where two code points are used for vowels. Same for European languages where singe character and be represented with two code points, for example: "ü" can be represented as "u" and "¨" -- two code points.
So if you are aware of these issues then utfcpp will be fine, otherwise you will not
find anything simpler.

I never used, but stumbled upon this UTF-8 CPP library a while ago, and had enough good feelings to bookmark it. It is released on a BSD like license IIUC.
It still relies on std::string for strings and provides lots of utility functions to help checking that the string is really UTF-8, to count the number of characters, to go back or forward by one character … It is really small, lives only in header files: looks really good!

You might be interested in the Flexible and Economical UTF-8 Decoder by Björn Höhrmann but by no mean it's a drop-in replacement for std::string.

Related

The proper way to handle Unicode with C++ in 2018?

I have tried searching stackoverflow to find an answer to this but the questions and answers I've found are around 10 years old and I can't seem to find consensus on the subject due to changes and possible progress.
There are several libraries that I know of outside of the stl that are supposed to handle unicode-
http://userguide.icu-project.org/
https://github.com/nemtrif/utfcpp
https://github.com/CaptainCrowbar/unicorn-lib
There are a few features of the stl (wstring,codecvt_utf8) that were included but people seem to be ambivalent about using because they deal with UTF-16 which this site: (utf-8 everywhere) says shouldn't be used and many people online seem agree with the premise.
The only thing I'm looking for is the ability to do 4 things with a unicode strings-
Read a string into memory
Search the string with regex using unicode or ascii, concatenate or do text replacement/formatting with it with either ascii+unicode numbers or characters.
Convert to ascii + the unicode number format for characters that don't fit in the ascii range.
Write a string to disk or send wherever.
From what I can tell icu handles this and more. What I would like to know is if there is a standard way of handling this on Linux, Windows, and MacOS.
Thank you for your time.

I will try to throw some ideas here:
most C++ programs/programmers just assume that a text is an almost opaque sequence of bytes. UTF-8 is probably guilty for that, and there is no surprise that many comments resume to: don't worry with Unicode, just process UTF-8 encoded strings
files only contains bytes. At a moment, if you try to internally process true Unicode code points, you will have to serialize that to bytes -> here again UTF-8 wins the point
as soon as you go out of the Basic Multilingual Plane (16 bits code points), things become more and more complex. The emoji is specifically awful to process: an emoji can be followed by a variation selector (U+FE0E VARIATION SELECTOR-15 (VS15) for text or U+FE0F VARIATION SELECTOR-16 (VS16) for emoji-style) to alter its display style, more or less the old i bs ^ that was used in 1970 ascii when one wanted to print î. That's not all, the characters U+1F3FB to U+1F3FF are use to provide a skin color for 102 human emoji spread across six blocks: Dingbats, Emoticons, Miscellaneous Symbols, Miscellaneous Symbols and Pictographs, Supplemental Symbols and Pictographs, and Transport and Map Symbols.
That simply means that up to 3 consecutive unicode code points can represent one single glyph... So the idea that one character is one char32_t is still an approximation
My conclusion is that Unicode is a complex thing, and really requires a dedicated library like ICU. You can try to use simple tools like the converters of the standard library when you only deal with the BMP, but full support is far beyond that.
BTW: even other languages like Python that pretend to have a native unicode support (which is IMHO far better than current C++ one) ofter fails on some part:
the tkinter GUI library cannot display any code point outside the BMP - while it is the standard IDLE Python tool
different modules or the standard library are dedicated to Unicode in addition to the core language support (codecs and unicodedata), and other modules are available in the Python Package Index like the emoji support because the standard library does not meet all needs
So support for Unicode is poor for more than 10 years, and I do not really hope that things will go much better in the next 10 years...

C++ read and write UTF-32 files

I want to write a language learning app for myself using Visual Studio 2017, C++ and the WindowsAPI (formerly known as Win32). The Operation System is the latest Windows 10 insider build and backwards-compatibility is a non-issue. Since I assume English to be the mother tounge of the user and the language I am currently interested in is another European language, ASCII might suffice. But I want to future-proof it (more excotic languages) and I also want to try my hands on UTF-32. I have previously used both UTF-8 and UTF-16, though I have more experience with the later.
Thanks to std::basic_string, it was easy to figure out how to get an UTF-32 string:
typedef std::basic_string<char32_t> stringUTF32
Since I am using the WinAPI for all GUI staff, I need to do some conversion between UTF-32 and UTF-16.
Now to my problem: Since UTF-32 is not widely used because of its inefficiencies, there is hardly any material about it on the web. To avoid unnecessary conversions, I want to save my vocabulary lists and other data as UTF-32 (for all UTF-8 advocates/evangelists, the alternative would be UTF-16). The problem is, I cannot find how to write and open files in UTF-32.
So my question is: How to write/open files in UTF-32? I would prefer if no third-party libraries are needed unless they are a part of Windows or are usually shipped with that OS.

If you have a char32_t sequence, you can write it to a file using a std::basic_ofstream<char32_t> (which I will refer to as u32_ofstream, but this typedef does not exist). This works exactly like std::ofstream, except that it writes char32_ts instead of chars. But there are limitations.
Most standard library types that have an operator<< overload are templated on the character type. So they will work with u32_ofstream just fine. The problem you will encounter is for user types. These almost always assume that you're writing char, and thus are defined as ostream &operator<<(ostream &os, ...);. Such stream output can't work with u32_ofstream without a conversion layer.
But the big issue you're going to face is endian issues. u32_ofstream will write char32_t as your platform's native endian. If your application reads them back through a u32_ifstream, that's fine. But if other applications read them, or if your application needs to read something written in UTF-32 by someone else, that becomes a problem.
The typical solution is to use a "byte order mark" as the first character of the file. Unicode even has a specific codepoint set aside for this: \U0000FEFF.
The way a BOM works is like this. When writing a file, you write the BOM before any other codepoints.
When reading a file of an unknown encoding, you read the first codepoint as normal. If it comes out equal to the BOM in your native encoding, then you can read the rest of the file as normal. If it doesn't, then you need to read the file and endian-convert it before you can process it. That process would look at bit like this:
constexpr char32_t native_bom = U'\U0000FEFF';
u32_ifstream is(...);
char32_t bom;
is >> bom;
if(native_bom == bom)
{
process_stream(is);
}
else
{
basic_stringstream<char32_t> char_stream
//Load the rest of `is` and endian-convert it into `char_stream`.
process_stream(char_stream);
}

I am currently interested in is another European language, [so] ASCII might suffice
No. Even in plain English. You know how Microsoft Word creates “curly quotes”? Those are non-ASCII characters. All those letters with accents and umlauts in eg. French or English are non-ASCII characters.
I want to future-proof it
UTF-8, UTF-16 and UTF-32 all can encode every Unicode code point. They’re all future-proof. UTF-32 does not have an advantage over the other two.
Also for future proofing: I’m quite sure some scripts use characters (the technical term is ‘grapheme clusters’) consisting of more than one code point. A cursory search turns up Playing around with Devanagari characters.
A downside of UTF-32 is support in other tools. Notepad won’t open your files. Beyond Compare won’t. Visual Studio Code… nope. Visual Studio will, but it won’t let you create such files.
And the Win32 API: it has a function MultiByteToWideChar which can convert UTF-8 to UTF-16 (which you need to pass in to all Win32 calls) but it doesn’t accept UTF-32.
So my honest answer to this question is, don’t. Otherwise follow Nicol’s answer.

Unicode in C++11

I've been doing a bit of reading around the subject of Unicode -- specifically, UTF-8 -- (non) support in C++11, and I was hoping the gurus on Stack Overflow could reassure me that my understanding is correct, or point out where I've misunderstood or missed something if that is the case.
A short summary
First, the good: you can define UTF-8, UTF-16 and UCS-4 literals in your source code. Also, the <locale> header contains several std::codecvt implementations which can convert between any of UTF-8, UTF-16, UCS-4 and the platform multibyte encoding (although the API seems, to put it mildly, less than than straightforward). These codecvt implementations can be imbue()'d on streams to allow you to do conversion as you read or write a file (or other stream).
[EDIT: Cubbi points out in the comments that I neglected to mention the <codecvt> header, which provides std::codecvt implementations which do not depend on a locale. Also, the std::wstring_convert and wbuffer_convert functions can use these codecvts to convert strings and buffers directly, not relying on streams.]
C++11 also includes the C99/C11 <uchar.h> header which contains functions to convert individual characters from the platform multibyte encoding (which may or may not be UTF-8) to and from UCS-2 and UCS-4.
However, that's about the extent of it. While you can of course store UTF-8 text in a std::string, there are no ways that I can see to do anything really useful with it. For example, other than defining a literal in your code, you can't validate an array of bytes as containing valid UTF-8, you can't find out the length (i.e. number of Unicode characters, for some definition of "character") of a UTF-8-containing std::string, and you can't iterate over a std::string in any way other than byte-by-byte.
Similarly, even the C++11 addition of std::u16string doesn't really support UTF-16, but only the older UCS-2 -- it has no support for surrogate pairs, leaving you with just the BMP.
Observations
Given that UTF-8 is the standard way of handling Unicode on pretty much every Unix-derived system (including Mac OS X and* Linux) and has largely become the de-facto standard on the web, the lack of support in modern C++ seems like a pretty severe omission. Even on Windows, the fact that the new std::u16string doesn't really support UTF-16 seems somewhat regrettable.
* As pointed out in the comments and made clear here, the BSD-derived parts of Mac OS use UTF-8 while Cocoa uses UTF-16.
Questions
If you managed to read all that, thanks! Just a couple of quick questions, as this is Stack Overflow after all...
Is the above analysis correct, or are there any other Unicode-supporting facilities I'm missing?
The standards committee has done a fantastic job in the last couple of years moving C++ forward at a rapid pace. They're all smart people and I assume they're well aware of the above shortcomings. Is there a particular well-known reason that Unicode support remains so poor in C++?
Going forward, does anybody know of any proposals to rectify the situation? A quick search on isocpp.org didn't seem to reveal anything.
EDIT: Thanks everybody for your responses. I have to confess that I find them slightly disheartening -- it looks like the status quo is unlikely to change in the near future. If there is a consensus among the cognoscenti, it seems to be that complete Unicode support is just too hard, and that any solution must reimplement most of ICU to be considered useful.
I personally don't agree with this; I think there is valuable middle ground to be found. For example, the validation and normalisation algorithms for UTF-8 and UTF-16 are well-specified by the Unicode consortium, and could be supplied by the standard library as free functions in, say, a std::unicode namespace. These alone would be a great help for C++ programmes which need to interface with libraries expecting Unicode input. But based on the answer below (tinged, it must be said, with a hint of bitterness) it seems Puppy's proposal for just this sort of limited functionality was not well-received.

Is the above analysis correct
Let's see.
you can't validate an array of bytes as containing valid UTF-8
Incorrect. std::codecvt_utf8<char32_t>::length(start, end, max_lenght) returns the number of valid bytes in the array.
you can't find out the length
Partially correct. One can convert to char32_t and find out the length of the result. There is no easy way to find out the length without doing the actual conversion (but see below). I must say that need to count characters (in any sense) arises rather infrequently.
you can't iterate over a std::string in any way other than byte-by-byte
Incorrect. std::codecvt_utf8<char32_t>::length(start, end, 1) gives you a possibility to iterate over UTF-8 "characters" (Unicode code units), and of course determine their number (that's not an "easy" way to count the number of characters, but it's a way).
doesn't really support UTF-16
Incorrect. One can convert to and from UTF-16 with e.g. std::codecvt_utf8_utf16<char16_t>. A result of conversion to UTF-16 is, well, UTF-16. It is not restricted to BMP.
Demo that illustrates these points.
If I have missed some other "you can't", please point it out and I will address it.
Important addendum. These facilities are deprecated in C++17. This probably means they will go away in some future version of C++. Use them at your own risk. All these things enumerated in original question now cannot (safely) be done again, using only the standard library.

Is the above analysis correct, or are there any other
Unicode-supporting facilities I'm missing?
You're also missing the utter failure of UTF-8 literals. They don't have a distinct type to narrow-character literals, that may have a totally unrelated (e.g. codepages) encoding. So not only did they not add any serious new facilities in C++11, they broke what little there was because now you can't even assume that a char* is in narrow-string-encoding for your platform unless UTF-8 is the narrow string encoding. So the new feature here is "We totally broke char-based strings on every platform where UTF-8 isn't the existing narrow string encoding".
The standards committee has done a fantastic job in the last couple of
years moving C++ forward at a rapid pace. They're all smart people and
I assume they're well aware of the above shortcomings. Is there a
particular well-known reason that Unicode support remains so poor in
C++?
The Committee simply doesn't seem to give a shit about Unicode.
Also, many of the Unicode support algorithms are just that- algorithms. This means that to offer a decent interface, we need ranges. And we all know that the Committee can't figure out what they want w.r.t. ranges. The new Iterables thing from Eric Niebler may have a shot.
Going forward, does anybody know of any proposals to rectify the
situation? A quick search on isocpp.org didn't seem to reveal
anything.
There was N3572, which I authored. But when I went to Bristol and presented it, there were a number of problems.
Firstly, it turns out that the Committee don't bother to feedback on non-Committee-member-authored proposals between meetings, resulting in months of lost work when you iterate on a design they don't want.
Secondly, it turns out that it's voted on by whoever happens to wander by at the time. This means that if your paper gets rescheduled, you have a relatively random bunch of people who may or may not know anything about the subject matter. Or indeed, anything at all.
Thirdly, for some reason they don't seem to view the current situation as a serious problem. You can get endless discussion about how exactly optional<T>'s comparison operations should be defined, but dealing with user input? Who cares about that?
Fourthly, each paper needs a champion, effectively, to present and maintain it. Given the previous issues, plus the fact that there's no way I could afford to travel to other meetings, it was certainly not going to be me, will not be me in the future unless you want to donate all my travel expenses and pay a salary on top, and nobody else seemed to care enough to put the effort in.

C++ and UTF8 - Why not just replace ASCII?

In my application I have to constantly convert string between std::string and std::wstring due different APIs (boost, win32, ffmpeg etc..). Especially with ffmpeg the strings end up utf8->utf16->utf8->utf16, just to open a file.
Since UTF8 is backwards compatible with ASCII I thought that I consistently store all my strings UTF-8 std::string and only convert to std::wstring when I have to call certain unusual functions.
This worked kind of well, I implemented to_lower, to_upper, iequals for utf8. However then I met several dead-ends std::regex, and regular string comparisons. To make this usable I would need to implement a custom ustring class based on std::string with re-implementation of all corresponding algorithms (including regex).
Basically my conclusion is that utf8 is not very good for general usage. And the current std::string/std::wstring is mess.
However, my question is why the default std::string and "" are not simply changed to use UTF8? Especially as UTF8 is backward compatible? Is there possibly some compiler flag which can do this? Of course the stl implemention would need to be automatically adapted.
I've looked at ICU, but it is not very compatible with apis assuming basic_string, e.g. no begin/end/c_str etc...

The main issue is the conflation of in-memory representation and encoding.
None of the Unicode encoding is really amenable to text processing. Users will in general care about graphemes (what's on the screen) while the encoding is defined in terms of code points... and some graphemes are composed of several code points.
As such, when one asks: what is the 5th character of "Hélène" (French first name) the question is quite confusing:
In terms of graphemes, the answer is n.
In terms of code points... it depends on the representation of é and è (they can be represented either as a single code point or as a pair using diacritics...)
Depending on the source of the question (a end-user in front of her screen or an encoding routine) the response is completely different.
Therefore, I think that the real question is Why are we speaking about encodings here?
Today it does not make sense, and we would need two "views": Graphemes and Code Points.
Unfortunately the std::string and std::wstring interfaces were inherited from a time where people thought that ASCII was sufficient, and the progress made didn't really solve the issue.
I don't even understand why the in-memory representation should be specified, it is an implementation detail. All a user should want is:
to be able to read/write in UTF-* and ASCII
to be able to work on graphemes
to be able to edit a grapheme (to manage the diacritics)
... who cares how it is represented? I thought that good software was built on encapsulation?
Well, C cares, and we want interoperability... so I guess it will be fixed when C is.

You cannot, the primary reason for this is named Microsoft. They decided not to support Unicode as UTF-8 so the support for UTF-8 under Windows is minimal.
Under windows you cannot use UTF-8 as a codepage, but you can convert from or to UTF-8.

There are two snags to using UTF8 on windows.
You cannot tell how many bytes a string will occupy - it depends on which characters are present, since some characters take 1 byte, some take 2, some take 3, and some take 4.
The windows API uses UTF16. Since most windows programs make numerous calls to the windows API, there is quite an overhead converting back and forward. ( Note that you can do a "non-unicode' build, which looks like it uses a utf8 windows api, but all that is happening is that the conversion back and forward on each call is hidden )
The big snag with UTF16 is that the binary representation of a string depends on the byte order in a word on the particular hardware the program is running on. This does not matter in most cases, except when strings are transmitted between computers where you cannot be sure that the other computer uses the same byte order.
So what to do? I uses UTF16 everywhere 'inside' all my programs. When string data has to be stored in a file, or transmitted from a socket, I first convert it to UTF8.
This means that 95% of my code runs simply and most efficiently, and all the messy conversions between UTF8 and UTF16 can be isolated to routines responsible for I/O.

Case insensitive search in Unicode in C++ on Windows

I asked a similar question yesterday, but recognize that i need to rephase it in a different way.
In short:
In C++ on Windows, how do I do a case-insensitive search for a string (inside another string) when the strings are in unicode format (wide char, wchar_t), and I don't know the language of the strings. I just want to know whether the needle exists in the haystack. Location of the needle isn't relevant to me.
Background:
I have a repository containing a lot of email bodies. The messages are in different languages (japanese, german, russian, finnish; you name it). All the data is in Unicode format, and I load it to wide strings (wchar_t) in my C++ application (the bodies have been MIME decoded, so in my debugger I can see the actual japanese, german characters). I don't know the language of the messages since email messages doensn't contain that detail, also a single email body may contain characters from several languages.
I'm looking for something like wcsstr, but with the ability to do the search in a case insensitve manner. I know that it's not possible to do a 100% proper conversion from upper case to lower case, without knowing the language of the text. I want a solution which works in the 99% cases where it's possible.
I'm using Visual Studio 2008 with C++, STL and Boost.

You have to specify the language to do case insensitive comparison. For example in Turkish, 'i' is NOT the lower case letter corresponding to 'I'. If the language appears not to be specified, then the comparison is being done with an implicitly selected language.

Boost String Algorithms has an icontains() function template which may do what you need.

You should use the ICU library which provides support for Unicode regular expressions which follow the Unicode rules for case-insensitive matching. The library is available as C/C++ and Java libraries. Many other languages such as Python support a wrapper for the ICU libraries.

you could convert both needle and haystack to lowercase (or uppercase) then do the wcsstr().

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js