How to write a std::string to a UTF-8 text file - c++

I just want to write some few simple lines to a text file in C++, but I want them to be encoded in UTF-8. What is the easiest and simple way to do so?

The only way UTF-8 affects std::string is that size(), length(), and all the indices are measured in bytes, not characters.
And, as sbi points out, incrementing the iterator provided by std::string will step forward by byte, not by character, so it can actually point into the middle of a multibyte UTF-8 codepoint. There's no UTF-8-aware iterator provided in the standard library, but there are a few available on the 'Net.
If you remember that, you can put UTF-8 into std::string, write it to a file, etc. all in the usual way (by which I mean the way you'd use a std::string without UTF-8 inside).
You may want to start your file with a byte order mark so that other programs will know it is UTF-8.

There is nice tiny library to work with utf8 from c++: utfcpp

libiconv is a great library for all our encoding and decoding needs.
If you are using Windows you can use WideCharToMultiByte and specify that you want UTF8.

What is the easiest and simple way to do so?
The most intuitive and thus easiest handling of utf8 in C++ is for sure using a drop-in replacement for std::string.
As the internet still lacks of one, I went to implement the functionality on my own:
tinyutf8 (EDIT: now Github).
This library provides a very lightweight drop-in preplacement for std::string (or std::u32string if you will, because you iterate over codepoints rather that chars). Ity is implemented succesfully in the middle between fast access and small memory consumption, while being very robust. This robustness to 'invalid' UTF8-sequences makes it (nearly completely) compatible with ANSI (0-255).
Hope this helps!

If by "simple" you mean ASCII, there is no need to do any encoding, since characters with an ASCII value of 127 or less are the same in UTF-8.

std::wstring text = L"Привет";
QString qstr = QString::fromStdWString(text);
QByteArray byteArray(qstr.toUtf8());
std::string str_std( byteArray.constData(), byteArray.length());

My preference is to convert to and from a std::u32string and work with codepoints internally, then convert to utf8 when writing out to a file using these converting iterators I put on github.
#include <utf/utf.h>
int main()
{
using namespace utf;
u32string u32_text = U"ɦΈ˪˪ʘ";
// do stuff with string
// convert to utf8 string
utf32_to_utf8_iterator<u32string::iterator> pos(u32_text.begin());
utf32_to_utf8_iterator<u32string::iterator> end(u32_text.end());
u8string u8_text(pos, end);
// write out utf8 to file.
// ...
}

Use Glib::ustring from glibmm.
It is the only widespread UTF-8 string container (AFAIK). While glyph (not byte) based, it has the same method signatures as std::string so the port should be simple search and replace (just make sure that your data is valid UTF-8 before loading it into a ustring).

As to UTF-8 is multibite characters string and so you get some problems to work and it's a bad idea/ Instead use normal Unicode.
So by my opinion best is use ordinary ASCII char text with some codding set. Need to use Unicode if you use more than 2 sets of different symbols
(languages) in single.
It's rather rare case. In most cases enough 2 sets of symbols. For this common case use ASCII chars, not Unicode.
Effect of using multibute chars like UTF-8 you get only China traditional, arabic or some hieroglyphic text. It's very very rare case!!!
I don't think there are many peoples needs that. So never use UTF-8!!! It's avoid strong headache of manipulate such strings.

Related

Storing math symbols into string c++

Is there a way to store math symbols into strings in c++ ?
I notably need the union/intersection symbols.
Thanks in advance!
This seemingly simple question is actual a tangle of multiple questions:
What character set to use?
Unicode is almost certainly the best choice nowadays.
What encoding to use?
C++ std::strings are strings of chars, but you can decide how those chars correspond to "characters" in your character set. The default representation assumed by the language and the system is could be ASCII, some random code page like Latin-1 or Windows-1252, or UTF-8.
If you're on Linux or Mac, your best bet is to use UTF-8. If you're on Windows, you might choose to use wide strings instead (std::wstring), and to use UTF-16 as the encoding. But many people suggest that you always use UTF-8 in std::strings even on Windows, and simply convert from and to UTF-16 as needed to do I/O.
How to specify string literals in the code?
To store UTF-8 in older versions of C++ (before C++11), you could manually encode your string literals like this:
const std::string subset = "\xE2\x8A\x82";
To store UTF-8 in C++11 or newer, you use the u8 prefix to tell the compiler you want UTF-8 encoding. You can use escaped characters:
const std::string subset = u8"\u2282";
Or you can enter the character directly into the source code:
const std::string subset = u8"⊂";
I tend to use the escaped versions to avoid worrying about the encoding of the source file and whether all the editors and viewers and IDEs I use will consistently understand the source file encoding.
If you're on Windows and you choose to use UTF-16 instead, then, regardless of C++ version, you can specify wide string literals in your code like this:
const std::wstring subset = L"\u2282"; // or L"⊂";
How to display these strings?
This is very system dependent.
On Mac and Linux, I suspect things will generally just work.
In a console program on Windows (e.g., one that just uses <iostreams> or printf to display in a command prompt), you're probably in trouble because the legacy command prompts don't have good Unicode and font support. (Maybe this is better on Windows 10?)
In a GUI program on Windows, you have to make sure you use the "Unicode" version of the API and to give it the wide string. ("Unicode" is in quotation marks here because the Windows API documentation often uses "Unicode" to mean a UTF-16 encoded wide character string, which isn't exactly what Unicode means.) So if you want to use an API like TextOut or MessageBox to display your string, you have to make sure you do two things: (1) call the "wide" version of the API, and (2) pass a UTF-16 encoded string.
You solve (1) by explicitly calling the wide versions (e.g., TextOutW or MessageBoxW) or by making your you compile with "Unicode" selected in your project settings. (You can also do it by defining several C++ preprocessor macros instead, but this answer is already long enough.)
For (2), if you are using std::wstrings, you're already done. If you're using UTF-8, you'll need to make a wide copy of the string to pass to the output function. Windows provides MultiByteToWideChar for making such a copy. Make sure you specify CP_UTF8.
For (2), do not try to call the narrow versions of the API functions themselves (e.g., TextOutA or MessageBoxA). These will convert your string to a wide string automatically, but they do so assuming the string is encoded in the user's current code page. If the string is really in UTF-8, then these will do the wrong thing for all of the "interesting" (non-ASCII) characters.
How to read these strings from a file, a socket, or the user?
This is very system specific and probably worth a separate question.
Yes, you can, as follows:
std::string unionChar = "∪";
std::string intersectionChar = "∩";
They are just characters but don't expect this code to be portable. You could also use Unicode, as follows:
std::string unionChar = u8"\u222A";
std::string intersectionChar = u8"\u2229";

Choosing encoding for icu::UnicodeString

I found myself in need of a way to change a string to lower case that was safe to use for ASCII and for UTF16-LE (as found in some windows registry strings) and came across this question: How to convert std::string to lower case?
The answer that seemed to be the "most correct" to me (I'm not using Boost) was one that demonstrated using the icu library.
In this answer, he specified the encoding "ISO-8859-1" for the UnicodeString constructor. Why is this the correct value and how do I know what to use?
ISO-8859-1 has worked for the few unit tests I've run against ASCII encoded strings that used only Latin characters, but I don't like using it if I don't know why.
If it matters, I'm mainly concerned with manipulating English data that is typically stored in ASCII, but the windows registry has the ability to store things in UTF-16LE and I don't want to block myself from supporting other languages down the road by littering my code with non-unicode safe stuff.
I found myself in need of a way to change a string to lower case for the purpose of case-insensitive string comparison
UnicodeString in ICU has many caseCompare() methods for performing comparisons "case-insensitively using full case folding". You don't need to transform your strings manually.
In this answer, he specified the encoding "ISO-8859-1" for the UnicodeString constructor. Why is this the correct value and how do I know what to use?
Because the author is passing an ISO-8859-1 encoded char* string literal to the constructor. UnicodeString represents a UTF-16 encoded string. If you construct it using a char* as input, you have to specify the correct charset the input data is encoded with so UnicodeString can decode it to Unicode and then re-encode it as UTF-16.

Unicode std::string class replacement

I'm looking for suggestions regarding unicode aware std::string library replacements. I have a bunch of code that uses std::string, its iterators etc, and would like to now support unicode strings (free or open source implementations preferred, regex capabilities would be great!).
I'm not sure at this point if I require a complete rewrite or if I can get away with dropping in a new string library that supports all of the std::string interfaces. The unicode world seems very complex and I'm just wanting to enable it in my applications not have to learn every single aspect of it.
btw how does the index operator work when it has to pass back a reference to either a 1, 2,3 or 4 structure which could in theory change to either a 1,2,3 or 4 byte structure. if a larger or smaller sized value is passed, does the shifting back and forth of the internal data representation occur insitu?
You don't need a complete rewrite if you make sure about what your std::string contains. For example, you could assume (and convert inputs to be sure) that your std::string contain UTF8 encoded strings (for those that need localization). Don't forget that std::string is only a container of raw data, it's not associated with an encoding (even in C++0x, it's only a possibility, not a requirement).
Then when you pass text to other libraries that require different encodings, you can use libraries like UTF8CPP to convert to the required encoding (but most of the time such libraries will do it themselves).
That way makes it simple. UTF8 with standard std::string in your code, enabling passing unicode string to everything else (with conversion if necessary).
There have been a lot of discussions about this in the boost community mailing list. Maybe reading it (if you have enough time...) can help you understand other possible solutions.
Depending on your needs, use std::wstring or the larger and more complex (but de facto standard) ICU: http://site.icu-project.org/
what unicode encoding do you need? If utf-8 is ok you can have a look at Glib::ustring
Glib::ustring has much the same
interface as std::string, but contains
Unicode characters encoded as UTF-8.
Asking for "a type like std::string, but for Unicode" is like asking for "a type like unsigned, but for primes." std::string is perfectly capable of storing Unicode, in many encodings - the most generally useful being UTF-8.
What you need to replace is your iterators, not your storage type. The iterators should iterate over the codepoints of the string rather than the bytes. That is, ++i should advance one codepoint, and *i should return a codepoint (via uint32_t) rather than a char.
I've written my own C++ UTF-8 library, which is a drop-in replacement of std::wstring/string. The data type that is showed to the user is char32_t, but internally the wide characters are all packed into utf8 char's.
The whole thing is quite fast and its performance is best with few unicode codepoints within many ascii codepoints. All operations that are known from std::string are available with this class (except for substring find) and operate on codepoint indices, in contrast to byte indices.
As a bonus of defensive programming, the whole ANSI range of 0-255 can be used without multibytes :)
Hope this helps!

Strings and character encoding in C++

I read a few posts about best practices for strings and character encoding in C++, but I am struggling a bit with finding a general purpose approach that seems to me reasonably simple and correct. Could I ask for comments on the following? I'm inclined to use UTF-8 and UTF-32, and to define something like:
typedef std::string string8;
typedef std::basic_string<uint32_t> string32;
The string8 class would be used for UTF-8, and having a separate type is just a reminder of the encoding. An alternative would be for string8 to be a subclass of std::string and to remove the methods that aren't quite right for UTF-8.
The string32 class would be used for UTF-32 when a fixed character size is desired.
The UTF-8 CPP functions, utf8::utf8to32() and utf8::utf32to8(), or even simpler wrapper functions, would be used to convert between the two.
If you plan on just passing strings around and never inspect them, you can use plain std::string though it's a poor man job.
The issue is that most frameworks, even the standard, have stupidly (I think) enforced encoding in memory. I say stupid because encoding should only matter on the interface, and those encoding are not adapted for in-memory manipulation of the data.
Furthermore, encoding is easy (it's a simple transposition CodePoint -> bytes and reversely) while the main difficulty is actually about manipulating the data.
With a 8-bits or 16-bits you run the risk of cutting a character in the middle because neither std::string nor std::wstring are aware of what a Unicode Character is. Worse, even with a 32-bits encoding, there is the risk of separating a character from the diacritics that apply to it, which is also stupid.
The support of Unicode in C++ is therefore extremely subpar, as far as the standard is concerned.
If you really wish to manipulate Unicode string, you need a Unicode aware container. The usual way is to use the ICU library, though its interface is really C-ish. However you'll get everything you need to actually work in Unicode with multiple languages.
It's not specified what character encoding must be used for string, wstring etc. The common way is to use unicode in wide strings. What types and encodings should be used depends on your requirements.
If you only need to pass data from A to B, choose std::string with UTF-8 encoding (don't introduce a new type, just use std::string). If you must work with strings (extract, concat, sort, ...) choose std::wstring and as encoding UCS2/UTF-16 (BMP only) on Windows and UCS4/UTF-32 on Linux.
The benefit is the fixed size: each character has a size of 2 (or 4 for UCS4) bytes while std::string with UTF-8 returns wrong length() results.
For conversion, you can check sizeof(std::wstring::value_type) == 2 or 4 to choose UCS2 or UCS4. I'm using the ICU library, but there may be simple wrapper libs.
Deriving from std::string is not recommended because basic_string is not designed for (lacks of virtual members etc..). If you really really really need your own type like std::basic_string< my_char_type > write a custom specialization for this.
The new C++0x standard defines wstring_convert<> and wbuffer_convert<> to convert with a std::codecvt from a narrow charset to a wide charset (for example UTF-8 to UCS2).
Visual Studio 2010 has already implemented this, afaik.
The traits approach described here might be helpful. It's an old but useful technique.

How do I get STL std::string to work with unicode on windows?

At my company we have a cross platform(Linux & Windows) library that contains our own extension of the STL std::string, this class provides all sort of functionality on top of the string; split, format, to/from base64, etc. Recently we were given the requirement of making this string unicode "friendly" basically it needs to support characters from Chinese, Japanese, Arabic, etc. After initial research this seems fine on the Linux side since every thing is inherently UTF-8, however I am having trouble with the Windows side; is there a trick to getting the STL std::string to work as UTF-8 on windows? Is it even possible? Is there a better way? Ideally we would keep ourselves based on the std::string since that is what the string class is based on in Linux.
Thank you,
There are several misconceptions in your question.
Neither C++ nor the STL deal with encodings.
std::string is essentially a string of bytes, not characters. So you should have no problem stuffing UTF-8 encoded Unicode into it. However, keep in mind that all string functions also work on bytes, so myString.length() will give you the number of bytes, not the number of characters.
Linux is not inherently UTF-8. Most distributions nowadays default to UTF-8, but it should not be relied upon.
Yes - by being more aware of locales and encodings.
Windows has two function calls for everything that requires text, a FoobarA() and a FoobarW(). The *W() functions take UTF-16 encoded strings, the *A() takes strings in the current codepage. However, Windows doesn't support a UTF-8 code page, so you can't directly use it in that sense with the *A() functions, nor would you want to depend on that being set by users. If you want "Unicode" in Windows, use the Unicode-capable (*W) functions. There are tutorials out there, Googling "Unicode Windows tutorial" should get you some.
If you are storing UTF-8 data in a std::string, then before you pass it off to Windows, convert it to UTF-16 (Windows provides functions for doing such), and then pass it to Windows.
Many of these problems arise from C/C++ being generally encoding-agnostic. char isn't really a character, it's just an integral type. Even using char arrays to store UTF-8 data can get you into trouble if you need to access individual code units, as char's signed-ness is left undefined by the standards. A statement like str[x] < 0x80 to check for multiple-byte characters can quickly introduce a bug. (That statement is always true if char is signed.) A UTF-8 code unit is an unsigned integral type with a range of 0-255. That maps to the C type of uint8_t exactly, although unsigned char works as well. Ideally then, I'd make a UTF-8 string an array of uint8_ts, but due to old APIs, this is rarely done.
Some people have recommended wchar_t, claiming it to be "A Unicode character type" or something like that. Again, here the standard is just as agnostic as before, as C is meant to work anywhere, and anywhere might not be using Unicode. Thus, wchar_t is no more Unicode than char. The standard states:
which is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales
In Linux, a wchat_t represents a UTF-32 code unit / code point. It is thus 4 bytes. However, in Windows, it's a UTF-16 code unit, and is only 2 bytes. (Which, I would have said does not conform to the above, since 2-bytes cannot represent all of Unicode, but that's the way it works.) This size difference, and difference in data encoding, clearly puts a strain on portability. The Unicode standard itself recommends against wchar_t if you need portability. (§5.2)
The end lesson: I find it easiest to store all my data in some well-declared format. (Typically UTF-8, usually in std::string's, but I'd really like something better.) The important thing here is not the UTF-8 part, but rather, I know that my strings are UTF-8. If I'm passing them to some other API, I must also know that that API expects UTF-8 strings. If it doesn't, then I must convert them. (Thus, if I speak to Window's API, I must convert strings to UTF-16 first.) A UTF-8 text string is an "orange", and a "latin1" text string is an "apple". A char array that doesn't know what encoding it is in is a recipe for disaster.
Putting UTF-8 code points into an std::string should be fine regardless of platform. The problem on Windows is that almost nothing else expects or works with UTF-8 -- it expects and works with UTF-16 instead. You can switch to an std::wstring which will store UTF-16 (at least on most Windows compilers) or you can write other routines that will accept UTF-8 (probably by converting to UTF-16, and then passing through to the OS).
Have you looked at std::wstring? It's a version of std::basic_string for wchar_t rather than the char that std::string uses.
No, there is no way to make Windows treat "narrow" strings as UTF-8.
Here is what works best for me in this situation (cross-platform application that has Windows and Linux builds).
Use std::string in cross-platform portion of the code. Assume that it always contains UTF-8 strings.
In Windows portion of the code, use "wide" versions of Windows API explicitly, i.e. write e.g. CreateFileW instead of CreateFile. This allows to avoid dependency on build system configuration.
In the platfrom abstraction layer, convert between UTF-8 and UTF-16 where needed (MultiByteToWideChar/WideCharToMultiByte).
Other approaches that I tried but don't like much:
typedef std::basic_string<TCHAR> tstring; then use tstring in the business code. Wrappers/overloads can be made to streamline conversion between std::string and std::tstring, but it still adds a lot of pain.
Use std::wstring everywhere. Does not help much since wchar_t is 16 bit on Windows, so you either have to restrict yourself to BMP or go to a lot of complications to make the code dealing with Unicode cross-platform. In the latter case, all benefits over UTF-8 evaporate.
Use ATL/WTL/MFC CString in the platfrom-specific portion; use std::string in cross-platfrom portion. This is actually a variant of what I recommend above. CString is in many aspects superior to std::string (in my opinion). But it introduces an additional dependency and thus not always acceptable or convenient.
If you want to avoid headache, don't use the STL string types at all. C++ knows nothing about Unicode or encodings, so to be portable, it's better to use a library that is tailored for Unicode support, e.g. the ICU library. ICU uses UTF-16 strings by default, so no conversion is required, and supports conversions to many other important encodings like UTF-8. Also try to use cross-platform libraries like Boost.Filesystem for things like path manipulations (boost::wpath). Avoid std::string and std::fstream.
In the Windows API and C runtime library, char* parameters are interpreted as being encoded in the "ANSI" code page. The problem is that UTF-8 isn't supported as an ANSI code page, which I find incredibly annoying.
I'm in a similar situation, being in the middle of porting software from Windows to Linux while also making it Unicode-aware. The approach we've taken for this is:
Use UTF-8 as the default encoding for strings.
In Windows-specific code, always call the "W" version of functions, converting string arguments between UTF-8 and UTF-16 as necessary.
This is also the approach Poco has taken.
It really platform dependant, Unicode is headache. Depends on which compiler you use. For older ones from MS (VS2010 or older), you would need use API described in MSDN
for VS2015
std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt"s;
according to their docs. I can't check that one.
for mingw, gcc, etc.
std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt";
std::cout << _old.data();
output contains proper file name...
You should consider using QString and QByteArray, it has good unicode support