I have a problem with a std::string comparation with codification I think. The problem is that I hate to compare a a string that is received and I dont know how kind of codification it has with a spanish string with unusal characters. I cant change s_area.m_s_area_text so I need to set s2 string with a identical value and i dont know how to do it in a generic way for other chases.
std::string s2= "Versión de sistema";
std::cout << s_area.m_s_area_text << std::endl;
for (const char* p = s2.c_str(); *p; ++p)
{
printf("%02x", *p);
}
printf("\n");
for (const char* p = s_area.m_s_area_text.c_str(); *p; ++p)
{
printf("%02x", *p);
}
printf("\n");
And the result of the execution is:
Versi├│n de sistema
5665727369fffffff36e2064652073697374656d61
5665727369ffffffc3ffffffb36e2064652073697374656d61
Obviously, as the 2 strings has not the same bytes values, all the compare method fails: strncmp, std::string ==, std:sstring.comapre etc.
Any idea of how to do that witho touching s_area.m_s_area_text string?
In general it is impossible to guess the encoding of a string by inspecting its raw bytes. The exception to this rule is when a byte order mark (BOM) is present at the start of the byte stream. The BOM will tell you which unicode encoding the bytes are and the endianness.
As an aside, if at some point in the future you decide you need a canonical string encoding (as some have pointed out in the comments that it would be a good idea). There are strong arguments in favour of UTF-8 as the best choice for C++. See UTF-8 everywhere for further information on this.
First of all, two compare two string correctly you at least need to know their encoding. In your example s_area.m_s_area_text is happened to be encoded with UTF-8 while for s2 ISO/IEC 8859-1 (Latin-1) is used.
If you are sure that s_area.m_s_area_text will always be encoded in UTF-8, you can try to make s2 use the same encoding and then just compare them. One way of defining a UTF-8 encoded string is escaping every character that is not in basic character set with \u.
std::string s2 = u8"Versi\u00F3n de sistema";
...
if (s_area.m_s_area_text == s2)
...
It should also be possible to do it without escaping the characters by setting an appropriate encoding for the source file and specifying the encoding to the compiler.
As #nwp mentioned, you may also want to normalise the strings before comparing. Otherwise, two strings that look the same may have different Unicode representation and that will cause your comparison to yield a false negative result.
For example, "Versión de sistema" will not be equal to "Versión de sistema".
Related
Background:
I am making a hash that will allow you to lookup the description you see below by feeding it a QString containing its character.
I got a full list of the relevant data, looking something like this:
QHash<QString, QString> lookupCharacterDescription;
...
lookupCharacterDescription.insert("003F","QUESTION MARK");
lookupCharacterDescription.insert("0040","COMMERCIAL AT");
lookupCharacterDescription.insert("0041","LATIN CAPITAL LETTER A");
lookupCharacterDescription.insert("0042","LATIN CAPITAL LETTER B");
...
lookupCharacterDescription.insert("1F648","SEE-NO-EVIL MONKEY");
lookupCharacterDescription.insert("1F649","HEAR-NO-EVIL MONKEY");
lookupCharacterDescription.insert("1F64A","SPEAK-NO-EVIL MONKEY");
lookupCharacterDescription.insert("1F64B","HAPPY PERSON RAISING ONE HAND");
...
lookupCharacterDescription.insert("FFFD","REPLACEMENT CHARACTER");
lookupCharacterDescription.insert("FFFE","<not a character>");
lookupCharacterDescription.insert("FFFF","<not a character>");
lookupCharacterDescription.insert("FFFFE","<not a character>");
lookupCharacterDescription.insert("FFFFF","<not a character>");
Now obviously "1F64B" needs to be wrapped in something here. I have tried playing around with things like 0x1F64B as a QChar, but I am honestly groping in the dark here. I could make it work with the lower values like the Latin Letters, but it fails with the 5 character addresses.
Questions:
How do I classify 1F64B?
Is this considered UTF-32?
What can I wrap this value "1F64B" in to produce the QString("🙋")?
Will the wrappings also work for the lower values?
When you use QString(0x1F64B) it'll call QString::QString(QChar ch). Since QChar is a 16-bit type, it'll truncate the value to 0xF64B and you get an invalid character since that code point is currently unassigned. I'm pretty sure you'll get an out-of-range warning at that line. You can see the value F64B easily in the character if you zoom in or use a hex editor. Since there's no way for 0x1F64B to fit into a single 16-bit QChar and must be represented by a surrogate pair, you can't initialize the string that way.
OTOH QString("🙋") works since it's constructing the string from another string. You must construct the string with a string like that, or manually by assigning the UTF-8/16 code units.
Is this considered UTF-32?
No. UTF-32 is a Unicode encoding that uses 32 bits for a code unit. You only have QString and not a bare byte array, so you don't need to care about its underlying encoding (which is actually UTF-16)
What can I wrap this value "1F64B" in to produce the QString("🙋")?
You shouldn't deal with the numeric values as string. Store it as a numeric type instead
QHash<qint32, QString> lookupCharacterDescription;
lookupCharacterDescription.insert(0x1F64B, "HAPPY PERSON RAISING ONE HAND");
and then to make a string that contains the character at code point 0x1F64B use
uint cp = 0x1F64B;
QString mystr = QString::fromUcs4(&cp, 1);
Will the wrappings also work for the lower values?
Yes, since UCS4, A.K.A. UTF-32, can store any possible Unicode characters
Alternatively you can construct the character from UTF-16 or UTF-8. U+1F64B is encoded in UTF-16 as D83D DE4B, or as F0 9F 99 8B in UTF-8, therefore you can use any of the below
QChar utf16[2] = { 0xD38D, 0xDE4B };
str1 = QString(utf16, 2);
char* utf8[4] = { 0xF0, 0x9F, 0x99, 0x8B };
str2 = QString::fromUtf8(utf8, 4);
If you want to include the string in its literal form in source code then either of the following will work
str1 = QString::fromWCharArray(L"\xD83D\xDE4B");
str2 = QString::fromUtf8("\xF0\x9F\x99\x8B");
If you have C++11 support then simply use the prefix u8, u and U for UTF-8, UTF-16 and UTF-32 respectively like
u8"🙋"
u"🙋"
U"🙋"
u8"\U0001F64B"
u"\U0001F64B"
u"\uD83D\uDE4B"
U"\U0001F64B"
Mandatory article to understand text and encodings: There Ain't No Such Thing as Plain Text
I am wondering is this way of reversing a string is safe?
void ReverseString( std::string & stringToReverse )
{
stringToReverse.assign( stringToReverse.rbegin(), stringToReverse.rend() );
}
According to §21.4.6.3/20, assign(first,last) (with iterators first and last) is equivalent to
assign(string(first,last))
Hence it first creates a new string object and then assigns it. There is no risk that the string you copy from (in reverse) is being modified while you still copy (if that is what you were afraid of).
However, using std::reverse(begin(str),end(str)) as suggested by the others is better and potentially more efficient.
I don't know if this is a request to have your code reviewed, or you don't know about other options, but you should just use std::reverse from <algorithm>
std::string str = "Hello world!";
std::reverse(str.begin(), str.end());
This reverses the string in place. If you wanted to create a new string, you're essentially doing what you have in you code using assign() but with the std::string constructor:
std::string reversed(str.rbegin(), str.rend());
As suggested by others, what you did, in fact, reverses the char sequence.
The fact this actually reverses the string depends on what the concept of "reverse" and "string" and "char" are meant to be.
An std::string is a sequence of char that are 8 bit long (at least on the most platforms).
A Japanese string (but even a French or Italian or German one) can contain codepoints that are outside the 0..127 range, and hence need to be encode somewhat to be represented into 8 bit characters, so a "character" may keep more than 1 char. An putting the char-s in reverse order doesn't reverse the text, it just mess it out completely.
Assuming 1 character <=> 1 char is true only for pure ASCII text.
I have a working algorithm to convert a UTF-8 string to a UTF-32 string, however, I have to allocate all the space for my UTF-32 string ahead of time. Is there any way to know how many characters in UTF-32 that a UTF-8 string will take up.
For example, the UTF-8 string "¥0" is 3 chars, and once converted to UTF-32 is 2 unsigned ints. Is there any way to know the number of UTF-32 'chars' I will need before doing the conversion? Or am I going to have to re-write the algorithm?
There are two basic options:
You could make two passes through the UTF-8 string, the first one counting the number of UTF-32 characters you'll need to generate, and the second one actually writing them to a buffer.
Allocate the max number of 32-bit chars you could possibly need -- i.e., the length of the UTF-8 string. This is wasteful of memory, but means you can transform utf8->utf32 in one pass.
You could also use a hybrid -- e.g., if the string is shorter than some threshold then use the second approach, otherwise use the first.
For the first approach, the first pass would look something like this:
size_t len=0; // warning: untested code.
for(const char *p=src; *p; ++p) {
// characters that begin with binary 10xxxxxx... are continuations; all other
// characters should begin a new utf32 char (assuming valid utf8 input)
if ((*p & 0xc0) != 0x80) ++len;
}
I have a function to read the value of one variable (integer, double, or boolean) on a single line in an ifstream:
template <typename Type>
void readFromFile (ifstream &in, Type &val)
{
string str;
getline (in, str);
stringstream ss(str);
ss >> val;
}
However, it fails on text files created with editors inserting a BOM (byte order mark) at the beginning of the first line, which unfortunately includes {Note,Word}pad. How can I modify this function to ignore the byte-order mark if present at the beginning of str?
(I'm assuming you're on Windows, since using U+FEFF as a signature in UTF-8 files is mostly a Windows thing and should simply be avoided elsewhere)
You could open the file as a UTF-8 file and then check to see if the first character is U+FEFF. You can do this by opening a normal char based fstream and then use wbuffer_convert to treat it as a series of code units in another encoding. VS2010 doesn't yet have great support for char32_t so the following uses UTF-16 in wchar_t.
std::fstream fs(filename);
std::wbuffer_convert<std::codecvt_utf8_utf16<wchar_t>,wchar_t> wb(fs.rdbuf());
std::wistream is(&wb);
// if you don't do this on the stack remember to destroy the objects in reverse order of creation. is, then wb, then fs.
std::wistream::int_type ch = is.get();
const std::wistream::int_type ZERO_WIDTH_NO_BREAK_SPACE = 0xFEFF
if(ZERO_WIDTH_NO_BREAK_SPACE != ch)
is.putback(ch);
// now the stream can be passed around and used without worrying about the extra character in the stream.
int i;
readFromStream<int>(is,i);
Remember that this should be done on the file stream as a whole, not inside readFromFile on your stringstream, because ignoring U+FEFF should only be done if it's the very first character in the whole file, if at all. It shouldn't be done anywhere else.
On the other hand, if you're happy using a char based stream and just want to skip U+FEFF if present then James Kanze suggestion seems good so here's an implementation:
std::fstream fs(filename);
char a,b,c;
a = fs.get();
b = fs.get();
c = fs.get();
if (a != (char)0xEF || b != (char)0xBB || c != (char)0xBF) {
fs.seekg(0);
} else {
std::cerr << "Warning: file contains the so-called 'UTF-8 signature'\n";
}
Additionally if you want to use wchar_t internally the codecvt_utf8_utf16 and codecvt_utf8 facets have a mode that can consume 'BOMs' for you. The only problem is that wchar_t is widely recognized to be worthless these days* and so you probably shouldn't do this.
std::wifstream fin(filename);
fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf8_utf16<wchar_t, 0x10FFFF, std::consume_header));
* wchar_t is worthless because it is specified to do just one thing; provide a fixed size data type that can represent any code point in a locale's character repertoire. It does not provide a common representation between locales (i.e., the same wchar_t value can be different characters in different locales so you cannot necessarily convert to wchar_t, switch to another locale, and then convert back to char in order to do iconv-like encoding conversions.)
The fixed sized representation itself is worthless for two reasons; first, many code points have semantic meanings and so understanding text means you have to process multiple code points anyway. Secondly, some platforms such as Windows use UTF-16 as the wchar_t encoding, which means a single wchar_t isn't even necessarily a code point value. (Whether using UTF-16 this way is even conformant to the standard is ambiguous. The standard requires that every character supported by a locale be representable as a single wchar_t value; If no locale supports any character outside the BMP then UTF-16 could be seen as conformant.)
You have to start by reading the first byte or two of the stream, and
deciding whether it is part of a BOM or not. It's a bit of a pain,
since you can only putback a single byte, whereas you typically will
want to read four. The simplest solution is to open the file, read the
initial bytes, memorize how many you need to skip, then seek back to the
beginning and skip them.
With a not-so-clean solution, I solved by removing non printing chars:
bool isNotAlnum(unsigned char c)
{
return (c < ' ' || c > '~');
}
...
str.erase(remove_if(str.begin(), str.end(), isNotAlnum), str.end());
Here's a simple C++ function to skip the BOM on an input stream on Windows. This assumes byte-sized data, as in UTF-8:
// skip BOM for UTF-8 on Windows
void skip_bom(auto& fs) {
const unsigned char boms[]{ 0xef, 0xbb, 0xbf };
bool have_bom{ true };
for(const auto& c : boms) {
if((unsigned char)fs.get() != c) have_bom = false;
}
if(!have_bom) fs.seekg(0);
return;
}
It simply checks the first three bytes for the UTF-8 BOM signature, and skips them if they all match. There's no harm if there's no BOM.
Edit: This works with a file stream, but not with cin. I found it did work with cin on Linux with GCC-11, but that's clearly not portable. See #Dúthomhas comment below.
I have two words and both are of the type std::string and they are unicode words. they are the same, I mean when I write them to some file they both have the same representation. but when I call word1.compare(word2), I dont get the right result. why they are not the same?
or should I use another function instead of compare to compare two unicode strings?
thanks
ifstream myfile;
string term = "";
myfile.open("homograph.txt");
istream_iterator<string> i(myfile);
multiset<string> s(i, istream_iterator<string>());
for(multiset<string>::const_iterator i = s.begin(); i != s.end(); i = s.upper_bound(*i))
{
term = *i;
}
pugi::xml_document doc;
std::ifstream stream("words0.xml");
pugi::xml_parse_result result = doc.load(stream);
pugi::xml_node words = doc.child("Words");
for (pugi::xml_node_iterator it = words.begin(); it != words.end(); ++it)
{
std::string wordValue = as_utf8(it->child("WORDVALUE").child_value());
if(!wordValue.compare(term))
{
o << wordValue << endl;
}
}
the first word is "term" and the second word is wordValue;
the overload function of as_utf8() is :
std::string wordNet::as_utf8(const char* str)
{
return str;
}
In Unicode (and UTF-8 is Unicode), there is the problem of composition. A token like é can be represented by its own code point, or by the code point e followed by ´. It could be that one is encoded using precomposition (é) and the other using decomposition (e´). Both will usually be displayed the same way. To avoid the problem, one should normalize strings on one of these composition types.
Of course, there could be another problem, but this is one of the problems that can make equal looking strings not compare as equal. OTOH, if your text does not have any characters outside ASCII, this is hardly the problem.
The correct way to compare the strings is to normalize them first. You can do this in Python with the unicodedata module.
The Unicode Standard Technical Appendix #15 describes composition and normalization in detail.
Unicode is more complicated than you think. There are combining characters, invisible code points and what not. If two strings look the same when printed, it doesn't mean they are byte-to-byte identical.
To take all complications of Unicode into account, you need to use a Unicode-aware string library. One such library is ICU. The C++ standard library is most definitely not Unicode-aware. It probably can correctly count characters in a UTF-8 strings, but that's about it.
Try using std::wstring instead.