Check if all characters in UTF16 string are valid? - c++

I have a problem where I have UTF16 strings (std::wstring) that might have "invalid" characters which causes my console terminal to stop printing (see question).
I wonder if there is a fast way to check all the characters in a string and replace any invalid chars with ?.
I know I could do something along these lines with a regex, but it would be difficult to make it validate all valid chars, and also slow. Is there e.g. a numeric range for the char codes that I might use e.g. all char codes between 26-5466 is valid?

It should be possible to use std::ctype<wchar_t> to determine if a character is printable:
std::local loc;
std::replace_if(string.begin(), string.end(),
[&](wchar_t c)->bool { return !std::isprint(c, loc); }, L'?');

I suspect your problem is not related to the validity of characters, but to the capability of the console to print them.
The definition UNICODE does to "printable" does not necessarily coincide to the effective capability of the console itself to "print".
Character like '€' are "printable" but -for example- not on winXP consoles.

Related

How to search a non-ASCII character in a c++ string?

string s="x1→(y1⊕y2)∧z3";
for(auto i=s.begin(); i!=s.end();i++){
if(*i=='→'){
...
}
}
The char comparing is definitely wrong, what's the correct way to do it? I am using vs2013.
First you need some basic understanding of how programs handle Unicode. Otherwise, you should read up, I quite like this post on Joel on Software.
You actually have 2 problems here:
Problem #1: getting the string into your program
Your first problem is getting that actual string in your string s. Depending on the encoding of your source code file, MSVC may corrupt any non-ASCII characters in that string.
either save your C++ file as UTF-16 (which Windows confusingly calls Unicode), and use whcar_t and wstring (effectively encoding the expression as UTF-16). Saving as UTF-8 with BOM will also work. Any other encoding and your L"..." character literals will contain the wrong characters.
Note that other platforms may define wchar_t as 4 bytes instead of 2. So the handling of characters above U+FFFF will be non-portable.
In all other cases, you can't just write those characters in your source file. The most portable way is encoding your string literals as UTF-8, using \x escape codes for all non-ASCII characters. Like this: "x1\xe2\x86\x92a\xe2\x8a\x95" "b)" rather than "x1→(a⊕b)".
And yes, that's as unreadable and cumbersome as it gets. The root problem is MSVC doesn't really support using UTF-8. You can go through this question here for an overview: How to create a UTF-8 string literal in Visual C++ 2008 .
But, also consider how often those strings will actually show up in your source code.
Problem #2: finding the character
(If you're using UTF-16, you can just find the L'→' character, since that character is representable as one whcar_t. For characters above U+FFFF you'll have to use the wide version of the workaround below.)
It's impossible to define a char representing the arrow character. You can however with a string: "\xe2\x86\x92". (that's a string with 3 chars for the arrow, and the \0 terminator.
You can now search for this string in your expression:
s.find("\xe2\x86\x92");
The UTF-8 encoding scheme guarantees this always finds the correct character, but keep in mind this is an offset in bytes.
My comment is too large, so i am submitting it as an answer.
The problem is that everybody is concentrating on the issue of different encodings that Unicode may use (UTF-8, UTF-16, UCS2, etc). But your problems here will just begin.
There is also an issue of composite characters, which will really mess up any search that you are trying to make.
Let's say you are looking for a character 'é', you find it in Unicode as U+00E9 and do your search, but it is not guaranteed that this is the only way to represent this character. The document may also contain U+0065 U+0301 combination. Which is actually exactly the same character.
Yes, not just "character that looks the same", but it is exactly the same, so any software and even some programming libraries will freely convert from one to another without even telling you.
So if you wish to make a search, that is robust, you will need something that represents not just different encodings of Unicode, but Unicode characters themselves with equality between Composite and Ready-Made chars.

How to compare/replace non-ASCII chars in array in C++?

I have a large char array, which contains Czech diacritical characters (e.g. "á"), coded in UTF-8. I need to replace them to their ASCII equivalents (e.g. "a"), because program must work on Windows (Linux console accepts these chars perfectly).
I am reading array char by char and writing content into string.
Here is code I am using, this doesnt work:
int array_size = 50000; //size of file array
char * array = new char[array_size]; //array to store file contents
string ascicontent="";
if ('\u00E1'==array[zacatek]) { //check if char is "á"
ascicontent +='a'; //write ordinal "a" into string
}
I even tried replacing '\u00E1' with 'á', but it also doesnt work. Guessing there is problem that these chars are longer than ascii.
How can I declare the non-ascii char, so it could be compared?
Each char is a single byte, however UTF-8 can use multiple bytes to encode a single character. In particular U+00E1 is encoded as two bytes: 0xC3 0xA1. So you can't do what you want with just comparing a single char.
There are multiple ways that you might be able to tackle your problem:
A) First, try googling for "windows console utf-8" and see if that gives anything which might make things just work without having to alter the characters at all. (I don't know if anything can work for you, I've never tried this.)
B) Convert the data to wide characters (wchar_t) using MultiByteToWideChar or mbstowcs and then google how to use wcout or such to output UTF-16 to the console.
C) Use MultiByteToWideChar to convert the data from UTF-8 to UTF-16. Then use WideCharToMultiByte to convert from UTF-16 to the console's code page, relying on the fact that it can automatically "best fit" common characters (such as "á" to "a").
D) If you really only care about a limited set of characters (such as only the accented characters in the Czech code page), then you could possibly write your own lookup table of UTF-8 byte sequences and your desired replacements. You just need to be doing comparisons on the UTF-8 by those multiple bytes rather than individual chars. Among various tools out there, I've found this page helpful for seeing how characters are encoded in various ways.
Which of these make the most sense for your program depends on various factors, such as how easy or hard it might be to keep the Windows-specific pieces from conflicting with the Linux-specific or cross-platform parts.
char in C is not unicode, it is really a byte; it only gets converted to a glyph by the terminal console you happen to use. On some Linux implementations (like Debian) it defaults to UTF-8, so if your program outputs a sequence of bytes encoded in UTF-8, your terminal will display the proper glyph. If you know that array is UTF-8 encoded, you must check for the proper byte sequence.
Edit: take a look at The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Please take a look at this link http://en.wikipedia.org/wiki/Wide_character.
And I believe this code might help you:
std::wstring str(L"cccccááddddddd");
std::replace( str.begin(), str.end(), L'á', L'a');

Adding a diacritic mark to string fail in c++

I want to add a diacritic mark to my string in c++. Assume I want to modify wordz string in a following manner:
String respj = resp[j];
std::string respjz1 = respj; // create respjz1 and respjz2
std::string respjz2 = respj;
respjz1[i] = 'ź'; // put diacritic marks
respjz2[i] = 'ż';
I keep receiving: wordş and wordĽ (instead of wordź and wordż). I tried to google it but I keep getting results related to the opposite problem - diacritic normalization to non-diacritic mark.
First, what is String? Does it support accented characters or not?
But the real issue is one of encodings. When you say "I keep
receiving", what do you mean. What the string will contain is
not a character, but a numeric value, representing a code point
of a character, in some encoding. If the encoding used by the
compiler for accented characters is the same as the encoding
used by whatever you use to visualize them, then you will get
the same character. If it isn't, you will get something
different. Thus, for example, depending on the encoding, LATIN
SMALL LETTER Z WITH DOT (what I think you're trying to assign to
respjz2[i]) can be 0xFD or 0xBF in the encoding tables I have
access to (and it's absent in most single byte encodings); in
the single byte encoding I normally use (ISO 8859-1), these code
points correspond to LATIN SMALL LETTER Y WITH ACUTE and
INVERTED QUESTION MARK, respectively.
In the end, there is no real solution. Long term, I think you
should probably move to UTF-8, and try to ensure that all of the
tools you use (and all of the tools used by your users)
understand that. Short term, it may not be that simple: for
starters, you're more or less stuck with what your compiler
provides (unless you enter the characters in the form \u00BF
or \u00FD, and even then the compiler may do some funny
mappings when it puts them into a string literal). And you may
not even know what other tools your users use.

Does Re2 use string size or null termination?

The title is pretty much it. If a standard C++ string with UTF-8 characters has no zero bytes does the scanning terminate at the end of the string defined by it's size? Conversely, if the string has a zero byte does scanning stop at that byte, or continue to the full length of the string?
I've look at the Re2.h file and it does not seem to address this issue.
A std::string containing UTF-8 characters can´t have 0-bytes a part of the text
(only as termination), because UTF-8 doesn´t allow 0´s anywhere.
And given you´re using something C++11-compliant, a terminating 0 is guaranteed
(doesn´t matter if you use data() or c_str(). And data is the original data, so...).
See http://en.cppreference.com/w/cpp/string/basic_string/data
or the standard (21.4.7.1/1 etc.).
=> The processing of a string will stop at the 0
The interface to Re2 seems to use std::string, which almost
certainly means that it uses the begin and the end of the
string, and that null characters are characters like any other.
(The are, after all, defined in Unicode and in UTF-8.) Of
course, '\0' is in the category control characters, so it won't
match something like "\pL" (which matches a letter). But it
should match "\pC". And of course, '\u0000' and other representations of the null character.

C++ - Escaping or disabling backslash on string

I am writing a C++ program to solve a common problem of message decoding. Part of the problem requires me to get a bunch of random characters, including '\', and map them to a key, one by one.
My program works fine in most cases, except that when I read characters such as '\' from a string, I obviously get a completely different character representation (e.g. '\0' yields a null character, or '\' simply escapes itself when it needs to be treated as a character).
Since I am not supposed to have any control on what character keys are included, I have been desperately trying to find a way to treat special control characters such as the backslash as the character itself.
My questions are basically these:
Is there a way to turn all special characters off within the scope of my program?
Is there a way to override current digraphs definitions of special characters and define them as something else (like digraphs using very rare keys)?
Is there some obscure method on the String class that I missed which can force the actual character on the string to be read instead of the pre-defined constant?
I have been trying to look for a solution for hours now but all possible fixes I've found are for other languages.
Any help is greatly appreciate.
If you read in a string like "\0" from stdin or a file, it will be treated as two separate characters: '\\' and '0'. There is no additional processing that you have to do.
Escaping characters is only used for string/character literals. That is to say, when you want to hard-code something into your source code.