How can force the user/OS to input an Ascii string - c++

This is an extended question of this one: Is std::string suppose to have only Ascii characters
I want to build a simple console application that take an input from the user as set of characters. Those characters include 0->9 digits and a->z letters.
I am dealing with input supposing that it is an Ascii. For example, I am using something like : static_cast<unsigned int>(my_char - '0') to get the number as unsigned int.
How can I make this code cross-platform? How can tell that I want the input to be Ascii always? Or I have missed a lot of concepts and static_cast<unsigned int>(my_char - '0') is just a bad way?
P.S. In Ascii (at least) digits have sequenced order. However, in others encoding, I do not know if they have. (I am pretty sure that they are but there is no guarantee, right?)

How can force the user/OS to input an Ascii string
You cannot, unless you let the user specify the numeric values of such ASCII input.
It all depends how the terminal implementation used to serve std::cin translates key strokes like 0 to a specific number, and what your toolchain expects to match that number with it's intrinsic translation for '0'.
You simply shouldn't expect ASCII values explicitly (e.g. using magic numbers), but char literals to provide portable code. The assumption that my_char - '0' will result in the actual digits value is true for all character sets. The C++ standard states in [lex.charset]/3 that
The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose representation has all zero bits. For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous.[...]
emphasis mine

You can't force or even verify that beforehand . "Evil user" can always sneak a UTF-8 encoded string into your application, with no characters above U+7F. And such string happens to be also Ascii-encoded.
Also, whatever platform specific measure you take, user can pipe a UTF-16LE encoded file. Or /dev/urandom
Your mistakes string encoding with some magic property of an input stream - and it is not. It is, well, encoding, like JPEG or AVI, and must be handled exactly the same way - read an input, match with format, report errors on parsing failure.
For your case, if you want to accept only ASCII, read input stream byte by byte and throw/exit with error if you ever encounters a byte with the value outside ASCII domain.
However, if later you encounter a terminal providing data with some incompatible encoding, like UTF16LE, you have no choice but to write a detection (based on byte order mark) and a conversion routine.

Related

Get decimal value of Unicode Character C++

How do I get the decimal values of Unicode Character such as "Ồ"
std::string a = "Ồ";
unsigned char c = a[0];
long val = long(c);
cout << val << endl;
OUTPUT
7,891;
Your question may look pretty straight-forward but as we delve into it, we'll find it isn't as simple as it might first appear.
The first problem is that std::string is defined as std::basic_string<char> which isn't really compatible with "Ồ". Thus the results you get from your code will probably depend on the compiler you use and/or the environment and OS you are running on. For example, my copy of Visual Studio treats "Ồ" as an invalid ASCII character and puts "?" (or 0x3F) in `a[0]'.
The second problem is that the character "Ồ" is more than eight bits wide, so it may not fit into the variable c. Whatever the compiler put into a[0], the variable c will only hold char bits of that value. Again, the results you get are likely to change depending on the compiler you use and/or the environment you run in.
Leaving that aside, let's start by assuming the character "Ồ" is LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND GRAVE (0x1ED2). With that assumption, one might imagine that the answer we are seeking to get is 0x1ED2 right? But not necessarily.
There are several ways to encode a Unicode character. The UTF-32 encoding is 0x1ED2 (or 0x00001ED2 if we include all the leading zeros to get thirty-two bits). The UTF-8 encoding is 0xE1BB92.
So the decimal value of "Ồ" is 7,890 if it is encoded in UTF-32 or 14,793,618 if it is encoded in UTF-8 (I'm ignoring the effects of endianness to keep things simple)
The Unicode site has a FAQ on encodings and Wikipedia has a page too.
As you can see, the answer to your question (to some extent) depends on the encoding you want to use. One C++ way to deal with encodings is std::codecvt. Another solution is to just treat your string as a sequence of bytes - which your code attempts to do - but that rather depends on you knowing how your system encodes strings, what endianness you are dealing with, etc. And the code won't necessarily be portable.
Another wrinkle to consider is that - in the general case - "Ồ" might not be one character. Obviously it is one character in your code. But if you read a string in from a disk file say and when printed or displayed that file produces "Ồ" we can't assume the file contains a single "Ồ" character.
Unicode defines COMBINING CIRCUMFLEX ACCENT (0x0302) and COMBINING GRAVE ACCENT (0x0300) as separate characters which can be combined with other characters. And it defines intermediate characters like LATIN CAPITAL LETTER O WITH GRAVE and LATIN CAPITAL LETTER O WITH ACUTE so there are actually several ways you can create a string in memory (or in a disk file) that would give you the same effect as the character "Ồ".

byte representation of ASCII symbols in std::wstring with different locales

Windows C++ app. We have a string that contain only ASCII symbols: std::wstring(L"abcdeABCDE ... any other ASCII symbol"). Note that this is std::wstring that uses wchar_t.
Question - do byte representation of this string depends on the localization settings, or something else? Can I assume that if I receive such string (for example, from WindowsAPI) while app is running its bytes will be the same as on the my PC?
In general, for characters (not escape sequence) wchar_t and wstring have to use the same codes as ASCII (just extended to 2 bytes).
But I am not sure about codes less then 32 and of course codes greater than 128 can has different meaning (as in ASCII) in the moment of output, so to avoid problem on output set particular locale explicitly, e.g.:
locale("en_US.UTF-8")
for standard output
wcout.imbue(locale("en_US.UTF-8"));
UPDATE:
I found one more suggestion about adding
std::ios_base::sync_with_stdio(false);
before setting localization with imbue
see details on How can I use std::imbue to set the locale for std::wcout?
The byte representation of the literal string does not depend on the environment. It's hardcoded to the binary data from the editor. However, the way that binary data is interpreted depends on the current code page, so you can end up with different results when converted at runtime to a wide string (as opposed to defining the string using a leading L, which means that the wide characters will be set at compile time.)
To be safe, use setlocale() to guarantee the encoding used for conversion. Then you don't have to worry about the environment.
This might help: "By definition, the ASCII character set is a subset of all multibyte-character sets. In many multibyte character sets, each character in the range 0x00 – 0x7F is identical to the character that has the same value in the ASCII character set. For example, in both ASCII and MBCS character strings, the 1-byte NULL character ('\0') has value 0x00 and indicates the terminating null character."
From:
Visual Studio Character Sets 'Not set' vs 'Multi byte character set'

Get number of characters in string?

I have an application, accepting a UTF-8 string of a maximum 255 characters.
If the characters are ASCII, (characters number == size in bytes).
If the characters are not all ASCII and contains Japanese letters for example, given the size in bytes, how can I get the number of characters in the string?
Input: char *data, int bytes_no
Output: int char_no
You can use mblen to count the length or use mbstowcs
source:
http://www.cplusplus.com/reference/cstdlib/mblen/
http://www.cl.cam.ac.uk/~mgk25/unicode.html#mod
The number of characters can be counted in C in a portable way using
mbstowcs(NULL,s,0). This works for UTF-8 like for any other supported
encoding, as long as the appropriate locale has been selected. A
hard-wired technique to count the number of characters in a UTF-8
string is to count all bytes except those in the range 0x80 – 0xBF,
because these are just continuation bytes and not characters of their
own. However, the need to count characters arises surprisingly rarely
in applications.
you can save a unicode char in a wide char wchar_t
There's no such thing as "character".
Or, more precisely, what "character" is depends on whom you ask.
If you look in the Unicode glossary you will find that the term has several not fully compatible meanings. As a smallest component of written language that has semantic value (the first meaning), á is a single character. If you take á and count basic unit of encoding for the Unicode character encoding (the third meaning) in it, you may get either one or two, depending on what exact representation (normalized or denormalized) is being used.
Or maybe not. This is a very complicated subject and nobody really knows what they are talking about.
Coming down to earth, you probably need to count code points, which is essentially the same as characters (meaning 3). mblen is one method of doing that, provided your current locale has UTF-8 encoding. Modern C++ offers more C++-ish methods, however, they are not supported on some popular implementations. Boost has something of its own and is more portable. Then there are specialized libraries like ICU which you may want to consider if your needs are much more complicated than counting characters.

Encoding binary data using string class

I am going through one of the requirment for string implementations as part of study project.
Let us assume that the standard library did not exist and we were
foced to design our own string class. What functionality would it
support and what limitations would we improve. Let us consider
following factors.
Does binary data need to be encoded?
Is multi-byte character encoding acceptable or is unicode necessary?
Can C-style functions be used to provide some of the needed functionality?
What kind of insertion and extraction operations are required?
My question on above text
What does author mean by "Does binary data need to be encoded?". Request to explain with example and how can we implement this.
What does author mean y point 2. Request to explain with example and how can we implement this.
Thanks for your time and help.
Regarding point one, "Binary data" refers to sequences of bytes, where "bytes" almost always means eight-bit words. In the olden days, most systems were based on ASCII, which requires seven bits (or eight, depending on who you ask). There was, therefore, no need to distinguish between bytes and characters. These days, we're more friendly to non-English speakers, and so we have to deal with Unicode (among other codesets). This raises the problem that string types need to deal with the fact that bytes and characters are no longer the same thing.
This segues onto point two, which is about how you represent strings of characters in a program. UTF-8 uses a variable-length encoding, which has the remarkable property that it encodes the entire ASCII character set using exactly the same bytes that ASCII encoding uses. However, it makes it more difficult to, e.g., count the number of characters in a string. For pure ASCII, the answer is simple: characters = bytes. But if your string might have non-ASCII characters, you now have to walk the string, decoding characters, in order to find out how many there are1.
These are the kinds of issues you need to think about when designing your string class.
1This isn't as difficult as it might seem, since the first byte of each character is guaranteed not to have 10 in its two high-bits. So you can simply count the bytes that satisfy (c & 0xc0) != 0xc0. Nonetheless, it is still expensive relative to just treating the length of a string buffer as its character-count.
The question here is "can we store ANY old data in the string, or does certain byte-values need to be encoded in some special way. An example of that would be in the standard C language, if you want to use a newline character, it is "encoded" as \n to make it more readable and clear - of course, in this example I'm talking of in the source code. In the case of binary data stored in the string, how would you deal with "strange" data - e.g. what about zero bytes? Will they need special treatment?
The values guaranteed to fit in a char is ASCII characters and a few others (a total of 256 different characters in a typical implementation, but char is not GUARANTEED to be 8 bits by the standard). But if we take non-european languages, such as Chinese or Japanese, they consist of a vastly higher number than the ones available to fit in a single char. Unicode allows for several million different characters, so any character from any european, chinese, japanese, thai, arabic, mayan, and ancient hieroglyphic language can be represented in one "unit". This is done by using a wider character - for the full size, we need 32 bits. The drawback here is that most of the time, we don't actually use that many different characters, so it is a bit wasteful to use 32 bits for each character, only to have zero's in the upper 24 bits nearly all the time.
A multibyte character encoding is a compromise, where "common" characters (common in the European languages) are used as one char, but less common characters are encoded with multiple char values, using a special range of character to indicate "there is more data in the next char to combine into a single unit". (Or,one could decide to use 2, 3, or 4 char each time, to encode a single character).

Failsafe conversion between different character encodings

I need to convert strings from one encoding (UTF-8) to another. The problem is that in the target encoding we do not have all characters from the source encoding and libc iconv(3) function fails in such situation. What I want is to be able to perform conversion but in output string have this problematic characters been replaced with some symbol, say '?'.
Programming language is C or C++.
Is there a way to address this issue ?
Try appending "//TRANSLIT" or "//IGNORE" to the end of the destination charset string. Note that this is only supported under the GNU C library.
From iconv_open(3):
//TRANSLIT
When the string "//TRANSLIT" is appended to tocode, translitera‐
tion is activated. This means that when a character cannot be
represented in the target character set, it can be approximated
through one or several similarly looking characters.
//IGNORE
When the string "//IGNORE" is appended to tocode, characters
that cannot be represented in the target character set will be
silently discarded.
Alternately, manually skip a character and insert a substitution in the output when you get -EILSEQ from iconv(3).
Regex based on the translatable source ranges used to swap a corresponding placeholder in for any chars that don't match.