When it is safe to use `isalpha()` - c++

I have GLFW key_callback() and I'm trying to detect if an alphabetic character is pressed. I use this function isalpha(). I've noticed that some keys such as Shift and Alt are treated as alphabetic characters if I perform the following code:
#include <iostream>
int main()
{
// 340: printed int if the Shift key is pressed.
if( isalpha(340) )
std::cout << "alpha" << std::endl;
else
std::cout << "not alpha" << std::endl;
return 0;
}
The preceding code yields alpha. I can validate the range of integers that are passed to the function but this makes no sense since I'm not taking advantage of the function. My question is is it safe to pass any integer to that function and use simple if-statement to validate alphabetic characters? What is the precaution need to be taken if any in case using this function?

key_callback is not a standard part of C++, so I searched and assume that the one you're talking about is part of the GLFW library. In that library, the key_callback gives a keyboard scancode.
Scancodes are not characters. In a typical input model, there's a state machine that maps scancodes (and sequences and combinations of scancodes) to characters. This is the layer that would allow you to change your QWERTY keyboard to work like a Dvorak keyboard or like one designed for typing in another language. This key_callback is lower level, and leaves the mapping to you.
The C++ std::isalpha function takes a char cast to an int or a special sentinel value EOF, which is typically -1. (On some systems, you'll find that you need to first cast your char to an unsigned char before converting it to an int.) Scancodes are not chars, so passing a scancode to std::isalpha is meaningless.
In particular, the value 340 is outside the range of an unsigned char, and it's not EOF, so it really can't be expected to do anything sensible.
If you need chars, you will have to build your own mapping from scancodes (and combinations of scancodes) to chars. It looks like that library has constants defined for the scancodes. For example, GLFW_KEY_LEFT_SHIFT is 340. That should help. If you just need to know if a particular key is pressed or released, you can compare the scancodes to the appropriate constants.
Note: You tagged the question C++, but you linked to the documentation for the C version of isalpha.

Just to summarize the wisdom from the discussion in the comments into a short answer.
Your code has undefined behaviour (UB), for reasons discussed in the comments.
Therefore, the outcome is unpredictable and most likely incorrect.
Code with with UB should be avoided as much as possible.
It's not clear what you mean by safe. It's unlikely that it will immediately lead to disaster, but UB in some important code (such as a program controlling an airplane) can do.

Related

C++ When should I std::ctype<char>::widen()?

Apparently, writing a single character of type char to a stream whose char type is char is guaranteed by the standard to not invoke ctype<char>.widen() on the associated locale.
On the other hand, according to my reading of the standard (C++17), when writing a string of chars (const char*) instead of a single char, ctype<char>.widen() must be invoked.
I am struggling to understand how to make sense of this.
On one hand, the fact, that widen() is required when writing strings, suggests that there are valid scenarios where widen() has an effect. But if that is the case, then how can it be alright to omit the widening operation when writing single characters?
It seems to me that there must be an intended difference in the roles (domains of applicability) of the two operations, output of single char (char) and output of string (const char*), but I do not see what it is.
To make things more concrete, let us say that I wanted to implement an output operator for a range object, and have the output be on the form 0->2. My first inkling would be something like this:
std::ostream& operator<<(std::ostream& out, const Range& range)
{
// ...
out << "->"; // Invokes widen()
// ...
}
But, is this how I am supposed to do it? Or would out << '-' << '>' (no widening) have been better / more correct?
Curiously, the formulation of the standard suggests to me that the two forms do not always produce the same result. Also, as far as I can tell, the latter form (with separate chars), could be much faster on some platforms.
What is the upshot? What are the rules that should guide me in choosing between the two types of output operations?
For reference, here is an earlier attempt of mine at posing the same question (3 years ago): C++ What is the role of std::ctype<char>::widen()?
Since the old question never got much traction, I'd prefer to mark that one as a duplicate of this one, rather than vice versa.
EDIT: I recognize that a good output operator might not want to use formatted output operations internally, but that is not what I am interested in here. I'm interested in the reasoning behind the difference in behavior of the two types of output operations.
EDIT: Here is one explanation that would make sense to me: << on single char is to be understood as a special case of << on std::string, and not as a special case of << on const char*. But, is this the right explanation? If so, I believe it means that I should use << "->" above. Not << '-' << '>'.
EDIT: Here is what makes me think that the explanation above (2nd EDIT) is not the right one: In the case of a wchar_t stream, both << on char and << on const char* invokes widen(), so from this point of view, they are in the same "family". So, from a consistency point of view, we should expect that when we switch stream type from wchar_t to char, either both of those operators should still invoke widen(), or both should not.
EDIT: Here is another kind of explanation, which I don't think is right, but I'll include it for exposition: For a char stream out, out << "->" has the same effect as out << '-' << '>', because even though the first form is required to invoke widen(), widen() is required to be a "no op" on a char stream in any locale (I don't believe this is the case). So, while there may be a significant difference in performance, the results are always the same. This would suggest that the difference in formulation of required behavior is a kind of unintended, but fairly benign accident. If this is the right explanation, then I should chose out << '-' << '>' due to the possibly much better performance.
EDIT: Ok, I found another 3 year old question from myself, where I am coming at it from a slightly different angle: C++ When are characters widened in output stream operator<<()?. The comments from Dietmar Kühl suggests that widen() is always a "no op" on a char stream, and the whole "issue" is due to imprecise wording in the standard. If so, it would render my second proposed explanation above correct (4th EDIT). Still, It would be nice to get this corroborated by somebody else.

Microsoft's implementation of lstrcmpi and Unicode characters

I'm trying to understand whether what I'm seeing is a bug, or some accepted behaviour of the Microsoft's lstrcmpi function?
I can illustrate it with the code:
WCHAR buff1[] = L"abc ";
WCHAR buff2[] = L"abc ";
buff1[3] = 0xFFFF;
buff2[3] = 0x0;
int res = lstrcmpi(buff1, buff2);
//res is 0 or equality!
EDIT: Addition for the comment below:
lstrcmpi calls CompareString with the current locale (from thread or user) and returns "a linguistically appropriate result".
From Michael Kaplans blog:
... Now if the functions were named lstrcoll and lstrcolli then perhaps the function would not be so commonly misused
and:
Remember that when checking for equality, especially on an item like a registry value where OS semantics are involved, the best answer is CompareStringOrdinal, with a fallback to RtlCompareUnicodeString or even better RtlEqualUnicodeString or if you absolutely must wcsicmp (with awareness that there is one character it can be wrong about) for anything that has to run pre-Vista.
and finally:
Because if you are calling lstrcmpi for appropriate reasons (i.e. you wanted to get linguistically meaningful results, say in the sorting of a list in a user interface) but you wanted to have behavior that did not vary with different locales, then CompareString with LOCALE_INVARIANT is a good answer.
But if you wanted almost anything else, including all of the non-linguistic purposes hinted at earlier, then CompareStringOrdinal or RtlCompareUnicodeString is a much better choice.
How it handles non-characters has actually changed over time.
The Unicode FFFF character is a noncharacter in the Unicode spec, so it is probably being ignored during the string comparison. This results in both strings being equal.

Is there keyboard input that isn't considered a char?

Thanks for taking the time to read this first of all,
I'm currently writing a driver program for a class in C++ and I need some input from the user. I've started using typedef to create validation programs so I can switch between different types fairly easily. For the particular problem I'm working on, I've found that I'm only working with char which leads me to my questions:
My validation checks to see if the input is char. Is using validation pointless if I know I'm just working with char in particular? Everything that the user types in seems to be a char.
Is there anything that the user could type in that won't be considered a char?
This question may seem a bit trivial but I've never really thought about this before! Still learning the language, so any guidance is appreciated.
Code in question (ElementType is of type char):
void getInput( ElementType & cho )
{
while ( !(cin >> cho) )
{
cout<< "That is an invalid input..."
<< "\nTry again: ";
}
cout<< endl;
}
Is there anything that the user could type in that won't be considered a char?
Yes: Shift, Ctrl, Alt, Num Lock to name a few.
The point is that a keyboard is its own beast with a potential output for each key-press, key-release of every key. A keyboard driver (software) takes these events and translates them into a series of char for a program's cin/stdin. Alternatively, a program could get access to the low level events, but may be beyond standard C++ code.
Recommend staying with the model that cin receives sporadically any of the usually 256 different char including '\0' until program-end or cin is closed from some source be it a keyboard, re-directed file input, piped input, or a remote device. Ignore that idea that input usually comes from a keyboard. It is simply a sequence of char.
Is using validation pointless?
Validation is useful. Code should validate the char arriving per the requirements of the program - not the requirements of char. Example, code may have trouble handing a null character, a negative char, a char outside the ASCII range of 0 - 127, or too many char between line endings. Validating input makes code resilient against hackers who will exploit a vulnerable program.
Under Linux, with UTF-8 input, I can certainly type a lot of stuff that isn't a char. E.g. typing á gives two bytes. There are quite a lot of others, directly typeable depending on the keyboard handling in place.
International character handling is complex. Not all the world types ASCII.

Is it always safe to assume that values of ..stream::int_type are >= 0 except for eof

I want to parse a file and use an std::stringstream to parse its contents. I use get() to read it character by character, which yields an std::stringstream::int_type. Now in certain cases I want to use a lookup table to convert ascii characters into other values (for example, to deterime whether a certain character is allowed in an identifier or not).
Now can I assume that the values I get from get() are non-negative, unless it is std::stringstream::traits_type::eof()? (And hence use them as indices for the lookup tables).
I couldn't find anything in the standard regarding that, which might be due to a lack of understanding on my part how this whole bytes to characters thing works in C++.
First let look at the more general case of basic_stringstream.
You can't assume that eof() is negative (I see the constraint nowhere and the C standard states The value of the macro WEOF may differ from that of EOF and need not be negative.)
In general, int_type comes from the trait parameter and the description of int_type for character traits doesn't mandate that to_int_type returns something positive.
Now, stringsteam is basic_stringstream<char> and thus use char_traits<char>; eof is negative but I haven't found a mandate that to_int_type has to non-negative values (it isn't in 21.2.3.1 and I see no way to deduce it from other constraints), but I wonder if I miss something as my expectation was that to_int_type(c) had to be equivalent to (int)(unsigned char)c -- it is the case for the GNU standard C++ library and I somewhat expect to get the same behavior as in C where functions taking or returning characters in int return non-negative values for characters.)
For information, the other standard specialization of char_traits:
char_traits<char16_t> and char_traits<char32_t> have an unsigned int_type, so even eof() is positive;
char_traits<wchar_t>::to_int_type isn't mandated to return a positive value for non eof() input either (but in contrast with char_traits<char> I didn't expect such mandate to be there).

std.algorithm.joiner(string[],string) - why result elements are dchar and not char?

I try to compile following code:
import std.algorithm;
void main()
{
string[] x = ["ab", "cd", "ef"]; // 'string' is same as 'immutable(char)[]'
string space = " ";
char z = joiner( x, space ).front(); // error
}
Compilation with dmd ends with error:
test.d(8): Error: cannot implicitly convert expression (joiner(x,space).front()) of type dchar to char
Changing char z to dchar z does fix the error message, but I'm interested why it appears in the first place.
Why result of joiner(string[],string).front() is dchar and not char?
(There is nothing on this in documentation http://dlang.org/phobos/std_algorithm.html#joiner)
All strings are treated as ranges of dchar. That's because a dchar is guaranteed to be a single code point, since in UTF-32, every code unit is a code point, whereas in UTF-8 (char) and UTF-16 (wchar), the number of code units per code point varies. So, if you were operating on individual chars or wchars, you'd be operating on pieces of characters rather than whole characters, which would be very bad. If you don't know much about unicode, I'd advise reading this article by Joel Spolsky. It explains things fairly well.
In any case, because operating on individual chars and wchars doesn't make sense, strings of char and wchar are treated as ranges of dchar (ElementType!string is dchar), meaning that as far as ranges are concerned, they don't have length (hasLength!string is false - walkLength needs to be used to get their length), aren't sliceable (hasSlicing!string is false), and aren't indexable (isRandomAccess!string is false). This also means that anything which builds a new range from any kind of string is going to result in a range of dchar. joiner is one of those. There are some functions which understand unicode and special case strings for efficiency, taking advantage of length, slicing, and indexing where they can, but unless their result is ultimately a slice of the original, any range they return is going to have to be made of dchars.
So, front on any range of characters will always be dchar, and popFront will always pop off a full code point.
If you don't know much about ranges, I'd advise reading this. It's a chapter in a book on D which is online and is currently the best tutorial on ranges that we have. We really should get a proper article on ranges (including on how they work with strings) onto dlang.org, but no one's gotten around to writing it yet. Regardless, you're going to need to have at least a basic grasp of ranges to be able to use a lot of D's standard library (especially std.algorithm), because it uses them very heavily.