How do I check whether character constants conform to ASCII?

How do I check whether character constants conform to ASCII? - c++

A comment on an earlier version of this answer of mine alerted me to the fact that I can't assume that 'A', 'B', 'C' etc. have successive numeric values. I had sort of assumed the C or C++ language standards guarantee that this is the case.
So, how should I determine whether consecutive letter characters' values are themselves consecutive? Or rather, how can I determine whether the character constants I can express within single quotes have their ASCII codes for a numeric value?
I'm asking how to do this both in C and in C++. Obviously the C way would work in C++ also, but if there's a C++ish facility for doing this I'm interested in that as well. Also, I'm asking about the newest relevant standards (C11, C++17).

You can use the preprocessor to check if a particular character maps to the charset:
#include <iostream>
using namespace std;
int main() {
#if ('A' == 65 && 'Z' - 'A' == 25)
std::cout << "ASCII" << std::endl;
#else
std::cout << "Other charset" << std::endl;
#endif
return 0;
}
The drawback is, you need to know the mapped values in advance.
The numeric chars '0' - '9' are guaranteed to appear in consecutive order BTW.

... (2) I expect to be able to obtain the distance in number-of-letters between two letters ...
This comment specifying your goal makes much more sense than your actual question! Why didn't you ask about that? You can use strchr on an array of characters, and strchr doesn't care what the native character set is, meaning your code won't care what the native character set is... For example:
char alphabet[] = "AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz";
ptrdiff_t fubar = strchr(alphabet, 'y') - strchr(alphabet, 'X');
printf("'X' and 'y' have a distance of %tu and a case difference of %tu\n", fubar / 2, fubar % 2);
... how should I determine whether consecutive letter characters' values are themselves consecutive?
Consecutive letter characters' values are consecutive, by definition, because they're consecutive letter characters. I know this isn't what you meant, but your actual question illustrates a lack of planning and thought, and... a stupid question warrants a stupid answer.
You're much better off programming in such a way that you don't care what values they have. Nonetheless, create an array containing the characters you care about, loop through the elements and test for inconsistencies. For example:
int is_consecutive(char const *alphabet) {
for (size_t x = 0; alphabet[x] && alphabet[x] + 1 == alphabet[x + 1]; x++);
return !alphabet[x];
}
... how can I determine whether the character constants I can express within single quotes have their ASCII codes for a numeric value?
Again with the lack of sense, and again with the caring about values... Alternatively, build two translation tables, native_to_ascii and ascii_to_native, and work it out from there. I won't help you with this, as it's a silly exercise involving the use of magic numbers that most likely aren't necessary for your actual goal.

Related

Is the comparison of strings or string views terminated at a null-character?

May a string or string_view include '\0' characters so that the following code prints 1 twice?
Or is this just implementation-defined?
#include <iostream>
#include <string_view>
#include <string>
using namespace std;
int main()
{
string_view sv( "\0hello world", 12 );
cout << (sv == sv) << endl;
string str( sv );
cout << (str == sv) << endl;
}
This isn't a duplicate to the question if strings can have embedded nulls since they obviously can. What I want to ask if the comparison of strings or string views is terminated at a 0-character.

Language lawyer answer since the standards documents are, by definition, the one true source of truth :-)
The standard is clear on this. In C++17 (since that's the tag you provided, but later iterations are similar), [string.operator==] states that, for using strings and/or string views, it:
Returns: lhs.compare(rhs) == 0.
The [string.compare] section further states that these all boil down to a comparison with a string view and explain that it:
Determines the effective length rlen of the strings to compare as the smaller of size() and sv.size(). The function then compares the two strings by calling traits::compare(data(), sv.data(), rlen).
These sizes are not restricted in any way by embedded nulls.
And, if you look at the traits information in table 54 of [char.traits.require], you'll see it's as clear as mud until you separate it out into sections:
X::compare(p,q,n) Returns int:
0 if for each i in [0,n), X::eq(p[i],q[i]) is true; else
a negative value if, for some j in [0,n), X::lt(p[j],q[j]) is true and for each i in [0,j) X::eq(p[i],q[i]) is true; else
a positive value.
The first bullet point is easy, it gives zero if every single character is equal.
The second is a little harder but it basically gives a negative value where the first difference between characters has the first string on the lower side (all previous characters are equal and the offending character is lower in the first string).
The third is just the default "if it's neither equal nor lesser, it must be greater".

nul-character is part of comparison, see https://en.cppreference.com/w/cpp/string/basic_string/operator_cmp
Two strings are equal if both the size of lhs and rhs are equal and each character in lhs has equivalent character in rhs at the same position.

Sign & Unsigned Char is not working in C++

In C++ Primer 5th Edition I saw this
when I tried to use it---
At this time it didn't work, but the program's output did give a weird symbol, but signed is totally blank And also they give some warnings when I tried to compile it. But C++ primer and so many webs said it should work... So I don't think they give the wrong information did I do something wrong?
I am newbie btw :)

But C++ primer ... said it should work
No it doesn't. The quote from C++ primer doesn't use std::cout at all. The output that you see doesn't contradict with what the book says.
So I don't think they give the wrong information
No1.
did I do something wrong?
It seems that you've possibly misunderstood what the value of a character means, or possibly misunderstood how character streams work.
Character types are integer types (but not all integer types are character types). The values of unsigned char are 0..255 (on systems where size of byte is 8 bits). Each2 of those values represent some textual symbol. The mapping from a set of values to a set of symbols is called a "character set" or "character encoding".
std::cout is a character stream. << is stream insertion operator. When you insert a character into a stream, the behaviour is not to show the numerical value. Instead, the behaviour to show the symbol that the value is mapped to3 in the character set that your system uses. In this case, it appears that the value 255 is mapped to whatever strange symbol you saw on the screen.
If you wish to print the numerical value of a character, what you can do is convert to a non-character integer type and insert that to the character stream:
int i = c;
std::cout << i;
1 At least, there's no wrong information regarding your confusion. The quote is a bit inaccurate and outdated in case of c2. Before C++20, the value was "implementation defined" rather than "undefined". Since C++20, the value is actually defined, and the value is 0 which is the null terminator character that signifies end of a string. If you try to print this character, you'll see no output.
2 This was bit of a lie for simplicity's sake. Some characters are not visible symbols. For example, there is the null terminator charter as well as other control characters. The situation becomes even more complex in the case of variable width encodings such as the ubiquitous Unicode, where symbols may consist of a sequence of several char. In such encoding, and individual char cannot necessarily be interpreted correctly without other char that are part of such sequence.
3 And this behaviour should feel natural once you grok the purpose of character types. Consider following program:
unsigned char c = 'a';
std::cout << c;
It would be highly confusing if the output would be a number that is the value of the character (such as 97 which may be the value of the symbol 'a' on the system) rather than the symbol 'a'.
For extra meditation, think about what this program might print (and feel free to try it out):
char c = 57;
std::cout << c << '\n';
int i = c;
std::cout << i << '\n';
c = '9';
std::cout << c << '\n';
i = c;
std::cout << i << '\n';

This is due to the behavior of the << operator on the char type and the character stream cout. Note, the << is known as formatted output means it does some implicit formatting.
We can say that the value of a variable is not the same as its representation in certain contexts. For example:
int main() {
bool t = true;
std::cout << t << std::endl; // Prints 1, not "true"
}
Think of it this way, why would we need char if it would still behave like a number when printed, why not to use int or unsigned? In essence, we have different types so to have different behaviors which can be deduced from these types.
So, the underlying numeric value of a char is probably not what we looking for, when we print one.
Check this for example:
int main() {
unsigned char c = -1;
int i = c;
std::cout << i << std::endl; // Prints 255
}
If I recall correctly, you're somewhat close in the Primer to the topic of built-in types conversions, it will bring in clarity when you'll get to know these rules better. Anyway, I'm sure, you will benefit greatly from looking into this article. Especially the "Printing chars as integers via type casting" part.

C++ toupper Syntax

I've just been introduced to toupper, and I'm a little confused by the syntax; it seems like it's repeating itself. What I've been using it for is for every character of a string, it converts the character into an uppercase character if possible.
for (int i = 0; i < string.length(); i++)
{
if (isalpha(string[i]))
{
if (islower(string[i]))
{
string[i] = toupper(string[i]);
}
}
}
Why do you have to list string[i] twice? Shouldn't this work?
toupper(string[i]); (I tried it, so I know it doesn't.)

toupper is a function that takes its argument by value. It could have been defined to take a reference to character and modify it in-place, but that would have made it more awkward to write code that just examines the upper-case variant of a character, as in this example:
// compare chars case-insensitively without modifying anything
if (std::toupper(*s1++) == std::toupper(*s2++))
...
In other words, toupper(c) doesn't change c for the same reasons that sin(x) doesn't change x.
To avoid repeating expressions like string[i] on the left and right side of the assignment, take a reference to a character and use it to read and write to the string:
for (size_t i = 0; i < string.length(); i++) {
char& c = string[i]; // reference to character inside string
c = std::toupper(c);
}
Using range-based for, the above can be written more briefly (and executed more efficiently) as:
for (auto& c: string)
c = std::toupper(c);

As from the documentation, the character is passed by value.
Because of that, the answer is no, it shouldn't.
The prototype of toupper is:
int toupper( int ch );
As you can see, the character is passed by value, transformed and returned by value.
If you don't assign the returned value to a variable, it will be definitely lost.
That's why in your example it is reassigned so that to replace the original one.

As many of the other answers already say, the argument to std::toupper is passed and the result returned by-value which makes sense because otherwise, you wouldn't be able to call, say std::toupper('a'). You cannot modify the literal 'a' in-place. It is also likely that you have your input in a read-only buffer and want to store the uppercase-output in another buffer. So the by-value approach is much more flexible.
What is redundant, on the other hand, is your checking for isalpha and islower. If the character is not a lower-case alphabetic character, toupper will leave it alone anyway so the logic reduces to this.
#include <cctype>
#include <iostream>
int
main()
{
char text[] = "Please send me 400 $ worth of dark chocolate by Wednesday!";
for (auto s = text; *s != '\0'; ++s)
*s = std::toupper(*s);
std::cout << text << '\n';
}
You could further eliminate the raw loop by using an algorithm, if you find this prettier.
#include <algorithm>
#include <cctype>
#include <iostream>
#include <utility>
int
main()
{
char text[] = "Please send me 400 $ worth of dark chocolate by Wednesday!";
std::transform(std::cbegin(text), std::cend(text), std::begin(text),
[](auto c){ return std::toupper(c); });
std::cout << text << '\n';
}

toupper takes an int by value and returns the int value of the char of that uppercase character. Every time a function doesn't take a pointer or reference as a parameter the parameter will be passed by value which means that there is no possible way to see the changes from outside the function because the parameter will actually be a copy of the variable passed to the function, the way you catch the changes is by saving what the function returns. In this case, the character upper-cased.

Note that there is a nasty gotcha in isalpha(), which is the following: the function only works correctly for inputs in the range 0-255 + EOF.
So what, you think.
Well, if your char type happens to be signed, and you pass a value greater than 127, this is considered a negative value, and thus the int passed to isalpha will also be negative (and thus outside the range of 0-255 + EOF).
In Visual Studio, this will crash your application. I have complained about this to Microsoft, on the grounds that a character classification function that is not safe for all inputs is basically pointless, but received an answer stating that this was entirely standards conforming and I should just write better code. Ok, fair enough, but nowhere else in the standard does anyone care about whether char is signed or unsigned. Only in the isxxx functions does it serve as a landmine that could easily make it through testing without anyone noticing.
The following code crashes Visual Studio 2015 (and, as far as I know, all earlier versions):
int x = toupper ('é');
So not only is the isalpha() in your code redundant, it is in fact actively harmful, as it will cause any strings that contain characters with values greater than 127 to crash your application.
See http://en.cppreference.com/w/cpp/string/byte/isalpha: "The behavior is undefined if the value of ch is not representable as unsigned char and is not equal to EOF."

Why do "strings", i.e. character arrays, have a null-terminating element, whereas integer arrays don't?

From what I understand, character arrays in C/C++ have a null-terminating character for the purpose of denoting an off-the-end element of that array, while integer arrays don't; they have some internal mechanism that is hidden from the user, but they obviously know their own size since the user can do sizeof(myArray)/sizeof(int) (Is that technically a hack?). Wouldn't it make sense for an integer array to have some null-terminating int -- call it i or something?
Why is this? It has never made any sense to me.

Because, in C, strings are not the same as character arrays, they exist at a level above arrays in much the same way as a linked list exists at a level above structures.
This is an example of a string:
"pax is great"
This is an example of a character array:
{ 'p', 'a', 'x' }
This is an example of a character array that just happens to be equivalent to a string:
{ 'p', 'a', 'x', '\0' }
In other words, C string are built on top of character arrays.
If you look at it another way, neither integer arrays nor "real" character arrays (like {'a', 'b', 'c'} for example) have a terminating character.
You can quite easily do the same thing (have a terminator) with an integer array of people's ages, using -1 (or any negative number) as the terminator.
The only difference is that you'll write your own code to handle it rather than using code helpfully provided in the C standard library, things like:
size_t agelen (int *ages) {
size_t len = 0;
while (*ages++ >= 0)
len++;
return len;
}
int *agecpy (int *src, int *dst) {
int *d = dst;
while (*s >= 0)
*d++ = *src++;
*dst = -1;
return dst;
}

Because string does not exists in c.
Because the null terminator is there to mark the end of the input and it doesn't have to be the length of the given array.

This is by convention, treating null as a non-character. Unlike other major system software languages of then e.g. PL/1 which had a leading integer to denote the length of a variable length character string, C was designed to treat strings as simply character arrays and did not want the overhead and in particular any portability issues (such as sizeof int) nor any limitations (what about very long strings). The convention has stuck because it worked out rather well.
To denote end of an int array as you have suggested would require a non-Int marker. That could be rather difficult to arrange. And sizeof an int array as you are figuring out is merely taking advantage of your knowledge of *alloc - there is absolutely nothing in C to prevent you from cobbling together an "array" by clever management of allocated memory. Modern compilers of course contain many convenience checks on wayward code and someone with better knowledge of compilers could clarify/rectify my comments here. C++ Vector contains an explicit knowledge of array capacity, for example.
A lot of places you can see a different Field Separator FS character used to separate out strings. E.g., CSV. But if you were to do that, you will need to write you own std libraries - thousands and thousands of lines of good, tested code.

A C-Style string is a collection of characters terminated by '\0'. It is not an array.
The collection can be indexed like an array.
Because the length of the collection can vary, the length must be determined by counting the number of characters in the collection.
A convenient representation is an array because an array is also a collection.
One difference is that an array is a fixed sized data structure. The collection of characters may not be a fixed size; for example, it can be concatenated.

If you think about the problem of how to represent strings, you have two choices: 1) store a count of letters followed by the letters or 2) store the letters followed by some unique special character used as an end of string marker.
End of string marker is more flexible - longer strings possible, easier to use, etc.
BTW you can have terminator on an int array if you want... Nothing stopping you saying that a -1 for example means the end if the list, as long as you are sure that the -1 is unique.

Does an empty string contain an empty string in C++?

Just had an interesting argument in the comment to one of my questions. My opponent claims that the statement "" does not contain "" is wrong.
My reasoning is that if "" contained another "", that one would also contain "" and so on.
Who is wrong?
P.S.
I am talking about a std::string
P.S. P.S
I was not talking about substrings, but even if I add to my question " as a substring", it still makes no sense. An empty substring is nonsense. If you allow empty substrings to be contained in strings, that means you have an infinity of empty substrings. What is the point of that?
Edit:
Am I the only one that thinks there's something wrong with the function std::string::find?
C++ reference clearly says
Return Value: The position of the first character of the first match.
Ok, let's assume it makes sense for a minute and run this code:
string empty1 = "";
string empty2 = "";
int postition = empty1.find(empty2);
cout << "found \"\" at index " << position << endl;
The output is: found "" at index 0
Nonsense part: how can there be index 0 in a string of length 0? It is nonsense.
To be able to even have a 0th position, the string must be at least 1 character long.
And C++ is giving a exception in this case, which proves my point:
cout << empty2.at( empty1.find(empty2) ) << endl;
If it really contained an empty string it would had no problem printing it out.

It depends on what you mean by "contains".
The empty string is a substring of the empty string, and so is contained in that sense.
On the other hand, if you consider a string as a collection of characters, the empty string can't contain the empty string, because its elements are characters, not strings.
Relating to sets, the set
{2}
is a subset of the set
A = {1, 2, 3}
but {2} is not a member of A - all A's members are numbers, not sets.
In the same way, {} is a subset of {}, but {} is not an element in {} (it can't be because it's empty).
So you're both right.

C++ agrees with your "opponent":
#include <iostream>
#include <string>
using namespace std;
int main()
{
bool contains = string("").find(string("")) != string::npos;
cout << "\"\" contains \"\": "
<< boolalpha << contains;
}
Output: "" contains "": true
Demo

It's easy. String A contains sub-string B if there is an argument offset such that A.substr(offset, B.size()) == B. No special cases for empty strings needed.
So, let's see. std::string("").substr(0,0) turns out to be std::string(""). And we can even check your "counter-example". std::string("").substr(0,0).substr(0,0) is also well-defined and empty. Turtles all the way down.

The first thing that is unclear is whether you are talking about std::string or null terminated C strings, the second thing is why should it matter?. I will assume std::string.
The requirements on std::string determine how the component must behave, not what its internal representation must be (although some of the requirements affect the internal representation). As long as the requirements for the component are met, whether it holds something internally is an implementation detail that you might not even be able to test.
In the particular case of an empty string, there is nothing that mandates that it holds anything. It could just hold a size member set to 0 and a pointer (for the dynamically allocated memory if/when not empty) also set to 0. The requirement in operator[] requires that it returns a reference to a character with value 0, but since that character cannot be modified without causing undefined behavior, and since strict aliasing rules allow reading from an lvalue of char type, the implementation could just return a reference to one of the bytes in the size member (all set to 0) in the case of an empty string.
Some implementations of std::string use small object optimizations, in those implementations there will be memory reserved for small strings, including an empty string. While the std::string will obviously not contain a std::string internally, it might contain the sequence of characters that compose an empty string (i.e. a terminating null character)

empty string doesn't contain anything - it's EMPTY. :)

Of course an empty string does not contain an empty string. It'll be turtles all the way down if it did.
Take String empty = ""; that is declaring a string literal that is empty, if you want a string literal to represent a string literal that is empty you would need String representsEMpty = """"; but of course, you need to escape it, giving you string actuallyRepresentsEmpty = "\"\"";
ps, I am taking a pragmatic approach to this. Leave the maths nonsense at the door.
Thinking about you amendment, it could be possible that your 'opponent' meant was that an 'empty' std::string still has an internal storage for characters which is itself empty of characters. That would be an implementation detail I am sure, it could perhaps just keep a certain size (say 10) array of characters 'just incase', so it will technically not be empty.
Of course, there is the trick question answer that 'nothing' fits into anything infinite times, a sort of 'divide by zero' situation.

Today I had the same question since I'm currently bound to a lousy STL implementation (dating back to the pre-C++98 era) that differs from C++98 and all following standards:
TEST_ASSERT(std::string().find(std::string()) == string::npos); // WRONG!!! (non-standard)
This is especially bad if you try to write portable code because it's so hard to prove that no feature depends on that behaviour. Sadly in my case that's actually true: it does string processing to shorten phone numbers input depending on a subscriber line spec.
On Cppreference, I see in std::basic_string::find an explicit description about empty strings that I think matches exactly the case in question:
an empty substring is found at pos if and only if pos <= size()
The referred pos defines the position where to start the search, it defaults to 0 (the beginning).
A standard-compliant C++ Standard Library will pass the following tests:
TEST_ASSERT(std::string().find(std::string()) == 0);
TEST_ASSERT(std::string().substr(0, 0).empty());
TEST_ASSERT(std::string().substr().empty());
This interpretation of "contain" answers the question with yes.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How do I check whether character constants conform to ASCII? - c++

Related

Is the comparison of strings or string views terminated at a null-character?

Sign & Unsigned Char is not working in C++

C++ toupper Syntax

Why do "strings", i.e. character arrays, have a null-terminating element, whereas integer arrays don't?

Does an empty string contain an empty string in C++?

Categories

Resources