I've just been introduced to toupper, and I'm a little confused by the syntax; it seems like it's repeating itself. What I've been using it for is for every character of a string, it converts the character into an uppercase character if possible.
for (int i = 0; i < string.length(); i++)
{
if (isalpha(string[i]))
{
if (islower(string[i]))
{
string[i] = toupper(string[i]);
}
}
}
Why do you have to list string[i] twice? Shouldn't this work?
toupper(string[i]); (I tried it, so I know it doesn't.)
toupper is a function that takes its argument by value. It could have been defined to take a reference to character and modify it in-place, but that would have made it more awkward to write code that just examines the upper-case variant of a character, as in this example:
// compare chars case-insensitively without modifying anything
if (std::toupper(*s1++) == std::toupper(*s2++))
...
In other words, toupper(c) doesn't change c for the same reasons that sin(x) doesn't change x.
To avoid repeating expressions like string[i] on the left and right side of the assignment, take a reference to a character and use it to read and write to the string:
for (size_t i = 0; i < string.length(); i++) {
char& c = string[i]; // reference to character inside string
c = std::toupper(c);
}
Using range-based for, the above can be written more briefly (and executed more efficiently) as:
for (auto& c: string)
c = std::toupper(c);
As from the documentation, the character is passed by value.
Because of that, the answer is no, it shouldn't.
The prototype of toupper is:
int toupper( int ch );
As you can see, the character is passed by value, transformed and returned by value.
If you don't assign the returned value to a variable, it will be definitely lost.
That's why in your example it is reassigned so that to replace the original one.
As many of the other answers already say, the argument to std::toupper is passed and the result returned by-value which makes sense because otherwise, you wouldn't be able to call, say std::toupper('a'). You cannot modify the literal 'a' in-place. It is also likely that you have your input in a read-only buffer and want to store the uppercase-output in another buffer. So the by-value approach is much more flexible.
What is redundant, on the other hand, is your checking for isalpha and islower. If the character is not a lower-case alphabetic character, toupper will leave it alone anyway so the logic reduces to this.
#include <cctype>
#include <iostream>
int
main()
{
char text[] = "Please send me 400 $ worth of dark chocolate by Wednesday!";
for (auto s = text; *s != '\0'; ++s)
*s = std::toupper(*s);
std::cout << text << '\n';
}
You could further eliminate the raw loop by using an algorithm, if you find this prettier.
#include <algorithm>
#include <cctype>
#include <iostream>
#include <utility>
int
main()
{
char text[] = "Please send me 400 $ worth of dark chocolate by Wednesday!";
std::transform(std::cbegin(text), std::cend(text), std::begin(text),
[](auto c){ return std::toupper(c); });
std::cout << text << '\n';
}
toupper takes an int by value and returns the int value of the char of that uppercase character. Every time a function doesn't take a pointer or reference as a parameter the parameter will be passed by value which means that there is no possible way to see the changes from outside the function because the parameter will actually be a copy of the variable passed to the function, the way you catch the changes is by saving what the function returns. In this case, the character upper-cased.
Note that there is a nasty gotcha in isalpha(), which is the following: the function only works correctly for inputs in the range 0-255 + EOF.
So what, you think.
Well, if your char type happens to be signed, and you pass a value greater than 127, this is considered a negative value, and thus the int passed to isalpha will also be negative (and thus outside the range of 0-255 + EOF).
In Visual Studio, this will crash your application. I have complained about this to Microsoft, on the grounds that a character classification function that is not safe for all inputs is basically pointless, but received an answer stating that this was entirely standards conforming and I should just write better code. Ok, fair enough, but nowhere else in the standard does anyone care about whether char is signed or unsigned. Only in the isxxx functions does it serve as a landmine that could easily make it through testing without anyone noticing.
The following code crashes Visual Studio 2015 (and, as far as I know, all earlier versions):
int x = toupper ('é');
So not only is the isalpha() in your code redundant, it is in fact actively harmful, as it will cause any strings that contain characters with values greater than 127 to crash your application.
See http://en.cppreference.com/w/cpp/string/byte/isalpha: "The behavior is undefined if the value of ch is not representable as unsigned char and is not equal to EOF."
Related
I was solving a question online on strings where we had to perform run-length encoding on a given string, I wrote this function to achieve the answer
using namespace std;
string runLengthEncoding(string str) {
vector <char> encString;
int runLength = 1;
for(int i = 1; i < str.length(); i++)
{
if(str[i - 1] != str[i] || runLength == 9)
{
encString.push_back(to_string(runLength)[0]);
encString.push_back(str[i - 1]);
runLength = 0;
}
runLength++;
}
encString.push_back(to_string(runLength)[0]);
encString.push_back(str[str.size() - 1]);
string encodedString(encString.begin(), encString.end());
return encodedString;
}
Here I was getting a very long error on this particular line in the for loop and outside it when I wrote:
encString.push_back(to_string(runLength));
which I later found out should be:
encString.push_back(to_string(runLength)[0]);
instead
I don't quite understand why I have to insert it as a 2D element(I don't know if that is the right way to say it, forgive me I am a beginner in this) when I am just trying to insert the integer...
In stupid terms - why do I gotta add [0] in this?
std::to_string() returns a std::string. That's what it does, if you check your C++ textbook for a description of this C++ library function that's what you will read there.
encString.push_back( /* something */ )
Because encString is a std::vector<char>, it logically follows that the only thing can be push_back() into it is a char. Just a single char. C++ does not allow you to pass an entire std::string to a function that takes a single char parameter. C++ does not work this way, C++ allows only certain, specific conversions betweens different types, and this isn't one of them.
And that's why encString.push_back(to_string(runLength)); does not work. The [0] operator returns the first char from the returned std::string. What a lucky coincidence! You get a char from that, the push_back() expects a single char value, and everyone lives happily ever after.
Also, it is important to note that you do not, do not "gotta add [0]". You could use [1], if you have to add the 2nd character from the string, or any other character from the string, in the same manner. This explains the compilation error. Whether [0] is the right solution, or not, is something that you'll need to figure out separately. You wanted to know why this does not compile without the [0], and that's the answer: to_string() returns a std::string put you must push_back() a single char value, and using [0] makes it happen. Whether it's the right char, or not, that's a completely different question.
I'm trying to get the length of a character array in a second function. I've looked at a few questions on here (1 2) but they don't answer my particular question (although I'm sure something does, I just can't find it). My code is below, but I get the error "invalid conversion from 'char' to 'const char*'". I don't know how to convert my array to what is needed.
#include <cstring>
#include <iostream>
int ValidInput(char, char);
int main() {
char user_input; // user input character
char character_array[26];
int valid_guess;
valid_guess = ValidGuess(user_input, character_array);
// another function to do stuff with valid_guess output
return 0;
}
int ValidGuess (char user_guess, char previous_guesses) {
for (int index = 0; index < strlen(previous_guesses); index++) {
if (user_guess == previous_guesses[index]) {
return 0; // invalid guess
}
}
return 1; // valid guess, reaches this if for loop is complete
}
Based on what I've done so far, I feel like I'm going to have a problem with previous_guesses[index] as well.
char user_input;
defines a single character
char character_array[26];
defines an array of 26 characters.
valid_guess = ValidGuess(user_input, character_array);
calls the function
int ValidGuess (char user_guess, char previous_guesses)
where char user_guess accepts a single character, lining up correctly with the user_input argument, and char previous_guesses accepts a single character, not the 26 characters of character_array. previous_guesses needs a different type to accommodate character_array. This be the cause of the reported error.
Where this gets tricky is character_array will decay to a pointer, so
int ValidGuess (char user_guess, char previous_guesses)
could be changed to
int ValidGuess (char user_guess, char * previous_guesses)
or
int ValidGuess (char user_guess, char previous_guesses[])
both ultimately mean the same thing.
Now for where things get REALLY tricky. When an array decays to a pointer it loses how big it is. The asker has gotten around this problem, kudos, with strlen which computes the length, but this needs a bit of extra help. strlen zips through an array, counting until it finds a null terminator, and there are no signs of character_array being null terminated. This is bad. Without knowing where to stop strlen will probably keep going1. A quick solution to this is go back up to the definition of character_array and change it to
char character_array[26] = {};
to force all of the slots in the array to 0, which just happens to be the null character.
That gets the program back on its feet, but it could be better. Every call to strlen may recount (compilers are smart and could compute once per loop and store the value if it can prove the contents won't change) the characters in the string, but this is still at least one scan through every entry in character_array to see if it's null when what you really want to do is scan for user_input. Basically the program looks at every item in the array twice.
Instead, look for both the null terminator and user_input in the same loop.
int index = 0;
while (previous_guesses[index] != '\0' ) {
if (user_guess == previous_guesses[index]) {
return 0; // prefer returning false here. The intent is clearer
}
index++;
}
You can also wow your friends by using pointers and eliminating the need for the index variable.
while (*previous_guesses != '\0' ) {
if (user_guess == *previous_guesses) {
return false;
}
previous_guesses++;
}
The compiler knows and uses this trick too, so use the one that's easier for you to understand.
For 26 entries it probably doesn't matter, but if you really want to get fancy, or have a lot more than 26 possibilities, use a std::set or a std::unordered_set. They allow only one of an item and have much faster look-up than scanning a list one by one, so long as the list is large enough to get over the added complexity of a set and take advantage of its smarter logic. ValidGuess is replaced with something like
if (used.find(user_input) != used.end())
Side note: Don't forget to make the user read a value into user_input before the program uses it. I've also left out how to store the previous inputs because the question does as well.
1 I say probably because the Standard doesn't say what to do. This is called Undefined Behaviour. C++ is littered with the stuff. Undefined Behaviour can do anything -- work, not work, visibly not work, look like it works until it doesn't, melt your computer, anything -- but what it usually does is the easiest and fastest thing. In this case that's just keep going until the program crashes or finds a null.
I wrote a piece of code to count how many 'e' characters are in a bunch of words.
For example, if I type "I read the news", the counter for how many e's are present should be 3.
#include <iostream>
#include <cstring>
using namespace std;
int main()
{
char s[255],n,i,nr=0;
cin.getline(s,255);
for(i=1; i<=strlen(s); i++)
{
if(s[i-1]=='e') nr++;
}
cout<<nr;
return 0;
}
I have 2 unclear things about characters in C++:
In the code above, if I replace strlen(s) with 255, my code just doesn't work. I can only type a word and the program stops. I have been taught at school that strlen(s) is the length for the string s, which in this case, as I declared it, is 255. So, why can't I just type 255, instead of strlen(s)?
If I run the program above normally, it doesn't show me a number, like it is supposed to do. It shows me a character (I believe it is from the ASCII table, but I'm not sure), like a heart or a diamond. It is supposed to print the number of e's from the words.
Can anybody please explain these to me?
strlen(s) gives you the length of the string held in the s variable, up to the first NULL character. So if you input "hello", the length will be 5, even though s has a capacity of 255....
nr is displayed as a character because it's declared as a char. Either declare it as int, for example, or cast it to int when cout'ing, and you'll see a number.
strlen() counts the actual length of strings - the number of real characters up to the first \0 character (marking end of string).
So, if you input "Hello":
sizeof(s) == 255
strlen(s) == 5
For second question, you declare your nr as char type. std::cout recognizes char as a single letter and tries it print it as such. Declare your variable as int type or cast it before printing to avoid this.
int nr = 42;
std::cout << nr;
//or
char charNr = 42;
std::cout << static_cast<int>(charNr);
Additional mistakes not mentioned by others, and notes:
You should always check whether the stream operation was successful before trying to use the result.
i is declared as char and cannot hold values greater than 127 on common platforms. In general, the maximum value for char can be obtained as either CHAR_MAX or std::numeric_limits<char>::max(). So, on common platforms, i <= 255 will always be true because 255 is greater than CHAR_MAX. Incrementing i once it has reached CHAR_MAX, however, is undefined behavior and should never be done. I recommend declaring i at least as int (which is guaranteed to have sufficient range for this particular use case). If you want to be on the safe side, use something like std::ptrdiff_t (add #include <cstddef> at the start of your program), which is guaranteed to be large enough to hold any valid array size.
n is declared but never used. This by itself is harmless but may indicate a design issue. It can also lead to mistakes such as trying to use n instead of nr.
You probably want to output a newline ('\n') at the end, as your program's output may look odd otherwise.
Also note that calling a potentially expensive function such as strlen repeatedly (as in the loop condition) can have negative performance implications (strlen is typically an intrinsic function, though, and the compiler may be able to optimize most calls away).
You do not need strlen anyway, and can use cin.gcount() instead.
Nothing wrong with return 0; except that it is redundant – this is a special case that only applies to the main function.
Here's an improved version of your program, without trying to change your code style overly much:
#include <iostream>
#include <cstring>
#include <cstddef>
using namespace std;
int main()
{
char s[255];
int nr=0;
if ( cin.getline(s,255) )
{ // only if reading was successful
for(int i=0; i<cin.gcount(); i++)
{
if(s[i]=='e') nr++;
}
cout<<nr<<'\n';
}
return 0;
}
For exposition, the following is a more concise and expressive version using std::string (for arbitrary length input), and a standard algorithm. (As an interviewer, I would set this, modulo minor stylistic differences, as the canonical answer i.e. worth full credit.)
#include <algorithm>
#include <iostream>
#include <string>
using namespace std;
int main()
{
string s;
if ( getline(cin, s) )
{
cout << std::count(begin(s), end(s), 'e') << '\n';
}
}
I have 2 unclear things about characters in C++: 1) In the code above,
if I replace the "strlen(s)" with 255, my code just doesn't work, I
can only type a word and the program stops, and I have been taught at
school that "strlen(s)" is the length for the string s, wich in this
case, as I declared it, is 255. So, why can't I just type 255, instead
of strlen(s);
That's right, but strings only go the null terminator, even if there's more space allocated. Consider this, per example:
char buf[32];
strcpy(buf, "Hello World!");
There's 32 chars worth of space, but my string is only 12 characters long. That's why strlen returns 12 in this example. It's because it doesn't know how long the buffer is, it only knows the address of the string and parses it until it finds the null terminator.
So if you enter 255, you're going past what was set by cin and you'll read the rest of the buffer. Which, in this case, is uninitialized. That's undefined behavior - in this case it will most likely read some rubbish values, and those might coincidentally have the 'e' value and thus give you a wrong result.
2) If you run the program above normaly, it doesn't show you a number,
like it's supposed to do, it shows me a character(I believe it's from
the ASCII table but I'm not sure), like a heart or a diamond, but it
is supposed to print the number of e's from the words. So can anybody
please explain these to me?
You declared nr as char. While that can indeed hold an integer value, if you print it like this, it will be printed as a character. Declare it as int instead or cast it when you print it.
So I am currently writing a part of a program that takes user text input. I want to ignore all input characters that are not alphabetic, and so I figured std::isalpha() would be a good way to do this. Unfortunately, as far as I know there are two std::isalpha() functions, and the general one needs to be disambiguated from the locale-specific one thusly:
(int(*)(int))std::isalpha()
If I don't disambiguate, std::isalpha seems to return true when reading uppercase but false when reading lowercase letters (if I directly print the returned value, though, it returns 0 for non-alpha chars, 1 for uppercase chars, and 2 for lowercase chars). So I need to do this.
I've done so in another program before, but for some reason, in this project, I sometimes get "ISO C++ forbids" errors. Note, only sometimes. Here is the problematic area of code (this appears together without anything in between):
std::cout << "Is alpha? " << (int(*)(int))std::isalpha((char)Event.text.unicode) << "\n";
if ( (int(*)(int))std::isalpha((char)Event.text.unicode) == true)
{
std::cout << "Is alpha!\n";
//...snip...
}
The first instance, where I send the returned value to std::cout, works fine - I get no errors for this, I get the expected values (0 for non-alpha, 1 for alpha), and if that's the only place I try to disambiguate, the program compiles and runs fine.
The second instance, however, throws up this:
error: ISO C++ forbids comparison between pointer and integer
and only compiles if I remove the (int(*)(int)) snippet, at which point bad behavior ensues. Could someone enlighten me here?
You are casting the return value of the std::alpha() call to int(*)(int), and then compare that pointer to true. Comparing pointers to boolean values doesn't make much sense and you get an error.
Now, without the cast, you compare the int returned by std::alpha() to true. bool is an integer type, and to compare the two different integer types the values are first converted to the same type. In this case they are both converted to int. true becomes 1, and if std::isalpha() returned 2 the comparison ends up with 2 != 1.
If you want to compare the result of std::alpha() against a bool, you should cast that returned in to bool, or simply leave out the comparison and use something like if (std::isalpha(c)) {...}
There is no need to disambiguate, because the there is no ambiguity in a normal call.
Also, there is no need to use the std:: prefix when you get the function declaration from <ctype.h>, which after C++11 is the header you should preferably use (i.e., not <cctype>) – and for that matter also before C++11, but C++11 clinched it.
Third, you should not compare the result to true.
However, you need to cast a char argument to unsigned char, lest you get Undefined Behavior for anything but 7-bit ASCII.
E.g. do like this:
bool isAlpha( char const c )
{
typedef unsigned char UChar;
return !!isalpha( UChar( c ) );
}
So I was playing around with some code and wanted to see which method of converting a std::string to upper case was most efficient. I figured that the two would be somewhat similar performance-wise, but I was terribly wrong. Now I'd like to find out why.
The first method of converting the string works as follows: for each character in the string (save the length, iterate from 0 to length), if it's between 'a' and 'z', then shift it so that it's between 'A' and 'Z' instead.
The second method works as follows: for each character in the string (start from 0, keep going till we hit a null terminator), apply the build in toupper() function.
Here's the code:
#include <iostream>
#include <string>
inline std::string ToUpper_Reg(std::string str)
{
for (int pos = 0, sz = str.length(); pos < sz; ++pos)
{
if (str[pos] >= 'a' && str[pos] <= 'z') { str[pos] += ('A' - 'a'); }
}
return str;
}
inline std::string ToUpper_Alt(std::string str)
{
for (int pos = 0; str[pos] != '\0'; ++pos) { str[pos] = toupper(str[pos]); }
return str;
}
int main()
{
std::string test = " abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~!##$%^&*()_+=-`'{}[]\\|\";:<>,./?";
for (size_t i = 0; i < 100000000; ++i) { ToUpper_Reg(test); /* ToUpper_Alt(test); */ }
return 0;
}
The first method ToUpper_Reg took about 169 seconds per 100 million iterations.
The second method Toupper_Alt took about 379 seconds per 100 million iterations.
What gives?
Edit: I changed the second method so that it iterates the string how the first one does (set the length aside, loop while less than length) and it's a bit faster, but still about twice as slow.
Edit 2: Thanks everybody for your submissions! The data I'll be using it on is guaranteed to be ascii, so I think I'll be sticking with the first method for the time being. I'll keep in mind that toupper is locale specific for when/if I need it.
std::toupper uses the current locale to do case conversions, which involves a function call and other abstractions. So naturally, it will be slower. But it will also work on non-ASCII text.
toupper() does more than just shift characters in the range [a-z]. For one thing it's locale dependent and can handle more than just ASCII.
toupper() takes the locale into account so it can handle (some) international characters and is much more complex than just handling the character range 'a'-'z'.
Well, ToUpper_Reg() doesn't work. For example, it doesn't turn my name into all uppercase characters. That said, ToUpper_Alt() also doesn't work because it toupper() gets passed a negative value on some platforms, i.e. it creates undefined behavior (normally a crash) when using it with my name. This is easily fixed, though, by correctly calling it something like this:
toupper(static_cast<unsigned char>(str[pos]))
That said, the two versions of the code are not equivalent: the version onot using toupper() isn't writing the characters all the time while the latter version is: once everything is converted to uppercase it always takes the same branch after a test and then does nothing. You might want to change ToUpper_Alt() to look like this and retest:
inline std::string ToUpper_Alt(std::string str)
{
for (int pos = 0; str[pos] != '\0'; ++pos) {
if (islower(static_cast<unsigned char>(str[pos])) {
str[pos] = toupper(static_cast<unsigned char>(str[pos]));
}
}
return str;
}
I would guess the difference is the writing: toupper() trades the comparison for an array look-up. The locale is quickly accessed and all toupper() does is get the current pointer and access the location at a given offset. With data in the cache this is probably as fast as the branch.
The second on involves a function call. a function call is an expensive operation in an inner loop. toupper also uses locales to determine how the character should be changed.
The advances of the call is that it is standard and will work regardless of character encoding on the host machine
That said, I would highly recommend use the boost function:
boost::algorithm::to_upper
It is a template so is more than likely to be inlined, however it does involve locales. I would still use it.
http://www.boost.org/doc/libs/1_40_0/doc/html/boost/algorithm/to_upper.html
I guess it's because the second one calls a C standard library function, that on the one hand isn't inlined, so you got the overhead of a function call. But even more important, this function probably does a lot more than just two comparisons, two jumps and two integer additions. It performs additional checks on the character and takes the current locale into account and all that stuff.
std::toupper uses the current locale and the reason why this is slower than the C function is that the current locale is shared and mutable from different threads, so it's necessary to lock the locale object when it's accessed to ensure it's not switched during the call. This happens once per call to toupper and introduces quite a large overhead (obtaining the lock might require a syscall depending on implementation). One workaround if you want to get the performance and respect the locale is to get the locale object first (creating a local copy) and then call the toupper facet on your copy, thus avoiding the need to lock for each toupper call. See the link below for an example.
http://www.cplusplus.com/reference/std/locale/ctype/toupper/
The question has already been answered, but as an aside, replacing the guts of your loop in the first method with:
std::string::value_type &c = str[pos];
if ('a' <= c && c <= 'z') { c += ('A' - 'a'); }
makes it even faster. Maybe my compiler just sucks.