Is string::compare reliable to determine alphabetical order? - c++

Simply put, if the input is always in the same case (here, lower case), and if the characters are always ASCII, can one use string::compare to determine reliably the alphabetical order of two strings?
Thus, with stringA.compare(stringB) if the result is 0, they are the same, if it is negative, stringA comes before stringB alphabetically , and if it is positive, stringA comes after?

According to the docs at cplusplus.com,
The member function returns 0 if all
the characters in the compared
contents compare equal, a negative
value if the first character that does
not match compares to less in the
object than in the comparing string,
and a positive value in the opposite
case.
So it will sort strings in ASCII order, which will be alphabetical for English strings (with no diacritical marks or other extended characters) of the same case.

Yes, as long as all of the characters in both strings are of the same case, and as long as both strings consist only of letters, this will work.
compare is a member function, though, so you would call it like so:
stringA.compare(stringB);

In C++, string is the instantiation of the template class basic_string with the default parameters: basic_string<char, char_traits<char>, allocator<char> >. The compare function in the basic_string template will use the char_traits<TChar>::compare function to determine the result value.
For std::string the ordering will be that of the default character code for the implementation (compiler) and that is usually ASCII order. If you require a different ordering (say you want to consider { a, á, à, â } as equivalent), you can instantiate a basic_string with your own char_traits<> implementation. providing a different compare function pointer.

yes,
The member function returns 0 if all
the characters in the compared
contents compare equal, a negative
value if the first character that does
not match compares to less in the
object than in the comparing string,
and a positive value in the opposite
case.
For string objects, the result of a
character comparison depends only on
its character code (i.e., its ASCII
code), so the result has some limited
alphabetical or numerical ordering
meaning.

The specifications for the C and C++ language guarantee for lexical ordering, 'A' < 'B' < 'C' ... < 'Z'. The same is true for lowercase.
The ordering for text digits is also guaranteed: '0' < ... < '9'.
When working with multiple languages, many people create an array of characters. The array is searched for the character. Instead of comparing characters, the indices are compared.

Related

Comparing a char variable to empty char does not work

Say x is a character.
Whenever I do if(x <> '') to know whether the variable is empty or not, it just does not work.
However, when I attempt to do this if(x <> chr(0)), it does work.
I have tried the same thing on two versions of the compiler : Free Pascal and Charm Pascal, but I am still facing the same problem.
There is no such thing as an "empty char". The Char type is always a single character.
That character could be 1 byte AnsiChar representing a value from 0..255. (In Delphi and fpc, it could also be a 2 byte WideChar representing a value from 0..65535.) Either way it is always represented as '<something>'. That "something" must be a character value.
When you compare x <> Chr(0) you are taking the byte value of 0 and converting it to a Char so a valid comparison can be performed.
Side Notes
For Char to reliably have the concept "no value" requires storing additional information. E.g. Databases may have a hidden internal bit field indicating the value is NULL. It's important to be aware that this is fundamentally different from any of the valid values it may have if it's not NULL. Libraries that interact with databases need to provide a way to determine if a value is NULL.
You haven't provided any information about the actual problem you're trying to solve but here are some thoughts that may yield progress:
If you're dealing with user input, it may be more appropriate to compare with a space character ' '.
If you're dealing with characters read from a file, you should probably be checking number of bytes/characters actually read.
If you're trying to determine the end of a string it's much more reliable to use the Length() of the string.
(Though there are some environments that use the convention of treating Char(0) as a special character meaning "end-of-string".) But the convention requires allocating an extra character making the string internally longer than its text length. So the technique is not usable if the environment doesn't support it.
Most importantly, from comments it seems you might be struggling with the difference between empty-string and how that's represented as a Char. And the point is that it isn't. You need to check the length of the string.
E.g. You can do the following:
if (s <> '') then
begin
{ You now know there is at least 1 character in the string so
you can safely read it and not worry about "if it has a value".}
x := s[1];
...
end;

C++: What if theres a null character before any other character in an array?

If we're outputting said array and the first character is \0, is it just ignored and the next character that isn't null treated as the first character?
Depends on the 'outputting' function...if it knows the length of the array, it could output every element regardless of value. Most functions working with 'C strings' will stop at the first \0.
C-styled strings are by default, sentinel character arrays meaning they terminated at the first appearance of \0 (or some form of null), so it shouldn't be ignored. It should terminate the string, treating it as an empty string.
By convention, if the "C string" (read char array) uses "string-ish" methods to do something with the string, it will stop at the first ASCII-zero (length parameters are regarded as "up-to" lengths). If it uses "array-ish" methods, it will regard a length parameter as a "precisely" length and iterate a for loop from zero to that size. The latter can often be found in binary handling, while the former is convention for string treatment of char arrays.
Examples in C standard headers are memxyz functions for binary array as opposed to strnxyz functions for strings.

How do I compare base36 values in C++?

As we all know, with C-style string comparisons, the value is dependent on the ASCII value of each character and just uses the strcmp function to compare. I'm confused that what the std::string compare depends on?
Although I have searched Google, I still didn't find the answer.
In addition, if strings are all base36 strings and they are all in lower case, could I compare their values by strings directly? Or I should convert them as a long variable using the strtol function? Which method is better?
Your outset "As we all know, with C-style string comparisons, the value is dependent on the ASCII value of each character..." is unfortunately wrong already. With e.g. UTF-8 strings and various forms of collation, that's simply untrue.
Then "...and just uses the strcmp function to compare." is also wrong, because C-style strings don't have an inherent way to compare but multiple ways that also depend on e.g. the encoding and locale. You could use strcmp() for bytewise equality comparisons though, but that won't always give you expected results.
To answer your question what std::string uses, that's simple. std::string is a specialization of the std::basic_string template and it delegates comparisons to its char_traits template parameter. This parameter typically uses memcmp(). It can not possibly use strcmp(), because other than a C-style string, std::string can include null chars, but strcmp() would stop at those.
std::string compare depends upon 'ASCII values' in exactly the same way that strcmp does.
For base36 comparisons, simple string comparison (either strcmp or std::string) doesn't work because "00123" and "123" are equal when representing base36 integers but they compare differently as strings. Neither does strtol work very well because of integer overflow. Instead you should probably write your own comparison routine that removes leading zeros, then compares length and finally for strings of equal length does a string comparison.

Is the expression 'ab' == "ab" true in C++

My question sounds probably quite stupid, but I have to answer it while preparing myself to my bachelor exam.
So, what do you think about such an expression 'ab' == "ab" in C++? Is this not true or simply not legal and compiling error? I have googled a little and get to know that 'ab' is in type int and "ab" of course not...
I have to regard not what compilator says but what says formal description of language..
It definitely generates a warning, but by default, gcc compiles it.
It should normally be false.
That being said, it should be theoretically possible, of course depending on the platform you're running this on, to have the compile-time constant "ab" at a memory location whose address is equal in numerical value to the numerical value of 'ab', case in which the expression would be true (although the comparison is of course meaningless).
In both C and C++ expression 'ab' == "ab" is invalid. It has no meaning. Neither language allows comparing arbitrary integral values with pointer values. For this reason, the matter of it being "true" or not does not even arise. In order to turn it into a compilable expression you have to explicitly cast the operands to comparable types.
The only loophole here is that the value of multi-char character constant is implementation-defined. If in some implementation the value of 'ab' happens to be zero, it can serve as a null-pointer constant. In that case 'ab' == "ab" becomes equivalent to 0 == "ab" and NULL == "ab". This is guaranteed to be false.
It is going to give you a warning, but it will build. What it will do is compare the multibyte integer 'ab' with the address of the string literal "ab".
Bottom line, the result of the comparison won't reflect the choice of letters being the same or not.
The Standard has absolutely nothing to say about comparing an integral type with a pointer. All it says is the following (in section 5.9):
The operands shall have arithmetic, enumeration, or pointer type, or
type std::nullptr_t...
It then goes into a detailed description on what it means to compare two pointers, and mentions comparing two integers. So my interpretation of the lack of specification would be "whatever the compiler writer decides", which is either an error or a warning.
Lets consider this to parts in simple C, the 'c' is a simple char if you want to manipulate strings you will have to use array of chars, as a result 'ca' shouldn't work the way you expect, and in c++ this stuff is still valid. If you want to use Strings you will have to use String class which isn't a raw type. And all what it does is a class with methods and type def's so you handle chars of arrays easier. As result even the C-style-string and the array of chars are different stuff, as result 'ab' == "ab" is not going to give a valid boolean respond . It's like trying to compare an int to a string. So, this comaprison will most likely throw an error.

What is '\0' in C++?

I'm trying to translate a huge project from C++ to Delphi and I'm finalizing the translation. One of the things I left is the '\0' monster.
if (*asmcmd=='\0' || *asmcmd==';')
where asmcmd is char*.
I know that \0 marks the end of array type in C++, but I need to know it as a byte. Is it 0?
In other words, would the code below be the equivalent of the C++ line?
if(asmcmd^=0) or (asmcmd^=';') then ...
where asmcmd is PAnsiChar.
You need not know Delphi to answer my question, but tell me \0 as byte. That would work also. :)
'\0' equals 0. It's a relic from C, which doesn't have any string type at all and uses char arrays instead. The null character is used to mark the end of a string; not a very wise decision in retrospect - most other string implementations use a dedicated counter variable somewhere, which makes finding the end of a string O(1) instead of C's O(n).
*asmcmd=='\0' is just a convoluted way of checking length(asmcmd) == 0 or asmcmd.is_empty() in a hypothetical language.
Strictly it is an escape sequence for the character with the octal value zero (which is of course also zero in any base).
Although you can use any number prefixed with zero to specify an octal character code (for example '\040' is a space character in ASCII encoding) you would seldom ever have cause to do so. '\0' is idiomatic for specifying a NUL character (because you cannot type such a character from the keyboard or display it in your editor).
You could equally specify '\x0', which is a NUL character expressed in hexadecimal.
The NUL character is used in C and C++ to terminate a string stored in a character array. This representation is used for literal string constants and by convention for strings that are manipulated by the<cstring>/<string.h>library. In C++ the std::string class can be used instead.
Note that in C++ a character constant such as '\0' or 'a' has type char. In C, for perhaps obscure reasons, it has type int.
That is the char for null or char value 0. It is used at the end of the string.