How do I compare base36 values in C++? - c++

As we all know, with C-style string comparisons, the value is dependent on the ASCII value of each character and just uses the strcmp function to compare. I'm confused that what the std::string compare depends on?
Although I have searched Google, I still didn't find the answer.
In addition, if strings are all base36 strings and they are all in lower case, could I compare their values by strings directly? Or I should convert them as a long variable using the strtol function? Which method is better?

Your outset "As we all know, with C-style string comparisons, the value is dependent on the ASCII value of each character..." is unfortunately wrong already. With e.g. UTF-8 strings and various forms of collation, that's simply untrue.
Then "...and just uses the strcmp function to compare." is also wrong, because C-style strings don't have an inherent way to compare but multiple ways that also depend on e.g. the encoding and locale. You could use strcmp() for bytewise equality comparisons though, but that won't always give you expected results.
To answer your question what std::string uses, that's simple. std::string is a specialization of the std::basic_string template and it delegates comparisons to its char_traits template parameter. This parameter typically uses memcmp(). It can not possibly use strcmp(), because other than a C-style string, std::string can include null chars, but strcmp() would stop at those.

std::string compare depends upon 'ASCII values' in exactly the same way that strcmp does.
For base36 comparisons, simple string comparison (either strcmp or std::string) doesn't work because "00123" and "123" are equal when representing base36 integers but they compare differently as strings. Neither does strtol work very well because of integer overflow. Instead you should probably write your own comparison routine that removes leading zeros, then compares length and finally for strings of equal length does a string comparison.

Related

What is the proper function for comparing two C-style strings?

So I have a dilemma. I need to compare two C-style strings and I searched for the functions that would be the most appropiate:
memcmp //Compare two blocks of memory (function)
strcmp //Compare two strings (function )
strcoll //Compare two strings using locale (function)
strncmp //Compare characters of two strings (function)
strxfrm //Transform string using locale (function)
The first one I think is for addresses, so the idea is out.
The second one sounds like the best choice to me, but I wanna hear feedback anyway.
The other three leave me clueless.
For general string comparisons, strcmp is the appropriate function. You should use strncmp to only compare some number of characters from a string (for example, a prefix), and memcmp to compare blocks of memory.
That said, since you're using C++, you should avoid this altogether and use the std::string class, which is much easier to use and generally safer than C-style strings. You can compare two std::strings for equality easily by just using the == operator.
Hope this helps!
Both memcmp and strcmp will work fine. To use the former, you'll need to know the length of the shorter string in advance.

Templated string class use of strcmp, strcpy and strlen

I overheard sometime ago a discussion about how when creating a templated string class that you should not use strcmp, strcpy and strlen for a templated string class that can make use of UTF8 and UTF16. From what I recall, you are suppose to use functions from algorithm.h, however, I do not remember how the implementation is, or why it is so. Could someone please explain what functions to use instead, how to use them and why?
The example of the templated string class would be something such as
String<UTF8> utf8String;
String<UTF16> utf16String;
This is where UTF8 will be a unsigned char and UTF16 is an unsigned short.
First off, C++ has no need of additional string classes. There are probably already hundreds or thousands too many string classes that have been developed, and yours won't improve the situation. Unless you're doing this purely for your edification, you should think long and hard and then decide not to write a new one.
You can use std::basic_string<char> to hold UTF-8 code unit sequences, std::basic_string<char16_t> to hold UTF-16 code unit sequences, std::basic_string<char32_t> to hold UTF-32 code unit sequences, etc. C++ even offers short, handy names for these types: string, u16string, and u32string. basic_string already solves the problem you're asking about here by offering member functions for copying, comparing, and getting the length of the string that work for any code unit you template it with.
I can't think of any good reason for new code that's not interfacing with legacy code to use anything else as its canonical storage type for strings. Even if you do interface with legacy code that uses something else, if the surface area of that interface isn't large you should probably still use one of the standard types and not anything else, and of course if you're interfacing with legacy code you'll be using that legacy type anyway, not writing your own new type.
With that said, the reason you can't use strcmp, strcpy, and strlen for your templated string type is that they all operate on null terminated byte sequences. If your code unit is larger than one byte then there may be bytes that are zero before the actual terminating null code unit (assuming you use null termination at all, which you probably shouldn't). Consider the bytes of this UTF-16 representation of the string "Hello" (on a little endian machine).
48 00 65 00 6c 00 6c 00 6f 00
Since UTF-16 uses 16 bit code units, the character 'H' ends up stored as the two bytes 48 00. A function operating on the above sequence of bytes by assuming the first null byte is the end would assume that the second half of the first character marks the end of the whole string. This clearly will not work.
So, strcmp, strcpy, and strlen are all specialized versions of algorithms that can be implemented more generally. Since they only work with byte sequences, and you need to work with code unit sequences where the code unit may be larger than a byte, you need need generic algorithms that can work with any code unit. The standard library offers has lots of generic algorithms to offer you. Here are my suggestions for replacing these str* functions.
strcmp compares two sequences of code units and returns 0 if the two sequences are equal, positive if the first is lexicographically less than the second, and negative otherwise. The standard library contains the generic algorithm lexicographical_compare which does nearly the same thing, except that it returns true if the first sequences is lexicographically less than the second and false otherwise.
strcpy copies a sequences of code units. You can use the standard library's copy algorithm instead.
strlen takes a pointer to a code unit and counts the number of code units before it finds a null value. If you need this function as opposed to one that just tells you the number of code units in the string, you can implement it with the algorithm find by passing the null value as the value to be found. If instead you want to find the actual length of the sequence, your class should just offer a size method that directly accesses whatever method your class uses internally to store the size.
Unlike the str* functions, the algorithms I've suggested take two iterators to demarcate code unit sequences; one pointing to the first element in the sequence, and one pointing to the position after the final element of the sequence. The str* functions only take a pointer to the first element and then assume the sequence continues until the first zero valued code unit it finds. When you're implementing your own templated string class it's best to move away from the explicit null termination convention as well, and just offer an end() method that provides the correct end point for your string.
The reason you can't use strcmp, strcpy, or strlen is that they operate on strings whose length is indicate by a terminating zero byte. Since your strings may contain zero bytes inside them, you can't use these functions.
I would just code exactly what you want. What you want depends on what you're trying to do.
In UTF16, you may see bytes that are equal to '\0' in the middle of the string; strcmp, strcpy, and strlen will return incorrect results for strings like that, because they operate under the assumption that strings are zero-terminated.
You can use copy, equal, and distance from the STL to copy, compare, and calculate length based on template-based iterators.

strcmp or string::compare?

I want to compare two strings. Is it possible with strcmp? (I tried and it does not seem to work). Is string::compare a solution?
Other than this, is there a way to compare a string to a char?
Thanks for the early comments. I was coding in C++ and yes it was std::string like some of you mentioned.
I didn't post the code because I wanted to learn the general knowledge and it is a pretty long code, so it was irrelevant for the question.
I think I learned the difference between C++ and C, thanks for pointing that out. And I will try to use overloaded operators now. And by the way string::compare worked too.
For C++, use std::string and compare using string::compare.
For C use strcmp. If your (i meant your programs) strings (for some weird reason) aren't nul terminated, use strncmp instead.
But why would someone not use something as simple as == for std::string
?
Assuming you mean std::string, why not use the overloaded operators: str1 == str2, str1 < str2?
See std::basic_string::compare and std::basic_string operators reference (in particular, there exists operator==, operator!=, operator<, etc.). What else do you need?
When using C++ use C++ functions viz. string::compare. When using C and you are forced to use char* for string, use strcmp
By your question, "is there a way to compare a string to a char?" do you mean "How do I find out if a particular char is contained in a string?" If so, the the C-library function:
char *strchr(const char *s, int c);
will do it for you.
-- pete
std::string can contain (and compare!) embedded null characters.
are*comp(...) will compare c-style strings, comparing up to the first null character (or the specified max nr of bytes/characters)
string::compare is actually implemented as a template basic_string so you can expect it to work for other types such as wstring
On the unclear phrase to "compare a string to a char" you can compare the char to *string.begin() or lookup the first occurrence (string::find_first_of and string::find_first_not_of)
Disclaimer: typed on my HTC, typos reserved :)

Why isn't ("Maya" == "Maya") true in C++?

Any idea why I get "Maya is not Maya" as a result of this code?
if ("Maya" == "Maya")
printf("Maya is Maya \n");
else
printf("Maya is not Maya \n");
Because you are actually comparing two pointers - use e.g. one of the following instead:
if (std::string("Maya") == "Maya") { /* ... */ }
if (std::strcmp("Maya", "Maya") == 0) { /* ... */ }
This is because C++03, §2.13.4 says:
An ordinary string literal has type “array of n const char”
... and in your case a conversion to pointer applies.
See also this question on why you can't provide an overload for == for this case.
You are not comparing strings, you are comparing pointer address equality.
To be more explicit -
"foo baz bar" implicitly defines an anonymous const char[m]. It is implementation-defined as to whether identical anonymous const char[m] will point to the same location in memory(a concept referred to as interning).
The function you want - in C - is strmp(char*, char*), which returns 0 on equality.
Or, in C++, what you might do is
#include <string>
std::string s1 = "foo"
std::string s2 = "bar"
and then compare s1 vs. s2 with the == operator, which is defined in an intuitive fashion for strings.
The output of your program is implementation-defined.
A string literal has the type const char[N] (that is, it's an array). Whether or not each string literal in your program is represented by a unique array is implementation-defined. (§2.13.4/2)
When you do the comparison, the arrays decay into pointers (to the first element), and you do a pointer comparison. If the compiler decides to store both string literals as the same array, the pointers compare true; if they each have their own storage, they compare false.
To compare string's, use std::strcmp(), like this:
if (std::strcmp("Maya", "Maya") == 0) // same
Typically you'd use the standard string class, std::string. It defines operator==. You'd need to make one of your literals a std::string to use that operator:
if (std::string("Maya") == "Maya") // same
What you are doing is comparing the address of one string with the address of another. Depending on the compiler and its settings, sometimes the identical literal strings will have the same address, and sometimes they won't (as apparently you found).
Any idea why i get "Maya is not Maya" as a result
Because in C, and thus in C++, string literals are of type const char[], which is implicitly converted to const char*, a pointer to the first character, when you try to compare them. And pointer comparison is address comparison.
Whether the two string literals compare equal or not depends whether your compiler (using your current settings) pools string literals. It is allowed to do that, but it doesn't need to. .
To compare the strings in C, use strcmp() from the <string.h> header. (It's std::strcmp() from <cstring>in C++.)
To do so in C++, the easiest is to turn one of them into a std::string (from the <string> header), which comes with all comparison operators, including ==:
#include <string>
// ...
if (std::string("Maya") == "Maya")
std::cout << "Maya is Maya\n";
else
std::cout << "Maya is not Maya\n";
C and C++ do this comparison via pointer comparison; looks like your compiler is creating separate resource instances for the strings "Maya" and "Maya" (probably due to having an optimization turned off).
My compiler says they are the same ;-)
even worse, my compiler is certainly broken. This very basic equation:
printf("23 - 523 = %d\n","23"-"523");
produces:
23 - 523 = 1
Indeed, "because your compiler, in this instance, isn't using string pooling," is the technically correct, yet not particularly helpful answer :)
This is one of the many reasons the std::string class in the Standard Template Library now exists to replace this earlier kind of string when you want to do anything useful with strings in C++, and is a problem pretty much everyone who's ever learned C or C++ stumbles over fairly early on in their studies.
Let me explain.
Basically, back in the days of C, all strings worked like this. A string is just a bunch of characters in memory. A string you embed in your C source code gets translated into a bunch of bytes representing that string in the running machine code when your program executes.
The crucial part here is that a good old-fashioned C-style "string" is an array of characters in memory. That block of memory is often referred to by means of a pointer -- the address of the start of the block of memory. Generally, when you're referring to a "string" in C, you're referring to that block of memory, or a pointer to it. C doesn't have a string type per se; strings are just a bunch of chars in a row.
When you write this in your code:
"wibble"
Then the compiler provides a block of memory that contains the bytes representing the characters 'w', 'i', 'b', 'b', 'l', 'e', and '\0' in that order (the compiler adds a zero byte at the end, a "null terminator". In C a standard string is a null-terminated string: a block of characters starting at a given memory address and continuing until the next zero byte.)
And when you start comparing expressions like that, what happens is this:
if ("Maya" == "Maya")
At the point of this comparison, the compiler -- in your case, specifically; see my explanation of string pooling at the end -- has created two separate blocks of memory, to hold two different sets of characters that are both set to 'M', 'a', 'y', 'a', '\0'.
When the compiler sees a string in quotes like this, "under the hood" it builds an array of characters, and the string itself, "Maya", acts as the name of the array of characters. Because the names of arrays are effectively pointers, pointing at the first character of the array, the type of the expression "Maya" is pointer to char.
When you compare these two expressions using "==", what you're actually comparing is the pointers, the memory addresses of the beginning of these two different blocks of memory. Which is why the comparison is false, in your particular case, with your particular compiler.
If you want to compare two good old-fashioned C strings, you should use the strcmp() function. This will examine the contents of the memory pointed two by both "strings" (which, as I've explained, are just pointers to a block of memory) and go through the bytes, comparing them one-by-one, and tell you whether they're really the same.
Now, as I've said, this is the kind of slightly surprising result that's been biting C beginners on the arse since the days of yore. And that's one of the reasons the language evolved over time. Now, in C++, there is a std::string class, that will hold strings, and will work as you expect. The "==" operator for std::string will actually compare the contents of two std::strings.
By default, though, C++ is designed to be backwards-compatible with C, i.e. a C program will generally compile and work under a C++ compiler the same way it does in a C compiler, and that means that old-fashioned strings, "things like this in your code", will still end up as pointers to bits of memory that will give non-obvious results to the beginner when you start comparing them.
Oh, and that "string pooling" I mentioned at the beginning? That's where some more complexity might creep in. A smart compiler, to be efficient with its memory, may well spot that in your case, the strings are the same and can't be changed, and therefore only allocate one block of memory, with both of your names, "Maya", pointing at it. At which point, comparing the "strings" -- the pointers -- will tell you that they are, in fact, equal. But more by luck than design!
This "string pooling" behaviour will change from compiler to compiler, and often will differ between debug and release modes of the same compiler, as the release mode often includes optimisations like this, which will make the output code more compact (it only has to have one block of memory with "Maya" in, not two, so it's saved five -- remember that null terminator! -- bytes in the object code.) And that's the kind of behaviour that can drive a person insane if they don't know what's going on :)
If nothing else, this answer might give you a lot of search terms for the thousands of articles that are out there on the web already, trying to explain this. It's a bit painful, and everyone goes through it. If you can get your head around pointers, you'll be a much better C or C++ programmer in the long run, whether you choose to use std::string instead or not!

Is string::compare reliable to determine alphabetical order?

Simply put, if the input is always in the same case (here, lower case), and if the characters are always ASCII, can one use string::compare to determine reliably the alphabetical order of two strings?
Thus, with stringA.compare(stringB) if the result is 0, they are the same, if it is negative, stringA comes before stringB alphabetically , and if it is positive, stringA comes after?
According to the docs at cplusplus.com,
The member function returns 0 if all
the characters in the compared
contents compare equal, a negative
value if the first character that does
not match compares to less in the
object than in the comparing string,
and a positive value in the opposite
case.
So it will sort strings in ASCII order, which will be alphabetical for English strings (with no diacritical marks or other extended characters) of the same case.
Yes, as long as all of the characters in both strings are of the same case, and as long as both strings consist only of letters, this will work.
compare is a member function, though, so you would call it like so:
stringA.compare(stringB);
In C++, string is the instantiation of the template class basic_string with the default parameters: basic_string<char, char_traits<char>, allocator<char> >. The compare function in the basic_string template will use the char_traits<TChar>::compare function to determine the result value.
For std::string the ordering will be that of the default character code for the implementation (compiler) and that is usually ASCII order. If you require a different ordering (say you want to consider { a, á, à, â } as equivalent), you can instantiate a basic_string with your own char_traits<> implementation. providing a different compare function pointer.
yes,
The member function returns 0 if all
the characters in the compared
contents compare equal, a negative
value if the first character that does
not match compares to less in the
object than in the comparing string,
and a positive value in the opposite
case.
For string objects, the result of a
character comparison depends only on
its character code (i.e., its ASCII
code), so the result has some limited
alphabetical or numerical ordering
meaning.
The specifications for the C and C++ language guarantee for lexical ordering, 'A' < 'B' < 'C' ... < 'Z'. The same is true for lowercase.
The ordering for text digits is also guaranteed: '0' < ... < '9'.
When working with multiple languages, many people create an array of characters. The array is searched for the character. Instead of comparing characters, the indices are compared.