C++ string sort like a human being? - c++

I would like to sort alphanumeric strings the way a human being would sort them. I.e., "A2" comes before "A10", and "a" certainly comes before "Z"! Is there any way to do with without writing a mini-parser? Ideally it would also put "A1B1" before "A1B10". I see the question "Natural (human alpha-numeric) sort in Microsoft SQL 2005" with a possible answer, but it uses various library functions, as does "Sorting Strings for Humans with IComparer".
Below is a test case that currently fails:
#include <set>
#include <iterator>
#include <iostream>
#include <vector>
#include <cassert>
template <typename T>
struct LexicographicSort {
inline bool operator() (const T& lhs, const T& rhs) const{
std::ostringstream s1,s2;
s1 << toLower(lhs); s2 << toLower(rhs);
bool less = s1.str() < s2.str();
//Answer: bool less = doj::alphanum_less<std::string>()(s1.str(), s2.str());
std::cout<<s1.str()<<" "<<s2.str()<<" "<<less<<"\n";
return less;
}
inline std::string toLower(const std::string& str) const {
std::string newString("");
for (std::string::const_iterator charIt = str.begin();
charIt!=str.end();++charIt) {
newString.push_back(std::tolower(*charIt));
}
return newString;
}
};
int main(void) {
const std::string reference[5] = {"ab","B","c1","c2","c10"};
std::vector<std::string> referenceStrings(&(reference[0]), &(reference[5]));
//Insert in reverse order so we know they get sorted
std::set<std::string,LexicographicSort<std::string> > strings(referenceStrings.rbegin(), referenceStrings.rend());
std::cout<<"Items:\n";
std::copy(strings.begin(), strings.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
std::vector<std::string> sortedStrings(strings.begin(), strings.end());
assert(sortedStrings == referenceStrings);
}

Is there any way to do with without writing a mini-parser?
Let someone else do that?
I'm using this implementation: http://www.davekoelle.com/alphanum.html, I've modified it to support wchar_t, too.

It really depends what you mean by "parser." If you want to avoid writing a parser, I would think you should avail yourself of library functions.
Treat the string as a sequence of subsequences which are uniformly alphabetic, numeric, or "other."
Get the next alphanumeric sequence of each string using isalnum and backtrack-checking for + or - if it is a number. Use strtold in-place to find the end of a numeric subsequence.
If one is numeric and one is alphabetic, the string with the numeric subsequence comes first.
If one string has run out of characters, it comes first.
Use strcoll to compare alphabetic subsequences within the current locale.
Use strtold to compare numeric subsequences within the current locale.
Repeat until finished with one or both strings.
Break ties with strcmp.
This algorithm has something of a weakness in comparing numeric strings which exceed the precision of long double.

Is there any way to do it without writing a mini parser? I would think the answer is no. But writing a parser isn't that tough. I had to do this a while ago to sort our company's stock numbers. Basically just scan the number and turn it into an array. Check the "type" of every character: alpha, number, maybe you have others you need to deal with special. Like I had to treat hyphens special because we wanted A-B-C to sort before AB-A. Then start peeling off characters. As long as they are the same type as the first character, they go into the same bucket. Once the type changes, you start putting them in a different bucket. Then you also need a compare function that compares bucket-by-bucket. When both buckets are alpha, you just do a normal alpha compare. When both are digits, convert both to integer and do an integer compare, or pad the shorter to the length of the longer or something equivalent. When they're different types, you'll need a rule for how those compare, like does A-A come before or after A-1 ?
It's not a trivial job and you have to come up with rules for all the odd cases that may arise, but I would think you could get it together in a few hours of work.

Without any parsing, there's no way to compare human written numbers (high values first with leading zeroes stripped) and normal characters as part of the same string.
The parsing doesn't need to be terribly complex though. A simple hash table to deal with things like case sensitivity and stripping special characters ('A'='a'=1,'B'='b'='2,... or 'A'=1,'a'=2,'B'=3,..., '-'=0(strip)), remap your string to an array of the hashed values, then truncate number cases (if a number is encountered and the last character was a number, multiply the last number by ten and add the current value to it).
From there, sort as normal.

Related

Find All Palindrome Substrings in a String by Rearranging Characters

For fun and practice, I have tried to solve the following problem (using C++): Given a string, return all the palindromes that can be obtained by rearranging its characters.
I've come up with an algorithm that doesn't work completely. Sometimes, it finds all the palindromes, but other times it finds some but not all.
It works by swapping each adjacent pair of characters N times, where N is the length of the input string. Here is the code:
std::vector<std::string> palindromeGen(std::string charactersSet) {
std::vector<std::string> pals;
for (const auto &c : charactersSet) {
for (auto i = 0, j = 1; i < charactersSet.length() - 1; ++i, ++j) {
std::swap(charactersSet[i], charactersSet[j]);
if (isPalandrome(charactersSet)) {
if (std::find(pals.begin(), pals.end(), charactersSet) == pals.end()) {
// if palindrome is unique
pals.push_back(charactersSet);
}
}
}
}
return pals;
}
What's the fault in this algorithm? I'm mostly concerned about the functionality of the algorithm, rather than the efficiency. Although I'll appreciate tips about efficiency as well. Thanks.
This probably fits a bit better in Code Review but here goes:
Logic Error
You change charactersSet while iterating over it, meaning that your iterator breaks. You need to make a copy of characterSet, and iterate over that.
Things to Change
Since pals holds only unique values, it should be a std::set instead of a std::vector. This will simplify some things. Also, your isPalandrome method spells palindrome wrong!
Alternative Approach
Since palindromes can only take a certain form, consider sorting the input string first, so that you can have a list of characters with an even number of occurrences, and a list of characters with an odd number. You can only have one character with an odd number of occurrences (and this only works for an odd length input). This should let you discard a bunch of possibilities. Then you can work through the different possible combinations of one half of the palindrome (since you can build one half from the other).
Here is another implementation that leverages std::next_permutation:
#include <string>
#include <algorithm>
#include <set>
std::set<std::string> palindromeGen(std::string charactersSet)
{
std::set<std::string> pals;
std::sort(charactersSet.begin(), charactersSet.end());
do
{
// check if the string is the same backwards as forwards
if ( isPalindrome(charactersSet))
pals.insert(charactersSet);
} while (std::next_permutation(charactersSet.begin(), charactersSet.end()));
return pals;
}
We first sort the original string. This is required for std::next_permutation to work correctly. A call to the isPalindrome function with a permutation of the string is done in a loop. Then if the string is a palindrome, it's stored in the set. The subsequent call to std::next_permutation just rearranges the string.
Here is a Live Example. Granted, it uses a reversed copy of the string as the "isPalindrome" function (probably not efficient), but you should get the idea.

How can I compare if a char is higher or lower in alphabetical order than another?

Pretty much as the title. I'm writing a linked list and I need a function to sort the list alphabetically, and I'm pretty stumped. Not sure how this has never come up before, but I have no idea how to do it other than to create my own function listing the entire alphabet and comparing positions of letters from scratch.
Is there any easy way to do this?
Edit for clarity:
I've got a linear linked list of class objects, each class object has a char name, and I'm writing a function to compare the name of each object in the list, to find the highest object alphabetically, and then find the next object down alphabetically, etc, linking them together as I go. I already have a function that does this for an int field, so I just need to rewrite it to compare inequalities between alphabetical characters where a is largest and z is smallest.
In hindsight that was probably a lot more relevant than I thought.
I think a couple of the answers I've gotten already should work so I'll pop back and select a best answer once I've gotten it working.
I'm also working with g++ and unity.
I think that in general case the best approach will be to use std::char_traits:
char a, b;
std::cin >> a >> b;
std::locale loc;
a = std::tolower(a, loc);
b = std::tolower(b, loc);
std::cout << std::char_traits::compare(&a, &b, 1u);
But in many common situations you can simply compare chars as you do it with other integer types.
My guess is that your list contains char* as data (it better contain std::strings as data). If the list is composed of the latter, you can simply sort using the overloaded std::string's operator<, like
return str1 < str2; // true if `str1` is lexicographically before `str2`
If your list is made of C-like null-terminated strings, then you can sort them using std::strcmp like
return std::strcmp(s1, s2);
or use the std::char_traits::compare (as mentioned by #Anton) like
return std::char_traits<char>::compare(s1, s2, std::min(std::strlen(s1), std::strlen(s2)));
or sort them via temporary std::strings (most expensive), like
return std::string(s1) < std::string(s2); // here s1 and s2 are C-strings
If your list simply contains characters, then, as mentioned in the comments,
return c1 < c2; // returns true whenever c1 is before c2 in the alphabet
If you don't care about uppercase/lowercase, then you can use std::toupper to transform the character into uppercase, then always compare the uppercase.
#include <stdio.h>
#include <ctype.h>
void main(void) {
char a = 'X', b = 'M';
printf("%i\n", a < b);
printf("%i\n", b < a);
printf("%i\n", 'a' < 'B');
printf("%i\n", tolower('a') < tolower('B'));
}
prints out:
0
1
0
1
chars are still numbers, and can be compared as such. The upper case letters and lower case letters are all in order, with the upper case letters before the lower. (Such that 'Z' < 'a'.) See an ASCII table.
As you can see from this ASCII table, all of the alphanumeric characters appear in the correct alphabetical order, regarding their actual values:
"Is there any easy way to do this?"
So yes, comparing the character values will provide to have them sorted in alphabetical order.
would something like below suffice ? convert everything to upper first.
class compareLessThanChar{
public:
bool operator()(const char a, const char b)
{ return toupper(a) < toupper(b); }
}
std::multiset<char, compareLessThanChar> sortedContainer;

Word Frequency Statistics

In an pre-interview, I am faced with a question like this:
Given a string consists of words separated by a single white space, print out the words in descending order sorted by the number of times they appear in the string.
For example an input string of “a b b” would generate the following output:
b : 2
a : 1
Firstly, I'd say it is not so clear that whether the input string is made up of single-letter words or multiple-letter words. If the former is the case, it could be simple.
Here is my thought:
int c[26] = {0};
char *pIn = strIn;
while (*pIn != 0 && *pIn != ' ')
{
++c[*pIn];
++pIn;
}
/* how to sort the array c[26] and remember the original index? */
I can get the statistics of the frequecy of every single-letter word in the input string, and I can get it sorted (using QuickSort or whatever). But after the count array is sorted, how to get the single-letter word associated with the count so that I can print them out in pair later?
If the input string is made of of multiple-letter word, I plan to use a map<const char *, int> to track the frequency. But again, how to sort the map's key-value pair?
The question is in C or C++, and any suggestion is welcome.
Thanks!
I would use a std::map<std::string, int> to store the words and their counts. Then I would use something this to get the words:
while(std::cin >> word) {
// increment map's count for that word
}
finally, you just need to figure out how to print them in order of frequency, I'll leave that as an exercise for you.
You're definitely wrong in assuming that you need only 26 options, 'cause your employer will want to allow multiple-character words as well (and maybe even numbers?).
This means you're going to need an array with a variable length. I strongly recommend using a vector or, even better, a map.
To find the character sequences in the string, find your current position (start at 0) and the position of the next space. Then that's the word. Set the current position to the space and do it again. Keep repeating this until you're at the end.
By using the map you'll already have the word/count available.
If the job you're applying for requires university skills, I strongly recommend optimizing the map by adding some kind of hashing function. However, judging by the difficulty of the question I assume that that is not the case.
Taking the C-language case:
I like brute-force, straightforward algos so I would do it in this way:
Tokenize the input string to give an unsorted array of words. I'll have to actually, physically move each word (because each is of variable length); and I think I'll need an array of char*, which I'll use as the arg to qsort( ).
qsort( ) (descending) that array of words. (In the COMPAR function of qsort(), pretend that bigger words are smaller words so that the array acquires descending sort order.)
3.a. Go through the now-sorted array, looking for subarrays of identical words. The end of a subarray, and the beginning of the next, is signalled by the first non-identical word I see.
3.b. When I get to the end of a subarray (or to the end of the sorted array), I know (1) the word and (2) the number of identical words in the subarray.
EDIT new step 4: Save, in another array (call it array2), a char* to a word in the subarry and the count of identical words in the subarray.
When no more words in sorted array, I'm done. it's time to print.
qsort( ) array2 by word frequency.
go through array2, printing each word and its frequency.
I'M DONE! Let's go to lunch.
All the answers prior to mine did not give really an answer.
Let us think on a potential solution.
There is a more or less standard approach for counting something in a container.
We can use an associative container like a std::map or a std::unordered_map. And here we associate a "key", in this case the word, to a count, with a value, in this case the count of the specific word.
And luckily the maps have a very nice index operator[]. This will look for the given key and, if found, return a reference to the value. If not found, then it will create a new entry with the key and return a reference to the new entry. So, in both cases, we will get a reference to the value used for counting. And then we can simply write:
std::unordered_map<char,int> counter{};
counter[word]++;
And that looks really intuitive.
After this operation, you have already the frequency table. Either sorted by the key (the word), by using a std::map or unsorted, but faster accessible with a std::unordered_map.
Now you want to sort according to the frequency/count. Unfortunately this is not possible with maps.
Therefore we need to use a second container, like a ```std::vector`````which we then can sort unsing std::sort for any given predicate, or, we can copy the values into a container, like a std::multiset that implicitely orders its elements.
For getting out the words of a std::string we simply use a std::istringstream and the standard extraction operator >>. No big deal at all.
And because writing all this long names for the std containers, we create alias names, with the using keyword.
After all this, we now write ultra compact code and fulfill the task with just a few lines of code:
#include <iostream>
#include <string>
#include <sstream>
#include <utility>
#include <set>
#include <unordered_map>
#include <type_traits>
#include <iomanip>
// ------------------------------------------------------------
// Create aliases. Save typing work and make code more readable
using Pair = std::pair<std::string, unsigned int>;
// Standard approach for counter
using Counter = std::unordered_map<Pair::first_type, Pair::second_type>;
// Sorted values will be stored in a multiset
struct Comp { bool operator ()(const Pair& p1, const Pair& p2) const { return (p1.second == p2.second) ? p1.first<p2.first : p1.second>p2.second; } };
using Rank = std::multiset<Pair, Comp>;
// ------------------------------------------------------------
std::istringstream text{ " 4444 55555 1 22 4444 333 55555 333 333 4444 4444 55555 55555 55555 22 "};
int main() {
Counter counter;
// Count
for (std::string word{}; text >> word; counter[word]++);
// Sort
Rank rank(counter.begin(), counter.end());
// Output
for (const auto& [word, count] : rank) std::cout << std::setw(15) << word << " : " << count << '\n';
}

Is it possible to construct an "infinite" string?

Is there any real sequence of characters that always compares greater than any other string?
My first thought was that a string constructed like so:
std::basic_string<T>(std::string::max_size(), std::numeric_limits<T>::max())
Would do the trick, provided that the fact that it would almost definitely fail to work isn't such a big issue. So I presume this kind of hackery could only be accomplished in Unicode, if it can be accomplished at all. I've never heard of anything that would indicate that it really is possible, but neither have I heard tell that it isn't, and I'm curious.
Any thoughts on how to achieve this without a possibly_infinite<basic_string<T>>?
I assume that you compare strings using their character value. I.e. one character acts like a digit, a longer string is greater than shorter string, etc.
s there any real sequence of characters that always compares greater than any other string?
No, because:
Let's assume there is a string s that is always greater than any other string.
If you make a copy of s, the copy will be equal to s. Equal means "not greater". Therefore there can be a string that is not greater than s.
If you make a copy of s and append one character at the end, it will be greater than original s. Therefore there can be a string that is greater than s.
Which means, it is not possible to make s.
I.e.
A string s that is always greater than any other string cannot exist. A copy of s (copy == other string) will be equal to s, and "equal" means "not greater".
A string s that is always greater or equal to any other string, can exist if a maximum string size has a reasonable limit. Without a size limit, it will be possible to take a copy of s, append one character at the end, and get a string that is greater than s.
In my opinion, the proper solution would be to introduce some kind of special string object that represents infinitely "large" string, and write a comparison operator for that object and standard string. Also, in this case you may need custom string class.
It is possible to make string that is always less or equal to any other string. Zero length string will be exactly that - always smaller than anything else, and equal to other zero-length strings.
Or you could write counter-intuitive comparison routine where shorter string is greater than longer string, but in this case next code maintainer will hate you, so it is not a good idea.
Not sure why would you ever need something like that, though.
You probably need a custom comparator, for which you define a magic "infinite string" value and which will always treat that value as greater than any other.
Unicode solves a lot of problems, but not that one. Unicode is just a different encoding for a character, 1, 2 or 4 bytes, they are still stored in a plain array. You can use infinite strings when you find a machine with infinite memory.
Yes. How you do it, I have no idea :)
You should try to state what you intend to achieve and what your requirements are. In particular, does it have to be a string? is there any limitation on the domain? do they need to be compared with <?
You can use a non-string type:
struct infinite_string {};
bool operator<( std::string const & , infinite_string const & ) {
return true;
}
bool operator<( infinite_string const &, std::string const & ) {
return false;
}
If you can use std::lexicographical_compare and you don't need to store it as a string, then you can write an infinite iterator:
template <typename CharT>
struct infinite_iterator
{
CharT operator*() { return std::numeric_limits<CharT>::max(); }
infinite_iterator& operator++() { return *this; }
bool operator<( const infinite_iterator& ) { return true; }
// all other stuff to make it proper
};
assert( std::lexicographical_compare( str.begin(), str.end(),
infinite_iterator, infinite_iterator ) );
If you can use any other comparisson functor and your domain has some invalid you can use that to your advantage:
namespace detail {
// assume that "\0\0\0\0\0" is not valid in your domain
std::string const infinite( 5, 0 );
}
bool compare( std::string const & lhs, std::string const & rhs ) {
if ( lhs == detail::infinite ) return false;
if ( rhs == detail::infinite ) return true;
return lhs < rhs;
}
if you need an artificial bound within a space of objects that isn't bounded, the standard trick is to add an extra element and define a new comparison operator that enforces your property.
Or implement lazy strings.
Well if you were to dynamically construct a string of equal length as the one that you are comparing to and fill it with the highest ASCII code available (7F for normal ASCII or FF for extended) you would be guaranteed that this string would compare equal to or greater than the one you compare it to.
What's your comparator?
Based on that, you can construct something that is the 'top' of your lattice.

How to get the next prefix in C++?

Given a sequence (for example a string "Xa"), I want to get the next prefix in order lexicographic (i.e "Xb"). The next of "aZ" should be "b"
A motivating use case where this function is useful is described here.
As I don't want to reinvent the wheel, I'm wondering if there is any function in C++ STL or boost that can help to define this generic function easily?
If not, do you think that this function can be useful?
Notes
Even if the examples are strings, the function should work for any Sequence.
The lexicographic order should be a template parameter of the function.
From the answers I conclude that there is nothing on C++/Boost that can help to define this generic function easily and also that this function is too specific to be proposed for free. I will implement a generic next_prefix and after that I will request if you find it useful.
I have accepted the single answer that gives some hints on how to do that even if the proposed implementation is not generic.
I'm not sure I understand the semantics by which you wish the string to transform, but maybe something like the following can be a starting point for you. The code will increment the sequence, as if it was a sequence of digits representing a number.
template<typename Bi, typename I>
bool increment(Bi first, Bi last, I minval, I maxval)
{
if( last == first ) return false;
while( --last != first && *last == maxval ) *last = minval;
if( last == first && *last == maxval ) {
*last = minval;
return false;
}
++*last;
return true;
}
Maybe you wish to add an overload with a function object, or an overload or specialization for primitives. A couple of examples:
string s1("aaz");
increment(s1.begin(), s1.end(), 'a', 'z');
cout << s1 << endl; // aba
string s2("95");
do {
cout << s2 << ' '; // 95 96 97 98 99
} while( increment(s2.begin(), s2.end(), '0', '9') );
cout << endl;
That seem so specific that I can't see how it would get in STL or boost.
When you say the order is a template parameter, what are you envisaging will be passed? A comparator that takes two characters and returns bool?
If so, then that's a bit of a nightmare, because the only way to find "the least char greater than my current char" is to sort all the chars, find your current char in the result, and step forward one (or actually, if some chars might compare equal, use upper_bound with your current char to find the first greater char).
In practice, for any sane string collation you can define a "get the next char, or warn me if I gave you the last char" function more efficiently, and build your "get the next prefix" function on top of that. Hopefully, permitting an arbitrary order is more flexibility than you need.
Orderings are typically specified as a comparator, not as a sequence generator.
Lexicographical orderings in particular tend be only partial, for example, in case or diacritic insensitivity. Therefore your final product will be nondeterministic, or at best arbitrary. ("Always choose lowest numerical encoding"?)
In any case, if you accept a comparator as input, the only way to translate that to an increment operation would be to compare the current value against every other in the character space. Which could work, 127 values being so few (a comparator-sorted table would make short work of the problem), or could be impossibly slow, if you use any other kind of character.
The best way is likely to define the character ordering somehow, then define the rules from going from one character to two characters to three characters.
Use whatever sort function you wish to use over the complete list of characters that you want to include, then just use that as the ordering. Find the index of the current character, and you can easily find the previous and next characters. Only advance the right-most character, unless it's going to roll over, then advance the next character to the left.
In other words, reinventing the wheel is like 10 lines of Python. Probably less than 500 lines of C++. :)