How to get the next prefix in C++? - c++

Given a sequence (for example a string "Xa"), I want to get the next prefix in order lexicographic (i.e "Xb"). The next of "aZ" should be "b"
A motivating use case where this function is useful is described here.
As I don't want to reinvent the wheel, I'm wondering if there is any function in C++ STL or boost that can help to define this generic function easily?
If not, do you think that this function can be useful?
Notes
Even if the examples are strings, the function should work for any Sequence.
The lexicographic order should be a template parameter of the function.
From the answers I conclude that there is nothing on C++/Boost that can help to define this generic function easily and also that this function is too specific to be proposed for free. I will implement a generic next_prefix and after that I will request if you find it useful.
I have accepted the single answer that gives some hints on how to do that even if the proposed implementation is not generic.

I'm not sure I understand the semantics by which you wish the string to transform, but maybe something like the following can be a starting point for you. The code will increment the sequence, as if it was a sequence of digits representing a number.
template<typename Bi, typename I>
bool increment(Bi first, Bi last, I minval, I maxval)
{
if( last == first ) return false;
while( --last != first && *last == maxval ) *last = minval;
if( last == first && *last == maxval ) {
*last = minval;
return false;
}
++*last;
return true;
}
Maybe you wish to add an overload with a function object, or an overload or specialization for primitives. A couple of examples:
string s1("aaz");
increment(s1.begin(), s1.end(), 'a', 'z');
cout << s1 << endl; // aba
string s2("95");
do {
cout << s2 << ' '; // 95 96 97 98 99
} while( increment(s2.begin(), s2.end(), '0', '9') );
cout << endl;

That seem so specific that I can't see how it would get in STL or boost.

When you say the order is a template parameter, what are you envisaging will be passed? A comparator that takes two characters and returns bool?
If so, then that's a bit of a nightmare, because the only way to find "the least char greater than my current char" is to sort all the chars, find your current char in the result, and step forward one (or actually, if some chars might compare equal, use upper_bound with your current char to find the first greater char).
In practice, for any sane string collation you can define a "get the next char, or warn me if I gave you the last char" function more efficiently, and build your "get the next prefix" function on top of that. Hopefully, permitting an arbitrary order is more flexibility than you need.

Orderings are typically specified as a comparator, not as a sequence generator.
Lexicographical orderings in particular tend be only partial, for example, in case or diacritic insensitivity. Therefore your final product will be nondeterministic, or at best arbitrary. ("Always choose lowest numerical encoding"?)
In any case, if you accept a comparator as input, the only way to translate that to an increment operation would be to compare the current value against every other in the character space. Which could work, 127 values being so few (a comparator-sorted table would make short work of the problem), or could be impossibly slow, if you use any other kind of character.

The best way is likely to define the character ordering somehow, then define the rules from going from one character to two characters to three characters.
Use whatever sort function you wish to use over the complete list of characters that you want to include, then just use that as the ordering. Find the index of the current character, and you can easily find the previous and next characters. Only advance the right-most character, unless it's going to roll over, then advance the next character to the left.
In other words, reinventing the wheel is like 10 lines of Python. Probably less than 500 lines of C++. :)

Related

C++ code performance strings compare

I have an array of struct (arrBoards) which has some integer values, vector and a string type.
I want to compare if certain string in the struct is equal with entered parameter (string p1).
What idea is faster - to check equation of input string with every string element inside an array, or firstly check if string.length() in current string element of the array greater than 0, then compare the strings.
if (p1.length())
{
transform(p1.begin(), p1.end(), p1.begin(), ::tolower); //to lowercase
for (int i=0; i<arrSize; i++) //check if string element already exists
if ( rdPtr->arrBoards[i].sName == p1 )
{
*/ some code */
break;
}
}
if (p1.length())
{
transform(p1.begin(), p1.end(), p1.begin(), ::tolower); //to lowercase
for (int i=0; i<arrSize; i++) //check if string element already exists
if ( rdPtr->arrBoards[i].sName.length() ) //check length of the string in the element of the array
if ( rdPtr->arrBoards[i].sName == p1 )
{
*/ some code */
break;
}
}
I think the second idea is better because it don't need to calculate the name everytime, but I can be wrong because using if could slow down code.
Thanks for the answers
I'm sure the comparison operator (==) of the string class is already optimized enough. Just use it.
operator==(...) returns a bool based on a short-circuit comparison
return __x.size() == __n && _Traits::compare(__x.data(), __s, __n) == 0;
It checks the size of the strings before calling compare(), so, there is no need for further optimization.
Always remember one of the principles of Software Engineering: KISS :P
What you want to do is play percentages.
Since the strings are highly likely to be different, you want to find that out as quickly as possible.
You're comparing length first, but don't assume length is cheap to compute, compared to whatever else you're doing.
Here's the kind of thing I've done (in C):
if (a[0]==b[0] && strcmp(a, b)==0)
so if the leading characters are different, it never gets to the string compare.
If the dataset is such that the leading characters are likely to be different, it saves a lot of time.
(strcmp also has this kind of optimization, but you still have to pay the price of setting up the arguments and getting in and out of the function. We're talking about small numbers of cycles here.)
If you do something like that, then you may find the loop iteration overhead is costing a significant fraction of time.
If so, you might consider unrolling it.
(The compiler might unroll it for you, but I wouldn't depend on it.)
Comparing a number is faster than comparing a string. Try comparing the strings length before comparing the string itself.

How can I compare if a char is higher or lower in alphabetical order than another?

Pretty much as the title. I'm writing a linked list and I need a function to sort the list alphabetically, and I'm pretty stumped. Not sure how this has never come up before, but I have no idea how to do it other than to create my own function listing the entire alphabet and comparing positions of letters from scratch.
Is there any easy way to do this?
Edit for clarity:
I've got a linear linked list of class objects, each class object has a char name, and I'm writing a function to compare the name of each object in the list, to find the highest object alphabetically, and then find the next object down alphabetically, etc, linking them together as I go. I already have a function that does this for an int field, so I just need to rewrite it to compare inequalities between alphabetical characters where a is largest and z is smallest.
In hindsight that was probably a lot more relevant than I thought.
I think a couple of the answers I've gotten already should work so I'll pop back and select a best answer once I've gotten it working.
I'm also working with g++ and unity.
I think that in general case the best approach will be to use std::char_traits:
char a, b;
std::cin >> a >> b;
std::locale loc;
a = std::tolower(a, loc);
b = std::tolower(b, loc);
std::cout << std::char_traits::compare(&a, &b, 1u);
But in many common situations you can simply compare chars as you do it with other integer types.
My guess is that your list contains char* as data (it better contain std::strings as data). If the list is composed of the latter, you can simply sort using the overloaded std::string's operator<, like
return str1 < str2; // true if `str1` is lexicographically before `str2`
If your list is made of C-like null-terminated strings, then you can sort them using std::strcmp like
return std::strcmp(s1, s2);
or use the std::char_traits::compare (as mentioned by #Anton) like
return std::char_traits<char>::compare(s1, s2, std::min(std::strlen(s1), std::strlen(s2)));
or sort them via temporary std::strings (most expensive), like
return std::string(s1) < std::string(s2); // here s1 and s2 are C-strings
If your list simply contains characters, then, as mentioned in the comments,
return c1 < c2; // returns true whenever c1 is before c2 in the alphabet
If you don't care about uppercase/lowercase, then you can use std::toupper to transform the character into uppercase, then always compare the uppercase.
#include <stdio.h>
#include <ctype.h>
void main(void) {
char a = 'X', b = 'M';
printf("%i\n", a < b);
printf("%i\n", b < a);
printf("%i\n", 'a' < 'B');
printf("%i\n", tolower('a') < tolower('B'));
}
prints out:
0
1
0
1
chars are still numbers, and can be compared as such. The upper case letters and lower case letters are all in order, with the upper case letters before the lower. (Such that 'Z' < 'a'.) See an ASCII table.
As you can see from this ASCII table, all of the alphanumeric characters appear in the correct alphabetical order, regarding their actual values:
"Is there any easy way to do this?"
So yes, comparing the character values will provide to have them sorted in alphabetical order.
would something like below suffice ? convert everything to upper first.
class compareLessThanChar{
public:
bool operator()(const char a, const char b)
{ return toupper(a) < toupper(b); }
}
std::multiset<char, compareLessThanChar> sortedContainer;

String Comparison return value (Is is used in applications that sorts characters ?)

When we use strcmp(str1, str2); or str1.compare(str2); the return values are like -1, 0 and 1, for str1 < str2, str1 == str2 or str1 > str2 respectively.
The question is, is it defined like this for a specific reason?
For instance, in binary tree sorting algorithm, we push smaller values to the left child and larger values to the right child. This strcmp or string::compare functions seem to be perfect for that. However, does anyone use string matching in order to sort a tree (integer index are easier to use) ?
So, what is the actual purpose of the three return values ( -1, 0, 1). Why cant it just return 1 for true, and 0 for false?
Thanks
The purpose of having three return values is exactly what it seems like: to answer all questions about string comparisons at once.
Everyone has different needs. Some people sometimes need a simple less-than test; strncmp provides this. Some people need equality testing; strncmp provides this. Some people really do need to know the full relationship between two strings; strncmp provides this.
What you absolutely don't want is someone writing this:
if(strless(lhs, rhs))
{
}
else if(strequal(lhs, rhs))
{
}
That's doing two potentially expensive comparison operations. strless also knows if they were equal, because it had to get to the end of both strings to return that it was not less.
Oh, and FYI: the return values isn't -1 or +1; it's greater than zero or less than zero. Or zero if they're equal.
It's useful for certain cases where knowing all three cases is important. Use operator< for string when you just care about a boolean comparison.
It could, but then you would need multiple functions for sorting and comparison. With strcmp() returning smaller, equal or bigger, you can use them easily for comparison and for sorting.
Remember that BSTs are not the only place where you would like to compare strings. You might want to sort a name list or similar. Also, it is not uncommon to have a string as key in a tree too.
As others have stated, there are real purposes for comparison of strings with < > == implications. For example; fixed length numbers assigned to strings will resolve correctly; ie: "312235423" > "312235422". On some occasions this is useful.
However the feature you're asking for, true/false for solutions still works with the given return values.
if (-1)
{
// resolves true
}
else if (1)
{
// also resolves true
}
else if (0)
{
// resolves false
}

Searching For Elements With Multiple Matches

I have a vector of Key-Value pairs, where each Key-Value pair is also tagged with an Entry Type code. The possible Entry Type codes are:
enum Type
{
tData = 0,
tSeqBegin = 1, // the beginning of a sequence
tSeqEnd = 2 // the end of a sequence
};
So the Key-Value pair itself looks like this:
struct KeyVal
{
int key_;
string val_;
Type type_;
};
Within the vector are sub-arrays of additional Key-Value pairs. These sub-arrays are called 'sequences'. Sequences can be nested to any level. So sequences can themselves have (optional) sub-sequences of varying lengths. The combination of a Key and Type is unique within a sequence element. That is, within a single sequence element there can only be one 269 data row, but other sequence elements can have their own 269 data rows.
Here is a graphical representation of some sample data, grossly oversimplified (If the 'Type' column is blank, it is of type tData):
Row# Type Key Value
---- ------------- ----- --------
1 35 "W"
2 1181 "IBM"
3 tSeqBegin 268 "3"
4 269 "0"
5 270 "160.3"
6 tSeqEnd 0
7 269 "0"
8 290 "0"
9 tSeqBegin 453 "1" <-- subsequence
10 tSeqEnd 0 <-- end of subsequence
11 tSeqEnd 0
12 269 "0"
13 290 "1"
14 270 "160.4"
15 tSeqEnd 0
16 1759 "ABC"
[EDIT: A note on the above. There is one tSeqBegin that marks the beginning of the whole sequence. The end of each sequence element is marked by a tSeqEnd. But there is no special tSeqEnd that also marks the end of the whole sequence. So for a sequence you will see 1 tSeqBegin and n tSeqEnds, where n is the number of elements within the sequence.
Another note, in the above sequence beginning at row #3 and ending at row #15, there is one subsequence in the 2nd element (rows 7-11). The subsequence is empty, and occupies rows 9 and 10.]
What I'm trying to do is find a sequence element which has multiple Key-Value matches to certain criteria. For example, suppose I want to find the sequence element that has both 269="0" and 290="0". In this case, it should not find element #0 (starting at row 3) because that element doesn't have a 290=... row at all. It should find the element starting at row #7 instead. Ultimately I will extract other fields from this element, but that's beyond the scope of this problem, so I haven't included that data above.
I can't use std::find_if() because find_if() will evaluate each row individually, not the whole sequence element as a unit. So I can't construct a functor that evaluates something like if 269=="0" &&* 290=="0" because no single row will ever evaluate this to true.
I had thought to implement my own find_sequence_element(...) function. But this would involve some fairly complex logic. First I would have to identify the begin() and end() of the entire sequence, noting where each element begin()'s and end()'s. Then I would have to construct some kind of evaluation structure that I could string together like this psudocode:
Condition cond = KeyValueMatch(269, "0") + KeyValueMatch(290, "0");
But this is also complex. I can't just construct a find_sequence_element() that takes exactly 2 parameters, one for the 269 match and another for the 290 match, because I want to use this algorithm for other sequences as well, with more or fewer conditions.
Moreover, it seems like I should be able to use the STL <algorithm>'s that already exist. While I know the STL rather well, I can't figure out a way to use find_if() in any straightforward way.
So, finally, here's the question. If you were faced with the above problem, how would you solve it? I know the question is vague. I'm hoping that with some discussion we can narrow the problem domain down until we have an answer.
Some conditions:
I cannot change the single flat vector to a vector of vectors or anything of the like. The reasons for this are complex.
(Placeholder for more conditions :) )
(If consensus is that this should be CW, I will mark it as such)
I would want to process in an online fashion. Have a type which tracks:
where the current sequence started
a count how many requirements have been met so far by the current sequence.
In your example requirements could be represented as a map<int,string>. In general they could be a sequence of binary predicates, or something polymorphic if you need to use different functors for different conditions in the same set, and for efficiency progress could be represented as a sequence of booleans, "has this predicate been met yet?"
When you see a tSeqEnd you clear the set of met requirements and start again. If your count hits the number of requirements, you're done.
The simplest case is that all predicates specify the key value, and hence only match once. It might look something like:
template<typename DataIterator, typename PredIterator>
DataIterator find_matching_sequence(
DataIterator dataFirst,
DataIterator dataLast,
PredIterator predFirst,
PredIterator predLast) {
DataIterator sequence_start = dataFirst;
size_t required = std::distance(predFirst, predLast);
size_t sofar = 0;
while (dataFirst != dataLast) {
if (dataFirst->type == SeqEnd) {
count = 0;
++dataFirst;
sequence_start = dataFirst;
continue;
}
sofar += std::count(predFirst, predLast, Matches(*dataFirst));
if (sofar == required) return sequence_start;
++dataFirst;
}
}
If the same predicate could match multiple rows in a subsequence, then you can use a vector<bool> instead of a count, or possibly a valarray<bool>.
To cope with multiply-nested sub-sequences, you actually need a stack of "how am I doing" records, and you might be able to implement that by the function recursively calling itself, and returning early if it sees enough "end" records to know that it has reached the end of its outermost sequence. But I don't really understand that part of the data format.
So no serious use of STL algorithms, unless you want to std::copy your initial range into an output iterator that performs the online processing ;-)
Hoping I understand your setup correctly, I would proceed as a two-step fashion, nesting search algorithms along the lines of:
template<typename It, typename Pr>
It find_sequence_element ( It begin, It end, Pr predicate );
except that Pr here is a predicate that takes a sequence and returns if that sequence matches, yes or no. An example for a single match could be:
class HasPair
{
int key_; string value_;
public:
Hasmatch ( int key, string value);
template<typename It>
bool operator() ( It begin, It end ) const {
return (std::find_if(begin, end, item_predicate(key_, value_));
}
};
Where item_predicate() is suitable to find the (key_,value_) pair in [begin,end).
If you're interested in finding a sequence with two pairs, write a HasPairs predicate that invokes std::find_if twice, or some more optimized version of a search for two elements.

C++ string sort like a human being?

I would like to sort alphanumeric strings the way a human being would sort them. I.e., "A2" comes before "A10", and "a" certainly comes before "Z"! Is there any way to do with without writing a mini-parser? Ideally it would also put "A1B1" before "A1B10". I see the question "Natural (human alpha-numeric) sort in Microsoft SQL 2005" with a possible answer, but it uses various library functions, as does "Sorting Strings for Humans with IComparer".
Below is a test case that currently fails:
#include <set>
#include <iterator>
#include <iostream>
#include <vector>
#include <cassert>
template <typename T>
struct LexicographicSort {
inline bool operator() (const T& lhs, const T& rhs) const{
std::ostringstream s1,s2;
s1 << toLower(lhs); s2 << toLower(rhs);
bool less = s1.str() < s2.str();
//Answer: bool less = doj::alphanum_less<std::string>()(s1.str(), s2.str());
std::cout<<s1.str()<<" "<<s2.str()<<" "<<less<<"\n";
return less;
}
inline std::string toLower(const std::string& str) const {
std::string newString("");
for (std::string::const_iterator charIt = str.begin();
charIt!=str.end();++charIt) {
newString.push_back(std::tolower(*charIt));
}
return newString;
}
};
int main(void) {
const std::string reference[5] = {"ab","B","c1","c2","c10"};
std::vector<std::string> referenceStrings(&(reference[0]), &(reference[5]));
//Insert in reverse order so we know they get sorted
std::set<std::string,LexicographicSort<std::string> > strings(referenceStrings.rbegin(), referenceStrings.rend());
std::cout<<"Items:\n";
std::copy(strings.begin(), strings.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
std::vector<std::string> sortedStrings(strings.begin(), strings.end());
assert(sortedStrings == referenceStrings);
}
Is there any way to do with without writing a mini-parser?
Let someone else do that?
I'm using this implementation: http://www.davekoelle.com/alphanum.html, I've modified it to support wchar_t, too.
It really depends what you mean by "parser." If you want to avoid writing a parser, I would think you should avail yourself of library functions.
Treat the string as a sequence of subsequences which are uniformly alphabetic, numeric, or "other."
Get the next alphanumeric sequence of each string using isalnum and backtrack-checking for + or - if it is a number. Use strtold in-place to find the end of a numeric subsequence.
If one is numeric and one is alphabetic, the string with the numeric subsequence comes first.
If one string has run out of characters, it comes first.
Use strcoll to compare alphabetic subsequences within the current locale.
Use strtold to compare numeric subsequences within the current locale.
Repeat until finished with one or both strings.
Break ties with strcmp.
This algorithm has something of a weakness in comparing numeric strings which exceed the precision of long double.
Is there any way to do it without writing a mini parser? I would think the answer is no. But writing a parser isn't that tough. I had to do this a while ago to sort our company's stock numbers. Basically just scan the number and turn it into an array. Check the "type" of every character: alpha, number, maybe you have others you need to deal with special. Like I had to treat hyphens special because we wanted A-B-C to sort before AB-A. Then start peeling off characters. As long as they are the same type as the first character, they go into the same bucket. Once the type changes, you start putting them in a different bucket. Then you also need a compare function that compares bucket-by-bucket. When both buckets are alpha, you just do a normal alpha compare. When both are digits, convert both to integer and do an integer compare, or pad the shorter to the length of the longer or something equivalent. When they're different types, you'll need a rule for how those compare, like does A-A come before or after A-1 ?
It's not a trivial job and you have to come up with rules for all the odd cases that may arise, but I would think you could get it together in a few hours of work.
Without any parsing, there's no way to compare human written numbers (high values first with leading zeroes stripped) and normal characters as part of the same string.
The parsing doesn't need to be terribly complex though. A simple hash table to deal with things like case sensitivity and stripping special characters ('A'='a'=1,'B'='b'='2,... or 'A'=1,'a'=2,'B'=3,..., '-'=0(strip)), remap your string to an array of the hashed values, then truncate number cases (if a number is encountered and the last character was a number, multiply the last number by ten and add the current value to it).
From there, sort as normal.