Case insensitive sorting of an array of strings

Case insensitive sorting of an array of strings - c++

Basically, I have to use selection sort to sort a string[]. I have done this part but this is what I am having difficulty with.
The sort, however, should be case-insensitive, so that "antenna" would come before "Jupiter". ASCII sorts from uppercase to lowercase, so would there not be a way to just swap the order of the sorted string? Or is there a simpler solution?
void stringSort(string array[], int size) {
int startScan, minIndex;
string minValue;
for(startScan = 0 ; startScan < (size - 1); startScan++) {
minIndex = startScan;
minValue = array[startScan];
for (int index = startScan + 1; index < size; index++) {
if (array[index] < minValue) {
minValue = array[index];
minIndex = index;
}
}
array[minIndex] = array[startScan];
array[startScan] = minValue;
}
}

C++ provides you with sort which takes a comparison function. In your case with a vector<string> you'll be comparing two strings. The comparison function should return true if the first argument is smaller.
For our comparison function we'll want to find the first mismatched character between the strings after tolower has been applied. To do this we can use mismatch which takes a comparator between two characters returning true as long as they are equal:
const auto result = mismatch(lhs.cbegin(), lhs.cend(), rhs.cbegin(), rhs.cend(), [](const unsigned char lhs, const unsigned char rhs){return tolower(lhs) == tolower(rhs);});
To decide if the lhs is smaller than the rhs fed to mismatch we need to test 3 things:
Were the strings of unequal length
Was string lhs shorter
Or was the first mismatched char from lhs smaller than the first mismatched char from rhs
This evaluation can be performed by:
result.second != rhs.cend() && (result.first == lhs.cend() || tolower(*result.first) < tolower(*result.second));
Ultimately, we'll want to wrap this up in a lambda and plug it back into sort as our comparator:
sort(foo.begin(), foo.end(), [](const unsigned char lhs, const unsigned char rhs){
const auto result = mismatch(lhs.cbegin(), lhs.cend(), rhs.cbegin(), rhs.cend(), [](const unsigned char lhs, const unsigned char rhs){return tolower(lhs) == tolower(rhs);});
return result.second != rhs.cend() && (result.first == lhs.cend() || tolower(*result.first) < tolower(*result.second));
});
This will correctly sort vector<string> foo. You can see a live example here: http://ideone.com/BVgyD2
EDIT:
Just saw your question update. You can use sort with string array[] as well. You'll just need to call it like this: sort(array, std::next(array, size), ...

#include <algorithm>
#include <vector>
#include <string>
using namespace std;
void CaseInsensitiveSort(vector<string>& strs)
{
sort(
begin(strs),
end(strs),
[](const string& str1, const string& str2){
return lexicographical_compare(
begin(str1), end(str1),
begin(str2), end(str2),
[](const char& char1, const char& char2) {
return tolower(char1) < tolower(char2);
}
);
}
);
}

I use this lambda function to sort a vectors of strings:
std::sort(entries.begin(), entries.end(), [](const std::string& a, const std::string& b) -> bool {
for (size_t c = 0; c < a.size() and c < b.size(); c++) {
if (std::tolower(a[c]) != std::tolower(b[c]))
return (std::tolower(a[c]) < std::tolower(b[c]));
}
return a.size() < b.size();
});

Instead of the < operator, use a case-insensitive string comparison function.
C89/C99 provide strcoll (string collate), which does a locale-aware string comparison. It's available in C++ as std::strcoll. In some (most?) locales, like en_CA.UTF-8, A and a (and all accented forms of either) are in the same equivalence class. I think strcoll only compares within an equivalence class as a tiebreak if the whole string is otherwise equal, which gives a very similar sort order to a case-insensitive compare. Collation (at least in English locales on GNU/Linux) ignores some characters (like [). So ls /usr/share | sort gives output like
acpi-support
adduser
ADM_scripts
aglfn
aisleriot
I pipe through sort because ls does its own sorting, which isn't quite the same as sort's locale-based sorting.
If you want to sort some user-input arbitrary strings into an order that the user will see directly, locale-aware string comparison is usually what you want. Strings that differ only in case or accents won't compare equal, so this won't work if you were using a stable sort and depending on case-differing strings to compare equal, but otherwise you get nice results. Depending on the use-case, nicer than plain case-insensitive comparison.
FreeBSD's strcoll was and maybe still is case sensitive for locales other than POSIX (ASCII). That forum post suggests that on most other systems it is not case senstive.
MSVC provides a _stricoll for case-insensitive collation, implying that its normal strcoll is case sensitive. However, this might just mean that the fallback to comparing within an equivalence class doesn't happen. Maybe someone can test the following example with MSVC.
// strcoll.c: show that these strings sort in a different order, depending on locale
#include <stdio.h>
#include <locale.h>
int main()
{
// TODO: try some strings containing characters like '[' that strcoll ignores completely.
const char * s[] = { "FooBar - abc", "Foobar - bcd", "FooBar - cde" };
#ifdef USE_LOCALE
setlocale(LC_ALL, ""); // empty string means look at env vars
#endif
strcoll(s[0], s[1]);
strcoll(s[0], s[2]);
strcoll(s[1], s[2]);
return 0;
}
output of gcc -DUSE_LOCALE -Og strcoll.c && ltrace ./a.out (or run LANG=C ltrace a.out):
__libc_start_main(0x400586, 1, ...
setlocale(LC_ALL, "") = "en_CA.UTF-8" # my env contains LANG=en_CA.UTF-8
strcoll("FooBar - abc", "Foobar - bcd") = -1
strcoll("FooBar - abc", "FooBar - cde") = -2
strcoll("Foobar - bcd", "FooBar - cde") = -1
# the three strings are in order
+++ exited (status 0) +++
with gcc -Og -UUSE_LOCALE strcoll.c && ltrace ./a.out:
__libc_start_main(0x400536, ...
# no setlocale, so current locale is C
strcoll("FooBar - abc", "Foobar - bcd") = -32
strcoll("FooBar - abc", "FooBar - cde") = -2
strcoll("Foobar - bcd", "FooBar - cde") = 32 # s[1] should sort after s[2], so it's out of order
+++ exited (status 0) +++
POSIX.1-2001 provides strcasecmp. The POSIX spec says the results are "unspecified" for locales other than plain-ASCII, though, so I'm not sure whether common implementations handle utf-8 correctly or not.
See this post for portability issues with strcasecmp, e.g. to Windows. See other answers on that question for other C++ ways of doing case-insensitive string compares.
Once you have a case-insensitive comparison function, you can use it with other sort algorithms, like C standard lib qsort, or c++ std::sort, instead of writing your own O(n^2) selection-sort.
As b.buchhold's answer points out, doing a case-insensitive comparison on the fly might be slower than converting everything to lowercase once, and sorting an array of indices. The lowercase-version of each strings is needed many times. std::strxfrm will transform a string so that strcmp on the result will give the same result as strcoll on the original string.

You could call tolower on every character you compare. This is probably the easiest, yet not a great solution, becasue:
You look at every char multiple times so you'd call the method more often than necessary
You need extra care to handle wide-characters w.r.t to their encoding (UTF8 etc)
You could also replace the comparator by your own function. I.e. there will be some place where you compare something like stringone[i] < stringtwo[j] or charA < charB. change it to my_less_than(stringone[i], stringtwo[j]) and implement the exact ordering you want based.
another way would be to transform every string to lowercase once and create an array of pairs. then you base your comparisons on the lowercase value only, but you swap whole pairs so that your final strings will be in the right order as well.
finally, you can create an array with lowercase versions and sort this one. whenever you swap two elements in this one, you also swap in the original array.
note that all those proposals would still need proper handling of wide characters (if you need that at all)

This solution is much simpler to understand than Jonathan Mee's and pretty inefficient, but for educational purpose could be fine:
std::string lowercase( std::string s )
{
std::transform( s.begin(), s.end(), s.begin(), ::tolower );
return s;
}
std::sort( array, array + length,
[]( const std::string &s1, const std::string &s2 ) {
return lowercase( s1 ) < lowercase( s2 );
} );
if you have to use your sort function, you can use the same approach:
....
minValue = lowercase( array[startScan] );
for (int index = startScan + 1; index < size; index++) {
const std::string &tstr = lowercase( array[index] );
if (tstr < minValue) {
minValue = tstr;
minIndex = index;
}
}
...

Related

Generate string lexicographically larger than input

Given an input string A, is there a concise way to generate a string B that is lexicographically larger than A, i.e. A < B == true?
My raw solution would be to say:
B = A;
++B.back();
but in general this won't work because:
A might be empty
The last character of A may be close to wraparound, in which case the resulting character will have a smaller value i.e. B < A.
Adding an extra character every time is wasteful and will quickly in unreasonably large strings.
So I was wondering whether there's a standard library function that can help me here, or if there's a strategy that scales nicely when I want to start from an arbitrary string.

You can duplicate A into B then look at the final character. If the final character isn't the final character in your range, then you can simply increment it by one.
Otherwise you can look at last-1, last-2, last-3. If you get to the front of the list of chars, then append to the length.

Here is my dummy solution:
std::string make_greater_string(std::string const &input)
{
std::string ret{std::numeric_limits<
std::string::value_type>::min()};
if (!input.empty())
{
if (std::numeric_limits<std::string::value_type>::max()
== input.back())
{
ret = input + ret;
}
else
{
ret = input;
++ret.back();
}
}
return ret;
}
Ideally I'd hope to avoid the explicit handling of all special cases, and use some facility that can more naturally handle them. Already looking at the answer by #JosephLarson I see that I could increment more that the last character which would improve the range achievable without adding more characters.
And here's the refinement after the suggestions in this post:
std::string make_greater_string(std::string const &input)
{
constexpr char minC = ' ', maxC = '~';
// Working with limits was a pain,
// using ASCII typical limit values instead.
std::string ret{minC};
auto rit = input.rbegin();
while (rit != input.rend())
{
if (maxC == *rit)
{
++rit;
if (rit == input.rend())
{
ret = input + ret;
break;
}
}
else
{
ret = input;
++(*(ret.rbegin() + std::distance(input.rbegin(), rit)));
break;
}
}
return ret;
}
Demo

You can copy the string and append some letters - this will produce a lexicographically larger result.
B = A + "a"

Character pointers messed up in simple Boyer-Moore implementation

I am currently experimenting with a very simple Boyer-Moore variant.
In general my implementation works, but if I try to utilize it in a loop the character pointer containing the haystack gets messed up. And I mean that characters in it are altered, or mixed.
The result is consistent, i.e. running the same test multiple times yields the same screw up.
This is the looping code:
string src("This haystack contains a needle! needless to say that only 2 matches need to be found!");
string pat("needle");
const char* res = src.c_str();
while((res = boyerMoore(res, pat)))
++res;
This is my implementation of the string search algorithm (the above code calls a convenience wrapper which pulls the character pointer and length of the string):
unsigned char*
boyerMoore(const unsigned char* src, size_t srcLgth, const unsigned char* pat, size_t patLgth)
{
if(srcLgth < patLgth || !src || !pat)
return nullptr;
size_t skip[UCHAR_MAX]; //this is the skip table
for(int i = 0; i < UCHAR_MAX; ++i)
skip[i] = patLgth; //initialize it with default value
for(size_t i = 0; i < patLgth; ++i)
skip[(int)pat[i]] = patLgth - i - 1; //set skip value of chars in pattern
std::cout<<src<<"\n"; //just to see what's going on here!
size_t srcI = patLgth - 1; //our first character to check
while(srcI < srcLgth)
{
size_t j = 0; //char match ct
while(j < patLgth)
{
if(src[srcI - j] == pat[patLgth - j - 1])
++j;
else
{
//since the number of characters to skip may be negative, I just increment in that case
size_t t = skip[(int)src[srcI - j]];
if(t > j)
srcI = srcI + t - j;
else
++srcI;
break;
}
}
if(j == patLgth)
return (unsigned char*)&src[srcI + 1 - j];
}
return nullptr;
}
The loop produced this output (i.e. these are the haystacks the algorithm received):
This haystack contains a needle! needless to say that only 2 matches need to be found!
eedle! needless to say that only 2 matches need to be found!
eedless to say that eed 2 meed to beed to be found!
As you can see the input is completely messed up after the second run. What am I missing? I thought the contents could not be modified, since I'm passing const pointers.
Is the way of setting the pointer in the loop wrong, or is my string search screwing up?
Btw: This is the complete code, except for includes and the main function around the looping code.
EDIT:
The missing nullptr of the first return was due to a copy/paste error, in the source it is actually there.
For clarification, this is my wrapper function:
inline char* boyerMoore(const string &src, const string &pat)
{
return (const char*) boyerMoore((const unsigned char*) src.c_str(), src.size(),
(const unsigned char*) pat.c_str(), pat.size());
}

In your boyerMoore() function, the first return isn't returning a value (you have just return; rather than return nullptr;) GCC doesn't always warn about missing return values, and not returning anything is undefined behavior. That means that when you store the return value in res and call the function again, there's no telling what will print out. You can see a related discussion here.
Also, you have omitted your convenience function that calculates the length of the strings that you are passing in. I would recommend double checking that logic to make sure the sizes are correct - I'm assuming you are using strlen or similar.

Working with strcmp and string array

I'm trying to eliminate extra elements in the string array and I wrote the code below. There seems a problem with strcmp function and string arrays. Strcmp doesn't accept the string array elements that way. Can you help me fix that? array3 is string array. I'm coding in C++ and What I want to do is like there are multiple "apple"s or "banana"s in the string array. But I only need one "apple" or one "banana".
for(int l = 0; l<9999; l++)
{
for(int m=l+1;m<10000;m++)
if(!strcmp(array3[l],array3[m]))
{
array3[m]=array3[m+1];
}
}

strcmp returns 0 on equality, so if (strcmp(s1,s2))... means "if the strings are equal then do this...". Is that what you mean?

First of all, you can use operator== to compare strings of std::string type:
std::string a = "asd";
std::string b = "asd";
if(a == b)
{
//do something
}
Second, you have an error in your code, provided 10000 is the size of the array:
array3[m]=array3[m+1];
In this line you are accessing the m+1st element, with m being up to 10000. This means you will eventually try to access the 10001st element, and get out of array bonds.
Finally, your approach is wrong, and this way will not let you remove all the duplicate strings.
A better (but not the best) way to do it is this (pseudocode):
std::string array[];//initial array
std::string result[];//the array without duplicate elements
int resultSize = 0;//The number of unique elements.
bool isUnique = false;//A flag to indicate if the current element is unique.
for( int i = 0; i < array.size; i++ )
{
isUnique = true;//we assume that the element is unique
for( int j = 0; j < result.size; j++ )
{
if( array[i] == result[j] )
{
/*if the result array already contains such an element, it is, obviously,
not unique, and we have no interest in it.*/
isUnique = false;
break;
}
}
//Now, if the isUnique flag is true, which means we didn't find a match in the result array,
//we add the current element into the result array, and increase the count by one.
if( isUnique == true )
{
result[resultSize] = array[i];
resultSize++;
}
}

strcmp works on Cstrings only so if you wanna use it I suggest you alter it to the following: strcmp(array3[l].c_str(),array3[m].c_str()) which makes the strings C Strings.
Another option would be to simply compare them with the equality operator array3[l]==array3[m] this would tell you if the strings are equal or not.
Another way to do what you're trying to do is just to put the array in a set and iterate over it. Sets don't take more than one string of the same content!
References:
More about strcmp :http://en.cppreference.com/w/cpp/string/byte/strcmp
And moreabout c_str: http://en.cppreference.com/w/cpp/string/basic_string/c_str
Regarding String Comparison: http://en.cppreference.com/w/cpp/string/basic_string/compare
C++ Sets http://en.cppreference.com/w/cpp/container/set

Removing specified characters from a string - Efficient methods (time and space complexity)

Here is the problem: Remove specified characters from a given string.
Input: The string is "Hello World!" and characters to be deleted are "lor"
Output: "He Wd!"
Solving this involves two sub-parts:
Determining if the given character is to be deleted
If so, then deleting the character
To solve the first part, I am reading the characters to be deleted into a std::unordered_map, i.e. I parse the string "lor" and insert each character into the hashmap. Later, when I am parsing the main string, I will look into this hashmap with each character as the key and if the returned value is non-zero, then I delete the character from the string.
Question 1: Is this the best approach?
Question 2: Which would be better for this problem? std::map or std::unordered_map? Since I am not interested in ordering, I used an unordered_map. But is there a higher overhead for creating the hash table? What to do in such situations? Use a map (balanced tree) or a unordered_map (hash table)?
Now coming to the next part, i.e. deleting the characters from the string. One approach is to delete the character and shift the data from that point on, back by one position. In the worst case, where we have to delete all the characters, this would take O(n^2).
The second approach would be to copy only the required characters to another buffer. This would involve allocating enough memory to hold the original string and copy over character by character leaving out the ones that are to be deleted. Although this requires additional memory, this would be a O(n) operation.
The third approach, would be to start reading and writing from the 0th position, increment the source pointer when every time I read and increment the destination pointer only when I write. Since source pointer will always be same or ahead of destination pointer, I can write over the same buffer. This saves memory and is also an O(n) operation. I am doing the same and calling resize in the end to remove the additional unnecessary characters?
Here is the function I have written:
// str contains the string (Hello World!)
// chars contains the characters to be deleted (lor)
void remove_chars(string& str, const string& chars)
{
unordered_map<char, int> chars_map;
for(string::size_type i = 0; i < chars.size(); ++i)
chars_map[chars[i]] = 1;
string::size_type i = 0; // source
string::size_type j = 0; // destination
while(i < str.size())
{
if(chars_map[str[i]] != 0)
++i;
else
{
str[j] = str[i];
++i;
++j;
}
}
str.resize(j);
}
Question 3: What are the different ways by which I can improve this function. Or is this best we can do?
Thanks!

Good job, now learn about the standard library algorithms and boost:
str.erase(std::remove_if(str.begin(), str.end(), boost::is_any_of("lor")), str.end());

Assuming that you're studying algorithms, and not interested in library solutions:
Hash tables are most valuable when the number of possible keys is large, but you only need to store a few of them. Your hash table would make sense if you were deleting specific 32-bit integers from digit sequences. But with ASCII characters, it's overkill.
Just make an array of 256 bools and set a flag for the characters you want to delete. It only uses one table lookup instruction per input character. Hash map involves at least a few more instructions to compute the hash function. Space-wise, they are probably no more compact once you add up all the auxiliary data.
void remove_chars(string& str, const string& chars)
{
// set up the look-up table
std::vector<bool> discard(256, false);
for (int i = 0; i < chars.size(); ++i)
{
discard[chars[i]] = true;
}
for (int j = 0; j < str.size(); ++j)
{
if (discard[str[j]])
{
// do something, depending on your storage choice
}
}
}
Regarding your storage choices: Choose between options 2 and 3 depending on whether you need to preserve the input data or not. 3 is obviously most efficient, but you don't always want an in-place procedure.

Here is a KISS solution with many advantages:
void remove_chars (char *dest, const char *src, const char *excludes)
{
do {
if (!strchr (excludes, *src))
*dest++ = *src;
} while (*src++);
*dest = '\000';
}

You can ping pong between strcspn and strspn to avoid the need for a hash table:
void remove_chars(
const char *input,
char *output,
const char *characters)
{
const char *next_input= input;
char *next_output= output;
while (*next_input!='\0')
{
int copy_length= strspn(next_input, characters);
memcpy(next_output, next_input, copy_length);
next_output+= copy_length;
next_input+= copy_length;
next_input+= strcspn(next_input, characters);
}
}

Unexpected collision with std::hash

I know hashing infinite number of string into 32b int must generate collision, but I expect from hashing function some nice distribution.
Isn't it weird that these 2 strings have the same hash?
size_t hash0 = std::hash<std::string>()("generated_id_0");
size_t hash1 = std::hash<std::string>()("generated_id_1");
//hash0 == hash1
I know I can use boost::hash<std::string> or others, but I want to know what is wrong with std::hash. Am I using it wrong? Shouldn't I somehow "seed" it?

There's nothing wrong with your usage of std::hash. The problem is that the specialization std::hash<std::string> provided by the standard library implementation bundled with Visual Studio 2010 only takes a subset of the string's characters to determine the hash value (presumably for performance reasons). Coincidentally the last character of a string with 14 characters is not part of this set, which is why both strings yield the same hash value.
As far as I know this behaviour is in conformance with the standard, which demands only that multiple calls to the hash function with the same argument must always return the same value. However, the probability of a hash collision should be minimal. The VS2010 implementation fulfills the mandatory part, yet fails to account for the optional one.
For details, see the implementation in the header file xfunctional (starting at line 869 in my copy) and §17.6.3.4 of the C++ standard (latest public draft).
If you absolutely need a better hash function for strings, you should implement it yourself. It's actually not that hard.

The exact hash algorithm isn't specified by the standard, so the results
will vary. The algorithm used by VC10 doesn't seem to take all of the
characters into account if the string is longer than 10 characters; it
advances with an increment of 1 + s.size() / 10. This is legal,
albeit from a QoI point of view, rather disappointing; such hash codes
are known to perform very poorly for some typical sets of data (like
URLs). I'd strongly suggest you replace it with either a FNV hash or
one based on a Mersenne prime:
FNV hash:
struct hash
{
size_t operator()( std::string const& s ) const
{
size_t result = 2166136261U ;
std::string::const_iterator end = s.end() ;
for ( std::string::const_iterator iter = s.begin() ;
iter != end ;
++ iter ) {
result = (16777619 * result)
^ static_cast< unsigned char >( *iter ) ;
}
return result ;
}
};
Mersenne prime hash:
struct hash
{
size_t operator()( std::string const& s ) const
{
size_t result = 2166136261U ;
std::string::const_iterator end = s.end() ;
for ( std::string::const_iterator iter = s.begin() ;
iter != end ;
++ iter ) {
result = 127 * result
+ static_cast< unsigned char >( *iter ) ;
}
return result ;
}
};
(The FNV hash is supposedly better, but the Mersenne prime hash will be
faster on a lot of machines, because multiplying by 127 is often
significantly faster than multiplying by 16777619.)

You should likely get different hash values. I get different hash values (GCC 4.5):
hashtest.cpp
#include <string>
#include <iostream>
#include <functional>
int main(int argc, char** argv)
{
size_t hash0 = std::hash<std::string>()("generated_id_0");
size_t hash1 = std::hash<std::string>()("generated_id_1");
std::cout << hash0 << (hash0 == hash1 ? " == " : " != ") << hash1 << "\n";
return 0;
}
Output
# g++ hashtest.cpp -o hashtest -std=gnu++0x
# ./hashtest
16797002355621538189 != 16797001256109909978

You do not seed hashing function, you can just salt "them" at most.
The function is used in the right way and this collision could be just fortuitous.
You cannot tell whether the hashing function is not evenly distributed unless you perform a massive test with random keys.

The TR1 hash function and the newest standard define proper overloads for things like strings. When I run this code using std::tr1::hash (g++ 4.1.2), I get different hash values for these two strings.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Case insensitive sorting of an array of strings - c++

Related

Generate string lexicographically larger than input

Character pointers messed up in simple Boyer-Moore implementation

Working with strcmp and string array

Removing specified characters from a string - Efficient methods (time and space complexity)

Unexpected collision with std::hash

Categories

Resources