How to order strings case-insensitively (not lexicographically)? - c++

I'm attempting to order a list input from a file alphabetically (not lexicographically). So, if the list were:
C
d
A
b
I need it to become:
A
b
C
d
Not the lexicographic ordering:
A
C
b
d
I'm using string variables to hold the input, so I'm looking for some way to modify the strings I'm comparing to all uppercase or lowercase, or if there's some easier way to force an alphabetic comparison, please impart that wisdom. Thanks!
I should also mention that we are limited to the following libraries for this assignment: iostream, iomanip, fstream, string, as well as C libraries, like cstring, cctype, etc.

It looks like I'm just going to have to defeat this problem via some very tedious method of character extraction and toppering for each string.
Converting the individual strings to upper case and comparing them is not made particularly worse by being restricted from using algorithm, iterator, etc. The comparison logic is about four lines of code. Even though it would be nice not to have to write those four lines having to write a sorting algorithm is far more difficult and tedious. (Well, assuming that the usual C version of toupper is acceptable in the first place.)
Below I show a simple strcasecmp() implementation and then put it to use in a complete program which uses restricted libraries. The implementation of strcasecmp() itself doesn't use restricted libraries.
#include <string>
#include <cctype>
#include <iostream>
void toupper(std::string &s) {
for (char &c : s)
c = std::toupper(c);
}
bool strcasecmp(std::string lhs, std::string rhs) {
toupper(lhs); toupper(rhs);
return lhs < rhs;
}
// restricted libraries used below
#include <algorithm>
#include <iterator>
#include <vector>
// Example usage:
// > ./a.out <<< "C d A b"
// A b C d
int main() {
std::vector<std::string> input;
std::string word;
while(std::cin >> word) {
input.push_back(word);
}
std::sort(std::begin(input), std::end(input), strcasecmp);
std::copy(std::begin(input), std::end(input),
std::ostream_iterator<std::string>(std::cout, " "));
std::cout << '\n';
}

You don't have to modify the strings before sorting. You can sort them in place with a case-insensitive single character comparator and std::sort:
bool case_insensitive_cmp(char lhs, char rhs) {
return ::toupper(static_cast<unsigned char>(lhs) <
::toupper(static_cast<unsigned char>(rhs);
}
std::string input = ....;
std::sort(input.begin(), input.end(), case_insensitive_cmp);

std::vector<string> vec {"A", "a", "lorem", "Z"};
std::sort(vec.begin(),
vec.end(),
[](const string& s1, const string& s2) -> bool {
return strcasecmp(s1.c_str(), s2.c_str()) < 0 ? true : false;
});

Use strcasecmp() as comparison function in qsort().

I am not completely sure how to write it, but what you want to do is convert the strings to lower or uppercase.
If the strings are in an array to begin with, you would run through the list, and save the indexes in order in an (int) array.

If you're just comparing letters, then a terrible hack which will work is to mask the upper two bits off each character. Then upper and lower case letters fall on top of each other.

Related

Count word frequency using map

This is my first time implementing map in C++. So given a character array with text, I want to count the frequency of each word occurring in the text. I decided to implement map to store the words and compare following words and increment a counter.
Following is the code I have written so far.
const char *kInputText = "\
So given a character array with text, I want to count the frequency of
each word occurring in the text.\n\
I decided to implement map to store the\n\
words and compare following words and increment a counter.\n";
typedef struct WordCounts
{
int wordcount;
}WordCounts;
typedef map<string, int> StoreMap;
//countWord function is to count the total number of words in the text.
void countWord( const char * text, WordCounts & outWordCounts )
{
outWordCounts.wordcount = 0;
size_t i;
if(isalpha(text[0]))
outWordCounts.wordcount++;
for(i=0;i<strlen(text);i++)
{
if((isalpha(text[i])) && (!isalpha(text[i-1])))
outWordCounts.wordcount++;
}
cout<<outWordCounts.wordcount;
}
//count_for_map() is to count the word frequency using map.
void count_for_map(const char *text, StoreMap & words)
{
string st;
while(text >> st)
words[st]++;
}
int main()
{
WordCounts wordCounts;
StoreMap w;
countWord( kInputText, wordCounts );
count_for_map(kInputText, w);
for(StoreMap::iterator p = w.begin();p != w.end();++p)
{
std::cout<<p->first<<"occurred" <<p->second<<"times. \n";
}
return 0;
}
Error: No match for 'operator >>' in 'text >> st'
I understand this is an operator overloading error, so I went ahead and
wrote the following lines of code.
//In the count_for_map()
/*istream & operator >> (istream & input,const char *text)
{
int i;
for(i=0;i<strlen(text);i++)
input >> text[i];
return input;
}*/
Am I implementing map in the wrong way?
There is no overload for >> with a const char* left hand side.
text is a const char*, not an istream, so your overload doesn't apply (and the overload 1: is wrong, and 2: already exists in the standard library).
You want to use the more suitable std::istringstream, like this:
std::istringstream textstream(text);
while(textstream >> st)
words[st]++;
If you use modern C++ language, then life will get by far easier.
First. Usage of a std::map is the correct approach.
This is a more or less standard approach for counting something in a container.
We can use an associative container like a std::map or a std::unordered_map. And here we associate a "key", in this case the "word" to count, with a value, in this case the count of the specific word.
And luckily the maps have a very nice index operator[]. This will look for the given key and if found, return a reference to the value. If not found, the it will create a new entry with the key and return a reference to the new entry. So, in bot cases, we will get a reference to the value used for counting. And then we can simply write:
std::unordered_map<std::string, unsigned int> counter{};
counter[word]++;
But how to get words from a string. A string is like a container containing elements. And in C++ many containers have iterators. And especially for strings there is a dedicated iterator that allows to iterate over patterns in a std::string. It is called std::sregex_token_iterator and described here.. The pattern is given as a std::regex which will give you a great flexibility.
And, because we have such a wonderful and dedicated iterator, we should use it!
Eveything glued together will give a very compact solution, with a minimal number of code lines.
Please see:
#include <iostream>
#include <string>
#include <regex>
#include <map>
#include <iomanip>
const std::regex re{ "\\w+" };
const std::string text{ R"(So given a character array with text, I want to count the frequency of
each word occurring in the text.
I decided to implement map to store the
words and compare following words and increment a counter.")" };
int main() {
std::map<std::string, unsigned int> counter{};
for (auto word{ std::sregex_token_iterator(text.begin(),text.end(),re) }; word != std::sregex_token_iterator(); ++word)
counter[*word]++;
for (const auto& [word, count] : counter)
std::cout << std::setw(20) << word << "\toccurred\t" << count << " times\n";
}

Sorting a vector of ints that have been converted into strings

So I am needing to sort a vector of strings in numerical order. I am using the sort function and it almost works. Say I have the numbers 10, 20, 5, 200, 50, 75 that have been converted to strings. The sort function sorts them like so: 10, 200, 25, 5, 50, 75. So it is only sorting the first character I suppose? Is there an easy way to get it to sort more than the first character? And yes, they must be converted to strings for my particular use.
Thanks!
Look the following piece of code:
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
int main()
{
std::vector<std::string> v {"123", "453", "78", "333"};
std::sort(std::begin(v), std::end(v), [] (std::string const &A, std::string const &B) { return std::stoi(A) < std::stoi(B);});
for(auto i : v) std::cout << i << std::endl;
}
The question is really why you want to sort this after it became a vector of strings and not before that.
The simplest way to sort a vector of strings holding ints might be to convert it to ints, sort that and then convert back to strings into the first vector... which in your case could be more efficient if you did not convert to strings in the first place.
Regarding the suggestion to convert to int on the fly inside the comparator, that is going to be expensive. Comparing int is trivial compared with the process of conversion from string to int. Sorting is O(N log N) (expected) number of comparisons, if you convert on the fly you will be doing O(N log N) conversions, if you convert once you will do O(N) conversions and O(N log N) trivial int compares.
You can also handcraft an algorithm to do the comparison. If you can assume that all values are positive and there are no leading zeros, a number, represented as a string, is larger than any other number represented as a string with a shorter length. You could use that to build a comparisson function:
struct Compare {
bool operator()(std::string const & lhs, std::string const & rhs) const {
return lhs.size() < rhs.size()
|| (lhs.size() == rhs.size() && lhs < rhs);
}
};
If there can be leading zeros, it is simple to find how many leading zeroes and adjust the size accordingly inside the comparator. If the numbers can be negative you can further extend the comparator to detect the sign and then apply something similar to the comparisson above.
Can you use a standard map instead?
// now since map is already sorted by keys, you look up on the integer to get the equivalent string.
std::map<int, string> integersAndStrings;
integersAndStrings[1] = "one";
integersAndStrings[2] = "two";
integersAndStrings[3] = "three";
You could also write a variant of 40two's example. Instead of stoi, you can just make your own predicate to compare the characters. If lhs has fewer digits than rhs, lhs must be a smaller number (assuming no floating point); if same number of digits than compare the strings (i.e., what David Rodriguez showed you in his answer). I didn't notice that he had already suggested that when I wrote my answer. The only additional thing that I am adding is really the suggestion of using another container (i.e., std::map).

Is there a CompareTo method in C++ similar to Java where you can use > < = operations on a data type

I know that in java there is a compareTo method that you can write in a class that will compare two variables and return a value -1, 1, or 0 signifing greater than, less than, and equal to operations. Is there a way to do this in C++?
Background:
Im creating a modified string class in which it takes a string and an arraylist. I want to be able to compare the string in a traditional fashion where if its lower in the alphabet it will be less than, than higher it would be greater than. Than i just want the array list to be linked to the files to store pages in which the word was indexed on in a text file. Anyways the specifics do not matter since i already have the class written. I just need to create compareTo method that would be able to be used in the main of my cpp file or by other data type like various trees for instance.
Ill write the code in java as i know how and maybe someone can help me with C++ Syntax (im required to write in c++ for this project unfortunatly, and i am new to C++)
I will shorten the code to give the rough outline of what im doing than write the compareTo method as i know how in java
class name ModifiedString
Has variables: word , arraylist pagelist
Methods:
getWord (returns the word associated with the class, i.e its string)
appendPageList (adds page numbers to the array list, this doesnt matter in this question)
Hers how i would do it in java
int compareTo(ModifiedString a){
if(this.getWord() > a.getWord())
return 1;
else if (this.word() < a.getWord())
return -1;
else return 0;
}
Then when < , > , or == is used on a ModifiedWord than the operations would be valid.
std::string already includes a working overload of operator<, so you can just compare strings directly. Java uses compareTo primarily because the built-in comparison operator produces results that aren't generally useful for strings. Being a lower-level language, Java doesn't support user-defined operator overloads, so it uses compareTo as a band-aid to cover for the inadequacy of the language.
From your description, however, you don't need to deal with any of that directly at all. At least as you've described the problem, you really want is something like:
std::map<std::string, std::vector<int> > page_map;
You'll then read words in from your text file, and insert the page number where each occurs into the page map:
page_map[current_word].push_back(current_page);
Note that I've used std::map above, on the expectation that you may want ordered results (e.g., be able to quickly find all words from age to ale in alphabetical order). If you don't care about ordering, you may want to use std::unordered_map instead.
Edit: here's a simple text cross-reference program that reads a text file (from standard input) and writes out a cross-reference by line number (i.e., each "word", and the numbers of the lines on which that word appeared).
#include <map>
#include <unordered_map>
#include <iostream>
#include <string>
#include <vector>
#include <sstream>
#include <iterator>
#include "infix_iterator.h"
typedef std::map<std::string, std::vector<unsigned> > index;
namespace std {
ostream &operator<<(ostream &os, index::value_type const &i) {
os << i.first << ":\t";
std::copy(i.second.begin(), i.second.end(),
infix_ostream_iterator<unsigned>(os, ", "));
return os;
}
}
void add_words(std::string const &line, size_t num, index &i) {
std::istringstream is(line);
std::string temp;
while (is >> temp)
i[temp].push_back(num);
}
int main() {
index i;
std::string line;
size_t line_number = 0;
while (std::getline(std::cin, line))
add_words(line, ++line_number, i);
std::copy(i.begin(), i.end(),
std::ostream_iterator<index::value_type>(std::cout, "\n"));
return 0;
}
If you look at the first typedef (of index), you can change it from map to unordered_map if you want to test a hash table vs. a red-black tree. Note that this interprets "word" pretty loosely -- basically any sequence of non-whitespace characters, so for example, it'll treat example, as a "word" (and it'll be separate from example).
Note that this uses the infix_iterator I've posted elsewhere.
There is no standard way in C++ to define an operator that does what the Java compareTo() function does. You can, however, implement
int compareTo(const ModifiedString&, const ModifiedString&);
Another option is to overload the <, <=, >, >=, == and != operators, e.g. by implementing
bool operator<(const ModifiedString&, const ModifiedString&);
In C++, you define bool operator< directly, no need to invent funny names, same for operator< and operator==. They're generally implemented as member functions taking one extra argument, the righthand side, but you could also define them as non-member functions taking two arguments.
Sun decided to not include operator overloading in Java, so them provided an in-class way (through member functions) to do that job: The equals() and compareTo() functions.
C++ has operator overloading, which allows you to specify the behaviour of the language operators within your own types.
To learn how to overload operators, I suggest you to read this thread: Operator overloading

trouble with case insensitive partial matching two strings?

I am trying to partial match two strings without case sensitivity. I do not want to use the boost libraries as most people don't have them on their compilers. I tried .find() that is in the standard c++ library, but it only checks if the user inputted string is in the first word of the string that is already there. like, if I have a dvd named Harry_Potter_Goblet, if I search for "goblet" or "Goblet", the program doesnt show Harry_Potter_Goblet as a result, only if I do case sensitive search for "Harry", then the resul shows a match. What am I doing wrong here? Here is my code.
Define a case-insensitive character comparison function:
#include <cctype>
bool case_insensitive_comp(char lhs, char rhs)
{
return std::toupper(lhs) == std::toupper(rhs);
}
Then, use std::search to find the sub-string within the larger string.
#include <algorithm>
....
std::string s1="Harry_Potter_Goblet";
std::string s2 = "goblet";
bool found = std::search(s1.begin(), s1.end(), s2.begin(), s2.end(), case_insensitive_comp) != s1.end();

Sorting UTF-8 strings?

My std::strings are encoded in UTF-8 so the std::string < operator doesn't cut it. How could I compare 2 utf-8 encoded std::strings?
where it does not cut it is for accents, é comes after z which it should not
Thanks
If you don't want a lexicographic ordering (which is what sorting the UTF-8 encoded strings lexicographically will give you), then you will need to decode your UTF-8 encoded strings into UCS-2 or UCS-4 as appropriate, and apply a suitable comparison function of your choosing.
To reiterate the point, the UTF-8 encoding mechanism is cleverly designed so that if you sort by looking at the numeric value of each 8-bit encoded byte, you will get the same result as if you first decoded the string into Unicode and compared the numeric values of each code point.
Update: Your updated question indicates that you want a more complex comparison function than purely a lexicographic sort. You will need to decode your UTF-8 strings and compare the decoded characters.
The standard has std::locale for locale-specific things such as collation (sorting). If the environment contains LC_COLLATE=en_US.utf8 or similar, this program will sort lines as desired.
#include <algorithm>
#include <functional>
#include <iostream>
#include <iterator>
#include <locale>
#include <string>
#include <vector>
class collate_in : public std::binary_function<std::string, std::string, bool> {
protected:
const std::collate<char> &coll;
public:
collate_in(std::locale loc)
: coll(std::use_facet<std::collate<char> >(loc)) {}
bool operator()(const std::string &a, const std::string &b) const {
// std::collate::compare() takes C-style string (begin, end)s and
// returns values like strcmp or strcoll. Compare to 0 for results
// expected for a less<>-style comparator.
return coll.compare(a.c_str(), a.c_str() + a.size(),
b.c_str(), b.c_str() + b.size()) < 0;
}
};
int main() {
std::vector<std::string> v;
copy(std::istream_iterator<std::string>(std::cin),
std::istream_iterator<std::string>(), back_inserter(v));
// std::locale("") is the locale from the environment. One could also
// std::locale::global(std::locale("")) to set up this program's global
// first, and then use locale() to get the global locale, or choose a
// specific locale instead of using the environment's.
sort(v.begin(), v.end(), collate_in(std::locale("")));
copy(v.begin(), v.end(),
std::ostream_iterator<std::string>(std::cout, "\n"));
return 0;
}
$ cat >file
f
é
e
d
^D
$ LC_COLLATE=C ./a.out file
d
e
f
é
$ LC_COLLATE=en_US.utf8 ./a.out file
d
e
é
f
It's been brought to my attention that std::locale::operator()(a, b) exists, obviating the std::collate<>::compare(a, b) < 0 wrapper I wrote above.
#include <algorithm>
#include <iostream>
#include <iterator>
#include <locale>
#include <string>
#include <vector>
int main() {
std::vector<std::string> v;
copy(std::istream_iterator<std::string>(std::cin),
std::istream_iterator<std::string>(), back_inserter(v));
sort(v.begin(), v.end(), std::locale(""));
copy(v.begin(), v.end(),
std::ostream_iterator<std::string>(std::cout, "\n"));
return 0;
}
Encoding (UTF-8, 16, etc) isn't the problem, it's whether the container itself is treating the string as Unicode string or 8-bit (ASCII or Latin-1) string that matters.
I found Is there an STL and UTF-8 friendly C++ Wrapper for ICU, or other powerful Unicode library, which could help you.
One option would be to use ICU collators (http://userguide.icu-project.org/collation/api) which provide a properly internationalized "compare" method that you can then use to sort.
Chromium has a small wrapper that should be easy to copy&paste/reuse
https://code.google.com/p/chromium/codesearch#chromium/src/base/i18n/string_compare.cc&sq=package:chromium&type=cs