Sorting UTF-8 strings? - c++

My std::strings are encoded in UTF-8 so the std::string < operator doesn't cut it. How could I compare 2 utf-8 encoded std::strings?
where it does not cut it is for accents, é comes after z which it should not
Thanks

If you don't want a lexicographic ordering (which is what sorting the UTF-8 encoded strings lexicographically will give you), then you will need to decode your UTF-8 encoded strings into UCS-2 or UCS-4 as appropriate, and apply a suitable comparison function of your choosing.
To reiterate the point, the UTF-8 encoding mechanism is cleverly designed so that if you sort by looking at the numeric value of each 8-bit encoded byte, you will get the same result as if you first decoded the string into Unicode and compared the numeric values of each code point.
Update: Your updated question indicates that you want a more complex comparison function than purely a lexicographic sort. You will need to decode your UTF-8 strings and compare the decoded characters.

The standard has std::locale for locale-specific things such as collation (sorting). If the environment contains LC_COLLATE=en_US.utf8 or similar, this program will sort lines as desired.
#include <algorithm>
#include <functional>
#include <iostream>
#include <iterator>
#include <locale>
#include <string>
#include <vector>
class collate_in : public std::binary_function<std::string, std::string, bool> {
protected:
const std::collate<char> &coll;
public:
collate_in(std::locale loc)
: coll(std::use_facet<std::collate<char> >(loc)) {}
bool operator()(const std::string &a, const std::string &b) const {
// std::collate::compare() takes C-style string (begin, end)s and
// returns values like strcmp or strcoll. Compare to 0 for results
// expected for a less<>-style comparator.
return coll.compare(a.c_str(), a.c_str() + a.size(),
b.c_str(), b.c_str() + b.size()) < 0;
}
};
int main() {
std::vector<std::string> v;
copy(std::istream_iterator<std::string>(std::cin),
std::istream_iterator<std::string>(), back_inserter(v));
// std::locale("") is the locale from the environment. One could also
// std::locale::global(std::locale("")) to set up this program's global
// first, and then use locale() to get the global locale, or choose a
// specific locale instead of using the environment's.
sort(v.begin(), v.end(), collate_in(std::locale("")));
copy(v.begin(), v.end(),
std::ostream_iterator<std::string>(std::cout, "\n"));
return 0;
}
$ cat >file
f
é
e
d
^D
$ LC_COLLATE=C ./a.out file
d
e
f
é
$ LC_COLLATE=en_US.utf8 ./a.out file
d
e
é
f
It's been brought to my attention that std::locale::operator()(a, b) exists, obviating the std::collate<>::compare(a, b) < 0 wrapper I wrote above.
#include <algorithm>
#include <iostream>
#include <iterator>
#include <locale>
#include <string>
#include <vector>
int main() {
std::vector<std::string> v;
copy(std::istream_iterator<std::string>(std::cin),
std::istream_iterator<std::string>(), back_inserter(v));
sort(v.begin(), v.end(), std::locale(""));
copy(v.begin(), v.end(),
std::ostream_iterator<std::string>(std::cout, "\n"));
return 0;
}

Encoding (UTF-8, 16, etc) isn't the problem, it's whether the container itself is treating the string as Unicode string or 8-bit (ASCII or Latin-1) string that matters.
I found Is there an STL and UTF-8 friendly C++ Wrapper for ICU, or other powerful Unicode library, which could help you.

One option would be to use ICU collators (http://userguide.icu-project.org/collation/api) which provide a properly internationalized "compare" method that you can then use to sort.
Chromium has a small wrapper that should be easy to copy&paste/reuse
https://code.google.com/p/chromium/codesearch#chromium/src/base/i18n/string_compare.cc&sq=package:chromium&type=cs

Related

Is there some way to use std::remove_if on std::string_view iterators?

I'm wanting to effectively trim an already created std::string_view using an iterator that doesn't point to the trimmed characters thanks to std::remove_if(). However, I can't use std::remove_if() on a std::basic_string_view::iterator directly because that's really a std::basic_string_view::const_iterator and std::remove_if() can't take non-moveable iterators as arguments.
The only workaround I've though of is casting the std::string_view to a std::string and then taking the iterator. Here's an example of that:
#include <string>
#include <string_view>
#include <algorithm>
#include <locale>
int main() {
std::string_view foo{"Whitepace...\nThe Final Frontier"};
const auto is_space{
[](const auto& character) {
return std::isspace(character, std::locale{});
}
};
// Doesn't compile
//auto without_conversion{
// std::remove_if(foo.begin(), foo.end(), is_space)
//};
// Works, for the most part.
auto with_conversion{
std::remove_if(std::string{foo}.begin(), std::string{foo}.end(), is_space)
};
But this kinda defeats the whole point of using std::string_view, as a string_view constructed from this iterator wouldn't be viewing the original string.
Is there some (preferably elegant) way to do this while keeping the view on the original string? Perhaps some way to make the string_view iterator non-const?
If your goal is to trim a string_view of spaces, and store the result in a std::string, then you should choose the appropriate algorithm that allows const iterators.
One such algorithm is std::copy_if:
#include <iostream>
#include <string_view>
#include <algorithm>
#include <iterator>
#include <cctype>
int main()
{
std::string_view foo{"Whitepace...\nThe Final Frontier"};
std::string result;
std::copy_if(foo.begin(), foo.end(), std::back_inserter(result), [](char ch)
{ return !std::isspace(static_cast<unsigned char>(ch)); });
std::cout << result;
}
Output:
Whitepace...TheFinalFrontier
std::string_view is a constant view of the string sequence.
For example, begin returns a const_iterator.
https://en.cppreference.com/w/cpp/string/basic_string_view/begin
Maybe you will have better luck with std::span, however take into account that literals in the program are always immutable.
You have to make a copy first anyway.
Also your last line doesn't do what you think because you are iterating over different temporaries, even if it compiles.
The correct code is, for example:
std::string FOO = foo;
auto with_conversion{
std::remove_if(FOO.begin(), FOO.end(), is_space)
};
In other words, the whole idea of your program (that you can modify a "program" string) is flawed in the first place.

How to convert a std::set of strings to lowercase c++ using some combination of for_each, transform, iterators and lambda?

I'm trying to convert a set of strings to lowercase without using low level loops like while and for(;;) because I'm practicing using the STL. I was thinking about using for_each and transform and lambda but I'm not really sure how they work.
#include <string>
#include <iterator>
#include <algorithm>
#include <set>
using namespace std;
int main()
{
set<string> words;
//insert a bunch of words using words.insert().....
//convert everything in words to lower case
return 0;
}
How would I convert each string in the set words using any combination of for_each, transform, iterators and lambdas?
I was thinking of doing something like: transform(words.begin(), words.end(), words.begin(), ??lambda??) however I don't know how to do the 4th parameter
You can't do something like transform(words.begin(), words.end(), words.begin(), ??lambda??),
because words.begin() would be a const iterator.
You'll have to do something like this:
set<string> words;
set<string> new_words;
transform(words.begin(), words.end(), std::inserter(new_words, new_words.begin()), [](string s) {
transform(s.begin(), s.end(), s.begin(), ::tolower);
return s;
});
return new_words;
try
std::for_each(words.begin(),words.end(),[](char & c) { c = tolower(c); });
and include cctype.

How to order strings case-insensitively (not lexicographically)?

I'm attempting to order a list input from a file alphabetically (not lexicographically). So, if the list were:
C
d
A
b
I need it to become:
A
b
C
d
Not the lexicographic ordering:
A
C
b
d
I'm using string variables to hold the input, so I'm looking for some way to modify the strings I'm comparing to all uppercase or lowercase, or if there's some easier way to force an alphabetic comparison, please impart that wisdom. Thanks!
I should also mention that we are limited to the following libraries for this assignment: iostream, iomanip, fstream, string, as well as C libraries, like cstring, cctype, etc.
It looks like I'm just going to have to defeat this problem via some very tedious method of character extraction and toppering for each string.
Converting the individual strings to upper case and comparing them is not made particularly worse by being restricted from using algorithm, iterator, etc. The comparison logic is about four lines of code. Even though it would be nice not to have to write those four lines having to write a sorting algorithm is far more difficult and tedious. (Well, assuming that the usual C version of toupper is acceptable in the first place.)
Below I show a simple strcasecmp() implementation and then put it to use in a complete program which uses restricted libraries. The implementation of strcasecmp() itself doesn't use restricted libraries.
#include <string>
#include <cctype>
#include <iostream>
void toupper(std::string &s) {
for (char &c : s)
c = std::toupper(c);
}
bool strcasecmp(std::string lhs, std::string rhs) {
toupper(lhs); toupper(rhs);
return lhs < rhs;
}
// restricted libraries used below
#include <algorithm>
#include <iterator>
#include <vector>
// Example usage:
// > ./a.out <<< "C d A b"
// A b C d
int main() {
std::vector<std::string> input;
std::string word;
while(std::cin >> word) {
input.push_back(word);
}
std::sort(std::begin(input), std::end(input), strcasecmp);
std::copy(std::begin(input), std::end(input),
std::ostream_iterator<std::string>(std::cout, " "));
std::cout << '\n';
}
You don't have to modify the strings before sorting. You can sort them in place with a case-insensitive single character comparator and std::sort:
bool case_insensitive_cmp(char lhs, char rhs) {
return ::toupper(static_cast<unsigned char>(lhs) <
::toupper(static_cast<unsigned char>(rhs);
}
std::string input = ....;
std::sort(input.begin(), input.end(), case_insensitive_cmp);
std::vector<string> vec {"A", "a", "lorem", "Z"};
std::sort(vec.begin(),
vec.end(),
[](const string& s1, const string& s2) -> bool {
return strcasecmp(s1.c_str(), s2.c_str()) < 0 ? true : false;
});
Use strcasecmp() as comparison function in qsort().
I am not completely sure how to write it, but what you want to do is convert the strings to lower or uppercase.
If the strings are in an array to begin with, you would run through the list, and save the indexes in order in an (int) array.
If you're just comparing letters, then a terrible hack which will work is to mask the upper two bits off each character. Then upper and lower case letters fall on top of each other.

How to create the right interface for std::transform

The signature of transform is:
OutputIterator transform (InputIterator first1, InputIterator last1,
OutputIterator result, UnaryOperation op);
And I want to create a generic token replacing functor, msg_parser, below, so I can use any container (string used in example below) and pass begin and end of container to transform. Thats the idea.
But I can't get this to compile.
Here is my code. Any help would be much appreciated.
#include <iostream>
#include <iterator>
#include <string>
#include <map>
#include <algorithm>
class msg_parser {
public:
msg_parser(const std::map<std::string, std::string>& mapping, const char token = '$')
: map_(mapping), token_(token) {}
// I can use a generic istream type interface to handle the parsing.
std::ostream_iterator operator() (std::istream_iterator in) {
//body will go through input and when get to end of input return output
}
private:
const char token_;
const std::map<std::string, std::string>& map_;
};
int main(int argc, char* argv[]) {
std::map<std::string, std::string> mapping;
mapping["author"] = "Winston Churchill";
std::string str_test("I am $(author)");
std::string str_out;
std::transform(str_test.begin(), str_test.end(), str_out.begin(), msg_parser(mapping));
return 0;
}
Since std::string is a collection of chars, std::transform will iterate over chars exactly distance(first1, last1) times, so in your case it's not possible to change the size of the string. You may be able to transform "$(author)" into another string exactly the same size, though, but I guess it's not what you want.
You probably want to iterate over stream iterators instead of chars:
std::stringstream istrstr(str_test);
std::stringstream ostrstr;
std::transform(std::istream_iterator<std::string>(istrstr),
std::istream_iterator<std::string>(),
std::ostream_iterator<std::string>(ostrstr, " "), // note the delimiter
msg_parser(mapping));
std::cout << ostrstr.str() << std::endl;
By the way, your UnaryOperation works on the iterated type, not on iterators, so operator() should be:
std::string operator() (std::string in) { // ...
You should read the documentations and examples for std::transform in a reference like this.
You'll notice that the operation shall take an element of the input container and generate an element for the output container. Since your containers are strings and the elements are chars, the signature should be char operator()(char). Container-iterators would be wrong in this case. Anyways, the iterators of std::string are char*s, so your std::ostream_iterator are completely senseless.
Having said that, you will notice that transform works on single characters, if you apply it to your string, not on the whole "author" substring. What you are trying to do is best achieved with C++11's std::regex regular expression library, not with std::transform

How to get a string of union set from a vector string?

I have a vector string filled with some file extensions as follows:
vector<string> vExt;
vExt.push_back("*.JPG;*.TGA;*.TIF");
vExt.push_back("*.PNG;*.RAW");
vExt.push_back("*.BMP;*.HDF");
vExt.push_back("*.GIF");
vExt.push_back("*.JPG");
vExt.push_back("*.BMP");
I now want to get a string of union set from the above-mentioned vector string, in which each file extension must be unique in the resulting string. As for my given example, the resulting string should take the form of "*.JPG;*.TGA;*.TIF;*.PNG;*.RAW;*.BMP;*.HDF;*.GIF".
I know that std::unique can remove consecutive duplicates in range. It con't work with my condition. Would you please show me how to do that? Thank you!
See it live here: http://ideone.com/0fmy0 (FIXED)
#include <iostream>
#include <sstream>
#include <iterator>
#include <algorithm>
#include <vector>
#include <set>
int main()
{
std::vector<std::string> vExt;
vExt.push_back("*.JPG;*.TGA;*.TIF");
vExt.push_back("*.PNG;*.RAW");
vExt.push_back("*.BMP;*.HDF");
vExt.push_back("*.GIF");
vExt.push_back("*.JPG");
vExt.push_back("*.BMP");
std::stringstream ss;
std::copy(vExt.begin(), vExt.end(),
std::ostream_iterator<std::string>(ss, ";"));
std::string element;
std::set<std::string> unique;
while (std::getline(ss, element, ';'))
unique.insert(unique.end(), element);
std::stringstream oss;
std::copy(unique.begin(), unique.end(),
std::ostream_iterator<std::string>(oss, ";"));
std::cout << oss.str() << std::endl;
return 0;
}
output:
*.BMP;*.GIF;*.HDF;*.JPG;*.PNG;*.RAW;*.TGA;*.TIF;
I'd tokenize each string into constituent parts (using semicolon as the separator), and stick the resulting tokens into a set. The resultant contents of that set is what you're looking for.
You need to parse the strings that contain multiple file extensions and then push them into the vector. After that std::unique will do what you want. Have a look at the Boost.Tokenizer class, that should make this trivial.