How to make languages-friendly function to lower?

How to make languages-friendly function to lower? - c++

I want one function 'to lower' (from word) to work correctly on two languages, for example, english and russian. What should I do? Should I use std::wstring for it, or I can go along with std::string?
Also I want it to be cross-platform and don't reinvent the wheel.

The canonical library for this kind of things is ICU:
http://site.icu-project.org/
There is also a boost wrapper:
http://www.boost.org/doc/libs/1_55_0/libs/locale/doc/html/index.html
See also this question:
Is there an STL and UTF-8 friendly C++ Wrapper for ICU, or other powerful Unicode library
Make sure first that you understand the concept of locales, and that you have a firm grasp of what Unicode and more generally coding systems is all about.
Some good reads for a quick start:
http://joelonsoftware.com/articles/Unicode.html
http://en.wikipedia.org/wiki/Locale

I think this solution is ok. I'm not sure it suits for every situation, but it's quite possible.
#include <locale>
#include <codecvt>
#include <string>
std::string toLowerCase (const std::string& word) {
std::wstring_convert<std::codecvt_utf8<wchar_t> > conv;
std::locale loc("en_US.UTF-8");
std::wstring wword = conv.from_bytes(word);
for (int i = 0; i < wword.length(); ++i) {
wword[i] = std::tolower(word[i], loc);
}
return conv.to_bytes(wword);
}

Related

Print unicode char

I tried a very simple code in C++:
#include <iostream>
#include <string>
int main()
{
std::wstring test = L"asdfa-";
test += u'ç';
std::wcout << test;
}
But the result was:
asdfa-?
It was not possible print 'ç', with cout or wcout, how can I can print this string correctally?
OS: Linux.
Ps: I use wstring instead of string, because sometimes I need calculate the length of the string, and this size must be the same of what is on the screen.
Ps: I need concatenate the unicode char, it can't be on the string constructor.

First, here's something that does work:
#include <iostream>
#include <string>
int main() {
std::string test = "asdfa-";
test += "ç";
std::cout << test;
}
I used just regular strings here and let C++ keep everything in UTF-8. I think you already know that this would work because you mentioned that you wanted to concatenate the ç rather than just leaving it in the string constructor.
Dealing with char, char16_t, char32_t, and wchar_t in C++ has never really been fun. You have to be careful with the L, u, and U prefixes.
However, where possible, if you deal with utf-8 strings, and avoid characters, you can generally get things to work much better. And since most consoles (with the possible exception of old Windows machines) understand utf-8 pretty well, this is the approach that often just works the best. So if you have wide characters, see if you can convert them to regular std::string objects and work in that domain.

One general way of handling this would be:
Input (convert from multibyte to wide using current locale)
Your App: work with wide strings
Output or saving to a file (convert from wide to multibyte)
For wide string manipulations like num of characters, substring etc. there is wcsXXX class of functions.

If you are using libstdc++ on Linux: you forgot an essential call at the beginning of the program
std::locale::global(std::locale(""));
This is assuming you are on Linux and your locale supports UTF-8.
If you are using libc++: forget about using wstreams. This library does not support I/O of wide characters in a useful way (i.e. translation to UTF-8 like libstdc++ does).
Windows has a wholly separate set of quirks regarding Unicode. You are lucky if you don't have to deal with them.
demo with gcc/libstdc++ and a call to std::locale
demo with gcc/libstdc++ and no call to std::locale
Different versions of clang/libc++ behave differently with this example: some output ? instead of the non-ascii char, some output nothing; some crash on call to std::locale, some don't. None do the right thing, which is printing the ç, or maybe I just haven't found one that works. I don't recommend using libc++ if you need anything related to locale or wchar_t.

I solved this problem using a conversion function:
#include <iostream>
#include <string>
#include <codecvt>
#include <locale>
std::string wstr2str(const std::wstring& wstr) {
std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
return myconv.to_bytes(wstr);
}
int main()
{
std::wstring test = L"asdfa-";
test += L'ç';
std::string str = wstr2str(test)
std::cout << str;
}

What is the equivalent of `string` in C++

In Python, there is a type named string, what is the exact equivalent of python's string in C++?

The equivalent is std::string or std::wstring declared in the <string> header file.
Though you should note that python has probably different intrinsic behavior about handling automatic conversions to UNICODE strings, as mentioned in #Vincent Savard's comment.
To overcome these problems we use additional libraries in c++ like libiconv. It's available for use on a broad number of platforms.
You should seriously note to do some better research before asking at Stack Overflow, or ask your question more clearly. std::string is ubiquitous.

You could either use std::string (see available interface here: std::string)
or use char array(or const char*) to represent a basic combination of characters that might function as a primitive string.

Do you mean the std::string family?
#include <string>
int main() {
const std::string example = "test";
std::string exclaim = example + "!";
std::cout << exclaim << std::endl;
return 0;
}

C++ & Boost: encode/decode UTF-8

I'm trying to do a very simple task: take a unicode-aware wstring and convert it to a string, encoded as UTF8 bytes, and then the opposite way around: take a string containing UTF8 bytes and convert it to unicode-aware wstring.
The problem is, I need it cross-platform and I need it work with Boost... and I just can't seem to figure a way to make it work. I've been toying with
http://www.edobashira.com/2010/03/using-boost-code-facet-for-reading-utf8.html and
http://www.boost.org/doc/libs/1_46_0/libs/serialization/doc/codecvt.html
Trying to convert the code to use stringstream/wstringstream instead of files of whatever, but nothing seems to work.
For instance, in Python it would look like so:
>>> u"שלום"
u'\u05e9\u05dc\u05d5\u05dd'
>>> u"שלום".encode("utf8")
'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'
>>> '\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'.decode("utf8")
u'\u05e9\u05dc\u05d5\u05dd'
What I'm ultimately after is this:
wchar_t uchars[] = {0x5e9, 0x5dc, 0x5d5, 0x5dd, 0};
wstring ws(uchars);
string s = encode_utf8(ws);
// s now holds "\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d"
wstring ws2 = decode_utf8(s);
// ws2 now holds {0x5e9, 0x5dc, 0x5d5, 0x5dd}
I really don't want to add another dependency on the ICU or something in that spirit... but to my understanding, it should be possible with Boost.
Some sample code would greatly be appreciated! Thanks

Thanks everyone, but ultimately I resorted to http://utfcpp.sourceforge.net/ -- it's a header-only library that's very lightweight and easy to use. I'm sharing a demo code here, should anyone find it useful:
inline void decode_utf8(const std::string& bytes, std::wstring& wstr)
{
utf8::utf8to32(bytes.begin(), bytes.end(), std::back_inserter(wstr));
}
inline void encode_utf8(const std::wstring& wstr, std::string& bytes)
{
utf8::utf32to8(wstr.begin(), wstr.end(), std::back_inserter(bytes));
}
Usage:
wstring ws(L"\u05e9\u05dc\u05d5\u05dd");
string s;
encode_utf8(ws, s);

There's already a boost link in the comments, but in the almost-standard C++0x, there is wstring_convert that does this
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
int main()
{
wchar_t uchars[] = {0x5e9, 0x5dc, 0x5d5, 0x5dd, 0};
std::wstring_convert<std::codecvt_utf8<wchar_t>> conv;
std::string s = conv.to_bytes(uchars);
std::wstring ws2 = conv.from_bytes(s);
std::cout << std::boolalpha
<< (s == "\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d" ) << '\n'
<< (ws2 == uchars ) << '\n';
}
output when compiled with MS Visual Studio 2010 EE SP1 or with CLang++ 2.9
true
true

Boost.Locale was released in Boost 1.48(November 15th, 2011) making it easier to convert from and to UTF8/16
Here are some convenient examples from the docs:
string utf8_string = to_utf<char>(latin1_string,"Latin1");
wstring wide_string = to_utf<wchar_t>(latin1_string,"Latin1");
string latin1_string = from_utf(wide_string,"Latin1");
string utf8_string2 = utf_to_utf<char>(wide_string);
Almost as easy as Python encoding/decoding :)
Note that Boost.Locale is not a header-only library.

For a drop-in replacement for std::string/std::wstring that handles utf8, see TINYUTF8.
In combination with <codecvt> you can convert pretty much from/to every encoding from/to utf8, which you then handle through the above library.

Wide to narrow characters

What is the cleanest way of converting a std::wstring into a std::string? I have used W2A et al macros in the past, but I have never liked them.

What you might be looking for is icu, an open-source, cross-platform library for dealing with Unicode and legacy encodings amongst many other things.

The most native way is std::ctype<wchar_t>::narrow(), but that does little more than std::copy as gishu suggested and you still need to manage your own buffers.
If you're not trying to perform any translation but just want a one-liner, you can do std::string my_string( my_wstring.begin(), my_wstring.end() ).
If you want actual encoding translation, you can use locales/codecvt or one of the libraries from another answer, but I'm guessing that's not what you're looking for.

Since this is one of the first results for a search of "c++ narrow string," and it is from before C++11, here is the C++11 way of solving this problem:
#include <codecvt>
#include <locale>
#include <string>
std::string narrow( const std::wstring& str ){
std::wstring_convert<
std::codecvt_utf8_utf16< std::wstring::value_type >,
std::wstring::value_type
> utf16conv;
return utf16conv.to_bytes( str );
}
std::wstring_convert: http://en.cppreference.com/w/cpp/locale/wstring_convert
std::codecvt_utf8_utf16: http://en.cppreference.com/w/cpp/locale/codecvt_utf8_utf16

If the encoding in the wstring is UTF-16 and you want conversion to a UTF-8 encoded string, you can use UTF8 CPP library:
utf8::utf16to8(wstr.begin(), wstr.end(), back_inserter(str));

See if this helps. This one uses std::copy to achieve your goal.
http://www.codeguru.com/forum/archive/index.php/t-193852.html

I don't know if it's the "cleanest" but I've used copy() function without any problems so far.
#include <iostream>
#include <algorithm>
using namespace std;
string wstring2string(const wstring & wstr)
{
string str(wstr.length(),’ ‘);
copy(wstr.begin(),wstr.end(),str.begin());
return str;
}
wstring string2wstring(const string & str)
{
wstring wstr(str.length(),L’ ‘);
copy(str.begin(),str.end(),wstr.begin());
return wstr;
}
http://agraja.wordpress.com/2008/09/08/cpp-string-wstring-conversion/

C++: what is the optimal way to convert a double to a string?

What is the most optimal way to achieve the same as this?
void foo(double floatValue, char* stringResult)
{
sprintf(stringResult, "%f", floatValue);
}

I'm sure someone will say boost::lexical_cast, so go for that if you're using boost, but it's basically the same as this anyway:
#include <sstream>
#include <string>
std::string doubleToString(double d)
{
std::ostringstream ss;
ss << d;
return ss.str();
}
Note that you could easily make this into a template that works on anything that can be stream-inserted (not just doubles).

http://www.cplusplus.com/reference/iostream/stringstream/
double d=123.456;
stringstream s;
s << d; // insert d into s

Boost::lexical_cast<>

On dinkumware STL, the stringstream is filled out by the C library snprintf.
Thus using snprintf formatting directly will be comparable with the STL formatting part.
But someone once told me that the whole is greater than or equal to the sum of its known parts.
As it will be platform dependent as to whether stringstream will do an allocation (and I am quite sure that DINKUMWARE DOES NOT YET include a small buffer in stringstream for conversions of single items like yours) it is truely doubtful that ANYTHING that requires an allocation (ESPECIALLY if MULTITHREADED) can compete with snprintf.
In fact (formatting+allocation) has a chance of being really terrible as an allocation and a release might well require 2 full read-modify-write cycles in a multithreaded environment unless the allocation implementation has a thread local small heap.
That being said, if I was truely concerned about performance, I would take the advice from some of the other comments above, change the interface to include a size and use snprintf - i.e.
bool
foo(const double d, char* const p, const size_t n){
use snprintf......
determine if it fit, etc etc etc.
}
If you want a std::string you are still better off using the above and instantiating the string from the resultant char* as there will be 2 allocations + 2 releases involved with the std::stringstream, std::string solution.
BTW I cannot tell if the "string" in the question is std::string or just generic ascii chars usage of "string"

The best thing to do would be to build a simple templatized function to convert any streamable type into a string. Here's the way I do it:
#include <sstream>
#include <string>
template <typename T>
const std::string to_string(const T& data)
{
std::ostringstream conv;
conv << data;
return conv.str();
}
If you want a const char* representation, simply substitute conv.str().c_str() in the above.

I'd probably go with what you suggested in your question, since there's no built-in ftoa() function and sprintf gives you control over the format. A google search for "ftoa asm" yields some possibly useful results, but I'm not sure you want to go that far.

I'd say sprintf is pretty much the optimal way. You may prefer snprintf over it, but it doesn't have much to do with performance.

Herb Sutter has done an extensive study on the alternatives for converting an int to a string, but I would think his arguments hold for a double as well.
He looks at the balances between safety, efficiency, code clarity and usability in templates.
Read it here: http://www.gotw.ca/publications/mill19.htm

_gcvt or _gcvt_s.

If you use the Qt4 frame work you could go :
double d = 5.5;
QString num = QString::number(d);

This is very useful thread. I use sprintf_s for it but I started to doubt if it is really faster than other ways. I came across following document on Boost website which shows performance comparison between Printf/scanf, StringStream and Boost.
Double to String is most common conversion we do in our code, so i'll stick with what i've been using. But, using Boost in other scenarios could be your deciding factor.
http://www.boost.org/doc/libs/1_58_0/doc/html/boost_lexical_cast/performance.html

In the future, you can use std::to_chars to write code like https://godbolt.org/z/cEO4Sd . Unfortunately, only VS2017 and VS2019 support part of this functionality...
#include <iostream>
#include <charconv>
#include <system_error>
#include <string_view>
#include <array>
int main()
{
std::array<char, 10> chars;
auto [parsed, error] = std::to_chars(
chars.data(),
chars.data() + chars.size(),
static_cast<double>(12345.234)
);
std::cout << std::string_view(chars.data(), parsed - chars.data());
}
For a lengthy discussion on MSVC details, see
https://www.reddit.com/r/cpp/comments/a2mpaj/how_to_use_the_newest_c_string_conversion/eazo82q/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to make languages-friendly function to lower? - c++

I want one function 'to lower' (from word) to work correctly on two languages, for example, english and russian. What should I do? Should I use std::wstring for it, or I can go along with std::string? Also I want it to be cross-platform and don't reinvent the wheel.

Related

Print unicode char

What is the equivalent of `string` in C++

C++ & Boost: encode/decode UTF-8

Wide to narrow characters

C++: what is the optimal way to convert a double to a string?

Categories

Resources