Issues with Wide Characters in C++

Issues with Wide Characters in C++ - c++

I have a program that is meant to read in a text file of words (each on a separate line), and then print out a random word from that file. It also gives you the ability to select a non-English language (e.g., Greek or Russian). Because of the latter condition, I use std::wstring to capture the text. Here is the code:
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <cstdlib>
#include <boost/random/mersenne_twister.hpp>
#include <boost/random/random_device.hpp>
#include <boost/random/uniform_int_distribution.hpp>
int main(int argc, char* argv[]) {
if (argc != 2) {
std::cout << "Usage: word [lang]" << std::endl;
std::cout << "\tlang: Choose from de,en,es,fr,gr,it,la,ru" << std::endl;
return EXIT_FAILURE;
}
std::string file = static_cast<std::string>("C:\\util_bin\\data\\words_") + static_cast<std::string>(argv[1]) + static_cast<std::string>(".txt");
std::wfstream fin(file, std::wifstream::in);
std::vector<std::wstring> data;
std::wstring line;
while (std::getline(fin, line))
data.push_back(line);
int size = data.size();
boost::random::random_device rd;
boost::random::mt19937 mt(rd());
boost::random::uniform_int_distribution<int> dist(0, size - 1);
std::wcout << data[dist(mt)] << std::endl;
}
This code compiles just fine, however when I run it with Russian (for instance), I just get garbage text:
C:\util_bin>word ru
������������
C:\util_bin>
I'm not all that familiar with the ins and outs of wide characters in C++, so I can't really discern what's going wrong. Anyone have any ideas?

I'm going to guess you're using Visual Studio. This is a quirk of the implementation of std::basic_filebuf in Windows. From the relevant MSDN page:
Objects of type basic_filebuf are created with an internal buffer of type char * regardless of the char_type specified by the type parameter Elem. This means that a Unicode string (containing wchar_t characters) will be converted to an ANSI string (containing char characters) before it is written to the internal buffer. To store Unicode strings in the buffer, create a new buffer of type wchar_t and set it using the basic_streambuf::pubsetbuf() method.
As it was explained to me, the filebuf is implemented with a FILE*; there is an internal flag that performs the ANSI conversion whether you want it or not, and you can't clear. the flag except by allocating and setting your own buffer (via pubsetbuf). Putting a codecvt in your locale won't do it. It has to happen right after a successful file open. Really, infuriatingly intrusive. I wound up having to write a wrapper class ( which wasn't so bad, as it gave you the ability to store the file name before opening).
You can also open the file with std::binary. Some people recommend that you always do that. But opening the file that way probably makes you do your own code conversions before inserting into a stream or extracting from it.

After you create instantiate your wfstream object, call imbue on it like this:
fin.imbue( std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>) );

Related

Print unicode char

I tried a very simple code in C++:
#include <iostream>
#include <string>
int main()
{
std::wstring test = L"asdfa-";
test += u'ç';
std::wcout << test;
}
But the result was:
asdfa-?
It was not possible print 'ç', with cout or wcout, how can I can print this string correctally?
OS: Linux.
Ps: I use wstring instead of string, because sometimes I need calculate the length of the string, and this size must be the same of what is on the screen.
Ps: I need concatenate the unicode char, it can't be on the string constructor.

First, here's something that does work:
#include <iostream>
#include <string>
int main() {
std::string test = "asdfa-";
test += "ç";
std::cout << test;
}
I used just regular strings here and let C++ keep everything in UTF-8. I think you already know that this would work because you mentioned that you wanted to concatenate the ç rather than just leaving it in the string constructor.
Dealing with char, char16_t, char32_t, and wchar_t in C++ has never really been fun. You have to be careful with the L, u, and U prefixes.
However, where possible, if you deal with utf-8 strings, and avoid characters, you can generally get things to work much better. And since most consoles (with the possible exception of old Windows machines) understand utf-8 pretty well, this is the approach that often just works the best. So if you have wide characters, see if you can convert them to regular std::string objects and work in that domain.

One general way of handling this would be:
Input (convert from multibyte to wide using current locale)
Your App: work with wide strings
Output or saving to a file (convert from wide to multibyte)
For wide string manipulations like num of characters, substring etc. there is wcsXXX class of functions.

If you are using libstdc++ on Linux: you forgot an essential call at the beginning of the program
std::locale::global(std::locale(""));
This is assuming you are on Linux and your locale supports UTF-8.
If you are using libc++: forget about using wstreams. This library does not support I/O of wide characters in a useful way (i.e. translation to UTF-8 like libstdc++ does).
Windows has a wholly separate set of quirks regarding Unicode. You are lucky if you don't have to deal with them.
demo with gcc/libstdc++ and a call to std::locale
demo with gcc/libstdc++ and no call to std::locale
Different versions of clang/libc++ behave differently with this example: some output ? instead of the non-ascii char, some output nothing; some crash on call to std::locale, some don't. None do the right thing, which is printing the ç, or maybe I just haven't found one that works. I don't recommend using libc++ if you need anything related to locale or wchar_t.

I solved this problem using a conversion function:
#include <iostream>
#include <string>
#include <codecvt>
#include <locale>
std::string wstr2str(const std::wstring& wstr) {
std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
return myconv.to_bytes(wstr);
}
int main()
{
std::wstring test = L"asdfa-";
test += L'ç';
std::string str = wstr2str(test)
std::cout << str;
}

What's the most efficient way to read a file into a std::string?

I currently do this, and the conversion to std::string at the end take 98% of the execution time. There must be a better way!
std::string
file2string(std::string filename)
{
std::ifstream file(filename.c_str());
if(!file.is_open()){
// If they passed a bad file name, or one we have no read access to,
// we pass back an empty string.
return "";
}
// find out how much data there is
file.seekg(0,std::ios::end);
std::streampos length = file.tellg();
file.seekg(0,std::ios::beg);
// Get a vector that size and
std::vector<char> buf(length);
// Fill the buffer with the size
file.read(&buf[0],length);
file.close();
// return buffer as string
std::string s(buf.begin(),buf.end());
return s;
}

Being a big fan of C++ iterator abstraction and the algorithms, I would love the following to be the fasted way to read a file (or any other input stream) into a std::string (and then print the content):
#include <algorithm>
#include <fstream>
#include <iostream>
#include <iterator>
#include <string>
int main()
{
std::string s(std::istreambuf_iterator<char>(std::ifstream("file")
>> std::skipws),
std::istreambuf_iterator<char>());
std::cout << "file='" << s << "'\n";
}
This certainly is fast for my own implementation of IOStreams but it requires a lot of trickery to actually get it fast. Primarily, it requires optimizing algorithms to cope with segmented sequences: a stream can be seen as a sequence of input buffers. I'm not aware of any STL implementation consistently doing this optimization. The odd use of std::skipws is just to get reference to the just created stream: the std::istreambuf_iterator<char> expects a reference to which the temporary file stream wouldn't bind.
Since this probably isn't the fastest approach, I would be inclined to use std::getline() with a particular "newline" character, i.e. on which isn't in the file:
std::string s;
// optionally reserve space although I wouldn't be too fuzzed about the
// reallocations because the reads probably dominate the performances
std::getline(std::ifstream("file") >> std::skipws, s, 0);
This assumes that the file doesn't contain a null character. Any other character would do as well. Unfortunately, std::getline() takes a char_type as delimiting argument, rather than an int_type which is what the member std::istream::getline() takes for the delimiter: in this case you could use eof() for a character which never occurs (char_type, int_type, and eof() refer to the respective member of char_traits<char>). The member version, in turn, can't be used because you would need to know ahead of time how many characters are in the file.
BTW, I saw some attempts to use seeking to determine the size of the file. This is bound not to work too well. The problem is that the code conversion done in std::ifstream (well, actually in std::filebuf) can create a different number of characters than there are bytes in the file. Admittedly, this isn't the case when using the default C locale and it is possible to detect that this doesn't do any conversion. Otherwise the best bet for the stream would be to run over the file and determine the number of characters being produced. I actually think that this is what would be needed to be done when the code conversion could something interesting although I don't think it actually is done. However, none of the examples explicitly set up the C locale, using e.g. std::locale::global(std::locale("C"));. Even with this it is also necessary to open the file in std::ios_base::binary mode because otherwise end of line sequences may be replaced by a single character when reading. Admittedly, this would only make the result shorter, never longer.
The other approaches using the extraction from std::streambuf* (i.e. those involving rdbuf()) all require that the resulting content is copied at some point. Given that the file may actually be very large this may not be an option. Without the copy this could very well be the fastest approach, however. To avoid the copy, it would be possible to create a simple custom stream buffer which takes a reference to a std::string as constructor argument and directly appends to this std::string:
#include <fstream>
#include <iostream>
#include <string>
class custombuf:
public std::streambuf
{
public:
custombuf(std::string& target): target_(target) {
this->setp(this->buffer_, this->buffer_ + bufsize - 1);
}
private:
std::string& target_;
enum { bufsize = 8192 };
char buffer_[bufsize];
int overflow(int c) {
if (!traits_type::eq_int_type(c, traits_type::eof()))
{
*this->pptr() = traits_type::to_char_type(c);
this->pbump(1);
}
this->target_.append(this->pbase(), this->pptr() - this->pbase());
this->setp(this->buffer_, this->buffer_ + bufsize - 1);
return traits_type::not_eof(c);
}
int sync() { this->overflow(traits_type::eof()); return 0; }
};
int main()
{
std::string s;
custombuf sbuf(s);
if (std::ostream(&sbuf)
<< std::ifstream("readfile.cpp").rdbuf()
<< std::flush) {
std::cout << "file='" << s << "'\n";
}
else {
std::cout << "failed to read file\n";
}
}
At least with a suitably chosen buffer I would expect the version to be the fairly fast. Which version is the fastest will certainly depend on the system, the standard C++ library being used, and probably a number of other factors, i.e. you want to measure the performance.

You can try this:
#include <fstream>
#include <sstream>
#include <string>
int main()
{
std::ostringstream oss;
std::string s;
std::string filename = get_file_name();
if (oss << std::ifstream(filename, std::ios::binary).rdbuf())
{
s = oss.str();
}
else
{
// error
}
// now s contains your file
}
You can also just use oss.str() directly if you like; just make sure you have some sort of error check somewhere.
No guarantee that it's the most efficient; you probably can't beat <cstdio> and fread. As #Benjamin pointed out, the string stream only exposes the data by copy, so you could instead read directly into the target string:
#include <string>
#include <cstdio>
std::FILE * fp = std::fopen("file.bin", "rb");
std::fseek(fp, 0L, SEEK_END);
unsigned int fsize = std::ftell(fp);
std::rewind(fp);
std::string s(fsize, 0);
if (fsize != std::fread(static_cast<void*>(&s[0]), 1, fsize, fp))
{
// error
}
std::fclose(fp);
(You might like to use a RAII wrapper for the FILE*.)
Edit: The fstream-analogue of the second version goes like this:
#include <string>
#include <fstream>
std::ifstream infile("file.bin", std::ios::binary);
infile.seekg(0, std::ios::end);
unsigned int fsize = infile.tellg();
infile.seekg(0, std::ios::beg);
std::string s(fsize, 0);
if (!infile.read(&s[0], fsize))
{
// error
}
Edit: Yet another version, using streambuf-iterators:
std::ifstream thefile(filename, std::ios::binary);
std::string s((std::istreambuf_iterator<char>(thefile)), std::istreambuf_iterator<char>());
(Mind the aditional parentheses to get the correct parsing.)

Ironically, the example for string::reserve is reading a file into a string. You don't want to read the file into one buffer and then have to allocate/copy into another one.
Here's the example code:
// string::reserve
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main ()
{
string str;
size_t filesize;
ifstream file ("test.txt",ios::in|ios::ate);
filesize=file.tellg();
str.reserve(filesize); // allocate space in the string
file.seekg(0);
for (char c; file.get(c); )
{
str += c;
}
cout << str;
return 0;
}

I don't know how efficient it is, but here is a simple (to read) way, by just setting the EOF as the delimiter:
string buffer;
ifstream fin;
fin.open("filename.txt");
if(fin.is_open()) {
getline(fin,buffer,'\x1A');
fin.close();
}
The efficiency of this obviously depends on what's going on internally in the getline algorithm, so you could take a look at the code in the standard libraries to see how it works.

saving files in c++ at different full paths

I am writing a program in C++ which I need to save some .txt files to different locations as per the counter variable in program what should be the code? Please help
I know how to save file using full path
ofstream f;
f.open("c:\\user\\Desktop\\**data1**\\example.txt");
f.close();
I want "c:\user\Desktop\data*[CTR]*\filedata.txt"
But here the data1,data2,data3 .... and so on have to be accessed by me and create a textfile in each so what is the code?
Counter variable "ctr" is already evaluated in my program.

You could snprintf to create a custom string. An example is this:
char filepath[100];
snprintf(filepath, 100, "c:\\user\\Desktop\\data%d\\example.txt", datanum);
Then whatever you want to do with it:
ofstream f;
f.open(filepath);
f.close();
Note: snprintf limits the maximum number of characters that can be written on your buffer (filepath). This is very useful for when the arguments of *printf are strings (that is, using %s) to avoid buffer overflow. In the case of this example, where the argument is a number (%d), it is already known that it cannot have more than 10 characters and so the resulting string's length already has an upper bound and just making the filepath buffer big enough is sufficient. That is, in this special case, sprintf could be used instead of snprintf.

You can use the standard string streams, such as:
#include <fstream>
#include <string>
#include <sstream>
using namespace std;
void f ( int data1 )
{
ostringstream path;
path << "c:\\user\\Desktop\\" << data1 << "\\example.txt";
ofstream file(path.str().c_str());
if (!file.is_open()) {
// handle error.
}
// write contents...
}

Streaming output to save things

I would like to << stream to save all inputs from console into a file
Here is how I tried
ofstream of("file.txt");
while(1)
{
string str;
cin>>str;
of<<str;
}
I don't see the non-English characters in the file (Edit: I mean they are Japanese, Chinese or Korean etc)

You could stream char by char. Then it would be a true binary copy.
ofstream of("file.txt");
while(1)
{
char c;
cin>>c;
of<<c;
}

Streaming using the formatted extraction operator is a poor choice. Specifically, it will consume all of the white space.
If it were me, I would copy using std::getline or istream::rdbuf:
std::getline:
// Copy standard input to named file, one line at a time.
#include <iostream>
#include <fstream>
int main (int argc, char **argv) {
std::string s;
std::ofstream of(argv[1]);
while(std::getline(std::cin, s)) {
of << s << "\n";
}
}
istream::rdbuf:
// Copy entire standard input stream to named file in one go
#include <iostream>
#include <fstream>
int main (int argc, char **argv) {
std::ofstream(argv[1]) << std::cin.rdbuf();
}

Just a couple of points as a starter:
ofstream of("file.txt");
If you want to see Japanese, Chinese or Korean character you should not be using an ofstream here. You want a stream that writes wide characters: a std::wofstream. You will also haveendow that stream with a locale. See Why does wide file-stream in C++ narrow written data by default? for details.
Another point: You apparently are have a using namespace std;. You can find many questions here at Stack Overflow that indicate that this is a bad idea. Typing those extra five characters isn't very hard, it avoids problems with names from the standard library polluting your namespace, and it makes the code clearer.
while(1)
Your loop doesn't have any break statements to escape the loop, so this plus the while (1) means your program will never stop. It is just going to keep on going and going and going and going. You want it to stop (or should want it to stop) on encountering an error or end of file in the input stream.
A better approach is to use a construct such as
while (std::getline (std::cin, s))
to control the loop (except you need to use something special to get wide characters).

What is the most elegant way to read a text file with c++?

I'd like to read whole content of a text file to a std::string object with c++.
With Python, I can write:
text = open("text.txt", "rt").read()
It is very simple and elegant. I hate ugly stuff, so I'd like to know - what is the most elegant way to read a text file with C++?
Thanks.

There are many ways, you pick which is the most elegant for you.
Reading into char*:
ifstream file ("file.txt", ios::in|ios::binary|ios::ate);
if (file.is_open())
{
file.seekg(0, ios::end);
size = file.tellg();
char *contents = new char [size];
file.seekg (0, ios::beg);
file.read (contents, size);
file.close();
//... do something with it
delete [] contents;
}
Into std::string:
std::ifstream in("file.txt");
std::string contents((std::istreambuf_iterator<char>(in)),
std::istreambuf_iterator<char>());
Into vector<char>:
std::ifstream in("file.txt");
std::vector<char> contents((std::istreambuf_iterator<char>(in)),
std::istreambuf_iterator<char>());
Into string, using stringstream:
std::ifstream in("file.txt");
std::stringstream buffer;
buffer << in.rdbuf();
std::string contents(buffer.str());
file.txt is just an example, everything works fine for binary files as well, just make sure you use ios::binary in ifstream constructor.

There's another thread on this subject.
My solutions from this thread (both one-liners):
The nice (see Milan's second solution):
string str((istreambuf_iterator<char>(ifs)), istreambuf_iterator<char>());
and the fast:
string str(static_cast<stringstream const&>(stringstream() << ifs.rdbuf()).str());

You seem to speak of elegance as a definite property of "little code". This is ofcourse subjective in some extent. Some would say that omitting all error handling isn't very elegant. Some would say that clear and compact code you understand right away is elegant.
Write your own one-liner function/method which reads the file contents, but make it rigorous and safe underneath the surface and you will have covered both aspects of elegance.
All the best
/Robert

But beware that a c++-string (or more concrete: An STL-string) is as little as a C-String capable of holding a string of arbitraty length - of course not!
Take a look at the member max_size() which gives you the maximum number of characters a string might contain. This is an implementation definied number and may not be portable among different platforms. Visual Studio gives a value of about 4gigs for strings, others might give you only 64k and on 64Bit-platforms it might give you something really huge! It depends and of course normally you will run into a bad_alloc-exception due to memory exhaustion a long time before reaching the 4gig limit...
BTW: max_size() is a member of other STL-containers as well! It will give you the maximum number of elements of a certain type (for which you instanciated the container) which this container will (theoretically) be able to hold.
So, if you're reading from a file of unknow origin you should:
- Check its size and make sure it's smaller than max_size()
- Catch and process bad_alloc-exceptions
And another point:
Why are you keen on reading the file into a string? I would expect to further process it by incrementally parsing it or something, right? So instead of reading it into a string you might as well read it into a stringstream (which basically is just some syntactic sugar for a string) and do the processing. But then you could do the processing directly from the file as well. Because if properly programmed the stringstream could seamlessly be replaced by a filestream, i. e. by the file itself. Or by any other input stream as well, they all share the same members and operators and can thus be seamlessly interchanged!
And for the processing itself: There's also a lot you can have automated by the compiler! E. g. let's say you want to tokenize the string. When defining a proper template the following actions:
- Reading from a file (or a string or any other input stream)
- Tokenizing the content
- pushing all found tokens into an STL-container
- sort the tokens alphabetically
- eleminating any double values
can all(!!) be achived in one single(!) line of C++-code (let aside the template itself and the error handling)! It's just a single call of the function std::copy()! Just google for "token iterator" and you'll get an idea of what I mean. So this appears to me to be even more "elegant" than just reading from a file...

I like Milan's char* way, but with std::string.
#include <iostream>
#include <string>
#include <fstream>
#include <cstdlib>
using namespace std;
string& getfile(const string& filename, string& buffer) {
ifstream in(filename.c_str(), ios_base::binary | ios_base::ate);
in.exceptions(ios_base::badbit | ios_base::failbit | ios_base::eofbit);
buffer.resize(in.tellg());
in.seekg(0, ios_base::beg);
in.read(&buffer[0], buffer.size());
return buffer;
}
int main(int argc, char* argv[]) {
if (argc != 2) {
cerr << "Usage: this_executable file_to_read\n";
return EXIT_FAILURE;
}
string buffer;
cout << getfile(argv[1], buffer).size() << "\n";
}
(with or without the ios_base::binary, depending on whether you want newlines tranlated or not. You could also change getfile to just return a string so that you don't have to pass a buffer string in. Then, test to see if the compiler optimizes the copy out when returning.)
However, this might look a little better (and be a lot slower):
#include <iostream>
#include <string>
#include <fstream>
#include <cstdlib>
using namespace std;
string getfile(const string& filename) {
ifstream in(filename.c_str(), ios_base::binary);
in.exceptions(ios_base::badbit | ios_base::failbit | ios_base::eofbit);
return string(istreambuf_iterator<char>(in), istreambuf_iterator<char>());
}
int main(int argc, char* argv[]) {
if (argc != 2) {
cerr << "Usage: this_executable file_to_read\n";
return EXIT_FAILURE;
}
cout << getfile(argv[1]).size() << "\n";
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js