Read Unicode UTF-8 file into wstring - c++

How can I read a Unicode (UTF-8) file into wstring(s) on the Windows platform?

With C++11 support, you can use std::codecvt_utf8 facet which encapsulates conversion between a UTF-8 encoded byte string and UCS2 or UCS4 character string and which can be used to read and write UTF-8 files, both text and binary.
In order to use facet you usually create locale object that encapsulates culture-specific information as a set of facets that collectively define a specific localized environment. Once you have a locale object, you can imbue your stream buffer with it:
#include <sstream>
#include <fstream>
#include <codecvt>
std::wstring readFile(const char* filename)
{
std::wifstream wif(filename);
wif.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
std::wstringstream wss;
wss << wif.rdbuf();
return wss.str();
}
which can be used like this:
std::wstring wstr = readFile("a.txt");
Alternatively you can set the global C++ locale before you work with string streams which causes all future calls to the std::locale default constructor to return a copy of the global C++ locale (you don't need to explicitly imbue stream buffers with it then):
std::locale::global(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));

According to a comment by #Hans Passant, the simplest way is to use _wfopen_s. Open the file with mode rt, ccs=UTF-8.
Here is another pure C++ solution that works at least with VC++ 2010:
#include <locale>
#include <codecvt>
#include <string>
#include <fstream>
#include <cstdlib>
int main() {
const std::locale empty_locale = std::locale::empty();
typedef std::codecvt_utf8<wchar_t> converter_type;
const converter_type* converter = new converter_type;
const std::locale utf8_locale = std::locale(empty_locale, converter);
std::wifstream stream(L"test.txt");
stream.imbue(utf8_locale);
std::wstring line;
std::getline(stream, line);
std::system("pause");
}
Except for locale::empty() (here locale::global() might work as well) and the wchar_t* overload of the basic_ifstream constructor, this should even be pretty standard-compliant (where “standard” means C++0x, of course).

Here's a platform-specific function for Windows only:
size_t GetSizeOfFile(const std::wstring& path)
{
struct _stat fileinfo;
_wstat(path.c_str(), &fileinfo);
return fileinfo.st_size;
}
std::wstring LoadUtf8FileToString(const std::wstring& filename)
{
std::wstring buffer; // stores file contents
FILE* f = _wfopen(filename.c_str(), L"rtS, ccs=UTF-8");
// Failed to open file
if (f == NULL)
{
// ...handle some error...
return buffer;
}
size_t filesize = GetSizeOfFile(filename);
// Read entire file contents in to memory
if (filesize > 0)
{
buffer.resize(filesize);
size_t wchars_read = fread(&(buffer.front()), sizeof(wchar_t), filesize, f);
buffer.resize(wchars_read);
buffer.shrink_to_fit();
}
fclose(f);
return buffer;
}
Use like so:
std::wstring mytext = LoadUtf8FileToString(L"C:\\MyUtf8File.txt");
Note the entire file is loaded in to memory, so you might not want to use it for very large files.

#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <cstdlib>
int main()
{
std::wifstream wif("filename.txt");
wif.imbue(std::locale("zh_CN.UTF-8"));
std::wcout.imbue(std::locale("zh_CN.UTF-8"));
std::wcout << wif.rdbuf();
}

This question was addressed in Confused about C++'s std::wstring, UTF-16, UTF-8 and displaying strings in a windows GUI. In sum, wstring is based upon the UCS-2 standard, which is the predecessor of UTF-16. This is a strictly two byte standard. I believe this covers Arabic.

Recently dealt with all the encodings, solved this way. It is better to use std::u32string as it has stable size on all platforms, and most fonts work with utf-32 format. (the file should still be in utf-8)
std::u32string readFile(std::string filename) {
std::basic_ifstream<char32_t> fin(filename);
std::u32string str{};
std::getline(fin, str, U'\0');
return str;
}
For this approach to work multiplatform, when you need to read a file incompletely, you should use only getline function (remember to write separator, without separator function returns exception std::bad_cast) to move between lines (or to find a certain character), you can save line position value by seekg and tellg. And don't move between characters, just use substr.
All other methods of reading files in the standard library that I have found are not able to work adequately with files with dynamic character sizes.

This is a bit raw, but how about reading the file as plain old bytes then cast the byte buffer to wchar_t* ?
Something like:
#include <iostream>
#include <fstream>
std::wstring ReadFileIntoWstring(const std::wstring& filepath)
{
std::wstring wstr;
std::ifstream file (filepath.c_str(), std::ios::in|std::ios::binary|std::ios::ate);
size_t size = (size_t)file.tellg();
file.seekg (0, std::ios::beg);
char* buffer = new char [size];
file.read (buffer, size);
wstr = (wchar_t*)buffer;
file.close();
delete[] buffer;
return wstr;
}

Related

C++ Read File Accent Problems [duplicate]

How can I read a Unicode (UTF-8) file into wstring(s) on the Windows platform?
With C++11 support, you can use std::codecvt_utf8 facet which encapsulates conversion between a UTF-8 encoded byte string and UCS2 or UCS4 character string and which can be used to read and write UTF-8 files, both text and binary.
In order to use facet you usually create locale object that encapsulates culture-specific information as a set of facets that collectively define a specific localized environment. Once you have a locale object, you can imbue your stream buffer with it:
#include <sstream>
#include <fstream>
#include <codecvt>
std::wstring readFile(const char* filename)
{
std::wifstream wif(filename);
wif.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
std::wstringstream wss;
wss << wif.rdbuf();
return wss.str();
}
which can be used like this:
std::wstring wstr = readFile("a.txt");
Alternatively you can set the global C++ locale before you work with string streams which causes all future calls to the std::locale default constructor to return a copy of the global C++ locale (you don't need to explicitly imbue stream buffers with it then):
std::locale::global(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
According to a comment by #Hans Passant, the simplest way is to use _wfopen_s. Open the file with mode rt, ccs=UTF-8.
Here is another pure C++ solution that works at least with VC++ 2010:
#include <locale>
#include <codecvt>
#include <string>
#include <fstream>
#include <cstdlib>
int main() {
const std::locale empty_locale = std::locale::empty();
typedef std::codecvt_utf8<wchar_t> converter_type;
const converter_type* converter = new converter_type;
const std::locale utf8_locale = std::locale(empty_locale, converter);
std::wifstream stream(L"test.txt");
stream.imbue(utf8_locale);
std::wstring line;
std::getline(stream, line);
std::system("pause");
}
Except for locale::empty() (here locale::global() might work as well) and the wchar_t* overload of the basic_ifstream constructor, this should even be pretty standard-compliant (where “standard” means C++0x, of course).
Here's a platform-specific function for Windows only:
size_t GetSizeOfFile(const std::wstring& path)
{
struct _stat fileinfo;
_wstat(path.c_str(), &fileinfo);
return fileinfo.st_size;
}
std::wstring LoadUtf8FileToString(const std::wstring& filename)
{
std::wstring buffer; // stores file contents
FILE* f = _wfopen(filename.c_str(), L"rtS, ccs=UTF-8");
// Failed to open file
if (f == NULL)
{
// ...handle some error...
return buffer;
}
size_t filesize = GetSizeOfFile(filename);
// Read entire file contents in to memory
if (filesize > 0)
{
buffer.resize(filesize);
size_t wchars_read = fread(&(buffer.front()), sizeof(wchar_t), filesize, f);
buffer.resize(wchars_read);
buffer.shrink_to_fit();
}
fclose(f);
return buffer;
}
Use like so:
std::wstring mytext = LoadUtf8FileToString(L"C:\\MyUtf8File.txt");
Note the entire file is loaded in to memory, so you might not want to use it for very large files.
#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <cstdlib>
int main()
{
std::wifstream wif("filename.txt");
wif.imbue(std::locale("zh_CN.UTF-8"));
std::wcout.imbue(std::locale("zh_CN.UTF-8"));
std::wcout << wif.rdbuf();
}
This question was addressed in Confused about C++'s std::wstring, UTF-16, UTF-8 and displaying strings in a windows GUI. In sum, wstring is based upon the UCS-2 standard, which is the predecessor of UTF-16. This is a strictly two byte standard. I believe this covers Arabic.
Recently dealt with all the encodings, solved this way. It is better to use std::u32string as it has stable size on all platforms, and most fonts work with utf-32 format. (the file should still be in utf-8)
std::u32string readFile(std::string filename) {
std::basic_ifstream<char32_t> fin(filename);
std::u32string str{};
std::getline(fin, str, U'\0');
return str;
}
For this approach to work multiplatform, when you need to read a file incompletely, you should use only getline function (remember to write separator, without separator function returns exception std::bad_cast) to move between lines (or to find a certain character), you can save line position value by seekg and tellg. And don't move between characters, just use substr.
All other methods of reading files in the standard library that I have found are not able to work adequately with files with dynamic character sizes.
This is a bit raw, but how about reading the file as plain old bytes then cast the byte buffer to wchar_t* ?
Something like:
#include <iostream>
#include <fstream>
std::wstring ReadFileIntoWstring(const std::wstring& filepath)
{
std::wstring wstr;
std::ifstream file (filepath.c_str(), std::ios::in|std::ios::binary|std::ios::ate);
size_t size = (size_t)file.tellg();
file.seekg (0, std::ios::beg);
char* buffer = new char [size];
file.read (buffer, size);
wstr = (wchar_t*)buffer;
file.close();
delete[] buffer;
return wstr;
}

How to read a single char with std::wifstream? [duplicate]

You wouldn't imagine something as basic as opening a file using the C++ standard library for a Windows application was tricky ... but it appears to be. By Unicode here I mean UTF-8, but I can convert to UTF-16 or whatever, the point is getting an ofstream instance from a Unicode filename. Before I hack up my own solution, is there a preferred route here ? Especially a cross-platform one ?
The C++ standard library is not Unicode-aware. char and wchar_t are not required to be Unicode encodings.
On Windows, wchar_t is UTF-16, but there's no direct support for UTF-8 filenames in the standard library (the char datatype is not Unicode on Windows)
With MSVC (and thus the Microsoft STL), a constructor for filestreams is provided which takes a const wchar_t* filename, allowing you to create the stream as:
wchar_t const name[] = L"filename.txt";
std::fstream file(name);
However, this overload is not specified by the C++11 standard (it only guarantees the presence of the char based version). It is also not present on alternative STL implementations like GCC's libstdc++ for MinGW(-w64), as of version g++ 4.8.x.
Note that just like char on Windows is not UTF8, on other OS'es wchar_t may not be UTF16. So overall, this isn't likely to be portable. Opening a stream given a wchar_t filename isn't defined according to the standard, and specifying the filename in chars may be difficult because the encoding used by char varies between OS'es.
Since C++17, there is a cross-platform way to open an std::fstream with a Unicode filename using the std::filesystem::path overload. Example:
std::ofstream out(std::filesystem::path(u8"こんにちは"));
out << "hello";
The current versions of Visual C++ the std::basic_fstream have an open() method that take a wchar_t* according to http://msdn.microsoft.com/en-us/library/4dx08bh4.aspx.
Use std::wofstream, std::wifstream and std::wfstream. They accept unicode filename. File name has to be wstring, array of wchar_ts, or it has to have _T() macro, or prefix Lbefore the text.
Have a look at Boost.Nowide:
#include <boost/nowide/fstream.hpp>
#include <boost/nowide/cout.hpp>
using boost::nowide::ifstream;
using boost::nowide::cout;
// #include <fstream>
// #include <iostream>
// using std::ifstream;
// using std::cout;
#include <string>
int main() {
ifstream f("UTF-8 (e.g. ß).txt");
std::string line;
std::getline(f, line);
cout << "UTF-8 content: " << line;
}
Use
wfstream
instead of
fstream
and
wofstream
instead of
ofstream
and so on...
You can find this information in the iosfwd header file.
If you're using Qt mixed with std::ifstream:
return std::wstring(reinterpret_cast<const wchar_t*>(qString.utf16()));
Note that the std::basic_ifstream constructor normally doesn't accept a const w_char*, but on in the MS implementation of STL it does. With other implementations you would probably call qString.utf8(), and use the const char* ctor.

Create file in arabic name in c++

I want to create file having arabic name in C++. Below is the program that I tried.
#include <iostream>
#include <fstream>
#include <string>
int main() {
const char *path="D:\\user\\c++\\files\\فثسف.txt";
std::ofstream file(path); //open in constructor
std::string data("Hello World");
file << data;
return 0;
}
But file gets created with junk characters: ÙثسÙ.txt.
I am using windows platform and g++ compiler.
The default encoding for string literals can be specified with the -fexec-charset compiler option for gcc / g++.
In C++11 and later, can also use the u8, u, and U prefixes to strings to specify UTF8, UTF16, and UTF32 encodings:
const char * utf8literal = u8"This is an unicode UTF8 string! 剝Ц";
const char16_t * utf16literal = u"This is an unicode UTF16 string! 剝Ц";
const char32_t * utf32literal = U"This is an unicode UTF32 string! 剝Ц";
Using the above prefixes can upset some functions who aren't expecting these specific types of strings though; in general it may be better to set the compiler option.
There's a great writeup about this topic on this blog post: http://cppwhispers.blogspot.com/2012/11/unicode-and-your-application-3-of-n.html
I hope this helps. :)
Use UTF8:
#include <iostream>
#include <fstream>
#include <filesystem>
int main()
{
namespace fs = std::filesystem;
fs::path path { u8"فثسف.txt" };
std::ofstream file { path };
file << "Hello World";
return 0;
}
Using <filesystem> library may require additional compiler/linker options. GNU implementation requires linking with -lstdc++fs and LLVM implementation requires linking with -lc++fs

Strange behavior with c++ io

I am using zlib to compress data for a game I am making. Here is the code I have been using
#include <SFML/Graphics.hpp>
#include <Windows.h>
#include <fstream>
#include <iostream>
#include "zlib.h"
#include "zconf.h"
using namespace std;
void compress(Bytef* toWrite, int bufferSize, char* filename)
{
uLongf comprLen = compressBound(bufferSize);
Bytef* data = new Bytef[comprLen];
compress(data, &comprLen, &toWrite[0], bufferSize);
ofstream file(filename);
file.write((char*) data, comprLen);
file.close();
cout<<comprLen;
}
int main()
{
const int X_BLOCKS = 1700;
const int Y_BLOCKS = 19;
int bufferSize = X_BLOCKS * Y_BLOCKS;
Bytef world[X_BLOCKS][Y_BLOCKS];
//fill world with integer values
compress(&world[0][0], bufferSize, "Level.lvl");
while(2);
return EXIT_SUCCESS;
}
Now I would have expected the program to simply compress the array world and save it to a file. However I noticed a weird behavior. When I prited the value for 'comprLen' it was a different length then the created file. I couldn't understand where the extra bytes in the file were coming from.
You need to open the file in binary mode:
std::ofstream file(filename, std::ios_base::binary);
without the std::ios_base::binary flag the system will replace end of line characters (\n) by end of line sequences (\n\r). Suppressing this conversion is the only purpose of the std::ios_base::binary flag.
Note that the conversion is made on the bytes written to the stream. That is, the number of actually written bytes will increase compared to the second argument to write(). Also note, that you need to make sure that you are using the "C" locale rather than some locale with a non-trivial code conversion facet (since you don't explicitly set the global std::locale in your code you should get the default which is the "C" locale).

How to open an std::fstream (ofstream or ifstream) with a unicode filename?

You wouldn't imagine something as basic as opening a file using the C++ standard library for a Windows application was tricky ... but it appears to be. By Unicode here I mean UTF-8, but I can convert to UTF-16 or whatever, the point is getting an ofstream instance from a Unicode filename. Before I hack up my own solution, is there a preferred route here ? Especially a cross-platform one ?
The C++ standard library is not Unicode-aware. char and wchar_t are not required to be Unicode encodings.
On Windows, wchar_t is UTF-16, but there's no direct support for UTF-8 filenames in the standard library (the char datatype is not Unicode on Windows)
With MSVC (and thus the Microsoft STL), a constructor for filestreams is provided which takes a const wchar_t* filename, allowing you to create the stream as:
wchar_t const name[] = L"filename.txt";
std::fstream file(name);
However, this overload is not specified by the C++11 standard (it only guarantees the presence of the char based version). It is also not present on alternative STL implementations like GCC's libstdc++ for MinGW(-w64), as of version g++ 4.8.x.
Note that just like char on Windows is not UTF8, on other OS'es wchar_t may not be UTF16. So overall, this isn't likely to be portable. Opening a stream given a wchar_t filename isn't defined according to the standard, and specifying the filename in chars may be difficult because the encoding used by char varies between OS'es.
Since C++17, there is a cross-platform way to open an std::fstream with a Unicode filename using the std::filesystem::path overload. Example:
std::ofstream out(std::filesystem::path(u8"こんにちは"));
out << "hello";
The current versions of Visual C++ the std::basic_fstream have an open() method that take a wchar_t* according to http://msdn.microsoft.com/en-us/library/4dx08bh4.aspx.
Use std::wofstream, std::wifstream and std::wfstream. They accept unicode filename. File name has to be wstring, array of wchar_ts, or it has to have _T() macro, or prefix Lbefore the text.
Have a look at Boost.Nowide:
#include <boost/nowide/fstream.hpp>
#include <boost/nowide/cout.hpp>
using boost::nowide::ifstream;
using boost::nowide::cout;
// #include <fstream>
// #include <iostream>
// using std::ifstream;
// using std::cout;
#include <string>
int main() {
ifstream f("UTF-8 (e.g. ß).txt");
std::string line;
std::getline(f, line);
cout << "UTF-8 content: " << line;
}
Use
wfstream
instead of
fstream
and
wofstream
instead of
ofstream
and so on...
You can find this information in the iosfwd header file.
If you're using Qt mixed with std::ifstream:
return std::wstring(reinterpret_cast<const wchar_t*>(qString.utf16()));
Note that the std::basic_ifstream constructor normally doesn't accept a const w_char*, but on in the MS implementation of STL it does. With other implementations you would probably call qString.utf8(), and use the const char* ctor.