I want to create file having arabic name in C++. Below is the program that I tried.
#include <iostream>
#include <fstream>
#include <string>
int main() {
const char *path="D:\\user\\c++\\files\\فثسف.txt";
std::ofstream file(path); //open in constructor
std::string data("Hello World");
file << data;
return 0;
}
But file gets created with junk characters: ÙثسÙ.txt.
I am using windows platform and g++ compiler.
The default encoding for string literals can be specified with the -fexec-charset compiler option for gcc / g++.
In C++11 and later, can also use the u8, u, and U prefixes to strings to specify UTF8, UTF16, and UTF32 encodings:
const char * utf8literal = u8"This is an unicode UTF8 string! 剝Ц";
const char16_t * utf16literal = u"This is an unicode UTF16 string! 剝Ц";
const char32_t * utf32literal = U"This is an unicode UTF32 string! 剝Ц";
Using the above prefixes can upset some functions who aren't expecting these specific types of strings though; in general it may be better to set the compiler option.
There's a great writeup about this topic on this blog post: http://cppwhispers.blogspot.com/2012/11/unicode-and-your-application-3-of-n.html
I hope this helps. :)
Use UTF8:
#include <iostream>
#include <fstream>
#include <filesystem>
int main()
{
namespace fs = std::filesystem;
fs::path path { u8"فثسف.txt" };
std::ofstream file { path };
file << "Hello World";
return 0;
}
Using <filesystem> library may require additional compiler/linker options. GNU implementation requires linking with -lstdc++fs and LLVM implementation requires linking with -lc++fs
Related
How can I read a Unicode (UTF-8) file into wstring(s) on the Windows platform?
With C++11 support, you can use std::codecvt_utf8 facet which encapsulates conversion between a UTF-8 encoded byte string and UCS2 or UCS4 character string and which can be used to read and write UTF-8 files, both text and binary.
In order to use facet you usually create locale object that encapsulates culture-specific information as a set of facets that collectively define a specific localized environment. Once you have a locale object, you can imbue your stream buffer with it:
#include <sstream>
#include <fstream>
#include <codecvt>
std::wstring readFile(const char* filename)
{
std::wifstream wif(filename);
wif.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
std::wstringstream wss;
wss << wif.rdbuf();
return wss.str();
}
which can be used like this:
std::wstring wstr = readFile("a.txt");
Alternatively you can set the global C++ locale before you work with string streams which causes all future calls to the std::locale default constructor to return a copy of the global C++ locale (you don't need to explicitly imbue stream buffers with it then):
std::locale::global(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
According to a comment by #Hans Passant, the simplest way is to use _wfopen_s. Open the file with mode rt, ccs=UTF-8.
Here is another pure C++ solution that works at least with VC++ 2010:
#include <locale>
#include <codecvt>
#include <string>
#include <fstream>
#include <cstdlib>
int main() {
const std::locale empty_locale = std::locale::empty();
typedef std::codecvt_utf8<wchar_t> converter_type;
const converter_type* converter = new converter_type;
const std::locale utf8_locale = std::locale(empty_locale, converter);
std::wifstream stream(L"test.txt");
stream.imbue(utf8_locale);
std::wstring line;
std::getline(stream, line);
std::system("pause");
}
Except for locale::empty() (here locale::global() might work as well) and the wchar_t* overload of the basic_ifstream constructor, this should even be pretty standard-compliant (where “standard” means C++0x, of course).
Here's a platform-specific function for Windows only:
size_t GetSizeOfFile(const std::wstring& path)
{
struct _stat fileinfo;
_wstat(path.c_str(), &fileinfo);
return fileinfo.st_size;
}
std::wstring LoadUtf8FileToString(const std::wstring& filename)
{
std::wstring buffer; // stores file contents
FILE* f = _wfopen(filename.c_str(), L"rtS, ccs=UTF-8");
// Failed to open file
if (f == NULL)
{
// ...handle some error...
return buffer;
}
size_t filesize = GetSizeOfFile(filename);
// Read entire file contents in to memory
if (filesize > 0)
{
buffer.resize(filesize);
size_t wchars_read = fread(&(buffer.front()), sizeof(wchar_t), filesize, f);
buffer.resize(wchars_read);
buffer.shrink_to_fit();
}
fclose(f);
return buffer;
}
Use like so:
std::wstring mytext = LoadUtf8FileToString(L"C:\\MyUtf8File.txt");
Note the entire file is loaded in to memory, so you might not want to use it for very large files.
#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <cstdlib>
int main()
{
std::wifstream wif("filename.txt");
wif.imbue(std::locale("zh_CN.UTF-8"));
std::wcout.imbue(std::locale("zh_CN.UTF-8"));
std::wcout << wif.rdbuf();
}
This question was addressed in Confused about C++'s std::wstring, UTF-16, UTF-8 and displaying strings in a windows GUI. In sum, wstring is based upon the UCS-2 standard, which is the predecessor of UTF-16. This is a strictly two byte standard. I believe this covers Arabic.
Recently dealt with all the encodings, solved this way. It is better to use std::u32string as it has stable size on all platforms, and most fonts work with utf-32 format. (the file should still be in utf-8)
std::u32string readFile(std::string filename) {
std::basic_ifstream<char32_t> fin(filename);
std::u32string str{};
std::getline(fin, str, U'\0');
return str;
}
For this approach to work multiplatform, when you need to read a file incompletely, you should use only getline function (remember to write separator, without separator function returns exception std::bad_cast) to move between lines (or to find a certain character), you can save line position value by seekg and tellg. And don't move between characters, just use substr.
All other methods of reading files in the standard library that I have found are not able to work adequately with files with dynamic character sizes.
This is a bit raw, but how about reading the file as plain old bytes then cast the byte buffer to wchar_t* ?
Something like:
#include <iostream>
#include <fstream>
std::wstring ReadFileIntoWstring(const std::wstring& filepath)
{
std::wstring wstr;
std::ifstream file (filepath.c_str(), std::ios::in|std::ios::binary|std::ios::ate);
size_t size = (size_t)file.tellg();
file.seekg (0, std::ios::beg);
char* buffer = new char [size];
file.read (buffer, size);
wstr = (wchar_t*)buffer;
file.close();
delete[] buffer;
return wstr;
}
Here a little code that reads a line from UFT-8 file:
#include <iostream>
#include <io.h>
#include <fcntl.h>
#include <locale>
#include <fstream>
#include <codecvt>
int main()
{
_setmode(_fileno(stdout), _O_U8TEXT);
auto inputFileStream = std::wifstream("input.txt");
const auto utf8Locale = std::locale(std::locale(), new std::codecvt_utf8<wchar_t>());
inputFileStream.imbue(utf8Locale);
std::wstring line;
std::getline(inputFileStream, line);
std::wcout << line << std::endl;
inputFileStream.close();
return 0;
}
When I build it with the Visual Studio Visual C++ compiler, I got the next result:
test τεστ тест
as expected.
By when I use MinGW with the GCC compiler, I got
琀攀猀琀 쐃딃쌃쐃 䈄㔄䄄䈄
As you understand, it's not the expected result.
Does any simple way exist to fix the output for GCC to the expected string?
OR
Does any simple way exist to use UTF-8 for both MSVC and GCC?
Answer (thanks for Igor Tandetnik and Remy Lebeau):
Seems, we must specify endian mode explicitly, because MSVC and GCC have different defaults. So
new std::codecvt_utf8<wchar_t, 0x10ffff, std::little_endian>()
should be used.
Fixed code:
#include <iostream>
#include <io.h>
#include <fcntl.h>
#include <locale>
#include <fstream>
#include <codecvt>
int main()
{
_setmode(_fileno(stdout), _O_U8TEXT);
auto inputFileStream = std::wifstream("input.txt");
const auto utf8Locale = std::locale(std::locale(), new std::codecvt_utf8<wchar_t, 0x10ffff, std::little_endian>());
inputFileStream.imbue(utf8Locale);
std::wstring line;
std::getline(inputFileStream, line);
std::wcout << line << std::endl;
inputFileStream.close();
return 0;
}
For your second question, one option is to limit the use of utf16 and std::w-prefixed stuff to the cases when you need to exchange utf16-encoded strings with the operating system. This happens when you receive arguments in wmain, open file with _wfopen, call Windows API function, etc. Otherwise, you would store, get from the user and return to the user utf8 strings using char type (char*, std::string, etc). Conversion between utf8 and utf16 can be done with MultiByteToWideChar and WideCharToMultiByte, bypassing the retarded c++ encoding api. The place where this does not work well is console input/output. Overall, you can output utf8 to the console if the user sets chcp 65001 and a ttf font. At least in Windows 7, you will also have to make sure not to split a character between two write calls, otherwise it will not print correctly (this also implies you cannot use std::cout, because msvcrt will call putc for every byte separately, and you'll need to use puts, fprintf, etc instead); I heard that this was fixed in Windows 10, but cannot confirm. Reading utf8 from the console with file api does not work as far as I know; if you want that, you'd need to detect that stdin is attached to a console and use console api instead.
I am iterating through all the files in a folder and just want their names in a string. I want to get a string from a std::filesystem::path. How do I do that?
My code:
#include <string>
#include <iostream>
#include <filesystem>
namespace fs = std::experimental::filesystem;
int main()
{
std::string path = "C:/Users/user1/Desktop";
for (auto & p : fs::directory_iterator(path))
std::string fileName = p.path;
}
However I get the following error:
non-standard syntax; use '&' to create a pointer to a member.
To convert a std::filesystem::path to a natively-encoded string (whose type is std::filesystem::path::value_type), use the string() method. Note the other *string() methods, which enable you to obtain strings of a specific encoding (e.g. u8string() for an UTF-8 string).
C++17 example:
#include <filesystem>
#include <string>
namespace fs = std::filesystem;
int main()
{
fs::path path{fs::u8path(u8"愛.txt")};
std::string path_string{path.u8string()};
}
C++20 example (better language and library UTF-8 support):
#include <filesystem>
#include <string>
namespace fs = std::filesystem;
int main()
{
fs::path path{u8"愛.txt"};
std::u8string path_string{path.u8string()};
}
The examples given in the accepted answer, using the UTF-8 operations, are fine and a good guideline. There is just one error in the introductory explanation given in the answer, which Windows/MSVC developers should be aware of:
The string() method does not return the natively-encoded string (which would be std::wstring() on Windows), but rather it always returns a std::string. It also tries to convert the path to the local encoding, which is not always possible, if the path contains a unicode character not representable in the current code page and then the method throws an exception!
If you actually want the behavior that is described in the answer (method returns native string, i.e., std::string on Linux and std::wstring on Windows), you would have to use the native() method or the implicit conversion based on std::filesystem::path::operator string_type(), but as #tambre correctly pointed out in the examples, you should consider using the UTF-8 versions throughout.
In C++ 17 and above, you can use .generic_string() to convert the path to a string: https://en.cppreference.com/w/cpp/filesystem/path/generic_string.
The following is an example that gets the current working directory and converts it into a string.
#include <string>
#include <filesystem>
using std::filesystem::current_path;
int main()
{
filesystem::path directoryPath = current_path();
string stringpath = directoryPath.generic_string();
}
#include <iostream>
using namespace std;
int main() {
std::wstring str = L"\u00A2";
std::wcout << str;
return 0;
}
Whys this doesn't work? And how solve this?
It doesn't work because in the default C locale, there is no character which corresponds to U+00A2.
If you're using a standard ubuntu install, it is most likely that your user locale uses a larger character set than US-ASCII, quite possibly Unicode encoded with UTF-8. So you just need to switch to the locale specified in the environment, as follows:
#include <iostream>
/* locale is needed for std::setlocale */
#include <locale>
#include <string>
int main() {
/* The following switches to the locale specified
* by the LC_ALL environment variable.
*/
std::setlocale (LC_ALL, "");
std::wstring str = L"\u00A2";
std::wcout << str;
return 0;
}
If you use std::string instead of std::wstring and std::cout instead of std::wcout, then you don't need the setlocale because no translation is needed (provided the console expects UTF-8).
I have programmed some code but there is some problem. In this codes i am trying to convert string to wstring. But this string have "█" characters. This character have 219 ascii code.
This conversion getting error.
In my code:
string strs= "█and█something else";
wstring wstr(strs.begin(),strs.end());
After debugging, I am getting result like this
?and?something else
How do I correct this problem?
Thanks...
The C-library solution for converting between the system's narrow and wide encoding use the mbsrtowcs and wcsrtombs functions from the <cwchar> header. I've spelt this out in this answer.
In C++11, you can use the wstring_convert template instantiated with a suitable codecvt facet. Unfortunately this requires some custom rigging, which is spelt out on the cppreference page.
I've adapted it here into a self-contained example which converts a wstring to a string, converting from the system's wide into the system's narrow encoding:
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
// utility wrapper to adapt locale-bound facets for wstring/wbuffer convert
template <typename Facet>
struct deletable_facet : Facet
{
using Facet::Facet;
};
int main()
{
std::wstring_convert<
deletable_facet<std::codecvt<wchar_t, char, std::mbstate_t>>> conv;
std::wstring ws(L"Hello world.");
std::string ns = conv.to_bytes(ws);
std::cout << ns << std::endl;
}