How to read a UCS-2 file? - c++

I'm writing a program to get the infomation in *.rc file encoding in UCS-2 Little Endian.
int _tmain(int argc, _TCHAR* argv[]) {
wstring csvLine(wstring sLine);
wifstream fin("en.rc");
wofstream fout("table.csv");
wofstream fout_rm("temp.txt");
wstring sLine;
fout << "en\n";
while(getline(fin,sLine)) {
if (sLine.find(L"IDS") == -1)
fout_rm << sLine << endl;
else
fout << csvLine(sLine);
}
fout << flush;
system("pause");
return 0;
}
The first line in "en.rc" is #include <windows.h> but sLine shows as below:
[0] 255 L'ÿ'
[1] 254 L'þ'
[2] 35 L'#'
[3] 0
[4] 105 L'i'
[5] 0
[6] 110 L'n'
[7] 0
[8] 99 L'c'
. .
. .
. .
This program can work out correctly for UTF-8. How can I do it to UCS-2?

Wide streams use a wide stream buffer to access the file. The Wide stream buffer reads bytes from the file and uses its codecvt facet to convert these bytes to wide characters. The default codecvt facet is std::codecvt<wchar_t, char ,std::mbstate_t> which converts between the native character sets for wchar_t and char (i.e., like mbstowcs() does).
You're not using the native char character set, so what you want is a codecvt facet that reads UCS-2 as a multibyte sequence and converts it to wide characters.
#include <fstream>
#include <string>
#include <codecvt>
#include <iostream>
int main(int argc, char *argv[])
{
wifstream fin("en.rc", std::ios::binary); // You need to open the file in binary mode
// Imbue the file stream with a codecvt facet that uses UTF-16 as the external multibyte encoding
fin.imbue(std::locale(fin.getloc(),
new std::codecvt_utf16<wchar_t, 0xffff, consume_header>));
// ^ We set 0xFFFF as the maxcode because that's the largest that will fit in a single wchar_t
// We use consume_header to detect and use the UTF-16 'BOM'
// The following is not really the correct way to write Unicode output, but it's easy
std::wstring sLine;
std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> convert;
while (getline(fin, sLine))
{
std::cout << convert.to_bytes(sLine) << '\n';
}
}
Note that there's an issue with UTF-16 here. The purpose of wchar_t is for one wchar_t to represent one codepoint. However Windows uses UTF-16 which represents some codepoints as two wchar_ts. This means that the standard API doesn't work very well with Windows.
The consequence here is that when the file contains a surrogate pair, codecvt_utf16 will read that pair, convert it to a single codepoint value greater than 16 bits and have to truncate the value to 16 bits to stick it in a wchar_t. This means this code really is limited to UCS-2. I've set the maxcode template parameter to 0xFFFF to reflect this.
There are a number of other problems with wchar_t, and you might want to just avoid it entirely: What's “wrong” with C++ wchar_t?

#include <filesystem>
namespace fs = std::filesystem;
FILE* f = _wfopen(L"myfile.txt", L"rb");
auto file_size = fs::file_size(filename);
std::wstring buf;
buf.resize((size_t)file_size / sizeof(decltype(buf)::value_type));// buf in my code is a template object, so I use decltype(buf) to decide its type.
fread(&buf[0], 1, 2, f); // escape UCS2 BOM
fread(&buf[0], 1, file_size, f);

Related

Can't write chinese character into textfile with wofstream

I'm using std::wofstream to write characters in a text file.My characters can have chars from very different languages(english to chinese).
I want to print my vector<wstring> into that file.
If my vector contains only english characters I can print them without a problem.
But if I write chineses characters my file remains empty.
I browsed trough stackoverflow and all answers said bascially to use functions from the library:
#include <codecvt>
I can't include that library, because I am using Dev-C++ in version 5.11.
I did:#define UNICODE in all my header files.
I guess there is a really simple solution for that problem.
It would be great, if someone could help me out.
My code:
#define UNICODE
#include <string>
#include <fstream>
using namespace std;
int main()
{
string Path = "D:\\Users\\\t\\Desktop\\korrigiert_RotCommon_zh_check_error.log";
wofstream Out;
wstring eng = L"hello";
wstring chi = L"程序";
Out.open(Path, ios::out);
//works.
Out << eng;
//fails
Out << chi;
Out.close();
return 0;
}
Kind Regards
Even if the name of the wofstream implies it's a wide char stream, it's not. It's still a char stream that uses a convert facet from a locale to convert the wchars to char.
Here is what cppreference says:
All file I/O operations performed through std::basic_fstream<CharT> use the std::codecvt<CharT, char, std::mbstate_t> facet of the locale imbued in the stream.
So you could either set the global locale to one that supports Chinese or imbue the stream. In both cases you'll get a single byte stream.
#include <locale>
//...
const std::locale loc = std::locale(std::locale(), new std::codecvt_utf8<wchar_t>);
Out.open(Path, ios::out);
Out.imbue(loc);
Unfortunately std::codecvt_utf8 is already deprecated[2]. This MSDN
magazine
article explains how to do UTF-8 conversion using MultiByteToWideChar C++ - Unicode Encoding Conversions with STL Strings and Win32 APIs.
Here the Microsoft/vcpkg variant of an to_utf8 conversion:
std::string to_utf8(const CWStringView w)
{
const size_t size = WideCharToMultiByte(CP_UTF8, 0, w.c_str(), -1, nullptr, 0, nullptr, nullptr);
std::string output;
output.resize(size - 1);
WideCharToMultiByte(CP_UTF8, 0, w.c_str(), -1, output.data(), size - 1, nullptr, nullptr);
return output;
}
On the other side you can use normal binary stream and write the wstring data with write().
std::ofstream Out(Path, ios::out | ios::binary);
const uint16_t bom = 0xFEFF;
Out.write(reinterpret_cast<const char*>(&bom), sizeof(bom)); // optional Byte order mark
Out.write(reinterpret_cast<const char*>(chi.data()), chi.size() * sizeof(wchar_t));
You forgot to tell your stream what locale to use:
Out.imbue(std::locale("zh_CN.UTF-8"));
You'll obviously need to include <locale> for this.

Coding a path in unicode c++

I had a problem with opening UTF-8 path files. Path that has a UTF-8 char (like Cyrillic or Latin). I found a way to solve that with _wfopen but the way a solved it was when I encode the UTF-8 char with UTF by hand (\Uxxxx).
Is there a function, macro or anything that when I supply the string (path) it will return the Unicode??
Something like this:
https://www.branah.com/unicode-converter
I tried with MultiByteToWideChar but it returns some Hex numbers that are not relavent.
Tried:
std::wstring s2ws(const std::string& s)
{
int len;
int slength = (int)s.length() + 1;
len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, 0, 0);
wchar_t* buf = new wchar_t[len];
MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, buf, len);
std::wstring r(buf);
delete[] buf;
return r;
}
std::wstring stemp = s2ws(x);
LPCWSTR result = stemp.c_str();
The result I get: 0055F7E8
Thank you in advance
Update:
I installed boost, and now I am trying to do it with boost. Can some one maybe help me out with boost.
So I have a path:
wchar_t path[100] = _T("čaćšžđ\\test.txt");
I need it converted to:
wchar_t s[100] = _T("\u010d\u0061\u0107\u0161\u017e\u0111\\test.txt");
Here's a way to convert between UTF-8 and UTF-16 on Windows, as well as showing the real values of the stored code units for both input and output:
#include <codecvt>
#include <iostream>
#include <iomanip>
#include <string>
int main() {
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;
std::string s = "test";
std::cout << std::hex << std::setfill('0');
std::cout << "Input `char` data: ";
for (char c : s) {
std::cout << std::setw(2) << static_cast<unsigned>(static_cast<unsigned char>(c)) << ' ';
}
std::cout << '\n';
std::wstring ws = convert.from_bytes(s);
std::cout << "Output `wchar_t` data: ";
for (wchar_t wc : ws) {
std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
}
std::cout << '\n';
}
Understanding the real values of the input and output is important because otherwise you may not correctly understand the transformation that you really need. For example it looks to me like there may be some confusion as to how VC++ deals with encodings, and what \Uxxxxxxxx and \uxxxx actually do in C++ source code (e.g., they don't necessarily produce UTF-8 data).
Try using code like that shown above to see what your input data really is.
To emphasize what I've written above; there are strong indications that you may not correctly understand the processing that's being done on your input, and you need to thoroughly check it.
The above program does correctly transform the UTF-8 representation of ć (U+0107) into the single 16-bit code unit 0x0107, if you replace the test string with the following:
std::string s = "\xC4\x87"; // UTF-8 representation of U+0107
The output of the program, on Windows using Visual Studio, is then:
Input char data: c4 87
Output wchar_t data: 0107
This is in contrast to if you use test strings such as:
std::string s = "ć";
Or
std::string s = "\u0107";
Which may result in the following output:
Input char data: 3f
Output wchar_t data: 003f
The problem here is that Visual Studio does not use UTF-8 as the encoding for strings without some trickery, so your request to convert from UTF-8 probably isn't what you actually need; or you do need conversion from UTF-8, but you're testing potential conversion routines using input that differs from your real input.
So I have a path: wchar_t path[100] = _T("čaćšžđ\test.txt");
I need it converted to:
wchar_t s[100] = _T("\u010d\u0061\u0107\u0161\u017e\u0111\test.txt");
Okay, so if I understand correctly, your actual problem is that the following fails:
wchar_t path[100] = _T("čaćšžđ\\test.txt");
FILE *f = _wfopen(path, L"w");
But if you instead write the string like:
wchar_t path[100] = _T("\u010d\u0061\u0107\u0161\u017e\u0111\\test.txt");
Then the _wfopen call succeeds and opens the file you want.
First of all, this has absolutely nothing to do with UTF-8. I assume you found some workaround using a char string and converting that to wchar_t and you somehow interpreted this as involving UTF-8, or something.
What encoding are you saving the source code with? Is the string L"čaćšžđ\\test.txt" actually being saved properly? Try closing the source file and reopening it. If some characters show up replaced by ?, then part of your problem is the source file encoding. In particular this is true of the default encoding used by Windows in most of North America and Western Europe: "Western European (Windows) - Codepage 1252".
You can also check the output of the following program:
#include <iomanip>
#include <iostream>
int main() {
wchar_t path[16] = L"čaćšžđ\\test.txt";
std::cout << std::hex << std::setfill('0');
for (wchar_t wc : path) {
std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
}
std::cout << '\n';
wchar_t s[16] = L"\u010d\u0061\u0107\u0161\u017e\u0111\\test.txt";
for (wchar_t wc : s) {
std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
}
std::cout << '\n';
}
Another thing you need to understand is that the \uxxxx form of writing characters, called Universal Character Names or UCNs, is not a form that you can convert strings to and from in C++. By the time you've compiled the program and it's running, i.e. by the time any code you write could be attempting to produce strings containing \uxxxx, the time when UCNs are interpreted by the compiler as different characters is long past. The only UCNs that will work are ones that are written directly in the source file.
Also, you're using _T() incorrectly. IMO You shouldn't be using TCHAR and the related macros at all, but if you do use it then you ought to use it consistently: don't mix TCHAR APIs with explicit use of the *W APIs or wchar_t. The whole point of TCHAR is to allow code to be independent and switch between those wchar_t and Microsoft's "ANSI" APIs, so using TCHAR and then hard coding an assumption that TCHAR is wchar_t defeats the entire purpose.
You should just write:
wchar_t path[100] = L"čaćšžđ\\test.txt";
Your code is Windows-specific, and you're using Visual C++. So, just use wide literals. Visual C++ supports wide strings for file stream constructors.
It's as simple as that &dash; when you don't require portability.
#include <fstream>
#include <iostream>
#include <stdlib.h>
using namespace std;
auto main() -> int
{
wchar_t const path[] = L"cacšžd/test.txt";
ifstream f( path );
int ch;
while( (ch = f.get()) != EOF )
{
cout.put( ch );
}
}
Note, however, that this code is Visual C++ specific. That's reasonable for Windows-specific code. Possibly with C++17 we will have Boost file system library adopted into the standard library, and then for conformance g++ will ideally offer the constructor used here.
The problem was that I was saving the CPP file as ANSI... I had to convert it to UTF-8. I tried this before posting but VS 2015 turns it into ANSI, I had to change it in VS so I could get it working.
I tried opening the cpp file with notepad++ and changing the encoding but when I turn on VS it automatically returns. So I was looking to Save As option but there is no encoding option. Finally i found it, in Visual Studio 2015
File -> Advanced Save Options in the Encoding dropdown change it to Unicode
One thing that is still strange to me, how did VS display the characters normally but when I opened the file in N++ there was ? (like it was supposed to be, because of ANSI)?

how to print each character of strings that mix ascii character with unicode?

for example, I want to create some typewriter effects so need to print strings like that:
#include <string>
int main(){
std::string st1="ab》cd《ef";
for(int i=0;i<st1.size();i++){
std::string st2=st1.substr(0,i).c_str();
printf("%s\n",st2.c_str());
}
return 0;
}
but the output is
a
ab
ab?
ab?
ab》
ab》c
ab》cd
ab》cd?
ab》cd?
ab》cd《
ab》cd《e
and not:
a
ab
ab》
ab》c
ab》cd
ab》cd《
ab》cd《e
how to know the upcoming character is unicode?
similar question, print each character also has the problem:
#include <string>
int main(){
std::string st1="ab》cd《ef";
for(int i=0;i<st1.size();i++){
std::string st2=st1.substr(i,1).c_str();
printf("%s\n",st2.c_str());
}
return 0;
}
the output is:
a
b
?
?
?
c
d
?
?
?
e
f
not:
a
b
》
c
d
《
e
f
I think the problem is encoding. Likely your string is in UTF-8 encoding which has variable sized characters. This means you can not iterate one char at a time because some characters are more than one char wide.
The fact is, in unicode, you can only iterate reliably one fixed character at a time with UTF-32 encoding.
So what you can do is use a UTF library like ICU to convert vetween UTF-8 and UTF-32.
If you have C++11 then there are some tools to help you here, mostly std::u32string which is able to hold UTF-32 encoded strings:
#include <string>
#include <iostream>
#include <unicode/ucnv.h>
#include <unicode/uchar.h>
#include <unicode/utypes.h>
// convert from UTF-32 to UTF-8
std::string to_utf8(std::u32string s)
{
UErrorCode status = U_ZERO_ERROR;
char target[1024];
int32_t len = ucnv_convert(
"UTF-8", "UTF-32"
, target, sizeof(target)
, (const char*)s.data(), s.size() * sizeof(char32_t)
, &status);
return std::string(target, len);
}
// convert from UTF-8 to UTF-32
std::u32string to_utf32(const std::string& utf8)
{
UErrorCode status = U_ZERO_ERROR;
char32_t target[256];
int32_t len = ucnv_convert(
"UTF-32", "UTF-8"
, (char*)target, sizeof(target)
, utf8.data(), utf8.size()
, &status);
return std::u32string(target, (len / sizeof(char32_t)));
}
int main()
{
// UTF-8 input (needs UTF-8 editor)
std::string utf8 = "ab》cd《ef"; // UTF-8
// convert to UTF-32
std::u32string utf32 = to_utf32(utf8);
// Now it is safe to use string indexing
// But i is for length so starting from 1
for(std::size_t i = 1; i < utf32.size(); ++i)
{
// convert back to to UTF-8 for output
// NOTE: i + 1 to include the BOM
std::cout << to_utf8(utf32.substr(0, i + 1)) << '\n';
}
}
Output:
a
ab
ab》
ab》c
ab》cd
ab》cd《
ab》cd《e
ab》cd《ef
NOTE:
The ICU library adds a BOM (Byte Order Mark) at the beginning of the strings it converts into Unicode. Therefore you need to deal with the fact that the first character of the UTF-32 string is the BOM. This is why the substring uses i + 1 for its length parameter to include the BOM.
Your C++ code is simply echoing octets to your terminal, and it is your terminal display that's converting octets encoded in its default character set to unicode characteers.
It looks like, based on your example, that your terminal display uses UTF-8. The rules for converting UTF-8-encoded characters to unicode are fairly well specified (Google is your friend), so all you have to do is to check the first character of a UTF-8 sequence to figure out how many octets make up the next unicode character.

writing a string to file as a sequence of bytes

I want to write a wide string to a file as a sequence of bytes. I tried two ways, the first way:
std::wstring str = L"This is a test";
LPBYTE pBuf = (LPBYTE)str.c_str();
FILE* hFile = _wfopen( L"c:\\temp.txt", L"w" );
for( int i = 0; i<(str.length()*sizeof(wchar_t)); ++i)
fwprintf( hFile, L"%02X", pBuf[i] );
fclose(hFile);
The second way:
std::wstring str = L"This is a test";
LPBYTE pBuf = (LPBYTE)str.c_str();
HANDLE hFile = CreateFile( L"c:\\temp.txt", GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL );
DWORD dwRet;
WriteFile( hFile, pBuf, str.length()*sizeof(wchar_t), &dwRet, NULL );
CloseHandle(hFile);
When I open the result file, in the first case the contents of the file are:
54006800690073002000690073002000610020007400650073007400
In the second case, the contents of the file are:
This is a test
Why the first way doesn't work as expected? it looks like both ways are equal.
In the first example, you used fwprintf to format the bytes as 2-digit hex strings so that is why you see hex in that file.
I suspect you should spend some time researching the ASCII code and UTF-16LE and looking at text using a hex editor.
Every file is just a sequence of bytes so your question is not well defined and makes me think you have some fundamental misunderstanding about bytes and encodings but I'm not sure what it is.
Assuming you want to write out the in-memory representation of the string:
#include <fstream>
int main (int argc,char *argv[]) {
std::wstring str = L"This is a test";
std::ofstream fout(R"(c:\temp.txt)");
fout.exceptions(std::ios::badbit | std::ios::failbit);
fout.write(reinterpret_cast<const char*>(str.data()), sizeof(wchar_t) * str.size());
}
We use ofstream because this is C++ and it's better to use RAII types instead of having to manually call fclose or CloseHandle. We use a raw string for the filename so we don't have to deal with escaping the backslash. (On platforms that use a sensible path separator ; ) the raw string here is unnecessary.) We also turn on exceptions so that we don't have to explicitly check for errors.
Then we write out the bytes using the write member function. Note that the codecvt facet is still applied to the data written using this method. This is the reason we're using ofstream instead of wofstream; The default facet for ofstream does nothing, but the default facet for wofstream would convert the wchar_t to char using the default locale.
If you simply want to write UTF-16 data out then there are better ways than trying to write the raw bytes of a wchar_t string. (wchar_t isn't necessarily UTF-16. Some platforms just happen to use UTF-16.)
One way is to use a the codecvt_utf16 facet:
#include <fstream>
#include <codecvt>
int main(int argc, char *argv[]) {
std::wstring str = L"This is a test";
std::wofstream fout(R"(C:\temp.txt)");
fout.exceptions(std::ios::badbit | std::ios::failbit);
fout.imbue(std::locale(std::locale("C"), new std::codecvt_utf16<wchar_t>));
fout << str;
}
Here we write a wchar_t string normally, but we've imbued the wstream with codecvt_utf16, so that the the wchar_t is converted to UTF-16. If you want little endian UTF-16, or you want to include U+FEFF at the beginning of the file (these are frequently done on Windows) then there are flags to enable that: std::codecvt_utf16<wchar_t, 0x10FFFF, std::codecvt_mode::generate_header | std::codecvt_mode::little_endian>. (also note that codecvt_utf16 will treat wchar_t as UCS-2 or UCS-4, never UTF-16. The upshot is that this only handles the BMP on Windows)
Another option is to use normal streams and the wstring_convert facility:
#include <fstream>
#include <codecvt>
int main(int argc, char *argv[]) {
std::wstring str = L"This is a test";
std::ofstream fout(R"(C:\temp.txt)");
fout.exceptions(std::ios::badbit | std::ios::failbit);
std::wstring_convert<std::codecvt_utf16<wchar_t>, wchar_t> convert;
fout << convert.to_bytes(str);
}
This is probably the option I would choose, since it allows one to almost completely avoid wchar_t.

How I can print the wchar_t values to console?

Example:
#include <iostream>
using namespace std;
int main()
{
wchar_t en[] = L"Hello";
wchar_t ru[] = L"Привет"; //Russian language
cout << ru
<< endl
<< en;
return 0;
}
This code only prints HEX-values like adress.
How to print the wchar_t string?
Edit: This doesn’t work if you are trying to write text that cannot be represented in your default locale. :-(
Use std::wcout instead of std::cout.
wcout << ru << endl << en;
Can I suggest std::wcout ?
So, something like this:
std::cout << "ASCII and ANSI" << std::endl;
std::wcout << L"INSERT MULTIBYTE WCHAR* HERE" << std::endl;
You might find more information in a related question here.
You cannot portably print wide strings using standard C++ facilities.
Instead you can use the open-source {fmt} library to portably print Unicode text. For example (https://godbolt.org/z/nccb6j):
#include <fmt/core.h>
int main() {
const char en[] = "Hello";
const char ru[] = "Привет";
fmt::print("{}\n{}\n", ru, en);
}
prints
Привет
Hello
This requires compiling with the /utf-8 compiler option in MSVC.
For comparison, writing to wcout on Linux:
wchar_t en[] = L"Hello";
wchar_t ru[] = L"Привет";
std::wcout << ru << std::endl << en;
may transliterate the Russian text into Latin (https://godbolt.org/z/za5zP8):
Privet
Hello
This particular issue can be fixed by switching to a locale that uses UTF-8 but a similar problem exists on Windows that cannot be fixed just with standard facilities.
Disclaimer: I'm the author of {fmt}.
Windows has the very confusing information. You should learn C/C++ concept from Unix/Linux before programming in Windows.
wchar_t stores character in UTF-16 which is a fixed 16-bit memory size called wide character but wprintf() or wcout() will never print non-english wide characters correctly because no console will output in UTF-16. Windows will output in current locale while unix/linux will output in UTF-8, all are multi-byte. So you have to convert wide characters to multi-byte before printing. The unix command wcstombs() doesn't work on Windows, use WideCharToMultiByte() instead.
First you need to convert file to UTF-8 using notepad or other editor. Then install font in command prompt console so that it can read/write in your language and change code page in console to UTF-8 to display correctly by typing in the command prompt "chcp 65001" while cygwin is already default to UTF-8. Here is what I did in Thai.
#include <windows.h>
#include <stdio.h>
int main()
{
wchar_t* in=L"ทดสอบ"; // thai language
char* out=(char *)malloc(15);
WideCharToMultiByte(874, 0, in, 15, out, 15, NULL, NULL);
printf(out); // result is correctly in Thai although not neat
}
Note that
874=(Thai) code page in the operating system, 15=size of string
My suggestion is to avoid printing non-english wide characters to console unless necessary because it is not easy.
#include <iostream>
using namespace std;
void main()
{
setlocale(LC_ALL, "Russian");
cout << "\tДОБРО ПОЖАЛОВАТЬ В КИНО!\n";
}
The way to do it is to convert UTF-16 LE (Default Windows encoding) into UTF-8, and then print to console (chcp 65001 first, to switch codepage to UTF-8).
It's pretty trivial to convert UTF-16 to UTF-8. Use this page as a guide, if you need more than 2 byte characters.
short* cmd_s = (short*)cmd;
while(cmd_s[i] != 0)
{
short u16 = cmd_s[i++];
if(u16 > 0x7F)
{
unsigned char c0 = ((char)u16 & 0x3F) | 0x80; // Least significant
unsigned char c1 = char(((u16 >> 6) & 0x1F) | 0xC0); // Most significant
cout << c1 << c0; // Use Big-endian network order
}
else
{
unsigned char c0 = (char)u16;
cout << c0;
}
}
Of course, you can put it in a function and extend it to handle wider characters (For Cyrillic it should be enough), but I wanted to show basic algorithm, and to prove that it's not hard at all and you don't need any libraries, just a few lines of code.
You could use use a normal char array that is actually filled with utf-8 characters. This should allow mixing characters across languages.
You can print wide characters with wprintf.
#include <iostream>
int main()
{
wchar_t en[] = L"Hello";
wchar_t ru[] = L"Привет"; //Russian language
wprintf(en);
wprintf(ru);
return 0;
}
Output:
Hello
Привет