Writing wide string to a file in byte mode stopped - c++

I am writing out unicode text (stored as wstring) into a file and I'm doing it in byte mode, but the string in the file ends prior to "™" character being printed. Is "™" not unicode or am I doing something wrong?
wofstream output;
outp.open("output.txt", ofstream::binary);
wstring a =L"ABC™";
output << a;

TM is definitely unicode. ofstream and wofstream do not write the text in UTF-8 format. You've to encode the output buffer in UTF-8 in order to see the results you're expecting. So, try using "WideCharToMultiByte".

There is a common misconception about the iostream binary mode: that it is to read/write binary files. The iostream library works only with text files and only read and write text files. The only thing the the "binary" mode changes is how NL (new line) characters are handled. In binary more, no transformation occurs. In non-binary mode, writing LF characters ('\n') to a stream will convert it to the platform specific new line sequence (Unix -> LF, Windows -> CR LF ("\r\n"), Mac -> CR) while when reading, the platform specific new line sequence will be converted to a single LF ('\n') character.
For everything else, nothing changes, meaning an wofstream will always convert the Unicode wide character string to single byte or multi byte character stream depending on the locale used by your process. If you have a locale of "en_US.utf8" on Linux for example, it will be converted to UTF8. Now, if the current locale does not have a representation for the TM Unicode symbol, then either nothing or a '?' will be written to the file.

Related

How to set file encoding to ISO-8859-1 or WinCP-1252 in C++

I am learning C++ and I have a requirement to write a CSV file encoded in ISO-8859-1 or WinCP-1252.
I've tried the following code snippet to set a locale that will use 1252 codepage encoding, but when I open the output file in Notepad.exe, the encoding is displayed as UTF-8.
std::ofstream ofs;
ofs.imbue(std::locale("English_United States.1252"));
ofs.open("file.txt");
ofs << 78123.456 << std::endl;
If you use only chars with ASCII codes 0..127, you should not care of a file encoding. UTF-8 is a default Notepad encoding, 8-bit multibyte. Notepad is not a tool for determining a file encoding. In other words, chars with ASCII codes 0..127 can be considered to be any 8-bit ISO or multibyte encoding.

How to convert unsigned hex values to corresponding unicode characters which should be written to file using c++ [duplicate]

This question already has answers here:
UTF8 to/from wide char conversion in STL
(8 answers)
Closed 9 years ago.
I need to convert unsigned hex values to corresponding unicode characters which should be written to file using c++
so far I have tried this
unsigned short array[2]={0x20ac,0x20ab};
this should be converted to corresponding character in a file using c++
It depends on what encoding you have choosen.
If you are using UTF-8 encoding, you need to first convert each Unicode character to corresponding UTF-8 bytes sequence and then write that byte sequence to the file.
Its pseudo code will be like
EncodeCharToUTF8(charin, charout, &numbytes); //EncodeCharToUTF8(short,char*, int*);
WriteToFile(charout, numchar);
If you are using UTF-16 encoding, you need to first write BOM at the beginning of the file and then encoding each character into UTF-16 byte sequence (byte order matters here whether it is little-endian or big-endian depending on your BOM).
WriteToFile("\xFF\xFE", 2); //Write BOM
EncodeCharToUTF16(charin, charout, &numbytes); //EncodeCharToUTF16(short,char*, int*);
//Write the character.
WriteToFile(charout, numchar);
UTF-32 is not recommended although, step is similar to UTF-16.
I think this should help you to start.
From your array, it seems that you are going to use UTF-16.
Write UTF-16 BOM 0xFFFE for little endian and 0xFEFF for big endian. After that write each character as per byte order of your machine.
I have given here pseudo code which you can white-boxed. Search more on encoding conversion.
Actually you are facing two problems:
1. How to convert buffer from UTF-8 encoding to UTF-16 encoding?
I suggest you use boost locale library ,
sample codes can be like this:
std::string ansi = "This is what we want to convert";
try
{
std::string utf8 = boost::locale::conv::to_utf<char>(ansi, "ISO-8859-1");
std::wstring utf16 = boost::locale::conv::to_utf<wchar_t>(ansi, "ISO-8859-1");
std::wstring utf16_2 = boost::locale::conv::utf_to_utf<wchar_t, char>(utf8);
}
catch (boost::locale::conv::conversion_error e)
{
std::cout << "Fail to convert to unicode!" << std::endl;
}
2. How to save buffer to a file as UTF-16 encoding?
This involves writting a BOM (ByteOrderMark) at the beginning of the file manually, you can find reference here
That means if you want to save a buffer encodes as UTF-8 to a UNICODE file, you should first write 3 bytes "EF BB BF" in the beginning of the output file."FE FF" for Big-Endian UTF-16, "FF FE" for Little-Endian UTF-16.
I you still don't understand how BOM works, just open a Notepad, and write some words, save it with different "Encoding" options, and then open the saved file with a hex editor, you can see the BOM.
Hope it helps you!

Reading From A File Which Contains Unicode Characters

I have this huge file which contains unicode strings at the beginning (first ~10,000 character or so)
I don't care about the unicode part, parts I'm interested aren't unicode but whenever I try to read those parts I get '=', and if I were to load the entire file to char array and write to to some temporary file (without altering the data) with ofstream I get incorrect data actually all I get is a text file filled with Í If I were to remove the unicode part manually everything works fine, So it seems ifstream cannot deal with streams which contains unicode data, but if this assumption is true, is there any way to work on this file introducing a new library to my project?
Thanks,
EDIT: Here's a sample code, program reads from this file which contains characters (some, not all) that can't be represented in ASCII.
ifstream inFile("somefile");
inFile.seekg(0,ios_base::end);
size_t size = inFile.tellg();
inFile.seekg(0,ios_base::beg);
char *book = new char[size];
inFile.read(book,size);
for (int i = 0; i < size; i++) {
cout << book[i] << " " << i << endl; //book[i] will always be '='
}
ofstream outFile("TEST.txt");
outFile.write(book,size);
outFile.close();
Keith Thompson's question is very important. Depending on which Unicode encoding, writing a small C routine that reads (and discards) the Unicode characters can be trivial, or slightly more complex.
Supposing the encoding is UTF-8, you will have a problem determining when to stop discarding because ASCII is a subset of UTF-8, so any time you encounter an ASCII char, you might be tempted to say "this is it, we're back in ASCII land" and the next char still might be still outside the ASCII range.
So you need to read the file and determine where the last character>127 is. Anything after that is plain ASCII -- hopefully.
A text file is generally in just one encoding utf-8, utf-16 (big or little endian) or utf-32 (big or little) or ASCII or other ANSI code pages. Mixing of encoding is only possible in some custom ways.
That said, you will have to read both the data that you need and that you don't in the same encoding. If you know the format is utf-8 you could, depending on what you are going to do with the data, read the file as a binary file into char buffer piece by piece. Then you could API(s) like strnextc (on windows. equivalent API must be available on other platforms) to move character by character on the buffer. Once you reach the end - you could move the balance to the front of the buffer and load the rest of the buffer from the file.
In fact you could use the above approach in general for any encoding. But for utf-16, you could try using wifstream - provided the endianess of the file and the platform you would be running on is the same. And you need to check if the implementation of wifstream is good at handling change in endiness and is able to take care of BOM (byte order mark) - 2 byte sequence ("FE FF" or "FF FE") that is generally present at the beginning of a file - leave alone surrogate pairs.

how to get a single character from UTF-8 encoded URDU string written in a file?

i am working on Urdu Hindi translation/transliteration. my objective is to translate an Urdu sentence into Hindi and vice versa, i am using visual c++ 2010 software with c++ language. i have written an Urdu sentence in a text file saved as UTF-8 format. now i want to get a single character one by one from that file so that i can work on it to convert it into its equivalent Hindi character. when i try to get a single character from input file and write this single character on output file, i get some unknown ugly looking character placed in output file. kindly help me with proper code. my code is as follows
#include<iostream>
#include<fstream>
#include<cwchar>
#include<cstdlib>
using namespace std;
void main()
{
wchar_t arry[50];
wifstream inputfile("input.dat",ios::in);
wofstream outputfile("output.dat");
if(!inputfile)
{
cerr<<"File not open"<<endl;
exit(1);
}
while (!inputfile.eof()) // i am using this while just to
// make sure copy-paste operation of
// written urdu text from one file to
// another when i try to pick only one character
// from file, it does not work.
{ inputfile>>arry; }
int i=0;
while(arry[i] != '\0') // i want to get urdu character placed at
// each-index so that i can work on it to convert
// it into its equivalent hindi character
{ outputfile<<arry[i]<<endl;
i++; }
inputfile.close();
outputfile.close();
cout<<"Hello world"<<endl;
}
Assuming you are on Windows, the easiest way to get "useful" characters is to read a larger chunk of the file (for example a line, or the entire file), and convert it to UTF-16 using the MultiByteToWideChar function. Use the "pseudo"-codepage CP_UTF8. In many cases, decoding the UTF-16 isn't required, but I don't know about the languages you are referring to; if you expect non-BOM characters (with codes above 65535) you might want to consider decoding the UTF-16 (or decode the UTF-8 yourself) to avoid having to deal with 2-word characters.
You can also write your own UTF-8 decoder, if you prefer. It's not complicated, and just requires some bit-juggling to extract the proper bits from the input bytes and assemble them into the final unicode value.
HINT: Windows also has a NormalizeString() function, which you can use to make sure the characters from the file are what you expect. This can be used to transform characters that have several representations in Unicode into their "canonical" representation.
EDIT: if you read up on UTF-8 encoding, you can easily see that you can read the first byte, figure out how many more bytes you need, read these as well, and pass the whole thing to MultiByteToWideChar or your own decoder (although your own decoder could just read from the file, of course). That way you could really do a "read one char at a time".
'w' classes do not read and write UTF-8. They read and write UTF-16. If your file is in UTF-8, reading it with this code will produce gibberish.
You will need to read it as bytes and then convert it, or write it in UTF-16 in the first place.

Typographic apostrophe + wide string literal broke my wofstream (C++)

I’ve just encountered some strange behaviour when dealing with the ominous typographic apostrophe ( ’ ) – not the typewriter apostrophe ( ' ). Used with wide string literal, the apostrophe breaks wofstream.
This code works
ofstream file("test.txt");
file << "A’B" ;
file.close();
==> A’B
This code works
wofstream file("test.txt");
file << "A’B" ;
file.close();
==> A’B
This code fails
wofstream file("test.txt");
file << L"A’B" ;
file.close();
==> A
This code fails...
wstring test = L"A’B";
wofstream file("test.txt");
file << test ;
file.close();
==> A
Any idea ?
You should "enable" locale before using wofstream:
std::locale::global(std::locale()); // Enable locale support
wofstream file("test.txt");
file << L"A’B";
So if you have system locale en_US.UTF-8 then the file test.txt will include
utf8 encoded data (4 byes), if you have system locale en_US.ISO8859-1, then it would encode it as 8 bit encoding (3 bytes), unless ISO 8859-1 misses such character.
wofstream file("test.txt");
file << "A’B" ;
file.close();
This code works because "A’B" is actually utf-8 string and you save utf-8
string to file byte by byte.
Note: I assume you are using POSIX like OS, and you have default locale different from "C" that is the default locale.
Are you sure it's not your compiler's support for unicode characters in source files that is "broken"? What if you use \x or similar to encode the character in the string literal? Is your source file even in whatever encoding might might to a wchar_t for your compiler?
Try wrapping the stream insertion character in a try-catch block and tell us what, if any, exception it throws.
I am not sure what is going on here, but I'll harass a guess anyway. The typographic apostrophe probably has a value that fits into one byte. This works with "A’B" since it blindly copies bytes without bothering about the underlying encoding. However, with L"A’B", an implementation dependent encoding factor comes into play. It probably doesn't find the proper UTF-16 (if you are on Windows) or UTF-32 (if you are on *nix/Mac) value to store for this particular character.