Need explanation on creating utf-8 encoded files on linux using c++

Need explanation on creating utf-8 encoded files on linux using c++ - c++

I need some explanations on encodage of files using g++ on Linux.
I have an easy code :
int main ()
{
FILE * pFile;
char buffer[] = { 'x' , 'y' , 'z' ,'é' };
pFile = fopen ("myfile", "wt, ccs=UTF-8");
//pFile = fopen ("myfile", "wt");
fwrite (buffer , sizeof(char), sizeof(buffer), pFile);
fclose (pFile);
return 0;
}
Even if the "ccs=UTF-8" part is added on the fopen line, the output file of this program is always encoded in iso-8859-1. However, if I create a file using vi on Linux containing theses charaters, the resulting file is UTF-8 encoded (I use the command "file myfile" to see the encoding mode of the file, and a "xxd -b myfile" confirm this behavior).
So I would like to undestand :
1- Why g++ on Linux doesn't create a UTF-8 file by default?
2- What is the aim of the ccs=UTF-8 if the file created is not encoded in UTF-8?
3- How can I create an UTF-8 file based on this simple code?
Thanks.

Your file may appear to be in ISO-8859-1, but it's actually not. It's simply broken.
Your file contains byte A9, which is the lower byte of UTF-8 representation of é.
When you wrote 'é', the compiler should have warned you:
aaa.c:4:38: warning: multi-character character constant [-Wmultichar]
char buffer[] = { 'x' , 'y' , 'z' ,'é' };
^
char is not a type for a character, it's a type for one byte. GCC treats multibyte character literals as big-endian integers. Here, you cast it immediately to char, leaving the lowest byte: A9
(BTW, é in ISO-8859-1 is E9, not A9)
You open your file with an encoding, but then you save bytes into it. The bytes correspond to ISO-8859-1 characters xyz©.
If you want to write characters, not bytes, then use wchar_t instead of char and fputws instead of fwrite
#include <stdio.h>
#include <wchar.h>
int main ()
{
FILE * pFile;
// note final zero and L indicating wchar_t literal
wchar_t buffer[] = { 'x' , 'y' , 'z' , L'é' , 0};
// note no space before ccs
pFile = fopen ("myfile", "wt,ccs=UTF-8");
fputws(buffer, pFile);
fclose (pFile);
return 0;
}

Related

Unicode big endian some character not getting properly from wchar_t array

I am trying to extract the exact "unicode big endian" character from an array.
The values i directly taken from a file using big endian. i use vs 2015, mfc framework (unicode support).
values: 𠀐亙𠀃𠀃亙亙𠀐𠀐Val𪛕𨕥
So these values directly taken from file to an array and without changing those values in the same array and directly printing to another txt file as unicode big endian format is possible. But changing some chars getting wrong result.
Directly written to editor.cpp file
wchar_t chr[] = {L'𠀐', L'亙', L'𠀃', L'𠀃', L'亙', L'亙', L'𠀐', L'𠀐', L'V', L'a', L'l', L'𪛕', L'𨕥'};
wchar_t chVal = (wchar_t) chr[0]; // getting � or a rectangle mark
if(chVal == L'𠀐')
MessageBox(_T("Show msg")); // results wrong
wchar_t chVal = (wchar_t) chr[1]; // getting 亙 proper element.
if(chVal == L'亙')
MessageBox(_T("Show msg")); // results correct
llly correct results in 'V', 'a', 'l'
=======================================
Before i placed the code
wchar_t* ch = _wsetlocale(LC_ALL, _T("Chinese"));
is it a problem from _wsetLocale ?
in the editor we can directly write those characters. But during debug or exe the results wrong.
why the editor not displaying some characters during debugging or execution.
================
updated:
// wcstring is wchar_t array with unicode characters
CStringW str; wchar_t wh;
System::Text::Encoding^ encodingWr = System::Text::Encoding::BigEndianUnicode;
StreamWriter^ writer = gcnew StreamWriter("Converted.txt", true, encodingWr );
//String^ line = reader->ReadLine();
for(int ct = 0; ct< ctTot; ct++)
{
int ln = wcstring[ct]; // correct number
wh = /*(wchar_t)*/ wcstring[ct]; //wrong
str.Format(_T("UNNUM %d %lc"), ln, wh);
/* https://learn.microsoft.com/en-us/cpp/text/how-to-convert-between-various-string-types?view=vs-2017*/
// Convert a wide character CStringW to a
// System::String.
String ^systemstringw = gcnew String(str);
//systemstringw += " (System::String)";
//Console::WriteLine("{0}", systemstringw);
//delete systemstringw;
writer->WriteLine(systemstringw);
delete systemstringw;
OutputDebugString(str);
}
but needed to print on file correct unicode character.
so compiler problems need to know too.

How can I read a text file containing Unicode, into a wchar_t pointer using wifstream?

I'm trying to read Unicode characters from a text file into a wchar_t pointer array, using wifstream. Here is a code snippet:
locale::global(std::locale("en_US.UTF-8"));
std::wifstream inputFile("gsmCharacterSet.txt", std::ifstream::binary | std::ifstream::ate);
int length = inputFile.tellg();
inputFile.seekg(0,inputFile.beg);
wchar_t *charArray = new wchar_t[length];
inputFile.read(charArray,length);
It's not working. The length returned is 252 which is the correct file size in bytes. However, the array remains empty.
The following condition returns true:
if ( inputFile.peek() == std::wifstream::traits_type::eof() )
cout << "File is empty";
I'm compiling the program on Linux using g++.
What am I doing wrong? Thanks for the help.

How to process CSV lines with nul char in some elements?

When reading and parsing a CSV-file line, I need to process the nul character that appears as the value of some row fields. It is complicated by the fact that sometimes the CSV file is in windows-1250 encoding, sometimes it in UTF-8, and sometimes UTF-16. Because of this, I have started some way, and then found the nul char problem later -- see below.
Details: I need to clean a CSV files from third party to the form common to our data extractor (that is the utility works as a filter -- storing one CSV form to another CSV form).
My initial approach was to open the CSV file in binary mode and check whether the first bytes form BOM. I know all the given Unicode files start with BOM. If there is no BOM, I know that it is in windows-1250 encoding.
The converted CSV file should use the windows-1250 encoding. So, after checking the input file, I open it using the related mode, like this:
// Open the file in binary mode first to see whether BOM is there or not.
FILE * fh{ nullptr };
errno_t err = fopen_s(&fh, fnameIn.string().c_str(), "rb"); // const fs::path & fnameIn
assert(err == 0);
vector<char> buf(4, '\0');
fread(&buf[0], 1, 3, fh);
::fclose(fh);
// Set the isUnicode flag and open the file according to that.
string mode{ "r" }; // init
bool isUnicode = false; // pessimistic init
if (buf[0] == 0xEF && buf[1] == 0xBB && buf[2] == 0xBF) // UTF-8 BOM
{
mode += ", ccs=UTF-8";
isUnicode = true;
}
else if ((buf[0] == 0xFE && buf[1] == 0xFF) // UTF-16 BE BOM
|| (buf[0] == 0xFF && buf[1] == 0xFE)) // UTF-16 LE BOM
{
mode += ", ccs=UNICODE";
isUnicode = true;
}
// Open in the suitable mode.
err = fopen_s(&fh, fnameIn.string().c_str(), mode.c_str());
assert(err == 0);
After the successful open, the input line is read or via fgets or via fgetws -- depending on whether Unicode was detected or not. Then the idea was to convert the buffer content from Unicode to 1250 if the unicode was detected earlier, or let the buffer be in 1250. The s variable should contain the string in the windows-1250 encoding. The ATL::CW2A(buf, 1250) is used when conversion is needed:
const int bufsize = 4096;
wchar_t buf[bufsize];
// Read the line from the input according to the isUnicode flag.
while (isUnicode ? (fgetws(buf, bufsize, fh) != NULL)
: (fgets(reinterpret_cast<char*>(buf), bufsize, fh) != NULL))
{
// If the input is in Unicode, convert the buffer content
// to the string in cp1250. Otherwise, do not touch it.
string s;
if (isUnicode) s = ATL::CW2A(buf, 1250);
else s = reinterpret_cast<char*>(buf);
...
// Now processing the characters of the `s` to form the output file
}
It worked fine... until a file with a nul character used as the value in the row appeared. The problem is that when the s variable is assigned, the nul cuts the rest of the line. In the observed case, it happened with the file that used 1250 encoding. But it can probably happen also in the UTF encoded files.
How to solve the problem?

The NUL character problem is solved by using either C++ or Windows functions. In this case, the easiest solution is MultiByteToWideChar which will accept an explicit string length, precisely so it doesn't stop on NUL.

fread and fgetc in the same file stream not working

I have a binary file with the following repeating format: 6 float values + 3 unsigned char (byte`, integer value from 0 to 255) values.
I am parsing it like this:
FILE *file = fopen("file.bin", "r");
bool valid = true;
while(!feof(file)) {
float vals[6];
valid = valid && (fread((void*)(&vals), sizeof(float), 6, file) == 6);
unsigned char a,b,c;
a = fgetc(file); b = fgetc(file); c = fgetc(file);
(...)
}
This works fine for the first 30 iterations or so, but after that it simply stops parsing (way way before the end of the file).
What could be wrong?
I also tried parsing the unsigned char bytes with
fread((void*)&(a), sizeof(unsigned char), 1, file);

it simply stops parsing (way way before the end of the file).
You and the C Standard library are having a difference of opinion about where the end of the file is. ASCII character EOF (for DOS/Windows: decimal 26, hex 1A, aka Ctrl+Z, for Unix/Linux: decimal 4, hex 04, aka Ctrl+D) is a control character meaning "end of file". There's also the file length stored by the filesystem metadata.
The C stdio functions can operate in several modes: text, default, binary, and these control several behaviors:
Newline translations (implementation-defined): enabled in text mode, disabled in binary mode, default: ???
End of file: Implementation-defined, but usually EOF character in text mode, by filesystem file length in binary mode, default: ???
Since your file contains binary data, you should force binary mode by using "b" in the mode string to fopen, e.g.
FILE* file = fopen("file.bin", "rb");
When you do so, characters with value 26 are treated like any other byte and lose their "EOF" meaning.

Strange text-mode file output behavior

Consider the following code
FILE * pOutFile;
unsigned char uid;
pOutFile = fopen("OutFile.bin","w") ; // open a file to write
uid = 0x0A;
fprintf (pOutFile,"%c",uid); // Trying to print 0x0A in to the file
But the print I get in the file is
0x0D 0x0A
Where is this 0x0D coming from? Am I missing something? What consideration must I take to prevent this.
Corrected: uidl was a typo.

Windows text files want new lines to be represented by two consecutive chars: 0x0D and 0x0A.
In C, a new line is represented by a single char: 0x0A.
Thus, on Windows, in C, you have two ways to open a file: text mode or binary mode.
In binary mode, when you write a LineFeed (0x0A) char, a single byte (0x0A) is append to the file.
In text mode, whenever you write a LineFeed (0x0A) char, two bytes (0x0D and 0x0A) are append to the file.
The solution is to open the file in binary mode, using "wb".

Because you have opened the file in "w" mode it is in TEXT mode, which means \n's (aka 0x0a) are translated into \r\n (carriage return and line feed).
If you only want 0x0a written to the file open it in binary mode ("wb").

Actually, none of those are the issue...
FILE * pOutFile;
unsigned char uid;
pOutFile = fopen("OutFile.bin","wb") ; // open a file to write (in binary, <b>not</b> text mode)
uid = 0x0A; //changed from uidl = 0x0A (which didnt set uid)
fprintf (pOutFile, "%c", uid); // Trying to print 0x0A in to the file
What I changed, was you were setting uidl and NOT uid, which you printed.
You could always do the following:
fprintf(pOutFile, "%c", 0x0A); or
fprintf(pOutFile, "%c", '\n'); or
fprintf(pOutFile, "\n");
if you wanted (the last option is probably your best.
I also opened your file in wb mode.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Need explanation on creating utf-8 encoded files on linux using c++ - c++

Related

Unicode big endian some character not getting properly from wchar_t array

How can I read a text file containing Unicode, into a wchar_t pointer using wifstream?

How to process CSV lines with nul char in some elements?

fread and fgetc in the same file stream not working

Strange text-mode file output behavior

Categories

Resources