Strange text-mode file output behavior - c++

Consider the following code
FILE * pOutFile;
unsigned char uid;
pOutFile = fopen("OutFile.bin","w") ; // open a file to write
uid = 0x0A;
fprintf (pOutFile,"%c",uid); // Trying to print 0x0A in to the file
But the print I get in the file is
0x0D 0x0A
Where is this 0x0D coming from? Am I missing something? What consideration must I take to prevent this.
Corrected: uidl was a typo.

Windows text files want new lines to be represented by two consecutive chars: 0x0D and 0x0A.
In C, a new line is represented by a single char: 0x0A.
Thus, on Windows, in C, you have two ways to open a file: text mode or binary mode.
In binary mode, when you write a LineFeed (0x0A) char, a single byte (0x0A) is append to the file.
In text mode, whenever you write a LineFeed (0x0A) char, two bytes (0x0D and 0x0A) are append to the file.
The solution is to open the file in binary mode, using "wb".

Because you have opened the file in "w" mode it is in TEXT mode, which means \n's (aka 0x0a) are translated into \r\n (carriage return and line feed).
If you only want 0x0a written to the file open it in binary mode ("wb").

Actually, none of those are the issue...
FILE * pOutFile;
unsigned char uid;
pOutFile = fopen("OutFile.bin","wb") ; // open a file to write (in binary, <b>not</b> text mode)
uid = 0x0A; //changed from uidl = 0x0A (which didnt set uid)
fprintf (pOutFile, "%c", uid); // Trying to print 0x0A in to the file
What I changed, was you were setting uidl and NOT uid, which you printed.
You could always do the following:
fprintf(pOutFile, "%c", 0x0A); or
fprintf(pOutFile, "%c", '\n'); or
fprintf(pOutFile, "\n");
if you wanted (the last option is probably your best.
I also opened your file in wb mode.

Related

Copying picture (.tga) from text content

I working on the creation of a game. I want to hide all my .tga files.
I concatenate the string content of all my files on a single file in order to make it illisible for players.
I want my program to load a picture by creating a temporaly .tga file from
the saved content.
So that, I'm trying to copy a .tga file from the content of an original one.
More precisely, I read a .tga file as a text and a write it.
Eventhough Notepad++ finds original file and new file as identical, the new file can not be open as .tga file. Windows detects the size of files with 1 byte offset.
Can you explain me what I'm doing wrong ?
Or may be suggest me a better way to hide my files.
Regards
More precisely, I read a .tga file as a text and a write it
Herein may lie your problem: You have to read and write the .tga file as a binary file. Otherwise, any occurence of the byte sequence 0x0D 0x0A (CR LF, Windows line ending) may be replaced with a single 0x0A (LF, Unix line ending) or vice versa, or 0x1A (DOS end of file) may be stripped or appended. Depending on the code you are using, you may also end up stripping any 0x00 (NUL) bytes.
I tried to read / write with my program (c++) a .tga file as binary file but the generated file was still corrupted. The code is below.
std::string name = "my_picture.tga";
std::ifstream FileIn(name, std::ios_base::binary);
std::vector<char> listChar;
bool stopp = false;
if (FileIn) {
while (!(stopp))
{
char xin;
FileIn.read(reinterpret_cast<char*>(&xin), sizeof(char));
listChar.push_back(xin);
if (FileIn.eof()) stopp = true;
}
FileIn.close();
}
std::ofstream FileOut(".\\test.tga", std::ios_base::binary);
bool isCarierReturn = false;
for (char xout : listChar) {
isCarierReturn = xout == '\r';
if (!isCarierReturn) FileOut.write(reinterpret_cast<const char*>(&xout), sizeof(char));
}
FileOut.close();
I compared the original file and the new one on a hexadecimal reader and files are effectively different.
The difference between original and new file consists in a mismatch on lines ending, instead of just having 0x0A ('\n') on the original file, the new file had the byte sequence 0x0D 0x0A ('\r' and '\n'). On some other pictures the generated file was incomplete, the break is always before a 0x1A value (as said #Christoph Lipka).
I manage to write the right sequence by testing if the char is a carrier return, the char is not written on this case and only the byte 0x0D is skipped, see below :
std::ofstream FileOut(".\\test.tga", std::ios_base::binary);
bool isCarrierReturn = false;
char xout_p1 = '\0';
if (listChar.size() >= 1) xout_p1 = listChar.at(0);
for (unsigned i(0); i < listChar.size(); i++) {
char xout = xout_p1;
if (i < listChar.size() - 1) xout_p1 = listChar.at(i + 1);
else xout_p1 = '\0';
isCarrierReturn = xout == '\r' && xout_p1 == '\n';
if (!isCarrierReturn) FileOut.write(reinterpret_cast<const char*>(&xout), sizeof(char));
}
FileOut.close();
The incomplete file reading is solved by reading the file as binary file.
It works.

trying to append new text to an existing file using fwrite() with mode "a+" but get weird string written

I am writing a program that is to insert texts to a file every time when it is called. I don't want to rewrite the entire file, and I want the new text could be inserted to a new line. Here is my test code:
void writeFile()
{
FILE *pFile;
char* data = "hahaha";
int data_size = 7;
int count = 1;
pFile = fopen("textfile.bin","a+");
if (pFile!=NULL)
{
fwrite (data, data_size, count, pFile);
fclose (pFile);
}
}
At the first time it got called, everything worked fine. A new file was created and the data was successfully written. But when I called it again and expected that a new data to be inserted, I got weird strings in the file, something like:慨慨慨栀桡桡a.
I am not really familiar with C++ I/O functions. Can someone tell me what I did wrong? Also, any suggestion for appending text to the next line?
I think you are running into a code set issue, and the program you're using to look at the file you write expects to find UTF-16 data in the file.
I base this on an analysis of the string you quote:
慨慨慨栀桡桡a
When that (UTF-8) data is converted to Unicode values, I get:
0xE6 0x85 0xA8 = U+6168
0xE6 0x85 0xA8 = U+6168
0xE6 0x85 0xA8 = U+6168
0xE6 0xA0 0x80 = U+6800
0xE6 0xA1 0xA1 = U+6861
0xE6 0xA1 0xA1 = U+6861
0x61 = U+0061
0x0A = U+000A
The Unicode values U+6168 is represented in little-endian as bytes 0x68 0x61, and the ASCII code for h is 104 (0x68) and for a is 97 (0x61). So, the data is probably written correctly, but the interpretation of the data that is written is incorrect.
As I noted in a comment:
If you want lines in the file, you'll need to put them there (by adding newlines to the data that is written), because fwrite() won't output any newlines unless they are in the data it is given to write. You have written a null byte to the file (because you used data_size = 7), which means the file is not really a text file (text files don't contain null bytes). What happens next depends on the code set you're using.
The trailing single-byte codes in the output appear because the second null byte isn't visible in what's pasted on this page, and the trailing U+000A was added by the echo in the command line I used for the analysis (where utf8-unicode is a program I wrote):
echo "慨慨慨栀桡桡a" | utf8-unicode
Change your code to this:
char* data = "hahaha\0";
pFile = fopen("textfile.bin","a+");
if (pFile!=NULL)
{
fwrite (data, sizeof(char), strlen(data), pFile);
fclose (pFile);
}

How to process CSV lines with nul char in some elements?

When reading and parsing a CSV-file line, I need to process the nul character that appears as the value of some row fields. It is complicated by the fact that sometimes the CSV file is in windows-1250 encoding, sometimes it in UTF-8, and sometimes UTF-16. Because of this, I have started some way, and then found the nul char problem later -- see below.
Details: I need to clean a CSV files from third party to the form common to our data extractor (that is the utility works as a filter -- storing one CSV form to another CSV form).
My initial approach was to open the CSV file in binary mode and check whether the first bytes form BOM. I know all the given Unicode files start with BOM. If there is no BOM, I know that it is in windows-1250 encoding.
The converted CSV file should use the windows-1250 encoding. So, after checking the input file, I open it using the related mode, like this:
// Open the file in binary mode first to see whether BOM is there or not.
FILE * fh{ nullptr };
errno_t err = fopen_s(&fh, fnameIn.string().c_str(), "rb"); // const fs::path & fnameIn
assert(err == 0);
vector<char> buf(4, '\0');
fread(&buf[0], 1, 3, fh);
::fclose(fh);
// Set the isUnicode flag and open the file according to that.
string mode{ "r" }; // init
bool isUnicode = false; // pessimistic init
if (buf[0] == 0xEF && buf[1] == 0xBB && buf[2] == 0xBF) // UTF-8 BOM
{
mode += ", ccs=UTF-8";
isUnicode = true;
}
else if ((buf[0] == 0xFE && buf[1] == 0xFF) // UTF-16 BE BOM
|| (buf[0] == 0xFF && buf[1] == 0xFE)) // UTF-16 LE BOM
{
mode += ", ccs=UNICODE";
isUnicode = true;
}
// Open in the suitable mode.
err = fopen_s(&fh, fnameIn.string().c_str(), mode.c_str());
assert(err == 0);
After the successful open, the input line is read or via fgets or via fgetws -- depending on whether Unicode was detected or not. Then the idea was to convert the buffer content from Unicode to 1250 if the unicode was detected earlier, or let the buffer be in 1250. The s variable should contain the string in the windows-1250 encoding. The ATL::CW2A(buf, 1250) is used when conversion is needed:
const int bufsize = 4096;
wchar_t buf[bufsize];
// Read the line from the input according to the isUnicode flag.
while (isUnicode ? (fgetws(buf, bufsize, fh) != NULL)
: (fgets(reinterpret_cast<char*>(buf), bufsize, fh) != NULL))
{
// If the input is in Unicode, convert the buffer content
// to the string in cp1250. Otherwise, do not touch it.
string s;
if (isUnicode) s = ATL::CW2A(buf, 1250);
else s = reinterpret_cast<char*>(buf);
...
// Now processing the characters of the `s` to form the output file
}
It worked fine... until a file with a nul character used as the value in the row appeared. The problem is that when the s variable is assigned, the nul cuts the rest of the line. In the observed case, it happened with the file that used 1250 encoding. But it can probably happen also in the UTF encoded files.
How to solve the problem?
The NUL character problem is solved by using either C++ or Windows functions. In this case, the easiest solution is MultiByteToWideChar which will accept an explicit string length, precisely so it doesn't stop on NUL.

fread and fgetc in the same file stream not working

I have a binary file with the following repeating format: 6 float values + 3 unsigned char (byte`, integer value from 0 to 255) values.
I am parsing it like this:
FILE *file = fopen("file.bin", "r");
bool valid = true;
while(!feof(file)) {
float vals[6];
valid = valid && (fread((void*)(&vals), sizeof(float), 6, file) == 6);
unsigned char a,b,c;
a = fgetc(file); b = fgetc(file); c = fgetc(file);
(...)
}
This works fine for the first 30 iterations or so, but after that it simply stops parsing (way way before the end of the file).
What could be wrong?
I also tried parsing the unsigned char bytes with
fread((void*)&(a), sizeof(unsigned char), 1, file);
it simply stops parsing (way way before the end of the file).
You and the C Standard library are having a difference of opinion about where the end of the file is. ASCII character EOF (for DOS/Windows: decimal 26, hex 1A, aka Ctrl+Z, for Unix/Linux: decimal 4, hex 04, aka Ctrl+D) is a control character meaning "end of file". There's also the file length stored by the filesystem metadata.
The C stdio functions can operate in several modes: text, default, binary, and these control several behaviors:
Newline translations (implementation-defined): enabled in text mode, disabled in binary mode, default: ???
End of file: Implementation-defined, but usually EOF character in text mode, by filesystem file length in binary mode, default: ???
Since your file contains binary data, you should force binary mode by using "b" in the mode string to fopen, e.g.
FILE* file = fopen("file.bin", "rb");
When you do so, characters with value 26 are treated like any other byte and lose their "EOF" meaning.

Need explanation on creating utf-8 encoded files on linux using c++

I need some explanations on encodage of files using g++ on Linux.
I have an easy code :
int main ()
{
FILE * pFile;
char buffer[] = { 'x' , 'y' , 'z' ,'é' };
pFile = fopen ("myfile", "wt, ccs=UTF-8");
//pFile = fopen ("myfile", "wt");
fwrite (buffer , sizeof(char), sizeof(buffer), pFile);
fclose (pFile);
return 0;
}
Even if the "ccs=UTF-8" part is added on the fopen line, the output file of this program is always encoded in iso-8859-1. However, if I create a file using vi on Linux containing theses charaters, the resulting file is UTF-8 encoded (I use the command "file myfile" to see the encoding mode of the file, and a "xxd -b myfile" confirm this behavior).
So I would like to undestand :
1- Why g++ on Linux doesn't create a UTF-8 file by default?
2- What is the aim of the ccs=UTF-8 if the file created is not encoded in UTF-8?
3- How can I create an UTF-8 file based on this simple code?
Thanks.
Your file may appear to be in ISO-8859-1, but it's actually not. It's simply broken.
Your file contains byte A9, which is the lower byte of UTF-8 representation of é.
When you wrote 'é', the compiler should have warned you:
aaa.c:4:38: warning: multi-character character constant [-Wmultichar]
char buffer[] = { 'x' , 'y' , 'z' ,'é' };
^
char is not a type for a character, it's a type for one byte. GCC treats multibyte character literals as big-endian integers. Here, you cast it immediately to char, leaving the lowest byte: A9
(BTW, é in ISO-8859-1 is E9, not A9)
You open your file with an encoding, but then you save bytes into it. The bytes correspond to ISO-8859-1 characters xyz©.
If you want to write characters, not bytes, then use wchar_t instead of char and fputws instead of fwrite
#include <stdio.h>
#include <wchar.h>
int main ()
{
FILE * pFile;
// note final zero and L indicating wchar_t literal
wchar_t buffer[] = { 'x' , 'y' , 'z' , L'é' , 0};
// note no space before ccs
pFile = fopen ("myfile", "wt,ccs=UTF-8");
fputws(buffer, pFile);
fclose (pFile);
return 0;
}