understanding format of file - c++

I have a question regarding file reading and I am getting frustrated over it as I am doing some handwriting recognition development and the tool I am using doesn't seem to read my training data file.
So I have one file which works perfectly fine. I paste some contents of that file here:
è Aڈ2*A ê“AêA mwA)àXA$NلAئ~A›إA:ozA)"ŒA%IœA&»ّAم3ACA
|®AH÷AD¢A ô-A گ&AJXAsAA mGA قQAٍALs#÷8´A
The file is in a format I know about that first 12 bytes are 2 longs and 2 shorts with most probably data as 4 , 1000 , 1024 , 9 but T cannot read the file to get these values.
Actually I want to write my first 12 bytes in format similar to the mentioned above and I dont seem to get how to do it.
Forgot to mention that the remaining data are float points. When I write data into file I get human readable text not these symbols and when I am reading these symbols I do not get the actual values. How to get the actual floats and integers across these symbols?
My code is
struct rec
{
long a;
long b ;
short c;
short d;
}; // this is the struct
FILE *pFile;
struct rec my_record;
// then I read using fread
fread(&my_record,1,sizeof(my_record),pFile);`
and the values i get in a, b, c and d are 85991456, -402448352, 8193, and 2336 instead of the actual values.

First of all, you should open that file in a hex editor, to see exactly what bytes it contains. From the text excerpt you have posted I think it does not contain 4, 1000, 1024 and 9 as you expect, but text form may be very misleading, because different character encodings show different characters for the same sequences of bytes.
If you have confirmed that the file contains the expected data, there may be still other issues. One of these is endianness, some machines and file formats encode a 4-byte long with least significant byte first, while others read and write the most significant byte first.
Other issue concerns the long data type you use. If your computer has a 64-bit architecture and you are using Linux, long is a 64-bit value, and your structure becomes 20 bytes long instead of 12.
Edit:
To read big-endian longs on a litte-endian machine like yours, you should read de data byte-by-byte and build the longs from them manually:
// Read 4 bytes
unsigned char buf[4];
fread(buf, 4, 1, pFile);
// Convert to long
my_record.a = (((long)buf[0]) << 24) | (((long)buf[1]) << 16) | (((long)buf[2]) << 8) | ((long)buf[3]);

Compiler adds padding to your structure members to make it (typically ) 4byte aligned. In this case variables c and d are padded.
You should read per-defined data types at a time From your fread instead of your structure.

Related

C++ question: I'm having problems writing array of doubles (single precision 32 bit) to a disc file as 4-byte IEEE754 format

I have an application I'm trying to write in which will take a table of numbers (generated by user) and write the table to a file on disc. This file will then later be transferred to an Arduino AVR device over USB to its EEPROM.
The format I wish to save this information in on disc is 4-byte Little Endian as just raw Hex data. My table array called "tbl1Array[]" in my code below has been cast as a double.
Below is a snippet of the bunk code I have in place now, in-line following some array preparation code. The file open/close works fine, and in fact, data DOES get transferred to the file, but the format is not what I want.
ofstream fileToOutput("318file.bin");
for (int i=0; i<41; i++)
{
fileToOutput << tbl1Array[i];
}
fileToOutput.close();
THE PROBLEM is that what is written to the file is a hex ASCII representation of the decimal value. Not what I want! I don't know what I need to do to get my array as a nice neat concatenated list of 4-byte Little Endian words for my doubles that I can later read from within the Arduino code. I have a working method for transferring the file to the Arduino using AVRDUDE (tested and confirmed), so my only real hang-up is getting these doubles in my applications' array to 4-byte IEEE754 on disc. Any guidance would be greatly appreciated.
Regards - Mark
In a text stream, the new-line character is translated. Some operating systems translated to \r\n. A binary stream leaves it as it is.
Regardless of how you opened the stream – binary or text:
a. When you use the insertion/extraction operator the data is written as text. Assuming 0x01 is a one byte integer, ASCII 1 will be written (that is 0x31).
b. If you use the ostream::write method, you write the actual bytes; no translations happens. Assuming 0x01 is a one byte integer, 0x01 will be written.
If you want to write the actual bytes, open the stream in binary mode and use ostream::write:
ofstream os{ "out.bin", ios::binary };
if (!os)
return -1;
double d = 1.234567;
os.write((const char*)&d, sizeof d);
If you want to write the actual bytes as a string, open the stream in text mode and use the insertion operator:
ofstream os{ "out.txt" };
if (!os)
return -1;
double d = 1.234567;
static_assert(sizeof(double) == sizeof(long long));
os << hex << *reinterpret_cast<long long*>(&d);
If you want to write a double as string, using the maximum precision use:
os << setprecision(numeric_limits<double>::max_digits10) << d;
Don't use translated & untranslated methods on the same stream.
I do not know if all above examples will work on an Arduino compiler.

Reading PCM audio file is giving sometimes wrong samples

I have a 16 bit, 48kHz, 1-channel (mono) PCM audio file (with no header but it would be the same with a WAV header anyway) and I can read that file correctly using a software such as Audacity, however when I try to read it programatically (in C++), some samples just seem to be out of place while most are correct when comparing Audacity values.
My process of reading the PCM file is the following:
Convert the byte array of PCM to a short array to get readable values by bitshifting (the order of bytes is little-endian here).
for(int i = 0; i < bytesSize - 1; i += 2)
shortValue[i] = bytes[i] | bytes[i + 1] << 8;
note: bytes is a char array of the binary contents of the PCM file. And shortValue is a short array.
Convert the short values to Amplitude levels in a float array by dividing by the max value of short (32767)
for(int i = 0; i < shortsSize ; i++)
amplitude[i] = static_cast<float>(shortValue[i]) / 32767;
This is obviously not optimal code and I could do it in one loop but for the sole purpose of explaining I separated the two steps.
So what happens exactly is that when I try to find very big changes of amplitude levels in my last array, it shows me samples that are not correct? Like here in Audacity notice how the wave is perfectly smooth and how the sample 276,467 pointed in green goes just a bit lower to the next sample pointed in red, which should be around -0.17.
However, when reading from my code, I get a totally wrong value of the red sample (-0.002), while still getting a good value of the green sample (around -0.17), the sample after the red one is also correct (around -0.17 as well).
I don't really understand what's happening and how Audacity is able to read those bytes correctly, I tried with multiple PCM/WAV files and I get the same results. Any help would really be appreciated!

Endianness in wav files

I have tried to make a simple wav writer. I wanted to do this so that I could read in a wav file (using a pre-existing wav reader), resample the audio data then write the resampled data to another wav file. Input files could be 16 bitsPerSample or 32 bitsPerSample and I wanted to save the resampled audio with the same number of bitsPerSample.
The writer is working but there a couple of things I don't understand to do with endianness and I was hoping someone may be able to help me?
I previously had no experience of reading or writing binary files. I began by looking up the wav file format online and tried to write the data following the correct format. At first the writing wasn't working but I then found out that wav files are little-endian and it was trying to make my file writer consistent with this that brought up the majority of my problems.
I have got the wav writer to work now (by way of a test whereby I read in a wav file and checked I could write the unsampled audio and reproduce the exact same file) however there are a couple of points I am still unsure on to do with endianness and I was hoping someone may be able to help me?
Assuming the relevant variables have already been set here is my code for the wav writer:
// Write RIFF header
out_stream.write(chunkID.c_str(),4);
out_stream.write((char*)&chunkSize,4);
out_stream.write(format.c_str());
// Write format chunk
out_stream.write(subchunk1ID.c_str(),4);
out_stream.write((char*)&subchunk1Size,4);
out_stream.write((char*)&audioFormat,2);
out_stream.write((char*)&numOfChannels,2);
out_stream.write((char*)&sampleRate,4);
out_stream.write((char*)&byteRate,4);
out_stream.write((char*)&blockAlign,2);
out_stream.write((char*)&bitsPerSample,2);
// Write data chunk
out_stream.write(subchunk2ID.c_str(),4);
out_stream.write((char*)&subchunk2Size,4);
// Variables for writing 16 bitsPerSample data
std::vector<short> soundDataShort;
soundDataShort.resize(numSamples);
char theSoundDataBytes [2];
// soundData samples are written as shorts if bitsPerSample=16 and floats if bitsPerSample=32
switch( bitsPerSample )
{
case (16):
// cast each of the soundData samples from floats to shorts
// then save the samples in little-endian form (requires reversal of byte-order of the short variable)
for (int sample=0; sample < numSamples; sample++)
{
soundDataShort[sample] = static_cast<short>(soundData[sample]);
theSoundDataBytes[0] = (soundDataShort[sample]) & 0xFF;
theSoundDataBytes[1] = (soundDataShort[sample] >> 8) & 0xFF;
out_stream.write(theSoundDataBytes,2);
}
break;
case (32):
// save the soundData samples in binary form (does not require change to byte order for floats)
out_stream.write((char*)&soundData[0],numSamples);
}
The questions that I have are:
In the soundData vector why does the endianness of a vector of shorts matter but the vector of floats doesn't? In my code I have reversed the byte order of the shorts but not the floats.
Originally I tried to write the shorts without reversing the byte order. When I wrote the file it ended up being half the size it should have been (i.e. half the audio data was missing, but the half that was there sounded correct), why would this be?
I have not reversed the byte order of the shorts and longs in the other single variables which are essentially all the other fields that make up the wav file e.g. sampleRate, numOfChannels etc but this does not seem to affect the playing of the wav file. Is this just because media players do not use these fields (and hence I can't tell that I have got them wrong) or is it because the byte order of these variables does not matter?
In the soundData vector why does the endianness of a vector of shorts matter but the vector of floats doesn't? In my code I have reversed the byte order of the shorts but not the floats.
Actually, if you take a closer look at your code, you will see that you are not reversing the endianness of your shorts at all. Nor do you need to, on Intel CPUs (or on any other low-endian CPU).
Originally I tried to write the shorts without reversing the byte order. When I wrote the file it ended up being half the size it should have been (i.e. half the audio data was missing, but the half that was there sounded correct), why would this be?
I have no idea without seeing the code but I suspect that some other factor was in play.
I have not reversed the byte order of the shorts and longs in the other single variables which are essentially all the other fields that make up the wav file e.g. sampleRate, numOfChannels etc but this does not seem to affect the playing of the wav file. Is this just because media players do not use these fields (and hence I can't tell that I have got them wrong) or is it because the byte order of these variables does not matter?
These fields are in fact very important and must also be little-endian, but, as we have seen, you don't need to swap those either.

How to read chunks(of unknown size) of a binary file? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I am little bit confused on binary files, i know the datas are stored in chunks in binary files, and from my knowledge through experimenting i found that if we had a struct with member variables like this:
struct student{
int Roll_No;
char Name[10];
}
Then after updating the variables with contents, and saving it in a binary file the binary file is of 14 bytes, 10 bytes of char and 4 of int, so if we analyze the file in a hexeditor the file has 4 bytes reserved for Roll_no and 10 bytes reserved for Name in which the filled contents are filled and others can be seen as dots in the file, i mean if we create a program with struct/class like above, and after saving contents to the file, the file's size is the same as we created structure, i mean 4 of int and 10 of char, so from my knowledge if i created a new image format eg. (Dot).MyIMG, from my program which's stucture/class is like this
struct MyIMG{
char Header[5];
int width, height;
int Pixels[124000];
}
Then my program will create a new file of size 49613 bytes or 49 Kigabytes (which is 5 of header, +(plus) 8 of int height and width, +(plus) 4×124000 of int pixels), wether the pixels are 4, 8, 100, or whatever it will write the whole Pixels array wether empty, so why this effect cannot be same on any large softwares like MSpaint, Adobe photoshop, what do they do, which make their program to write files which's size depends on the pixels stored inn not the blank arrays...
EDIT: I have now edited my question, and clearly defined my question, pls help me, thanks in advance!!
File formats like .png and .bmp have a specific format. File formats can either specify a layout of bytes (such as 4 bytes for the width, 4 bytes for the height, 2MB of RGBA pixel data, or whatever), or the format may give you information about the size of various objects.
For example, a TIFF file will specify that there are a number tags at specific byte offsets within the file. Those tags then contain information about the size, location, and format of the image data. So you might have a fixed-sized header that says "there is a list of tags starting at byte 100, and it contains 40 tags." The tags would each be a fixed size (say 16-bytes), so you'd know to read 40 16-byte chunks starting at byte 100. The tags would then contain information such as the byte offset of the start of the image data, how many bytes are in a pixel, and how many pixels there are. From this, you can read the data without knowing ahead of time what the entire format is.
The code writing the file has to choose it's own format. For example, when writing your student structure to file, you could say something like:
size_t name_len = strlen(my_student.Name);
my_ofstream.write((const char*)&my_student, sizeof my_student - sizeof my_student.Name + name_len + 1);
This would then write the name up to and including the first 0/NUL character to the binary file. When reading the file back, the program could read a block of data from the ifstream then - knowing that a student is stored at some offset, use strlen() on the .Name part of the incoming data to recover the length, partly so it can only copy the necessary data to a student object, and also know where to start parsing the next data item from the input stream:
char buffer[32768];
student my_student;
if (my_ifstream.read(buffer, sizeof buffer) && my_ifstream.gcount() > 5)
{
// check for NUL without risking reading buffer[.gcount()]
size_t pre_name_len = std::offsetof(student, name);
const char* p_name = buffer + pre_name_len;
const char* p_nul = strnchr(p_name,
std::min(10, my_ifstream.gcount() - pre_name_len),
'\0');
if (p_nul == nullptr || *p_nul != '\0')
throw std::runtime_error("file didn't contain complete student record");
memcpy(my_student, buffer, p_nul - buffer + 1);
// keep parsing input from p_nul + 1, not going past .gcount()
}
As you can see - it's a bit of a pain to scan for the single NUL while keeping track of the amount of data read from the file so you don't crash if you get a corrupt input file....
For a beginner, it's probably easiest and far more robust to learn about the boost serialisation library which abstracts much of the low level - some would say C-style - I/O, casting and offset calculations to provide a cleaner logical interface for you.

I get "Invalid utf 8 error" when checking string, but it seems correct when i use std::cout

I am writing some code that must read utf 8 encoded text files, and send them to OpenGL.
Also using a library which i downloaded from this site: http://utfcpp.sourceforge.net/
When i write down this i can show the right images on OpenGL window:
std::string somestring = "abcçdefgğh";
// Convert string to utf32 encoding..
// I also set local on program startup.
But when i read the utf8 encoded string from file:
The library warns me about that the string has not a valid utf encoding
I can't send the 'read from file' string to OpenGL. It crashes.
But i can still use std::cout for the string that i read from file (it looks right).
I use this code to read from file:
void something(){
std::ifstream ifs("words.xml");
std::string readd;
if(ifs.good()){
while(!ifs.eof()){
std::getline(ifs, readd);
// do something..
}
}
}
Now the question is:
If the string which is read from file is not correct, how does it look as expected when i check it with std::cout?
How can i get this issue solved?
Thanks in advance:)
The shell to which you write output is probably rather robust against characters it doesn't understand. It seems, not all of the used software is. It should, however, be relatively straight forward to verify if you byte sequence is a valid UTF-8 sequence: the UTF-8 encoding is relatively straight forward:
each code point starts with a byte representing the number of bytes to be read and the first couple of bytes:
if the high bit is 0, the code point consists of one byte represented by the 7 lower bits
otherwise the number of leading 1 bits represent the total number of bytes followed by a zero bit (obiously) and the remaining bits become the high bits of the code point
since 1 byte is already represented, bytes with the high bit set and the next bit not set are continuation bytes: the lower 6 bits are part of the representation of the code point
Based on these rules, there are two things which can go wrong and make the UTF-8 invalid:
a continuation byte is encountered at a point where a start byte is expected
there was a start byte indicating more continuation bytes then followed
I don't have code around which could indicate where things are going wrong but it should be fairly straight forward to write such code.