Reading every file format as 0's and 1's - compression

I am making a file compressor using Huffman Coding. I want to read the file in binary (1's and 0's) , whatever be the format of the file. Is it possible? If yes, how?

First, make sure to read the file in binary mode. (Under some operating systems, like MS Windows, reading in text mode will corrupt any bytes that happen to look like newlines.)
Second, you will need to decide what order to read the bits in each byte: least significant bit (LSB) first, or most significant bit (MSB) first. It doesn't matter which order you choose, as long as you're consistent (and, of course, you must write the bits in the same order as you wish to read them!)
Finally, you will need to use bitwise operators to access the individual bits. An example (in the form of a partial C++ class):
class bit_buffer {
public:
bit_buffer(std::istream *input): m_input(input) {}
bool get_bit() {
if (m_bit_idx < 0) {
m_input->get(m_char);
m_bit_idx = 7;
}
bool retval = (m_char >> m_bit_idx) & 0x01;
-- m_bit_idx;
return retval;
}
private:
std::istream *m_input;
unsigned char m_char;
int m_bit_idx = -1;
}
Note that the above code reads the bits most-significant-bit (MSB) first.

Related

How to handle binary files in a portable way using std::fstream?

The put/get methods of std::fstream classes operate on char arguments rather than ints.
Is there a portable way of representing these char-bytes as integers ?
(My naive expectation is that a binary file is a sequence of bytes,
i.e. a sequence of integers).
To make this question more concrete, consider the following two functions:
void print_binary_file_to_cout( const std::string &filename)
{
std::ifstream ifs(filename, std::ios_base::binary|std::ios_base::in);
char c;
while(ifs.get(c))
std::cout << static_cast<int>(c) << std::endl;
}
and
void make_binary_file_from_cin( const std::string &filename)
{
std::ofstream ofs(filename, std::ios_base::binary|std::ios_base::out);
const int no_char = 256;
int cInt = no_char;
while(std::cin>>cInt && cInt!=no_char )
ofs.put( static_cast<char>( cInt ) );
}
Now, suppose that one function is compiled on Windows in Visual Studio, and the other in gcc on Linux. If the output of print...() is given as the input to make...()
will the original file be reproduced?
I guess not, so I'm asking how to correctly implement this idea, i.e.
how to get a portable (and human-understandable) representation of bytes in binary files?
The most common human-readable representation of bytes is in hex (base 16) notation. You can tell iostreams to use hex format by passing std::hex into the stream. std::hex modifies the streams behavior accordingly both for input and output streams. This format is also canonical to work independent of compilers and platforms, and you do not need to use a separator (like newline) between values. As a stop value, you can use any character outside [0-9a-fA-F].
Note that you should use unsigned chars.
There is a lot of code out there tat presumes the char functions will work correctly with unsigned char variables, perhaps with a static_cast, that the forms are bit identical, but the language lawyers will say that assumption can't be relied on if you are writing "perfect" portable code.
Luckily, reinterpret_cast does offer the facility to cast any pointer into a pointer to signed or unsigned char, and that is the easiest get-out.
Two notes top consider for all binary files:
On windows the file must be opened in binary mode, otherwise any bytes with code 13 will mysteriously disappear.
To store numbers larger than 256 you will need to span together a number of byte values. You need to decide the convention for doing this: wether the first byte is the least or most significant part of the value. Certain archetectures (arm native and 68K) use the "big end" model, where the most significant byte is first, while intel (and arm in switched mode) use a "little end" model. If you are reading byte by byte you just have to specify it.

compatibility and structure padding

there is one structure in Linux (64bit OS)
And I did the following to output this structure as Hexa code.
After the below code, "strBuff" will be output to the file in the same way as "printf",
This is a situation that needs to be read from windows, and should be stored in the same structure as above "example".
However, there was a problem here.
In my current windows, unsigned long size is 4byte.
In my current Linux, unsigned long size is 8byte.
So there is too much zero output in the output text.
This seems to be related to the padding bit. It is expected that only 2 bytes should be padding, and padding is done by 4 bytes.
It is not possible to change the structure "example" because the code is implemented by thinking it is 4byte when outputting from linux and the code is already in the completion stage.
I have two things to ask.
What if I need to get rid of unnecessary zero hexa in the output code?
Currently, we are using a hard coding method to skip all unsigned long and signed long variables.
Compatibility between windows and linux should be solved.
The code can be changed both on the reading side and on the output side. Is there a lib related to the above problem and compatibility that can solve the padding problem?
enter code here
struct example
{
unsigned long Ul;
int a;
signed long Sl;
}
struct examle eg;
// data input at eg
char *tempDataPtr = (char*)(&eg);
for(int i = 0 ; i < size(example) ; i++)
{
sprintf(&strBuff[i*3],"%02X ", tempDataPtr[i]);
}
Use types that have explicit format:
(And order them from largest to smallest for good measure, to protect against padding discrepancies between fields)
struct example
{
uint32_t Ul;
int32_t Sl;
int16_t a;
}

How to read and write data in 8 bit integers unit form by c++ file functions

Is it possible to store data in integer form from 0 to 255 rather than 8-bit characters.Although both are same thing, how can we do it, for example, with write() function?
Is it ok to directly cast any integer to char and vice versa? Does something like
{
int a[1]=213;
write((char*)a,1);
}
and
{
int a[1];
read((char*)a,1);
cout<<a;
}
work to get 213 from the same location in the file? It may work on that computer but is it portable, in other words, is it suitable for cross-platform projects in that way? If I create a file format for each game level(which will store objects' coordinates in the current level's file) using this principle, will it work on other computers/systems/platforms in order to have loaded same level?
The code you show would write the first (lowest-address) byte of a[0]'s object representation - which may or may not be the byte with the value 213. The particular object representation of an int is imeplementation defined.
The portable way of writing one byte with the value of 213 would be
unsigned char c = a[0];
write(&c, 1);
You have the right idea, but it could use a bit of refinement.
{
int intToWrite = 213;
unsigned char byteToWrite = 0;
if ( intToWrite > 255 || intToWrite < 0 )
{
doError();
return();
}
// since your range is 0-255, you really want the low order byte of the int.
// Just reading the 1st byte may or may not work for your architecture. I
// prefer to let the compiler handle the conversion via casting.
byteToWrite = (unsigned char) intToWrite;
write( &byteToWrite, sizeof(byteToWrite) );
// you can hard code the size, but I try to be in the habit of using sizeof
// since it is better when dealing with multibyte types
}
{
int a = 0;
unsigned char toRead = 0;
// just like the write, the byte ordering of the int will depend on your
// architecture. You could write code to explicitly handle this, but it's
// easier to let the compiler figure it out via implicit conversions
read( &toRead, sizeof(toRead) );
a = toRead;
cout<<a;
}
If you need to minimize space or otherwise can't afford the extra char sitting around, then it's definitely possible to read/write a particular location in your integer. However, it can need linking in new headers (e.g. using htons/ntons) or annoying (using platform #defines).
It will work, with some caveats:
Use reinterpret_cast<char*>(x) instead of (char*)x to be explicit that you’re performing a cast that’s ordinarily unsafe.
sizeof(int) varies between platforms, so you may wish to use a fixed-size integer type from <cstdint> such as int32_t.
Endianness can also differ between platforms, so you should be aware of the platform byte order and swap byte orders to a consistent format when writing the file. You can detect endianness at runtime and swap bytes manually, or use htonl and ntohl to convert between host and network (big-endian) byte order.
Also, as a practical matter, I recommend you prefer text-based formats—they’re less compact, but far easier to debug when things go wrong, since you can examine them in any text editor. If you determine that loading and parsing these files is too slow, then consider moving to a binary format.

dealing with endianness in c++

I am working on translating a system from python to c++. I need to be able to perform actions in c++ that are generally performed by using Python's struct.unpack (interpreting binary strings as numerical values). For integer values, I am able to get this to (sort of) work, using the data types in stdint.h:
struct.unpack("i", str) ==> *(int32_t*) str; //str is a char* containing the data
This works properly for little-endian binary strings, but fails on big-endian binary strings. Basically, I need an equivalent to using the > tag in struct.unpack:
struct.unpack(">i", str) ==> ???
Please note, if there is a better way to do this, I am all ears. However, I cannot use c++11, nor any 3rd party libraries other than Boost. I will also need to be able to interpret floats and doubles, as in struct.unpack(">f", str) and struct.unpack(">d", str), but I'll get to that when I solve this.
NOTE I should point out that the endianness of my machine is irrelevant in this case. I know that the bitstream I receive in my code will ALWAYS be big-endian, and that's why I need a solution that will always cover the big-endian case. The article pointed out by BoBTFish in the comments seems to offer a solution.
For 32 and 16-bit values:
This is exactly the problem you have for network data, which is big-endian. You can use the the ntohl to turn a 32-bit into host order, little-endian in your case.
The ntohl() function converts the unsigned integer netlong from network byte order to
host byte order.
int res = ntohl(*((int32_t) str)));
This will also take care of the case where your host is big-endian and won't do anything.
For 64-bit values
Non-standardly on linux/BSD you can take a look at 64 bit ntohl() in C++?, which points to htobe64
These functions convert the byte encoding of integer values from the byte order that
the current CPU (the "host") uses, to and from little-endian and big-endian byte
order.
For windows try: How do I convert between big-endian and little-endian values in C++?
Which points to _byteswap_uint64 and as well as a 16 and 32-bit solution and a gcc-specific __builtin_bswap(32/64) call.
Other Sizes
Most systems don't have values that aren't 16/32/64 bits long. At that point I might try to store it in a 64-bit value, shift it and they translate. I'd write some good tests. I suspectt is an uncommon situation and more details would help.
Unpack the string one byte at a time.
unsigned char *str;
unsigned int result;
result = *str++ << 24;
result |= *str++ << 16;
result |= *str++ << 8;
result |= *str++;
First, the cast you're doing:
char *str = ...;
int32_t i = *(int32_t*)str;
results in undefined behavior due to the strict aliasing rule (unless str is initialized with something like int32_t x; char *str = (char*)&x;). In practical terms that cast can result in an unaligned read which causes a bus error (a crash) on some platforms and slow performance on others.
Instead you should be doing something like:
int32_t i;
std::memcpy(&i, c, sizeof(i));
There are a number of functions for swapping bytes between the host's native byte ordering and a host independent ordering: ntoh*(), hton*(), where * is nothing, l, or s for the different types supported. Since different hosts may have different byte orderings then this may be what you want to use if the data you're reading uses a consistent serialized form on all platforms.
ntoh(i);
You can also manually move bytes around in str before copying it into the integer.
std::swap(str[0],str[3]);
std::swap(str[1],str[2]);
std::memcpy(&i,str,sizeof(i));
Or you can manually manipulate the integer's value using shifts and bitwise operators.
std::memcpy(&i,str,sizeof(i));
i = (i&0xFFFF0000)>>16 | (i&0x0000FFFF)<<16;
i = (i&0xFF00FF00)>>8 | (i&0x00FF00FF)<<8;
This falls in the realm of bit twiddling.
for (i=0;i<sizeof(struct foo);i++) dst[i] = src[i ^ mask];
where mask == (sizeof type -1) if the stored and native endianness differ.
With this technique one can convert a struct to bit masks:
struct foo {
byte a,b; // mask = 0,0
short e; // mask = 1,1
int g; // mask = 3,3,3,3,
double i; // mask = 7,7,7,7,7,7,7,7
} s; // notice that all units must be aligned according their native size
Again these masks can be encoded with two bits per symbol: (1<<n)-1, meaning that in 64-bit machines one can encode necessary masks of a 32 byte sized struct in a single constant (with 1,2,4 and 8 byte alignments).
unsigned int mask = 0xffffaa50; // or zero if the endianness matches
for (i=0;i<16;i++) {
dst[i]=src[i ^ ((1<<(mask & 3))-1]; mask>>=2;
}
If your as received values are truly strings, (char* or std::string) and you know their format information, sscanf(), and atoi(), well, really ato() will be your friends. They take well formatted strings and convert them per passed-in formats (kind of reverse printf).

How to pack data in binary format in c++

Say, i have binary protocol, where first 4 bits represent a numeric value which can be less than or equal to 10 (ten in decimal).
In C++, the smallest data type available to me is char, which is 8 bits long. So, within my application, i can hold the value represented by 4 bits in a char variable. My question is, if i have to pack the char value back into 4 bits for network transmission, how do i pack my char's value back into 4 bits?
You do bitwise operation on the char;
so
unsigned char packedvalue = 0;
packedvalue |= 0xF0 & (7 <<4);
packedvalue |= 0x0F & (10);
Set the 4 upper most bit to 7 and the lower 4 bits to 10
Unpacking these again as
int upper, lower;
upper = (packedvalue & 0xF0) >>4;
lower = packedvalue & 0x0F;
As an extra answer to the question -- you may also want to look at protocol buffers for a way of encoding and decoding data for binary transfers.
Sure, just use one char for your value:
std::ofstream outfile("thefile.bin", std::ios::binary);
unsigned int n; // at most 10!
char c = n << 4; // fits
outfile.write(&c, 1); // we wrote the value "10"
The lower 4 bits will be left at zero. If they're also used for something, you'll have to populate c fully before writing it. To read:
infile.read(&c, 1);
unsigned int n = c >> 4;
Well, there's the popular but non-portable "Bit Fields". They're standard-compliant, but may create a different packing order on different platforms. So don't use them.
Then, there are the highly portable bit shifting and bitwise AND and OR operators, which you should prefer. Essentially, you work on a larger field (usually 32 bits, for TCP/IP protocols) and extract or replace subsequences of bits. See Martin's link and Soren's answer for those.
Are you familiar with C's bitfields? You simply write
struct my_bits {
unsigned v1 : 4;
...
};
Be warned, various operations are slower on bitfields because the compiler must unpack them for things like addition. I'd imagine unpacking a bitfield will still be faster than the addition operation itself, even though it requires multiple instructions, but it's still overhead. Bitwise operations should remain quite fast. Equality too.
You must also take care with endianness and threads (see the wikipedia article I linked for details, but the issues are kinda obvious). You should leearn about endianness anyways since you said "binary protocol" (see this previous questions)