Converting a byte into bit and writing the binary data to file - c++

Suppose I have a character array, char a[8] containing 10101010. If I store this data in a .txt file, this file has 8 bytes size. Now I am asking that how can I convert this data to binary format and save it in a file as 8 bits (and not 8 bytes) so that the file size is only 1 byte.
Also, Once I convert these 8 bytes to a single byte, Which File format should I save the output in? .txt or .dat or .bin?
I am working on Huffman Encoding of text files. I have already converted the text format into binary, i.e. 0's and 1's, but when I store this output data on a file, each digit(1 or 0) takes a byte instead of a bit. I want a solution such that each digit takes only a bit.
char buf[100];
void build_code(node n, char *s, int len)
{
static char *out = buf;
if (n->c) {
s[len] = 0;
strcpy(out, s);
code[n->c] = out;
out += len + 1;
return;
}
s[len] = '0'; build_code(n->left, s, len + 1);
s[len] = '1'; build_code(n->right, s, len + 1);
}
This is how I build up my code tree with help of a Huffman tree. And
void encode(const char *s, char *out)
{
while (*s)
{
strcpy(out, code[*s]);
out += strlen(code[*s++]);
}
}
This is how I Encode to get the final output.

Not entirely sure how you end up with a string representing the binary representation of a value,
but you can get an integer value from a string (in any base) using standard functions like std::strtoul.
That function provides an unsigned long value, since you know your value is within 0-255 range you can store it in an unsigned char:
unsigned char v=(unsigned char)(std::strtoul(binary_string_value.c_str(),0,2) & 0xff);
Writing it to disk, you can use ofstream to write
Which File format should I save the output in? .txt or .dat or .bin?
Keep in mind that the extension (the .txt, .dat or .bin) does not really mandate the format (i.e. the structure of the contents). The extension is a convention commonly used to indicate that you're using some well-known format (and in some OS/environments, it drives the configuration of which program can best handle that file). Since this is your file, it is up to you define the actual format... and to name the file with any extension (or even no extension) you like best (or in other words, any extension that best represent your contents) as long as it is meaningful to you and to those that are going to consume your files.
Edit: additional details
Assuming we have a buffer of some length where you're storing your string of '0' and '1'
int codeSize; // size of the code buffer
char *code; // code array/pointer
std::ofstream file; // File stream where we're writing to.
unsigned char *byteArray=new unsigned char[codeSize/8+(codeSize%8+=0)?1:0]
int bytes=0;
for(int i=8;i<codeSize;i+=8) {
std::string binstring(code[i-8],8); // create a temp string from the slice of the code
byteArray[bytes++]=(unsigned char)(std::strtoul(binstring.c_str(),0,2) & 0xff);
}
if(i>codeSize) {
// At this point, if there's a number of bits not multiple of 8,
// there are some bits that have not
// been writter. Not sure how you would like to handle it.
// One option is to assume that bits with 0 up to
// the next multiple of 8... but it all depends on what you're representing.
}
file.write(byteArray,bytes);

Function converting input 8 chars representing bit representation into one byte.
char BitsToByte( const char in[8] )
{
char ret = 0;
for( int i=0, pow=128;
i<8;
++i, pow/=2;
)
if( in[i] == '1' ) ret += pow;
return ret;
}
We iterate over array passed to function (of size 8 for obvious reasons) and based of content of it we increase our return value (first element in the array represents the oldest bit). pow is set to 128 because 2^(n-1)is value of n-th bit.

You can shift them into a byte pretty easily, like this:
byte x = (s[3] - '0') + ((s[2] - '0') << 1) + ((s[1] - '0') << 2) + ((s[0] - '0') << 3);
In my example, I only shifted a nibble, or 4-bits. You can expand the example to shift an entire byte. This solution will be faster than a loop.

One way:
/** Converts 8 bytes to 8 bits **/
unsigned char BinStrToNum(const char a[8])
{
return( ('1' == a[0]) ? 128 : 0
+ ('1' == a[1]) ? 64 : 0
+ ('1' == a[2]) ? 32 : 0
+ ('1' == a[3]) ? 16 : 0
+ ('1' == a[4]) ? 8 : 0
+ ('1' == a[5]) ? 4 : 0
+ ('1' == a[6]) ? 2 : 0
+ ('1' == a[7]) ? 1 : 0);
);
};
Save it in any of the formats you mentioned; or invent your own!
int main()
{
rCode=0;
char *a = "10101010";
unsigned char byte;
FILE *fp=NULL;
fp=fopen("data.xyz", "wb");
if(NULL==fp)
{
rCode=errno;
fprintf(stderr, "fopen() failed. errno:%d\n", errno);
goto CLEANUP;
}
byte=BinStrToNum(a);
fwrite(&byte, 1, 1, fp);
CLEANUP:
if(fp)
fclose(fp);
return(rCode);
}

Related

Alternate reading as char* and wchar_t*

I'm trying to write a program that parses ID3 tags, for educational purposes (so please explain in depth, as I'm trying to learn). So far I've had great success, but stuck on an encoding issue.
When reading the mp3 file, the default encoding for all text is ISO-8859-1. All header info (frame IDs etc) can be read in that encoding.
This is how I've done it:
ifstream mp3File("../myfile.mp3");
mp3File.read(mp3Header, 10); // char mp3Header[10];
// .... Parsing the header
// After reading the main header, we get into the individual frames.
// Read the first 10 bytes from buffer, get size and then read data
char encoding[1];
while(1){
char frameHeader[10] = {0};
mp3File.read(frameHeader, 10);
ID3Frame frame(frameHeader); // Parses frameHeader
if (frame.frameId[0] == 'T'){ // Text Information Frame
mp3File.read(encoding, 1); // Get encoding
if (encoding[0] == 1){
// We're dealing with UCS-2 encoded Unicode with BOM
char data[frame.size];
mp3File.read(data, frame.size);
}
}
}
This is bad code, because data is a char*, its' inside should look like this (converted undisplayable chars to int):
char = [0xFF, 0xFE, C, 0, r, 0, a, 0, z, 0, y, 0]
Two questions:
What are the first two bytes? - Answered.
How can I read wchar_t from my already open file? And then get back to reading the rest of it?
Edit Clarification: I'm not sure if this is the correct way to do it, but essentially what I wanted to do was.. Read the first 11 bytes to a char array (header+encoding), then the next 12 bytes to a wchar_t array (the name of the song), and then the next 10 bytes to a char array (the next header). Is that possible?
I figured out a decent solution: create a new wchar_t buffer and add the characters from the char array in pairs.
wchar_t* charToWChar(char* cArray, int len) {
char wideChar[2];
wchar_t wideCharW;
wchar_t *wArray = (wchar_t *) malloc(sizeof(wchar_t) * len / 2);
int counter = 0;
int endian = BIGENDIAN;
// Check endianness
if ((uint8_t) cArray[0] == 255 && (uint8_t) cArray[1] == 254)
endian = LITTLEENDIAN;
else if ((uint8_t) cArray[1] == 255 && (uint8_t) cArray[0] == 254)
endian = BIGENDIAN;
for (int j = 2; j < len; j+=2){
switch (endian){
case LITTLEENDIAN: {wideChar[0] = cArray[j]; wideChar[1] = cArray[j + 1];} break;
default:
case BIGENDIAN: {wideChar[1] = cArray[j]; wideChar[0] = cArray[j + 1];} break;
}
wideCharW = (uint16_t)((uint8_t)wideChar[1] << 8 | (uint8_t)wideChar[0]);
wArray[counter] = wideCharW;
counter++;
}
wArray[counter] = '\0';
return wArray;
}
Usage:
if (encoding[0] == 1){
// We're dealing with UCS-2 encoded Unicode with BOM
char data[frame.size];
mp3File.read(data, frame.size);
wcout << charToWChar(data, frame.size) << endl;
}

How to convert a char array to a byte array?

I'm working on my project and now I'm stuck with a problem that is, how can I convert a char array to a byte array?.
For example: I need to convert char[9] "fff2bdf1" to a byte array that is byte[4] is 0xff,0xf2,0xbd,0xf1.
Here is a little Arduino sketch illustrating one way to do this:
void setup() {
Serial.begin(9600);
char arr[] = "abcdef98";
byte out[4];
auto getNum = [](char c){ return c > '9' ? c - 'a' + 10 : c - '0'; };
byte *ptr = out;
for(char *idx = arr ; *idx ; ++idx, ++ptr ){
*ptr = (getNum( *idx++ ) << 4) + getNum( *idx );
}
//Check converted byte values.
for( byte b : out )
Serial.println( b, HEX );
}
void loop() {
}
The loop will keep converting until it hits a null character. Also the code used in getNumonly deals with lower case values. If you need to parse uppercase values its an easy change. If you need to parse both then its only a little more code, I'll leave that for you if needed (let me know if you cannot work it out and need it).
This will output to the serial monitor the 4 byte values contained in out after conversion.
AB
CD
EF
98
Edit: How to use different length inputs.
The loop does not care how much data there is, as long as there are an even number of inputs (two ascii chars for each byte of output) plus a single terminating null. It simply stops converting when it hits the input strings terminating null.
So to do a longer conversion in the sketch above, you only need to change the length of the output (to accommodate the longer number). I.e:
char arr[] = "abcdef9876543210";
byte out[8];
The 4 inside the loop doesn't change. It is shifting the first number into position.
For the first two inputs ("ab") the code first converts the 'a' to the number 10, or hexidecimal A. It then shifts it left 4 bits, so it resides in the upper four bits of the byte: 0A to A0. Then the second value B is simply added to the number giving AB.
Assuming you want to parse the hex values in your string, and two letters always make up one byte value (so you use leading zeros), you can use sscanf like this:
char input[] = "fff2bdf1";
unsigned char output[4];
for (int i=0; i<4; i++) {
sscanf(&input[i*2], "%02xd", &data[i]);
}
Just shift 0 or 1 to its position in binary format :)
char lineChars[8] = {1,1,0,0,0,1,0,1};
char lineChar = 0;
for(int i=0; i<8;i++)
{
lineChar |= lineChars[i] << (7-i);
}
Example 2. But is not tested!
void abs()
{
char* charData = new char;
*charData = 'h';
BYTE* byteData = new BYTE;
*byteData = *(BYTE*)charData;
}

Compare uint32_t with a loaded char[] from file C++

I have a binary file from which I load whole text in unsigned char[] and a variable const uint32_t LITTLE_ENDIAN_ID = 0x49696949;
I need to compare first four characters from loaded char[] with given uint32_t.
Is that possible somehow?
If buff is your unsigned char[] buffer, you can do:
memcmp((unsigned char*)&LITTLE_ENDIAN_ID, buff, 4) == 0
memcmp is defined in string.h
yes, it's absolutely possible, but your question is underspecified. What you want to do is to take the first 4 characters of your character array and convert them into a uint32_t; the obvious question: which character corresponds to which byte of the 32-bit int? This is probably equivalent of asking if these bytes are stored in little-endian or big-endian order. Though now that I see your LITTLE_ENDIAN_ID I realize that it doesn't matter - it's (oddly) the same forwards and backwards.
Anyhow, what you want is either:
unsigned char[] text = ...
uint32_t x = text[0] << 24 + text[1] << 16 + text[2] << 8 + text[3];
if (x == LITTLE_ENDIAN_ID)
// do something
Or the same thing, but with
uint32_t x = text[3] << 24 + text[2] << 16 + text[1] << 8 + text[0];
Alternatively we could do something a little more unusual like
union {
uint32_t int_value;
unsigned char[4] characters;
} converter;
unsigned char[] text = ...
converter x;
for (int i=0; i < 4; i++)
x.characters[i] = text[i];
if (x.int_value == LITTLE_ENDIAN_ID)
// do something
This is probably closer to what you want if you are actually looking to test the endianness of the current system.

How can I print a bit instead of byte in a file?

I am using huffman algorithm to develop a file compressor and right now I am facing a problem which is:
By using the algorithm to the word:
stackoverflow, i get the following result:
a,c,e,f,k,l,r,s,t,v,w = 1 time repeated
o = 2 times repeated
a,c,e,f,k,l,r,s,t,v,w = 7.69231%
and
o = 15.3846%
So I start inserting then into a Binary Tree, which will get me the results:
o=00
a=010
e=0110
c=0111
t=1000
s=1001
w=1010
v=1011
k=1100
f=1101
r=1110
l=1111
which means the path for the character in the tree, considering 0 to be left and 1 to right.
then the word "stackoverflow" will be:
100110000100111010011111000010110110111011011111001010
and well, I want to put that whole value into a binary file to be in bits, which will result in 47bits, which would happen to be 6bytes, but instead I can only make it 47bytes because the minimun to put into a file with fwrite or fprintf is 1byte, by using sizeof(something).
Than my question is: how can I print in my file only a single bit?
Just write the "header" to the file: the number of bits and then "pack" the bits into bytes padding the last one. Here's a sample.
#include <stdio.h>
FILE* f;
/* how many bits in current byte */
int bit_counter;
/* current byte */
unsigned char cur_byte;
/* write 1 or 0 bit */
void write_bit(unsigned char bit)
{
if(++bit_counter == 8)
{
fwrite(&cur_byte,1,1,f);
bit_counter = 0;
cur_byte = 0;
}
cur_byte <<= 1;
cur_byte |= bit;
}
int main()
{
f = fopen("test.bits", "w");
cur_byte = 0;
bit_counter = 0;
/* write the number of bits here to decode the bitstream later (47 in your case) */
/* int num = 47; */
/* fwrite(num, 1, 4, f); */
write_bit(1);
write_bit(0);
write_bit(0);
/* etc... - do this in a loop for each encoded character */
/* 100110000100111010011111000010110110111011011111001010 */
if(bit_counter > 0)
{
// pad the last byte with zeroes
cur_byte <<= 8 - bit_counter;
fwrite(&cur_byte, 1, 1, f);
}
fclose(f);
return 0;
}
To do the full Huffman encoder you'll have to write the bit codes at the beginning, of course.
This is sort of an encoding issue. The problem is that files can only contain bytes - so 1 and 0 can only be '1' and '0' in a file - the characters for 1 and 0, which are bytes.
What you'll have to do is to pack the bits into bytes, creating a file that contains the bits in a set of bytes. You won't be able to open the file in a text editor - it doesn't know you want to display each bit as a 1 or 0 char, it will display whatever each packed byte turns out to be. You could open it with an editor that understands how to work with binary files, though. For instance, vim can do this.
As far as extra trailing bytes or an end-of-file marker, you're going to have to create some sort of encoding convention. For example, you can pack and pad with extra zeros, like you mention in your comments, but then by convention have the first N bytes be metadata - e.g. the data length, how many bits are interesting in your file. This sort of thing is very common.
You'll need to manage this yourself, by buffering the bits to write and only actually writing data when you have a complete byte. Something like...
void writeBit(bool b)
{
static char buffer=0;
static int bitcount=0;
buffer = (buffer << 1) | (b ? 1:0);
if (++bitcount == 8)
{
fputc(buffer); // write out the byte
bitcount = 0;
buffer = 0;
}
}
The above isn't reentrant (and is likely to be pretty inefficient) - and you need to make sure you somehow flush any half-written byte at the end, (write an extra 7 zero bits, maybe) but you should get the general idea.

Dealing with hex values in C/C++

I receive values using winsock from another computer on the network. It is a TCP socket, with the 4 first bytes of the message carrying its size. The rest of the message is formatted by the server using protobuf (protocol buffers from google).
The problemn, I think, is that it seems that the values sent by the server are hex values sent as char (ie only 10 received for 0x10). To receive the values, I do this :
bytesreceived = recv(sock, buffer, msg_size, 0);
for (int i=0;i<bytesreceived;i++)
{
data_s << hex << buffer[i];
}
where data_s is a stringstream. Them I can use the ParseFromIstream(&data_s) method from protobuf and recover the information I want.
The problem that I have is that this is VERY VERY long (I got another implementation using QSock taht I can't use for my project but which is much faster, so there is no problem on the server side).
I tried many things that I took from here and everywhere on the internet (using Arrays of bytes, strings), but nothing works.
Do I have any other options ?
Thank you for your time and comments ;)
not sure if this will be of any use, but I've used a similar protocol before (first 4 bytes holds an int with the length, rest is encoded using protobuf) and to decode it I did something like this (probably not the most efficient solution due to appending to strings):
// Once I've got the first 4 bytes, cast it to an int:
int msgLen = ntohl(*reinterpret_cast<const int*>(buffer));
// Check I've got enough bytes for the message, if I have then
// just parse the buffer directly
MyProtobufObj obj;
if( bytesreceived >= msgLen+4 )
{
obj.ParseFromArray(buffer+4,msgLen);
}
else
{
// just keep appending buffer to an STL string until I have
// msgLen+4 bytes and then do
// obj.ParseFromString(myStlString)
}
I wouldn't use the stream operators. They're for formatted data and that's not what you want.
You can keep the values received in a std::vector with the char type (vector of bytes). That would essentially just be a dynamic array. If you want to continue using a string stream, you can use the stringstream::write function which takes a buffer and a length. You should have the buffer and number of bytes received from your call to recv.
If you want to use the vector method, you can use std::copy to make it easier.
#include <algorithm>
#include <iterator>
#include <vector>
char buf[256];
std::vector<char> bytes;
size_t n = recv(sock, buf, 256, 0);
std::copy(buf, buf + n, std::back_inserter(bytes));
Your question is kind of ambiguous. Let's follow your example. You receive 10 as characters and you want to retrieve this as a hex number.
Assuming recv will give you this character string, you can do this.
First of all make it null terminated:
bytesreceived[msg_size] = '\0';
then you can very easily read the value from this buffer using standard *scanf function for strings:
int hexValue;
sscanf(bytesreceived, "%x", &hexValue);
There you go!
Edit: If you receive the number in reverse order (so 01 for 10), probably your best shot is to convert it manually:
int hexValue = 0;
int positionValue = 1;
for (int i = 0; i < msg_size; ++i)
{
int digit = 0;
if (bytesreceived[i] >= '0' && bytesreceived[i] <= '9')
digit = bytesreceived[i]-'0';
else if (bytesreceived[i] >= 'a' && bytesreceived[i] <= 'f')
digit = bytesreceived[i]-'a';
else if (bytesreceived[i] >= 'A' && bytesreceived[i] <= 'F')
digit = bytesreceived[i]-'A';
else // Some kind of error!
return error;
hexValue += digit*positionValue;
positionValue *= 16;
}
This is just a clear example though. In reality you would do it with bit shifting for example rather than multiplying.
What data type is buffer?
The whole thing looks like a great big no-op, since operator<<(stringstream&, char) ignores the base specifier. The hex specifier only affects formatting of non-character integral types. For certain you don't want to be handing textual data to protobuf.
Just hand the buffer pointer to protobuf, you're done.
OK, a shot into the dark: Let's say your ingress stream is "71F4E81DA...", and you want to turn this into a byte stream { 0x71, 0xF4, 0xE8, ...}. Then we can just assemble the bytes from the character literals as follows, schematically:
char * p = getCurrentPointer();
while (chars_left() >= 2)
{
unsigned char b;
b = get_byte_value(*p++) << 8;
b += get_byte_value(*p++);
output_stream.insert(b);
}
Here we use a little helper function:
unsigned char get_byte_value(char c)
{
if ('0' <= c && c <= '9') return c - '0';
if ('A' <= c && c <= 'F') return 10 + c - 'A';
if ('a' <= c && c <= 'f') return 10 + c - 'a';
return 0; // error
}