How to convert Text to Binary (and Reverse) in C++?

How to convert Text to Binary (and Reverse) in C++? - c++

Okay, this will be a very beginner question, Though I can´t seem to find a good resource on this topic.
What I want is simple. take a string (or char*) and convert it to a binary file that I can store somewhere on my system.
Then, at a later date, I want to be able to read that binary file and convert it back to a string (or char*).
Now...
Whenever I search for this I often get to the concept of Serialisation, which is basically what I want.
There´s a problem though, most often "Boost-Serialisation" is recommended. Which (IMO) is quite heavy for just converting simple text to binary and converting simple binary to text. (ok, I know it isn´t THAT easy, but you get the idea)
There has got to be an easier way to handle this. I hope you can help me find it. :D
Thank you very much in advance for your answers.

How to convert Text to Binary (and Reverse)
There's nothing to do. Text is already data, and the in-memory presentation of all data in any modern computer is always binary.
You need to know what you mean. If you just mean "write it to a file" (in any representation), then just do that:
std::string my_text;
std::ofstream ofs("myfile.bin", std::ios::binary);
ofs.write(my_text.data(), my_text.size());
If you need some specific representation (different character sets, encodings or even (archive) file formats) you might need to do that conversion.
Oh, lest I forget, to read-back:
std::ifstream ifs("myfile.bin", std::ios::binary);
std::string my_text(std::istreambuf_iterator<char>(ifs), {});

You just have to use bit manipulation tricks. C and C++ both have operators that allow you to run integer values through logic gates. So for example:
x = 3 & 1 ;
Will set x to 1 because when you do an AND operation you're taking every bit from the left side of the & and the corresponding bit from the right side of the & and putting them through an And operation.
You can also do bit shifting. Where you shift the bits over by some number. For example:
y = 1 << 2;
Will shift all the bits in the integer 1 over by two, and the new rightmost bits will be set to zero.
So the way to do this is: for every byte in the string, do an AND operation with 128 and then if the value from that is zero then that means the left-most bit is zero (and you print "0"), if not then the value is one (so you print "1"). Then shift it to the left by one and do the operation again. Do that eight times and you've converted one byte to binary.

You could use ofstream/ifstream. They work similar to cout/cin except it reads and writes files instead of the console. Maybe this link is helpful: https://www.cplusplus.com/doc/tutorial/files/

Related

Detect endianness of binary file data

Recently I was (again) reading about 'endian'ness. I know how to identify the endianness of host, as there are lots of post on SO, and also I have seen this, which I think is pretty good resource.
However, one thing I like to know is to how to detect the endianness of input binary file. For example, I am reading a binary file (using C++) like following:
ifstream mydata("mydata.raw", ios::binary);
short value;
char buf[sizeof(short)];
int dataCount = 0;
short myDataMat[DATA_DIMENSION][DATA_DIMENSION];
while (mydata.read(reinterpret_cast<char*>(&buf), sizeof(buf)))
{
memcpy(&value, buf, sizeof(value));
myDataMat[dataCount / DATA_DIMENSION][dataCount%DATA_DIMENSION] = value;
dataCount++;
}
I like to know how I can detect the endianness in the mydata.raw, and whether endianness affects this program anyway.
Additional Information:
I am only manipulating the data in myDataMat using mathematical operations, and no pointer operation or bitwise operation is done on the data).
My machine (host) is little endian.

It is impossible to "detect" the endianity of data in general. Just like it is impossible to detect whether the data is an array of 4 byte integers, or twice that many 2 byte integers. Without any knowledge about the representation, raw data is just a mass of meaningless bits.
However, with some extra knowledge about the data representation, it become possible. Some examples:
Most file formats mandate particular endianity, in which case this is never a problem.
Unicode text files may optionally start with a byte order mark. Same idea can be implemented by other data representations.
Some file formats contain a checksum. You can guess one endianity, and if the checksum does not match, try again with another endianity. It will be unlikely that the checksum matches with wrong interpretation of the data.
Sometimes you can make guesses based on the data. Is the temperature outside 33'554'432 degrees, or maybe 2? You can pick the endianity that represents sane data. Of course, this type of guesswork fails miserably, when the aliens invade and start melting our planet.

You can't tell.
The endianness transformation is essentially an operator E(x) on a number x such that x = E(E(x)). So you don't know "which way round" the x elements are in your file.

Types bit length and architecture specific implementations

I'm doing stuff in C++ but lately I've found that there are slight differences regarding how much data a type can accomodate and also the byte order is an issue.
Suppose I got a binary file, where I've encoded shorts that are 2 bytes in size. The file is in binary format like:
FA C8 - data segment 1
BA 32 - data segment 2
53 56 - data segment 3
Now all is well up to this point. Now I want to read this data. There are 2 problems:
1 what data type to choose to store this values?
2 how to deal with endianness of the target architecture?
The first problem is actually related to the second because here I will have to do bit shifts in order to swap the order of bytes.
I know that I could read the file byte by byte and add every two bytes. But is there an approach that could ease that pain?
I'm sorry If I'm being ambiguous. The problem is hard to explain. Hope you get a glimpse of what I'm talking about. I just want to store this data internally.
So I would appreciate some advices or if you can share some of your experience in this topic.

If you use big endian on the file that stores the data then you could just rely on htons(), htonl(), ntohs(), ntohl() to convert the integers to the right endianess before saving or after reading.

There is no easy way to do this.
Rather than doing that yourself, you might want to look into serialization libraries (for example Protobuf or boost serialization), they'll take care of a lot of that for you.
If you want to do it yourself, use fixed-width types (uint32_t and the like from <cstdint>), and endian conversion functions as appropriate. Either have a "prefix" in your file that determines what endianness it contains (a BOM/Byte Order Mark), or always store in either big or little endian, and systematically convert.
Be extra careful if you need to serialize strings, they have encoding problems of their own too.

Writing to a text file, binary vs ascii

So I am having the hardest time trying to understand this concept. I have a program that reads a text file, and writes it to another file and replaces the most common words with unsigned chars. But what I cannot for the life of me understand is how then do I determine the difference between the two.
If I write to the new file the original char I read in or an unsigned char value corresponding to 1-255, how then do I determine the difference when I go back in reverse to the original file contents?

When you write a file as binary, then a number such as "1253553" is written using 2 or 4 bytes (depending on the size of the int on the platform). So, in a binary file, you will see a sequence of 2 or 4 bytes representing that number. For chars, it should not make a difference as each char is represented on one byte.

Usually, you have to have some well known and obvious way to determine the format of your file.
One way to do this is to create your own file extension. You could naively expect that any file with that extension is in your compressed format, but it's actually quite likely other files out there have the same extension (e.g., ".dat" is probably a bad choice). So, you'll want to take further steps, like having the first few bytes of the file be something that is unlikely to be there in any other file (some "magic numbers"). Let's use two bytes, and let's simply choose 0xAB 0xCD as those two bytes.
So, when your program is presented with a file that has the proper extension, open it and read the first two bytes. If they're 0xAB and 0xCD, you can assume you're reading your special format.
This isn't a very strong way of accomplishing this task, but it is one way of doing it. You could get more extravagant if you like.
For more information, you might want to read the Wikipedia page on the subject. It's a start.

Write a program that takes text as input and produces a program that reproduces that text

Recently I came across one nice problem, which turned up as simple to understand as hard to find any way to solve. The problem is:
Write a program, that reads a text from input and prints some other
program on output. If we compile and run the printed program, it must
output the original text.
The input text is supposed to be rather large (more than 10000 characters).
The only (and very strong) requirement is that the size of the archive (i.e. the program printed) must be strictly less than the size of the original text. This makes impossible obvious solutions like
std::string s;
/* read the text into s */
std::cout << "#include<iostream> int main () { std::cout<<\"" << s << "\"; }";
I believe some archiving techniques are to be used here.

Unfortunately, such a program does not exist.
To see why this is so, we need to do a bit of math. First, let's count up how many binary strings there are of length n. Each of the bits can be either a 0 or 1, which gives us one of two choices for each of those bits. Since there are two choices per bit and n bits, there are thus a total of 2n binary strings of length n.
Now, let's suppose that we want to build a compression algorithm that always compresses a bitstring of length n into a bitstring of length less than n. In order for this to work, we need to count up how many different strings of length less than n there are. Well, this is given by the number of bitstrings of length 0, plus the number of bitstrings of length 1, plus the number of bitstrings of length 2, etc., all the way up to n - 1. This total is
20 + 21 + 22 + ... + 2n - 1
Using a bit of math, we can get that this number is equal to 2n - 1. In other words, the total number of bitstrings of length less than n is one smaller than the number of bitstrings of length n.
But this is a problem. In order for us to have a lossless compression algorithm that always maps a string of length n to a string of length at most n - 1, we would have to have some way of associating every bitstring of length n with some shorter bitstring such that no two bitstrings of length n are associated with the same shorter bitstream. This way, we can compress the string by just mapping it to the associated shorter string, and we can decompress it by reversing the mapping. The restriction that no two bitstrings of length n map to the same shorter string is what makes this lossless - if two length-n bitstrings were to map to the same shorter bitstring, then when it came time to decompress the string, there wouldn't be a way to know which of the two original bitstrings we had compressed.
This is where we reach a problem. Since there are 2n different bitstrings of length n and only 2n-1 shorter bitstrings, there is no possible way we can pair up each bitstring of length n with some shorter bitstring without assigning at least two length-n bitstrings to the same shorter string. This means that no matter how hard we try, no matter how clever we are, and no matter how creative we get with our compression algorithm, there is a hard mathematical limit that says that we can't always make the text shorter.
So how does this map to your original problem? Well, if we get a string of text of length at least 10000 and need to output a shorter program that prints it, then we would have to have some way of mapping each of the 210000 strings of length 10000 onto the 210000 - 1 strings of length less than 10000. That mapping has some other properties, namely that we always have to produce a valid program, but that's irrelevant here - there simply aren't enough shorter strings to go around. As a result, the problem you want to solve is impossible.
That said, we might be able to get a program that can compress all but one of the strings of length 10000 to a shorter string. In fact, we might find a compression algorithm that does this, meaning that with probability 1 - 210000 any string of length 10000 could be compressed. This is such a high probability that if we kept picking strings for the lifetime of the universe, we'd almost certainly never guess the One Bad String.
For further reading, there is a concept from information theory called Kolmogorov complexity, which is the length of the smallest program necessary to produce a given string. Some strings are easily compressed (for example, abababababababab), while others are not (for example, sdkjhdbvljkhwqe0235089). There exist strings that are called incompressible strings, for which the string cannot possibly be compressed into any smaller space. This means that any program that would print that string would have to be at least as long as the given string. For a good introduction to Kolmogorov Complexity, you may want to look at Chapter 6 of "Introduction to the Theory of Computation, Second Edition" by Michael Sipser, which has an excellent overview of some of the cooler results. For a more rigorous and in-depth look, consider reading "Elements of Information Theory," chapter 14.
Hope this helps!

If we are talking about ASCII text...
I think this actually could be done, and I think the restriction that the text will be large than 10000 chars is there for a reason (to give you coding room).
People here are saying that the string cannot be compressed, yet it can.
Why?
Requirement: OUTPUT THE ORIGINAL TEXT
Text is not data. When you read input text you read ASCII chars (bytes). Which have both printable and non printable values inside.
Take this for example:
ASCII values characters
0x00 .. 0x08 NUL, (other control codes)
0x09 .. 0x0D (white-space control codes: '\t','\f','\v','\n','\r')
0x0E .. 0x1F (other control codes)
... rest of printable characters
Since you have to print text as output, you are not interested in the range (0x00-0x08,0x0E-0x1F).
You can compress the input bytes by using a different storing and retrieving mechanism (binary patterns), since you don't have to give back the original data but the original text. You can recalculate what the stored values mean and readjust them to bytes to print. You would effectively loose only data that was not text data anyway, and is therefore not printable or inputtable. If WinZip would do that it would be a big fail, but for your stated requirements it simply does not matter.
Since the requirement states that the text is 10000 chars and you can save 26 of 255, if your packing did not have any loss you are effectively saving around 10% space, which means if you can code the 'decompression' in 1000 (10% of 10000) characters you can achieve that. You would have to treat groups of 10 bytes as 11 chars, and from there extrapolate te 11th, by some extrapolation method for your range of 229. If that can be done then the problem is solvable.
Nevertheless it requires clever thinking, and coding skills that can actually do that in 1 kilobyte.
Of course this is just a conceptual answer, not a functional one.
I don't know if I could ever achieve this.
But I had the urge to give my 2 cents on this, since everybody felt it cannot be done, by being so sure about it.
The real problem in your problem is understanding the problem and the requirements.

What you are describing is essentially a program for creating self-extracting zip archives, with the small difference that a regular self-extracting zip archive would write the original data to a file rather than to stdout. If you want to make such a program yourself, there are plenty of implementations of compression algorithms, or you could implement e.g. DEFLATE (the algorithm used by gzip) yourself. The "outer" program must compress the input data and output the code for the decompression, and embed the compressed data into that code.
Pseudocode:
string originalData;
cin >> originalData;
char * compressedData = compress(originalData);
cout << "#include<...> string decompress(char * compressedData) { ... }" << endl;
cout << "int main() { char compressedData[] = {";
(output the int values of the elements of the compressedData array)
cout << "}; cout << decompress(compressedData) << endl; return 0; }" << endl;

Assuming "character" means "byte" and assuming the input text may contains at least as many valid characters as the programming language, its impossible to do this for all inputs, since as templatetypedef explained, for any given length of input text all "strictly smaller" programs are themselves possible inputs with smaller length, which means there are more possible inputs than there can ever be outputs. (It's possible to arrange for the output to be at most one bit longer than the input by using an encoding scheme that starts with a "if this is 1, the following is just the unencoded input because it couldn't be compressed further" bit)
Assuming its sufficient to have this work for most inputs (eg. inputs that consist mainly of ASCII characters and not the full range of possible byte values), then the answer readily exists: use gzip. That's what its good at. Nothing is going to be much better. You can either create self-extracting archives, or treat the gzip format as the "language" output. In some circumstances you may be more efficient by having a complete programming language or executable as your output, but often, reducing the overhead by having a format designed for this problem, ie. gzip, will be more efficient.

It's called a file archiver producing self-extracting archives.

Multibyte character constants and bitmap file header type constants

I have some existing code that I've used to write out an image to a bitmap file. One of the lines of code looks like this:
bfh.bfType='MB';
I think I probably copied that from somewhere. One of the other devs says to me "that doesn't look right, isn't it supposed to be 'BM'?" Anyway it does seem to work ok, but on code review it gets refactored to this:
bfh.bfType=*(WORD*)"BM";
A google search indicates that most of the time, the first line seems to be used, while some of the time people will do this:
bfh.bfType=0x4D42;
So what is the difference? How can they all give the correct result? What does the multi-byte character constant mean anyway? Are they the same really?

All three are (probably) equivalent, but for different reasons.
bfh.bfType=0x4D42;
This is the simplest to understand, it just loads bfType with a number that happens to represent ASCII 'M' in bits 8-15 and ASCII 'B' in bits 0-7. If you write this to a stream in little-endian format, then the stream will contain 'B', 'M'.
bfh.bfType='MB';
This is essentially equivalent to the first statement -- it's just a different way of expressing an integer constant. It probably depends on the compiler exactly what it does with it, but it will probably generate a value according to the endian-ness of the machine you compile on. If you compile and execute on a machine of the same endian-ness, then when you write the value out on the stream you should get 'B', 'M'.
bfh.bfType=*(WORD*)"BM";
Here, the "BM" causes the compiler to create a block of data that looks like 'B', 'M', '\0' and get a char* pointing to it. This is then cast to WORD* so that when it's dereferenced it will read the memory as a WORD. Hence it reads the 'B', 'M' into bfType in whatever endian-ness the machine has. Writing it out using the same endian-ness will obviously put 'B', 'M' on your stream. So long as you only use bfType to write out to the stream this is the most portable version. However, if you're doing any comparisons/etc with bfType then it's probably best to pick an endian-ness for it and convert as necessary when reading or writing the value.

I did not find the API, but according to http://cboard.cprogramming.com/showthread.php?t=24453, the bfType is a bitmapheader. A value of BM would most likely mean "bitmap".
0x4D42 is a hexadecimal value (0x4D for M and 0x42 for B). In the little endian way of writing (least significate byte first), that would be the same as "BM" (not "MB"). If it also works with "MB" then probably some default value is taken.

Addendum to tehvan's post:
From Wikipedia's entry on BMP:
File header
Note that the first two bytes of the BMP file format (thus the BMP header) are stored in big-endian order. This is the magic number 'BM'. All of the other integer values are stored in little-endian format (i.e. least-significant byte first).
So it looks like the refactored code is correct according to the specification.
Have you tried opening the file with 'MB' as the magic number with a few different photo-editors?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js