How does compiling C++ code produce machine code? - c++

I'm studying C++ using the website learncpp.com. Chapter 0.5 states that the purpose of a compiler is to translate human-readable source code to machine-readable machine code, consisting of 1's and 0's.
I've written a short hello-world program and used g++ hello-world.cpp to compile it (I'm using macOS). The result is a.out. It does print "Hello World" just fine, however, when I try to look at a.out in vim/less/Atom/..., I don't see 1's and 0', but rather a lot of this:
H�E�H��X�����H�E�H�}���H��X���H9��
Why are the contents of a.out not just 1's and 0's, as would be expected from machine code?

They are binary bits (1s and 0s) but whatever piece of software you are using to view the file's contents is trying to read them as human readable characters, not as machine code.
If you think about it, everything that you open in a text editor is comprised of binary bits stored on bare metal. Those 1s and 0s can be interpreted in many many different ways, and most text editors will attempt to read them in as characters. Take the character 'A' for example. It's ASCII code is 65 which is 01000001 in binary. When a text editor reads through the file on your computer it is processing those bits as characters rather than machine instructions, and therefore it reads in 8 bits (byte) in the pattern 01000001 it knows that it has just read an 'A'.
This process results in that jumble of symbols you see in the executable file. While some of the content happens to be in the right pattern to make human readable characters, the majority of them will likely be outside of what either the character encoding considers valid or knows how to print, resulting in the '�' that you see.
I won't go into the intricacies of how character encodings work here, but read Character Encodings for Beginners for a bit more info.

Related

How to keep characters in C++ from combining when outputted to text file

I have a fairly simple program with a vector of characters which is then outputted to a .txt file.
ofstream op ("output.txt");
vector <char> outp;
for(int i=0;i<outp.size();i++){
op<<outp[i]; //the final output of this is incorrect
cout<<outp[i]; //this output is correct
}
op.close();
the text that is output by cout is correct, but when I open the text file that was created, the output is wrong with what look like Chinese characters that shouldn't have been an option for the program to output. For example, when the program should output:
O dsof
And cout prints the right output, the .txt file has this:
O獤景
I have even tried adding the characters into a string before outputting it but it doesn't help. My best guess is that the characters are combining together and getting a different value for unicode or ascii but I don't know enough about character codes to know for sure or how to stop this from happening. Is there a way to correct the output so that it doesn't do this? I am currently using a windows 8.1 computer with code::blocks 12.11 and the GNU GCC compiler in case that helps.
Some text editors try to guess the encoding of a file and occasionally get it wrong. This can particularly happen with very small amounts of text because whatever statistical analysis is being used just doesn't have enough data to make a good conclusion. Window's Notepad has/had an infamous example with the text "Bush hid the facts".
More advanced text editors (for example Notepad++) may either not experience the same problem or may give you options to change what encoding is being assumed. You could use such to verify that the contents of the file are actually correct.
Hex editors/viewers are another way, since they allow you to examine the raw bytes of the file without interpretation. For instance, HxD is a hex editor that I have used in the past.
Alternatively, you can simply output more text. The more there is, generally the less likely something will guess wrong. From some of my experiences, newlines are particularly helpful in convincing the text editor to assume the correct encoding.
there is nothing wrong with your code.
maybe the text editor you use has a default encoding.
use more advanced editors and you will get the right output.

Add/edit string in compiled C program?

I have a strange question, I am wondering if there is a way to add/edit a string (or something that could be accessed via the C program (inside, ie not an external file)) after it has been compiled?
The purpose is to change a URL on an Windows program via PHP on Linux (obviously I cannot just compile it).
Many posix platforms come with the program strings which will read through a binary file searching for strings. There is an option to print out the offset of the strings. For example:
strings -td myexec
From there you can use a hex editor but the main problem is that you wouldn't be able to make a string bigger than it already is.
A Hex Editor is probably your best bet.
A hex editor will work, but you have to be careful not to alter the size of the executable. If the string happens to be in the .res file, you can use ResEdit.
There are specialized tools to modify existing executable files. A notable tool is
Resource Tuner, which can be used to edit all sorts of resources in an executable.
Another option is to use a text editor, like Hex Workshop, to edit the characters in the strings of an executable. However, bear in mind that with this method, you can only edit existing strings in an executable, and the replaced strings must have an equal or smaller length than the original ones, otherwise you'll end up modifying executable code.
As others have suggested, you can use a binary file editor (hex editor) to change the string in the executable file. You will want to embed into the string a marker (unique sequence of bytes) so that you can find the string in your file. And you will want to ensure that you are reading/writing the file at correct offsets.
As OP stated plans to use PHP on linux to rewrite the file, you will need to use fseek to position the file pointer to the starting location of this URL string, ensure you stay within the size of the string as you replace bytes, and then use fseek/rewind and fwrite to change the file.
This technique can be used to change a URL embedded in a binary file, and it can also be used to embed a license key into a binary, or to embed an application checksum value into a binary so that one can detect when the binary has changed.
As some posters have suggested, you may need to recompute a checksum or re-sign a binary file. A quick way to check for this behavior would be to compile two versions of your binary with different URL values. Then compare the files and see if there are differences other than in the URL values.
to properly edit a string in a compiled program you need to:
read in the files bytes
search the .rdata for strings and record the address of the first occurrence of the string
convert that address to the virtual address using some of the data in the file header
write a new .rdata onto the executable and write your new string into it recording its address and getting its virtual address.
search the .text section for references to the virtual address of the old string and replace it with the reference to your new string.
fortunately i made a program to do this on windows it only works on 32 bit programs here
Not unless you want to poke around in the generated hex or assembly code.

How to read output of hexdump of a file?

I wrote a program in C++ that compresses a file.
Now I want to see the contents of the compressed file.
I used hexdump but I dont know what the hex numbers mean.
For example I have:
0000000 00f8
0000001
How can I convert that back to something that I can compare with the original file contents?
If you implemented a well-known compression algorithm you should be able to find a tool that performs the same kind of compression and compare its results with yours. Otherwise you need to implement an uncompressor for your format and check that the result of compressing and then uncompressing is identical to your original data.
That looks like a file containing the single byte 0xf8. I say that since it appears to have the same behaviour as od under UNIX-like operating systems, with the last line containing the length and the contents padded to a word boundary (you can use od -t x1 to get rid of the padding, assuming your od is advanced enough).
As to how to recreate it, you need to run it through a decryption process that matches the encryption used.
Given that the encrypted file is that short, you either started with a very small file, your encryption process is broken, or it's incredibly efficient.

Understanding the concept of 'File encoding'

I have already gone through some stuff on the web and SOF explaining 'file encoding' but I still have questions. File is a group of related records and on disk, its contents are just stored as '1's and '0's. Every time, a running program wants to read in a file or write to the file, the file is brought into the RAM and put into the address space of the running program (aka process). Now what determines how the bits (or bytes) in the file should be decoded/encoded and read and displayed/written?
There is one explanation on SOF which reads 'At the storage level, a file contains an array of bytes. On top of this you have the encoding layer for text files. The format layer comes last, on top of the encoding layer for text files or on top of the array of bytes for all the other binary files'. I am sort of fine with this but would like to know if it is 100% correct.
The question basically came up when understanding file opening modes in C++.
I think the description of the orders of layers is confusing here. I would consider formats and encodings to be related but not tied together so tightly. Let's try to define it formally.
A file is a contiguous sequence of bytes. A byte is a contiguous sequence of bits.
A symbol is a unit of data. Bytes are one kind of symbol. There are other symbols that are not bytes. Consider the number 6 - it is a symbol but not a byte. It can however be encoded as a byte, commonly as 00000110 (this is the two's complement encoding of 6).
An encoding maps a set of symbols to another set of symbols. Most commonly, it maps from a set of non-byte symbols to bytes, which when applied to an entire file makes it a file encoding. Two's complement gives a representation of the numeric values. On the other hand, ASCII, for example, gives a representation of the Latin alphabet and related characters in bytes. If you take ASCII and apply it to a string of text, say "Hello, World!", you get a sequence of bytes. If you store this sequence of bytes as a file, you have a file encoded as ASCII.
A format describes a set of valid sequences of symbols. When applied to the bytes of a file, it is a file format. An example is the BMP file format for storing raster graphics. It specifies that there must be a few bytes at the beginning that identify the file format as BMP, followed by a few bytes to describe the size and depth of the image, and so on. An example of a format that is not a file format would be how we write decimal numbers in English. The basic format is a sequence of numerical characters followed by an optional decimal point with more numerical characters.
Text Files
A text file is a kind of file that has a very simple format. It's format is very simple because it has no structure. It immediately begins with some encoding of a character and ends with the encoding of the final character. There's usually no header or footer or metadata or anything like that. You just start interpreting the bytes as characters right from the beginning.
But how do you interpret the characters in the file? That's where the encoding comes in. If the file is encoded as ASCII, the byte 01000001 represents the Latin letter A. There are much more complicated encodings, such as UTF-8. In UTF-8, a character cannot necessarily be represented in a single byte. Some can, some can't. You determine the number of bytes to interpret as a character from the first few bits of the first byte.
When you open a file in your favourite text editor, how does it know how to interpret the bytes? Well that's an interesting problem. The text editor has to determine the encoding of the file. It can attempt to do this in many ways. Sometimes the file name gives a hint through its extension (.txt is likely to be at least ASCII compatible). Sometimes the first character of the file gives a good hint as to what the encoding is. Most text editors will, however, give you the option to specify which encoding to treat the file as.
A text file can have a format. Often the format is entirely independent of the encoding of the text. That is, the format doesn't describe the valid sequences of bytes at all. It instead describes the valid sequences of characters. For example, HTML is a format for text files for marking up documents. It describes the sequences of characters that determine the contents of a document (note: not the sequence of bytes). As an example, it says that the sequence of characters <html> are an opening tag and must be followed at some point by the closing tag </html>. Of course, the format is much more detailed than this.
Binary file
A binary file is a file with meaning determined by its file format. The file format describes the valid sequences of bytes within the file and the meaning that that sequence has. It is not some interpretation of the bytes that matters at the file format level - it is the order and arrangement of bytes.
As described above, the BMP file format gives a way of storing raster graphics. It says that the first two bytes must be 01000010 01001101, the next four bytes must give the size of the file as a count of the number of bytes, and so on, leading up to the actual pixel data.
A binary file can have encodings within it. To illustrate this, consider the previous example. I said that the four bytes following the first two in a BMP file give the size of the file in bytes. How are those bytes interpreted? The BMP file format states that those bytes give the size as an unsigned integer. This is the encoding of those bytes.
So when you browse the directories on your computer for a BMP file and open it, how does your system know how to open it? How does it know which program to use to view it? The format of a binary file is much more strongly hinted by the file extension than the encoding of a text file. If the filename has .bmp at the end, your system will likely consider it to be a BMP file and just open it in whatever graphics program you have. It may also look at the first few bytes and see what they suggest.
Summary
The first level of understanding the meaning of bytes in a file is that file's format. A text file has an extremely simple format - start at the beginning, interpreting characters until you reach the end. How you interpret the characters depends on that text file's character encoding. Most formats are more complicated, however, and will likely have encodings nested within them. At some level you have to start extracting abstract information from your bytes and that's where the encodings kick in. But then whatever is being encoded can also have a format that is applied to it. You have a chain of formats and encodings until you get the information that you want.
Let's see if this helps...
A Unix file is just an array of bits (1/0) the current minimum number of bits in a file is 8, i.e. 1 byte. All file interaction is done at no less than the byte level. On most systems now a days, you don't really have to concern your self with the maximum size of a file. There are still some small variences in Operating Systems, but very few if none have maximum sizes of less that 1 GB.
The encoding or format of a file is only dependent on the applications that use it.
There are many common file formats, such as 'unix ASCII text' and PDF. Most of the files you will come accross will have a documented format specification somewhere on the net. For example the specification of a 'Unix ASCII text file' is:
A collection of ascii characters where each line is terminated by a end of line character. The end of line character is specificed in c++ as std::endl' or the quoted "\n". Unix specifies this character as the binary value - 012(oct) or 00001010.
Hope this helps :)
The determination of how to encode/display something is entirely up to the designer of the program. Of course, there are standards for certain types of files - a PDF or JPG file has a standard format for its content. The definition of both PDF and JPG is quite complex.
Text files have at least somewhat of a standard - but how to interpret or use the contents of a text-file may be just as complex and confusing as JPEG - the only difference is that the content is (some sort of) text, so you can load it up in a text editor and try to make sense of it. But see below for an example line of "text in a database type application".
In C and C++, there is essentially just one distinction, files are either "binary" or "text" ("not-binary"). The difference is about the treatment of "special bits", mostly to do with "endings" - a text file will contain end of line markers, or newlines ('\n') [more in a bit about newlines], and in some operating systems , also contain "end of file marker(s)" - for example in old CP/M, the file was sized in blocks of 128 or 256 bytes. So if we had "Hello, World!\n" in a text file, that file would be 128 bytes long, and the remaining 114 bytes would be "end-of-file" markers. Most modern operating systems track filesize in bytes, so there's no need to have a end-of-file marker in the file. But C supports many operating systems, both new and old, so the language has an allowance for this. End of file is typically CTRL-Z (DOS, Windows, etc) or CTRL-D (Unix - Linux, etc). When the C runtime library hits the end of file character, it will stop reading and give the error code/behaviour, same as if "there is no more file to read here".
Line endings or newlines need special treatment because they are not always the same in the OS that the file is living on. For example, Windows and DOS uses "Carriage Return, Line Feed" (CR, LF - CTRL-M, CTRL-J, ASCII 13 and 10 respectively) as the end of line. In the various forms of Unix, (Linux, MacOS X and BSD for example), the line ending is "Line Feed" (LF, CTRL-J) alone. In older MacOS, the line ending is ONLY "carriage Return." So that you as a programmer don't have to worry about exactly how lines end, the C runtime library will do translation of the "native" line-ending to a standardized line-ending of '\n' (which translates to "Line Feed" or character value 10). Of course, this means that the C runtime library needs to know that "if there is a CR followed by LF, we should just give out an LF character."
For binary files, we really DO NOT want any translation of the data, just because our pixels happen to be the values 13 and 10 next to each other, doesn't mean we want it merged to a single 10 byte, right? And if the code reads a byte of the value 26 (CTRL-Z) or 4 (CTRL-D), we certainly don't want the input to stop there...
Now, if I have a database text file that contains:
10 01353-897617 14000 Mats
You probably have very little idea what that means - I mean you can probably figure out that "Mats" is my name - but it could also be those little cardboard things to go under glasses (aka "Beer-mats") or something to go on the floor, e.g. "Prayer Mats" for Muslims.
The number 10 could be a customer number, article number, "row number" or something like that. 01353-896617 could be just about anything - perhaps my telephone number [no it isn't, but it does indeed resemble it] - but it could also be a "manufacturers part number" or some form of serial number or some such. 14000? Price per item, number of units in stock, my salary [I hope not!], distance in miles from my address to Sydney in Australia [roughly, I think].
I'm sure someone else, not given anything else could come up with hundreds of other answers.
[The truth is that it's just made up nonsense for the purpose of this answer, except for the bit at the beginning of the "phone number", which is a valid UK area code - the point is to explain that "the meaning of a set of fields in a text-file can only be understood if there is something describing the meaning of the fields"]
Of course the same applies to binary files, except that it's often even harder to figure out what the content is, because of the lack of separators - if you didn't have spaces and dashes in the text above, it would be much harder to know what belongs where, right? There are typically no 'spaces' and other such things in a binary file. It's all down to someone's description or definition in some code somewhere, or something like that.
I hope my ramblings here have given you some idea.
Now what determines how the bits (or bytes) in the file should be decoded/encoded and read and displayed/written?
The format of the file, obviously. If you are reading a BMP file, you have to first read the header, then height*width pixel data. If you are reading .txt, just read the characters as-is. Text files can have different encodings, such as Unicode.
Some formats, like .png, are compressed, meaning that their raw data takes more space in memory that the file on disk.
The particular algorithm is chosen depending on various factors. On Windows, it's usually the extension that matters. In web, the content type is dominant.
In general, if you try to read the file in other format, you will usually get garbage. That can be forced sometimes : try opening a .bmp file in your text editor, for example.
So basically we're talking about text files mainly, right?
Now to the point: when your text editor loads the file into memory, from some information it deduces its file encoding (either you tell it or it has a special file format marker among the first few bytes of the file, or whatever). Then it's the program itself that decides how it treats the raw bytes.
For example, if you tell your text editor to open a file as ASCII, it will treat each byte as an individual character, and it will display the character A whenever encounters the number 65 as the current byte to show, etc (because 65 is the ASCII character code for A).
However, if you tell it to open your file as UTF-16, then it will grab two bytes (well, more precisely, two octets) at a time, it will use this so-called "word" as the numeric value to be looked up, and it will, for example, display a ç character when the two bytes it read were corresponding to 231, the Unicode character code of ç.

Encoding issue using XZIP

I wrote a c++ program that needs to zip files in it's work. For creating these zip files I used the XZip library. While developing this program ran on a Win7 machine and it works fine.
Now the program should be used on a WindowsXP machine. The issue I run into is:
If I let XZip create the zip archive "ü.zip" and add the file "ü.txt" to it on Win7 it is working as intended. On WindowsXP however I end up having the "ü.zip" file with "³.txt" as file in it.
The "³" => "ü" thing is of course an encoding issue between UTF8 and Ascii (ü = 252 in UTF8 and 252 = ³ in Ascii) BUT I can't really imagine how this could affect the creating of the internal zip structure in different ways depending on the OS.
//EDIT to clear it up:
the problem is that I run a test with XZip on Win7 and get the archive "ü.zip" containing the file with name "ü.txt".
When I run that test on an XP machine I get the archive "ü.zip" containing the file "³.txt".
//Edit2:
The thing that makes me wonder about that is, what exactly causes the zip to change between XP and Win7. The fact that it does change means that either a windows function behaves differently or XZip has specific behavior for different OS built in.
When having a quick look at XZip I can't see that it changes the encoding flag on the zip archives. The question of course only can be answered by people who did have a closer look into this exact problem before.
As a general rule, if you want any sort of portability between locales, OS's (including different versions) and what have you, you should limit your filenames to the usual 26 letters, the 10 digits, and perhaps '_' and '-' (and I'm not even sure about the latter), and one '.', no more than three characters from the end. Once you start using letters beyond the original ASCII character set, you're at the merci of the various programs which interpret the character set.
Also, 252 isn't anything in ASCII, since ASCII only uses character codes in the range 0...127. And in UTF-8, 252 would be the first byte of a six byte character. Something that doesn't exist in Unicode: in UTF-8, LATIN SMALL LETTER U WITH DIAERESIS would be the two byte sequence 0xC3, 0xBC. 256 is the encoding of LATIN SMALL LETTER U WITH DIAERESIS in ISO 8859-1, otherwise known as Latin-1; it's also the encoding in UTF-16 and UTF-32.
None of this, of course, should affect what is in the file.
May be you are building your Win32 program (or the library) as ASCII (not as UNICODE). It may help if you build your Win32 applications with UNICODE configuration setting (you may change it in your Visual Studio project settings).
It is impossible to say what happened in your program without seeing your code. May be your library or the archive format is not UNICODE-aware, may be your program's code is not UNICODE-aware, may be you don't handle strings careful enough, or may be you just have to change your project setting to UNICODE. Also your "8-bit encoding for non-Unicode programs" Windows OS setting matters if you don't use UNICODE strings.
As for 252, UTF8 and ASCII read post by James Kanze. It is more or less safe to use ASCII file names with no ':', '?', '*', '/', '\' characters. Using non-ASCII characters may lead to encoding problems if you are not using UNICODE-based programs and file-systems.