(Lower level of C++) When using "cout" on a piece of data, were does it go to before being displayed on screen? - c++

Specifically talking about the C++ part of the code here: [LINK]
(intel x86, .cpp & .asm hybrid program.)
From dealing with chars/strings' pointers in .asm I know it uses dl/dx registers for their storage-before-display (in case of 2h and 9h functions).
How is it in the case when the data (specifically, a floating-point value) gets sent to C++ portion of the hybrid, and then is treated with cout?
Where is that value stored before the cout converts it into a string to be displayed? (Is it a register, or a stack, or something else?)

The lower level stuff of C++ is platform dependent. For example, reading a character from the keyboard. Some platforms don't have keyboards. Some platforms send messages when a character arrives, others wait (poll the input port).
Let's talk one level down from the high level language.
For cin, the underlying level reads characters from the input buffer. If the buffer is empty, the underlying layer reads characters from the standard input and stores them into a buffer until an end-of-line character is detected.
Note: there are methods to bypass this layer, still using C++.
In many OS based platforms, the C++ libraries eventually call an OS function to fetch a single character. In Linux, the OS delegates this request to a driver. The driver has the responsibility of reading the character from the hardware and returning it. The driver is the piece of code that gets the character from the keyboard.
There are exceptions to this path, for example piping. With piping, the OS redirects the requests from standard input to a file or device, depending on command line.
Where is that value stored before the cout converts it into a string to be displayed? (Is it a register, or a stack, or something else?)
The compiler calls a function that converts the internal representation of a floating point variable into a textual representation. This textual representation is sent to the underlying cout function, character by character; or as a pointer to a string. The textual representation can reside almost anywhere: stack, heap, cache, etc. It really doesn't make a difference. Most processor registers are too small to contain all the characters in a textual representation of a floating point number.
The floating point value may be stored in a register, on the stack, or other places before passed to the conversion function. Depends on the optimization level of the compiler and the API for the conversion function. The compiler will try to use the most efficient storage types.

Related

C++ Question about memory pretty basic but this is confusing me

Not really too c++ related I guess but say I have a signed int
int a =50;
This sets aside like 32 bits memory for this right it'll get some bit patternand a memory address, now my question is basically well we created this variable but the computer ITSELF doesn't know what the type is it just sees some bit pattern and memory address it doesn't know this is an int, but my question is how? Does the computer know that those 4 bytes are all connected to a? And also how does the computer not know the type? It set aside 4 bytes for one variable I get that that doesn't automatically make it an int but does the computer really know nothing? The value was 50 the number and that gets turned into binary and stored in the bit pattern how does the computer know nothing
Type information is used by the compiler. It knows the size in bytes of each type and will create an executable that at runtime will correctly access memory for each variable.
C++ is a compiled language and is not run directly by the computer. The computer itself runs machine code. Since machine code is all binary, we will often look at what is called assembly language. This language uses human readable symbols to represent the machine code but also corresponds to it on a line by line basis
When the compiler sees int a = 50; It might (depending on architecture and compiler)
mov r1 #50 // move the literal 50 into the register r1
Registers are placeholders in the CPU that the cpu can use to manipulate memory. In the above statement the compiler will remember that whenever it wants to translate a statement that uses a into machine code, it needs to fetch the value from r1
In the above case, the computer decided to map a to a register, it may well use memory instead.
The type itself is not translated into machine code, it is rather used as a hint as to what types of assembly operations subsequence c++ statements will be translated into. The CPU itself does not understand types. Size is used when loading and storing values from memory to registers. With a char type one 1 byte is read, with a short 2 and an int, typically 4.
When it comes to signedness, the cpu has different instructions for signed and unsigned comparisons.
Lastly float either have to simulated using integer maths or special floating point assembler instructions need to be used.
So once translated into assembler, there is no easy way to know what the original C++ code was.

What are common values for uninitialized memory for debugging?

A long time ago I learned about filling unused / uninitialized memory with 0xDEADBEEF so that in a debugger or a crash report if I ever see that value I know I'm looking at uninitialized memory. I saw from a crash report iOS uses 0xBBADBEEF.
What other creative values have people used? Do any particular values have any kind of specific benefit?
The most obvious benefit of values that turn into words is that, at least of most people, if the words are in their language they stick out easily where as some strictly numeric value is less likely to stick out.
But, maybe there are other reason to pick numbers? For example an odd number might crash a processors (68000) for example on certain memory accesses so it's probably better to pick 0x0BADBEEF over 0xBADBEEF0. Are their any other values (maybe processor specific) that have a concrete benefit for using for uninitialized memory?
Generally speaking, you want a value which is unlikely to happen to "work" when interpreted as either an integer, a pointer, or a string. So, here are a few constraints:
Don't use a value that's a multiple of the smallest "usual" alignment on your target architecture. For x86, that's 4 (bytes), so no values that are divisible by 4. This ensures that if the value is interpreted as a pointer, it'll be obviously-incorrect. If you're on a non-x86 architecture, you might even be able to use a value that will cause an alignment trap if used as a pointer.
Don't use a value which could reasonably be a small (positive or negative) integer. Your typical "int" variable in a C program never gets larger than 1,000 or so, so don't use small numbers as your empty data fill.
Don't use a value which is composed entirely of valid ASCII characters. Make sure there's at least one byte in there with the high bit set. These days, you'd want to make sure they weren't valid UTF-8 or possibly UTF-16 values, either.
Don't have any zero bytes in the value. There are too many cases where this would work out to be "helpful" to keeping the program from crashing - terminating a string, giving a non-int field a reasonable-looking value, etc.
Don't use a single (or two) byte values, repeated over and over. Having a full-word length pattern can make it easier to determine how your wild pointer ended up pointing where it is, at least narrowing down which operations offset it from the start of the pattern.
Don't use a value that maps to an valid address for a "typical" process. If the highest bits are set, it'll typically take a whole lot of malloc() before your process will grow large enough to make that a valid address.
Perhaps unsurprisingly, patterns like 0xDEADBEEF meet basically all of these requirements.
One technical term for values like this is "poison value".
Hex numbers that form English words are called Hexspeak. Wikipedia's Hexspeak article pretty much answers this question, cataloguing many known constants in use for various things, including several that are used as poison values / canaries / sanity checks, as well as other uses like error codes or IPv6 addresses.
I seem to recall some variation of 0xBADF00D. (maybe with a repeated letter like your 2nd example).
There's also 0xDEADC0DE. (Googling for where I've seen this used found the wikipedia article linked above).
Other English words in hex I've seen: Java .class files use 0xCAFEBABE as the magic number (first 4 bytes of the file). As a play on this, I guess, the Jikes JVM uses 0xDEADBABE as a sanity check constant.
Apparently Java wasn't the first user of 0xCAFEBABE. Wikipedia says "It was originally created by NeXTSTEP developers as a reference to the baristas at Peet's Coffee & Tea", and was used by the people developing Java before they thought of the name "Java". So it didn't come out of Java -> coffee (if anything the other way around), it's just plain old non-feminist tech culture. :(
re: update: Choosing a good value. For a poison value (not an error code), you want all the bytes to be different and not 0x00 or 0xFF, since those are probably the most likely values for an errant single-byte store. This applies especially for things like stack canaries (to detect buffer overruns), or other cases where detecting that it didn't get overwritten is important.
Your speculation about picking an odd value makes a lot of sense. Not being a valid memory address in the virtual memory layout of typical processes is a big advantage. Failing noisily as early as possible is optimal for debugging. Anyway, this probably means that having the high bit set is a good idea, so 0x0... is probably not a good idea.

What's the difference between read() and getc()

I have two code segments:
while((n=read(0,buf,BUFFSIZE))>0)
if(write(1,buf,n)!=n)
err_sys("write error");
while((c=getc(stdin))!=EOF)
if(putc(c,stdout)==EOF)
err_sys("write error");
Some sayings on internet make me confused. I know that standard I/O does buffering automatically, but I have passed a buf to read(), so read() is also doing buffering, right? And it seems that getc() read data char by char, how much data will the buffer have before sending all the data out?
Thanks
While both functions can be used to read from a file, they are very different. First of all on many systems read is a lower-level function, and may even be a system call directly into the OS. The read function also isn't standard C or C++, it's part of e.g. POSIX. It also can read arbitrarily sized blocks, not only one byte at a time. There's no buffering (except maybe at the OS/kernel level), and it doesn't differ between "binary" and "text" data. And on POSIX systems, where read is a system call, it can be used to read from all kind of devices and not only files.
The getc function is a higher level function. It usually uses buffered input (so input is read in blocks into a buffer, sometimes by using read, and the getc function gets its characters from that buffer). It also only returns a single characters at a time. It's also part of the C and C++ specifications as part of the standard library. Also, there may be conversions of the data read and the data returned by the function, depending on if the file was opened in text or binary mode.
Another difference is that read is also always a function, while getc might be a preprocessor macro.
Comparing read and getc doesn't really make much sense, more sense would be comparing read with fread.

Storing hexadecimal addresses in a file

I have a pintool application which store the memory address accessed by an application in a file. These addresses are in hexadecimal form. If I write these addresses in form of string, it will take a huge amount of storage(nearly 300GB). Writing such a large file will also take large amount of time. So I think of an alternate way to reduce the amount of storage used.
Each character of hexadecimal address represent 4 bits and each ASCII character is of 8 bits. So I am thinking of representing two hexadecimal characters by one ASCII character.
For example :
if my hexadecimal address is 0x26234B
then corresponding converted ASCII address will be &#K (0x is ignored as I know all address will be hexadecimal).
I want to know that is there any other much more efficient method for doing this which takes less amount of storage.
NOTE : I am working in c++
This is a good start. If you really want to go further, you can consider compressing the data using something like a zip library or Huffman encoding.
Assuming your addresses are 64-bit pointers, and that such a representation is sensible for your platform, you can just store them as 64-bit ints. For example, you list 0x1234567890abcdef, which could be stored as the four bytes:
12 34 56 78 90 ab cd ef
(your pointer, stored in 8 bytes.)
or the same, but backwards, depending on what endianness you choose. Specifically, you should read this.
We can even do this somewhat platform-independently: uintptr_t is unsigned integer type the same width as a pointer (assuming one exists, which it usually does, but it's not a sure thing), and sizeof(our_pointer), which gives us the size in bytes of a pointer. We can arrive at the above bytes with:
Convert the pointer to an integer representation (i.e., 0x0026234b)
Shift the bytes around to pick out the one we want.
Stick it somewhere.
In code:
unsigned char buffer[sizeof(YourPointerType)];
for(unsigned int i = 0; i < sizeof(YourPointerType); ++i) {
buffer[i] = (
(reinterpret_cast<uintptr_t>(your_pointer) >> (sizeof(YourPointerType) - i - 1))
& 0xff
);
}
Some notes:
That'll do a >> 0 on the last loop iteration. I suspect that might be undefined behavior, and you'll need an if-case to handle it.
This will write out pointers of the size of your platform, and requires that they can be converted sensibly to integers. (I think uintptr_t won't exist if this isn't the case.) It won't do the same thing on 64- as it will on 32-bit platforms, as they have different pointer sizes. (Or any other pointer-sized platform you run across.)
A program's pointers aren't valid once the program dies, and might not even remain valid when the program is still running. (If the pointer points to memory that the program decides to free, then the pointer is invalid.)
There's likely a library that'll do this for you. (struct, in Python, does this.)
The above is a big-endian encoder. Alternatively, you can write out little endian — the Wikipedia article details the difference.
Last, you can just cast a pointer to the pointer to a unsigned char *, and write that. (I.e., dump the actual memory of the pointer to a file.) That's way more platform dependent though.
If you need even more space, I'd run it through gzip.

Unexpected "padding" in a Fortran unformatted file

I don't understand the format of unformatted files in Fortran.
For example:
open (3,file=filename,form="unformatted",access="sequential")
write(3) matrix(i,:)
outputs a column of a matrix into a file. I've discovered that it pads the file with 4 bytes on either end, however I don't really understand why, or how to control this behavior. Is there a way to remove the padding?
For unformated IO, Fortran compilers typically write the length of the record at the beginning and end of the record. Most but not all compilers use four bytes. This aids in reading records, e.g., length at the end assists with a backspace operation. You can suppress this with the new Stream IO mode of Fortran 2003, which was added for compatibility with other languages. Use access='stream' in your open statement.
I never used sequential access with unformatted output for this exact reason. However it depends on the application and sometimes it is convenient to have a record length indicator (especially for unstructured data). As suggested by steabert in Looking at binary output from fortran on gnuplot, you can avoid this by using keyword argument ACCESS = 'DIRECT', in which case you need to specify record length. This method is convenient for efficient storage of large multi-dimensional structured data (constant record length). Following example writes an unformatted file whose size equals the size of the array:
REAL(KIND=4),DIMENSION(10) :: a = 3.141
INTEGER :: reclen
INQUIRE(iolength=reclen)a
OPEN(UNIT=10,FILE='direct.out',FORM='UNFORMATTED',&
ACCESS='DIRECT',RECL=reclen)
WRITE(UNIT=10,REC=1)a
CLOSE(UNIT=10)
END
Note that this is not the ideal aproach in sense of portability. In an unformatted file written with direct access, there is no information about the size of each element. A readme text file that describes the data size does the job fine for me, and I prefer this method instead of padding in sequential mode.
Fortran IO is record based, not stream based. Every time you write something through write() you are not only writing the data, but also beginning and end markers for that record. Both record markers are the size of that record. This is the reason why writing a bunch of reals in a single write (one record: one begin marker, the bunch of reals, one end marker) has a different size with respect to writing each real in a separate write (multiple records, each of one begin marker, one real, and one end marker). This is extremely important if you are writing down large matrices, as you could balloon the occupation if improperly written.
Fortran Unformatted IO I am quite familiar with differing outputs using the Intel and Gnu compilers. Fortunately my vast experience dating back to 1970's IBM's allowed me to decode things. Gnu pads records with 4 byte integer counters giving the record length. Intel uses a 1 byte counter and a number of embedded coding values to signify a continuation record or the end of a count. One can still have very long record lengths even though only 1 byte is used.
I have software compiled by the Gnu compiler that I had to modify so it could read an unformatted file generated by either compiler, so it has to detect which format it finds. Reading an unformatted file generated by the Intel compiler (which follows the "old' IBM days) takes "forever" using Gnu's fgetc or opening the file in stream mode. Converting the file to what Gnu expects results in a factor of up to 100 times faster. It depends on your file size if you want to bother with detection and conversion or not. I reduced my program startup time (which opens a large unformatted file) from 5 minutes down to 10 seconds. I had to add in options to reconvert back again if the user wants to take the file back to an Intel compiled program. It's all a pain, but there you go.