Defining a byte in C++

Defining a byte in C++ - c++

In http://www.parashift.com/c++-faq-lite/intrinsic-types.html#faq-26.6, it is wriiten that
"Another valid approach would be to define a "byte" as 9 bits, and simulate a char* by two words of memory: the first could point to the 36-bit word, the second could be a bit-offset within that word. In that case, the C++ compiler would need to add extra instructions when compiling code using char* pointers."
I couldn't understand what it meant by "simulating char* by two words" and further quote.
Could somebody please explain it by giving an example ?

I think this is what they were describing:
The PDP-10 referenced in the second paragraph had 36-bit words and was unable to address anything inside of those words. The following text is a description of one way that this problem could have been solved while fitting within the restrictions of the C++ language spec (that are included in the first paragraph).
Let's assume that you want to make 9-bit-long bytes (for some reason). By the spec, a char* must be able to address individual bytes. The PDP-10 can't do this, because it can't address anything smaller than a 36-bit word.
One way around the PDP-10's limitations would be to simulate a char* using two words of memory. The first word would be a pointer to the 36-bit word containing the char (this is normally as precise as the PDP-10's pointers allow). The second word would indicate an offset (in bits) within that word. Now, the char* can access any byte in the system and complies with the C++ spec's limitations.
ASCII-art visual aid:
| Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Byte 6 | Byte 7 | Byte 8 |
-------------------------------------------------------------------------
| Word 1 | Word 2 |
| (Address) | (Offset) |
-------------------------------------------------------------------------
Say you had a char* with word1 = 0x0100 and word2 = 0x12. This would point to the 18th bit (the start of the third byte) of the 256th word of memory.
If this technique was really used to generate a conforming C++ implementation on the PDP-10, then the C++ compiler would have to do some extra work with juggling the extra bits required by this rather funky internal format.
The whole point of that article is to illustrate that a char isn't always 8 bits. It is at least 8 bits, but there is no defined maximum. The internal representation of data types is dependent on the platform architecture and may be different than what you expect.

Since the C++ spec says that a char* must point to individual bytes, and the PDP-6/10 does not allow addressing individual bytes in a word, you have a problem with char* (which is a byte pointer) on the PDP-6/10
So one work around is: define a byte as 9 bits, then you essentially have 4 bytes in a word (4 * 9 = 36 bits = 1 word).
You still can't have char* point to individual bytes on the PDP-6/10, so instead have char* be made up of 2 36-bit words. The lower word would be the actual address, and the upper word would be some byte-mask magic that the C++ compiler could use to point to the right 9bits in the lower word.
In this case,
sizeof(*int) (36bits) is different than sizeof(*char) (72bits).
It's just a contrived example that shows how the spec doesn't constrain primatives to specific bit/byte sizes.

data: [char1|char2|char3|char4]
To access char1:
ptrToChar = &data
index = 0
To access char2:
ptrToChar = &data
index = 9
To access char3:
ptrToChar = &data
index = 18
...
then to access a char, you would:
(*ptrToChar >> index) & 0x001ff
but ptrToChar and index would be saved in some sort of structure that the compiler creates so they would be associated with each other.

Actually, the PDP-10 can address (load, store) 'bytes', smaller than a (36-bit) word, with a single word pointer. On th -10, a byte pointer includes the word address containing the 'byte', the width (in bits) of the 'byte', and the position (in bits from the right) of the 'byte' within the word. Incrementing the pointer (with an explicit increment, or increment and load/deposit instruction), increments the position part (by the size part) and, handles overflow to the next word address. (No decrementing, though.) A byte pointer can e.g. address individual bits, but 6, 8, 9, 18(!) were probably common, as there were specially-formatted versions of byte pointers (global byte pointers) that made their use somewhat easier.

Supposing a PDP-10 implementation wanted to get as close to having 8-bit bytes as possible. The most reasonable to split up a 36-bit word (the smallest unit of memory that the machine's assembly langauge can address) is to divide the word up into four 9-bit bytes. To access a particular 9-bit byte, you need to know which word it's in (you'd use the machine's native addressing mode for that, using a pointer which takes up one word), and you'd need extra data to indicate which of the 4 bytes inside the word was the one you're interested. This extra data would be stored in a second machine word. The compiler would generate lots of extra instructions to use that extra data to pull the right byte out of the word, using the extra data stored in the second word.

Related

Size of string data type in C++ array

This is a very simple code in C++. The address of the strings are separated by a constant gap of 28 bytes. What does these 28 bytes contains. I am trying to find an analogy with the gap of 4 bytes of an array containing integers. As far as I know the 4 bytes ensures the upper limit of the value of an integer that can be reached. What happens in case of the 28 bytes. Does it really contain 28*8 bits of character data - I do not believe that. I have tried giving in a large text of data, and it still prints without any issues.
string str[3] = { "a", "b", "c" };
for (int i = 0; i < 3; ++i) {
cout << &str[i] << endl;
}

What does these 28 bytes contains.
It contains the object of type string. We don't know any more unless we know how you have defined that type.
If string is an alias of std::string, then it is a class defined by the standard library. The exact contents and thus the exact size depend on and vary between standard library implementations, and the target architecture.
If we consider what some implementation might do in practice:
Does it really contain 28*8 bits of character data - I do not believe that.
Believe it or not, (modern) string implementations really do contain ~ sizeof(string) (sans potential overhead) bytes of character data when those characters fit on that space.
They use advanced tricks to change the internal layout to support longer strings. For those, they use pointers. Typically, there would be a pointer to beginning, pointer to end of string (storing offset is another option) and pointer (or offset) to the end of dynamic storage. This representation is essentially identical to a vector.
If you read the standard library headers that you use, you'll find the exact definition of the class there.

Is `reinterpret_cast<char*>(reinterpret_cast<uintptr_t>(&ch) + 1) == &ch +1` guaranteed?

I'm writing alignment-dependent code, and quite surprised that there's no standard function testing if a given pointer is aligned properly.
It seems that most code on the internet use (long)ptr or reinterpret_cast<uintptr_t>(ptr) to test the alignment and I also used them, but I wonder if using the casted pointer to integral type is standard-conformant.
Is there any system that fires the assertion here?
char ch[2];
assert(reinterpret_cast<char*>(reinterpret_cast<uintptr_t>(&ch[0]) + 1)
== &ch[1]);

To answer the question in the title: No.
Counter example: On the old Pr1me mini-computer, a normal pointer was two 16-bit words. First word was 12-bit segment number, 2 ring bits, and a flag bit (can't remember the 16th bit). Second word was a 16-bit word offset within a segment. A char* (and hence void*) needed a third word. If the flag bit was set, the third word was either 0 or 8 (being the bit offset within the addressed word). A uintptr_t for such a machine would need to be uint48_t or uint64_t. Either way, adding 1 to such an integer would not advance to the next character in memory.
A capability addressed machine is also likely to have pointers which are much larger than the address space, and there is no particular reason why the least significant part of the corresponding integer should be part of the "address" rather than part of the extra info.
In practise of course, nobody is writing C++ for a Pr1me, and capability addressed machines seem not to have appeared either. It will work on all real systems - but the standard doesn't guarantee it.

Is reading a "char" from memory faster than reading a "word"?

The memory subsystem on a modern processor is restricted to accessing memory at the granularity and alignment of it's word size.
So can I assume that reading a char from memory should be as fast as reading a word (4 bytes)?
If the answer is YES than why do we even use char variables when coding instead of word variables (Other than the obvious type checking necessity).

Why do we use char variables? To avoid wasting memory. If you need 4 variables, you can fit 4 chars in a single word, but if you declared them as int it would take 4 words. If they only need to hold low numbers, all that extra memory is unnecessary.

One reason I can immediately think of is that we just want a byte for an ASCII character.
const WORD* str = W"Hello World?";
// ^ Some theoretical WORD-charactered string
Hehe.
And so because each character is just one byte, the processor would just have to do one fetch when getting, say, 4 characters.

does the size of char * (character pointer) in C/C++ vary? - use for database column fixed size

per the following code, I get the size of a character pointer is 8 bytes. Yet this site has a size of 1 byte for the char pointer.
#include <stdio.h>
int main(void ){
char *a = "saher asd asd asldasdas;daksd ahwal";
printf(" nSize = %d \n", sizeof(a));
return 0;
}
Is this always the case? I am writing a connector for a simple database I am implementing and want to read TEXT field of mysql into my database. Since TEXT has variable size, I was wondering if my column Type/metadata can have a fixed size of 8 bytes where I store the pointer in memory to the string (char *)?

per the following code, I get the size of a character pointer is 8 bytes. Yet this site has a size of 1 byte for the char pointer.
It's implementation-defined. It's usually 8 on a 64-bit Intel system and 4 on a 32-bit Intel system. Don't rely on it being any particular size.
I am writing a connector for a simple database I am implementing and want to read TEXT field of mysql into my database. Since TEXT has variable size, I was wondering if my column can have a fixed size of 8 bytes where I store the pointer in memory to the string (char *)?
It makes no sense at all to store pointers into memory in a database. A database is for persistent data. On the other hand, data stored in memory is liable to disappear whenever a process exits (or the system is restarted).

No, it is not. Size of a pointer depends on CPU architecture. Some architecture even have different sizes depending on "type" of the pointer. On x86_64, pointers are 48 bits wide. 64 bits are used because individual bits are not addressable. One could, however, use pointer packing to serialize/deserialize pointers into 48-bit chunks.

A variable can be different sizes based on the computer that you are using. This is causing the discrepancy between your results and the results you see online.
However, the variable will always be the same size on the same machine.

The size of any pointer in one platform is the same.. regardless of the data type char, string, object, etc.
In PC with 64 operating system (and also the compiler support 64 bit), the size of pointer is 8 byte (64 bit address space)..
Another platform may have 4 byte, 2 byte, or 1 byte (like an 8 bit micro controller)..

Storing hexadecimal addresses in a file

I have a pintool application which store the memory address accessed by an application in a file. These addresses are in hexadecimal form. If I write these addresses in form of string, it will take a huge amount of storage(nearly 300GB). Writing such a large file will also take large amount of time. So I think of an alternate way to reduce the amount of storage used.
Each character of hexadecimal address represent 4 bits and each ASCII character is of 8 bits. So I am thinking of representing two hexadecimal characters by one ASCII character.
For example :
if my hexadecimal address is 0x26234B
then corresponding converted ASCII address will be &#K (0x is ignored as I know all address will be hexadecimal).
I want to know that is there any other much more efficient method for doing this which takes less amount of storage.
NOTE : I am working in c++

This is a good start. If you really want to go further, you can consider compressing the data using something like a zip library or Huffman encoding.

Assuming your addresses are 64-bit pointers, and that such a representation is sensible for your platform, you can just store them as 64-bit ints. For example, you list 0x1234567890abcdef, which could be stored as the four bytes:
12 34 56 78 90 ab cd ef
(your pointer, stored in 8 bytes.)
or the same, but backwards, depending on what endianness you choose. Specifically, you should read this.
We can even do this somewhat platform-independently: uintptr_t is unsigned integer type the same width as a pointer (assuming one exists, which it usually does, but it's not a sure thing), and sizeof(our_pointer), which gives us the size in bytes of a pointer. We can arrive at the above bytes with:
Convert the pointer to an integer representation (i.e., 0x0026234b)
Shift the bytes around to pick out the one we want.
Stick it somewhere.
In code:
unsigned char buffer[sizeof(YourPointerType)];
for(unsigned int i = 0; i < sizeof(YourPointerType); ++i) {
buffer[i] = (
(reinterpret_cast<uintptr_t>(your_pointer) >> (sizeof(YourPointerType) - i - 1))
& 0xff
);
}
Some notes:
That'll do a >> 0 on the last loop iteration. I suspect that might be undefined behavior, and you'll need an if-case to handle it.
This will write out pointers of the size of your platform, and requires that they can be converted sensibly to integers. (I think uintptr_t won't exist if this isn't the case.) It won't do the same thing on 64- as it will on 32-bit platforms, as they have different pointer sizes. (Or any other pointer-sized platform you run across.)
A program's pointers aren't valid once the program dies, and might not even remain valid when the program is still running. (If the pointer points to memory that the program decides to free, then the pointer is invalid.)
There's likely a library that'll do this for you. (struct, in Python, does this.)
The above is a big-endian encoder. Alternatively, you can write out little endian — the Wikipedia article details the difference.
Last, you can just cast a pointer to the pointer to a unsigned char *, and write that. (I.e., dump the actual memory of the pointer to a file.) That's way more platform dependent though.
If you need even more space, I'd run it through gzip.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Defining a byte in C++ - c++

Related

Size of string data type in C++ array

Is `reinterpret_cast<char*>(reinterpret_cast<uintptr_t>(&ch) + 1) == &ch +1` guaranteed?

Is reading a "char" from memory faster than reading a "word"?

does the size of char * (character pointer) in C/C++ vary? - use for database column fixed size

Storing hexadecimal addresses in a file

Categories

Resources