Storing hexadecimal addresses in a file - c++

I have a pintool application which store the memory address accessed by an application in a file. These addresses are in hexadecimal form. If I write these addresses in form of string, it will take a huge amount of storage(nearly 300GB). Writing such a large file will also take large amount of time. So I think of an alternate way to reduce the amount of storage used.
Each character of hexadecimal address represent 4 bits and each ASCII character is of 8 bits. So I am thinking of representing two hexadecimal characters by one ASCII character.
For example :
if my hexadecimal address is 0x26234B
then corresponding converted ASCII address will be &#K (0x is ignored as I know all address will be hexadecimal).
I want to know that is there any other much more efficient method for doing this which takes less amount of storage.
NOTE : I am working in c++

This is a good start. If you really want to go further, you can consider compressing the data using something like a zip library or Huffman encoding.

Assuming your addresses are 64-bit pointers, and that such a representation is sensible for your platform, you can just store them as 64-bit ints. For example, you list 0x1234567890abcdef, which could be stored as the four bytes:
12 34 56 78 90 ab cd ef
(your pointer, stored in 8 bytes.)
or the same, but backwards, depending on what endianness you choose. Specifically, you should read this.
We can even do this somewhat platform-independently: uintptr_t is unsigned integer type the same width as a pointer (assuming one exists, which it usually does, but it's not a sure thing), and sizeof(our_pointer), which gives us the size in bytes of a pointer. We can arrive at the above bytes with:
Convert the pointer to an integer representation (i.e., 0x0026234b)
Shift the bytes around to pick out the one we want.
Stick it somewhere.
In code:
unsigned char buffer[sizeof(YourPointerType)];
for(unsigned int i = 0; i < sizeof(YourPointerType); ++i) {
buffer[i] = (
(reinterpret_cast<uintptr_t>(your_pointer) >> (sizeof(YourPointerType) - i - 1))
& 0xff
);
}
Some notes:
That'll do a >> 0 on the last loop iteration. I suspect that might be undefined behavior, and you'll need an if-case to handle it.
This will write out pointers of the size of your platform, and requires that they can be converted sensibly to integers. (I think uintptr_t won't exist if this isn't the case.) It won't do the same thing on 64- as it will on 32-bit platforms, as they have different pointer sizes. (Or any other pointer-sized platform you run across.)
A program's pointers aren't valid once the program dies, and might not even remain valid when the program is still running. (If the pointer points to memory that the program decides to free, then the pointer is invalid.)
There's likely a library that'll do this for you. (struct, in Python, does this.)
The above is a big-endian encoder. Alternatively, you can write out little endian — the Wikipedia article details the difference.
Last, you can just cast a pointer to the pointer to a unsigned char *, and write that. (I.e., dump the actual memory of the pointer to a file.) That's way more platform dependent though.
If you need even more space, I'd run it through gzip.

Related

Is `reinterpret_cast<char*>(reinterpret_cast<uintptr_t>(&ch) + 1) == &ch +1` guaranteed?

I'm writing alignment-dependent code, and quite surprised that there's no standard function testing if a given pointer is aligned properly.
It seems that most code on the internet use (long)ptr or reinterpret_cast<uintptr_t>(ptr) to test the alignment and I also used them, but I wonder if using the casted pointer to integral type is standard-conformant.
Is there any system that fires the assertion here?
char ch[2];
assert(reinterpret_cast<char*>(reinterpret_cast<uintptr_t>(&ch[0]) + 1)
== &ch[1]);
To answer the question in the title: No.
Counter example: On the old Pr1me mini-computer, a normal pointer was two 16-bit words. First word was 12-bit segment number, 2 ring bits, and a flag bit (can't remember the 16th bit). Second word was a 16-bit word offset within a segment. A char* (and hence void*) needed a third word. If the flag bit was set, the third word was either 0 or 8 (being the bit offset within the addressed word). A uintptr_t for such a machine would need to be uint48_t or uint64_t. Either way, adding 1 to such an integer would not advance to the next character in memory.
A capability addressed machine is also likely to have pointers which are much larger than the address space, and there is no particular reason why the least significant part of the corresponding integer should be part of the "address" rather than part of the extra info.
In practise of course, nobody is writing C++ for a Pr1me, and capability addressed machines seem not to have appeared either. It will work on all real systems - but the standard doesn't guarantee it.

Why does using reinterpret_cast to convert from char* to a structure seem to work normally?

People say it's not good to trust reinterpret_cast to convert from raw data (like char*) to a structure. For example, for the structure
struct A
{
unsigned int a;
unsigned int b;
unsigned char c;
unsigned int d;
};
sizeof(A) = 16 and __alignof(A) = 4, exactly as expected.
Suppose I do this:
char *data = new char[sizeof(A) + 1];
A *ptr = reinterpret_cast<A*>(data + 1); // +1 is to ensure it doesn't points to 4-byte aligned data
Then copy some data to ptr:
memcpy_s(sh, sizeof(A),
"\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00", sizeof(A));
Then ptr->a is 1, ptr->b is 2, ptr->c is 3 and ptr->d is 4.
Okay, seems to work. Exactly what I was expecting.
But the data pointed by ptr is not 4-byte aligned like A should be. What problems this may cause in a x86 or x64 platform? Performance issues?
For one thing, your initialization string assumes that the underlying integers are stored in little endian format. But another architecture might use big endian, in which case your string will produce garbage. (Some huge numbers.) The correct string for that architecture would be
"\x00\x00\x00\x01\x00\x00\x00\x02\x03\x00\x00\x00\x00\x00\x00\x04".
Then, of course, there is the issue of alignment.
Certain architectures won't even allow you to assign the address of data + 1 to a non-character pointer, they will issue a memory alignment trap.
But even architectures which will allow this (like x86) will perform miserably, having to perform two memory accesses for each integer in the structure. (For more information, see this excellent answer:
https://stackoverflow.com/a/381368/773113)
Finally, I am not completely sure about this, but I think that C and C++ do not even guarantee to you that an array of characters will contain characters packed in bytes. (I hope someone who knows more might clarify this.) Conceivably, there can be architectures which are completely incapable of addressing non-word-aligned data, so in such architectures each character would have to occupy an entire word. This would mean that it would be valid to take the address of data + 1, because it would still be aligned, but your initialization string would be unsuitable for the intended job, as the first 4 characters in it would cover your entire structure, producing a=1, b=0, c=0 and d=0.
The problem is that you can not be sure if this code will run on another platform, with the next version of Visual Studio, etc. When running on another processor, it may cause a hardware exception.
There was a time when you could read out arbitrary memory locations, but all those programs crash with an "access violation" exception nowadays. Something similar could happen to this program in the future.
However, what you can do, and what any compiler that calls itself "C++ standard compliant" must compile correctly, is this:
You can reinterpret_cast a pointer to something else, and then back to the original type. The value of the type, when read before and after, must stay the same.
I don't know what exactly you want to do, but you might get away with, for example
allocating a struct A
reinterpret_casting it to chars
saving the memory content to a file
and restore everything later:
allocate a struct A
reinterpret_cast it to chars
load the content to memory
reinterpret_cast it back to a struct A

does the size of char * (character pointer) in C/C++ vary? - use for database column fixed size

per the following code, I get the size of a character pointer is 8 bytes. Yet this site has a size of 1 byte for the char pointer.
#include <stdio.h>
int main(void ){
char *a = "saher asd asd asldasdas;daksd ahwal";
printf(" nSize = %d \n", sizeof(a));
return 0;
}
Is this always the case? I am writing a connector for a simple database I am implementing and want to read TEXT field of mysql into my database. Since TEXT has variable size, I was wondering if my column Type/metadata can have a fixed size of 8 bytes where I store the pointer in memory to the string (char *)?
per the following code, I get the size of a character pointer is 8 bytes. Yet this site has a size of 1 byte for the char pointer.
It's implementation-defined. It's usually 8 on a 64-bit Intel system and 4 on a 32-bit Intel system. Don't rely on it being any particular size.
I am writing a connector for a simple database I am implementing and want to read TEXT field of mysql into my database. Since TEXT has variable size, I was wondering if my column can have a fixed size of 8 bytes where I store the pointer in memory to the string (char *)?
It makes no sense at all to store pointers into memory in a database. A database is for persistent data. On the other hand, data stored in memory is liable to disappear whenever a process exits (or the system is restarted).
No, it is not. Size of a pointer depends on CPU architecture. Some architecture even have different sizes depending on "type" of the pointer. On x86_64, pointers are 48 bits wide. 64 bits are used because individual bits are not addressable. One could, however, use pointer packing to serialize/deserialize pointers into 48-bit chunks.
A variable can be different sizes based on the computer that you are using. This is causing the discrepancy between your results and the results you see online.
However, the variable will always be the same size on the same machine.
The size of any pointer in one platform is the same.. regardless of the data type char, string, object, etc.
In PC with 64 operating system (and also the compiler support 64 bit), the size of pointer is 8 byte (64 bit address space)..
Another platform may have 4 byte, 2 byte, or 1 byte (like an 8 bit micro controller)..

size of char being written to file as a binary value in C++

What I understood about char type from a few questions asked here is that it is always 1 byte in C++, but number of bits can vary from system to system.
sizeof() operator uses char as a unit so sizeof(char) is always 1 in bytes of C++.(which takes number of bits of smallest unit of address of local machine) If when using file functions of fstream() in binary mode, we directly read and write from/to an address of any variable in RAM, the size of variable as smallest unit of data written to file should be in size of the value read from RAM and for one read from file it is vice-versa. Then can we say that data may not be written 8 by 8 in bits if something like this is tried:
ofstream file;
file.open("blabla.bin",ios::out|ios::binary);
char a[]="asdfghjkkll";
file.seekp(0);
file.write((char*)a,sizeof(a)-1);
file.close();
Unless char is always used in bytes existing standard 8 bits, what happens if a heap of data is written to file in a 16 bit machine and is read in a 32 bit machine? Or should I use OS-dependent text mode? If not, and I misunderstood what is truth?
Edit : I have corrected my mistake.
Thanks for warning.
Edit2: My system is 64 bit but I get number of bits of char type as 8.What is wrong? Is the way I get the result of 8false?
I got a 00000... by shifting a char variable more than possible size of it with bitwise operators.After guaranteeing that all bits of the variable is zero, I got a 111... by inverting it. And shifted until it become zero.If we shift it its size time, we get a zero, so we can get number of bits from indice of the loop terminated below.
char zero,test;
zero<<=64; //hoping that system is not more than 64 bit(most likely)
test=~zero; //we have a 111...
int i;
for(i=0; test!=zero; i++)
test=test<<1;
Value of variable of i after the loop is number of bits in char type.According to this, the result is 8.
My last question is:
Are filesystem byte and char type different data types because how computer adresses pointers in file stream is different from standart char type which is at least 8 bits?
So, exactly what is going on the background?
Edit3: Why these minuses? What is my mistake? Isn't the question clear enough? Maybe my question is stupid but why there is no any response related to my question?
A language standard can't really specify what the filesystem does - it can only specify how the language interacts with it. The C and C++ standards also don't address anything to do with interoperability or communication between different implementations. In other words, there isn't a general answer to this question except to say that:
the VAST majority of systems use 8-bit bytes
the C and C++ standard require that char is at least 8 bits
it is very likely that greater-than-8-bit systems have mechanisms in place to somehow utilize (or at least transcode) 8-bit files.

Binary How The Processor Distinguishes Between Two Same Byte Size Variable Types

I'm trying to figure out how it is that two variable types that have the same byte size?
If i have a variable, that is one byte in size.. how is it that the computer is able to tell that it is a character instead of a Boolean type variable? Or even a character or half of a short integer?
The processor doesn't know. The compiler does, and generates the appropriate instructions for the processor to execute to manipulate bytes in memory in the appropriate manner, but to the processor itself a byte of data is a byte of data and it could be anything.
The language gives meaning to these things, but it's an abstraction the processor isn't really aware of.
The computer is not able to do that. The compiler is. You use the char or bool keyword to declare a variable and the compiler produces code that makes the computer treat the memory occupied by that variable in a way that makes sense for that particular type.
A 32-bit integer for example, takes up 4 bytes in memory. To increment it, the CPU has an instruction that says "increment a 32-bit integer at this address". That's what the compiler produces and the CPU blindly executes it. It doesn't care if the address is correct or what binary data is located there.
The size of the instruction for incrementing the variable is another matter. It may very well be another 4 or so bytes, but instructions (code) are stored separately from data. There may be many instructions generated for a program that deal with the same location in memory. It is not possible to formally specify the size of the instructions beforehand because of optimizations that may change the number of instructions used for a given operation. The only way to tell is to compile your program and look at the generated assembly code (the instructions).
Also, take a look at unions in C. They let you use the same memory location for different data types. The compiler lets you do that and produces code for it but you have to know what you're doing.
Because you specify the type. C++ is a strongly typed language. You can't write $x = 10. :)
It knows
char c = 0;
is a char because of... well, the char keyword.
The computer only sees 1 and 0. You are in command of what the variable contains.
you can cast that data also into what ever you want.
char foo = 'a';
if ( (bool)(foo) ) // true
{
int sumA = (byte)(foo) + (byte)(foo);
// sumA == (97 + 97)
}
Also look into data casting to look at the memory location as different data types. This can be as small as a char or entire structs.
In general, it can't. Look at the restrictions of dynamic_cast<>, which tries to do exactly that. dynamic_cast can only work in the special case of objects derived from polymorphic base classes. That's because such objects (and only those) have extra data in them. Chars and ints do not have this information, so you can't use dynamic_cast on them.