Getting the offset to .text section code PE file format? VirtualAddress, PointerToRawData? - c++

I've been trying to do this for about two days, with no success. I have been reading over many PE file format tutorials to no avail.
I map a 32 bit executable into memory via CreateFileMapping which works perfectly. My program then loops through the section headers, and checks the characteristics against my default characteristics (to make sure the section is executable and is code). If it is true the program returns the (PIMAGE_SECTION_HEADER) pointer to that section header (program works perfectly so far).
Now that I have the pointer, there are two specific entries to the structure that have baffled me, and that is PointerToRawData and VirtualAddress, when I cout the entries;
VirtualSize = 4096, PointerToRawData = 1536.
From what I have read in PE documentation, is that PointerToRawData is a supposed offset (RVA???) to the first byte of data in the section on disk (am I correct?), and is a multiple of a alignment value (512). The question is what do I set this value to, to obtain a pointer which I can use to access the section's data. On a memory-mapped file would it be better to use (VirtualAddress value + the imagebase value) to find the first byte of the section?
Another point of confusion is VirtualSize vs SizeOfRawData. This has confused me because in this article - http://msdn.microsoft.com/en-us/library/ms809762.aspx, it says "The SizeOfRawData field (seems a bit of a misnomer) later on in the structure holds the rounded up value" yet my VirtualSize is greater than my SizeOfRawData value which has led to confusion on which one I should use.
The object of this program is to find the executable section (.text section) and perform a bitwise operation on all the bits in the section, and end the operation before the next section.
I don't want it to seem like I expect a spoonfeed, I just want some clarifications.
Thank you for your time/help, it is appreciated.

I don't happen to have the spec handy or any PE code to look at for reference (I'm writing this on my iPad from my couch ;) but the key point to realize is that there are two modes to consider: all talk of RVAs is only relevant when the PE is mapped into memory and the alignment there is page-alignment. When you're reading the file off disk, the offsets are file offsets and each section is using the file alignment.
I hope this helps.

Related

c++ How can I change the size of a void* according to a file I want to process

I am currently trying to make a program that can read a .blend file. Well trying is the important part, since I am already stuck on reading the file block info.
Im gonna quickly explain my problem, please refer this page for context
So in the .blend header there is a char that determines wheter or not the pointer size, later used in the file info block (Or just fileBlock on the linked webpage) among other things, is 4 or 8 bytes long. From what I have read, in c++ the void pointer only changes size according to the target platform it was compiled for ( 8 bytes for 64 bit and 4 bytes for 32 bits ). However .blend files can have either one, regardless of the platform I presume.
Now since blender itself does also read its own files using c, there must be a way to change the pointer to match the required pointer size, according to the info in the header. However my best guess would be to dynamically allocate a void pointer array to either one or two pointers, which then makes actually using the data even more complicated.
Please help me find the intended way of handling the different pointer sizes!
Go back to the top of the wiki page and you will find the File Header structure. The header of a blend file starts with "BLENDER" which is followed by the pointer size for the file -
Size of a pointer
All pointers in the file are stored in this format
'_' (underscore) means 4 bytes or 32 bit
'-' (minus) means 8 bytes or 64 bits.
So by reading the eighth byte of the file you know the size of the pointers in the file.
if (file_bytes[7] == "_")
ptr_size = 4;
else if (file_bytes[7] == "-")
ptr_size = 8;
The copy of blender creating the file determines the sizes used for the file, so a 32bit build will save 32bit pointers in the file while a 64 bit build will save 64bit pointers.
You should also read the next byte, it tells you whether the file was saved as big or little endian, to see if you need to do any byte swapping. The use of blender on big endian machines might be getting smaller, but you may still come across big endian files.
Another important thing that doesn't seem to be mentioned, is that blend files can be compressed and often are. Reading a compressed blend file will mean using gzread() to read the file. A compressed file has the first two bytes set to 0x1f 0x8b
You will find the code that blender uses to read blend files in source/blender/blenloader.
Yup, that's painful. The solution is not to treat them as C++ at all. Instead, create your own class BlendPointer to abstract this away. Those would be read from a BlendFile, and that BlendFile would store whether its BlendPointers are 4 or 8 bytes on disk.

What's the meaning of HIGHLOW in a disassembled binary file?

I just used DUMPBIN for the first time and I see the term HIGHLOW repeatedly in the output file:
BASE RELOCATIONS #7
11000 RVA, E0 SizeOfBlock
...
3B5 HIGHLOW 2001753D ___onexitbegin
3C1 HIGHLOW 2001753D ___onexitbegin
...
I'm curious what this term stands for. I didn't find anything on Google or Stackoverflow about it.
To apply a fixup, a delta is calculated as the difference between the
preferred base address, and the base where the image is actually
loaded.
The basic idea is that when doing a fixup at some address, we must know
what memory must be changed ("offset" field)
what value is needed for its relocation ("delta" value)
which parts of relocated data and delta value to use ("type" field)
Here are some possible values of the "type" field
HIGH - add higher word (16 bits) of delta to the 16-bit value at "offset"
LOW - add lower word of delta to the value at "offset"
HIGHLOW - add full delta to the 32-bit value at "offset"
In other words, HIGHLOW type tells the program that it's doing a fix-up on offset "offset" from the page of this relocation block*, and that there is a doubleword that needs to be modified in order to have properly working executable.
* all of the relocation entries are grouped into blocks, and every block has a page on which its entries are applied
Let's say that you have this instruction in your code:
section .data
message: "Hello World!", 0
section .code
...
mov eax, message
...
You run assembler and immediately after it you run disassembler. Now your code looks like this:
mov eax, dword [0x702000]
You're now curious why is it 0x700000, and when you look into file dump, you see that
ImageBase: 0x00700000
Now you understand where did this number come from and you'e ready to run the executable.
Loader which loads executable files into memory and creates address space for them finds out, that memory 0x700000 is unavailable and it needs to place that file somewhere else. It decides that 0xf00000 will be OK and copies the file contents there.
But, your program was linked to work only with data on 0x700000 and there was no way for linker to know that its output would be relocated. Because of this, loader must do its magic. It
calculates delta value - the old address (image base) is 0x700000 but it wants 0xf00000 (preferred address). It subtracts one from another and gets 0x800000 as result.
gets to the .reloc section of the file
checks if there is still another page (4KB of data) to be relocated. If no, it continues toward calling fileĀ“s entry point.
4.for every relocation for the current page, it
gets data at relocation offset
adds the delta value (in the way as type field states)
places the new value at relocation offset
continues on step 3
There are also more types of relocation entry and some of them are architecture-specific. To see a full list, read the "Microsoft Portable Executable and Common Object File Format, section 6.6.2. Fixup Types".
What you see here is the content of the "Base relocation table" in Microsoft Windows executable files.
Base relocation tables are necessary in Windows for DLL files and they are optional for executable files; they contain information about the location of address information in the EXE/DLL file that must be updated when the actual address of the DLL file in memory is known (when loading the DLL into memory). Windows uses the information stored in this table to update the address information.
The table supports different types of addresses while the naming is Microsoft-specific: ABSOLUTE (= dummy), HIGH, LOW, HIGHLOW, HIGHADJ and MIPS_JMPADDR.
The full name of the constant is "IMAGE_REL_BASED_HIGHLOW".
The "ABSOLUTE" type is typically a dummy entry inserted to ensure the parts of the table are a multiple of 4 (or 8) bytes long.
On x86 CPUs only the "HIGHLOW" type is used: It tells Windows about the location of an absolute (32-bit) address in the file.
Some background info:
In your example the "Image Base" could be 0x20000000 which means that the EXE/DLL file has been compiled to be loaded into address 0x20000000. At the addresses 0x200113B5 (0x20000000 + 0x11000 + 0x3B5) and 0x200113C1 there are absolute addresses.
Let's say the memory at location 0x200113B5 contains the value 0x20012345 which is the address of a function or variable in the program.
Maybe the memory at address 0x20000000 cannot be used and Windows decides to load the DLL into the memory at 0x50000000 instead. Then the 0x20012345 must be replaced by 0x50012345.
The information in the base relocation table is used by Windows to find all addresses that must be replaced.

Access voilation reading location

I am trying to debugg the project on MSVS 2010.
Implementation - c++; when i am degubbing the source code, i get the following failure reported by MSVS.
Failure reported:
"First chance exception at 0x00000013fb5b9ee in unit.exe: 0xc00000005 access voilation reading location 0x00000000000000c."
the problem lies in obtaining address.
int base = (*(abc::g_runc1.m_paulsenderpin.m_lastchunk_p)).xcpp::cxcppoutput::m_baseaddress;
my project is very big to include the source code,
In short it can be described as:
- paul is a module with sender pin connected to c1.
- xcpp is the interface
this source code and the project is correct and works without failure on ARM compiler, but on MSVS it gives access violation error.
On msdn there are some posts about permission set by assembly, and which avoids to read the addressed location. if so, how to change it... ?
or is there any better option to find the problem...?
Any help is appreciated.
Your code is trying to access location that actually isn't owned by it's process. No data of user applications can be located at addresses so close to zero. As your expressions is too long to simply find where is the member containing zero reference, my tip is m_last chunk_p, and the m_baseaddress seems to be member at offset 12.
There is one simple explanation why does your code work fine when it's compiled by something that works with ARM: ARM uses aligned memory access, so class and structure members are aligned to full blocks, although they don't always use whole space allocated for them. Therefore you use bigger pointer or wrong memset parameters somewhere in your code and your pointer gets overwritten.
Problem may also disappear when you compile it with another version of (possibly another) compiler (or non machine with different processor architecture 32/64), as the size of fundamental types isn't always the same.
You should try to check what pointed is actually zero (or possibly 12) in your expression and try to set a watch on it. Be sure you use sizeof properly everywhere.
The Problem lies with the memory addressing, in ARM debugger 32 bits and MSVS10 48 bits of addressing, because of it the MSB byte is lost and so cannot find the correct memory address...!!!

Write a C++ struct to a file and read file using another programming language?

I have a challenging situation; we will have programs on Mac, PC, iOS and Android receiving files in a legacy format and parsing data from those files. We cannot change how those files are created.
The files are produced by a C++ program filling a struct with numbers and Strings and then writing it out. Here's a sanitized version.
struct MyObject {
String Kfkj(MAXKYS);
String Oern(MAXKYS);
String Vdflj(MAXKYS, 9);
int Muic;
int Tdfkj;
int VdfkAsdk;
int SsdjsdDsldsk;
int Ndsoief;
String TdflsajPdlj;
String TdckjdfPas;
String AdsfakjIdd;
int IdkfjdKasdkj;
int AsadkjaKadkja(MAXKYS);
int Kasldsdkj;
bool Usadl;
String PsadkjOasdj(9);
String PasdkjOsdkj;
};
Primitives and Strings, as you can see.
Then here is how they write it out to a file:
MyInstance MyObject;
FileName = "C:\MyFile.ab2"
ofstream fout (FileName, ios::binary);
fout.write((char*)& MyInstance, sizeof(MyInstance));
There is no option for us to translate it once and then distribute the file to other platforms; we must translate it on each and every different platform, and this is what we have to work with. I'd appreciate any information on how C++ serializes data, so we know how to parse the file.
EDIT: solution
The feedback I received from multiple answers here was VERY helpful. Using that, I did extensive analysis with hex editors and discovered:
the elements come in the file one after another
a "String," in this case, starts with an int describing how many characters follow the int for that String. If the String does not exist, it will still have that int with a value of 0.
integers, for the files and machines I saw, are two bytes, little-endian, and MOSTLY unsigned (there were a few that were signed, just to keep me on my toes)
the boolean was two bytes, with apparently -1 (FF FF) representing "true"
So far we have not ran into issues with different padding or endianness on different devices, but those are very real concerns. The skilled notes and warnings in these answers provides us with more ammunition to try to convince the client to change to a less fragile alternative, such as XML or JSON, for transferring data online across platforms.
As for those of you asking if the developer was fired... well, let's just say their code is very old, but after multiple conversations we're still having trouble convincing them writing out the C++ struct and trying to read that on different platforms is not a good idea.
You're going to run into many problems.
C++ doesn't have a specific format for serializing data per se. It is highly dependent on the computer architecture/processor that you are running on.
The compiler is allowed to add padding to help alignment on systems. When we say alignment we basically are referring to an architecture/processor's affinity for having data lie on specific byte boundaries. For example, some processors vastly prefer floating point numbers to lie at 4 or 8 byte boundaries - if they don't the processor may work much slower or may not work at all.
So, you can't simply know what padding your system is adding magically.
What you can do is use #pragma pack(1) / #pragma pack(0) to stop your compiler from padding your numbers.
PS: you also have to worry about endianness. What if one computer is running on big-endian and one is little endian? They will interpret bytes differently without a conversion.
Simply put, you either have to fix the application generating the files so it uses a proper serialization scheme OR you need to look at it running on a SPECIFIC computer, look at exactly how it writes the files, and write a translator for every target platform (which is just silly).
Interesting Suggestion
If you're really stuck, write an app that monitors the folder where you write files. Have the app pick up the files (since it's on the same PC it'll be able to read their format without issue). Have it write the files back in XML or some other true serialization format and distribute those instead.
Whoa - that's crazy. So String objects don't contain any pointers? Must not- because you claim this is working code.
Anyway, that code isn't doing any serialization. Its just writing the structure out to file exactly the way it is laid out in memory. The only issue you have is that on some platforms padding and sizeof integral types like int may be different.
You'll have to find the size of the integral types, and use that information in reader/writer for newer platforms to make sure they get laid out the same way on the legacy platform.
You're running a real risk with that code though. As it is, a compiler change could suddenly cause the file layout to change.
The format of your data file is entirely down to the compiler that your C++ program is compiled with, and the definition of your String class. You can rely on the fields being in the order they're declared in, and in this case, I think you can rely on there not being any padding at the start, but that's about all. Some tips that might help you out in this case:-
You don't give the definition of the String class you're using. If it's a typedef for std::string, you're completely screwed, because the contents of the string aren't in the memory. I assume your C++ programmers are using some special local buffer, in which case I'll guess you will find the first bytes of the object are the string, and there is some amount of useless padding afterwards. I hope the struct contains an int at the start telling you how much data in it is useful.
You'll probably find the int fields are four bytes long.
You'll probably find the bool field is one byte long, followed by three bytes of useless padding. Only one bit, most likely the bottom bit, will be set.
That's about all the useful guesswork I can offer you. In your target language, make sure to read the whole file in as the closest thing to a byte array available in the language, and only after that, use the language features to convert it into the right kind of thing in your language. Don't try reading it in as integers, as that won't let you byte-swap if you're on a platform with different endianness to the C++ program. I suggest also looking through the file in a text editor to reverse-engineer it and help you find the offset of each field.
Last piece of advice: consider printing P45s (or pink slips, or whatever you have in your country) for whichever programmers or project managers thought this kind of 'serialization' was a good idea. This kind of sloppy work might have been acceptable in a life-or-death situation, but they have seriously screwed you over in a way you're going to find it very hard to recover from. Writing the code to read in these files will not be that hard, if it's only one struct like this, but keeping it reliable will be a world of pain, and they've effectively made it impossible for themselves to change compilers or compiler version safely.
The way it's done, the struct is written in raw form to a file. So basically what you need to know to parse this file is the binary layout of your struct.
Basically, the fields are just one after the other, so to read an int, you just read 4 bytes and cast that to an int, etc.
Strings are a particular case. It's not clear from your code whether this "String" type is an inline array of characters, or a pointer to such an array. In the first case, you need to know how many characters each string contains and simply read that number of characters sequentially. In the second case, you won't be able to get the string back, since it won't have been written to file. The pointer will be useless to you.
One last concern is whether the struct is packed or not. Since you gave no indication to that, by default struct fields are aligned to 4-bytes boundaries, so there may be space for instance after the boolean field that you need to account for. If the struct is packed, then each field comes directly after the previous.
So, to make a long story short, figure out your struct binary layout using its definition and, if all else fails, inspecting the memory at run-time with the debugger, or use a hex editor to study the output file. Then write that specification down somewhere and this will give you what you need to read from the file. It's impossible to tell exactly what that layout is simply by looking at the pseudo-definition you gave.
Writing in an ofstream does not serialize data. This code write the raw memory content of the struct as it was a string of char. Depending of your compiler, its version, its options and the system it is running on the content will be completely different.
Even the number of bits of a char is allowed to change between c++ implementation.
Data referenced by the object of the struct won't be written (forget the content of std::string).
If you cannot change the writer code. You must know the alignment policy, the size of base type and the data representation. You will have to analyze files produced by hand, for example with an hexadecimal editor like this one
http://www.physics.ohio-state.edu/~prewett/hexedit/
, and probably look at your compiler documentation.
If you can change the writer code. Use proper serialization like json, protocol buffer or simply xml.
No one has pointed out something that sticks out to me as particularly problematic (maybe because I've been bit by it). That problem: the data member bool Usadl;. sizeof(bool) varies across platforms, across compilers, and even across releases of the same compiler. Common values for sizeof(bool) are 4 and 1. This will bite you. It's getting hard to find a big endian machine nowadays, very, very hard to find a computer where CHAR_BIT is not 8 or sizeof(int) is not 4. This is not the case for sizeof(bool).
In agreement with everyone else, Chad's team needs to document the structure of the records in the file, and then make sure the program that produces the file writes this structure explicitly, including element sizes, padding, and endianness. Don't depend on class layout to do this for you. That's just asking for trouble.
The best way would probably be to use JSON or if you want a more robust solution go with something like Avro. Avro has a C++ API and a Java API, so it covers most of the cases you're encountering.

Writing binary files C++, way to force something to be at byte 18?

I'm currently trying to write a .bmp file in C++ and for the most part it works, there is however, just one issue. When I start trying to save images with different widths and heights everything goes askew and I'm struggling to solve it, so is there any way to force something to write to a specific byte (adding padding in between it and the last thing written)?
There are several sort of obvious answers, such as keeping your data in memory in a buffer, then putting the desired value in as bufr[offset]=mydata;. I presume you want something a little fancier than that, because you are, for example, doing this in a streaming sort of application where you can't have the whole object in memory at the same time.
In that case, what you're looking for is the magic offered by fseek(3) and ftell(3) (see man pages). Seek positions the file as a specific offset; tell gets the file's current offset. If it's a constant offset of 18, the you simply finish up with the file, and do
fseek(fp, 18L, SEEK_CUR)
where fp is the file pointer, SEEK_CUR is a constant declared in stdio.h, and 18 is the number 18.
Update
By the way, this is based on the system call lseek(2). Something that confuses people (read "me", I never remember this until I have been searching) is there is no matching "ltell(2)" system call. Instead, to get the current file offset, you use
off_t offset;
offset = lseek(fp, 0L, SEEK_CUR);
because lseek returns the offset after its operation. The example code above gives us the offset after moving 0 bytes from the current offset, which is of course the current offset.
UPdate
aha, C++. You said C. For C++, there are member functions for seek and tell. See the fstream man page.
Count how many bytes have been written. Write zeroes until the count hits 18. Then resume writing your real data.
If you are on Windows, everything comes to writing predefined structures: "Bitmap storage".
Also there is an example that shows how they should be filled: "Storing an Image".
If you are writing not-just-for-windows code then you can mimic these structs and fallow the guide.