This question already has answers here:
Convert assembly to machine code in C++
(3 answers)
Closed 6 years ago.
I would like to include NASM itself (the assembler) in a C++ project. Can I compile NASM as a shared library? If not, is there another assembler that works as a C or C++ library?
I checked libyasm but couldn't understand how I can use it to assemble my code.
Woah, this exploded when I was away.
I had solved this problem by tampering with the YASM source code, and totally forgot about the question in SO as it received absolutely no attention 8 months ago. Below are the details, followed by a better suggestion.
For the project that I had in mind, I needed to use YASM as a library, and I was in a hurry because I was doing this for a company. Back then there were no good libraries that I was aware of; and I had concluded that getting used to the LLVM framework was an overkill for the task (because all I wanted was to assemble singular x86 - x86_64 instructions and receive the bytes).
So I downloaded the source code for YASM.
Upon meddling with the code for a while, I noticed that the executable receives the file paths for input and output files; and passes these two strings along. I wanted char arrays in memory for the input and output; not files. So I figured, maybe if I could find all FILE pointers that are passed around, I can convert them to char pointers, and change every file read/write to array operations.
This turned out to be even more cumbersome than it sounds. Apparently YASM does not open input/output files once and uses the same FILE pointers; instead it passes around copies of the filepath strings. I needed a script that could make all the necessary changes for me, this wasn't good for me.
Eventually, I found all fopen/fclose calls in the program with a script, and replaced them with my_fopen/my_fclose. For each file that I made these replacements, I included my header file in which I implemented these two functions.
In both of these functions, I checked the incoming string, compared it with "fake_file". If they are equal, I passed a 'fake' FILE pointer pointing to two portions of memory, obtained from the function calls fmemopen and open_memstream. Otherwise I simply called the actual fopen/fclose functions. In other words, I redirected these two calls (only for a given filename) to a memory file. Then, I called the library with the filename parameter set to 'fake_file'.
Since I have had limited myself to Linux at that point, this approach worked for me. I also found out (using Valgrind) that there was a memory leak in the library version, so I wrote a very primitive garbage collector for it. Basically I wrapped malloc's etc. to keep track of all allocations that are not freed, and clean them after each execution.
This approach also allowed me to automate these changes using a script. Unfortunately I did all these in a company so I cannot leak any actual code.
Better suggestion:
As of May 31, 2016; you can use Keystone Engine instead. It is "based on LLVM, but it goes much further with a lot more to offer." The disassembly engine Capstone and this are a near perfect couple for assembly and disassembly. If you need either of these components, I suggest these instead of doing the hacks I described. Both of these engines are currently being developed; and even though Keystone has some small bugs, Capstone is very robust at the moment.
TL;DR: Use keystone.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I know that the inline asm exists, but is it also possible to execute machine code from a file during RUNTIME?
Would i need to write my own interpreter?
I'm using the GNU C++ compiler with c++ 14 enabled, on Windows 7.
Thanks for reading.
With your rephrasing into machine code, this question starts taking a more reasonable shape.
A short answer: Yes, you can run machine code from within your application.
A longer answer is - it's complicated.
Essentially, any string of bits and bytes in memory can be executed, given some conditions are met, such as the data being legal machine instructions (Otherwise the processor will invoke the illegal instruction exception and the OS will terminate your program) and that the memory page into which the data is loaded is marked with executable permissions.
Having said that, the conditions required for that machine code to actually run correctly and do what you expect it to do, is significantly harder, and have to do with understanding of Virtual Memory, Dynamic Loaders and Dynamic Linkers.
To bluntly answer your question, for a POSIX compliant environment at the least, you could always use the mmap system call to map a file into memory with PROT_EXEC permissions and jump into that memory space hoping for the best.
Naturally, any symbols that code would be expecting to find in memory aren't likely to be there, and the code was better compiled as PIC (Position Independent Code) but this roughly answers your question with a YES.
For better control, you'd usually prefer to use a more standard method, such as compiling your extra code as a shared object (Dynamic Link Library, DLL in Windows) and loading it into your application with dlopen while using dlsym to access symbols within it. It still allows you to load machine code from the disk into your application, but it also stores the machine code in a well formatted, standard way, which allows the dynamic linker to properly load and link the new code segment into your application, reducing unexpected behavior.
In neither of these cases will you need an interpreter, but neither is it a matter of language or compiler used - this is OS specific functionality, and will behave quite differently on Windows.
As a different approach, you could consider using the #include directive to import an external chunk of assembly code into your work while you're still working on it and properly incorporate it in compile time, which will yield far more deterministic results.
Edit:
For windows, the parallel for mmap is CreateFileMapping
dlopen is LoadLibrary
Not a Windows expert, sorry...
Let us distinguish between "assembler code"/assembly code (which is what this question initially asked about) and machine code (after one of the edits).
Anything you might describe as "assembler code" (or more usually "assembly code") but not machine code (i.e. anything not being actual, binary, executable, machine code) cannot be "executed". You can only read it into what I would call an "assembly-code-interpreter" and have it processed. I do not know of any such a program.
Alternatively, you can have it processed at runtime by a build process and execute the resulting executable. That however seems not to be what you are asking about.
Note that this does not mean that you can execute any machine code you might find in a file on your disk. It needs to be for the right, same platform and be supported by the appropriate runtime environment. That is applicable to executeables created for your machine or compatibles, e.g. the result of a built.
Note that I understand "assembler code" ("assembly code") to mean source code in assembly language, which is a (probably the most basic) representation of programs in (not really) human eye readable form. (As immortal has commented, an assembler is the program to process assembly code into machine code.) Opcode mnemonics are used, e.g. cmp r1, r2 for comparing two registers. That string of characters however is guaranteed not to make any sense when trying to execute it straight forward. (OK, strictly speaking I should say "almost guaranteed"...)
Machine code which is appropriatly made for your environment, including a loader, can be executed from a file. Any operating system will support you doing that, most will even provide a GUI for doing that. (I notice this sounds somewhat cynical, sorry, not meant to be.) Windows for example will execute an executable if you double-click its icon in the windows explorer.
An alternative to such executable programs are libraries. Especially the dynamic link libraries are probably quite close to what you are thinking of. They are very similar, in needing to be targeted at your environment. Then they can (usually partially) be executed from a linked program, via agreed calling mechanisms. Those mechanisms in turn ensure that the code is executed in a matching environment, including being able to return results.
so to make things clear, my goal is to write a code which can save crucial parts of its own code (like the main func or other classes or stuff like that)
which i will be just copying them inside a class, then i want my program to add/remove some codes (func, obj, class, ...) using user input and after all of that i want it to regenerate that code again and create the class that holds the crucial parts of the code automatically and i want it to compile that and delete itself.
So i have all of the above planned and figured out except the part which i want it to compile that code, is there anyway to link g++ to my code? but i know that g++ has a main func, wouldnt that create problems with my main func?
+ i cant use the compiler on the existing system and i cant have the compiler as a separate executable...
You don't need to link your code with a compiler. You could package your executable with a compiler. Your code could generate C++ source code and then call the compiler to generate a new executable.
Keep in mind that most compilers are huge in size. Try installing G++ or MinGW on your system to get an understanding.
For more details, search the internet for "Compiler Design Theory" which will give you information about translating languages like C++ into an executable.
Also, you will need to have the Operating System launch the new executable (and kill the present instance). This will take some research into the Operating System's API.
IMHO, the best method is to use an interpretive or script language. The alternate is to have your C++ program execute a script or pieces of a script.
Edit 1: very low level
At the very lowest level, microcode (the command bytes that a processor processes) needs to be generated.
The steps would be:
Generate the microcode and place in some known location in memory
(that you have access to and is has execution privilege).
Transfer execution to the microcode, remembering to push the return
address on the stack.
The hard part is generating the microcode, especially adjusting the target addresses of all the branch instructions (unless you use something called Position Independent Code).
You could spend many months or years writing code that generates the micro code or figure out how to use pieces of compilers (like CLang or G++).
I want to know if it is possible to "edit" the code inside an already compiled DLL.
I.E. imagine that there is a function called sum(a,b) inside Math.dll which adds the two numbers a and b
Let's say i've lost the source code of my DLL. So the only thing i have is the binary DLL file.
Is there a way i could open that binary file, locate where my function resides and replace the sum(a,b) routine with, for example, another routine that returns the multiplication of a and b (instead of the sum)?
In Summary, is it posible to edit Binary code files?
maybe using reverse engineering tools like ollydbg?
Yes it is definitely possible (as long as the DLL isn't cryptographically signed), but it is challenging. You can do it with a simple Hex editor, though depending on the size of the DLL you may have to update a lot of sections. Don't try to read the raw binary, but rather run it through a disassembler.
Inside the compiled binary you will see a bunch of esoteric bytes. All of the opcodes that are normally written in assembly as instructions like "call," "jmp," etc. will be translated to the machine architecture dependent byte equivalent. If you use a disassembler, the disassembler will replace these binary values with assembly instructions so that it is much easier to understand what is happening.
Inside the compiled binary you will also see a lot of references to hard coded locations. For example, instead of seeing "call add()" it will be "call 0xFFFFF." The value here is typically a reference to an instruction sitting at a particular offset in the file. Usually this is the first instruction belonging to the function being called. Other times it is stack setup/cleanup code. This varies by compiler.
As long as the instructions you replace are the exact same size as the original instructions, your offsets will still be correct and you won't need to update the rest of the file. However if you change the size of the instructions you replace, you'll need to manually update all references to locations (this is really tedious btw).
Hint: If the instructions you're adding are smaller than what you replaced, you can pad the rest with NOPs to keep the locations from getting off.
Hope that helps, and happy hacking :-)
Detours, a library for instrumenting arbitrary Win32 functions on x86 machines. Detours intercepts Win32 functions by re-writing target function images. The Detours package also contains utilities to attach arbitrary DLLs and data segments (called payloads) to any Win32 binary.
Download
You can, of course, hex-edit the DLL to your heart's content and do all sorts of fancy things. But the question is why go to all that trouble if your intention is to replace the function to begin with?
Create a new DLL with the new function, and change the code that calls the function in the old DLL to call the function in the new DLL.
Or did you lose the source code to the application as well? ;)
I know many have asked this question before, but as far as I can see, there's no clear answer that helps C++ beginners. So, here's my question (or request if you like),
Say I'm writing a C++ code using Xcode or any text editor, and I want to use some of the tools provided in another C++ program. For instance, an executable. So, how can I call that executable file in my code?
Also, can I exploit other functions/objects/classes provided in a C++ program and use them in my C++ code via this calling technique? Or is it just executables that I can call?
I hope someone could provide a clear answer that beginners can absorb.. :p
So, how can I call that executable file in my code?
The easiest way is to use system(). For example, if the executable is called tool, then:
system( "tool" );
However, there are a lot of caveats with this technique. This call just asks the operating system to do something, but each operating system can understand or answer the same command differently.
For example:
system( "pause" );
...will work in Windows, stopping the exectuion, but not in other operating systems. Also, the rules regarding spaces inside the path to the file are different. Finally, even the separator bar can be different ('\' for windows only).
And can I also exploit other functions/objects/classes... from a c++
and use them in my c++ code via this calling technique?
Not really. If you want to use clases or functions created by others, you will have to get the source code for them and compile them with your program. This is probably one of the easiest ways to do it, provided that source code is small enough.
Many times, people creates libraries, which are collections of useful classes and/or functions. If the library is distributed in binary form, then you'll need the dll file (or equivalent for other OS's), and a header file describing the classes and functions provided y the library. This is a rich source of frustration for C++ programmers, since even libraries created with different compilers in the same operating system are potentially incompatible. That's why many times libraries are distributed in source code form, with a list of instructions (a makefile or even worse) to obtain a binary version in a single file, and a header file, as described before.
This is because the C++ standard does not the low level stuff that happens inside a compiler. There are lots of implementation details that were freely left for compiler vendors to do as they wanted, possibly trying to achieve better performance. This unfortunately means that it is difficult to distribute a simple library.
You can call another program easily - this will start an entirely separate copy of the program. See the system() or exec() family of calls.
This is common in unix where there are lots of small programs which take an input stream of text, do something and write the output to the next program. Using these you could sort or search a set of data without having to write any more code.
On windows it's easy to start the default application for a file automatically, so you could write a pdf file and start the default app for viewing a PDF. What is harder on Windows is to control a separate giu program - unless the program has deliberately written to allow remote control (eg with com/ole on windows) then you can't control anything the user does in that program.
test.c:
int main()
{
return 0;
}
I haven't used any flags (I am a newb to gcc) , just the command:
gcc test.c
I have used the latest TDM build of GCC on win32.
The resulting executable is almost 23KB, way too big for an empty program.
How can I reduce the size of the executable?
Don't follow its suggestions, but for amusement sake, read this 'story' about making the smallest possible ELF binary.
How can I reduce its size?
Don't do it. You just wasting your time.
Use -s flag to strip symbols (gcc -s)
By default some standard libraries (e.g. C runtime) linked with your executable. Check out keys --nostdlib --nostartfiles --nodefaultlib for details. Link options described here.
For real program second option is to try optimization options, e.g. -Os (optimize for size).
Give up. On x86 Linux, gcc 4.3.2 produces a 5K binary. But wait! That's with dynamic linking! The statically linked binary is over half a meg: 516K. Relax and learn to live with the bloat.
And they said Modula-3 would never go anywhere because of a 200K hello world binary!
In case you wonder what's going on, the Gnu C library is structured such as to include certain features whether your program depends on them or not. These features include such trivia as malloc and free, dlopen, some string processing, and a whole bucketload of stuff that appears to have to do with locales and internationalization, although I can't find any relevant man pages.
Creating small executables for programs that require minimum services is not a design goal for glibc. To be fair, it has also been not a design goal for every run-time system I've ever worked with (about half a dozen).
Actually, if your code does nothing, is it even fair that the compiler still creates an executable? ;-)
Well, on Windows any executable would still have a size, although it can be reasonable small. With the old MS-DOS system, a complete do-nothing application would just be a couple of bytes. (I think four bytes to use the 21h interrupt to close the program.) Then again, those application were loaded straight into memory.
When the EXE format became more popular, things changed a bit. Now executables had additional information about the process itself, like the relocation of code and data segments plus some checksums and version information.
The introduction of Windows added another header to the format, to tell MS-DOS that it couldn't execute the executable since it needed to run under Windows. And Windows would recognize it without problems.
Of course, the executable format was also extended with resource information, like bitmaps, icons and dialog forms and much, much more.
A do-nothing executable would nowadays be between 4 and 8 kilobytes in size, depending on your compiler and every method you've used to reduce it's size. It would be at a size where UPX would actually result in bigger executables! Additional bytes in your executable might be added because you added certain libraries to your code. Especially libraries with initialized data or resources will add a considerable amount of bytes. Adding debug information also increases the size of the executable.
But while this all makes a nice exercise at reducing size, you could wonder if it's practical to just continue to worry about bloatedness of applications. Modern hard disks will divide files up in segments and for really large disks, the difference would be very small. However, the amount of trouble it would take to keep the size as small as possible will slow down development speed, unless you're an expert developer whom is used to these optimizations. These kinds of optimizations don't tend to improve performance and considering the average disk space of most systems, I don't see why it would be practical. (Still, I do optimize my own code in similar ways but then again, I am experienced with these optimizations.)
Interested in the EXE header? It's starts with the letters MZ, for "Mark Zbikowski". The first part is the old-style MS-DOS header for executables and is used as a stub to MS-DOS saying the program is not an MS-DOS executable. (In the binary, you can find the text 'This program cannot be run in DOS mode.' which is basically all it does: displaying that message. Next is the PE header, which Windows will recognise and use instead of the MS-DOS header. It starts with the letters PE for Portable Executable. After this second header there will be the executable itself, divided in several blocks of code and data. The header contains special reallocation tables which tells the OS where to load a specific block. And if you can keep this to a limit, the final executable can be smaller than 4 KB, but 90% would then be header information and no functionality.
I like the way the DJGPP FAQ addressed this many many years ago:
In general, judging code sizes by looking at the size of "Hello" programs is meaningless, because such programs consist mostly of the startup code. ... Most of the power of all these features goes wasted in "Hello" programs. There is no point in running all that code just to print a 15-byte string and exit.
What is the purpose of this exercise?
Even with as low a level language as C, there's still a lot of setup that has to happen before main can be called. Some of that setup is handled by the loader (which needs certain information), some is handled by the code that calls main. And then there's probably a little bit of library code that any normal program would have to have. At the least, there's probably references to the standard libraries, if they are in dlls.
Examining the binary size of the empty program is a worthless exercise in and of itself. It tells you nothing. If you want to learn something about code size, try writing non-empty (and preferably non-trivial) programs. Compare programs that use standard libraries with programs that do everything themselves.
If you really want to know what's going on in that binary (and why it's so big), then find out the executable format get a binary dump tool and take the thing apart.
What does 'size a.out' tell you about the size of the code, data, and bss segments? The majority of the code is likely to be the start up code (classically crt0.o on Unix machines) which is invoked by the o/s and does set up work (like sorting out command line arguments into argc, argv) before invoking main().
Run strip on the binary to get rid of the symbols. With gcc version 3.4.4 (cygming special) I drop from 10k to 4K.
You can try linking a custom run time (The part that calls main) to setup your runtime environment. All programs use the same one to setup the runtime environment that comes with gcc but for your executable you don't need data or zero'ed memory. The means you could get rid of unused library functions like memset/memcpy and reduce CRT0 size. When looking for info on this look at GCC in embedded environment. Embedded developers are general the only people that use custom runtime environments.
The rest is overheads for the OS that loads the executable. You are not going to same much there unless you tune that by hand?
Using GCC, compile your program using -Os rather than one of the other optimization flags (-O2 or -O3). This tells it to optimize for size rather than speed. Incidentally, it can sometimes make programs run faster than the speed optimizations would have, if some critical segment happens to fit more nicely. On the other hand, -O3 can actually induce code-size increases.
There might also be some linker flags telling it to leave out unused code from the final binary.