Tracking native instructions in Intel PIN [duplicate] - c++

This question already has an answer here:
What instructions 'instCount' Pin tool counts?
(1 answer)
Closed 5 years ago.
I am using the Intel PIN tool to do some analysis on the assembly instructions of a C program. I have a simple C program which prints "Hello World", which I have compiled and generated an executable. I have the assembly instruction trace generated from gdb like this-
Dump of assembler code for function main:
0x0000000000400526 <+0>: push %rbp
0x0000000000400527 <+1>: mov %rsp,%rbp
=> 0x000000000040052a <+4>: mov $0x4005c4,%edi
0x000000000040052f <+9>: mov $0x0,%eax
0x0000000000400534 <+14>: callq 0x400400 <printf#plt>
0x0000000000400539 <+19>: mov $0x0,%eax
0x000000000040053e <+24>: pop %rbp
0x000000000040053f <+25>: retq
End of assembler dump.
I ran a pintool where I gave the executable as an input, and I am doing an instruction trace and printing the number of instructions. I wish to trace the instructions which are from my C program and probably get the machine opcodes and do some kind of analysis. I am using a C++ PIN tool to count the number of instructions-
#include "pin.H"
#include <iostream>
#include <stdio.h>
UINT64 icount = 0;
using namespace std;
//====================================================================
// Analysis Routines
//====================================================================
void docount(THREADID tid) {
icount++;
}
//====================================================================
// Instrumentation Routines
//====================================================================
VOID Instruction(INS ins, void *v) {
INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR)docount, IARG_THREAD_ID, IARG_END);
}
VOID Fini(INT32 code, VOID *v) {
printf("count = %ld\n",(long)icount);
}
INT32 Usage() {
PIN_ERROR("This Pintool failed\n"
+ KNOB_BASE::StringKnobSummary() + "\n");
return -1;
}
int main(int argc, char *argv[]) {
if (PIN_Init(argc, argv)) return Usage();
PIN_InitSymbols();
PIN_AddInternalExceptionHandler(ExceptionHandler,NULL);
INS_AddInstrumentFunction(Instruction, 0);
PIN_AddFiniFunction(Fini, 0);
PIN_StartProgram();
return 0;
}
When I run my hello world program with this tool, I get icount = 81563. I understand that PIN adds its own instructions for analysis, but I don't understand how it adds so many instructions, while I don't have more than 10 instructions in my C program. Also is there a way to identify the assembly instructions which are from my code and the ones generated by PIN. I seem to find no way to differentiate between instructions generated by PIN and the ones which are from my program. Please Help!

You're not measuring what you think you're measuring. See my answer here for details:
What instructions 'instCount' Pin tool counts?
Pin does not count its own instructions. The large count is the result of preparation before and after main() and the call to printf().

Related

Delete instruction using PIN

I am using the Intel PIN tool to emulate some new instructions and check the corresponding results. For this purpose I am using illegal opcodes of x86_64 to represent my instructions. For example- opcodes 0x16, 0x17 are illegal in x86_64. which represent my instruction opcodes. I am using a C program to generate an executable and then pass it to the Pintool. A C program I am using is this -
#include <stdio.h>
int main()
{
asm(".byte 0x16");
asm(".byte 0x17");
return 0;
}
So if we see the instruction trace 0x16 and 0x17 will appear as bad instructions and if we try to run the executable we get -
Illegal instruction (core dumped)
which is expected as 0x16, 0x17 are illegal in x86_64 and hence the executable should not pass. I am using this executable as input to my Pintool, which examines the instruction trace and hence will encounter 0x16 and 0x17 in the trace.
The Pintool I am using is this -
#include "pin.H"
#include <iostream>
#include <fstream>
#include <cstdint>
UINT64 icount = 0;
using namespace std;
KNOB<string> KnobOutputFile(KNOB_MODE_WRITEONCE, "pintool", "o", "test.out","This pin tool simulates ULI");
FILE * op;
//====================================================================
// Analysis Routines
//====================================================================
VOID simulate_ins(VOID *ip, UINT32 size) {
fprintf(op,"Wrong instruction encountered here\n");
// Do something based on the instruction
}
//====================================================================
// Instrumentation Routines
//====================================================================
VOID Instruction(INS ins, void *v) {
UINT8 opcodeBytes[15];
UINT64 fetched = PIN_SafeCopy(&opcodeBytes[0],(void *)INS_Address(ins),INS_Size(ins));
if (fetched != INS_Size(ins))
fprintf(op,"\nBad\n");
else {
if(opcodeBytes[0]==0x16 || opcodeBytes[0]==0x17) {
INS_InsertCall( ins, IPOINT_BEFORE, (AFUNPTR)simulate_ins, IARG_INST_PTR, IARG_UINT64, INS_Size(ins) , IARG_END);
INS_Delete(ins);
}
}
VOID Fini(INT32 code, VOID *v) {
//Display some end result
}
INT32 Usage() {
PIN_ERROR("This Pintool failed\n" + KNOB_BASE::StringKnobSummary() + "\n");
return -1;
}
int main(int argc, char *argv[])
{
op = fopen("test.out", "w");
if (PIN_Init(argc, argv))
return Usage();
PIN_InitSymbols();
PIN_AddInternalExceptionHandler(ExceptionHandler,NULL);
INS_AddInstrumentFunction(Instruction, 0);
PIN_AddFiniFunction(Fini, 0);
PIN_StartProgram();
return 0;
}
So I am extracting my assembly opcodes and if the first byte is 0x16 or 0x17 I am sending the instruction to my analysis routine and then deleting the instruction. But however when I run this Pintool on the executable I still get the Illegal instruction (core dumped) error and my code fails to run. My understanding is that the Instrumentation routine is called every time a new instruction is encountered in the trace and the analysis routine is called before the instruction is executed. Here I am checking for the opcode and based on the result I am sending the code to the analysis routine and deleting the instruction. I will be simulating my new instruction in the analysis routine so, I just need to delete the old instruction and let the program proceed futher and make sure it dosen't give the illegal instruction error again.
Anywhere I am doing something wrong?

How to write x64 machine code into virtual memory and execute it for Windows in C++

I have been wondering how V8 JavaScript Engine and any other JIT compilers execute the generated code.
Here are the articles I read during my attempt to write a small demo.
http://eli.thegreenplace.net/2013/11/05/how-to-jit-an-introduction
http://nullprogram.com/blog/2015/03/19/
I only know very little about assembly, so I initially used http://gcc.godbolt.org/ to write a function and get the disassembled output, but the code is not working on Windows.
I then wrote a small C++ code, compiled with -g -Og, then get disassmbled output with gdb.
#include <stdio.h>
int square(int num) {
return num * num;
}
int main() {
printf("%d\n", square(10));
return 0;
}
Output:
Dump of assembler code for function square(int):
=> 0x00000000004015b0 <+0>: imul %ecx,%ecx
0x00000000004015b3 <+3>: mov %ecx,%eax
0x00000000004015b5 <+5>: retq
I copy-pasted the output ('%' removed) to online x86 assembler and get { 0x0F, 0xAF, 0xC9, 0x89, 0xC1, 0xC3 }.
Here is my final code. if I compiled it with gcc, I always get 1. If I compiled it with VC++, I get random number. What is going on?
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <windows.h>
typedef unsigned char byte;
typedef int (*int0_int)(int);
const byte square_code[] = {
0x0f, 0xaf, 0xc9,
0x89, 0xc1,
0xc3
};
int main() {
byte* buf = reinterpret_cast<byte*>(VirtualAlloc(0, 1 << 8, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE));
if (buf == nullptr) return 0;
memcpy(buf, square_code, sizeof(square_code));
{
DWORD old;
VirtualProtect(buf, 1 << 8, PAGE_EXECUTE_READ, &old);
}
int0_int square = reinterpret_cast<int0_int>(buf);
int ans = square(100);
printf("%d\n", ans);
VirtualFree(buf, 0, MEM_RELEASE);
return 0;
}
Note
I am trying to learn how JIT works, so please do not suggest me to use LLVM or any library. I promise I will use a proper JIT library in real project rather than writing from scratch.
Note: as Ben Voigt points out in the comments, this is really only valid for x86, not x86_64. For x86_64 you just have some errors in your assembly (which are still errors in x86 as well) as Ben Voigt points out as well in his answer.
This is happening because your compiler could see both sides of the function call when you generated your assembly. Since the compiler was in control of generating code for both the caller and the callee, it didn't have to follow the cdecl calling convention, and it didn't.
The default calling convention for MSVC is cdecl. Basically, function parameters are pushed onto the stack in the reverse of the order they're listed, so a call to foo(10, 100) could result in the assembly:
push 100
push 10
call foo(int, int)
In your case, the compiler will generate something like the following at the call site:
push 100
call esi ; assuming the address of your code is in the register esi
That's not what your code is expecting though. Your code is expecting its argument to be passed in the register ecx, not the stack.
The compiler has used what looks like the fastcall calling convention. If I compile a similar program (I get slightly different assembly) I get the expected result:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <windows.h>
typedef unsigned char byte;
typedef int (_fastcall *int0_int)(int);
const byte square_code[] = {
0x8b, 0xc1,
0x0f, 0xaf, 0xc0,
0xc3
};
int main() {
byte* buf = reinterpret_cast<byte*>(VirtualAlloc(0, 1 << 8, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE));
if (buf == nullptr) return 0;
memcpy(buf, square_code, sizeof(square_code));
{
DWORD old;
VirtualProtect(buf, 1 << 8, PAGE_EXECUTE_READ, &old);
}
int0_int square = reinterpret_cast<int0_int>(buf);
int ans = square(100);
printf("%d\n", ans);
VirtualFree(buf, 0, MEM_RELEASE);
return 0;
}
Note that I've told the compiler to use the _fastcall calling convention. If you want to use cdecl, the assembly would need to look more like this:
push ebp
mov ebp, esp
mov eax, DWORD PTR _n$[ebp]
imul eax, eax
pop ebp
ret 0
(DISCLAMER: I'm not great at assembly, and that was generated by Visual Studio)
I copy-pasted the output ('%' removed)
Well, that means your second instruction was
mov ecx, eax
which makes no sense at all (it overwrites the result of the multiplication with the uninitialized return value).
On the other hand
mov eax, foo
ret
is a very common pattern for ending a function with non-void return type.
The difference between your two assembly languages (AT&T style vs Intel style) is more than just the % marker, the operand order is reversed and pointers and offsets are denoted very differently as well.
You'll want to issue a set disassembly-flavor intel command in gdb

Why is _mm_set_epi16 sometimes faster than _mm_load_si128?

I've understood it's best to avoid _mm_set_epi*, and instead rely on _mm_load_si128 (or even _mm_loadu_si128 with a small performance hit if the data is not aligned). However, the impact this has on performance seems inconsistent to me. The following is a good example.
Consider the two following functions that utilize SSE intrinsics:
static uint32_t clmul_load(uint16_t x, uint16_t y)
{
const __m128i c = _mm_clmulepi64_si128(
_mm_load_si128((__m128i const*)(&x)),
_mm_load_si128((__m128i const*)(&y)), 0);
return _mm_extract_epi32(c, 0);
}
static uint32_t clmul_set(uint16_t x, uint16_t y)
{
const __m128i c = _mm_clmulepi64_si128(
_mm_set_epi16(0, 0, 0, 0, 0, 0, 0, x),
_mm_set_epi16(0, 0, 0, 0, 0, 0, 0, y), 0);
return _mm_extract_epi32(c, 0);
}
The following function benchmarks the performance of the two:
template <typename F>
void benchmark(int t, F f)
{
std::mt19937 rng(static_cast<unsigned int>(std::time(0)));
std::uniform_int_distribution<uint32_t> uint_dist10(
0, std::numeric_limits<uint32_t>::max());
std::vector<uint32_t> vec(t);
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < t; ++i)
{
vec[i] = f(uint_dist10(rng), uint_dist10(rng));
}
auto duration = std::chrono::duration_cast<
std::chrono::milliseconds>(
std::chrono::high_resolution_clock::now() -
start);
std::cout << (duration.count() / 1000.0) << " seconds.\n";
}
Finally, the following main program does some testing:
int main()
{
const int N = 10000000;
benchmark(N, clmul_load);
benchmark(N, clmul_set);
}
On an i7 Haswell with MSVC 2013, a typical output is
0.208 seconds. // _mm_load_si128
0.129 seconds. // _mm_set_epi16
Using GCC with parameters -O3 -std=c++11 -march=native (with slightly older hardware), a typical output is
0.312 seconds. // _mm_load_si128
0.262 seconds. // _mm_set_epi16
What explains this? Are there actually cases where _mm_set_epi* is preferable over _mm_load_si128? There are other times where I've noticed _mm_load_si128 to perform better, but I can't really characterize those observations.
Your compiler is optimizing away the "gather" behavior of your _mm_set_epi16() call since it really isn't needed. From g++ 4.8 (-O3) and gdb:
(gdb) disas clmul_load
Dump of assembler code for function clmul_load(uint16_t, uint16_t):
0x0000000000400b80 <+0>: mov %di,-0xc(%rsp)
0x0000000000400b85 <+5>: mov %si,-0x10(%rsp)
0x0000000000400b8a <+10>: vmovdqu -0xc(%rsp),%xmm0
0x0000000000400b90 <+16>: vmovdqu -0x10(%rsp),%xmm1
0x0000000000400b96 <+22>: vpclmullqlqdq %xmm1,%xmm0,%xmm0
0x0000000000400b9c <+28>: vmovd %xmm0,%eax
0x0000000000400ba0 <+32>: retq
End of assembler dump.
(gdb) disas clmul_set
Dump of assembler code for function clmul_set(uint16_t, uint16_t):
0x0000000000400bb0 <+0>: vpxor %xmm0,%xmm0,%xmm0
0x0000000000400bb4 <+4>: vpxor %xmm1,%xmm1,%xmm1
0x0000000000400bb8 <+8>: vpinsrw $0x0,%edi,%xmm0,%xmm0
0x0000000000400bbd <+13>: vpinsrw $0x0,%esi,%xmm1,%xmm1
0x0000000000400bc2 <+18>: vpclmullqlqdq %xmm1,%xmm0,%xmm0
0x0000000000400bc8 <+24>: vmovd %xmm0,%eax
0x0000000000400bcc <+28>: retq
End of assembler dump.
The vpinsrw (insert word) is ever-so-slightly faster than the unaligned double-quadword move from clmul_load, likely due to the internal load/store unit being able to do the smaller reads simultaneously but not the 16B ones. If you were doing more arbitrary loads, this would go away, obviously.
The slowness of _mm_set_epi* comes from the need to scrape together various variables into a single vector. You'd have to examine the generated assembly to be certain, but my guess is that since most of the arguments to your _mm_set_epi16 calls are constants (and zeroes, at that), GCC is generating a fairly short and fast set of instructions for the intrinsic.

Why are there 8 bytes between the end of a buffer and the saved frame pointer?

I am doing a stack-smashing exercise for coursework, and I have already completed the assignment, but there is one aspect that I do not understand.
Here is the target program:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int bar(char *arg, char *out)
{
strcpy(out, arg);
return 0;
}
void foo(char *argv[])
{
char buf[256];
bar(argv[1], buf);
}
int main(int argc, char *argv[])
{
if (argc != 2)
{
fprintf(stderr, "target1: argc != 2\n");
exit(EXIT_FAILURE);
}
foo(argv);
return 0;
}
Here are the commands used to compile it, on an x86 virtual machine running Ubuntu 12.04, with ASLR disabled.
gcc -ggdb -m32 -g -std=c99 -D_GNU_SOURCE -fno-stack-protector -m32 target1.c -o target1
execstack -s target1
When I look at the memory of this program on the stack, I see that buf has the address 0xbffffc40. Moreover, the saved frame pointer is stored at 0xbffffd48, and the return address is stored at 0xbffffd4c.
These specific addresses are not relevant, but I observe that even though buf only has length 256, the distance 0xbffffd48 - 0xbffffc40 = 264. Symbolically, this computation is $fp - buf.
Why are there 8 extra bytes between the end of buf and the stored frame pointer on the stack?
Here is some disassembly of the function foo. I have already examined it, but I did not see any obvious usage of that memory region, unless it was implicit (ie a side effect of some instruction).
0x080484ab <+0>: push %ebp
0x080484ac <+1>: mov %esp,%ebp
0x080484ae <+3>: sub $0x118,%esp
0x080484b4 <+9>: mov 0x8(%ebp),%eax
0x080484b7 <+12>: add $0x4,%eax
0x080484ba <+15>: mov (%eax),%eax
0x080484bc <+17>: lea -0x108(%ebp),%edx
0x080484c2 <+23>: mov %edx,0x4(%esp)
0x080484c6 <+27>: mov %eax,(%esp)
0x080484c9 <+30>: call 0x804848c <bar>
0x080484ce <+35>: leave
0x080484cf <+36>: ret
Basile Starynkevitch gets the prize for mentioning alignment.
It turns out that gcc 4.7.2 defaults to aligning the frame boundary to a 4-word boundary. On 32-bit emulated hardware, that is 16 bytes. Since the saved frame pointer and the saved instruction pointer together only take up 8 bytes, the compiler put in another 8 bytes after the end of buf to align the top of the stack frame to a 16 byte boundary.
Using the following additional compiler flag, the 8 bytes disappears, because the 8 bytes is enough to align to a 2-word boundary.
-mpreferred-stack-boundary=2

Super weird segfault with gcc 4.7 -- Bug?

Here is a piece of code that I've been trying to compile:
#include <cstdio>
#define N 3
struct Data {
int A[N][N];
int B[N];
};
int foo(int uloc, const int A[N][N], const int B[N])
{
for(unsigned int j = 0; j < N; j++) {
for( int i = 0; i < N; i++) {
for( int r = 0; r < N ; r++) {
for( int q = 0; q < N ; q++) {
uloc += B[i]*A[r][j] + B[j];
}
}
}
}
return uloc;
}
int apply(const Data *d)
{
return foo(4,d->A,d->B);
}
int main(int, char **)
{
Data d;
for(int i = 0; i < N; ++i) {
for(int j = 0; j < N; ++j) {
d.A[i][j] = 0.0;
}
d.B[i] = 0.0;
}
int res = 11 + apply(&d);
printf("%d\n",res);
return 0;
}
Yes, it looks quite strange, and does not do anything useful at all at the moment, but it is the most concise version of a much larger program which I had the problem with initially.
It compiles and runs just fine with GCC(G++) 4.4 and 4.6, but if I use GCC 4.7, and enable third level optimizations:
g++-4.7 -g -O3 prog.cpp -o prog
I get a segmentation fault when running it. Gdb does not really give much information on what went wrong:
(gdb) run
Starting program: /home/kalle/work/code/advect_diff/c++/strunt
Program received signal SIGSEGV, Segmentation fault.
apply (d=d#entry=0x7fffffffe1a0) at src/strunt.cpp:25
25 int apply(const Data *d)
(gdb) bt
#0 apply (d=d#entry=0x7fffffffe1a0) at src/strunt.cpp:25
#1 0x00000000004004cc in main () at src/strunt.cpp:34
I've tried tweaking the code in different ways to see if the error goes away. It seems necessary to have all of the four loop levels in foo, and I have not been able to reproduce it by having a single level of function calls. Oh yeah, the outermost loop must use an unsigned loop index.
I'm starting to suspect that this is a bug in the compiler or runtime, since it is specific to version 4.7 and I cannot see what memory accesses are invalid.
Any insight into what is going on would be very much appreciated.
It is possible to get the same situation with the C-version of GCC, with a slight modification of the code.
My system is:
Debian wheezy
Linux 3.2.0-4-amd64
GCC 4.7.2-5
Okay so I looked at the disassembly offered by gdb, but I'm afraid it doesn't say much to me:
Dump of assembler code for function apply(Data const*):
0x0000000000400760 <+0>: push %r13
0x0000000000400762 <+2>: movabs $0x400000000,%r8
0x000000000040076c <+12>: push %r12
0x000000000040076e <+14>: push %rbp
0x000000000040076f <+15>: push %rbx
0x0000000000400770 <+16>: mov 0x24(%rdi),%ecx
=> 0x0000000000400773 <+19>: mov (%rdi,%r8,1),%ebp
0x0000000000400777 <+23>: mov 0x18(%rdi),%r10d
0x000000000040077b <+27>: mov $0x4,%r8b
0x000000000040077e <+30>: mov 0x28(%rdi),%edx
0x0000000000400781 <+33>: mov 0x2c(%rdi),%eax
0x0000000000400784 <+36>: mov %ecx,%ebx
0x0000000000400786 <+38>: mov (%rdi,%r8,1),%r11d
0x000000000040078a <+42>: mov 0x1c(%rdi),%r9d
0x000000000040078e <+46>: imul %ebp,%ebx
0x0000000000400791 <+49>: mov $0x8,%r8b
0x0000000000400794 <+52>: mov 0x20(%rdi),%esi
What should I see when I look at this?
Edit 2015-08-13: This seem to be fixed in g++ 4.8 and later.
You never initialized d. Its value is indeterminate, and trying to do math with its contents is undefined behavior. (Even trying to read its values without doing anything with them is undefined behavior.) Initialize d and see what happens.
Now that you've initialized d and it still fails, that looks like a real compiler bug. Try updating to 4.7.3 or 4.8.2; if the problem persists, submit a bug report. (The list of known bugs currently appears to be empty, or at least the link is going somewhere that only lists non-bugs.)
It indeed and unfortunately is a bug in gcc. I have not the slightest idea what it is doing there, but the generated assembly for the apply function is ( I compiled it without main btw., and it has foo inlined in it):
_Z5applyPK4Data:
pushq %r13
movabsq $17179869184, %r8
pushq %r12
pushq %rbp
pushq %rbx
movl 36(%rdi), %ecx
movl (%rdi,%r8), %ebp
movl 24(%rdi), %r10d
and exactly at the movl (%rdi,%r8), %ebp it will crashes, since it adds a nonsensical 0x400000000 to $rdi (the first parameter, thus the pointer to Data) and dereferences it.