How to inline in C++ self modifying assembly code? - c++

How would I inline this in C++ function.
0041F84E . 7B 02 JPO SHORT Unmodifi.0041F852
0041F850 B8 DB B8
0041F851 00 DB 00
0041F852 . 8B46 38 MOV EAX,DWORD PTR DS:[ESI+38]
0041F855 . 8B56 24 MOV EDX,DWORD PTR DS:[ESI+24]
0041F858 . 8B4E 10 MOV ECX,DWORD PTR DS:[ESI+10]
0041F85B . 81EA 8B4B8636 SUB EDX,36864B8B
How would I put
DB B8
DB 00
void test() {
__asm {
...
JPO label_0041F852
__emit 0xB8
__emit 00
label_0041F852:
MOV EAX,DWORD PTR DS:[ESI+0x38]
MOV EDX,DWORD PTR DS:[ESI+0x24]
MOV ECX,DWORD PTR DS:[ESI+0x10]
SUB EDX,0x36864B8B
...
}
}
error C2400: inline assembler syntax error in 'opcode'; found 'constant'
Error executing cl.exe.
I don't think I can put this in the .data section, I've read thats all I can do to include bytes like this.

This is an answer-length comment to reply to SSpoke's request for an example. A long time ago, when emulating Turing machines was a cool thing to do, I wrote a Turing machine emulator program to search for busy beavers on a DEC Vax minicomputer. When the program decided which Turing machine to try next, it compiled the machine code for the Turing machine into an array, and called the array as if it was a function. (All this was written in C.)
That's self-modifying code. To run it, you need an area of memory that is simultaneously writable and executable.
Your code is not self-modifying -- you don't write to it at all. So you can run it in a read-only program segment.

Related

C++ odd assembly output query

Using Windows 10 Pro with Visual Studio 2022, Debug mode, X64 platform, I have the following code...
int main()
{
int var = 1;
int* varPtr = &var;
*varPtr = 10;
return 0;
}
In the disassembly window we see this...
int var = 1;
00007FF75F1D1A0D C7 45 04 01 00 00 00 mov dword ptr [var],1
int* varPtr = &var;
00007FF75F1D1A14 48 8D 45 04 lea rax,[var]
00007FF75F1D1A18 48 89 45 28 mov qword ptr [varPtr],rax
*varPtr = 10;
00007FF75F1D1A1C 48 8B 45 28 mov rax,qword ptr [varPtr]
00007FF75F1D1A20 C7 00 0A 00 00 00 mov dword ptr [rax],0Ah
return 0;
Upon stepping through the above, the RAX register is loaded with the memory address for the stack variable, var, via...
00007FF75F1D1A14 48 8D 45 04 lea rax,[var]
Since RAX is not changed after this, why is that same var address being loaded into RAX again, 2 instructions later with...
00007FF75F1D1A1C 48 8B 45 28 mov rax,qword ptr [varPtr]
The memory view window shows that the &var address is constant throughout. Am I missing something daft?
[Updated] - switching to release mode and optimisation off returns the above in full. Turning on speed/size optimization returns only that "return 0" code. Would be interesting to see if there's a way to force the compiler to compile everything (using fast switch) and force it to not remove what it thought was redundant, for this example. This minimal appears to be too minimal, lol.
Still concerned about that unneeded double load of RAX - primarily, for such a small program, though yes, that's what 'optimisation' is all about. Sill.
When compiling in Debug mode (i.e. with all optimisations disabled), the compiler generates code like this for a reason.
Suppose you are stepping through the code and you stop on the line that reads *varPtr = 10;. At that point, you decide that you loaded the wrong address into varPtr and would like to change it and continue debugging without stopping, rebuilding and restarting your program.
Well, in Debug mode, you can. Just change the address stored in varPtr (in the Watch window, say) and carry on debugging. Without the 'redundant' second load, this wouldn't work. When the compiler emits said load, it does.
So, to summarise, Debug mode is designed to make debugging easier, while Release mode is designed to make your code run as fast (or be as small) as possible, hopefully with the same semantics.
And just be grateful that compiler writers understand the need for these two modes of operation. Without them, our lives as developers would be much, much harder.

Is it possible to write asm in C++ with opcode instead of shellcode

I'm curious if there's a way to use __asm in c++ then write that into memory instead of doing something like:
BYTE shell_code[] = { 0x48, 0x03 ,0x1c ,0x25, 0x0A, 0x00, 0x00, 0x00 };
write_to_memory(function, &shell_code, sizeof(shell_code));
So I would like to do:
asm_code = __asm("add rbx, &variable\n\t""jmp rbx") ;
write_to_memory(function, &asm_code , sizeof(asm_code ));
Worst case I can use GCC and objdump externally or something but hoping there's an internal way
You can put an asm(""); statement at global scope, with start/end labels inside it, and declare those labels as extern char start_code[], end_code[0]; so you can access them from C. C char arrays work most like asm labels, in terms of being able to use the C name and have it work as an address.
// compile with gcc -masm=intel
// AFAIK, no way to do that with clang
asm(
".pushsection .rodata \n" // we don't want to run this from here, it's just data
"start_code: \n"
" add rax, OFFSET variable \n" // *absolute* address as 32-bit sign-extended immediate
"end_code: \n"
".popsection"
);
__attribute__((used)) static int variable = 1;
extern char start_code[], end_code[0]; // C declarations for those asm labels
#include <string.h>
void copy_code(void *dst)
{
memcpy(dst, start_code, end_code - start_code);
}
It would be fine to have the payload code in the default .text section, but we can put it in .rodata since we don't want to run it.
Is that the kind of thing you're looking for? asm output on Godbolt (without assembling + disassembling:
start_code:
add rax, OFFSET variable
end_code:
copy_code(void*):
mov edx, OFFSET FLAT:end_code
mov esi, OFFSET FLAT:start_code
sub rdx, OFFSET FLAT:start_code
jmp [QWORD PTR memcpy#GOTPCREL[rip]]
To see if it actually assembles to what we want, I compiled with
gcc -O2 -fno-plt -masm=intel -fno-pie -no-pie -c foo.c to get a .o. objdump -drwC -Mintel shows:
0000000000000000 <copy_code>:
0: ba 00 00 00 00 mov edx,0x0 1: R_X86_64_32 .rodata+0x6
5: be 00 00 00 00 mov esi,0x0 6: R_X86_64_32 .rodata
a: 48 81 ea 00 00 00 00 sub rdx,0x0 d: R_X86_64_32S .rodata
11: ff 25 00 00 00 00 jmp QWORD PTR [rip+0x0] # 17 <end_code+0x11> 13: R_X86_64_GOTPCRELX memcpy-0x4
And with -D to see all sections, the actual payload is there in .rodata, still not linked yet:
Disassembly of section .rodata:
0000000000000000 <start_code>:
0: 48 05 00 00 00 00 add rax,0x0 2: R_X86_64_32S .data
-fno-pie -no-pie is only necessary for the 32-bit absolute address of variable to work. (Without it, we get two RIP-relative LEAs and a sub rdx, rsi. Unfortunately neither way of compiling gets GCC to subtract the symbols at build time with mov edx, OFFSET end_code - start_code, but that's just in the code doing the memcpy, not in the machine code being copied.)
In a linked executable
We can see how the linker filled in those relocations.
(I tested by using -nostartfiles instead of -c - I didn't want to run it, just look at the disassembly, so there was not point to actually writing a main.)
$ gcc -O2 -fno-plt -masm=intel -fno-pie -no-pie -nostartfiles foo.c
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000401000
$ objdump -D -rwC -Mintel a.out
(manually edited to remove uninteresting sections)
Disassembly of section .text:
0000000000401000 <copy_code>:
401000: ba 06 20 40 00 mov edx,0x402006
401005: be 00 20 40 00 mov esi,0x402000
40100a: 48 81 ea 00 20 40 00 sub rdx,0x402000
401011: ff 25 e1 2f 00 00 jmp QWORD PTR [rip+0x2fe1] # 403ff8 <memcpy#GLIBC_2.14>
The linked payload:
0000000000402000 <start_code>:
402000: 48 05 18 40 40 00 add rax,0x404018 # from add rax, OFFSET variable
0000000000402006 <end_code>:
402006: 48 c7 c2 06 00 00 00 mov rdx,0x6
# this was from mov rdx, OFFSET end_code - start_code to see if that would assemble + link
Our non-zero-init dword variable that we're taking the address of:
Disassembly of section .data:
0000000000404018 <variable>:
404018: 01 00 add DWORD PTR [rax],eax
...
Your specific asm instruction is weird
&variable isn't valid asm syntax, but I'm guessing you wanted to add the address?
Since you're going to be copying the machine code somewhere, you must avoid RIP-relative addressing modes and any other relative references to things outside the block you're copying. Only mov can use 64-bit absolute addresses, like movabs rdi, OFFSET variable instead of the usual lea rdi, [rip + variable]. Also, you can even load / store into/from RAX/EAX/AX/AL with 64-bit absolute addresses movabs eax, [variable]. (mov-immediate can use any register, load/store are only the accumulator. https://www.felixcloutier.com/x86/mov)
(movabs is an AT&T mnemonic, but GAS allows it in .intel_syntax noprefix to force using 64-bit immediates, instead of the default 32-bit-sign-extended.)
This is kind of opposite of normal position-independent code, which works when the whole image is loaded at an arbitrary base. This will make code that works when the image is loaded to a fixed base (or even variable since runtime fixups should work for symbolic references), and then copied around relative to the rest of your code. So all your memory refs have to be absolute, except for within the asm.
So we couldn't have made PIE-compatible machine code by using lea rdx, [RIP+variable] / add rax, rdx - that would only get the right address for variable when run from the linked location in .rodata, not from any copy. (Unless you manually fixup the code when copying it, but it's still only a rel32 displacement.)
Terminology:
An opcode is part of a machine instruction, e.g. add ecx, 123 assembles to 3 bytes: 83 c1 7b. Those are the opcode, modrm, and imm8 respectively. (https://www.felixcloutier.com/x86/add).
"opcode" also gets misused (especially in shellcode usage) to describe the whole instruction represented as bytes.
Text names for instructions like add are mnemonics.
this is just a guess, i don't know if it will work. i'm sorry in advance for an ugly answer since i don't have much time due to work.
i think you can enclose your asm code inside labels.
get the address of that label and the size. treat it as a blob of data and you can write it anywhere.
void funcA(){
//some code here.
labelStart:
__asm("
;asm code here.
")
labelEnd:
//some code here.
//---make code as movable data.
char* pDynamicProgram = labelStart;
size_t sizeDP = labelEnd - labelStart;
//---writing to some memory.
char* someBuffer = malloc(sizeDP);
memcpy(someBuffer, pDynamicProgram, sizeDP);
//---execute: cast as a function pointer then execute call.
((func*)someBuffer)(/* parameters if any*/);
}
the sample code above of course is not compilable. but the logic is kind of like that. i see viruses do it that way though i haven't saw the actual c++ code. but we saw it from disassemblers. for the "return" logic after the call, there are many adhoc ways to do that. just be creative.
also, i think you have to enable first some settings for your program to write to some forbidden memory in case you want to override an existing function.

Why is there a locked xadd instruction in this disassambled std::string dtor?

I have a very simple code:
#include <string>
#include <iostream>
int main() {
std::string s("abc");
std::cout << s;
}
Then, I compiled it:
g++ -Wall test_string.cpp -o test_string -std=c++17 -O3 -g3 -ggdb3
And then decompiled it, and the most interesting piece is:
00000000004009a0 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10>:
4009a0: 48 81 ff a0 11 60 00 cmp rdi,0x6011a0
4009a7: 75 01 jne 4009aa <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0xa>
4009a9: c3 ret
4009aa: b8 00 00 00 00 mov eax,0x0
4009af: 48 85 c0 test rax,rax
4009b2: 74 11 je 4009c5 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x25>
4009b4: 83 c8 ff or eax,0xffffffff
4009b7: f0 0f c1 47 10 lock xadd DWORD PTR [rdi+0x10],eax
4009bc: 85 c0 test eax,eax
4009be: 7f e9 jg 4009a9 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x9>
4009c0: e9 cb fd ff ff jmp 400790 <_ZdlPv#plt>
4009c5: 8b 47 10 mov eax,DWORD PTR [rdi+0x10]
4009c8: 8d 50 ff lea edx,[rax-0x1]
4009cb: 89 57 10 mov DWORD PTR [rdi+0x10],edx
4009ce: eb ec jmp 4009bc <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x1c>
Why _ZNSs4_Rep10_M_disposeERKSaIcE.isra.10 (which is std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Rep::_M_dispose(std::allocator<char> const&) [clone .isra.10]) is a lock prefixed xadd?
A follow-up question is how I can avoid it?
It looks like code associated with copy on write strings. The locked instruction is decrementing a reference count and then calling operator delete only if the reference count for the possibly shared buffer containing the actual string data is zero (i.e., it is not shared: no other string object refers to it).
Since libstdc++ is open source, we can confirm this by looking at the source!
The function you've disassembled, _ZNSs4_Rep10_M_disposeERKSaIcE de-mangles1 to std::basic_string<char>::_Rep::_M_dispose(std::allocator<char> const&). Here's the corresponding source for libstdc++ in the gcc-4.x era2:
void
_M_dispose(const _Alloc& __a)
{
#if _GLIBCXX_FULLY_DYNAMIC_STRING == 0
if (__builtin_expect(this != &_S_empty_rep(), false))
#endif
{
// Be race-detector-friendly. For more info see bits/c++config.
_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&this->_M_refcount);
if (__gnu_cxx::__exchange_and_add_dispatch(&this->_M_refcount,
-1) <= 0)
{
_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&this->_M_refcount);
_M_destroy(__a);
}
}
} // XXX MT
Given that, we can annotate the assembly you provided, mapping each instruction back to the C++ source:
00000000004009a0 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10>:
# the next two lines implement the check:
# if (__builtin_expect(this != &_S_empty_rep(), false))
# which is an empty string optimization. The S_empty_rep singleton
# is at address 0x6011a0 and if the current buffer points to that
# we are done (execute the ret)
4009a0: cmp rdi,0x6011a0
4009a7: jne 4009aa <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0xa>
4009a9: ret
# now we are in the implementation of
# __gnu_cxx::__exchange_and_add_dispatch(&this->_M_refcount, -1)
# which dispatches either to an atomic version of the add function
# or the non-atomic version, depending on the value of `eax` which
# is always directly set to zero, so the non-atomic version is
# *always called* (see details below)
4009aa: mov eax,0x0
4009af: test rax,rax
4009b2: je 4009c5 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x25>
# this is the atomic version of the decrement you were concerned about
# but we never execute this code because the test above always jumps
# to 4009c5 (the non-atomic version)
4009b4: or eax,0xffffffff
4009b7: lock xadd DWORD PTR [rdi+0x10],eax
4009bc: test eax,eax
# check if the result of the xadd was zero, if not skip the delete
4009be: jg 4009a9 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x9>
# the delete call
4009c0: jmp 400790 <_ZdlPv#plt> # tailcall
# the non-atomic version starts here, this is the code that is
# always executed
4009c5: mov eax,DWORD PTR [rdi+0x10]
4009c8: lea edx,[rax-0x1]
4009cb: mov DWORD PTR [rdi+0x10],edx
# this jumps up to the test eax,eax check which calls operator delete
# if the refcount was zero
4009ce: jmp 4009bc <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x1c>
A key note is that the lock xadd code you were concerned about is never executed. There is a mov eax, 0 followed by a test rax, rax; je - this test always succeeds and the jump always occurs because rax is always zero.
What's happening here is that __gnu_cxx::__atomic_add_dispatch is implemented in a way that it checks whether the process is definitely single threaded. If it is definitely single threaded, it doesn't bother to use expensive atomic instructions for things like __atomic_add_dispatch - it simply uses a regular non-atomic addition. It does this by checking the address of a pthreads function, __pthread_key_create - if this is zero, the pthread library hasn't been linked in, and hence the process is definitely single threaded. In your case, the address of this pthread function gets resolved at link time to 0 (you didn't have -lpthread on your compile command line), which is where the mov eax, 0x0 comes from. At link time, it's too late to optimize on this knowledge, so the vestigial atomic increment code remains but never executes. This mechanism is described in more detail in this answer.
The code that does execute is the last part of the function, starting at 4009c5. This code also decrements the reference count, but in a non-atomic way. The check which decides between these two options is probably based on whether the process is multithreaded or not, e.g., whether -lpthread has been linked. For whatever reason this check, inside __exchange_and_add_dispatch, is implemented in a way that prevents the compiler from actually removing the atomic half of the branch, even though the fact that it will never be taken is known at some point during the build process (after all, the hard-coded mov eax, 0 got there somehow).
A follow-up question is how I can avoid it?
Well you've already avoided the lock add part, so if that's what you care about, your good to go. However, you still have a cause for concern:
Copy on write std::string implementations are not standards compliant due to changes made in C++11, so the question remains why exactly you are getting this COW string behavior even when specifying -std=c++17.
The problem is most likely distribution related: CentOS 7 by default uses an ancient gcc version < 5 which still uses the non-compliant COW strings. However, you mention that you are using gcc 8.2.1, which by default in a normal install which uses non-COW strings. It seems like if you installed 8.2.1 use the RHEL "devtools" method, you'll get a new gcc which still uses the old ABI and links against the old system libstdc++.
To confirm this, you might want to check the value of _GLIBCXX_USE_CXX11_ABI macro in your test program, and also your libstdc++ version (the version information here might prove useful).
You can avoid by using an OS other than CentOS that doesn't use ancient gcc and glibc version. If you need to stick with CentOS for some reason you'll have to look into if there is a supported way to use newer libstdc++ version on that distribution. You could also consider using a containerization technology to build an executable independent of the library versions of your local host.
1 You can demangle it like so: echo '_ZNSs4_Rep10_M_disposeERKSaIcE' | c++filt.
2 I'm using gcc-4 era source since I'm guessing that's what you end up using in CentOS 7.

AND operator + addition faster than a subtraction

I've measured the execution time of following codes:
volatile int r = 768;
r -= 511;
volatile int r = 768;
r = (r & ~512) + 1;
assembly:
mov eax, DWORD PTR [rbp-4]
sub eax, 511
mov DWORD PTR [rbp-4], eax
mov eax, DWORD PTR [rbp-4]
and ah, 253
add eax, 1
mov DWORD PTR [rbp-4], eax
the results:
Subtraction time: 141ns
AND + addition: 53ns
I've run the snippet multiple times with consistent results.
Can someone explain me why is this the case even tho there is one more line of assembly for AND + addition version?
Your assertion that one snippet is faster than the other is mistaken.
If you look at the code:
mov eax, DWORD PTR [rbp-4]
....
mov DWORD PTR [rbp-4], eax
You'll see that the running time is dominated by the load/store to memory.
Even on Skylake this will take 2+2 = 4 cycles minimum.
The 1 cycles that the sub or the 3*) cycles that the and bytereg/add full reg takes simply disappears into memory access time.
On older processors such as Core2 it takes 5 cycles minimum to do a load/store pair to the same address.
It is difficult to time such short sequences of code and care should be taken to ensure you have the correct methodology.
You also need to remember that rdstc is not accurate on Intel processors and runs out of order to boot.
If you use proper timing code like:
.... x 100,000 //stress the cpu using integercode in a 100,000 x loop to ensure it's running at 100%
cpuid //serialize instruction to make sure rdtscp does not run early.
rdstcp //use the serializing version to ensure it does not run late
push eax
push edx
mov reg1,1000*1000 //time a minimum of 1,000,000 runs to ensure accuracy
loop:
... //insert code to time here
sub reg1,1 //don't use dec, it causes a partial register stall on the flags.
jnz loop //loop
//kernel mode only!
//mov eax,cr0 //reading and writing to cr0 serializes as well.
//mov cr0,eax
cpuid //serialization in user mode.
rdstcp //make sure to use the 'p' version of rdstc.
push eax
push edx
pop 4x //retrieve the start and end times from the stack.
Run the timing code a 100x and take the lowest cycle count.
Now you'll have an accurate count to within 1 or 2 cycles.
You'll want to time an empty loop as well and subtract the times for that so that you can see the net time spend executing the instructions of interest.
If you do this you'll discover that add and sub run at exactly the same speed, just like it does/did in every x86/x64 CPU since the 8086.
This, of course, is also what Agner Fog, the Intel CPU manuals, the AMD cpu manuals, and just about any other source available say.
*) and ah,value takes 1 cycle, then the CPU stalls for 1 cycle due the partial register write and the add eax,value takes another cycle.
Optimized code
sub DWORD PTR [rbp-4],511
Might be faster if you don't need to reuse the value elsewhere, the latency is slow at 5 cycles, but the reciprocal throughput is 1 cycle, which is much better than either of your versions.
The full machine code is
8b 45 fc mov eax,DWORD PTR [rbp-0x4]
2d ff 01 00 00 sub eax,0x1ff
89 45 fc mov DWORD PTR [rbp-0x4],eax
vs
8b 45 fc mov eax,DWORD PTR [rbp-0x4]
80 e4 fd and ah,0xfd
83 c0 01 add eax,0x1
89 45 fc mov DWORD PTR [rbp-0x4],eax
This means for the code for the secound operation is in fact only one byte longer (11 vs 12). Most likely the CPU fetches code in larger units them bytes, so fetching isn't much slower. Also it can decode multiple instructions at the same time, so there the first sample doesn't have an advantage either. Executing a single add, and or sub each takes up a single ALU pass so they all take only one clock on a single execution unit. That's a 1 ns advantage for you sub on a 1GHz CPU.
So basically both operations are more or less the same. The difference may be attributed to some other factors. Maybe memory cell rbp-0x4 is still in L1 cache before your run the secound code sniplet. Or the instructions for the first sniplet are located worse reachable in memory. Or the CPU was able to run the secound sniplet speculativly before you started measuring etc., you would need to know how you measured the speed etc. to decide that.

MASM Fixing 64 bit Truncation in a DLL

I am working with the Adobe Flash ocx by loading it into my C++ program. The ocx is supposed to be 64 bit but for some reason it has issues when I compile with the x64 platform. I have read up on this and found that it is likely that some function receives DWORD userData instead of void* userData through some structure and then casts it to an object pointer. This works ok in a 32-bit environment, but crashes in 64-bit.
The disassembly of the function calls inside the ocx that cause the crash are the following lines:
mov ecx,r8d
The first operation copies only low 32-bits from R8D to ECX (ECX is 32-bit).
cmp dword ptr [rcx+11BCh],0
The second operation accesses 64-bit register, where low 32-bits contains correct address and high 32-bits contains some junk. Leading to a crash, of course.
Solution
I have read that one possible solution is to do the following:
Create an asm file containing the following code:
nop
nop
nop
mov ecx,r8d
cmp dword ptr [rcx+11BCh],0
nop
nop
nop
mov rcx,r8d // I've replaced ecx with rcx here
cmp dword ptr [rcx+11BCh],0
Build an obj file using this asm file and MASM.exe
Open the obj file with a hex editor and locate the 90's that represent the nop
In the Flash ocx locate the first string of bytes between the nops and replace it with the new string of bytes that comes after the nops. This will change it from 32 bit to 64 bit function calls.
Problem
I have attempted this by making the following asm file and building it with ml64.exe (I do not have masm.exe but I think that ml.exe is the new 32 bit version of it, and this code would only build with the ml64.exe, probably because of the 64-bit only operators?):
TITLE: Print String Assembly Program (test.asm)
.Code
main Proc
nop
nop
nop
mov ecx,r8d
cmp dword ptr [rcx+11BCh],0
nop
nop
nop
mov rcx,r8
cmp dword ptr [rcx+11BCh],0
main ENDP
END
I had trouble getting it to build (I kept getting errors about instruction length matching) until I changed r8d to r8 in the second section.
I got this obj to build, and opened it with a hex editor and was able to locate the two byte strings. But where my problem comes is that when I search for the first byte string that should be in the flash ocx, I cannot find it. It is not there, so I cannot replace it with the second one.
What am I doing wrong?
Thanks!
Create an asm file containing the following code:
nop
nop
nop
mov ecx,r8d
cmp dword ptr [rcx+11BCh],0
nop
nop
nop
mov rcx,r8d // I've replaced ecx with rcx here
cmp dword ptr [rcx+11BCh],0
Build an obj file using this asm file and MASM.exe
Open the obj file with a hex editor and locate the 90's that represent the nop
In the Flash ocx locate the first string of bytes between the nops and replace it with the new string of bytes that comes after the nops. This will change it from 32 bit to 64 bit function calls.
I made the following asm file and built it with ml64.exe
TITLE: Print String Assembly Program (test.asm)
.Code
main Proc
nop
nop
nop
mov ecx,r8d
cmp dword ptr [rcx+11BCh],0
nop
nop
nop
mov rcx,r8
cmp dword ptr [rcx+11BCh],0
main ENDP
END
I got this obj to build, and opened it with a hex editor and was able to locate the two byte strings. I found the first byte string in the Flash OCX and changed it to the second one. (The only actual change was a 41 to a 49 in the strings)