I have a directory change monitor process that reads updates from files within a set of directories. I have another process that performs small writes to a lot of files to those directories (test program). Figure about 100 directories with 10 files in each, and about 500 files being modified per second.
After running for a while, the directory monitor process hangs on a call to fclose() in a method that is basically tailing the file. In this method, I fopen() the file, check that the handle is valid, do a few seeks and reads, and then call fclose(). These reads are all performed by the same thread in the process. After the hang, the thread never progresses.
I couldn't find any good information on why fclose() might deadlock instead of returning some kind of error code. The documentation does mention _fclose_nolock(), but it doesn't seem to be available to me (Visual Studio 2003).
The hang occurs for both debug and release builds. In a debug build, I can see that fclose() calls _free_base(), which hangs before returning. Some kind of call into kernel32.dll => ntdll.dll => KernelBase.dll => ntdll.dll is spinning. Here's the assembly from ntdll.dll that loops indefinitely:
77CEB83F cmp dword ptr [edi+4Ch],0
77CEB843 lea esi,[ebx-8]
77CEB846 je 77CEB85E
77CEB848 mov eax,dword ptr [edi+50h]
77CEB84B xor dword ptr [esi],eax
77CEB84D mov al,byte ptr [esi+2]
77CEB850 xor al,byte ptr [esi+1]
77CEB853 xor al,byte ptr [esi]
77CEB855 cmp byte ptr [esi+3],al
77CEB858 jne 77D19A0B
77CEB85E mov eax,200h
77CEB863 cmp word ptr [esi],ax
77CEB866 ja 77CEB815
77CEB868 cmp dword ptr [edi+4Ch],0
77CEB86C je 77CEB87E
77CEB86E mov al,byte ptr [esi+2]
77CEB871 xor al,byte ptr [esi+1]
77CEB874 xor al,byte ptr [esi]
77CEB876 mov byte ptr [esi+3],al
77CEB879 mov eax,dword ptr [edi+50h]
77CEB87C xor dword ptr [esi],eax
77CEB87E mov ebx,dword ptr [ebx+4]
77CEB881 lea eax,[edi+0C4h]
77CEB887 cmp ebx,eax
77CEB889 jne 77CEB83F
Any ideas what might be happening here?
I posted this as a comment, but I realize this could be an answer in its own right...
Based on the disassembly, my guess is you've overwritten some internal heap structure maintained by ntdll, and it is looping forever iterating through a linked list.
In particular at the start of the loop, the current list node seems to be in ebx. At the end of the loop, the expected last node (or terminator, if you like -- it looks a bit like these are circular lists and the last node is the same as the first, pointer to this node being at [edi+4Ch]) is contained in eax. Probably the result of cmp ebx, eax is never equal, because there is some cycle in the list introduced by a heap corruption.
I don't think this has anything to do with locks, otherwise we would see some atomic instructions (eg. lock cmpxchg, xchg, etc.) or calls to other synchronization functions.
I had a same case with file close function. In my case, I solved by located the close function embedded other function body instead of having own function.
I was also suspicious on
(1) the name of file being duplicated (2) Windows scheduling (file IO wasn't completed before next task treading being started. Windows scheduling and multi-threading is behind of the curtain, so it is hard to verify, but I have similar issue when I tried to save many data in ASCII in the loop. Saving on binary solved at this case.)
My environment, IDE: Visual Studio 2015, OS: Windows 7, language: C++
Related
I am currently reading "The Ultimate Anti Debugging Reference" and I am trying to implement some of the techniques.
To check the Value of the NtglobalFlag they use this code -
push 60h
pop rsi
gs:lodsq ;Process Environment Block
mov al, [rsi*2+rax-14h] ;NtGlobalFlag
and al, 70h
cmp al, 70h
je being_debugged
I did all the correct adjustments for running x64 code on visual studio 2017 I used this tutorial.
I used this instruction to accesses the NtGlobalFlag
lodsq gs:[rsi]
because their syntax didn't work on Visual studio.
But still, it didn't work.
While debugging I've noticed that the value of the gs register is set to 0x0000000000000000 while the fs register is set to a real value 0x0000007595377000.
I don't understand why the value of GS was zeroed, because it should have its value set on x64.
64 bit Windows is apparently using fs to point to "per thread" memory, since gs is zero. I don't know what variables are kept in "per thread" memory, other than the seed value for rand(). You could debug a program that used rand(), and step through it in a disassembler window, to see how it is accessed.
The success of adding an anti-debugger feature to a program will depend on how much motivation there is to defeat it. The main issue is Windows remote debugging, and/or using a hacker installed device driver running in kernel mode to defeat an anti-debugger feature.
So I still don't understand why the code posted here caused so many problems, As I said I just copied it from "The “Ultimate”Anti-Debugging Reference"
push 60h
pop rsi
gs:lodsq ;Process Environment Block
mov al, [rsi*2+rax-14h] ;NtGlobalFlag
and al, 70h
cmp al, 70h
je being_debugged
But I've found a simpler solution that works perfectly.
As #"Peter Cordes" said I should be good with just accessing the value without lodsq like so -
mov rax, gs:[60h]
And after further investigation, I found this reference,
Code -
mov rax, gs:[60h]
mov al, [rax+BCh]
and al, 70h
cmp al, 70h
jz being_debugged
And I modified it a little bit for my program -
.code
GetValueFromASM proc
mov rax, gs:[60h]
mov al, [rax+0BCh]
and al, 70h
cmp al, 70h
jz being_debugged
mov rax,0
ret
being_debugged:
mov rax, 1
ret
GetValueFromASM endp
end
Just one thing to note -
When running inside visual studio 2017 the result returned was 0. Meaning no debugger attached which is False (Because I used the Local Windows Debugger).
But when launching the process with WinDBG it did return 1 which means that it works.
I ran the debugger on CodeBlocks and viewed the disassembly window.
The full source code for the program I debugged is the following:
int main(){}
and the assembly code I saw in the window was this:
00401020 push %ebp
00401021 mov %esp,%ebp
00401023 push %ebx
00401024 sub $0x34,%esp
00401027 movl $0x401150,(%esp)
0040102E call 0x401984 <SetUnhandledExceptionFilter#4>
00401033 sub $0x4,%esp
00401036 call 0x401330 <__cpu_features_init>
0040103B call 0x401740 <fpreset>
00401040 lea -0x10(%ebp),%eax
00401043 movl $0x0,-0x10(%ebp)
0040104A mov %eax,0x10(%esp)
0040104E mov 0x402000,%eax
00401053 movl $0x404004,0x4(%esp)
0040105B movl $0x404000,(%esp)
00401062 mov %eax,0xc(%esp)
00401066 lea -0xc(%ebp),%eax
00401069 mov %eax,0x8(%esp)
0040106D call 0x40192c <__getmainargs>
00401072 mov 0x404008,%eax
00401077 test %eax,%eax
00401079 jne 0x4010c5 <__mingw_CRTStartup+165>
0040107B call 0x401934 <__p__fmode>
00401080 mov 0x402004,%edx
00401086 mov %edx,(%eax)
00401088 call 0x4014f0 <_pei386_runtime_relocator>
0040108D and $0xfffffff0,%esp
00401090 call 0x401720 <__main>
00401095 call 0x40193c <__p__environ>
0040109A mov (%eax),%eax
0040109C mov %eax,0x8(%esp)
004010A0 mov 0x404004,%eax
004010A5 mov %eax,0x4(%esp)
004010A9 mov 0x404000,%eax
004010AE mov %eax,(%esp)
004010B1 call 0x401318 <main>
004010B6 mov %eax,%ebx
004010B8 call 0x401944 <_cexit>
004010BD mov %ebx,(%esp)
004010C0 call 0x40198c <ExitProcess#4>
004010C5 mov 0x4050f4,%ebx
004010CB mov %eax,0x402004
004010D0 mov %eax,0x4(%esp)
004010D4 mov 0x10(%ebx),%eax
004010D7 mov %eax,(%esp)
004010DA call 0x40194c <_setmode>
004010DF mov 0x404008,%eax
004010E4 mov %eax,0x4(%esp)
004010E8 mov 0x30(%ebx),%eax
004010EB mov %eax,(%esp)
004010EE call 0x40194c <_setmode>
004010F3 mov 0x404008,%eax
004010F8 mov %eax,0x4(%esp)
004010FC mov 0x50(%ebx),%eax
004010FF mov %eax,(%esp)
00401102 call 0x40194c <_setmode>
00401107 jmp 0x40107b <__mingw_CRTStartup+91>
0040110C lea 0x0(%esi,%eiz,1),%esi
Is it normal to get this much assembly code from so little C++ code?
By normal, I mean is this close to the average amount of assembly code the MinGW compiler generates relative to the amount of C++ source code I provided above?
Yes, this is fairly typical startup/shutdown code.
Before your main runs, a few things need to happen:
stdin/stdout/stderr get opened
cin/cout/cerr/clog get opened, referring to stdin/stdout/stderr
Any static objects you define get initialized
command line gets parsed to produce argc/argv
environment gets retrieved (maybe)
Likewise, after your main exits, a few more things have to happen:
Anything set up with atexit gets run
Your static objects get destroyed
cin/cout/cerr/clog get destroyed
all open output streams get flushed and closed
all open input streams get closed
Depending on the platform, there may be a few more things as well, such as setting up some default exception handlers (for either C++ exceptions, some platform-specific exceptions, or both).
Note that most of this is fixed code that gets linked into essentially every program, regardless of what it does or doesn't contain. In theory, they can use some tricks (e.g., "weak externals") to avoid linking in some of this code when it isn't needed, but most of what's above is used so close to universally (and the code to handle it is sufficiently trivial) that it's pretty rare to bother going to any work to eliminate this little bit of code, even when it's not going to be used (like your case, where nothing gets used at all).
Note that what you've shown is startup/shutdown code though. It's linked into your program, traditionally from a file named something like crt0 (along with, perhaps, some additional files).
If you look through your file for the code generated for main itself, you'll probably find that it's a lot shorter--possibly as short and simple as just ret. It may be so tiny that you missed the fact that it's there at all though.
This call 0x401318 <main>
is what you code resolved to, basically. main() is a function and there is code surrounding it, often called something like __start and __end.
What you see amounts, in part, to the CRT support code in __start, and cleanup afterward in __end.
C++ compiled (from the same source) DLL with Visual Studio C++ 2010 Express on both a 64 bit Windows 7 and 32 bit Windows XP. External 64 bit app on windows 7 calls the DLL and executes properly. Equivalent 32 bit app on Windows XP bombs on return from DLL call with stack or memory corruption.
Trying to debug this I put a breakpoint where the DLL is copying the data from some internal structures to what the external app wants, last step before returning. At a given point I'm looking at something like this in Visual Studio:
destination[i].field = source[i].field;
where both fields in the source and destination are doubles or longs.
Hovering over the source it shows the correctly computed values. Hovering over the destination, before executing the statement, shows that it was properly initialized to zeros. After executing the statement the destination contain a different value, e.g. 36.3468 becomes 0.00104800000000122891, 6 becomes 10, etc.
This is strange. Maybe there is a structure element misalignment, but wouldn't that show up as a warning somewhere else? Maybe I'm stepping over memory (in the 32 bit version only!?), but then shouldn't the value be apparently correct after stepping over the assignment? Haven't stepped into machine code in a while and don't know x86/x86_64 assembly that well, do I have to do that to see what the code that does that assignment is really doing?
Here is one of the lines that seems to not execute properly and the disassembly in both the 64 bit and 32 bit versions, in that order:
destination[i].field = source[i].field;
000007FEEF6B4DD3 movsxd rax,dword ptr [i]
000007FEEF6B4DDB imul rax,rax,4A68h
000007FEEF6B4DE2 movsxd rcx,dword ptr [i]
000007FEEF6B4DEA imul rcx,rcx,30h
000007FEEF6B4DEE mov rdx,qword ptr [destination]
000007FEEF6B4DF3 mov r8,qword ptr [source]
000007FEEF6B4DF8 movsd xmm0,mmword ptr [r8+rax+4A40h]
000007FEEF6B4E02 movsd mmword ptr [rdx+rcx+8],xmm0
destination[i].field = source[i].field;
09E64361 mov eax,dword ptr [i]
09E64364 imul eax,eax,4A38h
09E6436A mov ecx,dword ptr [i]
09E6436D imul ecx,ecx,2Ch
09E64370 mov edx,dword ptr [destination]
09E64373 mov esi,dword ptr [source]
09E64376 fld qword ptr [esi+eax+4A10h]
09E6437D fstp qword ptr [edx+ecx+4]
If I step over that line in the 64 bit version, VS shows me the proper value for destination[i].field, but not in the 32 bit version. Seems that the structures have different sizes in different versions, thus different offsets and 4 vs 8 bytes in the last assignment, but shouldn't at that point VS show me the proper value?
If I step over the fld instruction on the 32 bit version, I can see that st0 is loaded with the wrong value, i.e. not what is shown for source[i].field, For i=0, eax=0, esi=source, thus probably the 4A10h offset is wrong and/or differently computed in the code and what VS uses to show me the value. How is this possible?
Under Windows, there are three compiler-intrinsic functions to implement memory barrier:
1. _ReadBarrier;
2. _WriteBarrier;
3. _ReadWriteBarrier;
However, I found a weird problem: _ReadBarrier seems a dummy function doing nothing! The following is my assembly code generated by VC++ 2012.
My question is: How to implement a memory barrier function in assembly instructions?
int main()
{
013EEE10 push ebp
013EEE11 mov ebp,esp
013EEE13 sub esp,0CCh
013EEE19 push ebx
013EEE1A push esi
013EEE1B push edi
013EEE1C lea edi,[ebp-0CCh]
013EEE22 mov ecx,33h
013EEE27 mov eax,0CCCCCCCCh
013EEE2C rep stos dword ptr es:[edi]
int n = 0;
013EEE2E mov dword ptr [n],0
n = n + 1;
013EEE35 mov eax,dword ptr [n]
013EEE38 add eax,1
013EEE3B mov dword ptr [n],eax
_ReadBarrier();
n = n + 1;
013EEE3E mov eax,dword ptr [n]
013EEE41 add eax,1
013EEE44 mov dword ptr [n],eax
}
013EEE56 xor eax,eax
013EEE58 pop edi
013EEE59 pop esi
013EEE5A pop ebx
013EEE5B add esp,0CCh
013EEE61 cmp ebp,esp
013EEE63 call __RTC_CheckEsp (013EC3B0h)
013EEE68 mov esp,ebp
013EEE6A pop ebp
013EEE6B ret
_ReadBarrier, _WriteBarrier, and _ReadWriteBarrier are intrinsics that affect how the compiler can reorder code; they have absolutely nothing to do with CPU memory barriers and are only valid for specific kinds of memory (see "Affected Memory" here).
MemoryBarrier() is the intrinsic that you use to force a CPU memory barrier. However, the recommendation from Microsoft is to use std::atomic<T> going forward with VC++.
Modern processors are capable of executing instructions quite a long way ahead of where it actually is "completing" the instructions, so memory barriers are used to prevent it from running to far ahead when it comes to certain types of memory operations, where strict ordering is required - for most things, it doesn't actually matter if you write to variable a before variable b, or b before a. But sometimes it does.
The x86 instruction set has lfence, sfence and fence, which are instructions that "fence in" loads, stores and all memory operations respectively. The point about a "fence" or "barrier" instruction is to ensure that all the instructions that precede the barrier instruction has completed their loads, stores or both before the next instruction after the barrier can continue.
This is important if you are implementing for example semaphores, mutexes or similar instructions, since it's important to store the value saying "I've locked the semaphore" before you continue to read other data, for example. Otherwise things can go wrong, let's say.
Note that unless you REALLY know what you are doing with memory barriers, it's probably best to NOT use them - and rely on already existing code that solves the same problem - std::atomic are one place to fund such code. I have written quite a bit of "tricky" code, but only once or twice have I needed a memory barrier in my code.
Several times, I've needed to make the compiler not spread the code around, which you can do with "no-op functions", and apparently there are even special intrinsic functions these days to do that.
There are several important points to consider. Perhaps the
first is that barriers only have an effect in multithreaded
code, and most compilers require a special option to produce
multithreaded code. And things like _ReadBarrier are almost
certainly compiler built-ins, and should do nothing unless
you've given the options for multithreaded code.
The second is that what the hardware requires, even in a
multithreaded context, varies. On most of the machines I've
worked on (over some forty years), the machine never required
anything; barriers only become relevant if the machine has
sophisticated read and write pipelines. (Most earlier machines
didn't even have fence or barrier instructions, so the generated
code would have to be empty.)
I'm trying to inject a 64 Bit DLL into 64 Bit Process (explorer for the matter).
I've tried using Remote-thread\Window Hooks techniques but some Anti-Viruses detects my loader as a false positive.
After reading this article : Dll Injection by Darawk, I decided to use code caves.
It worked great for 32bit but because VS doesn't support inline assembly for 64 Bit I had to write the op-codes and operands explicitly.
I looked at this article : 64Bit injection using code cave, as the article states, there are some differences:
There are several differences that had to be incorporated here:
MASM64 uses fastcall, so the function's argument has to be passed in a
register and not on the stack.
The length of the addresses - 32 vs. 64 bit - must be taken into account.
MASM64 has no instruction that
pushes all registers on the stack (like pushad in 32bit) so this had
to be done by pushing all the registers explicitly.
I followed those guidelines and ran the article's example but none of what I did worked.
The target process just crashed at the moment I resumed the main thread and I don't know how to really look into it because ollydbg has no 64 bit support.
This is how the code looks before I injected it:
codeToInject:
000000013FACD000 push 7741933Ah
000000013FACD005 pushfq
000000013FACD006 push rax
000000013FACD007 push rcx
000000013FACD008 push rdx
000000013FACD009 push rbx
000000013FACD00A push rbp
000000013FACD00B push rsi
000000013FACD00C push rdi
000000013FACD00D push r8
000000013FACD00F push r9
000000013FACD011 push r10
000000013FACD013 push r11
000000013FACD015 push r12
000000013FACD017 push r13
000000013FACD019 push r14
000000013FACD01B push r15
000000013FACD01D mov rcx,2CA0000h
000000013FACD027 mov rax,76E36F80h
000000013FACD031 call rax
000000013FACD033 pop r15
000000013FACD035 pop r14
000000013FACD037 pop r13
000000013FACD039 pop r12
000000013FACD03B pop r11
000000013FACD03D pop r10
000000013FACD03F pop r9
000000013FACD041 pop r8
000000013FACD043 pop rdi
000000013FACD044 pop rsi
000000013FACD045 pop rbp
000000013FACD046 pop rbx
000000013FACD047 pop rdx
000000013FACD048 pop rcx
000000013FACD049 pop rax
000000013FACD04A popfq
000000013FACD04B ret
Seems fine to me but I guess I'm missing something.
My complete code can be found here : Source code
Any ideas\suggestions\alternatives?
The first push that stores the return value only pushes a 32-bit value. dwOldIP in your code is a DWORD as well, it should be a DWORD64. Having to cast to DWORD from ctx.Rip should've been enough of a hint ;)
Also, make sure the stack is 16-byte aligned upon entering the call to LoadLibrary. Some APIs throw exceptions if the stack is not aligned properly.
Apparently, The main problem was that I allocated the code cave data without the EXECUTE_PAGE_READWRITE permission and therefore the chunk of data was treated as data and not as opcodes.