Doubting about the Threads window of visual studio - c++

As you can see above , there are 4 win32 threads at exactly the same location, how to understand it?
UPDATE
7C92E4BE mov dword ptr [esp],eax
7C92E4C1 mov dword ptr [esp+4],0
7C92E4C9 mov dword ptr [esp+8],0
7C92E4D1 mov dword ptr [esp+10h],0
7C92E4D9 push esp
7C92E4DA call 7C92E508
7C92E4DF mov eax,dword ptr [esp]
7C92E4E2 mov esp,ebp
7C92E4E4 pop ebp
7C92E4E5 ret
7C92E4E6 lea esp,[esp]
7C92E4ED lea ecx,[ecx]
7C92E4F0 mov edx,esp
7C92E4F2 sysenter
7C92E4F4 ret

At a guess, they're probably sleeping in something like WaitForSingleObject or similar.

The debugger shows the next ring3 processor instruction that is going to be executed. In this case the thread has called sysenter, which makes a ring0 system call to the operating system's kernel. This kernel system call is waiting for something to happen before returning control back to the calling code. Once that something happens, then it will call the next user-mode instruction, which in this case is ret.
If you have 4 threads that are all calling the same function that waits for a system call at the same location, you will have 4 threads that show the same address in the Threads window. This is something that you will see quite often in applications built with the Windows subsystem, which usually have a number of threads that are started by the Windows API that spend most of their time waiting for kernel events.

At a guess, you have a thread pool of some sort, so you have four threads all executing the same thread function. In this case, all four are mostly likely idle, waiting for a task they need to execute. If that's the case, it's quite sensible that all four show the same location.

You'll need to ignore the threads that are started by Microsoft code. I'm guessing at mmsys or DirectX from your screen shot. Microsoft code is very thread-happy.
You can get better diagnostics about what they do when you enable the Microsoft Symbol Server. You'll get decent names in the Call Stack window, often letting you guess what their purpose is. Of course, you'll never get to look at their code.

Related

C++ SYSENTER x86 calls in inline assembly

I'm about to learning how sysenter on x86 works. and i created a simple console application on x86 platform, which should call the NtWriteVirtualMemory function manually in inline assembly.
i started with this code here, but it seems that the compiler dont understand the opcode "sysenter" so i decided to _emit them with the bytes for sysenter.(maybe i need to change something in my project settings?) it compiles but when its about to calling the function visual studio gives me an error that my ret is an illegal instruction while executing, and the program stops.
someone have knowledge how to do that correctly?
#include <windows.h>
#include <iostream>
__declspec(naked) void __KiFastSystemCall()
{
__asm
{
mov edx, esp
// need to emit "sysenter" because of syntaxerrors, "Opcode"; "newline"
_emit 0x0F
_emit 0x34
ret // illegal instructiona after execute?
}
}
void Test_NtWriteVirtualMemory(HANDLE hProcess, PVOID BaseAddress, PVOID Buffer, SIZE_T sizeToWrite, SIZE_T* NumberOfBytesWritten)
{
__asm
{
push NumberOfBytesWritten
push sizeToWrite
push Buffer
push BaseAddress
push hProcess
mov eax, 0x3A // Syscall ID NtWriteVirtualMemory in Windows10
mov edx, __KiFastSystemCall
call edx
add esp, 0x14 // 5 push * 4 bytes 20 dec
retn
}
}
void Test_NtWriteVirtualMemory(HANDLE hProcess, PVOID BaseAddress, PVOID Buffer, SIZE_T sizeToWrite, SIZE_T* NumberOfBytesWritten)
{
__asm
{
push NumberOfBytesWritten
push sizeToWrite
push Buffer
push BaseAddress
push hProcess
mov eax, 0x3a // Syscall ID NtWriteVirtualMemory in Windows10
mov edx, 0x76F88E00
call edx
ret 0x14
}
}
int main()
{
std::cout << "Test Hello World\n";
HANDLE hProcess = OpenProcess(PROCESS_ALL_ACCESS, FALSE, GetProcessId("MyGame.exe"));
if (hProcess == NULL)
return false;
DWORD TestAddress = 0x87A0B4; // harcoded
DWORD TestValue = 4;
Test_NtWriteVirtualMemory(hProcess, (PVOID)TestAddress, (PVOID)TestValue, sizeof(DWORD), NULL);
CloseHandle(hProcess);
return 0;
}
Do you have a 32-bit only version of Windows?
sysenter was the "successor" of int 2eh and introduced during the Windows XP era.
The 64-bit versions of Windows don't use it, in fact it was removed since:
sysenter and sysret are illegal in long mode in an AMD CPU (irrespective of the compatibility mode).
The IA32_SYSENTER_CS MSR is left to zero by a 64-bit version of Windows1.
This will cause a #GP fault when executing sysenter.
If you single-step through your __KiFastSystemCall you should see the debugger catch an exception with code 0xc0000005 when executing sysenter.
So, in order to use sysenter you must have a real 32-bit version of Windows.
Running a 32-bit program on a 64-bit version of Windows won't work, that's compatibility mode (done through the WOW64 machinery).
If, besides having a 64-bit version of Windows, you also have an AMD CPU then it won't work double time.
Windows 64-bit uses syscall for 64-bit program or an indirect call to the WOW32Reserved field of the TEB2, you should use those.
Beware that the 64-bit system call convention is slightly different from the usual one: particularly it assumes the syscall is in a function of its own, thus it expects the parameters on the stack to be shifted up by 8.
Plus, the first parameter must be in r10, not rcx.
For example, if you inline the syscall instruction, the first parameter on the stack (if any) must be at rsp + 28h and not at rsp + 20h.
The 32-bit compatibility mode syscall convention is also different, you need to set both eax and ecx to specific values.
I didn't dig what exactly ecx is used for but it may be related to an optimization called Turbo thunks and must be set to a specific value.
Note that while syscall numbers are very volatile, turbo thunks are even more because they can be disabled by the admin.
1I don't have a definitive source for this, it is just zero on my version of Windows and it makes sysenter fault.
2I.e. a call DWORD [fs:0c0h], this will point to a code that will jump to a gate descriptor for a 64-bit code segment that in turn will execute a syscall
Since you got illegal instruction on the ret instruction rather than the sysenter instruction, you know the sysenter instruction was decoded correctly by the CPU. Your call got into kernel mode, but the kernel didn't like your system call invocation.
Probably it was depending on user-space to help save some registers because sysenter is very minimal. Check the stack pointer after returning from the kernel as you single-step before letting ret execute.
I'd be only speculating as the the problem, but wrapping the syscall gate in another function call looks wrong to my eyes. As I said in comments, do not do this because the syscall numbers can change on you.
Under Linux, 32-bit processes call through the VDSO (a library injected into their address space by the kernel) to get the optimal system-call instruction, used in a way that matches what the kernel wants. (sysenter doesn't preserve the stack pointer so user-space has to help.)
Perhaps if you want to play with this instruction you're better off writing a toy OS.
Sorry it's not a ton of answer, but it's not completely unreasonable.
Making system calls in x86 Windows is different from x64. You need to specify the correct arguments length in ret otherwise you will get illegal instruction and/or runtime esp.
Furthermore I don't recommend you to use inline assembly, instead use it inside an .asm file or as shellcode.
To make a correct x86 system call on x86 Windows:
mov eax, SYSCALL_INDEX
call sysentry
ret ARGUMENTS_LENGTH_SIZE
mov edx,esp
sysenter
retn
To make a correct x64 system call on x64 Windows:
mov eax, SYSCALL_INDEX
mov r10,rcx
syscall
retn
The above will work 100% correctly on any x86 and x64 Windows (tested). Can't help you with inline assembly though, because I never used it that way.
Enjoy.

Mixing c++ and assembly cant pass multiple paramaters from C++ function to assembly

I've been frustrated by passing parameters from a c++ function to assembly. I couldn't find anything that helped on Google and would really like your help. I am using Visual Studio 2017 and masm to compile my assembly code.
This is a simplified version of my c++ file where I call the assembly procedure set_clock
int main()
{
TimeInfo localTime;
char clock[4] = { 0,0,0,0 };
set_clock(clock,&localTime);
system("pause");
return 0;
}
I run into problems in the assembly file. I can't figure out why the second parameter passed to the function turns out huge. I was going off my textbook, which shows similar code with PROC followed by parameters. I don't know why the first parameter is passed successfully and the second one isn't. Can someone tell me the correct way to pass multiple parameters?
.code
set_clock PROC,
array:qword,address:qword
mov rdx,array ; works fine memory address: 0x1052440000616
mov rdi,address ; value of rdi is 14757395258967641292
mov al, [rdx]
mov [rdi],al ; ERROR: cant access that memory location
ret
set_clock ENDP
END
MASM's high-level crap is biting you in the ass. x64 Windows passes the first 4 args in rcx, rdx, r8, r9 (for any of those 4 that are integer/pointer).
mov rdx,array
mov rdi,address
assembles to
mov rdx, rcx ; clobber 2nd arg with a copy of the 1st
mov rdi, rdx ; copy array again
Use a disassembler to check for yourself. Always a good idea to check the real machine code by disassembling or using your debuggers disassembly instead of source mode, if anything weird is happening with assembler macros.
I'm not sure why this would result in an inaccessible memory location. If both args really are pointers to locals, then it should just be loading and storing back into the same stack location. But if char clock[4] is a const in static storage, it might be in a read-only memory page which would explain the store failing.
Either way, use a debugger and find out.
BTW, rdi is a call-preserved (aka non-volatile) register in the x64 Windows convention. (https://msdn.microsoft.com/en-us/library/9z1stfyw.aspx). Use call-clobbered registers for scratch regs unless you run out and need to save/restore some call-preserved regs. See also Agner Fog's calling conventions doc (http://agner.org/optimize/), and other links in the x86 tag wiki.
It's call-clobbered in x86-64 System V, which also passes args in different registers. Maybe you were looking at a different example?
Hopefully-fixed version, using movzx to avoid a false dependency on RAX when loading a byte.
set_clock PROC,
array:qword,address:qword
movzx eax, byte ptr [array]
mov [address], al
ret
set_clock ENDP
I don't use MASM, but I think array:qword makes array an alias for rcx. Or you could skip declaring the parameters and just use rcx and rdx directly, and document it with comments. That would be easier for everyone to understand.
You definitely don't want useless mov reg,reg instructions cluttering your code; if you're writing in asm in the first place, wasted instructions would cut into any speedups you're getting.

Why does this function push RAX to the stack as the first operation?

In the assembly of the C++ source below. Why is RAX pushed to the stack?
RAX, as I understand it from the ABI could contain anything from the calling function. But we save it here, and then later move the stack back by 8 bytes. So the RAX on the stack is, I think only relevant for the std::__throw_bad_function_call() operation ... ?
The code:-
#include <functional>
void f(std::function<void()> a)
{
a();
}
Output, from gcc.godbolt.org, using Clang 3.7.1 -O3:
f(std::function<void ()>): # #f(std::function<void ()>)
push rax
cmp qword ptr [rdi + 16], 0
je .LBB0_1
add rsp, 8
jmp qword ptr [rdi + 24] # TAILCALL
.LBB0_1:
call std::__throw_bad_function_call()
I'm sure the reason is obvious, but I'm struggling to figure it out.
Here's a tailcall without the std::function<void()> wrapper for comparison:
void g(void(*a)())
{
a();
}
The trivial:
g(void (*)()): # #g(void (*)())
jmp rdi # TAILCALL
The 64-bit ABI requires that the stack is aligned to 16 bytes before a call instruction.
call pushes an 8-byte return address on the stack, which breaks the alignment, so the compiler needs to do something to align the stack again to a multiple of 16 before the next call.
(The ABI design choice of requiring alignment before a call instead of after has the minor advantage that if any args were passed on the stack, this choice makes the first arg 16B-aligned.)
Pushing a don't-care value works well, and can be more efficient than sub rsp, 8 on CPUs with a stack engine. (See the comments).
The reason push rax is there is to align the stack back to a 16-byte boundary to conform to the 64-bit System V ABI in the case where je .LBB0_1 branch is taken. The value placed on the stack isn't relevant. Another way would have been subtracting 8 from RSP with sub rsp, 8. The ABI states the alignment this way:
The end of the input argument area shall be aligned on a 16 (32, if __m256 is
passed on stack) byte boundary. In other words, the value (%rsp + 8) is always
a multiple of 16 (32) when control is transferred to the function entry point. The stack pointer, %rsp, always points to the end of the latest allocated stack frame.
Prior to the call to function f the stack was 16-byte aligned per the calling convention. After control was transferred via a CALL to f the return address was placed on the stack misaligning the stack by 8. push rax is a simple way of subtracting 8 from RSP and realigning it again. If the branch is taken to call std::__throw_bad_function_call()the stack will be properly aligned for that call to work.
In the case where the comparison falls through, the stack will appear just as it did at function entry once the add rsp, 8 instruction is executed. The return address of the CALLER to function f will now be back at the top of the stack and the stack will be misaligned by 8 again. This is what we want because a TAIL CALL is being made with jmp qword ptr [rdi + 24] to transfer control to the function a. This will JMP to the function not CALL it. When function a does a RET it will return directly back to the function that called f.
At a higher optimization level I would have expected that the compiler should be smart enough to do the comparison, and let it fall through directly to the JMP. What is at label .LBB0_1 could then align the stack to a 16-byte boundary so that call std::__throw_bad_function_call() works properly.
As #CodyGray pointed out, if you use GCC (not CLANG) with optimization level of -O2 or higher, the code produced does seem more reasonable. GCC 6.1 output from Godbolt is:
f(std::function<void ()>):
cmp QWORD PTR [rdi+16], 0 # MEM[(bool (*<T5fc5>) (union _Any_data &, const union _Any_data &, _Manager_operation) *)a_2(D) + 16B],
je .L7 #,
jmp [QWORD PTR [rdi+24]] # MEM[(const struct function *)a_2(D)]._M_invoker
.L7:
sub rsp, 8 #,
call std::__throw_bad_function_call() #
This code is more in line with what I would have expected. In this case it would appear that GCC's optimizer may handle this code generation better than CLANG.
In other cases, clang typically fixes up the stack before returning with a pop rcx.
Using push has an upside for efficiency in code-size (push is only 1 byte vs. 4 bytes for sub rsp, 8), and also in uops on Intel CPUs. (No need for a stack-sync uop, which you'd get if you access rsp directly because the call that brought us to the top of the current function makes the stack engine "dirty").
This long and rambling answer discusses the worst-case performance risks of using push rax / pop rcx for aligning the stack, and whether or not rax and rcx are good choices of register. (Sorry for making this so long.)
(TL:DR: looks good, the possible downside is usually small and the upside in the common case makes this worth it. Partial-register stalls could be a problem on Core2/Nehalem if al or ax are "dirty", though. No other 64-bit capable CPU has big problems (because they don't rename partial regs, or merge efficiently), and 32-bit code needs more than 1 extra push to align the stack by 16 for another call unless it was already saving/restoring some call-preserved regs for its own use.)
Using push rax instead of sub rsp, 8 introduces a dependency on the old value of rax, so you'd think it might slow things down if the value of rax is the result of a long-latency dependency chain (and/or a cache miss).
e.g. the caller might have done something slow with rax that's unrelated to the function args, like var = table[ x % y ]; var2 = foo(x);
# example caller that leaves RAX not-ready for a long time
mov rdi, rax ; prepare function arg
div rbx ; very high latency
mov rax, [table + rdx] ; rax = table[ value % something ], may miss in cache
mov [rsp + 24], rax ; spill the result.
call foo ; foo uses push rax to align the stack
Fortunately out-of-order execution will do a good job here.
The push doesn't make the value of rsp dependent on rax. (It's either handled by the stack engine, or on very old CPUs push decodes to multiple uops, one of which updates rsp independently of the uops that store rax. Micro-fusion of the store-address and store-data uops let push be a single fused-domain uop, even though stores always take 2 unfused-domain uops.)
As long as nothing depends on the output push rax / pop rcx, it's not a problem for out-of-order execution. If push rax has to wait because rax isn't ready, it won't cause the ROB (ReOrder Buffer) to fill up and eventually block the execution of later independent instruction. The ROB would fill up even without the push because the instruction that's slow to produce rax, and whatever instruction in the caller consumes rax before the call are even older, and can't retire either until rax is ready. Retirement has to happen in-order in case of exceptions / interrupts.
(I don't think a cache-miss load can retire before the load completes, leaving just a load-buffer entry. But even if it could, it wouldn't make sense to produce a result in a call-clobbered register without reading it with another instruction before making a call. The caller's instruction that consumes rax definitely can't execute/retire until our push can do the same.)
When rax does become ready, push can execute and retire in a couple cycles, allowing later instructions (which were already executed out of order) to also retire. The store-address uop will have already executed, and I assume the store-data uop can complete in a cycle or two after being dispatched to the store port. Stores can retire as soon as the data is written to the store buffer. Commit to L1D happens after retirement, when the store is known to be non-speculative.
So even in the worst case, where the instruction that produces rax was so slow that it led to the ROB filling up with independent instructions that are mostly already executed and ready to retire, having to execute push rax only causes a couple extra cycles of delay before independent instructions after it can retire. (And some of the caller's instructions will retire first, making a bit of room in the ROB even before our push retires.)
A push rax that has to wait will tie up some other microarchitectural resources, leaving one fewer entry for finding parallelism between other later instructions. (An add rsp,8 that could execute would only be consuming a ROB entry, and not much else.)
It will use up one entry in the out-of-order scheduler (aka Reservation Station / RS). The store-address uop can execute as soon as there's a free cycle, so only the store-data uop will be left. The pop rcx uop's load address is ready, so it should dispatch to a load port and execute. (When the pop load executes, it finds that its address matches the incomplete push store in the store buffer (aka memory order buffer), so it sets up the store-forwarding which will happen after the store-data uop executes. This probably consumes a load buffer entry.)
Even an old CPUs like Nehalem has a 36 entry RS, vs. 54 in Sandybridge, or 97 in Skylake. Keeping 1 entry occupied for longer than usual in rare cases is nothing to worry about. The alternative of executing two uops (stack-sync + sub) is worse.
(off topic)
The ROB is larger than the RS, 128 (Nehalem), 168 (Sandybridge), 224 (Skylake). (It holds fused-domain uops from issue to retirement, vs. the RS holding unfused-domain uops from issue to execution). At 4 uops per clock max frontend throughput, that's over 50 cycles of delay-hiding on Skylake. (Older uarches are less likely to sustain 4 uops per clock for as long...)
ROB size determines the out-of-order window for hiding a slow independent operation. (Unless register-file size limits are a smaller limit). RS size determines the out-of-order window for finding parallelism between two separate dependency chains. (e.g. consider a 200 uop loop body where every iteration is independent, but within each iteration it's one long dependency chain without much instruction-level parallelism (e.g. a[i] = complex_function(b[i])). Skylake's ROB can hold more than 1 iteration, but we can't get uops from the next iteration into the RS until we're within 97 uops of the end of the current one. If the dep chain wasn't so much larger than RS size, uops from 2 iterations could be in flight most of the time.)
There are cases where push rax / pop rcx can be more dangerous:
The caller of this function knows that rcx is call-clobbered, so won't read the value. But it might have a false dependency on rcx after we return, like bsf rcx, rax / jnz or test eax,eax / setz cl. Recent Intel CPUs don't rename low8 partial registers anymore, so setcc cl has a false dep on rcx. bsf actually leaves its destination unmodified if the source is 0, even though Intel documents it as an undefined value. AMD documents leave-unmodified behaviour.
The false dependency could create a loop-carried dep chain. On the other hand, a false dependency can do that anyway, if our function wrote rcx with instructions dependent on its inputs.
It would be worse to use push rbx/pop rbx to save/restore a call-preserved register that we weren't going to use. The caller likely would read it after we return, and we'd have introduced a store-forwarding latency into the caller's dependency chain for that register. (Also, it's maybe more likely that rbx would be written right before the call, since anything the caller wanted to keep across the call would be moved to call-preserved registers like rbx and rbp.)
On CPUs with partial-register stalls (Intel pre-Sandybridge), reading rax with push could cause a stall or 2-3 cycles on Core2 / Nehalem if the caller had done something like setcc al before the call. Sandybridge doesn't stall while inserting a merging uop, and Haswell and later don't rename low8 registers separately from rax at all.
It would be nice to push a register that was less likely to have had its low8 used. If compilers tried to avoid REX prefixes for code-size reasons, they'd avoid dil and sil, so rdi and rsi would be less likely to have partial-register issues. But unfortunately gcc and clang don't seem to favour using dl or cl as 8-bit scratch registers, using dil or sil even in tiny functions where nothing else is using rdx or rcx. (Although lack of low8 renaming in some CPUs means that setcc cl has a false dependency on the old rcx, so setcc dil is safer if the flag-setting was dependent on the function arg in rdi.)
pop rcx at the end "cleans" rcx of any partial-register stuff. Since cl is used for shift counts, and functions do sometimes write just cl even when they could have written ecx instead. (IIRC I've seen clang do this. gcc more strongly favours 32-bit and 64-bit operand sizes to avoid partial-register issues.)
push rdi would probably be a good choice in a lot of cases, since the rest of the function also reads rdi, so introducing another instruction dependent on it wouldn't hurt. It does stop out-of-order execution from getting the push out of the way if rax is ready before rdi, though.
Another potential downside is using cycles on the load/store ports. But they are unlikely to be saturated, and the alternative is uops for the ALU ports. With the extra stack-sync uop on Intel CPUs that you'd get from sub rsp, 8, that would be 2 ALU uops at the top of the function.

How does memory barrier work?

Under Windows, there are three compiler-intrinsic functions to implement memory barrier:
1. _ReadBarrier;
2. _WriteBarrier;
3. _ReadWriteBarrier;
However, I found a weird problem: _ReadBarrier seems a dummy function doing nothing! The following is my assembly code generated by VC++ 2012.
My question is: How to implement a memory barrier function in assembly instructions?
int main()
{
013EEE10 push ebp
013EEE11 mov ebp,esp
013EEE13 sub esp,0CCh
013EEE19 push ebx
013EEE1A push esi
013EEE1B push edi
013EEE1C lea edi,[ebp-0CCh]
013EEE22 mov ecx,33h
013EEE27 mov eax,0CCCCCCCCh
013EEE2C rep stos dword ptr es:[edi]
int n = 0;
013EEE2E mov dword ptr [n],0
n = n + 1;
013EEE35 mov eax,dword ptr [n]
013EEE38 add eax,1
013EEE3B mov dword ptr [n],eax
_ReadBarrier();
n = n + 1;
013EEE3E mov eax,dword ptr [n]
013EEE41 add eax,1
013EEE44 mov dword ptr [n],eax
}
013EEE56 xor eax,eax
013EEE58 pop edi
013EEE59 pop esi
013EEE5A pop ebx
013EEE5B add esp,0CCh
013EEE61 cmp ebp,esp
013EEE63 call __RTC_CheckEsp (013EC3B0h)
013EEE68 mov esp,ebp
013EEE6A pop ebp
013EEE6B ret
_ReadBarrier, _WriteBarrier, and _ReadWriteBarrier are intrinsics that affect how the compiler can reorder code; they have absolutely nothing to do with CPU memory barriers and are only valid for specific kinds of memory (see "Affected Memory" here).
MemoryBarrier() is the intrinsic that you use to force a CPU memory barrier. However, the recommendation from Microsoft is to use std::atomic<T> going forward with VC++.
Modern processors are capable of executing instructions quite a long way ahead of where it actually is "completing" the instructions, so memory barriers are used to prevent it from running to far ahead when it comes to certain types of memory operations, where strict ordering is required - for most things, it doesn't actually matter if you write to variable a before variable b, or b before a. But sometimes it does.
The x86 instruction set has lfence, sfence and fence, which are instructions that "fence in" loads, stores and all memory operations respectively. The point about a "fence" or "barrier" instruction is to ensure that all the instructions that precede the barrier instruction has completed their loads, stores or both before the next instruction after the barrier can continue.
This is important if you are implementing for example semaphores, mutexes or similar instructions, since it's important to store the value saying "I've locked the semaphore" before you continue to read other data, for example. Otherwise things can go wrong, let's say.
Note that unless you REALLY know what you are doing with memory barriers, it's probably best to NOT use them - and rely on already existing code that solves the same problem - std::atomic are one place to fund such code. I have written quite a bit of "tricky" code, but only once or twice have I needed a memory barrier in my code.
Several times, I've needed to make the compiler not spread the code around, which you can do with "no-op functions", and apparently there are even special intrinsic functions these days to do that.
There are several important points to consider. Perhaps the
first is that barriers only have an effect in multithreaded
code, and most compilers require a special option to produce
multithreaded code. And things like _ReadBarrier are almost
certainly compiler built-ins, and should do nothing unless
you've given the options for multithreaded code.
The second is that what the hardware requires, even in a
multithreaded context, varies. On most of the machines I've
worked on (over some forty years), the machine never required
anything; barriers only become relevant if the machine has
sophisticated read and write pipelines. (Most earlier machines
didn't even have fence or barrier instructions, so the generated
code would have to be empty.)

Why would fclose hang / deadlock? (Windows)

I have a directory change monitor process that reads updates from files within a set of directories. I have another process that performs small writes to a lot of files to those directories (test program). Figure about 100 directories with 10 files in each, and about 500 files being modified per second.
After running for a while, the directory monitor process hangs on a call to fclose() in a method that is basically tailing the file. In this method, I fopen() the file, check that the handle is valid, do a few seeks and reads, and then call fclose(). These reads are all performed by the same thread in the process. After the hang, the thread never progresses.
I couldn't find any good information on why fclose() might deadlock instead of returning some kind of error code. The documentation does mention _fclose_nolock(), but it doesn't seem to be available to me (Visual Studio 2003).
The hang occurs for both debug and release builds. In a debug build, I can see that fclose() calls _free_base(), which hangs before returning. Some kind of call into kernel32.dll => ntdll.dll => KernelBase.dll => ntdll.dll is spinning. Here's the assembly from ntdll.dll that loops indefinitely:
77CEB83F cmp dword ptr [edi+4Ch],0
77CEB843 lea esi,[ebx-8]
77CEB846 je 77CEB85E
77CEB848 mov eax,dword ptr [edi+50h]
77CEB84B xor dword ptr [esi],eax
77CEB84D mov al,byte ptr [esi+2]
77CEB850 xor al,byte ptr [esi+1]
77CEB853 xor al,byte ptr [esi]
77CEB855 cmp byte ptr [esi+3],al
77CEB858 jne 77D19A0B
77CEB85E mov eax,200h
77CEB863 cmp word ptr [esi],ax
77CEB866 ja 77CEB815
77CEB868 cmp dword ptr [edi+4Ch],0
77CEB86C je 77CEB87E
77CEB86E mov al,byte ptr [esi+2]
77CEB871 xor al,byte ptr [esi+1]
77CEB874 xor al,byte ptr [esi]
77CEB876 mov byte ptr [esi+3],al
77CEB879 mov eax,dword ptr [edi+50h]
77CEB87C xor dword ptr [esi],eax
77CEB87E mov ebx,dword ptr [ebx+4]
77CEB881 lea eax,[edi+0C4h]
77CEB887 cmp ebx,eax
77CEB889 jne 77CEB83F
Any ideas what might be happening here?
I posted this as a comment, but I realize this could be an answer in its own right...
Based on the disassembly, my guess is you've overwritten some internal heap structure maintained by ntdll, and it is looping forever iterating through a linked list.
In particular at the start of the loop, the current list node seems to be in ebx. At the end of the loop, the expected last node (or terminator, if you like -- it looks a bit like these are circular lists and the last node is the same as the first, pointer to this node being at [edi+4Ch]) is contained in eax. Probably the result of cmp ebx, eax is never equal, because there is some cycle in the list introduced by a heap corruption.
I don't think this has anything to do with locks, otherwise we would see some atomic instructions (eg. lock cmpxchg, xchg, etc.) or calls to other synchronization functions.
I had a same case with file close function. In my case, I solved by located the close function embedded other function body instead of having own function.
I was also suspicious on
(1) the name of file being duplicated (2) Windows scheduling (file IO wasn't completed before next task treading being started. Windows scheduling and multi-threading is behind of the curtain, so it is hard to verify, but I have similar issue when I tried to save many data in ASCII in the loop. Saving on binary solved at this case.)
My environment, IDE: Visual Studio 2015, OS: Windows 7, language: C++