Why would a process hang within RtlExitUserProcess/LdrpDrainWorkQueue?

Why would a process hang within RtlExitUserProcess/LdrpDrainWorkQueue? - c++

To debug a locked file problem, we're calling SysInternal's Handle64.exe 4.11 from a .NET process (via Process.Start with asynchronous output redirection). The calling process hangs on Process.WaitForExit because the Handle64 process doesn't exit (for more than two hours).
We took a dump of the corresponding Handle64 process and checked it in the Visual Studio 2017 debugger. It shows two threads ("Main Thread" and "ntdll.dll!TppWorkerThread").
Main thread's call stack:
ntdll.dll!NtWaitForSingleObject () Unknown
ntdll.dll!LdrpDrainWorkQueue() Unknown
ntdll.dll!RtlExitUserProcess() Unknown
kernel32.dll!ExitProcessImplementation () Unknown
handle64.exe!000000014000664c() Unknown
handle64.exe!00000001400082a5() Unknown
kernel32.dll!BaseThreadInitThunk () Unknown
ntdll.dll!RtlUserThreadStart () Unknown
Worker thread's call stack:
ntdll.dll!NtWaitForSingleObject() Unknown
ntdll.dll!LdrpDrainWorkQueue() Unknown
ntdll.dll!LdrpInitializeThread() Unknown
ntdll.dll!_LdrpInitialize() Unknown
ntdll.dll!LdrInitializeThunk() Unknown
My question is: Why would a process hang in LdrpDrainWorkQueue? From https://stackoverflow.com/a/42789684/62838, I gather that this is the Windows 10 parallel loader at work, but why would it get stuck while exiting the process? Can this be caused by how we invoke Handle64 from another process? I.e., are we doing something wrong or is this rather a bug in Handle64?

How long did you wait?
According to this analysis,
The worker thread idle timeout is set to 30 seconds. Programs which
execute in less than 30 seconds will appear to hang due to
ntdll!TppWorkerThread waiting for the idle timeout before the process
terminates.
I would recommend trying to set the registry key specified in that article to disable the parallel loader and see if this resolved the issue.
Parent Key: HKLM\Software\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\handle64.exe
Value Name: MaxLoaderThreads
Type: DWORD
Value: 1 to disable

Related

Process replayd killed by jetsam reason highwater

Recently I add a broadcast upload extension to my host app to implement system wide screen cast.I found the broadcast upload extension sometimes stopped for unknown reason.If I debug the broadcast upload extension process in Xcode, it stopped without stopping at a breakpoint（If the extension is killed for 50M bytes memory limit, it will stopped at a breakpoint, and Xcode will point out that it's killed for 50M bytes memory limit）.For more imformation, I read the console log line by line.Finally, I found a significant line:
osanalyticshelper Process replayd [26715] killed by jetsam reason highwater
It looks like the ReplayKit serving process 'replayd' is killed by jetsam, and the reason is 'highwater'.So I searched the internet for more imformation.And I found a post:
https://www.jianshu.com/p/30f24bb91222
After reading that,I checked the JetsamEvent report in device, and found that when the 'replayd' process was killed it occupied 100M bytes memory.Is there a 100M bytes memory limit for 'replayd' process?How can I avoid it to occupy more than 100M bytes memory?
Further more, I found that this problem offen occured if the previous extension process is stopped via RPBroadcastSampleHandler's finishBroadcastWithError method.If I stop the extension via control center button, this rarely occured.
As comparison, when the 'Wemeet' app stop it's broadcast upload extension, it raraly cause this problem.I compared the console log when 'Wemeet' stop it's broadcast extension and the log my app stop it's broadcast extension.I found this line is different:
Wemeet:
mediaserverd MEDeviceStreamClient.cpp:429 AQME Default-InputOutput: client stopping: <ZenAQIONodeClient#0x1080f7a40, sid:0x3456e, replayd(30213), 'prim'>; running count now 0
My app:
mediaserverd MEDeviceStreamClient.cpp:429 AQME Default-InputOutput: client stopping: <ZenAQIONodeClient#0x107e869a0, sid:0x3464b, replayd(30232), 'prim'>; running count now 3
As we can see, the 'running count' is different.

AmazonDynamoDBLockClient - Heartb eat thread recieved interrupted

have this error in AmazonDynamoDBLockClient. Im using org.springframework.cloud:spring-cloud-stream-binder-kinesis:2.0.1.RELEASE
#Spring Cloud Stream Kinesis Binder properties
spring.cloud.stream.bindings.cdcInput.group=listener
spring.cloud.stream.bindings.cdcInput.destination=my_stream
spring.cloud.stream.bindings.cdcInput.content-type=application/json
spring.cloud.stream.kinesis.binder.auto-create-stream=false
spring.cloud.stream.kinesis.binder.locks.table=LockRegistry
spring.cloud.stream.kinesis.binder.checkpoint.table=MetadataStore
spring.cloud.stream.kinesis.binder.locks.leaseDuration=10
spring.cloud.stream.kinesis.binder.locks.heartbeat-period=3
and here is my application.properties configs
com.amazonaws.services.dynamodbv2.AmazonDynamoDBLockClient - Heartbeat thread recieved interrupt, exiting run() (possibly exiting thread)
java.lang.InterruptedException: sleep interrupted
java.lang.Thread.sleep(Native Method)
com.amazonaws.services.dynamodbv2.AmazonDynamoDBLockClient.run(AmazonDynamoDBLockClient.java:1184)
java.lang.Thread.run(Thread.java:748)
[SpringContextShutdownHook] INFO org.springframework.integration.aws.inbound.kinesis.KinesisMessa
geDrivenChannelAdapter - stopped KinesisMessageDrivenChannelAdapter{shardOffsets=[KinesisShardOffset{iteratorType=TRIM_HORIZON, sequenceNumber='null', timestamp=null, stream='binlog_updates', shard='shardId-000000000000', reset=false}], consumerGroup='cdc-listener'}
[-kinesis-shard-locks-1] ERROR org.springframework.integration.aws.inbound.kinesis.KinesisMessageD
rivenChannelAdapter - Error during unlocking: DynamoDbLock [lockKey=cdc-listener:rds_binlog_updates:shardId-000000000000,lockedAt=2021-01-21#15:54:36.735, lockItem=null]
org.springframework.dao.DataAccessResourceFailureException: Failed to release lock at cdc-listener:binlog_updates:shardI
d-000000000000; nested exception is java.util.concurrent.RejectedExecutionException: Task org.springframework.integration.aws.lock.DynamoDbLockRegistry$DynamoDbLock$ reject
ed from java.util.concurrent.ThreadPoolExecutor#[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1]
org.springframework.integration.aws.lock.DynamoDbLockRegistry$DynamoDbLock.unlock(DynamoDbLockRegistry.java:526)
org.springframework.integration.aws.inbound.kinesis.KinesisMessageDrivenChannelAdapter$ShardConsumerManager.run(KinesisMessageDrivenChannelAdapter.java:1294)
We are getting this error even if the stream is empty and no data is read by consumer. Service starts application context as regular without any errors, but in 1-2 minutes such an error appears and application falls down.

Returning from exe entry point does not terminate the process on Windows 10

My attempt
I created a minimal, CRT-free, dependency-depleted executable with Microsoft Visual Studio by specifying the /GS- compiler flag and the /NoDefaultLib linker flag, and naming the main function mainCRTStartup. The application does not create additional threads and returns from mainCRTStartup after < 5 seconds, but it takes 30 seconds in total for the process to terminate.
Problem description
From my experience, if an application, executed on Windows 10, only depends on dynamic libraries that are loaded by default into every Windows process, namingly ntdll.dll, KernelBase.dll and kernel32.dll, the process exits normally when the main thread returns from the mainCRTStartup function.
If other libraries are loaded, statically or dynamically (f. e. by calling LoadLibraryW), returning from the main function will leave the process alive: for 30 seconds when run normally and indefinitely when run under a debugger.
Context
On process creation, the Windows 10 process loader creates additional threads to load dynamic libraries faster, see:
Why does Windows 10 start extra threads in my program?
Why there are three unexpected worker threads when a Win32 console application starts up?
Cylance mentions in Windows 10 Parallel Loading Breakdown:
The worker thread idle timeout is set to 30 seconds. Programs which execute in less than 30 seconds will appear to hang due to ntdll!TppWorkerThreadwaiting for the idle timeout before the process terminates.
Microsoft mentions in Terminating a Process: How Processes are Terminated:
Note that some implementation of the C run-time library (CRT) call ExitProcess if the primary thread of the process returns.
On the other hand, Microsoft mentions in ExitProcess:
Note that returning from the main function of an application results in a call to ExitProcess.
Test code
This is the minimal test code I worked with, I used kernel32!CloseHandle and user32!CloseWindow as examples, the call to them does not actually do anything:
#include <cstdint>
namespace windows {
typedef const intptr_t Handle;
typedef const void * Module;
constexpr Handle InvalidHandleValue = -1;
namespace kernel32 {
extern "C" uint32_t __stdcall CloseHandle(Handle);
extern "C" uint32_t __stdcall FreeLibrary(Module);
extern "C" Module __stdcall LoadLibraryW(const wchar_t *);
}
namespace user32 {
extern "C" uint32_t __stdcall CloseWindow(Handle);
}
}
int mainCRTStartup() {
// 0 seconds
// windows::kernel32::CloseHandle(windows::InvalidHandleValue);
// 30 seconds
// windows::user32::CloseWindow(windows::InvalidHandleValue);
// 0 seconds
// windows::kernel32::FreeLibrary(windows::kernel32::LoadLibraryW(L"kernel32.dll"));
// 30 seconds
// windows::kernel32::FreeLibrary(windows::kernel32::LoadLibraryW(L"user32.dll"));
// 0 seconds
// windows::kernel32::FreeLibrary(windows::kernel32::LoadLibraryW(L""));
return 0;
}
Debugging
Commenting in the WinAPI usage in the the mainCRTStartup function results in execution times mentioned above the respective WinAPI call.
This is the execution flow of the program traced in a debugger in pseudo C++:
ntdll.RtlUserThreadStart() {
kernel32.BaseThreadInitThunk() {
const auto return_code = test.mainCRTStartup();
ntdll.RtlExitUserThread(return_code) {
if (ntdll.NtQueryInformationThread(CURRENT_THREAD, ThreadAmILastThread) != STATUS_SUCCESS || !AmILastThread) {
// Bad path - for `30 seconds`.
ntdll.LdrShutdownThread();
ntdll.TpCheckTerminateWorker(0);
ntdll.NtTerminateThread(0, return_code);
// The thread execution does not return from `NtTerminateThread`, but the process still runs.
} else {
// Good path - for `0 seconds`.
ntdll.RtlExitUserProcess(return_code) {
ntdll.EtwpShutdownPrivateLoggers();
ntdll.LdrpDrainWorkQueue(0);
ntdll.LdrpAcquireLoaderLock();
ntdll.RtlEnterCriticalSection(ntdll.FastPebLock);
ntdll.RtlLockHeap(peb.ProcessHeap);
ntdll.NtTerminateProcess(0, return_code);
ntdll.RtlUnlockProcessHeapOnProcessTerminate();
ntdll.RtlLeaveCriticalSection(ntdll.FastPebLock);
ntdll.RtlReportSilentProcessExit(CURRENT_PROCESS, return_code);
ntdll.LdrShutdownProcess();
ntdll.NtTerminateProcess(CURRENT_PROCESS, return_code);
// The thread execution does not return from `NtTerminateProcess` and the process is terminated.
}
}
}
}
}
Expected results
I expected the process to terminate if it does not create additional threads and returns from the main function.
Calling ExitProcess at the end of the main function terminates the process, even if WinAPI is called which resulted in 30 seconds execution before. Using this API is not always possible, because the problematic application might not be mine, but a 3rd party application (from Microsoft) like here: Why would a process hang within RtlExitUserProcess/LdrpDrainWorkQueue?
It seems to me that the Windows 10 process loader is broken, if even Microsoft processes behave incorrectly.
Is there a clean solution to this problem?
What are those loader threads needed for, if the last user created thread exits? AFAIK it is impossible at this point to load any other libraries.

I expected the process to terminate if it does not create additional
threads and returns from the main function.
process can implicit create additional threads. loader for example. and need understanding what mean
returns from the main function
here mean function which called from standard CRT mainCRTStartup function. after this mainCRTStartup call ExitProcess. so not any exe entry real entry point function but some sub-function called from entry point. but entry point call ExitProcess than.
if we not use CRT - we need call ExitProcess yourself. if we simply return from from entry point - will be RtlExitUserThread which not call ExitProcess except this is last thread in process (AmILastThread) (and here also can be race if 2 or more threads in parallel call ExitThread)

Output thread IDs as seen by debugger

I'm developing a multi-threaded C++ application using GCC 4.4.5 and GDB 7.2.
At the moment, I have four threads. Each one interacts with a CAN bus in one form or another, either reading, writing, polling or handling messages.
In order to determine which thread is doing what, I have decided to add the thread IDs to log messages.
In my logging functions, I have the following code:
// This is for outputting debug messages
void logDebug(string msg, thread::id threadId[ = NULL]) {
#ifdebug _DEBUG
threadState.outputLock->lock();
if (threadId != NULL)
cout << "[Thread #" << threadId << "] ";
// The rest of the output
threadState.outputLock->unlock();
#endif
}
This is the (debug) output from the application:
[Thread #3085296768] [DEBUG] [Mon Jun 17 10:18:45 2019] CAN frame was empty or no message on bus...
----------
And this is the what GDB is telling me:
Thread #3 7575 [core: 0] (Suspended: Breakpoint)
----
Why is the debugger giving me different information from the application (the thread IDs/numbers) and is there a way to output the same information in the application, as the debugger is telling me?
The expected behaviour is that the thread IDs are identical.
EDIT:
I forgot to add some possibly important information.
I'm cross-compiling to an embedded device powered by a POWERPC chip, running a derivative of Debian Wheezy.

You can get the thread id from your application with the following system call : syscall(SYS_gettid)
From there you can set the thread name by either :
writing directly the name in /proc/PID/task/TID/comm
using the pthread function int pthread_setname_np(pthread_t thread, const char *name)
Then in GDB you can easily match the given thread name, its Linux TID and the GDB thread ID with info threads command.
Hope this helps.

Efficient variable watching in C/C++

I'm currently writing a multi-threaded, high efficient and scalable algorithm. Because I have to guess a parameter for the code and I'm not sure how the calculation performs on a specific data set, I would like to watch a variable. The test only works with a real world, huge data set. It is possible to analyze the collected data after profiling. Imagine the following, simple code example (real code can contain multiple watch points:
// function get's called by loops of multiple threads
void payload(data_t* data, double threshold) {
double value = calc(data);
// here I want to watch the value
if (value < threshold) {
doSomething(data);
} else {
doSomethingElse(data);
}
}
I thought about the following approaches:
Using cout or other system outputs
Use a binary output (file, network)
Set a breakpoint via gdb/lldb
Use variable watching + logging via gdb/lldb
I'm not happy with the results because: To use 1. and 2. I have to change the code, but this is a debugging/evaluating task. Furthermore 1. requires locking and 1.+2. requires I/O operations, which heavily slows down the entire code and makes testing with real data nearly impossible. 3. is also too slow. To use 4., I have to know the variable address because it's not a global variable, but because threads get created by a dynamic scheduler, this would require breaking + stepping for each thread.
So my conclusion is, that I need a profiler/debugger that works at machine code level and dumps/logs/watches the variable without double->string conversion and is highly efficient, or to sum up with other words: I would like to profile the internal state of my algorithm without heavy slow-down and without doing deep modification. Does anybody know a tool that is able to this?

OK, this took some time but now I'm able to present a solution for my problem. It's called tracepoints. Instead of breaking the program every time, it's more lightweight and (ideally) doesn't change performance/timing too much. It does not require code changes. Here is an explanation how to use them using gdb:
Make sure you compiled your program with debugging symbols (using the -g flag). Now, start the gdb server and provide a network port (e.g. 10000) and the program arguments:
gdbserver :10000 ./program --parameters you --want --to use
Now, switch to a second console and start gdb (program parameters are not required here):
gdb ./program
All following commands are entered in the gdb command line interface. So let's connect to the server:
target remote :10000
After you got the connection confirmation, use trace or ftrace to set a tracepoint to a specific source location (try ftrace first, it should be faster but doesn't work on all platforms):
trace source.c:127
This should create tracepoint #1. Now you can setup an action for this tracepoint. Here I want to collect the data from myVariable
action 1
collect myVariable
end
If expect much data or want to use the data later (after restart), you can set a binary trace file:
tsave trace.bin
Now, start tracing and run the program:
tstart
continue
You can wait for program exit or interrupt your program using CTRL-C (still on gdb console, not on server side). Continue by telling gdb that you want to stop tracing:
tstop
Now we come the tricky part and I'm not really happy with the following code because it's really slow:
set pagination off
set logging file trace.txt
tfind start
while ($trace_frame != -1)
set logging on
printf "%f\n", myVariable
set logging off
tfind
end
This dumps all variable data to a text file. You can add some filter or preparation here. Now you're done and you can exit gdb. This will also shutdown the server:
quit
For detailed documentation especially for explanation of filtering and more advanced tracepoint positions, you can visit the following document: http://sourceware.org/gdb/onlinedocs/gdb/Tracepoints.html
To isolate trace file writing from your program execution, you can use cgroups or another network connected computer. When using another computer, you have to add the host to the port information (e.g. 192.168.1.37:10000). To load a binary trace file later, just start gdb as shown above (forget the server) and change the target:
gdb ./program
target tfile trace.bin

you can set hardware watchpoint using gdb debugger, for example if you have
bool b;
variable and you want to be notified every time the value of it has chenged (by any thread)
you would declare a watchpoint like this:
(gdb) watch *(bool*)0x7fffffffe344
example:
root#comp:~# gdb prog
GNU gdb (GDB) 7.5-ubuntu
Copyright ...
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /dist/Debug/GNU-Linux-x86/cppapp_socket5_ipaddresses...done.
(gdb) watch *(bool*)0x7fffffffe344
Hardware watchpoint 1: *(bool*)0x7fffffffe344
(gdb) start
Temporary breakpoint 2 at 0x40079f: file main.cpp, line 26.
Starting program: /dist/Debug/GNU-Linux-x86/cppapp_socket5_ipaddresses
Hardware watchpoint 1: *(bool*)0x7fffffffe344
Old value = true
New value = false
main () at main.cpp:50
50 if (strcmp(mask, "255.0.0.0") != 0) {
(gdb) c
Continuing.
Hardware watchpoint 1: *(bool*)0x7fffffffe344
Old value = false
New value = true
main () at main.cpp:41
41 if (ifa ->ifa_addr->sa_family == AF_INET) { // check it is IP4
(gdb) c
Continuing.
mask:255.255.255.0
eth0 IP Address 192.168.1.5
[Inferior 1 (process 18146) exited normally]
(gdb) q

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js