Slow communication using shared memory between user mode and kernel - c++

I am running a thread in the Windows kernel communicating with an application over shared memory. Everything is working fine except the communication is slow due to a Sleep loop. I have been investigating spin locks, mutexes and interlocked but can't really figure this one out. I have also considered Windows events but don't know about the performance of that one. Please advice on what would be a faster solution keeping the communication over shared memory possibly suggesting Windows events.
KERNEL CODE
typedef struct _SHARED_MEMORY
{
BOOLEAN mutex;
CHAR data[BUFFER_SIZE];
} SHARED_MEMORY, *PSHARED_MEMORY;
ZwCreateSection(...)
ZwMapViewOfSection(...)
while (TRUE) {
if (((PSHARED_MEMORY)SharedSection)->mutex == TRUE) {
//... do work...
((PSHARED_MEMORY)SharedSection)->mutex = FALSE;
}
KeDelayExecutionThread(KernelMode, FALSE, &PollingInterval);
}
APPLICATION CODE
OpenFileMapping(...)
MapViewOfFile(...)
...
RtlCopyMemory(&SM->data, WriteData, Size);
SM->mutex = TRUE;
while (SM->mutex != FALSE) {
Sleep(1); // Slow and removing it will cause an infinite loop
}
RtlCopyMemory(ReadData, &SM->data, Size);
UPDATE 1
Currently this is the fastest solution I have come up with:
while(InterlockedCompareExchange(&SM->mutex, FALSE, FALSE));
However I find it funny that you need to do an exchange and that there is no function for only compare.

You don't want to use InterlockedCompareExchange. It burns the CPU, saturates core resources that might be needed by another thread sharing that physical core, and can saturate inter-core buses.
You do need to do two things:
1) Write an InterlockedGet function and use it.
2) Prevent the loop from burning CPU resources and from taking the mother of all mispredicted branches when it finally gets unblocked.
For 1, this is known to work on all compilers that support InterlockedCompareExchange, at least last time I checked:
__inline static int InterlockedGet(int *val)
{
return *((volatile int *)val);
}
For 2, put this as the body of the wait loop:
__asm
{
rep nop
}
For x86 CPUs, this is specified to solve the resource saturation and branch prediction problems.
Putting it together:
while ((*(volatile int *) &SM->mutex) != FALSE) {
__asm
{
rep nop
}
}
Change int as needed if it's not appropriate.

Related

std::thread does not exit

I have code which acts as a sort of anti idle CPU state. On the laptop, a USB camera is attached, grabbing images, and if the CPU is allowed to enter idle states, I get lost images. Since I don't have admin rights on the system in question, I instead run a thread which just does a stupid ++ on an integer, to keep the CPU out of idle (and 1 core at 100% usage). The issue is, on the system in question, the code never exits. On my development system, the code will exit just fine, on the system where the application should run, it works fine, but never exits.
The output I get in console is
Setting bool to exit.
Reached join 1.
Reached join 2.
Thats it. The exiting does not happen, so the join() on the AntiIdle does not return. Why? On one system, it does, on the other, it does not.
bool g_ExitProgram = false;
void AntiIdle()
{
int32_t ch = 0;
while (!g_ExitProgram)
{
ch++;
}
}
main()
{
std::thread antiIdleThread(AntiIdle);
while (!g_ExitProgram)
{
if (_kbhit())
{
char ch = _getch();
switch (ch)
{
case 27:
printf("Setting bool to exit.\n");
g_ExitProgram = true; break;
default:
;
}
}
}
printf("Reached join 1.\n");
displayThread.join();
printf("Reached join 2.\n");
antiIdleThread.join();
printf("Exiting code.\n");
return 0;
}
Edit: note, displayThread has the exact same exit condition, just with a few sleeps() in between, waiting for the next image to arrive.
This is a data race, because there's no synchronization at all on your global flag.
The simplest solution is to change the flag to std::atomic_bool - the default sequential consistency will work, and you probably don't need to optimize it in this case.
In terms of the documentation, std::atomic with either the default sequential consistency, or the more relaxed store(memory_order_release)/load(memory_order_acquire) gives you release-acquire ordering.
Just for the sake of perfect clarity, making the flag volatile does not address this problem. It may work in Java, but it doesn't work in C++, and it never did. If you're very unlucky it will appear to work for long enough to get you in trouble.

Resource intensive multithreading killing other processes

I have a very resource intensive code, that I made, so I can split the workload over multiple pthreads. While everything works, the computation is done faster, etc. What I'm guessing happens is that other processes on that processor core get so slow, that they crash after a few seconds of runtime.
I already managed to kill random processes like Chrome tabs, the Cinnamon DE or even the entire OS (Kernel?).
Code: (It's late, and I'm too tired to make a pseudo code, or even comments..)
-- But it's a brute force code, not so much for cracking, but for testing passwords and or CPU IPS.
Any ideas how to fix this, while still keeping as much performance as possible?
static unsigned int NTHREADS = std::thread::hardware_concurrency();
static int THREAD_COMPLETE = -1;
static std::string PASSWORD = "";
static std::string CHARS;
static std::mutex MUTEX;
void *find_seq(void *arg_0)
{
unsigned int _arg_0 = *((unsigned int *) arg_0);
std::string *str_CURRENT = new std::string(" ");
while (true)
{
for (unsigned int loop_0 = _arg_0; loop_0 < CHARS.length() - 1; loop_0 += NTHREADS)
{
str_CURRENT->back() = CHARS[loop_0];
if (*str_CURRENT == PASSWORD)
{
THREAD_COMPLETE = _arg_0;
return (void *) str_CURRENT;
}
}
str_CURRENT->back() = CHARS.back();
for (int loop_1 = (str_CURRENT->length() - 1); loop_1 >= 0; loop_1--)
{
if (str_CURRENT->at(loop_1) == CHARS.back())
{
if (loop_1 == 0)
str_CURRENT->assign(str_CURRENT->length() + 1, CHARS.front());
else
{
str_CURRENT->at(loop_1) = CHARS.front();
str_CURRENT->at(loop_1 - 1) = CHARS[CHARS.find(str_CURRENT->at(loop_1 - 1)) + 1];
}
}
}
};
}
Areuz,
Can you post the full code? I suspect the issue is the NTHREADS value. On my Ubuntu box, the value is set to 8 which is the number of cores in the /proc/cpuinfo file. Kicking off 8 'hot' threads on my box hogs 100% of the CPU. The kernel will time slice for its own critical processes but in general all other processes will starve for CPU.
Check out the max processor value in /etc/cpuinfo and go at least one lower then that. The CPU's are numbered 0-7 on my box, so 7 would be the max for me. The actual max might be 3 since 4 of my cores are hyper-threads. For completely CPU processes, hyper-threading generally doesn't help.
Bottom line, don't hog all the CPU, it will destabilize the system.
--Matt
Thank you for your answers and especially Matthew Fisher for his suggestion to try it on another system.
After some trial and error I decided to pull back my CPU overclock that I thought was stable (I had it for over a year) and that solved this weird behaviour. I guess that I've never ran such a CPU intensive and (I'm guessing) efficient (In regards to not throttling the full CPU by yielding) script to see this happen.
As Matthew suggested I need to come up with a better way than to just constantly check the THREAD_COMPLETE variable with a while true loop, but I hope to resolve that in the comments.
Full and updated code for future visitors is here: pastebin.com/jbiYyKBu

Multithreading result into slower

Basically I havn't done multi-threaded programming earlier. Conceptually I am aware of it.
So started with some what coding with random number generation. Code is working but it produce slower result than single thread program. So wanted to know for loopholes in my code and how to improve performance.
so if I tr to generate 1-1500 numbers randomly, using single thread and 10 threads (or 5 threads). single thread execute faster. thread switching or locking seems to be taking time. so how to handle it?
pthread_mutex_t map_lock;
std::set<int> numSet;
int randcount=0;
static void *thread_fun (void *arg)
{
int randNum= *(int *)arg;
int result;
std::set<int> findItr;
while (randcount != randNum -1 ) {
result = rand ()%randNum;
if (result == 0) continue;
pthread_mutex_lock (&map_lock);
const bool is_in = (numSet.find (result) != numSet.end ());
if (!is_in)
{
numSet.insert (result);
printf (" %d\t", result);
randcount++;
}
pthread_mutex_unlock (&map_lock);
}
}
Since the majority of your code blocks all parallel threads (because is between a pthread_mutex_lock (&map_lock); and a pthread_mutex_unlock (&map_lock); block), your code works like it was running sequentially only with the overhead of parallelisation.
Tip: try to only collect the results in your thread then pass them back to the main thread which will display them. Also if you don't access your set parallely but pass back partial lists from each thread you don't have to deal with concurrency slowing down your code.

CUDA, mutex and atomicCAS()

Recently I started to develop on CUDA and faced with the problem with atomicCAS().
To do some manipulations with memory in device code I have to create a mutex, so that only one thread could work with memory in critical section of code.
The device code below runs on 1 block and several threads.
__global__ void cudaKernelGenerateRandomGraph(..., int* mutex)
{
int i = threadIdx.x;
...
do
{
atomicCAS(mutex, 0, 1 + i);
}
while (*mutex != i + 1);
//critical section
//do some manipulations with objects in device memory
*mutex = 0;
...
}
When first thread executes
atomicCAS(mutex, 0, 1 + i);
mutex is 1. After that first thread changes its status from Active to Inactive, and line
*mutex = 0;
is not executed. Other threads stays forever in loop. I have tried many variants of this cycle like while(){};, do{}while();, with temp variable = *mutex inside loop, even variant with if(){} and goto. But result is the same.
The host part of code:
...
int verticlesCount = 5;
int *mutex;
cudaMalloc((void **)&mutex, sizeof(int));
cudaMemset(mutex, 0, sizeof(int));
cudaKernelGenerateRandomGraph<<<1, verticlesCount>>>(..., mutex);
I use Visual Studio 2012 with CUDA 5.5.
The device is NVidia GeForce GT 240 with compute capability 1.2.
Thanks in advance.
UPD:
After some time working on my diploma project this spring, I found a solution for critical section on cuda.
This is a combination of lock-free and mutex mechanisms.
Here is working code. Used it to impelment atomic dynamic-resizable array.
// *mutex should be 0 before calling this function
__global__ void kernelFunction(..., unsigned long long* mutex)
{
bool isSet = false;
do
{
if (isSet = atomicCAS(mutex, 0, 1) == 0)
{
// critical section goes here
}
if (isSet)
{
mutex = 0;
}
}
while (!isSet);
}
The loop in question
do
{
atomicCAS(mutex, 0, 1 + i);
}
while (*mutex != i + 1);
would work fine if it were running on the host (CPU) side; once thread 0 sets *mutex to 1, the other threads would wait exactly until thread 0 sets *mutex back to 0.
However, GPU threads are not as independent as their CPU counterparts. GPU threads are grouped into groups of 32, commonly referred to as warps. Threads in the same warp will execute instructions in complete lock-step. If a control statement such as if or while causes some of the 32 threads to diverge from the rest, the remaining threads will wait (i.e. sleeps) for the divergent threads to finish. [1]
Going back to the loop in question, thread 0 becomes inactive because threads 1, 2, ..., 31 are still stuck in the while loop. So thread 0 never reaches the line *mutex = 0, and the other 31 threads loops forever.
A potential solution is to make a local copy of the shared resource in question, let 32 threads modify the copy, and then pick one thread to 'push' the change back to the shared resource. A __shared__ variable is ideal in this situation: it will be shared by the threads belonging to the same block but not other blocks. We can use __syncthreads() to fine-control the access of this variable by the member threads.
[1] CUDA Best Practices Guide - Branching and Divergence
Avoid different execution paths within the same warp.
Any flow control instruction (if, switch, do, for, while) can significantly affect the instruction throughput by causing threads of the same warp to diverge; that is, to follow different execution paths. If this happens, the different execution paths must be serialized, since all of the threads of a warp share a program counter; this increases the total number of instructions executed for this warp. When all the different execution paths have completed, the threads converge back to the same execution path.

Can boost::mutex lock out an OS if enough are active?

I'm working on a producer consumer problem with an intermediate processing thread. When I run 200 of these applications it locks the system up in win7 when lots of connections timeout. Unfortunately, not in a way that I know how to debug it. The system becomes unresponsive and I have to restart it with the power button. It works fine on my mac, and oddly enough, it works fine in windows in safe mode.
I'm using boost 1.44 as that is what the host application uses.
Here is my queue. My intention is that the queues are synchronized on their size. I've manipulated this to use timed_wait to make sure I wasn't losing notifications, though I saw no difference in effect.
class ConcurrentQueue {
public:
void push(const std::string& str, size_t notify_size, size_t max_size);
std::string pop();
private:
std::queue<std::string> queue;
boost::mutex mutex;
boost::condition_variable cond;
};
void ConcurrentQueue::push(
const std::string& str, size_t notify_size, size_t max_size) {
size_t queue_size;
{{
boost::mutex::scoped_lock lock(mutex);
if (queue.size() < max_size) {
queue.push(str);
}
queue_size = queue.size();
}}
if (queue_size >= notify_size)
cond.notify_one();
}
std::string ConcurrentQueue::pop() {
boost::mutex::scoped_lock lock(mutex);
while (!queue.size())
cond.wait(lock);
std::string str = queue.front();
queue.pop();
return str;
}
These threads use the below queues to process and send using libcurl.
boost::shared_ptr<ConcurrentQueue> queue_a(new ConcurrentQueue);
boost::shared_ptr<ConcurrentQueue> queue_b(new ConcurrentQueue);
void prod_run(size_t iterations) {
try {
// stagger startup
boost::this_thread::sleep(
boost::posix_time::seconds(random_num(0, 25)));
size_t save_frequency = random_num(41, 97);
for (size_t i = 0; i < iterations; i++) {
// compute
size_t v = 1;
for (size_t j = 2; j < (i % 7890) + 4567; j++) {
v *= j;
v = std::max(v % 39484, v % 85783);
}
// save
if (i % save_frequency == 0) {
std::string iv =
boost::str( boost::format("%1%=%2%") % i % v );
queue_a->push(iv, 1, 200);
}
sleep_frame();
}
} catch (boost::thread_interrupted&) {
}
}
void prodcons_run() {
try {
for (;;) {
std::string iv = queue_a->pop();
queue_b->push(iv, 1, 200);
}
} catch (boost::thread_interrupted&) {
}
}
void cons_run() {
try {
for (;;) {
std::string iv = queue_b->pop();
send_http_post("http://127.0.0.1", iv);
}
} catch (boost::thread_interrupted&) {
}
}
My understanding of using mutexes in this way should not make a system unresponsive. If anything, my apps would deadlock and sleep forever.
Is there some way that having 200 of these at once creates a scenario where this isn't the case?
Update:
When I restart the computer, most of the time I need to replug in the USB keyboard to get it to respond. Given the driver comment, I thought that may be relevant. I tried updating the northbridge drivers, though they were up to date. I'll look to see if there are other drivers that need attention.
Update:
I've watched memory, non-paged pool, cpu, handles, ports and none of them are at alarming rates at any time while the system is responsive. It's possible something spikes at the end, though that is not visible to me.
Update:
When the system hangs, it stops rendering and does not respond to the keyboard. The last frame that it rendered stays up though. The system sounds like it is still running and when the system comes back up there is nothing in the event viewer saying that it crashed. There are no crash dump files either. I interpret this as the OS is being blocked out of execution.
A mutex lock locks other applications that use the same lock. Any mutex used by the OS should not be (directly) available to any application.
Of course, if the mutex is implemented using the OS in some way, it may well call into the OS, and thus using CPU resources. However, a mutex lock should not cause any worse behaviour than the application using CPU resources any other way.
It may of course be that if you use locks in an inappropriate way, different parts of the application becomes deadlocked, as function 1 acquires lock A, and then function 2 acquires lock B. If then function 1 tries to acquire lock B, and function 2 tries to acquire lock A before releasing their respective locks, you have a deadlock. The trick here is to always acquire multiple locks in the same order. So if you need two locks at the same time, always acquire lock A first, then lock B.
Deadlockin should not affect the OS as such - if anything, it makes it better, but if the application is in some way "misbehaving" in case of a deadlock, it may cause problems by calling the OS a lot - e.g if the locking is done by:
while (!trylock(lock))
{
/// do nothing here
}
it may cause peaks in system usage.