Context:
I am working on an application that needs fast access to large files, so I use memory-mapping. Reading and writing becomes a simple memcpy in consequence. I am now trying to add the ability to abort any reads or writes in progress.
The first thing that came to mind (since I don't know about any interruptable memcpy function) was to periodically memcpy a few KB and check if the operation should be aborted. This should ensure a near-instantanious abortion, if the read is reasonably fast.
If it isn't however the application shouldn't take ages to abort, so my second idea was using multithreading. The memcpy happens in its own thread, and a controlling thread uses WaitForMultipleObjects on an event that signals abortion, and the memcpy-thread. It would then kill the memcpy-thread, if the abortion-event was signaled. However, the documentation on TerminateThread states that one should be absolutely sure that one does not leave the system in a bad state by not releasing ressource for instance.
Question:
Does a memcpy do anything that would make it unsafe to kill it when copying mapped memory? Is it safe to do so? Is it implementation dependant (using different operating systems/architectures than Windows x86-64)?
I do realize that using the second approach may be complete overkill, since no 1KB read/write is every going to take that long realistically, but I just want to be safe.
If at all possible you should choose a different design, TerminateThread should not be thought of as a normal function, it is more for debugging/power tools.
I would recommend that you create a wrapper around memcpy that copies in chunks. The chunk size is really up to you, depends on your responsiveness requirements. 1 MiB is probably a good starting point.
If you absolutely want to kill threads you have to take a couple of things into account:
You obviously don't know anything about how memcpy works internally nor how much it copied so you have to assume that the whole range is undefined when you abort.
Terminating a thread will leak memory on some versions of Windows. There are workarounds for that.
Don't hold any locks in the thread.
Related
Can someone clarify the next moment. I saw some implementations of std::queue for multithreading purposes, where all operations of pushing/poping/erasing elements where protected with mutex, but when I see that I imagine next scenario: we have two threads (thread1 and thread2) and they are running on different cores of the processor, thus they have different L1 caches. And we have the next queue:
struct Message {
char* buf;
size_t size;
};
std::queue<Message> messageQueue;
Thread1 adds some element to queue, then thread2 tries to access an element with front() method, but what if that piece of memory was previosly cached for this core of the processor (so the size variable may not indicate current size of buf variable, or buf pointer may hold wrong (not updated) address)?
I have such problem while designing client/server application on the server side. In my app server is running in one thread, it works directly with sockets, and when it receives new message, it allocates memory for that message and adds this message to some message queue, then other thread accesses this queue, processes message and then deletes it. I am always afraid of caching problems, and because of that I have created my own implementation of queue with volatile pointers.
What is the proper way to work with such things? Do I have to avoid using std::list, std::queue with locks? If this problem is impossible, could you please explain why?
It's not your problem. You're writing C++ code. It's the compiler's job to ensure your code makes the CPU and its caches do the right thing, not yours. You just have to comply with the rules for whatever threading standard you are using.
But if i lock some mutex, how can i be shure, that this memory isn't cached, or mutex locking somehow guarantee reading directly from the memory?
You can't. And that's a good thing. Caching massively improves performance and main memory is terribly slow. Fortunately, no modern CPU that you're likely to write multi-threaded code on requires you to sacrifice performance like that. They have incredibly sophisticated optimizations such as cache coherency hardware and prefetch pinning to avoid things that hurt performance that much. What you want is for your code to work, not for it to work awfully.
You can safely use mutex locking as long as you use it everywhere you MIGHT access / update in a threaded fashion. This means for this example that adding to / removing from / parsing your list is mutex-guarded.
You also have to be careful that the objects you're storing in your queue are properly protected.
Also, please please please do not use char *. That's what std::string is for.
I would like to write multithreading-safe logger using lock-free queue. Logging threads will push messages to queue and logger will be popping them and send to output. I consider how to solve that issue- sending to output.
I would like to avoid using mutex/locks as long as it is possible.
So, let's assume that I am going to use C++ streams to write to the file/console. We can assume that target system is Linux.
Ok, writing to stream must be just a wrapper ( perhaps a advanced wrapper) for system call offered by Unix write. From what I know syscalls are atomic ( only one process can execute syscall at the same time). So, it is tempting not to use locks to make safe writing to file.
But write is a system call but it doesn't guarantees writing "whole output". It returns number of bytes which are succesfully written to the file.
Basically, my question is:
How to solve it? Is it possible to avoid mutex? ( I think it is not possible). And please mark my considerations, am I wrong?
Igor is right: just have one thread do all the log writes. Keep in mind that the kernel has to do locking to synchronize access to the open file descriptor (which keeps track of the file position), so by doing writes from multiple cores you're causing contention inside the kernel. Even worse, you're making system calls from multiple cores, which means the kernel's code / data accesses will dirty your caches on multiple cores.
See this paper for more about the impact of making system calls on the performance of user-space code after the syscall completes. (And about data / instruction cache misses inside the kernel for infrequent syscalls). It definitely makes sense to have one thread doing all the system calls, at least all the write system calls, to keep that part of your process's footprint isolated to one core. As well as the locking contention inside the kernel.
That FlexSC paper is about an idea for batching system calls to reduce user->kernel->user transitions, but they also measure overhead for the normal synchronous system-call method. More important is the discussion of cache-pollution from making system calls.
Alternatively, if you can let multiple threads write to your log file, you could just do that and not use the queue at all.
It's not guaranteed that a large write will finish uninterrupted, but a small to medium sized write should (almost?) always copy its whole buffer on most OSes. Especially if you're writing to a file, not a pipe. IDK how Linux write() behaves when it's preempted, but I expect it usually resumes to finish the write instead of returning without having written all the requested bytes. Partial writes might be more likely when interrupted by a signal.
It is guaranteed that bytes from two write() system calls won't be mixed together; all the bytes from one will be before or after the bytes from the other. You're correct that partial writes are a potential problem, though. I forget if the glibc syscall wrapper will resume the call for you on EINTR. Although in that case, it means no bytes actually got written, or it would have returned success with a byte count.
You should test this, for partial writes and for performance. kernel-space locking might be cheaper than the overhead of your lock-free queue, but making system calls from every thread that generates log messages might be worse for performance. (And when you test this, make sure you do it with some real work happening in your user-space process, not just a loop that only calls write.)
Let's say I have a computationally intensive algorithm running.
For example, let's say it's a routing algorithm, and on window running on a separate thread, I want to show the user what routes are being currently being analyzed and such, and for whatever reason, it contains heavily CPU-intensive code.
The important thing is that I don't want to slow down the worker thread just for the sake of displaying progress; it needs to run at full-speed as much as possible. It is perfectly OK if the user sees stale data, such as an in-between that didn't actually occur (say, two active routes at once), because this progress visualization is for informational purposes only, and nothing else.
From a theoretical standpoint, I think that according to the C++ standard, my best option is to use std::atomic with std::memory_order_relaxed on both threads. But that would slow down the code on the worker thread noticeably.
From a practical standpoint, though, I'm just tempted to ignore std::atomic altogether, and just have the worker thread work with all the variables normally. Who cares if the GUI thread reads stale data? i don't, and presumably neither will the user. In reality it won't matter because there is only one worker thread, and only that thread needs to observe valid writes, which in practice is the only thing that'll happen.
What I'm wondering about is:
What is the best way to solve this kind of problem, both in theory and in practice?
Do people just ignore the standard and go for raw primitives, or do they bite the bullet and take the performance hit of using std::atomic?
Or are there other facilities I'm not aware of for soving this problem?
Ignoring proper fences for std::atomic wouldn't buy you match but you might be at risk of loosing the communication between threads completely, mostly on the compiler side. The problem does not exist for example on x86 hardware side at all, because each store to memory (if you can ensure your compiler do it as expected) has required store-with-release semantics anyway.
Also I doubt that sharing the progress more often than 30-100 FPS (or Hz) brings any value. On the other hand, it can certainly put the unnecessary burden on the system resources (if repeated in a tight loop) and break compiler optimizations, e.g. vectorization.
So, if the overhead for worker thread is the concern, share the info with less frequency. E.g. update the atomic counter once in 1024 iterations:
// worker thread
if( i%1024 == 0 ) // update the progress info
my_atomic_progress.store( i, std::memory_order_release ); // regular `mov` on x86
// GUI thread
auto i = my_atomic_progress.load( std::memory_order_consume );
This example also shows the minimal fences necessary to establish the communication, otherwise the compiler is free to optimize the memory operations out of a loop for example.
There is no best way - it depends how much data you need to send to the display, if its just a single long integer value, and the display is completely nu-guaranteed, then I'd just write the value and have done with it. Occasionally the reader will read a corrupted value, but it won't matter so I won't care.
Otherwise, I'd be tempted to send the value to a queue and use an event or condition variable to trigger the read afterwards (as often you do not want the reader running full tilt, and you need some way to inform it there is new data to read)
I'm not sure the overhead for std::atomic is that great - isn't it going to be implemented in the OS primitives anyway? If so, the primitives (on Windows, x86 at least via InterlockedExchange function) end up as a single CPU instruction after the compiler and optimiser have done their thng.
I've tryed to profile one of my application using Qt.
The results I found seemed to show that Qt is a big Thread user. It seems to create and destroy threads a lot. Which is the peak of its memory consumption. Is it true ?
So I've tryed to do some research on "how to optimize a Qt application" but, well I hadn't found anything relevant for now.
So I was wondering if there is any "general way" of programming with Qt that could be optimized. Shall I use the threads in a specific manner ? Can I do anything except respecting C++ standards, -pedantic options in compiler, and so one ?
Generally speaking, if you create and destroy threads a lot, then that's probably not a very good design. Assuming your threads do the same (or similar) things, then having a fixed "pool" of threads that run for as long as it takes and then get put back in the pool when your current code destroys the thread.
Or, let the thread run forever, and feed it data through some suitable IPC.
I would also say that unless you are doing something very special, if something takes less than about a quarter of a second to do, then you shouldn't create a thread to do that. That's not a fixed rule.
Threads as such don't use that much memory, but the stack of each thread may use quite a bit of memory.
If you're creating and destroying QThreads a lot, consider using a QThreadPool or QtConcurrent. These will hold threads in reserve and serve them on demand.
If you're not creating and destroying threads a lot, then your problem is elsewhere.
I am considering the use of potentially hundreds of threads to implement tasks that manage devices over a network.
This is a C++ application running on a powerpc processor with a linux kernel.
After an initial phase when each task does synchronization to copy data from the device into the task, the task becomes idle, and only wakes up when it receives an alarm, or needs to change some data (configuration), which is rare after the start phase. Once all tasks reach the "idle" phase, I expect that only a few per second will need to wake.
So, my main concern is, if I have hundreds of threads will they have a negative impact on the system once they become idle?
Thanks.
amso
edit:
I'm updating the question based on the answers that I got. Thanks guys.
So it seems that having a ton of threads idling (IO blocked, waiting, sleeping, etc), per se , will not have an impact on the system in terms of responsiveness.
Of course, they will spend extra money for each thread's stack and TLS data but that's okay as long as we throw more memory at the thing (making it more €€€)
But then, other issues have to be accounted for. Having 100s of threads waiting will likely increase memory usage on the kernel, due to the need of wait queues or other similar resources. There's also a latency issue, which looks non-deterministic. To check the responsiveness and memory usage of each solution one should measure it and compare.
Finally, the whole idea of hundreds of threads that will be mostly idling may be modeled like a thread pool. This reduces a bit of code linearity but dramatically increases the scalability of the solution and with propper care can be easily tunable to adjust the compromise between performance and resource usage.
I think that's all. Thanks everyone for their input.
--
amso
Each thread has overhead - most importantly each one has its own stack and TLS. Performance is not that much of a problem since they will not get any time slices unless they actually do anything. You may still want to consider using thread pools.
Chiefly they will use up address space and memory for stacks; once you get, say, 1000 threads, this gets quite significant as I've seen that 10M per thread is typical for stacks (on x86_64). It is changable, but only with care.
If you have a 32-bit processor, address space will be the main limitation once you hit 1000s of threads, you can easily exhaust the AS.
They use up some kernel memory, but probably not as much as userspace.
Edit: of course threads share address space with each other only if they are in the same process; I am assuming that they are.
I'm not a Linux hacker, but assuming that Linux's thread scheduling is similar to Windows'...
Yes, of course the will be some impact. Every bit of memory you consume will potentially have some impact.
However, in a time-sliced environment, threads that are in a Wait/Sleep/Join state will not consume CPU cycles until they are awoken.
I would be worried about offering 1:1 thread-connections mappings, if nothing else because it leaves you rather exposed to denial of service attacks. (pthread_create() is a fairly expensive operation compared to just a call to accept())
EboMike has already answered the question directly - provided threads are blocked and not busy-waiting then they won't consume much in the way of resources although they will occupy memory and swap for all the per-thread state.
I'm learning the basics of the kernel now. I can't give you a specific answer yet; I'm still a noob... but here are some things for you to chew on.
Linux implements each POSIX thread as a unique process. This will create overhead as others have mentioned. In addition to this, your waiting model appears flawed any way you do it. If you create one conditional variable for each thread, then I think (based off of my interpretation of the website below) that you'll actually be expending a lot of kernel memory, as each thread would be placed into its own wait queue. If instead you break your threads up for each group of X threads to share a conditional variable, then you've got problems as well because every time the variable signals, you must wake up _EVERY_DARN_PROCESS_ in that variable's wait queue.
I also assume that you will need to do some object sharing an synchronization. In this case, your code may get slower because of the need to wake up all processes waiting on a resource, as I mentioned earlier.
I know this wasn't much help, but as I said, I'm a kernel noob. Hope it helped a little.
http://book.chinaunix.net/special/ebook/PrenticeHall/PrenticeHallPTRTheLinuxKernelPrimer/0131181637/ch03lev1sec7.html
I'm not sure what "device" you are talking about, but if it's a file descriptor, I'd suggest that you look at starting to migrate to using either poll or epoll (Id suggest the latter given the description of how active you expect each file descriptor to be). That way, you could use one process which would be responsible for all the fds.