Fastest way to communicate with subprocess - c++

I have a parent process which spawns several subprocesses to do some CPU intensive work. For each batch of work, the parent needs to send several 100MB of data (as one single chunk) to the subprocess and when it is done, it must receive about the same amount of data (again as one single chunk).
The parent process and the subprocess are different applications and even different languages (Python and C++ mostly) but if I have any solution in C/C++, I could write a Python wrapper if needed.
I thought the simplest way would be to use pipes. That has many advantages, such as being mostly cross-platform, being simple, and flexible, and I can maybe even later extend my code without too much work to communicate over network.
However, now I'm profiling the whole application and I see some noticeable overhead in the communication and I wonder whether there are faster ways. Cross-platform is not really needed for my case (scientific research), it's enough if it works on Ubuntu >=12 or so (although MacOSX would also be nice). In principle, I thought that copying a big chunk of data into a pipe and reading it at the other end should not take much more time than setting up some shared memory and doing a memcpy. Am I wrong? Or how much performance would you expect is it worse?
The profiling itself is complicated and I don't really have reliable and exact data, only clues (because it's all a quite complicated system). I wonder where I should spent my time now. Trying to get more exact profiling data? Trying to implement some shared memory solution and see how much it improves?. Or something else? I also thought about wrapping and compiling the subprocess application in a library and linking it into the main process and thus avoiding the communication with another process - in that case I need just a memcpy.
There are quite a few related questions here on StackOverflow but I haven't really seen a performance comparison for different methods of communication.

Ok, so I wrote a small benchmarking tool here which copies some data (~200MB) either via shared memory or via pipe, 10 times.
Results on my MacBook with MacOSX:
Shared memory:
24.34 real 18.49 user 5.96 sys
Pipe:
36.16 real 20.45 user 17.79 sys
So, first we see that the shared memory is noticeably faster. Note that if I copy smaller chunks of data (~10MB), I almost don't see a difference in total time.
The second noticeable difference is the time spent in kernel. It is expected that the pipe needs more kernel time because the kernel has to handle all those reads and writes. But I would not have expected it to be that much.

Related

Is there any way to isolate two thread memory space?

I have an online service which is single thread, since whole program share one memory pool(use new or malloc),a module might destroy memory which leads to another module work incorrectly, so I want to split whole program into two part, each part runs on a thread, is it possible to isolate thread memory like multiprocess so I can check where is the problem? (splitting it into multiprocess cost a lot of time and risky so I don't want to try)
As long as you'll use threads, memory can be easily corrupted since, BY DESIGN, threads are sharing the same memory. Splitting your program across two threads won't help in ANY manner for security - it can greatly help with CPU load, latency, performances, etc. but in no way as an anti memory corruption mechanism.
So either you'll need to ensure a proper development and that your code won't plow memory where it must not, or you use multiple process - those are isolated from each other by operating system.
You can try to sanitize your code by using tools designed for this purpose, but it depends on your platform - you didn't gave it in your question.
It can go from a simple Debug compilation with MSVC under Windows, up to a Valgrind analysis under Linux, and so on - list can be huge, so please give additional informations.
Also, if it's your threading code that contains the error, maybe rethink what you're doing: switching it to multiprocess may be the cheapest solution in the end - don't be fooled by sunk cost, especially since threading can NOT protect part 1 against part 2 and reciprocally...
Isolation like this is quite often done by using separate processes.
The downsides of doing that are
harder communication between the processes (but thats kind of the point)
the overhead of starting processes is typically a lot larger than starting thread. So you would not typically spawn a new process for each request for example.
Common model is a lead process that starts a child process to do the request serving. The lead process just monitors the health of the worker
Or you could fix your code so that it doesnt get corrupted (AN easy thing to say I know)

Create huge text file - multi-threading a good idea?

A need to create huge (>10 Gb) text files where every line is a very long number, basically a string as even types like unsigned long long won't be enough. So i will be using random generator and first though was that probably it's a good idea create several threads. From what I see, every thread will be writing one line at a time, which is considered a thread safe operation in C++.
Is it a good idea or am I missing something and it's better just to write line by line from one thread?
A correct answer here will depend fairly heavily on the type of drive to which you're writing the file.
If it's an actual hard drive, then a single thread will probably be able to generate your random numbers and write the data to the disk as fast as the disk can accept it. With reasonable code on a modern CPU, the core running that code will probably spend more time idle than it does doing actual work.
If it's a SATA hard SSD, effectively the same thing will be true, but (especially if you're using an older CPU) the core running the code will probably spend a lot less time idle. A single thread will probably still be able to generate the data and write it to the drive as fast as the drive can accept it.
Then we get to things like NVMe and Optane drives. Here you honestly might stand a decent chance of improving performance by writing from multiple threads. But (at least in my experience) to do that, you just about have to skip past using iostreams, and instead talk directly to the OS. Under Windows, that would mean opening the file with CreateFile (specifying FILE_FLAG_OVERLAPPED when you do). Windows also has built in support for I/O completion ports (which are really sort of thread pools) to minimize overhead and improve performance--but using it is somewhat nontrivial.
On Linux, asynchronous I/O is a bit more of a sore point. There's an official AIO interface for doing asynchronous I/O, and Linux has had an implementation of that for a long time, but it's never really worked very well. More recently, something called io_uring was added to the Linux kernel. I haven't used it a lot, but it looks like it's probably a better design--but it doesn't (as far as I know) support the standard AIO interface, so you pretty much have to use via its own liburing instead. Rather that Windows I/O completion ports, this works well, but using it is somewhat non-trivial.
If there is no explicit synchronization between the threads, you need to make sure that the library functions you use are thread-safe. For example, the C++ random number generators in <random> are not, so it would be best to have one RNG per thread. Additionally, you need to look at bottlenecks. Conversion of a number to text is one bottleneck, and multithreading would help with that. Output is another, and multithreading would not. Profiling would help resolve this.
Ostreams are not thread-safe, so you'll have to use synchronization to protect each thread's access.

Profiling a multiprocess system

I have a system that i need to profile.
It is comprised of tens of processes, mostly c++, some comprised of several threads, that communicate to the network and to one another though various system calls.
I know there are performance bottlenecks sometimes, but no one has put in the time/effort to check where they are: they may be in userspace code, inefficient use of syscalls, or something else.
What would be the best way to approach profiling a system like this?
I have thought of the following strategy:
Manually logging the roundtrip times of various code sequences (for example processing an incoming packet or a cli command) and seeing which process takes the largest time. After that, profiling that process, fixing the problem and repeating.
This method seems sorta hacky and guess-worky. I dont like it.
How would you suggest to approach this problem?
Are there tools that would help me out (multi-process profiler?)?
What im looking for is more of a strategy than just specific tools.
Should i profile every process separately and look for problems? if so how do i approach this?
Do i try and isolate the problematic processes and go from there? if so, how do i isolate them?
Are there other options?
I don't think there is a single answer to this sort of question. And every type of issue has it's own problems and solutions.
Generally, the first step is to figure out WHERE in the big system is the time spent. Is it CPU-bound or I/O-bound?
If the problem is CPU-bound, a system-wide profiling tool can be useful to determine where in the system the time is spent - the next question is of course whether that time is actually necessary or not, and no automated tool can tell the difference between a badly written piece of code that does a million completely useless processing steps, and one that does a matrix multiplication with a million elements very efficiently - it takes the same amount of CPU-time to do both, but one isn't actually achieving anything. However, knowing which program takes most of the time in a multiprogram system can be a good starting point for figuring out IF that code is well written, or can be improved.
If the system is I/O bound, such as network or disk I/O, then there are tools for analysing disk and network traffic that can help. But again, expecting the tool to point out what packet response or disk access time you should expect is a different matter - if you contact google to search for "kerflerp", or if you contact your local webserver that is a meter away, will have a dramatic impact on the time for a reasonable response.
There are lots of other issues - running two pieces of code in parallel that uses LOTS of memory can cause both to run slower than if they are run in sequence - because the high memory usage causes swapping, or because the OS isn't able to use spare memory for caching file-I/O, for example.
On the other hand, two or more simple processes that use very little memory will benefit quite a lot from running in parallel on a multiprocessor system.
Adding logging to your applications such that you can see WHERE it is spending time is another method that works reasonably well. Particularly if you KNOW what the use-case is where it takes time.
If you have a use-case where you know "this should take no more than X seconds", running regular pre- or post-commit test to check that the code is behaving as expected, and no-one added a lot of code to slow it down would also be a useful thing.

C++, always running processes or invoked executable files?

I'm working on a project made of some separate processes (services). Some services are called every second, some other every minute and some services may not be called after days. (and there are some services that are called randomly and there is no exact information about their call times).
I have two approaches to develop the project. To make services always running processes using interprocess messaging, or to write separate C++ programs and run executable files when I need them.
I have two questions that I couldn't find a suitable answer to.
Is there any way I could calculate an approximated threshold that can help me answer to 'when to use which way'?
How much faster is always running processes? (I mean compared with process of initializing and running executable files in OS)
Edit 1: As mentioned in comments and Mats Petersson's answer, answer to my questions is heavily related to environment. Then I explain more about these conditions.
OS: CentOS 6.3
services are small (smaller that 1000 line codes normally) and use no additional resources (such as database)
I don't think anyone can answer your direct two questions, as it depends on many factors, such as "what OS", "what secondary storage", "how large an application is", "what your application does" (loading up the contents of a database with a million entries takes much longer than int x = 73; as the whole initialization outside main).
There is overhead with both approaches, and assuming there isn't enough memory to hold EVERYTHING in RAM at all times (and modern OS's will try to make use of the RAM as disk-cache or for other caching, rather than keep old crusty application code that doesn't run, so eventually your application code will be swapped out if it's not being run), you are going to have approximately the same disk I/O amount for both solutions.
For me, "having memory available" trumps other things, so executing a process when itäs needed is better than leaving it running in the expectation that in some time, it will need to be reused. The only exceptions are if the executable takes a long time to start (in other words, it's large and has a complex starting procedure) AND it's being run fairly frequently (at the very least several times per minute). Or you have high real-time requirements, so the extra delay of starting the process is significantly worse than "we're holding it in memory" penalty (but bear in mind that holding it in memory isn't REALLY holding it in memory, since the content will be swapped out to disk anyway, if it isn't being used).
Starting a process that was recently run is typically done from cache, so it's less of an issue. Also, if the application uses shared libraries (.so, .dll or .dynlib depending on OS) that are genuinely shared, then it will normally shorten the load time if that shared library is in memory already.
Both Linux and Windows (and I expect OS X) are optimised to load a program much faster the second time it executes in short succession - because it caches things, etc. So for the frequent calling of the executable, this will definitely work in your favour.
I would start by "execute every time", and if you find that this is causing a problem, redesign the programs to stay around.

How to avoid HDD thrashing

I am developing a large program which uses a lot of memory. The program is quite experimental and I add and remove big chunks of code all the time. Sometimes I will add a routine that is rather too memory hungry and the HDD drive will start thrashing and the program (and the whole system) will slow to a snails pace. It can easily take 5 mins to shut it down!
What I would like is a mechanism for avoiding this scenario. Either a run time procedure or even something to be done before running the program, which can say something like "If you run this program there is a risk of HDD thrashing - aborting now to avoid slowing to a snails pace".
Any ideas?
EDIT: Forgot to mention, my program uses multiple threads.
You could consider using SetProcessWorkingSetSize . This would be useful in debugging, because your app will crash with a fatal exception when it runs out of memory instead of dragging your machine into a thrashing situation.
http://msdn.microsoft.com/en-us/library/ms686234%28VS.85%29.aspx
Similar SO question
Set Windows process (or user) memory limit
Windows XP is terrible when there are multiple threads or processes accessing the disk at the same time. This is effectively what you experience when your application begins to swap, as the OS is writing out some pages while reading in others. Windows XP (and Server 2003 for that matter) is utterly trash for this. This is a real shame, as it means that swapping is almost synonymous with thrashing on these systems.
Your options:
Microsoft fixed this problem in Vista and Server 2008. So stop using a 9 year old OS. :)
Use unbuffered I/O to read/write data to a file, and implement your own paging inside your application. Implementing your own "swap" like this enables you to avoid thrashing.
See here many more details of this problem: How to obtain good concurrent read performance from disk
I'm not familiar with Windows programming, but under Unix you can limit the amount of memory that a program can use with setrlimit(). Maybe there is something similar. The goal is to get the program to abort once it uses to much memory, rather than thrashing. The limit would be a bit less than the total physical memory on the machine. I would guess somewhere between 75% and 90%, but some experimentation would be necessary to find the optimal setting.
Chances are your program could use some memory management. While there are a few programs that do need to hold everything in memory at once, odds are good that with a little bit of foresight you might be able to rework your program to reuse or discard a lot of the memory you need.
Your program will run much faster too. If you are using that much memory, then basically all of your built-in first and second level caches are likely overflowing, meaning the CPU is mostly waiting on memory loads instead of processing your code's instructions.
I'd rather determine reasonable minimum requirements for the computer your program is supposed to run on, and during installation either warn the user if there's not enough memory available, or refuse to install.
Telling him each time he's starting the program is nonsensical.