How to implement a communication between MATLAB and a separate C++ application?

How to implement a communication between MATLAB and a separate C++ application? - c++

I read that I can use TCP/IP to send data between the two applications, but I was wondering if that would be faster than using binary files and polling the files for change? I don't have time to implement both methods and benchmark, so if someone has knowledge of this I'd appreciate the input.
I would need to send two buffers back and forth, one very small (a few KB) and one that could be 0.1 - 1 MB in size.
I should also mention that the C++ application runs on a cluster and is parallelized with MPI such that each process needs to read the entire buffer. When reading binary files, they can do it in parallel at the same time, so it's not an issue. I'm not sure if that can be done with TCP/IP.

If you try to do this via TCP/IP you'll need to serialise the data in one application and reconstruct it in the other, using the API of your network stack. This could get messy.
One thing you could try is a memory mapped file. That's basically sharing memory between the processes. Super fast and quite common. You should just find an example for each system, c++ and MATLAB, and take it from there.

Elementary, dear Watson...
This is what a process has to pay:
whenever a process touches a file [SERIAL]-ly, expect + ~12-15 [ms] access latency on spinning drives
whenever a process touches a file [CONCURRENT]-ly add + a few [us] for fileIO overheads, as more than one process tries to govern when and what is intended to get read/written ( the more MPI processes try to get the hat & dictate, the lower is the probability of being an isolated access & these overheads will naturally only grow larger ) -- the more MPI-nodes try to do ( cit.) "polling the files for change", the larger will overheads grow -- indeed a very bad idea for attempts to increase performance -- continuously accumulated more and more tens to hundreds to thousands [ms] are the most expensive way to go.
whenever a process touches a RAM [SERIAL]-ly, expect + ~80-100 [ns] access latency on classical DDR-x DRAM. Re-reads may get masked by Cache hierarchy to cost only + few [ns]
whenever a process touches a RAM, [CONCURRENT]-ly add + a few [ns] as most CPUs are an out-of-order execution type hardware piece and can automatically detect and try to re-arrange a bit the memory access patterns and hide a part of this add-on latency, though at a cost of loosing some chance of Cache hits, be it due to temporal or capacity inducted collisions ( so having to pay those + ~80-100 [ns] to re-fetch from DRAM, once the Cached LRU-part got evicted.
Next comes the cost of passing a volume of data through a limited bandwidth:
reading from file takes a lot of time -- for the max flow see the SATA-x, NAS etc. specification of how many seconds it will take to move a 1 MB through. Nevertheless, expect actually a lower throughput / longer time, as not all drives have it's interface-matching performance ( interface performance does not implicate the inner parts of the device are as fast as the "cable" towards the external world - this is why this is sometimes masked by "buffered" storage on the in-drive controller's embedded cache )
reading from memory is way faster
reading from cache is fastest at all ( for 1MB, the whole circus may easily fit inside a localhost CPU(s) L3-cache )
What will cause problems:
Having mentioned your code is being designed as going distributed using MPI tools, all data manipulations get a bit outside of your caching / latency masking control.
If the MPI-distributed process will try to read from some common source, be it a file or memory, all the advanced tricks may get lost theirs grounds, as a non-localhost MPI-nodes may get into issues not seen on localhost-only implementation models and MEMMAP-alike tricks get naturally out of the game.
If performance matters the most:
may use lightweight, high-performance, asynchronous smart messaging/signalling toolkit, like ZeroMQ, ported also to MATLAB and almost all other languages ( ref. documentation ). Going this way, your design can avoid polling for changes and instead implement an explicit signal distribution of each and any relevant change to influenced peer-processing nodes, which is way smarter, yet lightweight, than trying to re-read files and will work even in LAN/(WAN)-distributed clusters yet at the costs of not more than a few [us] -- yes, more than thousands times less expensive than fileIO and yet capable of serving over a wide-distributed non-localhost-only computing cluster.

Related

Why do file IO speeds change when reading file size?

I'm creating a memory analysis program in C++ on Windows 10 using a 7200rpm HDD that essentially scans your drive and reports back on which folders are using how much space, allowing you to figure out where most of your drive storage is being used.
For efficiency reasons, I'm using C++ and my methodology is scanning the entire drive recursively, then reading each file's size in another thread so that I can both scan and analyze size at the same time. For obvious reasons, scanning is much faster than reporting on size but I've noticed that the IO speeds jump around a lot. Sometimes it'll read the size of 5000 files/second, whereas other times it'll read 10 files/second. Take a look at the video at this link. The first number is how many files' sizes have been read and the second number is how many files have been found altogether. The first number is what's important here.
Why does my file IO speed change, and is there anything I can do about this?

You have many bottlenecks to consider, both on the processor side and on the hard drive side.
Locating The Data
Essentially, the hard drive has to locate the sectors and tracks that contain the data in the file. If you are really lucky, the data will be in sequential sectors on sequential tracks, thus causing very little head movement or repositioning. However, file data can be "scattered", and thus the hard drive will read as much as it can, then calculate the next position of the data, relocate the head to that position and keep reading. This affects the flow of the data. If you drive is intelligent and has a lot of cache, the drive could place this data into a cache and deliver data from the cache instead of the drive, possibly making up some lost nanoseconds due to repositioning.
The Data Bus
The data has to go into the PC's memory. Usually there is only one bus for the data. This bus is shared among many entities in your system, the processor and the hard-drive controller to name a few. If your lucky, your PC has a Direct Memory Access (DMA) controller for the hard drive. The controller can transfer data from the hard drive port into memory, bypassing the processor. However, the DMA controller must share the data bus with the processor (and friends). The bus arbitration is another slowdown and inconsistency.
Sharing the Drive
Many operating systems use the hard drive as virtual memory; swapping out blocks of memory. These file requests will need to be intermingled with the requests from your program.
Sequential Access
Most of the cheaper platforms have sequential access to the drive. Only one entity can read at the same time. Most drives are a single bit stream. Higher performance, custom platforms, actually have more than one drive running in parallel. Because of the sequential nature of the device, entities must either wait for another to finish or intermingle the transactions. Compared to memory that is parallel access (8 or more bits read at the same time).
Interruptions & Scheduling
There are lots of activities going on inside your PC, from internet or wifi communications to audio and video playbacks (as well as other system tasks running). These all need to run. No matter how many cores you have, there isn't enough. Most Operating Systems will run the tasks by time and priority. Very rarely will one task have exclusive ownership of a processor until the task finishes. Your task will be intermingled with other tasks that are running. Thus slowing down your program.
Chunking It
Most disk clean up utilities work in chunks or pieces of files. Speed is not as important as the quality of the data operation. For example, a smaller chunk of a file will have better success at being moved or copied than a huge chunk. The program can be interrupted (from a User, for example). Smaller chunks allow for easier recovery from an interruption.
There are probably more reasons why your program is executing slowly or has inconsistent timings, but the above information should give you better insight as to the behavior of your PC.

Hard disk contention using multiple threads

I have not performed any profile testing of this yet, but what would the general consensus be on the advantages/disadvantages of resource loading from the hard disk using multiple threads vs one thread? Note. I am not talking about the main thread.
I would have thought that using more than one "other" thread to do the loading to be pointless because the HD cannot do 2 things at once, and therefore would surely only cause disk contention.
Not sure which way to go architecturally, appreciate any advice.
EDIT: Apologies, I meant to mean an SSD drive not a magnetic drive. Both are HD's to me, but I am more interested in the case of a system with a single SSD drive.
As pointed out in the comments one advantage of using multiple threads is that a large file load will not delay the presentation of a smaller for to the receiver of the thread loader. In my case, this is a big advantage, and so even if it costs a little perf to do it, having multiple threads is desirable.
I know there are no simple answers, but the real question I am asking is, what kind of performance % penalty would there be for making the parallel disk writes sequential (in the OS layer) as opposed to allowing only 1 resource loader thread? And what are the factors that drive this? I don't mean like platform, manufacturer etc. I mean technically, what aspects of the OS/HD interaction influence this penalty? (in theory).
FURTHER EDIT:
My exact use case are texture loading threads which only exist to load from HD and then "pass" them on to opengl, so there is minimal "computation in the threads (maybe some type conversion etc). In this case, the thread would spend most of its time waiting for the HD (I would of thought), and therefore how the OS-HD interaction is managed is important to understand. My OS is Windows 10.

Note. I am not talking about the main thread.
Main vs non-main thread makes zero difference to the speed of reading a disk.
I would have thought that using more than one "other" thread to do the loading to be pointless because the HD cannot do 2 things at once, and therefore would surely only cause disk contention.
Indeed. Not only are the attempted parallel reads forced to wait for each other (and thus not actually be parallel), but they will also make access pattern of the disk random as opposed to sequential, which is much much slower due to disk head seek time.
Of course, if you were to deal with multiple hard disks, then one thread dedicated for each drive would probably be optimal.
Now, if you were using a solid state drive instead of a hard drive, the situation isn't quite so clear cut. Multiple threads may be faster, slower, or comparable. There are probably many factors involved such as firmware, file system, operating system, speed of the drive relative to some other bottle neck, etc.
In either case, RAID might invalidate assumptions made here.

It depends on how much processing of the data you're going to do. This will determine whether the application is I/O you bound or compute bound.
For example, if all you are going to do to the data is some simple arithmetic, e.g. add 1, then you will end up being I/O bound. The CPU can add 1 to data far quicker than any I/O system can deliver flows of data.
However, if you're going to do a large amount of work on each batch of data, e.g. a FFT, then a filter, then a convolution (I'm picking random DSP routine names here), then it's likely that you will end up being compute bound; the CPU cannot keep up with the data being delivered by the I/O subsystem which owns your SSD.
It is quite an art to judge just how an algorithm should be structured to match the underlying capabilities of the underlying machine, and vice versa. There's profiling tools like FTRACE/Kernelshark, Intel's VTune, which are both useful in analysing exactly what is going on. Google does a lot to measure how many searches-per-Watt their hardware accomplishes, power being their biggest cost.
In general I/O of any sort, even a big array of SSDs, is painfully slow. Even the main memory in a PC (DDR4) is painfully slow in comparison to what the CPU can consume. Even the L3 and L2 caches are sluggards in comparison to the CPU cores. It's hard to design and multi-threadify an algorithm just right so that the right amount of work is done on each data item whilst it is in L1 cache so that the L2, L3 caches, DDR4 and I/O subsystems can deliver the next data item to the L1 caches just in time to keep the CPU cores busy. And the ideal software design for one machine is likely hopeless on another with a different CPU, or SSD, or memory SIMMs. Intel design for good general purpose computer performance, and actually extracting peak performance from a single program is a real challenge. Libraries like Intel's MKL and IPP are very big helps in doing this.
General Guidance
In general one should look at it in terms of data bandwidth required by any particular arrangement of threads and work those threads are doing.
This means benchmarking your program's inner processing loop and measuring how much data it processed and how quickly it managed to do it in, choosing an number of data items that makes sense but much more than the size of L3 cache. A single 'data item' is an amount of input data, the amount of corresponding output data, and any variables used processing the input to the output, the total size of which fits in L1 cache (with some room to spare). And no cheating - use the CPUs SSE/AVX instructions where appropriate, don't forego them by writing plain C or not using something like Intel's IPP/MKL. [Though if one is using IPP/MKL, it kinda does all this for you to the best of its ability.]
These days DDR4 memory is going to be good for anything between 20 to 100GByte/second (depending on what CPU, number of SIMMs, etc), so long as your not making random, scattered accesses to the data. By saturating the L3 your are forcing yourself into being bound by the DDR4 speed. Then you can start changing your code, increasing the work done by each thread on a single data item. Keep increasing the work per item and the speed will eventually start increasing; you've reached the point where you are no longer limited by the speed of DDR4, then L3, then L2.
If after this you can still see ways of increasing the work per data item, then keep going. You eventually get to a data bandwidth somewhere near that of the IO subsystems, and only then will you be getting the absolute most out of the machine.
It's an iterative process, and experience allows one to short cut it.
Of course, if one runs out of ideas for things to increase the work done per data item then that's the end of the design process. More performance can be achieved only by improving the bandwidth of whatever has ended up being the bottleneck (almost certainly the SSD).
For those of us who like doing this software of thing, the PS3's Cell processor was a dream. No need to second guess the cache, there was none. One had complete control over what data and code was where and when it was there.

A lot people will tell you that an HD can't do more than one thing at once. This isn't quite true because modern IO systems have a lot of indirection. Saturating them is difficult to do with one thread.
Here are three scenarios that I have experienced where multi-threading the IO helps.
Sometimes the IO reading library has a non-trivial amount of computation, think about reading compressed videos, or parity checking after the transfer has happened. One example is using robocopy with multiple threads. Its not unusual to launch robocopy with 128 threads!
Many operating systems are designed so that a single process can't saturate the IO, because this would lead to system unresponsiveness. In one case I got a 3% percent read speed improvement because I came closer to saturating the IO. This is doubly true if some system policy exists to stripe the data to different drives, as might be set on a Lustre drive in a HPC cluster. For my application, the optimal number of threads was two.
More complicated IO, like a RAID card, contains a substantial cache that keep the HD head constantly reading and writing. To get optimal throughput you need to be sure that whenever the head is spinning its constantly reading/writing and not just moving. The only way to do this is, in practice, is to saturate the card's on-board RAM.
So, many times you can overlap some minor amount of computation by using multiple threads, and stuff starts getting tricky with larger disk arrays.
Not sure which way to go architecturally, appreciate any advice.
Determining the amount of work per thread is the most common architectural optimization. Write code so that its easy to increase the IO worker count. You're going to need to benchmark.

What does "Disk Profiling" mean (related to hard disks)? [duplicate]

Currently I am working on a MFC application which reads and writes in to the disk. Sometimes this application runs amazingly fast and sometimes it is damn slow. I am guessing that it is because of the disk access involved, hence I want to profile it. These are some questions in this regard:
(1).Currently I am using AQTime profiler to profile the application. Has anybody tried profiling disk access using this? or is there any other tool available which I can use?
(2). What are the most important disk parameters I should be looking at?
(3). If I have multiple threads trying to read and write the data from disk does it affect the performance? i.e. am I better off having a single threaded access to the disk?

You can use the Windows Performance Toolkit for this. You can enable trace providers for disk I/O events and see the I/O time and disk service time for each. It does have a bit of a learning curve though. This will also let you determine which file I/O's actually result in real-access to the disk and aren't handled by the cache manager.
Most important parameters are disk service time and queue length. Disk service time is how long the disk actually took to service the request. Queue length indicates if your disk request is backed up behind other requests.
For many threads w/ reads & writes - Many disks have poor performance in the face of reads with background writes. If you have various threads doing lots of disk I/O to random locations on the disk, you may wind up starving certain requests.

To help you with (2):
Try to batch up your writes to disk to avoid many small calls to write. When you're done flushing your buffer, call commit. commit (aka fsync) is an expensive operation, so becomes even more so when there are lots of small writes.
On windows file handles you can experiment with FILE FLAG WRITE THROUGH to increase write speeds. Supposedly commit doesn't have to be called with handles using this flag.
If data you are writing to disk will also be accessed through reading, consider writing to an in memory structure first, having another thread read from the structure to write it to disk. This will help avoid calls to read data from disk that you have just written.
Hopefully this helps....

What I would do is, if you can't pause all threads at the same time and examine their state, focus on one of them and pause that, while it's being "damn slow". This is a little known but effective technique.
Since it is being extremely slow compared to what it could be, whatever it is waiting for it is waiting for probably 99% of the time, so when you pause it you will see it. That's true whether it's one big wait, or a zillion little ones. Look at the whole call stack. The culprit may be somewhere in the middle of the stack.
If you're not sure, pause it two or three times. The culprit will be on all stack samples.

Writing data chunks while processing - is there a convergence value due to hardware constraints?

I'm processing data from a hard disk from one large file (processing is fast and not a lot of overhead) and then have to write the results back (hundreds of thousands of files).
I started writing the results straight away in files, one at a time, which was the slowest option. I figured it gets a lot faster if I build a vector of a certain amount of the files and then write them all at once, then go back to processing while the hard disk is occupied in writing all that stuff that i poured into it (that at least seems to be what happens).
My question is, can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints ? To me it seems to be a hard disk buffer thing, I have 16MB buffer on that hard disk and get these values (all for ~100000 files):
Buffer size time (minutes)
------------------------------
no Buffer ~ 8:30
1 MB ~ 6:15
10 MB ~ 5:45
50 MB ~ 7:00
Or is this just a coincidence ?
I would also be interested in experience / rules of thumb about how writing performance is to be optimized in general, for example are larger hard disk blocks helpful, etc.
Edit:
Hardware is a pretty standard consumer drive (I'm a student, not a data center) WD 3,5 1TB/7200/16MB/USB2, HFS+ journalled, OS is MacOS 10.5. I'll soon give it a try on Ext3/Linux and internal disk rather than external).

Can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints?
Not in the long term. The problem is that your write performance is going to depend heavily on at least four things:
Which filesystem you're using
What disk-scheduling algorithm the kernel is using
The hardware characteristics of your disk
The hardware interconnect you're using
For example, USB is slower than IDE, which is slower than SATA. It wouldn't surprise me if XFS were much faster than ext2 for writing many small files. And kernels change all the time. So there are just too many factors here to make simple predictions easy.
If I were you I'd take these two steps:
Split my program into multiple threads (or even processes) and use one thread to deliver system calls open, write, and close to the OS as quickly as possible. Bonus points if you can make the number of threads a run-time parameter.
Instead of trying to estimate performance from hardware characteristics, write a program that tries a bunch of alternatives and finds the fastest one for your particular combination of hardware and software on that day. Save the fastest alternative in a file or even compile it into your code. This strategy was pioneered by Matteo Frigo for FFTW and it is remarkably effective.
Then when you change your disk, your interconnect, your kernel, or your CPU, you can just re-run the configuration program and presto! Your code will be optimized for best performance.

The important thing here is to get as many outstanding writes as possible, so the OS can optimize hard disk access. This means using async I/O, or using a task pool to actually write the new files to disk.
That being said, you should look at optimizing your read access. OS's (at least windows) is already really good at helping write access via buffering "under the hood", but if your reading in serial there isn't too much it can do to help. If use async I/O or (again) a task pool to process/read multiple parts of the file at once, you'll probably see increased perf.

Parsing XML should be doable at practically disk read speed, tens of MB/s. Your SAX implementation might not be doing that.
You might want to use some dirty tricks. 100.000s of files to write is not going to be efficient with the normal API.
Test this by writing sequentially to a single file first, not 100.000. Compare the performance. If the difference is interesting, read on.
If you really understand the file system you're writing to, you can make sure you're writing a contiguous block you just later split into multiple files in the directory structure.
You want smaller blocks in this case, not larger ones, as your files are going to be small. All free space in a block is going to be zeroed.
[edit] Do you really have an external need for those 100K files? A single file with an index could be sufficient.

Expanding on Norman's answer: if your files are all going into one filesystem, use only one helper thread.
Communication between the read thread and write helper(s) consists of a two-std::vector double-buffer per helper. (One buffer owned by the write process and one by the read process.) The read thread fills the buffer until a specified limit then blocks. The write thread times the write speed with gettimeofday or whatever, and adjusts the limit. If writing went faster than last time, increase the buffer by X%. If it went slower, adjust by –X%. X can be small.

Profiling disk access

Currently I am working on a MFC application which reads and writes in to the disk. Sometimes this application runs amazingly fast and sometimes it is damn slow. I am guessing that it is because of the disk access involved, hence I want to profile it. These are some questions in this regard:
(1).Currently I am using AQTime profiler to profile the application. Has anybody tried profiling disk access using this? or is there any other tool available which I can use?
(2). What are the most important disk parameters I should be looking at?
(3). If I have multiple threads trying to read and write the data from disk does it affect the performance? i.e. am I better off having a single threaded access to the disk?

You can use the Windows Performance Toolkit for this. You can enable trace providers for disk I/O events and see the I/O time and disk service time for each. It does have a bit of a learning curve though. This will also let you determine which file I/O's actually result in real-access to the disk and aren't handled by the cache manager.
Most important parameters are disk service time and queue length. Disk service time is how long the disk actually took to service the request. Queue length indicates if your disk request is backed up behind other requests.
For many threads w/ reads & writes - Many disks have poor performance in the face of reads with background writes. If you have various threads doing lots of disk I/O to random locations on the disk, you may wind up starving certain requests.

To help you with (2):
Try to batch up your writes to disk to avoid many small calls to write. When you're done flushing your buffer, call commit. commit (aka fsync) is an expensive operation, so becomes even more so when there are lots of small writes.
On windows file handles you can experiment with FILE FLAG WRITE THROUGH to increase write speeds. Supposedly commit doesn't have to be called with handles using this flag.
If data you are writing to disk will also be accessed through reading, consider writing to an in memory structure first, having another thread read from the structure to write it to disk. This will help avoid calls to read data from disk that you have just written.
Hopefully this helps....

What I would do is, if you can't pause all threads at the same time and examine their state, focus on one of them and pause that, while it's being "damn slow". This is a little known but effective technique.
Since it is being extremely slow compared to what it could be, whatever it is waiting for it is waiting for probably 99% of the time, so when you pause it you will see it. That's true whether it's one big wait, or a zillion little ones. Look at the whole call stack. The culprit may be somewhere in the middle of the stack.
If you're not sure, pause it two or three times. The culprit will be on all stack samples.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js