How to measure the amount of data transmitted by my MPI program?

How to measure the amount of data transmitted by my MPI program? - profiling

I'm experimenting my distributed clustering algorithm (implemented with MPI) on 24 computers that I set up as a cluster using BCCD (Bootable Cluster CD) that can be downloaded at http://bccd.net/.
I've written a batch program to run my experiment that consists in running my algorithm several times varying the number of nodes and the size of the input data.
I want to know the amount of data used in the MPI communications for each run of my algorithm so I can see how the amount of data changes when varying the previous mentioned parameters. And I want to do all this automatically using a batch program.
Someone told me to use tcpdump, but I found some difficulties in this approach.
First, I don't know how to call tcpdump in my batch program (which is written in C++ using the command system for making calls) before each run of my algorithm, since tcpdump requires another terminal to run in parallel with my application. And I can't run tcpdump in another computer since the network uses a switch. So I need to run it on the master node.
Second, I saw the traffic with tcpdump while my experiment was going on and I couldn't figure out what was the port used by MPI. It seems to use many ports. I wanted to know that for filtering the packages.
Third, I tried capturing whole packages and saving it to a file using tcpdump and in a few seconds the file was 3,5MB. But my whole experiment takes 2 days. So the final log file will be huge if I follow this approach.
The ideal approach would be to capture just the size field in the header of the packages and sum this up to obtain the total amount of data transmitted. In that way the logfile would be much smaller than if I were capturing the whole package. But I don't know how to do it.
Another restriction is that I don't have access to the computer disc. So I only have the RAM and my 4GB USB Flash drive. So I can't have huge logfiles.
I have already thought about using some MPI tracing or profiling tool such as those mentioned at http://www.open-mpi.org/faq/?category=perftools. I have only tested Sun Performance Analyzer until now. The problem is that I guess it will be difficult to install those tools on BCCD and maybe even impossible. In addtion to that, this tool will make my experiment take longer to end, sice it adds overhead. But if someone is familiar with BCCD and think it is a good choice to use one of those tools, so please let me know.
Hope someone have a solution.

Implementations like tcpdump won't work if there are multi-core nodes which use shard memory to communicate, anyway.
Using something like MPE is almost certainly the way to go. Those tools add very little overhead, and some overhead is always going to be necessary if you want to count messages. You can use mpitrace to write out every MPI call, and parse the resulting text file yourself. By the way, note that MPE is explicitly discussed on the bccd website. MPICH2 comes with MPE built in, but it can be compiled for any implementation. I've only found a very modest overhead for MPE.
IPM is another nice tool that does counting of messages and sizes; you should be able either parse the XML output, or use the postprocessing tools and just manually integrate the graphs (say either bytes_rx/bytes_tx by rank, or the message buffer size/count graph). The overhead for IPM is even less than for MPE, and mostly comes after the program's finished running to do the file I/O.
If you were really super worried about the overhead with either of these approaches, you could always write your own MPI wrappers using the profiling interface that wrapped MPI_Send, MPI_Recv, etc, and just counted # of bytes sent and recieved for each process, and output only that total at the end.

Related

Multi processes read different part of a big binary file simultanously

I have a large binary file, and it is saved on a NFS share disk. In the cluster, I want multiple processes to simultaneously read this big file. Each process gets a file pointer, opens the big file and reads starting from the supplied pointer and read some size of bytes.
How do I design this project? As far as I concerned, it is similar to some concurrency databases. Is there any lightweight library or open-source projects related to my project? I use the C++ language.

Not sure if there is a point to use a library.
You could use basic stuff. Open and reposition yourself in the file and then perform the read:
http://www.cplusplus.com/reference/fstream/ifstream/open/
http://www.cplusplus.com/reference/istream/istream/seekg/
or
http://www.cplusplus.com/reference/cstdio/fopen/
http://www.cplusplus.com/reference/cstdio/fseek/

nicolae: I agree :-)
mining: so far you haven't said anything about a need for interaction between your readers.
Consider a simple scenario.
Let's say you have your C++ program called "dostuff" which takes the following arguments:
--name something to lable your output.
--offset offset point, seek to here (default to zero).
--bytes number of bytes to process.
inputfile the file you want to read
The following would run your two processes in the background.
$ dostuff --name "proc1" --offset=0 --bytes=100 \\myserver\myshare\bigfile.dat &
$ dostuff --name "proc2" --offset=100 --bytes=100 \\myserver\myshare\bigfile.dat &
You can open a file handle within each process.
So long as the data access is read only why do you want to make it more complex?
important: I'm not saying it shouldn't be more complex, I'm suggesting you haven't yet shown a need for additional complexity. And that complexity is going to come from a need for your readers to collaborate. If they don't need to collaborate then you're pretty much done with your architecture - use the links Nicolae provided and good luck to you.

prioritizing torrent download sequences using libtorrent

Suppose I have 2+ clients (developed by me) ALL using libtorrent ( http://www.rasterbar.com/products/libtorrent/manual.html#queuing )
Can I prioritize download of a file from other clients effectively so that they download the file's pieces/chunks (whatever is torrent terminology here) from beginning of the file towards its end and not quite in random order?
(of course I'm allowing some "multiplexing" / "intertwining" pieces for reasons of availability and performance, but the goal here is to download as linearly and quickly from the start of the file towards the end as possible)
The goal I'm thinking about here is obviously previewing the file quickly. How to do this most effectively using libtorrent / possibly other C++ torrent library?
(I'm not quite interested in torrent implementations using non-binary languages, like Java or Python - I need machine code for reasons of performance and security, so, C, C++ or possibly D would all fit the bill)

You can certainly prioritize pieces and files with torrent_handle::prioritize_pieces() and torrent_handle::prioritize_files(). See the documentation.
This won't be enough to download in-order though. To do that, you can enable sequential download with torrent_handle::set_sequential_download(). This will issue new piece requests in-order. Keep in mind that the time a request take to be satisfied varies a lot depending on which peer you talk to. Making the requests in-order does not necessarily mean receiving the pieces in order.
There is another mechanism to attempt to do that. torrent_handle::set_piece_deadline() is used to set a target completion time for a piece. Such pieces are considered time-critical pieces, and they are ordered by their deadline and the fastest peers are used to request blocks from those pieces, attempting to download them in deadline-order.
Now, I also get the impression that you want two separate clients (presumably running on different machines) to coordinate which pieces they download. Is that right? It's not entirely clear what you're asking about, but there's no simple way of asking libtorrent to do that.
You could write a plugin for libtorrent that implements a new extension message for these clients to chat and coordinate, which could de-select certain pieces the other client is downloading by setting their priority to 0.

Programmatically getting per-process disk io statistics on Windows?

I would like to display a list of processes (Windows, C++) and how much they are reading and writing from the disk in KB/sec.
The Resource Monitor of Windows 7 has the ability so I should be able to do the same.
However I have unable to find a relevant API-call or find anything in the perfmon counters. Could anyone point me in the direction?

You can call GetProcessIoCounters to get overall disk I/O data per process - you'll need to keep track of deltas and converting to time-based rate yourself.
This API will tell you total number of I/O operations as well as total bytes.

WMI can do it, as long as you periodically snapshot it to get differential stats for some "recent" slice of time. This post presents a peculiarly mixed solution, with VBScript reading the info from WMI and Perl continually presenting the information in a Windows console. Despite the strange language mix, I think it stands as a good example of how to get at the kind of information you require (it should be quite possible to recode all of it in C++, of course).

How do I extract the network protocol from the source code of the server?

I'm trying to write a chat client for a popular network. The original client is proprietary, and is about 15 GB larger than I would like. (To be fair, others call it a game.)
There is absolutely no documentation available for the protocol on the internet, and most search results only come back with the client's scripting interface. I can understand that, since used in the wrong way, it could lead to ruining other people's experience.
I've downloaded the source code of a couple of alternative servers, including the one I want to connect to, but those
contain no documentation other than install instructions
are poorly commented (I did a superficial browsing)
are HUGE (the src folder of the target server contains 12 MB worth of .cpp and .h files), and grep didn't find anything related
I've also tried searching their forums and contacting the maintainers of the server, but so far, no luck.
Packet sniffing isn't likely to help, as the protocol relies heavily on encryption.
At this point, all my hope is my ability to chew through an ungodly amount of code. How do I start?
Edit: A related question.

If your original code is encrypted with some well known library like OpenSSL or Ctypto++ it might be useful to write your wrapper for the main entry points of these libraries, then delagating the call to the actual library. If you make such substitution and build the project successfully, you will be able to trace everything which goes out in the plain text way.
If your project is not using third party encryption libs, hopefully it is still possible to substitute the encryption routines with some wrappers which trace their input and then delegate encryption to the actual code.
Your bet is that usually enctyption is implemented in separate, relatively small number of source files so that should be easier for you to track input/output in these files.
Good luck!

I'd say
find the command that is used to send data through the socket (the call depends on the network library)
find references of this command and unroll from there. If you can modify-recompile the server code, it might help.
On the way, you will be able to log decrypted (or, more likely, not yet encrypted) network activity.

IMO, the best answer is to read the source code of the alternative server. Try using a good C++ IDE to help you. It will make a lot of difference.
It is likely that the protocol related material you need to understand will be limited to a subset of the files. These will contain references to network sockets and things. Start from there and work outwards as far as you need to.

A viable approach is to tackle this as a crypto challenge. That makes it easy, because you control so much.
For instance, you can use a current client to send a known message to the server, and then check server memory for that string. Once you've found out in which object the string ends, it also becomes possible to trace its ancestry through the code. Set a breakpoint on any non-const method of the object, and find the stacktraces. This gives you a live view of how messages arrive at the server, and a list of core functions essential to message processing. You can next find related functions (caller/callee of the functions on your list).

Any way to determine speed of a removable drive in windows?

Is there any way to determine a removable drive speed in Windows without actually reading in a file. And if I do have to read in a file, how much needs to be read to get a semi accurate speed (e.g. determine whether a device is USB2 or USB1)?
EDIT: Just to clarify, USB2 and USB1 were an example. These could be Compact Flash, could be SSD, could be a removable drive. And I am trying to determine this as fast as possible as it has a real effect on the responsiveness of the application.
EDIT: Should also clarify, this has to be done programatically. It will probably be done in C++.
EDIT: Boost answer is kind of what I was looking for (though I haven't written any WMI in C++). But I need to know what properties I have to check to determine relative speed. I don't need exact speed (like I said about the difference in speed between USB1 and USB2), but I need to know if it is going to be SLLOOOOWWW.

WMI - Physical Disks Properties is an article I found which would at least help you figure out what you have connected. I foresee things heading toward tables equating particular manufacturers and models to speeds, which is not as simple a solution as you may have hoped for.

You may have better results querying the operating system for information about the hardware rather than trying to reverse engineer it from data transfer timing information.
For example, identical transfer speeds don't necessarily mean the same technology is being used by two devices, although other factors such as seek times would improve the accuracy, if such information is available to your application.
In order to keep the application responsive while this work is done, try doing the calls asynchronously and provide some sort of progress indicator to the user. As an example, take a look at how WinDirStat handles this progress indication (I love the pac-man animation as each directory is analyzed).

Several megabytes, I'd say. Transfer speeds can start out slow, and then speed up as the transfer progresses. There are also variations because of file sizes (a single 1GB file will transfer much faster than 1GB of smaller files).
Best way to do that would be to copy a file to/from the device, and time how long it takes with your code. USB1 speed is 11Mb/s (I think), and USB2 is 480Mb/s (note those are numbers for the whole bus, not each port, so multiple devices on the same bus will change the actual numbers).

Try TerraCopy and copy one large file ~400mb - 500mb from device and to the device and you'll see the speed.

In Windows you can determine if a connected USB device is USB2 by selecting View -> "Devices by Connection" from the Device Manager and then checking to see if the device is under a USB2 controller (USB2 Enhanced Host Controller).
Note that this doesn't mean your device will actually perform at the higher speeds though, you would still need actual throughput tests for that. The Sisoft Sandra benchmarking software lists removable hard drives as supported in its feature list.
EDIT: Due to clarification in original question, I have submitted a new answer.

Consider the number of things that could affect data transfer speed:
The speed of the bus used to connect the device to the system. This is unlikely to be your bounding factor unless it's connected via USB1.
For hard drives, rotational speed and seek time matter. 7200 RPM drives will read and write blocks of data faster than 5400 RPM drives.
Optical and magnetic drives usually spin down when not in use, so the first access will take orders of magnitude more than the second access.
The filesystem used on the particular device.
Caching of data and filesystem metadata. The less metadata is cached, the more a magnetic or optical drive has to seek to figure out where the data is.
Data access pattern. Accessing a small number of large, contiguous files is almost always faster than accessing a large number of small files scattered around the disk.
File system fragmentation
You might be able to work up some heuristics based on the various characteristics of the devices you expect to see, but in general there's no good way to figure out transfer speed for a particular combination of bus, media, filesystem, and data access pattern without actually measuring it. If you decide to measure, try to simulate your final access pattern as closely as possible.

I'm going to borrow Raymond Chen's crystall ball and say that you really don't want this. You probably want to use asynchronous I/O. If you do not get the result of your I/O within a second, you want to check how much did happen. Take the inverse of that number, and you have a good estimate to quote to the user.
If nothing happened after a second, you may be in for a surprise. But even that can happen. For instance, a harddisk may need a second to spin up. Just poll every second until something has happened.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js