writing in file in MPI - c++

I've written a program in MPI and I want to execute it under 3 or more cores (or computers) and I want to write some thing in file in each processor, for two cores I've created two different files,but I don't know what should I do for more cores,is it necessary to create separate file for each processor?if yes, so how should I do it based on my code,and is it different for multi computer ?
void main()////it is just for 2 cores,what should I do if I use
more cores or more computers?
{
FILE*fp=fopen("C:\\a.txt","w");
FILE*fp1=fopen("C:\\b.txt","w");
if(Id==0)
{
here I write in "fp"
}
else
{
here I write in "fp1"
}
}

If you are using this model, yes, you would need to create a file per process. The reason is that if you have two processes write to the same file, they would most likely end up overwriting each other unless you do something with MPI to ensure that they serialize themselves.
Depending on the data you're trying to write, you could take a look at the I/O section of MPI. There is functionality in there for parallel I/O but it can be a bit tricky.

void main() {
int me;
char filename[1000];
MPI_Comm_rank(MPI_COMM_WORLD, &me);
sprintf(filename,"C:\\a%05d.txt",me);
FILE *fp=fopen(filename,"w");
here I write in "fp"
}
This is probably the hint you required, but usually the correct approach is one of:
collect data (see MPI_Gather() to start with) an let only the process with rank==0 perform I/O
if I/O is important or if all the data cannot fit the memory of one node (isn't it the same actually?), use MPI-IO.

Related

Limit CPU usage of fwrite operations

I'm developing a program with several threads that manages the streaming from several cameras. I have to write every raw images on SSD disk. I'm using fwrite to put the image in a binary file. Something like:
FILE* output;
output = fopen(fileName, "wb");
fwrite(imageData, imageSize, 1, output);
fclose(output);
The procedure seems to runs fast enough to save all images with the given cameras throughput. The problem is that the save procedure is CPU consuming, and I start to have sync issues when the save process is enabled, due to the CPU usage of the save threads.
Is there any way to reduce the CPU load of fwrite operations? Like playing with buffering, better DMA settings, ...?
Thanks!
MIX
-- UPDATE 1
Forgetting the multithreading software, here is a simple file writer software:
#include <stdio.h>
#include <stdlib.h>
const unsigned int TOT_DATA = 1280*2*960;
int main(int argc, char* argv[])
{
if(argc != 2)
{
printf("Usage:\n");
printf(" %s totWrite\n\n", argv[0]);
return -1;
}
char* imageData;
FILE* output;
char fileName[256];
unsigned int totWrite;
totWrite = atoi(argv[1]);
imageData = new char[TOT_DATA];
printf("Write imageData[%u] on file %u times.\n", TOT_DATA, totWrite);
for(unsigned int i = 0; i < totWrite; i++)
{
sprintf(fileName, "image_%06u.raw", i);
output = fopen(fileName, "wb");
fwrite(imageData, TOT_DATA, 1, output);
fclose(output);
}
printf("DONE!\n");
delete [] imageData;
return 0;
}
A char buffer will be created, and it will be written on file totWrite times. No overwrites, since each cycle writes on a new file. (of course, one have to remove files written by previous run...)
Running top (I'm on Linux) while the program is running I see that ~50% of the CPU (that means 50% of one of the 4 cores) is used. I suppose fwrite is the bottleneck regarding the CPU usage since it is the "slower" operation in the cycle, so the one "more probably" running when top update its stats. Even "more probable" if TOT_DATA will be increased, say, 100 times.
Any further consideration about what can reduce CPU usage in such program?
If you consider playing with DMA settings you're way out of the scope of the standard C library. It will be nowhere near portable - and then you don't have any benefits of using portable functions.
The first step you probably should use (after you've confirmed that it's CPU that is the bottleneck) is to use lower level functions like for example open/write (or whatever your OS calls them).
Basically what can happen with fwrite is that the program first copies the data to another place in memory (the FILE* buffer) before actually writing the data to disc. This operation certainly is CPU bound and if data transfer by the CPU is slower than the data transfer to the SSD it could be a case where CPU power is consumed for no good reason.
Also one should note that using multiple threads have it's drawbacks. First if it were not an SSD multiple threads writing to disk could result in redundant head movements which is not bad, but even SSD may suffer somewhat as you might fragment the layout of the data.
There's also a problem in loading the entire file as you seem to do in the example, especially if you do it in multiple threads. It will simply consume a lot of memory (which could result in that swapping is required). If possible you should write the data to the file as the data arrives.

C++ write filestream

I am writing an application that produces and logs a lot of data in the form of ASCII and binary output (not mixed, one or the other depending on the log). The application is single-threaded (should make things easier) and I want to write my data to disk in the order that it was generated. I need to implement a write(char* data) method that takes a null-terminated character array and writes it to disk. Ideally, I want the function to buffer the data and return before the data is actually written to disk...I figure that there must be some way for Windows to setup a thread and do this in the background. The only thing that I care about is that I get the data in the log file in the order that it was written. What is the best way to do this? Someone else implemented the current write method and it looks like:
void writeData(const char* data, int size)
{
if (fp != 0)
fwrite (data, 1, size, fp);
}
fp is the file pointer.
C++ Stdio.h header:
http://www.cplusplus.com/reference/cstdio/fwrite/
In multi-thread, may be you can use something like log queue.
In single-thread, the order is guaranteed
If you are talking Windows-only then you pretty much have two options: Overlapped I/O through the WinAPI or setting up a separate thread in your program to handle file I/O (which can potentially be cross-platform by using pthreads)

Proper, efficient file reading

I'd like to read and process (e.g. print) entries from the first row of a CSV file one at a time. I assume Unix-style \n newlines, that no entry is longer than 255 chars and (for now) that there's a newline before EOF. This is meant to be a more efficient alternative to fgets() followed by strtok().
#include <stdio.h>
#include <string.h>
int main() {
int i;
char ch, buf[256];
FILE *fp = fopen("test.csv", "r");
for (;;) {
for (i = 0; ; i++) {
ch = fgetc(fp);
if (ch == ',') {
buf[i] = '\0';
puts(buf);
break;
} else if (ch == '\n') {
buf[i] = '\0';
puts(buf);
fclose(fp);
return 0;
} else buf[i] = ch;
}
}
}
Is this method as efficient and correct as possible?
What is the best way to test for EOF and file reading errors using this method? (Possibilities: testing against the character macro EOF, feof(), ferror(), etc.).
Can I perform the same task using C++ file I/O with no loss of efficiency?
What is most efficient is going to depend a lot on the operating system, standard libraries (e.g. libc), and even the hardware you are running on. This makes it nearly impossible to tell you what's "most efficient".
That having been said, there are a few things you could try:
Use mmap() or a local operating system equivalent (Windows has CreateFileMapping / OpenFileMapping / MapViewOfFile, and probably others). Then you don't do explicit file reads: you simply access the file as if it were already in memory, and anything that isn't there will be faulted in by the page fault mechanism.
Read the entire file into a buffer manually, then work on that buffer. The fewer times you call into file read functions, the fewer function-call overheads you take, and likely also fewer application/OS domain switches. Obviously this uses more memory, but may very well be worth it.
Use a more optimal string scanner for your problem and platform. Going character-by-character yourself is almost never as fast as relying on something existing that's close to your problem domain. For example, you can bet that strchr and memchr are probably better-optimized than most code you can roll yourself, doing things like reading entire cachelines or words at once, scanning using better algorithms for this kind of search, etc. For more complicated cases, you might consider a full regular expression engine that could compile your regex to something fast for your complicated case.
Avoid copying your string around. It may be helpful to think in terms of "find delimiters" and then "output between delimiters". You could for example use strchr to find the next character of interest, and then fwrite or something to write to stdout directly from your input buffer. Then you're keeping most of your work in a few local registers, rather than using a stack or heap buf.
When in doubt, though, try a few possibilities and profile, profile, profile.
Also for this kind of problem, be very aware of differences between runs that are caused by OS and hardware caches: profile a bunch of runs rather than just one after each change -- and if possible, use tests that will either likely always hit caches (if you're trying to measure best-case performance) or tests that will likely miss (if you're trying to measure worst-case performance).
Regarding C++ file IO (fstream and such), just be aware that they're larger, more complicated beasts. They tend to include things such as locale management, automatic buffering, and the like -- as well as being less prone to particular types of coding mistakes.
If you're doing something pretty simple (like what you describe here), I tend to find C++ library stuff gets in the way. (Use a debugger and "step instruction" through a stringstream method versus some C string functions some time, you'll get a good feel for this quickly.)
It all depends on whether you're going to want or need that additional functionality or safety in the future.
Finally, the obligatory "don't sweat the small stuff". Only spend time on optimizing here if it's really important. Otherwise trust the libraries and OS to do the right thing for you most of the time -- if you get too far into micro-optimizations you'll find you're shooting yourself in the foot later. This is not to discourage you from thinking in terms of "should I read the whole file in ahead of time, will that break future use cases" -- because that's macro, rather than micro.
But generally speaking if you're not doing this kind of "make it faster" investigation for a good reason -- i.e. "need this app to perform better now that I've written it, and this code shows up as slow in profiler", or "doing this for fun so I can better understand the system" -- well, spend your time elsewhere first. =)
One method, provided you are going to scan through the file serially, is to use 2 buffers of a decent enough size (16K is the optimal size for SSDs and 4K for HDDs IIRC. But 16K should suffice for both). You start off by performing an asynchronous load (In windows look up Overlapped I/O and on Unix/OSX use O_NONBLOCK) of the first 16K into buffer 0 and then start another load into buffer 1 of bytes 16K to 32K. When your read position hits 16K, swap the buffers (so you are now reading from buffer 1 instead) wait for any further loads to complete into buffer 1 and perform an asynchronous load of bytes 32K to 48K into buffer 0 and so on. This way, you have far less chance of ever having to wait for a load to complete as it should be happening while you are processing the previous 16K.
I moved over to a scheme like this in my XML parser having been using fopen and fgetc previously and the speedup was huge. Loading in a 15 meg XML file and processing it reduced from minutes to seconds. Of course, Your milage may vary.
use fgets to read one line at a time. C++ file I/O are basically wrapper code with some compiler optimization tucked inside ( and many unwanted functionality ). Unless you are reading millions of lines of code and measuring time, it does not matter.

Best practice to implement fixed byte serial protocol in C++?

I have a device connected via serial interface to a BeagleBone computer. I communicates in a simple binary format like
|MessagID (1 Byte) | Data (n Bytes) | checksum (2 bytes) |
The message length is fixed for each command, meaning that it is known how many bytes to read after the First byte of a command was received. After some initial setup communication it sends packets of data every 20 ms.
My approach would be to use either termios or something like serial lib and then start a loop doing like that (a:
while(keepRunning)
{
char* buffer[256];
serial.read(buffer, 1)
switch(buffer[0])
{
case COMMAND1:
serial.read(&buffer[1], sizeof(MessageHello)+2); //Read data + checksum
if (calculateChecksum(buffer, sizeof(MessageHello)+3) )
{
extractDatafromCommand(buffer);
}
else
{
doSomeErrorHandling(buffer[0]);
}
break;
case COMMAND2:
serial.read(&buffer[1], sizeof(MessageFoo)+2);
[...]
}
}
extractDatafromCommand would then create some structs like:
struct MessageHello
{
char name[20];
int version;
}
Put everything in an own read thread and signal the availability of a new packet to other parts of the program using a semaphore (or a simple flag).
Is this a viable solution or are there better improvements to do (I assume so)?
Maybe make a abstract class Message and derive the other messages?
It really depends. The two major ways would be threaded (like you mentioned) and evented.
Threaded code is tricky because you can easily introduce race conditions. Code that you tested a million times could occasionally stumble and do the wrong thing after working for days or weeks or years. It's hard to 'prove' that things will always behave correctly. Seemingly trivial things like "i++" suddenly become leaky abstractions. (See why is i++ not thread safe on a single core machine? )
The other alternative is evented programming. Basically, you have a main loop that does a select() on all your file handles. Anything that is ready gets looked at, and you try to read/write as many bytes as you can without blocking. (pass O_NONBLOCK). There are two tricky parts: 1) you must never do long calculations without having a way to yield back to the main loop, and 2) you must never do a blocking operation (where the kernel stops your process waiting for a read or write).
In practice, most programs don't have long computations and it's easier to audit a small amount of your code for blocking calls than for races. (Although doing DNS without blocking is trickier than it should be.)
The upside of evented code is that there's no need for locking (no other threads to worry about) and it wastes less memory (in the general case where you're creating lots of threads.)
Most likely, you want to use a serial lib. termios processing is just overhead and a chance for stray bytes to do bad things.

How to optimize input/output in C++

I'm solving a problem which requires very fast input/output. More precisely, the input data file will be up to 15MB. Is there a fast, way to read/print integer values.
Note: I don't know if it helps, but the input file has the following form:
line 1: a number n
line 2..n+1: three numbers a,b,c;
line n+2: a number r
line n+3..n+4+r: four numbers a,b,c,d
Note 2: The input file will be stdin.
Edit: Something like the following isn't fast enough:
void fast_scan(int &n) {
char buffer[10];
gets(buffer);
n=atoi(buffer);
}
void fast_scan_three(int &a,int &b,int &c) {
char buffval[3][20],buffer[60];
gets(buffer);
int n=strlen(buffer);
int buffindex=0, curindex=0;
for(int i=0; i<n; ++i) {
if(!isdigit(buffer[i]) && !isspace(buffer[i]))break;
if(isspace(buffer[i])) {
buffindex++;
curindex=0;
} else {
buffval[buffindex][curindex++]=buffer[i];
}
}
a=atoi(buffval[0]);
b=atoi(buffval[1]);
c=atoi(buffval[2]);
}
General input/output optimization principle is to perform as less I/O operations as possible reading/writing as much data as possible.
So performance-aware solution typically looks like this:
Read all data from device into some buffer (using the principle mentioned above)
Process the data generating resulting data to some buffer (on place or another one)
Output results from buffer to device (using the principle mentioned above)
E.g. you could use std::basic_istream::read to input data by big chunks instead of doing it line by line. The similar idea with output - generate single string as result adding line feed symbols manually and output it at once.
If you want to minimize the physical I/O operation overhead, load the whole file into memory by a technique called memory mapped files. I doubt you'll get a noticable performance gain though. Parsing will most likely be a lot costlier.
Consider using threads. Threading is useful for lots of things, but this is exactly the kind of problem that motivated the invention of threads.
The underlying idea is to separate the input, processing, and output, so these different operations can run in parallel. Do it right and you will see a significant speedup.
Have one thread doing close to pure input. It reads lines into a buffer of lines. Have a second thread do a quick pre-parse and organize the raw input into blocks. You have two things that need to be parsed, the line that contain the number of lines that contain triples and the line that contains the number of lines that contain quads. This thread forms the raw input into blocks that are still mostly text. A third thread parses the triples and quads, re-forming the the input into fully parsed structures. Since the data are now organized into independent blocks, you can have multiple instances of this third operation so as to better take advantage of the multiple processors on your computer. Finally, other threads will operate on these fully-parsed structures. Note: It might be better to combine some of these operations, for example combining the input and pre-parsing operations into one thread.
Put several input lines in a buffer, split them, and then parse them simultaneously in different threads.
It's only 15MB. I would just slurp the whole thing into a memory buffer, then parse it.
The parsing looks something like this, approximately:
#define DIGIT(c)((c) >= '0' && (c) <= '9')
while(*p == ' ') p++;
if (DIGIT(*p)){
a = 0;
while(DIGIT(*p){
a *= 10; a += (*p++ - '0');
}
}
// and so on...
You should be able to write this kind of code in your sleep.
I don't know if that's any faster than atoi, but it's not fussing around figuring out where numbers begin and end. I would stay away from scanf because it goes through nine yards of figuring out its format string.
If you run this whole thing in a loop 1000 times and grab some stack samples, you should see that it is spending nearly 100% of its time reading in the file and generating output (which you didn't mention).
You just can't beat that.
If you do see that noticeable time is spent in the actual parsing, it might be possible to do overlapped I/O, but the machine would have to be really slow, or the I/O really fast (like from a solid state drive) before that would make sense.