For loop hangs when I make files in parallel. Why? (see code below) Also, what's a safe/efficient way to write to multiple binary files (pointer and offset determined by iteration variable)?
Context and questions:
What I would like my code to do is the following:
(1) All processes read a single binary file containing a matrix of doubles -> already achieved this using MPI_File_read_at()
(2) For each 'column' of input data, perform calculations using the numbers in each 'row', and save the data for each column into its own binary output file ("File0.bin" -> column 0)
(3) To enable the user to specify an arbitrary number of processes, I use simple indexing to treat the matrix as one long (rows)X(cols) vector, and split that vector by the number of processes. Each process gets (rows)X(cols)/tot_proc of entries to process... using this approach, the columns will not be neatly divided by each process, therefore, each process needs to access whatever file(s) correspond to it and, using proper offsets, write to the correct section of the correct file. At the moment, it does not matter that the resulting file will be fragmented.
As I work toward that goal, I have written a short program to create binary files in a loop, but the loop hangs on the very last iteration (13 files divided over 4 processes). Number of files to create = (rows).
Question 1 Why does this code hang at the very end of the loop? In my toy example of 4 processes, id_proc 1-3 have 3 files to create, while id_proc 0 (the root process) has 4 files to create. The loop hangs when the root process tries to make it's 4th file. Note: I'm compiling this on a laptop running Ubuntu using mpic++.
Question 2 Eventually I will add a second for loop just like the one you see below, except in this loop, the process must write to the appropriate section of the binary files that have already been created. I plan to use MPI_File_write_at() to do this, but I have also read that the files should be statically sized using MPI_File_set_size(), and then, every process should have it's own view of the file using MPI_File_set_view(). So, my question is, in order for this to work, should I do the following?
(Loop 1) MPI_File_open(...,MPI_WRONLY | MPI_CREATE,...), MPI_File_set_size(), MPI_File_close()
(Loop 2) MPI_File_open(...,MPI_WRONLY,...), MPI_File_set_view(), MPI_File_write_at(), MPI_File_close()
.... Loop 2 seems like it will be slowed by having to open and close files each iteration, but I do not know in advance how much input data the user will provide, nor how many processes the user will provide. For example, Process N might need to write to the end of file 1, the middle of file 2, and the end of file 8. In principle, all of that can be taken care of with offsets. What I don't know is whether MPI allows for this level of flexibility or not.
Code attempting to create multiple files in parallel:
#include <iostream>
#include <cstdlib>
#include <stdio.h>
#include <vector>
#include <fstream>
#include <string>
#include <sstream>
#include <cmath>
#include <sys/types.h>
#include <sys/stat.h>
#include <mpi.h>
using namespace std;
int main(int argc, char** argv)
//Variable declarations
string oname;
stringstream temp;
int rows = 13, cols = 7, sz_dbl = sizeof(double);
//each binary file will eventually have 7*sz_dbl bytes
int id_proc, tot_proc, loop_min, loop_max;
vector<double> output(rows*cols,1.0);//data to write
//MPI routines
MPI_Init(&argc,&argv);//initialize MPI
MPI_Comm_rank(MPI_COMM_WORLD,&id_proc);//get "this" node's id#/rank
MPI_Comm_size(MPI_COMM_WORLD,&tot_proc);//get the number of processors
//MPI loop variable assignments
loop_min = id_proc*rows/tot_proc + min(rows % tot_proc, id_proc);
loop_max = loop_min + rows/tot_proc + (rows % tot_proc > id_proc);
//File handle
MPI_File outfile;
//Create binary files in parallel
for(int i = loop_min; i < loop_max; i++)
temp << i;
oname = "Myout" + temp.str() + ".bin";
MPI_Barrier(MPI_COMM_WORLD);//with or without this, same error
MPI_Finalize();//MPI - end mpi run
return 0;
Tutorial/information pages I've read so far:
Parallel output using MPI IO to a single file
Is it possible to write with several processors in the same file, at the end of the file, in an ordonated way?

MPI_File_open() is a collective operation, that means that all tasks from MPI_COMM_WORLD must open the same file at the same time.
if you want to open one process per task, then use MPI_COMM_SELF instead.


Speed-up a single task using multi-threading in c++

I'm sorry if this is a repeat question but I already tried to search for an answer and came up empty handed.
I have a code to transfer data (189156 numbers) from txt file (input.txt) to another file (test.txt), after executing the program, the process takes about 23 seconds to transfer all data from the file: input.txt to the file : test.txt.
I wanted to speed up the process, so I divided the process into multiple threads (4 threads), each thread process 1/4 of the data, After executing the program, there was no difference in the time it took to transfer all the data.
here is my codes:
// This program reads data from a file into an array.
#include <iostream>
#include <fstream> // To use ifstream
#include <vector>
#include <thread>
using namespace std;
void test(int start, int end)
std::vector<int> numbers;
ifstream inputFile("input.txt"); // Input file stream object
// Check if exists and then open the file.
if (inputFile.good()) {
// Push items into a vector
int current_number = 0;
while (inputFile >> current_number) {
// Close the file.
// Display the numbers read:
cout << "The numbers are: ";
for (int count = start ; count < end; count++) {
cout << numbers[count] << " " ;
std::ofstream ofs;"test.txt", std::ofstream::out | std::ofstream::app);
ofs << numbers[count] << endl;
cout << endl;
else {
cout << "Error!";
int main() {
std::thread worker1(test, 0, 50000);
std::thread worker2(test, 50000, 100000);
std::thread worker3(test, 100000, 150000);
std::thread worker4(test, 150000, 189156);
return 0;
I am a beginner, I do not know if it is correct to use multi-threads in such a case, please, if so, where is my mistake and if not, what is the correct way to speed up the process.
There is a big race condition in the code that not only prevent the code to be fast, but also should produce wrong results (possibly non-deterministically). Indeed, all threads can write in the same file "test.txt" simultaneously. While this operation may be thread safe on the target system, the order in which the threads append data in the target file is undefined and thus the result can be shuffled. The file appending have to be serialized and this when this processes is thread safe, it is typically protected with a lock that prevent any parallel execution.
Additionally, the open+write+close should be extremely slow since it results in 3 system calls per line and system calls are generally slow, especially IO ones.
That being said, you cannot use one ofstream object with multiple thread without protection since it would cause a bigger undefined behaviour. Indeed, here is what the C++ standard explicitly states:
Concurrent access to a stream object [string.streams, file.streams], stream buffer object [stream.buffers], or C Library stream [c.files] by multiple threads may result in a data race [intro.multithread] unless otherwise specified [iostream.objects]. [Note: Data races result in undefined behavior [intro.multithread]. --end note]
An efficient solution is to do a inner-thread reduction: all threads append data to a thread-local ostringstream so to perform the integer to string serialization in a big buffer and then write data in a serialized way (so for the order to be the same than the sequential program). The serialization should be speed up by the use of multiple thread while the IO part will still be sequential. In practice, the serialization should be pretty slow so the use of multiple thread should help to significantly reduce the execution time.
There is another big issue: the input file is entirely read by each thread! This means the 4 threads overall compute 4 time more work than using just 1 thread. This completely defeat the benefit of using multiple threads. You need to split the input file in relatively equal parts and then perform the computation. This is not so easy since the line delimiter should be taken into account.
One solution to this problem is to first retrieve the size of the file and then divide the 0..size range in N parts, where N is the number of workers. The split ranges then need to be corrected so to reference the begining of a line. You can do this correction by reading a line in the file at the starting location of each range and then adapt the start/end location of each range consequently (you just need to add the size of the line read). Once corrected, each worker can operate on a completely independent part of the file and read it in parallel (using a different ifstream object like you did).

Reading a specific line from a .txt file

I have a text file full of names:
I want to code a random name generator that will copy a specific line from the.txt file and return it.
While reading in from a file you must start from the beginning and continue on. My best advice would be to read in all of the names, store them in a set, and randomly access them that way if you don't have stringent concerns over efficiency.
You cannot pick a random string from the end of the file without first reading up that name in the file.
You may also want to look at fseek() which will allow you to "jump" to a location within the input stream. You could randomly generate an offset and then provide that as an argument to fseek().
You cannot do that unless you do one of two things:
Generate an index for that file, containing the address of each line, then you can go straight to that address and read it. This index can be stored in many different ways, the easiest one being on a separate file, this way the original file can still be considered a text file, or;
Structure the file so that each line starts at a fixed distance in bytes of each other, so you can just go to the line you want by multiplying (desired index * size). This does not mean the texts on each line need to have the same length, you can pad the end of the line with null-terminators (character '\0'). In this case it is not recommended to work this file as a text file anymore, but a binary file instead.
You can write a separate program that will generate this index or generate the structured file for your main program to use.
All this of course, considering you want the program to run and read the line without having to load the entire file in memory first. If your program will constantly read lines from the file, you should probably just load the entire file into a std::vector<std::string> and then read the lines at will from there.
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <cstdlib>
#include <ctime>
using namespace std;
int main()
string filePath = "test.txt";
vector<std::string> qNames;
ifstream openFile(;
if (openFile.is_open())
string line;
while (getline(openFile, line))
if (!qNames.empty())
srand((unsigned int)time(NULL));
for (int i = 0; i < 10; i++)
int num = rand();
int linePos = num % qNames.size();
cout << << endl;
return 0;

Parallel HDF5 C++ program that creates a group for each MPI process rank

I am trying to find a minimal example for opening and closing a HDF5 file in parallel using the MPIO driver in the C++ interface to HDF5, that creates a HDF5 Group for each MPI process rank and saves the file. The parallel programming example given in the repo is not quite what I would call minimal, but I tried to use parts of that example, together with the C++ API docs and the simple C++ parallel HDF5 example set.
This is what I came up with so far:
Edit: I have added a loop over MPI ranks to try and create the HDF5 groups in collective mode, the result is the same.
#include <iostream>
#include <mpi.h>
#include <sstream>
#include <iostream>
#include <memory>
using std::cout;
using std::endl;
#include <string>
#include "H5Cpp.h"
using namespace H5;
using namespace std;
int main(void)
// Get the number of processes
int size;
MPI_Comm_size(MPI_COMM_WORLD, &size);
// Get the rank of the process
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
auto acc_tpl1 = H5Pcreate(H5P_FILE_ACCESS);
/* set Parallel access with communicator */
H5Pset_fapl_mpio(acc_tpl1, MPI_COMM_WORLD, MPI_INFO_NULL);
// Creating the file with H5File stores only a single group with 4 MPI processes.
auto testFile = H5File("test.h5", H5F_ACC_TRUNC, H5P_DEFAULT, acc_tpl1);
for (unsigned int i = 0; i < size; ++i)
std::stringstream ss;
ss << "/RANK_GROUP" << rank;
string rankGroup {ss.str()};
// Create the rank group with testFile.
if (! testFile.exists(rankGroup))
cout << rankGroup << endl;
// Release the file-access template
// Release the testFile
return 0;
I can't figure out from the C++ API how to set the MPIO driver.
Also, the groups are not written by every rank:
?> h5c++ test-mpi-group-creation.cpp -o test-mpi-group-creation
?> mpirun -np 4 ./test-mpi-group-creation
?> h5ls -lr test.h5
/ Group
What do I need to change to have this minimal parallel example with groups running using the C++ API to hdf5?
In HDF5, all the "metadata" must be created by all ranks in collective mode. That is: every processor will open the file, create all groups, create all datasets. Then, you can write to the specified datasets individually. Note that in the case of extendable datasets the resizing must also be done collectively.
In practice: you must loop in the program for the creation of groups, attributes and datasets.
The reason is that every rank must know about the whole layout of the HDF5 file.
An alternative in some cases is to write one hdf5 file per rank. In the case of fully independent groups this makes sense.
The page Collective Calling Requirements in Parallel HDF5 Applications lists the routines that must be called in the "collective" mode. The requirements are the same for all APIs (C, C++, Fortran, etc).

C++ - Opening text files sequentially

I have hundreds of .txt files ordered by number: 1.txt, 2.txt, 3.txt,...n.txt. In each file there are two columns with decimal numbers.
I wrote an algorithm that does some operations to one .txt file alone, and now I want to recursively do the same to all of them.
This helpful question gave me some idea of what I'm trying to do.
Now I'm trying to write an algorithm to read all of the files:
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main ()
int i, n;
char filename[6];
double column1[100], column2[100];
for (n=1;n=200;n++)
sprintf(filename, "%d.txt", n);
ifstream datafile;;
for (i=0;i<100;i++)
datafile >> column1[i] >> column2[i];
cout << column1[i] << column2[i];
return 0;
What I think the code is doing: it is creating string names from 1.txt till 200.txt, then it opens files with these names. For each file, the first 100 columns will be associated to the arrays column1 and column2, then the values will be shown on the screen.
I don't get any error when compiling it, but when I run it the output is huge and simply won't stop. If i set the output to a .txt file it reaches easily some Gb!
I also tried decreasing the loop number and reduce the numbers of columns (to 3 or so), but I till get an infinite output. I would be glad if someone could point the mistakes I'm doing in the code...
I am using gcc 5.2.1 with Linux.
6-element array is too short to store "200.txt". It must be at least 8 elements.
The condition n=200 is wrong and is always true. It should be n<=200.
If all your files are in the same directory, you could also use boost::filesystem, e.g.:
auto path = "path/to/folder";
[](boost::filesystem::directory_entry file){
// test if file is of the correct type
// do sth with file
I think this is a cleaner solution.

Reading key-value pairs as fast as possible in C++ from file

I have a file with roughly 2 million lines like this:
2s,3s,4s,5s,6s 100000
2s,3s,4s,5s,8s 101
2s,3s,4s,5s,9s 102
The first comma separated part indicates a poker result in Omaha, while the latter score is an example "value" of the cards. It is very important for me to read this file as fast as possible in C++, but I cannot seem to get it to be faster than a simple approach in Python (4.5 seconds) using the base library.
Using the Qt framework (QHash and QString), I was able to read the file in 2.5 seconds in release mode. However, I do not want to have the Qt dependency. The goal is to allow quick simulations using those 2 million lines, i.e. some_container["2s,3s,4s,5s,6s"] to yield 100 (though if applying a translation function or any non-readable format will allow for faster reading that's okay as well).
My current implementation is extremely slow (8 seconds!):
std::map<std::string, int> get_file_contents(const char *filename)
std::map<std::string, int> outcomes;
std::ifstream infile(filename);
std::string c;
int d;
while (infile.good())
infile >> c;
infile >> d;
//std::cout << c << d << std::endl;
outcomes[c] = d;
return outcomes;
What can I do to read this data into some kind of a key/value hash as fast as possible?
Note: The first 16 characters are always going to be there (the cards), while the score can go up to around 1 million.
Some further informations gathered from various comments:
sample file:
ram restrictions: 750MB
initialization time restriction: 5s
computation time per hand restriction: 0.5s
As I see it, there are two bottlenecks on your code.
1 Bottleneck
I believe that the file reading is the biggest problem there. Having a binary file is the fastest option. Not only you can read it directly in an array with a raw istream::read in a single operation (which is very fast), but you can even map the file in memory if your OS supports it. Here is a link that's very informative on how to use memory mapped files.
2 Bottleneck
The std::map is usually implemented with a self-balancing BST that will store all the data in order. This makes the insertion to be an O(logn) operation. You can change it to std::unordered_map, wich uses a hash table instead. A hash table have a constant time insertion if the number of colisions are low. As the ammount of elements that you need to read is known, you can reserve a suitable ammount of chuncks before inserting the elements. Keep in mind that you need more chuncks than the number of elements that will be inserted in the hash to avoid the maximum ammount of colisions.
Ian Medeiros already mentioned the two major botlenecks.
a few thoughts about data structures:
the amount of different cards is known: 4 colors of each 13 cards -> 52 cards.
so a card requires less than 6 bits to store. your current file format currently uses 24 bit (includig the comma).
so by simply enumerating the cards and omitting the comma you can save ~2/3 of file size and allows you to determine a card with reading only one character per card.
if you want to keep the file text based you may use a-m, n-z, A-M and N-Z for the four colors.
another thing that bugs me is the string based map. string operations are innefficient.
One hand contains 5 cards.
that means 52^5 posiibilities if we keep it simple and do not consider the already drawn cards.
--> 52^5 = 380.204.032 < 2^32
that means we can enumuerate every possible hand with a uint32 number. by defining a special sorting scheme of the cards (since order is irrelevant), we can assign a number to the hand and use this number as key in our map that is a lot faster than using strings.
if we have enough memory (1.5 GB) we do not even need a map but we can simply use an array.
of course the most cells are unused but access may be very fast. we even can ommit the ordering of the cards since the cells are present independet if we fill them or not. So we can use them. but in this case you should not forget to fill all possible permutations of the hand read from the file.
with this scheme we also (may be) can further optimize our file reading speed. if we only store the hands number and the rating so that only 2 values need to be parsed.
infact we can optimize the required storage space by using a more complex adressing scheme for the different hands, since in reality there are only 52*51*50*49*48 = 311.875.200 possible hands.additional to that the ordering is irrelevant as mentioned but i think that this saving is not worth the increased complexity of the encoding of the hands.
A simple idea might be to use the C API, which is considerably simpler:
#include <cstdio>
int n;
char s[128];
while (std::fscanf(stdin, "%127s %d", s, &n) == 2)
outcomes[s] = n;
A rough test showed a considerable speedup for me compared to the iostreams library.
Further speedups may be achieved by storing the data in a contiguous array, e.g. a vector of std::pair<std::string, int>; it depends on whether your data is already sorted and how you need to access it later.
For a serious solution, though, you should probably step back further and think of a better way to represent your data. For example, a fixed-width, binary encoding would be much more space-efficient and faster to parse, since you won't need to look ahead for line endings or parse strings.
Update: From some quick experimentation I've found it fairly fast to first read the entire file into memory and then perform alternating strtok calls with either " " or "\n" as the delimiter; whenever a pair of calls succeed, apply strtol on the second pointer to parse the integer. Here's a skeleton:
#include <cerrno>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <vector>
int main()
std::vector<char> data;
// Read entire file to memory
char buf[4096];
for (std::size_t n; (n = std::fread(buf, 1, sizeof buf, stdin)) > 0; )
data.insert(data.end(), buf, buf + n);
// Tokenize the in-memory data
char * p = &data.front();
for (char * q = std::strtok(p, " "); q; q = std::strtok(nullptr, " "))
if (char * r = std::strtok(nullptr, "\n"))
char * e;
errno = 0;
int const n = std::strtol(r, &e, 10);
if (*e != '\0' || errno != 0) { continue; }
// At this point we have data:
// * the string is "q"
// * the integer is "n"