How to buffer efficiently when writing to 1000s of files in C++ - c++

I am quite inexperienced when it comes to C++ I/O operations especially when dealing with buffers etc. so please bear with me.
I have a programme that has a vector of objects (1000s - 10,000s). At each time-step the state of the objects is updated. I want to have the functionality to log a complete state time history for each of these objects.
Currently I have a function that loops through my vector of objects, updates the state, and then calls a logging function which opens the file (ascii) for that object, writes the state to file, and closes the file (using std::ofstream). The problem is this signficantly slows down my run time.
I've been recommended a couple things to do to help speed this up:
Buffer my output to prevent extensive I/O calls to the disk
Write to binary not ascii files
My question mainly concerns 1. Specifically, how would I actually implement this? Would each object effectively require it's own buffer? or would this be a single buffer that somehow knows which file to send each bit of data? If the latter, what is the best way to achieve this?
Thanks!

Maybe the simplest idea first: instead of logging to separate files, why not log everything to an SQLite database?
Given the following table structure:
create table iterations (
id integer not null,
iteration integer not null,
value text not null
);
At the start of the program, prepare a statement once:
sqlite3_stmt *stmt;
sqlite3_prepare_v3(db, "insert into iterations values(?,?,?)", -1, SQLITE_PREPARE_PERSISTENT, &stmt, NULL);
The question marks here are placeholders for future values.
After every iteration of your simulation, you could walk your state vector and execute the stmt a number of times to actually insert rows into the database, like so:
for (int i = 0; i < objects.size(); i++) {
sqlite3_reset(stmt);
// Fill in the three placeholders and execute the query.
sqlite3_bind_int(stmt, 1, i);
sqlite3_bind_int(stmt, 2, current_iteration); // Could be done once, but here for illustration.
std::string state = objects[i].get_state();
sqlite3_bind_text(stmt, 3, state.c_str(), state.size(), SQLITE_STATIC); // SQLITE_STATIC means "no need to free this"
sqlite3_step(stmt); // Execute the query.
}
You can then easily query the history of each individual object using the SQLite command-line tool or any database manager that understands SQLite.

Related

Better way to bulk insert in sqlite3 C++

I've bulk insert statements to be executed. I can create a single sql string with all the insert statements and execute once like,
std::string sql_string = "";
int i;
for(i=0;i<1000; i++) {
Transaction transaction = transactions[i];
std::string tx_sql = CREATE_SQL_STRING(transaction, lastTxId); // "INSERT INTO (...) VALUES (...);"
sql_string = sql_string+tx_sql;
lastTxId++;
}
sqlite3_exec(db, sql_string.c_str(), callback, 0, &zErrMsg);
Is there any other better-performing way to bulk insert with sqlite3 c++ API?
Insertions are not security critical. Only the performance is considered. Should I use prepared statements?
most important: put all inserts into a single transaction.
for more detailed stuff regarding performance see the official documentation: https://sqlite.org/speed.html (a bit outdated)
for more recent information i'd recomment https://medium.com/#JasonWyatt/squeezing-performance-from-sqlite-insertions-971aff98eef2
as for prepared statements: i am not sure about performance, but it is a good practice to always use them.

C++ Insert into MySQl database

I have one application in c++ and i'm using the C++ Mysql Connector (https://dev.mysql.com/downloads/connector/cpp/)
I need to save some logs inside one table.Depending on the times I may have large amounts of data on the order of thousands (for example, 80,000).
I already implement a function that iterate my std::vector<std::string> and save the std::string to my database.
For example:
std::vector<std::string> lines = explode(filedata, '\n');
for (int i = 0; i < lines.size(); i++)
{
std::vector<std::string> elements = explode(lines[i], ';');
ui64 timestamp = strtol(elements.at(0).c_str(), 0, 10);
std::string pointId = elements.at(6);
std::string pointName = elements.at(5);
std::string data = elements.at(3);
database->SetLogs(timestamp, pointId, pointName, data);
}
The logs come from csv file, i save all fields to my vector. After this i parse the vector (with explode) and get only the fields that i need to save.
But i have a problem. If i have e.g 80,000 i'm calling my function to save in database 80,000. It works and save correctly all data but it takes a lot of time.
Exists some way to save all data only calling one time the function to save without calling e.g 80,000 and thus optimize the time?
EDIT 1
I change the insert code to this:
std::string insertLog = "INSERT INTO Logs (timestamp,pointId,pointName,data) VALUES (?,?,?,?)";
pstmt->setString(1, timestampString);
pstmt->setString(2, pointId);
pstmt->setString(3, pointName);
pstmt->setString(4, data);
pstmt->executeUpdate();
You could change the code to do bulk inserts, rather than insert one row at a time.
I would recommend doing it by generating an insert statement as a string and passing the string to mysqlpp::Query. I expect you can use a prepared statement to do bulk inserts in a similar way.
If you do an insert statement for each row (which I assume the case here using explode()), I think there's a lot more traffic between the client and the server, which must slow things down.
EDIT
I doubt the function explode() for increasing the execution time as you are accessing the file twice when you call explode twice. It would be nice if you can produce the code for explode().

Fortran unformatted output with each MPI process writing part of an array

In my parallel program, there was a big matrix. Each process computed and stored a part of it. Then the program wrote the matrix to a file by letting each process wrote its own part of the matrix in the correct order. The output file is in "unformatted" form. But when I tried to read the file in a serial code (I have the correct size of the big matrix allocated), I got an error which I don't understand.
My question is: in an MPI program, how do you get a binary file as the serial version output for a big matrix which is stored by different processes?
Here is my attempt:
if(ThisProcs == RootProcs) then
open(unit = file_restart%unit, file = file_restart%file, form = 'unformatted')
write(file_restart%unit)psi
close(file_restart%unit)
endif
#ifdef USEMPI
call mpi_barrier(mpi_comm_world,MPIerr)
#endif
do i = 1, NProcs - 1
if(ThisProcs == i) then
open(unit = file_restart%unit, file = file_restart%file, form = 'unformatted', status = 'old', position = 'append')
write(file_restart%unit)psi
close(file_restart%unit)
endif
#ifdef USEMPI
call mpi_barrier(mpi_comm_world,MPIerr)
#endif
enddo
Psi is the big matrix, it is allocated as:
Psi(N_lattice_points, NPsiStart:NPsiEnd)
But when I tried to load the file in a serial code:
open(2,file=File1,form="unformatted")
read(2)psi
forrtl: severe (67): input statement requires too much data, unit 2 (I am using MSVS 2012+intel fortran 2013)
How can I fix the parallel part to make the binary file readable for the serial code? Of course one can combine them into one big matrix in the MPI program, but is there an easier way?
Edit 1
The two answers are really nice. I'll use access = "stream" to solve my problem. And I just figured I can use inquire to check whether the file is "sequential" or "stream".
This isn't a problem specific to MPI, but would also happen in a serial program which took the same approach of writing out chunks piecemeal.
Ignore the opening and closing for each process and look at the overall connection and transfer statements. Your connection is an unformatted file using sequential access. It's unformatted because you explicitly asked for that, and sequential because you didn't ask for anything else.
Sequential file access is based on records. Each of your write statements transfers out a record consisting of a chunk of the matrix. Conversely, your input statement attempts to read from a single record.
Your problem is that while you try to read the entire matrix from the first record of the file that record doesn't contain the whole matrix. It doesn't contain anything like the correct amount of data. End result: "input statement requires too much data".
So, you need to either read in the data based on the same record structure, or move away from record files.
The latter is simple, use stream access
open(unit = file_restart%unit, file = file_restart%file, &
form = 'unformatted', access='stream')
Alternatively, read with a similar loop structure:
do i=1, NPROCS
! read statement with a slice
end do
This of course requires understanding the correct slicing.
Alternatively, one can consider using MPI-IO for output, which is very similar to using stream output. Read this back in with stream access. You can find about this concept elsewhere on SO.
Fortran unformatted sequential writes in record files are not quite completely raw data. Each write will have data before and after the record in a processor dependent form. The size of your reads cannot exceed the record size of your writes. This means if psi is written in two writes, you will need to read it back in two reads, you cannot read it in at once.
Perhaps the most straightforward option is to instead use stream access instead of sequential. A stream file is indexed by bytes (generally) and does not contain record start and end information. Using this access method you can split the write but read all at once. Stream access is a feature of Fortran 2003.
If you stick with sequential access, you'll need to know how many MPI ranks wrote the file and loop over properly sized records to read the data as it was written. You could make the user specify the number of ranks or store that as the first record in the file and read that first to determine how to read the rest of the data.
If you are writing MPI, why not MPI-IO? Each process will call MPI_File_set_view to set a subarray view of the file, then each process can collectively write the data with MPI_FILE_WRITE_ALL . This approach is likely to scale really well on big machines (though your approach will be fine up to oh, maybe 100 processors.)

Adjusting granularity in tbb parallel_pipeline

Task for pipeline is following:
read sequentially huge(10-15k) amount of ~100-200 Mb compressed files
decompress each file in parallel
deserialize each decompressed file in parallel
process result deserialized objects and get some values based on all objects (mean, median, grouppings etc.)
When I get decompressed file memory buffer, serialized blocks go one after one, so I'd like to pass them to the next filter in the same manner or, at least, adjust this process by packing serialized blocks in groups of some number and then pass. However (as I understand it) tbb_pipeline makes me pass pointer to buffer with ALL serialized blocks because each filter has to get pointer and return pointer.
Using concurrent queue to accumulate packs of serialized objects kills matter of using tbb_pipeline, as I understand. Moreover, constness of operator() in filters doesn't allow to have my own intermediate "task pool"(but nevertheless if each thread had its own local copy of storage for "tasks" and just cut right pieces from it, it would be great)
Primary question:
Is there some way to "adjust" granularity in this situation? (i.e. some filter gets pointer to all serialized objects and passes to the next filter small pack of objects)
Reformatting(splitting etc.) input files is almost impossible.
Secondary question:
When I accumulate processing results, I don't really care about any kind of order, I need only aggregating statistics. Can I use parallel filter instead of serial_out_of_order and accumulate results of processing for each thread somewhere, and then just merge them?
However (as I understand it) tbb_pipeline makes me pass pointer to buffer with ALL serialized blocks because each filter has to get pointer and return pointer.
First I think, it's better to use more modern, type-safe form of the pipeline: parallel_pipeline. It does not prescribe you to pass any specific pointer of any specific data. You just specify which data of which type is needed for the next stage to be able to process it. So it's rather a matter of how your first filter partitions the data to be processed by the following filters.
Primary question: Is there some way to "adjust" granularity in this situation? (i.e. some filter gets pointer to all serialized objects and passes to the next filter small pack of objects)
You can safely embed one parallel algorithm into another in order to change the granularity for some stages, e.g. on the top level, 1st pipeline goes through the file list; 2nd pipeline reads big blocks of the file on the nested level; and finally, the innermost pipeline breaks down the big blocks to smaller ones for some of the 2nd level stages. See a general example of nesting below.
Secondary question: Can I use parallel filter instead of serial_out_of_order and accumulate results of processing for each thread somewhere, and then just merge them?
Yes, you can always use a parallel filter if it does not modify a shared data. For example, you can use tbb::combinable in order to collect thread-specific partial sums and then combine them.
but nevertheless if each thread had its own local copy of storage for "tasks" and just cut right pieces from it, it would be great
yes, they have. Each thread has its own local pool of tasks.
General example of nested parallel_pipelines
parallel_pipeline( 2/*only two files at once*/,
make_filter<void,std::string>(
filter::serial,
[&](flow_control& fc)-> std::string {
if( !files.empty() ) {
std::string filename = files.front();
files.pop();
return filename;
} else {
fc.stop();
return "stop";
}
}
) &
make_filter<std::string,void>(
filter::parallel,
[](std::string s) {
// a nested pipeline
parallel_pipeline( 1024/*only two files at once*/,
make_filter<void,char>(
filter::serial,
[&s](flow_control& fc)-> char {
if( !s.empty() ) {
char c = s.back();
s.pop_back();
return c;
} else {
fc.stop();
return 0;
}
}
) &
make_filter<char,void>(
filter::parallel,
[](char c) {
putc(c, stdout);
}
)
);
}
)
);

Append to a JSON array in a JSON file on disk, every second using C++

This is my first post here, so please bear with me.
I have searched high and low on the internet for an answer, but I've not been able to resolve my issue, so I have decided to write a post here.
I am trying to write(append) to a JSON array on file using C++ and JZON, at intervals of 1 write each second. The JSON file is initially written by a “Prepare” function. Another function is then called each second to a add an array to the JSON file and append an new object to the array every second.
I have tried many things, most of which resulted in all sorts of issues. My latest attempt gave me the best results and this is the code that I have included below. However, the approach I took is very inefficient as I am writing an entire array every second. This is having a massive hit on CPU utilisation as the array grows, but not so much on memory as I had first anticipated.
What I really would like to be able to do is to append to an existing array contained in a JSON file on disk, line by line, rather than having to clear the entire array from the JSON object and rewriting the entire file, each and every second.
I am hoping that some of the geniuses on this website will be able to point me in the right direction.
Thank you very much in advance.
Here is my code:
//Create some object somewhere at the top of the cpp file
Jzon::Object jsonFlight;
Jzon::Array jsonFlightPath;
Jzon::Object jsonCoordinates;
int PrepareFlight(const char* jsonfilename) {
//...SOME PREPARE FUNCTION STUFF GOES HERE...
//Add the Flight Information to the jsonFlight root JSON Object
jsonFlight.Add("Flight Number", flightnum);
jsonFlight.Add("Origin", originicao);
jsonFlight.Add("Destination", desticao);
jsonFlight.Add("Pilot in Command", pic);
//Write the jsonFlight object to a .json file on disk. Filename is passed in as a param of the function.
Jzon::FileWriter::WriteFile(jsonfilename, jsonFlight, Jzon::NoFormat);
return 0;
}
int UpdateJSON_FlightPath(ACFT_PARAM* pS, const char* jsonfilename) {
//Add the current returned coordinates to the jsonCoordinates jzon object
jsonCoordinates.Add("altitude", pS-> altitude);
jsonCoordinates.Add("latitude", pS-> latitude);
jsonCoordinates.Add("longitude", pS-> longitude);
//Add the Coordinates to the FlightPath then clear the coordinates.
jsonFlightPath.Add(jsonCoordinates);
jsonCoordinates.Clear();
//Now add the entire flightpath array to the jsonFlight object.
jsonFlight.Add("Flightpath", jsonFlightPath);
//write the jsonFlight object to a JSON file on disk.
Jzon::FileWriter::WriteFile(jsonfilename, jsonFlight, Jzon::NoFormat);
//Remove the entire jsonFlighPath array from the jsonFlight object to avoid duplicaiton next time the function executes.
jsonFlight.Remove("Flightpath");
return 0;
}
For sure you can do "flat file" storage yourself.. but this is a symptom of needing a database. Something very light like SQLite, or mid-weight & open-source like MySQL, FireBird, or PostgreSQL.
But as to your question:
1) Leave the closing ] bracket off, and just keep the file open & appending -- but if you don't close the file correctly, it will be damaged & need repair to be readable.
2) Your current option -- writing a complete file each time -- isn't safe from data loss either, as the moment you "open to overwrite" you lose all data previously stored in the file. The workaround here, is to rename the old file as a backup before you start writing.
You should also make backup copies of your file, with the first option. (Say at daily intervals). Otherwise data loss is likely to occur eventually -- on Ctrl-C, power loss, program error or system crash.
Of course if you use any of SQLlite, MySQL, Firebird or PostgreSQL all the data-integrity problems will be handled for you.