Read large amount of ASCII numbers and write in binary form - c++

I have data files with about 1.5 Gb worth of floating-point numbers stored as ASCII text separated by whitespace, e.g., 1.2334 2.3456 3.4567 and so on.
Before processing such numbers I first translate the original file to binary format. This is helpful because I can choose whether to use float or double, reduce file size (to about 800 MB for double and 400 MB for float), and read in chunks of the appropriate size once I am processing the data.
I wrote the following function to make the ASCII-to-binary translation:
template<typename RealType=float>
void ascii_to_binary(const std::string& fsrc, const std::string& fdst){
RealType value;
std::fstream src(fsrc.c_str(), std::fstream::in | std::fstream::binary);
std::fstream dst(fdst.c_str(), std::fstream::out | std::fstream::binary);
while(src >> value){
dst.write((char*)&value, sizeof(RealType));
}
// RAII closes both files
}
I would like to speed-up acii_to_binary, and I seem unable to come up with anything. I tried reading the file in chunks of 8192 bytes, and then try to process the buffer in another subroutine. This seems very complicated because the last few characters in the buffer may be whitespace (in which case all is good), or a truncated number (which is very bad) - the logic to handle the possible truncation seems hardly worth it.
What would you do to speed up this function? I would rather rely on standard C++ (C++11 is OK) with no additional dependencies, like boost.
Thank you.
Edit:
#DavidSchwarts:
I tried to implement your suggestion as follows:
template<typename RealType=float>
void ascii_to_binary(const std::string& fsrc, const std::string& fdst{
std::vector<RealType> buffer;
typedef typename std::vector<RealType>::iterator VectorIterator;
buffer.reserve(65536);
std::fstream src(fsrc, std::fstream::in | std::fstream::binary);
std::fstream dst(fdst, std::fstream::out | std::fstream::binary);
while(true){
size_t k = 0;
while(k<65536 && src >> buffer[k]) k++;
dst.write((char*)&buffer[0], buffer.size());
if(k<65536){
break;
}
}
}
But it does not seem to be writing the data! I'm working on it...

I did exactly the same thing, except that my fields were separated by tab '\t' and I had to also handle non-numeric comments on the end of each line and header rows interspersed with the data.
Here is the documentation for my utility.
And I also had a speed problem. Here are the things I did to improve performance by around 20x:
Replace explicit file reads with memory-mapped files. Map two blocks at once. When you are in the second block after processing a line, remap with the second and third blocks. This way a line that straddles a block boundary is still contiguous in memory. (Assumes that no line is larger than a block, you can probably increase blocksize to guarantee this.)
Use SIMD instructions such as _mm_cmpeq_epi8 to search for line endings or other separator characters. In my case, any line containing an '=' character was a metadata row that needed different processing.
Use a barebones number parsing function (I used a custom one for parsing times in HH:MM:SS format, strtod and strtol are perfect for grabbing ordinary numbers). These are much faster than istream formatted extraction functions.
Use the OS file write API instead of the standard C++ API.
If you dream of throughput in the 300,000 lines/second range, then you should consider a similar approach.
Your executable also shrinks when you don't use C++ standard streams. I've got 205KB, including a graphical interface, and only dependent on DLLs that ship with Windows (no MSVCRTxx.dll needed). And looking again, I still am using C++ streams for status reporting.

Aggregate the writes into a fixed buffer, using a std::vector of RealType. Your logic should work like this:
Allocate a std::vector<RealType> with 65,536 default-constructed entries.
Read up to 65,536 entries into the vector, replacing the existing entries.
Write out as many entries as you were able to read in.
If you read in exactly 65,536 entries, go to step 2.
Stop, you are done.
This will prevent you from alternating reads and writes to two different files, minimizing the seek activity significantly. It will also allow you make far fewer write calls, reducing copying and buffering logic.

Related

c++ read small portions of big number of files

I have a relatively simple question to ask, there has been an ongoing discussion regarding many programming languages about which method provides the fastest file read. Mostly debated on read() or mmap(). As a person who also participated in these debates, I failed to find an answer to my current problem, because most answers help in the situation where the file to read is huge (example, how to read a 10 TB text file...).
But my problem is a bit different, I have lots of files, lets say a 100 million. I want to read the first 1-2 lines from these files. Whether the file is 10 kb or 100 TB is irrelevant. I just want the first one or two lines from every file. So I want to avoid reading or buffering the unnecessary parts of the files. My knowledge was not enough to thoroughly test which method is faster, or to discover what are all my options in the first place.
What I am doing right know: (I am doing this multithreaded for the moment)
for(const auto& p: std::filesystem::recursive_directory_iterator(path)) {
if (!std::filesystem::is_directory(p)) {
std::ifstream read_file(p.path().string());
if (read_file.is_open()) {
while (getline(read_file, line)) {
// Get two lines here.
}
}
}
}
What does C++, or the linux environment provide me in this situation ? Is there a faster or more efficient way to read small portions of millions of files ?
Thank you for your time.
Info: I have access to C++20 and Ubuntu 18.04
You can save one underlying call to fstat by not testing if the path is a directory, and then rely on is_open test
#include <iostream>
#include <fstream>
#include <filesystem>
#include <string>
int main()
{
std::string line,path=".";
for(const auto& p: std::filesystem::recursive_directory_iterator(path)) {
{
std::ifstream read_file(p.path().string());
if (read_file.is_open()) {
std::cout << "opened: " << p.path().string() << '\n';
while (getline(read_file, line)) {
// Get two lines here.
}
}
}
}
}
At least on Windows this code skips the directories. And as suggested in comments is_open test can even be skipped since getline doesn't read anything from a directory either.
Not the cleanest, but if it can save time it's worth it.
Any function in a program accessing a file under Linux will result in calling some "system calls" (for example read()).
All other available functions in some programming language (like fread(), fgets(), std::filesystem ...) call functions or methods which in turn call some system calls.
For this reason you can't be faster than calling the system calls directly.
I'm not 100% sure, but I think in most cases, the combination open(), read(), close() will be the fastest method for reading data from the start of a file.
(If the data is not located at the start of the file, pread() might be faster than read(); I'm not sure.)
Note that read() does not read a certain number of lines but a certain number of bytes (e.g. into an array of char), so you have to find the end(s) of the line(s) "manually" by searching the '\n' character(s) and/or the end of the file in the array of char.
Unfortunately, a line may be much longer than you expect, so reading the first N bytes from the file does not contain the first M lines and you have to call read() again.
In this case it depends on your system (e.g. file system or even hard disks) how many bytes you should read in each call to read() to get the maximum performance.
Example: Let's say in 75% of all files, the first N lines are found in the first 512 bytes of the file; in the other 25% of all files, first N lines are longer than 512 bytes in sum.
On some computers, reading 1024 bytes at once might require nearly the same time as reading 512 bytes, but reading 512 bytes twice will be much slower than reading 1024 bytes at once; on such computers it makes sense to read() 1024 bytes at once: You save a lot of time for 25% of the files and you lose only very little time for the other 75%.
On other computers, reading 512 bytes is significantly faster than reading 1024 bytes; on such computers it would be better to read() 512 bytes: Reading 1024 bytes would save you only little time when processing the 25% of files but cost you very much time when processing the other 75%.
I think in the most cases this "optimal value" will be a multiple of 512 bytes because most modern file systems organize files in units that have a multiple of 512 bytes.
I was just typing something similar to Martin Rosenau answer (when his popped up): unstructured read of the max length of two lines. But I would go further: queue that text buffer with corresponding file name and let another thread parse / analyze that. If parsing takes about the same time as reading, you can save half of the time. If it takes longer (unlikely) - you can use multiple threads and save even more.
Side note - you should not parallelize reading (tried that).
It may be worth experimenting: can you open one file, read it asynchronously while proceeding to open the next one? I don't know if any OS can overlap those things.

Compressing/decompressing data on the go

I'm running a physics simulation and I'd like to improve the way it is handling its data. I'm saving and reading files that contained one float then two ints followed by 512*512 = 262144 +1 or -1 weighting 595 kb per datafile in the end. All these numbers are separated by a single space.
I'm saving hundreds of thousands of these files so it quickly adds up to gigas of storage, I'd like to know if there is a quick (hopefully light cpu-effort-wise way) of compressing and decompressing this kind of data on the go (I mean not tarring/untarring before/after use).
How much could I expect saving in the end?
If you want relatively fast read-write, you would probably want to store and read them in "binary" format, i.e. native to the way they are internally stored in bytes. A float uses 4-bytes of data, and you do not need any kind of "separator" when storing a large sequence of them.
To do this you might consider boost's "serialize" library.
Note that using data compression methods (zlib etc) will save you on bytes stored but will be relatively slow to zip and unzip them for usage.
Storing in binary format will not only use less disk storage (than storing in text format) but should also be more performant, not just because there is less file I/O but also because there is no string writing/parsing going on.
Note that when you input/output to binary_iarchive or binary_oarchive you pass in an underlying istream or ostream and if this is a file, you need to open it with ios::binary flag because of the issue of line-endings being potentially converted.
Even if you do decide that data-compression (zlib or some other library) is the way to go, it is still worth using boost::serialize to get your data into a "blob" to compress. In that case you would probably use std::ostringstream as your output stream to create the blob.
Incidentally, if you have 2^18 "boolean" values that can only be 1 or -1, you only need 1 bit for each one, (they would be physically stored as 1 or 0 but you would logically translate that). That would come to 2^15 bytes which is 32K not 595K
Given the extra info about the valid data, define your class like this:-
class Data
{
float m_float_value;
int m_int_value_1, m_int_value_2;
unsigned m_weights [8192];
};
Then use binary file IO to stream this class to and from a file, don't convert to text!
The weights are stored as Boolean values, packed into unsigned integers.
To get the weight, add an accessor:-
int Data::GetWeight (size_t index)
{
return m_weights [index >> 5] & (1 << (index & 31)) ? 1 : -1;
}
This gives you a data file size of 32780 bytes (5.4%) if there's no packing in the class data.
I would suggest that if you are concerned about size a binary format would be the most useful way to "compress" your data. It sounds like you are dealing with something like the following:
struct data {
float a;
int b, c;
signed char d[512][512];
};
someFunc() {
data* someData = new data;
std::ifstream inFile("inputData.bin", std::ifstream::binary);
std::ofstream outFile("outputData.bin", std::ofstream::binary);
// Read from file
inFile.read(someData, sizeof(data));
inFile.close();
// Write to file
outFile.write(someData, sizeof(data));
outFile.close();
delete someData;
}
I should also mention then if you encode your +1/-1 as bits you should get a lot of space savings (another factor of 8 on top of what I'm showing here).
For that amount of data, anything homemade isn't going perform anywhere near as well as good-quality open-source binary-storage libraries. Try boost serialize or - for this type of storage requirement - HDF5. I've used HDF5 successfully on a few projects with very very large amounts of double, float, long and int data . Found it useful one can control the compression-rate vs cpu-effort on the fly per "file". Also useful is storing millions of "files" in a hierarchically-structured single "disk" file. NASA - probably ripping my style;) - also uses it.

How to optimize input/output in C++

I'm solving a problem which requires very fast input/output. More precisely, the input data file will be up to 15MB. Is there a fast, way to read/print integer values.
Note: I don't know if it helps, but the input file has the following form:
line 1: a number n
line 2..n+1: three numbers a,b,c;
line n+2: a number r
line n+3..n+4+r: four numbers a,b,c,d
Note 2: The input file will be stdin.
Edit: Something like the following isn't fast enough:
void fast_scan(int &n) {
char buffer[10];
gets(buffer);
n=atoi(buffer);
}
void fast_scan_three(int &a,int &b,int &c) {
char buffval[3][20],buffer[60];
gets(buffer);
int n=strlen(buffer);
int buffindex=0, curindex=0;
for(int i=0; i<n; ++i) {
if(!isdigit(buffer[i]) && !isspace(buffer[i]))break;
if(isspace(buffer[i])) {
buffindex++;
curindex=0;
} else {
buffval[buffindex][curindex++]=buffer[i];
}
}
a=atoi(buffval[0]);
b=atoi(buffval[1]);
c=atoi(buffval[2]);
}
General input/output optimization principle is to perform as less I/O operations as possible reading/writing as much data as possible.
So performance-aware solution typically looks like this:
Read all data from device into some buffer (using the principle mentioned above)
Process the data generating resulting data to some buffer (on place or another one)
Output results from buffer to device (using the principle mentioned above)
E.g. you could use std::basic_istream::read to input data by big chunks instead of doing it line by line. The similar idea with output - generate single string as result adding line feed symbols manually and output it at once.
If you want to minimize the physical I/O operation overhead, load the whole file into memory by a technique called memory mapped files. I doubt you'll get a noticable performance gain though. Parsing will most likely be a lot costlier.
Consider using threads. Threading is useful for lots of things, but this is exactly the kind of problem that motivated the invention of threads.
The underlying idea is to separate the input, processing, and output, so these different operations can run in parallel. Do it right and you will see a significant speedup.
Have one thread doing close to pure input. It reads lines into a buffer of lines. Have a second thread do a quick pre-parse and organize the raw input into blocks. You have two things that need to be parsed, the line that contain the number of lines that contain triples and the line that contains the number of lines that contain quads. This thread forms the raw input into blocks that are still mostly text. A third thread parses the triples and quads, re-forming the the input into fully parsed structures. Since the data are now organized into independent blocks, you can have multiple instances of this third operation so as to better take advantage of the multiple processors on your computer. Finally, other threads will operate on these fully-parsed structures. Note: It might be better to combine some of these operations, for example combining the input and pre-parsing operations into one thread.
Put several input lines in a buffer, split them, and then parse them simultaneously in different threads.
It's only 15MB. I would just slurp the whole thing into a memory buffer, then parse it.
The parsing looks something like this, approximately:
#define DIGIT(c)((c) >= '0' && (c) <= '9')
while(*p == ' ') p++;
if (DIGIT(*p)){
a = 0;
while(DIGIT(*p){
a *= 10; a += (*p++ - '0');
}
}
// and so on...
You should be able to write this kind of code in your sleep.
I don't know if that's any faster than atoi, but it's not fussing around figuring out where numbers begin and end. I would stay away from scanf because it goes through nine yards of figuring out its format string.
If you run this whole thing in a loop 1000 times and grab some stack samples, you should see that it is spending nearly 100% of its time reading in the file and generating output (which you didn't mention).
You just can't beat that.
If you do see that noticeable time is spent in the actual parsing, it might be possible to do overlapped I/O, but the machine would have to be really slow, or the I/O really fast (like from a solid state drive) before that would make sense.

Binary file write problem in C++

This is my function which creates a binary file
void writefile()
{
ofstream myfile ("data.abc", ios::out | ios::binary);
streamoff offset = 1;
if(myfile.is_open())
{
char c='A';
myfile.write(&c, offset );
c='B';
myfile.write(&c, offset );
c='C';
myfile.write(&c,offset);
myfile.write(StartAddr,streamoff (16) );
myfile.close();
}
else
cout << "Some error" << endl ;
}
The value of StartAddr is 1000, hence the expected output file is:
A B C 1000 NUL NUL NUL
However, strangely my output file appends this: data.abc
So the final outcome is: A B C 1000 NUL NUL NUL data.abc
Please help me out with this. How to deal with this? Why is this strange behavior?
I recommend you quit with binary writing and work on writing the data in a textual format. You've already encountered some of the problems with writing data. There are still issues for you to come across about reading the data and portability. Expect more pain if you continue this route.
Use textual representations. For simplicity you can put one field per line and use std::getline to read it in. The textual representation allows you to view the data in any text editor, easily. Try using Notepad to view a binary file!
Oh, but binary data is soo much faster and takes up less space in the file. You've already wasted enough time and money than you would gain by using binary data. The speed of computers and huge memory capacities (disk and RAM) make binary representations a thing of the past (except in extreme cases).
As a learning tool, go ahead and use binary. For ease of development and quick schedules (IOW, finishing early), use textual representations.
Search Stack Overflow for "C++ micro optimization" for the justifications.
There are several issues with this code.
For starters, if you want to write individual characters t a stream, you don't need to use ostream::write. Instead, just use ostream::put, as shown here:
myfile.put('A');
Second, if you want to write out a string into a file stream, just use the stream insertion operator:
myfile << StartAddr;
This is perfectly safe, even in binary mode.
As for the particular problem you're reporting, I think that the issue is that you're trying to write out a string of length four (StartAddr), but you've told the stream to write out sixteen bytes. This means that you're writing out the four bytes for the string contents, then the null terminator, and then nine bytes of whatever happens to be in memory after the buffer. In your case, this is two more null bytes, then the meaningless text that you saw after that. To fix this, either change your code to write fewer bytes or, if StartAddr is a string, then just write it using <<.
With the line myfile.write(StartAddr,streamoff (16) ); you are instructing the myfile object to write 16 bytes to the stream starting at the address StartAddr. Imagine that StartAddr is an array of 16 bytes:
char StartAddr[16] = "1000\0\0\0data.b32\0";
myfile.write(StartAddr, sizeof(StartAddr));
Would generate the output that you see. Without seeing the declaration / definition of StartAddr I cannot say for certain, but it appears you are writing out a five byte nul terminated string "1000" followed by whatever happens to reside in the next 11 bytes after StartAddr. In this case, it appears a couple of nul bytes followed by the constant nul terminated string "data.b32" (which the compiler must put somewhere in memory) are what follow StartAddr.
Regardless, it is clear that you overread a buffer.
If you are trying to write a 16 bit integer type to a stream you have a couple of options, both based on the fact that there are typically 8 bits in a byte. The 'cleanest' one would be something like:
char x = (StartAddr & 0xFF);
myfile.write(x);
x = (StartAddr >> 8);
myfile.write(x);
This assumes StartAddr is a 16 bit integer type and does not take into account any translation that might occur (such as potential conversion of a value of 10 [a linefeed] into a carriage return / linefeed sequence).
Alternatively, you could write something like:
myfile.write(reinterpret_cast<char*>(&StartAddr), sizeof(StartAddr));

How to perform fast formatted input from a stream in C++?

The situation is: there is a file with 14 294 508 unsigned integers and 13 994 397 floating-point numbers (need to read doubles). Total file size is ~250 MB.
Using std::istream takes ~30sec. Reading the data from file to memory (just copying bytes, without formatted input) is much faster. Is there any way to improve reading speed without changing file format?
Do you need to use STL style i/o? You must check out this excellent piece of work from one of the experts. It's a specialized iostream by Dietmar Kuhl.
I hate to suggest this but take a look at the C formatted i/o routines. Also, are you reading in the whole file in one go?
You might also want to look at Matthew Wilson's FastFormat library:
http://www.fastformat.org/
I haven't used it, but he makes some pretty impressive claims and I've found a lot of his other work to be worth studying and using (and stealing on occasion).
You haven't specified the format. It's possible that you could memory map it, or could read in very large chunks and process in a batch algorithm.
Also, you haven't said whether you know for sure that the file and process that will read it will be on the same platform. If a big-endian process writes it and a little-endian process reads it, or vice versa, it won;t work.
Parsing input by yourself (atoi & atof), usually boosts speed at least twice, compared to "universal" read methods.
Something quick and dirty is to just dump the file into a standard C++ string, and then use a stringstream on it:
#include <sstream>
// Load file into string file_string
std::stringstream s( file_string );
int x; float y;
s >> x >> y;
This may not give you much of a performance improvement (you will get a larger speed-up by avoiding iostreams), but it's very easy to try, and it may be faster enough.