Reading a text file with floats fast - c++

I have a text file with ~4 mio floats, i.e 30MB, and I want to read them into a vector<float>.
The code I have is very bare bone, and gets the job done
std::fstream is("data.txt", std::ios_base::in);
float number;
while (is >> number)
{
//printf("%f ", number);
number_vec.push_back(number);
}
The problem is that it takes 20-30 s on a modern desktop workstation. At first I assumed I did something stupid, but the more I starred at the code, the more I started accepting that maybe it was just the time it takes to parse all those ascii float values into floats
However, then I remembered that Matlab can read, and parse, the same file almost instantly (disk speed seems to be the limit), so it is obvious that my code is just very inefficient.
The only thing I could think of was to reserve the required elements in the vector in advance, but it didn't improve the situation at all.
Can someone help me understand why? and maybe help writing a faster solution?
EDIT The textfile looks like this:
152.00256 45.8569 5.87214 0.225 -0.0005 .....
i.e. One row, space delimited.

please consider taking a look at the possible duplicates shared by #gsamaras and #Brad Allred. Anyway, I will try to reply with a simple answer that will aim on keeping the code simplicity/friendliness and consider the following two premises:
You have a constraint regarding the file and will neither change the file format, neither the way floats are presented textually in it.
You want to keep using STL and are not looking for a library specialized/optimized for the challenge you are facing.
With those stated constraints and mindset, my main suggestion would be to preallocate your containers, both the float vector as the internal iostream buffer:
Increase performance of insertion in number_vec by reserving the required size in the std vector. This can be achieved by a call to reserve as explained in this stackoverflow post.
Increase performance of the iostream by setting the buffer size used internally. This can be achieved by a call to pubsetbuf as explained in this other stackoverflow post.

Related

Optimizing memory for writes

I am working on a sorting algorithm that iterates over a bunch of integers and puts them into buckets.
The exact type of the buckets is a custom data structure similar to std::vector. As you can imagine, there is a snippet similar to this one, for the case that there is already enough memory allocated in the bucket to write the element I'm adding:
*_end = _new_value; // LINE 1
++_end; // Line 2
I discovered in vtune optimizer that that LINE 1 accounts for about 1/3 of the runtime of my algorithm. I was curious if I could do better, so I started trying some stuff.
My workstation is Linux and I usually compile with gcc. Our software has to support other compilers and systems, too, but Linux-only optimizations are considered OK since we "suggest" users use Linux.
First I simply added a look-ahead to my the loop from which the above snippet is called. Looking ahead buffer_size iterations, it got the result from:
int * Bucket::get_end() {
__builtin_prefetch(_end, 1); // Line 3
return _end++; // Line 4
}
And it stored these results in a buffer similar to the following buffer:
using delayed_write = std::pair<int, int*>; // Line 5
std::array<delayed_write, buffer_size> buffer; // Line 6
I'd run the equivalent of:
*(buffer[i + buffer_size].second) = buffer[i + buffer_size].first;
This eliminated the bottleneck like I saw at line 2 in vtune, but the algorithm was slower overall. (I tried 4 and 8 as buffer_size).
I tried a few other things. In particular, I did some pretty complex stuff where I totally batched 4 or 8 integers at a time and did each step on all of them at once. I wrote code to try to look ahead to see if reallocation will be necessary; if not, I cleverly wrote some loops that avoided any data dependencies across steps of the loop. Of course all this complexity predictably made the algorithm much slower. :)
It's possible it simply can't be made faster, but I feel intuitively there should be some way to exploit that there is no data dependency on line 2's write until after the loop is over so that there's no need to wait for the likely cache miss there to be resolved.
My understanding is that a cache miss is very high-latency, but I sort of wonder why the CPU can't keep going and leave the writes in a buffer to handle asynchronously.
It'd be really cool if there were e.g. a way to promise that I'm not going to read that memory until I call some synchronization function to commit all of the writes so far.
Do you think in fact that I'm filling up the write buffer? (In which case there is no solution?)
If not, does anyone know of any ways to exploit the fact that the write will not be read until after the hot loop?

Why is reading big text file in parallel bad?

I have a big txt file with ~30 millions rows, each row is seperated by a line seperator \n. And I'd like to read all lines to an unordered list (e.g. std::list<std::string>).
std::list<std::string> list;
std::ifstream file(path);
while(file.good())
{
std::string tmp;
std::getline(file, tmp);
list.emplace_back(tmp);
}
process_data(list);
The current implementation is very slow, so I'm learning how to read data by chunk.
But after seeing this comment:
parallelizing on a HDD will make things worse, with the impact depending on the distribution of the files on the HDD. On a SSD it might (!) improve things.
Is it bad to read a file in parallel? What's the algorithm to read all lines of a file to an unordered container (e.g. std::list, normal array,...) as fast as possible, without using any libraries, and the code must be cross-platform?
Is it bad to read a file in parallel? What's the algorithm to read all
lines of a file to an unordered container (e.g. std::list, normal
array,...) as fast as possible, without using any libraries, and the
code must be cross-platform?
I guess I'll attempt to answer this one to avoid spamming the comments. I have, in multiple scenarios, sped up text file parsing substantially using multithreading. However, the keyword here is parsing, not disk I/O (though just about any text file read involves some level of parsing). Now first things first:
VTune here was telling me that my top hotspots were in parsing (sorry, this image was taken years ago and I didn't expand the call graph to show what inside obj_load was taking most of the time, but it was sscanf). This profiling session actually surprised me quite a bit. In spite of having been profiling for decades to the point where my hunches aren't too inaccurate (not accurate enough to avoid profiling, mind you, not even close, but I've tuned my sort of intuitive spider senses enough to where profiling sessions usually don't surprise me that much even without any glaring algorithmic inefficiencies -- though I might still be off about exactly why they exist since I'm not so good at assembly).
Yet this time I was really taken back and shocked so this example has always been one I used to show even the most skeptical colleagues who don't want to use profilers to show why profiling is so important. Some of them are actually good at guessing where hotspots exists and some were actually creating very competent-performing solutions in spite of never having used them, but none of them were good at guessing what isn't a hotspot, and none of them could draw a call graph based on their hunches. So I always liked to use this example to try to convert the skeptics and get them to spend a day just trying out VTune (and we had a boatload of free licenses from Intel who worked with us which were largely going to waste on our team which I thought was a tragedy since VTune is a really expensive piece of software).
And the reason I was taken back this time was not because I was surprised by the sscanf hotspot. That's kind of a no-brainer that non-trivial parsing of epic text files is going to generally be bottlenecked by string parsing. I could have guessed that. My colleagues who never touched a profiler could have guessed that. What I couldn't have guessed was how much of a bottleneck it was. I thought given the fact that I was loading millions of polygons and vertices, texture coordinates, normals, creating edges and finding adjacency data, using index FOR compression, associating materials from the MTL file to the polygons, reverse engineering object normals stored in the OBJ file and consolidating them to form edge creasing, etc. I would at least have a good chunk of the time distributed in the mesh system as well (I would have guessed 25-33% of the time spent in the mesh engine).
Turned out the mesh system took barely any time to my most pleasant surprise, and there my hunches were completely off about it specifically. It was, by far, parsing that was the uber bottleneck (not disk I/O, not the mesh engine).
So that's when I applied this optimization to multithread the parsing, and there it helped a lot. I even initially started off with a very modest multithreaded implementation which barely did any parsing except scanning the character buffers for line endings in each thread just to end up parsing in the loading thread, and that already helped by a decent amount (reduced the operation from 16 seconds to about 14 IIRC, and I eventually got it down to ~8 seconds and that was on an i3 with just two cores and hyperthreading). So anyway, yeah, you can probably make things faster with multithreaded parsing of character buffers you read in from text files in a single thread. I wouldn't use threads as a way to make disk I/O any faster.
I'm reading the characters from the file in binary into big char buffers in a single thread, then, using a parallel loop, have the threads figure out integer ranges for the lines in that buffer.
// Stores all the characters read in from the file in big chunks.
// This is shared for read-only access across threads.
vector<char> buffer;
// Local to a thread:
// Stores the starting position of each line.
vector<size_t> line_start;
// Stores the assigned buffer range for the thread:
size_t buffer_start, buffer_end;
Basically like so:
LINE1 and LINE2 are considered to belong to THREAD 1, while LINE3 is considered to belong to THREAD 2. LINE6 is not considered to belong to any thread since it doesn't have an EOL. Instead the characters of LINE6 will be combined with the next chunky buffer read from the file.
Each thread begins by looking at the first character in its assigned character buffer range. Then it works backwards until it finds an EOL or reaches the beginning of the buffer. After that it works forward and parses each line, looking for EOLs and doing whatever else we want, until it reaches the end of its assigned character buffer range. The last "incomplete line" is not processed by the thread, but instead the next thread (or if the thread is the last thread, then it is processed on the next big chunky buffer read by the first thread). The diagram is teeny (couldn't fit much) but I read in the character buffers from the file in the loading thread in big chunks (megabytes) before the threads parse them in parallel loops, and each thread might then parse thousands of lines from its designated buffer range.
std::list<std::string> list;
std::ifstream file(path);
while(file.good())
{
std::string tmp;
std::getline(file, tmp);
list.emplace_back(tmp);
}
process_data(list);
Kind of echoing Veedrac's comments, storing your lines in std::list<std::string> if you want to really load an epic number of lines quickly is not a good idea. That would actually be a bigger priority to address than multithreading. I'd turn that into just std::vector<char> all_lines storing all the strings, and you can use std::vector<size_t> line_start to store the starting line position of an nth line, which you can retrieve like so:
// note that 'line' will be EOL-terminated rather than null-terminated
// if it points to the original buffer.
const char* line = all_lines.data() + line_start[n];
The immediate problem with std::list without a custom allocator is a heap allocation per node. On top of that we're wasting memory storing two extra pointers per line. std::string is problematic here because SBO optimizations to avoid heap allocation would make it take too much memory for small strings (and thereby increase cache misses) or still end up invoking heap allocations for every non-small string. So you end up avoiding all these problems just storing everything in one giant char buffer, like in std::vector<char>. I/O streams, including stringstreams and functions like getline, are also horrible for performance, just awful, in ways that really disappointed me at first since my first OBJ loader used those and it was over 20 times slower than the second version where I ported all those I/O stream operators and functions and use of std::string to make use of C functions and my own hand-rolled stuff operating on char buffers. When it comes to parsing in performance-critical contexts, C functions like sscanf and memchr and plain old character buffers tend to be so much faster than the C++ ways of doing it, but you can at least still use std::vector<char> to store huge buffers, e.g., to avoid dealing with malloc/free and get some debug-build sanity checks when accessing the character buffer stored inside.

Improving/optimizing file write speed in C++

I've been running into some issues with writing to a file - namely, not being able to write fast enough.
To explain, my goal is to capture a stream of data coming in over gigabit Ethernet and simply save it to a file.
The raw data is coming in at a rate of 10MS/s, and it's then saved to a buffer and subsequently written to a file.
Below is the relevant section of code:
std::string path = "Stream/raw.dat";
ofstream outFile(path, ios::out | ios::app| ios::binary);
if(outFile.is_open())
cout << "Yes" << endl;
while(1)
{
rxSamples = rxStream->recv(&rxBuffer[0], rxBuffer.size(), metaData);
switch(metaData.error_code)
{
//Irrelevant error checking...
//Write data to a file
std::copy(begin(rxBuffer), end(rxBuffer), std::ostream_iterator<complex<float>>(outFile));
}
}
The issue I'm encountering is that it's taking too long to write the samples to a file. After a second or so, the device sending the samples reports its buffer has overflowed. After some quick profiling of the code, nearly all of the execution time is spent on std::copy(...) (99.96% of the time to be exact). If I remove this line, I can run the program for hours without encountering any overflow.
That said, I'm rather stumped as to how I can improve the write speed. I've looked through several posts on this site, and it seems like the most common suggestion (in regard to speed) is to implement file writes as I've already done - through the use of std::copy.
If it's helpful, I'm running this program on Ubuntu x86_64. Any suggestions would be appreciated.
So the main problem here is that you try to write in the same thread as you receive, which means that your recv() can only be called again after copy is complete. A few observations:
Move the writing to a different thread. This is about a USRP, so GNU Radio might really be the tool of your choice -- it's inherently multithreaded.
Your output iterator is probably not the most performant solution. Simply "write()" to a file descriptor might be better, but that's performance measurements that are up to you
If your hard drive/file system/OS/CPU aren't up to the rates coming in from the USRP, even if decoupling receiving from writing thread-wise, then there's nothing you can do -- get a faster system.
Try writing to a RAM disk instead
In fact, I don't know how you came up with the std::copy approach. The rx_samples_to_file example that comes with UHD does this with a simple write, and you should definitely favor that over copying; file I/O can, on good OSes, often be done with one copy less, and iterating over all elements is probably very slow.
Let's do a bit of math.
Your samples are (apparently) of type std::complex<std::float>. Given a (typical) 32-bit float, that means each sample is 64 bits. At 10 MS/s, that means the raw data is around 80 megabytes per second--that's within what you can expect to write to a desktop (7200 RPM) hard drive, but getting fairly close to the limit (which is typically around 100-100 megabytes per second or so).
Unfortunately, despite the std::ios::binary, you're actually writing the data in text format (because std::ostream_iterator basically does stream << data;).
This not only loses some precision, but increases the size of the data, at least as a rule. The exact amount of increase depends on the data--a small integer value can actually decrease the quantity of data, but for arbitrary input, a size increase close to 2:1 is fairly common. With a 2:1 increase, your outgoing data is now around 160 megabytes/second--which is faster than most hard drives can handle.
The obvious starting point for an improvement would be to write the data in binary format instead:
uint32_t nItems = std::end(rxBuffer)-std::begin(rxBuffer);
outFile.write((char *)&nItems, sizeof(nItems));
outFile.write((char *)&rxBuffer[0], sizeof(rxBuffer));
For the moment I've used sizeof(rxBuffer) on the assumption that it's a real array. If it's actually a pointer or vector, you'll have to compute the correct size (what you want is the total number of bytes to be written).
I'd also note that as it stands right now, your code has an even more serious problem: since it hasn't specified a separator between elements when it writes the data, the data will be written without anything to separate one item from the next. That means if you wrote two values of (for example) 1 and 0.2, what you'd read back in would not be 1 and 0.2, but a single value of 10.2. Adding separators to your text output will add yet more overhead (figure around 15% more data) to a process that's already failing because it generates too much data.
Writing in binary format means each float will consume precisely 4 bytes, so delimiters are not necessary to read the data back in correctly.
The next step after that would be to descend to a lower-level file I/O routine. Depending on the situation, this might or might not make much difference. On Windows, you can specify FILE_FLAG_NO_BUFFERING when you open a file with CreateFile. This means that reads and writes to that file will basically bypass the cache and go directly to the disk.
In your case, that's probably a win--at 10 MS/s, you're probably going to use up the cache space quite a while before you reread the same data. In such a case, letting the data go into the cache gains you virtually nothing, but costs you some data to copy data to the cache, then somewhat later copy it out to the disk. Worse, it's likely to pollute the cache with all this data, so it's no longer storing other data that's a lot more likely to benefit from caching.

Efficient implementation of tail -n [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How would you implement tail efficiently?
A friend of mine was asked how he'd implement tail -n.
To be clear, we are required to print the last n lines of the file specified.
I thought of using an array of n strings and overwriting them in a cyclic manner.
But if we are given, say a 10 GB file, this approach doesn't scale at all.
Is there a better way to do this?
Memory map the file, iterate from the end looking for end of line n times, write from that point to the end of file to standard out.
You could potentially complicate the solution by not mapping the whole file, but just the last X kb of memory (say a couple of memory pages) and seeking there. If there aren't enough lines, then memory map a larger region until you get what you want. You can use some heuristic to implement the guess for how much memory you want to map (say 1kb per line as a rough estimate). I would not really do this though.
"It depends", no doubt. Given the size of the file should be knowable, and given a sensible file-manipulation library which can 'seek' to the end of a very large file without literatally traversing each byte in turn or thrashing virtual memory, you could simply scan backwards from the end counting newlines.
When you're dealing with files that big though, what do you do about the degenerate case where n is close to the number of lines in the multi-gigabyte file? Storing stuff in temporary strings won't scale then, either.

C++ string memory management

Last week I wrote a few lines of code in C# to fire up a large text file (300,000 lines) into a Dictionary. It took ten minutes to write and it executed in less than a second.
Now I'm converting that piece of code into C++ (because I need it in an old C++ COM object). I've spent two days on it this far. :-( Although the productivity difference is shocking on its own, it's the performance that I would need some advice on.
It takes seven seconds to load, and even worse: it takes just exactly that much time to free all the CStringWs afterwards. This is not acceptable, and I must find a way to increase the performance.
Are there any chance that I can allocate this many strings without seeing this horrible performace degradation?
My guess right now is that I'll have to stuff all the text into a large array and then let my hash table point to the beginning of each string within this array and drop the CStringW stuff.
But before that, any advice from you C++ experts out there?
EDIT: My answer to myself is given below. I realized that that is the fastest route for me, and also step in what I consider the right direction - towards more managed code.
This sounds very much like the Raymond Chen vs Rico Mariani's C++ vs C# Chinese/English dictionary performance bake off. It took Raymond several iterations to beat C#.
Perhaps there are ideas there that would help.
http://blogs.msdn.com/ricom/archive/2005/05/10/performance-quiz-6-chinese-english-dictionary-reader.aspx
You are stepping into the shoes of Raymond Chen. He did the exact same thing, writing a Chinese dictionary in unmanaged C++. Rico Mariani did too, writing it in C#. Mr. Mariani made one version. Mr. Chen wrote 6 versions, trying to match the perf of Mariani's version. He pretty much rewrote significant chunks of the C/C++ runtime library to get there.
Managed code got a lot more respect after that. The GC allocator is impossible to beat. Check this blog post for the links. This blog post might interest you too, instructive to see how the STL value semantics are part of the problem.
Yikes. get rid of the CStrings...
try a profiler as well.
are you sure you were not just running debug code?
use std::string instead.
EDIT:
I just did a simple test of ctor and dtor comparisons.
CStringW seems to take between 2 and 3 times the time to do a new/delete.
iterated 1000000 times doing new/delete for each type. Nothing else - and a GetTickCount() call before and after each loop. Consistently get twice as long for CStringW.
That doesn't address your entire issue though I suspect.
EDIT:
I also don't think that using string or CStringW is the real the problem - there is something else going on that is causing your issue.
(but for god's sake, use stl anyway!)
You need to profile it. That is a disaster.
If it is a read-only dictionary then the following should work for you.
Use fseek/ftell functionality, to find the size of the text file.
Allocate a chunk of memory of that size + 1 to hold it.
fread the entire text file, into your memory chunk.
Iterate though the chunk.
push_back into a vector<const char *> the starting address of each line.
search for the line terminator using strchr.
when you find it, deposit a NUL, which turns it into a string.
the next character is the start of the next line
until you do not find a line terminator.
Insert a final NUL character.
You can now use the vector, to get the pointer, that will let you
access the corresponding value.
When you are finished with your dictionary, deallocate the memory, let the vector
die when going out of scope.
[EDIT]
This can be a little more complicated on the dos platform, as the line terminator is CRLF.
In that case, use strstr to find it, and increment by 2 to find the start of the next line.
What sort of a container are you storing your strings in? If it's a std::vector of CStringW and if you haven't reserve-ed enough memory beforehand, you're bound to take a hit. A vector typically resizes once it reaches it's limit (which is not very high) and then copies out the entirety to the new memory location which is can give you a big hit. As your vector grows exponentially (i.e. if initial size is 1, next time it allocates 2, 4 next time onwards, the hit becomes less and less frequent).
It also helps to know how long the individual strings are. (At times :)
Thanks all of you for your insightful comments. Upvotes for you! :-)
I must admit I wasn't prepared for this at all - that C# would beat the living crap out of good old C++ in this way. Please don't read that as an offence to C++, but instead what an amazingly good memory manager that sits inside the .NET Framework.
I decided to take a step back and fight this battle in the InterOp arena instead! That is, I'll keep my C# code and let my old C++ code talk to the C# code over a COM interface.
A lot of questions were asked about my code and I'll try to answer some of them:
The compiler was Visual Studio 2008 and no, I wasn't running a debug build.
The file was read with an UTF8 file reader which I downloaded from a Microsoft employee who published it on their site. It returned CStringW's and about 30% of the time was actually spent there just reading the file.
The container I stored the strings in was just a fixed size vector of pointers to CStringW's and it was never resized.
EDIT: I'm convinced that the suggestions I was given would indeed work, and that I probably could beat the C# code if I invested enough time in it. On the other hand, doing so would provide no customer value at all and the only reason to pull through with it would be just to prove that it could be done...
The problem is not in the CString, but rather that you are allocating a lot of small objects - the default memory allocator isn't optimized for this.
Write your own allocator - allocate a big chunk of memory and then just advance a pointer in it when allocating. This what actually the .NET allocator does. When you are ready delete the whole buffer.
I think there was sample of writing custom new/delete operators in (More) Effective C++
Load the string to a single buffer, parse the text to replace line breaks with string terminators ('\0'), and use pointers into that buffer to add to the set.
Alternatively - e.g. if you have to do an ANSI/UNICODE conversion during load - use a chunk allocator, that sacrifices deleting individual elements.
class ChunkAlloc
{
std::vector<BYTE> m_data;
size_t m_fill;
public:
ChunkAlloc(size_t chunkSize) : m_data(size), m_fill(0) {}
void * Alloc(size_t size)
{
if (m_data.size() - m_fill < size)
{
// normally, you'd reserve a new chunk here
return 0;
}
void * result = &(m_data[m_fill]);
m_fill += size;
return m_fill;
}
}
// all allocations from chuunk are freed when chung is destroyed.
Wouldn't hack that together in ten minutes, but 30 minutes and some testing sounds fine :)
When working with string classes, you should always have a look at unnecessary operations, for example, don't use constructors, concatenation and such operations too often, especially avoid them in loops. I suppose there's some character coding reason you use CStringW, so you probably can't use something different, this would be another way to optimize your code.
It's no wonder that CLR's memory management is better than the bunch of old and dirty tricks MFC is based on: it is at least two times younger than MFC itself, and it is pool-based. When I had to work on similar project with string arrays and WinAPI/MFC, I just used std::basic_string instantiated with WinAPI's TCHAR and my own allocator based on Loki::SmallObjAllocator. You can also take a look at boost::pool in this case (if you want it to have an "std feel" or have to use a version of VC++ compiler older than 7.1).