Read a big file by lines in C++

Read a big file by lines in C++ - c++

I have a big file nearly 800M, and I want to read it line by line.
At first I wrote my program in Python, I use linecache.getline:
lines = linecache.getlines(fname)
It costs about 1.2s.
Now I want to transplant my program to C++.
I wrote these code:
std::ifstream DATA(fname);
std::string line;
vector<string> lines;
while (std::getline(DATA, line)){
lines.push_back(line);
}
But it's slow(costs minutes). How to improve it?
Joachim Pileborg mentioned mmap(), and on windows CreateFileMapping() will work.
My code runs under VS2013, when I use "DEBUG" mode, it takes 162 seconds;
When I use "RELEASE" mode, only 7 seconds!
(Great Thanks To #DietmarKühl and #Andrew)

First of all, you should probably make sure you are compiling with optimizations enabled. This might not matter for such a simple algorithm, but that really depends on your vector/string library implementations.
As suggested by #angew, std::ios_base::sync_with_stdio(false) makes a big difference on routines like the one you have written.
Another, lesser, optimization would be to use lines.reserve() to preallocate your vector so that push_back() doesn't result in huge copy operations. However, this is most useful if you happen to know in advance approximately how many lines you are likely to receive.
Using the optimizations suggested above, I get the following results for reading an 800MB text stream:
20 seconds ## if average line length = 10 characters
3 seconds ## if average line length = 100 characters
1 second ## if average line length = 1000 characters
As you can see, the speed is dominated by per-line overhead. This overhead is primarily occurring inside the std::string class.
It is likely that any approach based on storing a large quantity of std::string will be suboptimal in terms of memory allocation overhead. On a 64-bit system, std::string will require a minimum of 16 bytes of overhead per string. In fact, it is very possible that the overhead will be significantly greater than that -- and you could find that memory allocation (inside of std::string) becomes a significant bottleneck.
For optimal memory use and performance, consider writing your own routine that reads the file in large blocks rather than using getline(). Then you could apply something similar to the flyweight pattern to manage the indexing of the individual lines using a custom string class.
P.S. Another relevant factor will be the physical disk I/O, which might or might not be bypassed by caching.

For c++ you could try something like this:
void processData(string str)
{
vector<string> arr;
boost::split(arr, str, boost::is_any_of(" \n"));
do_some_operation(arr);
}
int main()
{
unsigned long long int read_bytes = 45 * 1024 *1024;
const char* fname = "input.txt";
ifstream fin(fname, ios::in);
char* memblock;
while(!fin.eof())
{
memblock = new char[read_bytes];
fin.read(memblock, read_bytes);
string str(memblock);
processData(str);
delete [] memblock;
}
return 0;
}

Related

Memory and time issue when reading/writing from a file

I am trying to solve a school problem and I did that, but it should run faster and on less memory if possible - can you please help me achieve that?
Problem statement: Read a natural number N and a string from a file, and output in another file the same string N number of times.
Example of input file:
3
dog
Example of output file:
dog
dog
dog
Restrictions:
1 ≤ n ≤ 50, and the length of the line to be read is maximum 1,000,000
Time limit: 0.27 seconds
This is what I tried (but run time exceeds the limit):
#include<fstream>
using namespace std;
ifstream cin("afisaren.in");
ofstream cout("afisaren.out");
short n;
char s[1000005];
int main() {
cin >> n;
cin >> s;
while(n) {
cout << s << '\n';
n--;
}
cin.close();
cout.close();
return 0;
}

Generally when given this type of problem, you should profile your own code to see which part of the code is consuming what amount of time. This can mostly be done by adding a few calls to a timekeeping-function before and after code execution, to see how long it was executing. However this is not so easy with your code, since one of the biggest problems (optimisation-wise) is your char s[1000005]; line. The memory will be allocated before executing your main() function, which is operating system dependant (or rather depends on the libc and compiler used).
So first, do not use pre-allocated char-arrays. You're using C++! Why not simply read the text into a std::(w)string or any of the C++-classes which will do dynamic memory allocation (and not crash your program if line-length does exceed 1,000,000).
And second, the c++ std::streams usually perform a flush-to-disk every time a line-ending character is written. This is highly inefficient unless your text is exactly the same size as the block-size of the underlying file-system. To optimize this, create a memory object (i.e. std::string) and copy your text into it for k times, where k = fs-block-size / text-length. fs-block-size will most likely be 1024, 2048 or 4096 bytes. There are system-calls to find that out, but performance will usually not be affected too much when writing twice (or 4x) the fs-block-size, so you can safely assume it to be 4096 for close-to-or-maximum-performance.
Since the maximum number of repetitions is 1 < n < 50, and line length is 1,000,000 (approx. 1 MiB if ASCII), maximum file size for the output will be 50,000,000 characters. You could also write everything into memory and then write everything in one call to write(). This would probably be the most efficient way in terms of disk-activity, but obviously not regarding memory consumption.

I'm not a c++ expert but I had a similar problem when I used c++ style file streams, after googling a bit, I tried switching to c-style file system and it boosted my performance a lot because c++ file streams copy file contents into internal buffer and that takes time, you can try it c-style but usually it is not recommended to use c in c++.

Why does this C++ vector use so much RAM?

I have two files. One file is 15 gigabytes. The other is 684 megabytes. Both of these files have identical structures: they consist of many strings, one per line (which is to say, each string is separated by a \n).
While bored one day, and being the curious novice that I am, I decided to write a little C++ program to read these files into RAM. I compiled the program with G++ 8.1.1 on Fedora 28, and I found that when I read the small file into RAM, it consumes 2154 megabytes of RAM, and when I read the large file, it consumes 70.2 gigabytes of RAM. That's 3.15 times and 4.68 times the size of the original files, respectively.
Why is this the case?
This is the source code for this simple program. I'm using a std::vector to store each line as an std::string. I get the feeling that this question may actually boil down to, how does C++ handle strings? Is there an alternative datatype I should consider using?
#include <iostream>
#include <fstream>
#include <vector>
int main()
{
std::ifstream inFile;
std::vector<std::string> inStrings;
std::string line;
inStrings.reserve(1212356398);
inFile.open("bigfile.txt");
if (!inFile)
{
std::cerr << "Unable to open the hardcoded file" << std::endl;
exit(1);
}
while(getline(inFile, line))
{
inStrings.push_back(line);
}
std::cout << "done reading" << std::endl;
std::cin.get();
return 0;
}

If you have tried implementing a dynamic array in school or as an exercise, recall allocation strategies like doubling the capacity each time the capacity is full; similarly the vector prepares to store more than it actually stores.
Meanwhile, a string by itself stores a length, a capacity as well, and a reference counter, which makes 3 words at minimum even for an empty string.
Edit
Yeah, I guess the bit about the reference counter wasn't correct. I was remembering it was 3 words, and for some reason thought it wasn't counting the pointer to the actual allocated memory. But I guess maybe it is just that: the pointer to the actual string.
In any case the actual story differs due to optimizations across compilers. Search "std::string memory allocation" or something like that to read more.

How to read 4GB file on 32bit system

In my case I have different files lets assume that I have >4GB file with data. I want to read that file line by line and process each line. One of my restrictions is that soft has to be run on 32bit MS Windows or on 64bit with small amount of RAM (min 4GB). You can also assume that processing of these lines isn't bottleneck.
In current solution I read that file by ifstream and copy to some string. Here is snippet how it looks like.
std::ifstream file(filename_xml.c_str());
uintmax_t m_numLines = 0;
std::string str;
while (std::getline(file, str))
{
m_numLines++;
}
And ok, that's working but to slowly here is a time for my 3.6 GB of data:
real 1m4.155s
user 0m0.000s
sys 0m0.030s
I'm looking for a method that will be much faster than that for example I found that How to parse space-separated floats in C++ quickly? and I loved presented solution with boost::mapped_file but I faced to another problem what if my file is to big and in my case file 1GB large was enough to drop entire process. I have to care about current data in memory probably people who will be using that tool doesn't have more than 4 GB installed RAM.
So I found that mapped_file from boost but how to use it in my case? Is it possible to read partially that file and receive these lines?
Maybe you have another much better solution. I have to just process each line.
Thanks,
Bart

Nice to see you found my benchmark at How to parse space-separated floats in C++ quickly?
It seems you're really looking for the fastest way to count lines (or any linear single pass analysis), I've done a similar analysis and benchmark of exactly that here
Fast textfile reading in c++
Interestingly, you'll see that the most performant code does not need to rely on memory mapping at all there.
static uintmax_t wc(char const *fname)
{
static const auto BUFFER_SIZE = 16*1024;
int fd = open(fname, O_RDONLY);
if(fd == -1)
handle_error("open");
/* Advise the kernel of our access pattern. */
posix_fadvise(fd, 0, 0, 1); // FDADVICE_SEQUENTIAL
char buf[BUFFER_SIZE + 1];
uintmax_t lines = 0;
while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))
{
if(bytes_read == (size_t)-1)
handle_error("read failed");
if (!bytes_read)
break;
for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p)
++lines;
}
return lines;
}

The case of a 64-bit system with small memory should be fine to load a large file into - it's all about address space - although it may well be slower than the "fastest" option in that case, it really depends on what else is in memory and how much of the memory is available for mapping the file into. In a 32-bit system, it won't work, since the pointers into the filemapping won't go beyond about 3.5GB at the very most - and typically around 2GB is the maximum - again, depending on what memory addresses are available to the OS to map the file into.
However, the benefit of memory mapping a file is pretty small - the huge majority of the time spent is from actually reading the data. The saving from using memory mapping comes from not having to copy the data once it's loaded into RAM. (When using other file-reading mechanisms, the read function will copy the data into the buffer supplied, where memory mapping a file will stuff it straight into the correct location directly).

You might want to look at increasing the buffer for the ifstream - the default buffer is often rather small, this leads to lots of expensive reads.
You should be able to do this using something like:
std::ifstream file(filename_xml.c_str());
char buffer[1024*1024];
file.rdbuf()->pubsetbuf(buffer, 1024*1024);
uintmax_t m_numLines = 0;
std::string str;
while (std::getline(file, str))
{
m_numLines++;
}
See this question for more info:
How to get IOStream to perform better?

Since this is windows, you can use the native windows file functions with the "ex" suffix:
windows file management functions
specifically the functions like GetFileSizeEx(), SetFilePointerEx(), ... . Read and write functions are limited to 32 bit byte counts, and the read and write "ex" functions are for asynchronous I/O as opposed to handling large files.

c++ overhead from string concatenation

I'm reading in a text file of random ascii from an ifstream. I need to be able to put the whole message into a string type for character parsing. My current solution works, but I think i'm murdering process time on the more lengthy files by using the equivalent of this:
std::string result;
for (std::string line; std::getline(std::cin, line); )
{
result += line;
}
I'm concerned about the overhead associated with concatenating strings like this (this is happening a few thousand times, with a message 10's of thousands of characters long). I've spent the last few days browsing different potential solutions, but nothing is quite fitting... I don't know the length of the message ahead of time, so I don't think using a dynamically sized character array is my answer.
I read through this SO thread which sounded almost applicable but still left me unsure;
Any suggestions?

The problem really is that you don't know the full size ahead of time, so you cannot allocate memory appropriately. I would expect that the performance hit you get is related to that, not to the way strings are concatenated since it is efficiently done in the standard library.
Thus, I would recommend deferring concatenation until you know the full size of your final string. That is, you start by storing all your strings in a big vector as in:
using namespace std;
vector<string> allLines;
size_t totalSize = 0;
// If you can have access to the total size of the data you want
// to read (size of the input file, ...) then just initialize totalSize
// and use only the second code snippet below.
for (string line; getline(cin, line); )
{
allLines.push_back(line);
totalSize += line.size();
}
Then, you can create your big string knowing its size in advance:
string finalString;
finalString.reserve(totalSize);
for (vector<string>::iterator itS = allLines.begin(); itS != allLines.end(); ++itS)
{
finalString += *itS;
}
Although, I should mention that you should do that only if you experience performance issues. Don't try to optimize things that do not need to, otherwise you will complicate your program with no noticeable benefit. Places where we need to optimize are often counterintuitive and can vary from environment to environment. So do that only if your profiling tool tells you you need to.

If you know the file size, use result's member function 'reserve()' once.

I'm too sleepy to put together any solid data for you but, ultimately, without knowing the size ahead of time you're always going to have to do something like this. And the truth is that your standard library implementation is smart enough to handle string resizing fairly smartly. (That's despite the fact that there's no exponential growth guarantee for std::string, the way that there is for std::vector.)
So although you may see unwanted re-allocations the first fifty or so iterations, after a while, the re-allocated block becomes so large that re-allocations become rare.
If you profile and find that this is still a bottleneck, perhaps use std::string::reserve yourself with a typical quantity.

You're copying the result array for every line in the file (as you expand result). Instead pre-allocate the result and grow it exponentially:
std::string result;
result.reserve(1024); // pre-allocate a typical size
for (std::string line; std::getline(std::cin, line); )
{
// every time we run out of space, double the available space
while(result.capacity() < result.length() + line.length())
result.reserve(result.capacity() * 2);
result += line;
}

Read large amount of ASCII numbers and write in binary form

I have data files with about 1.5 Gb worth of floating-point numbers stored as ASCII text separated by whitespace, e.g., 1.2334 2.3456 3.4567 and so on.
Before processing such numbers I first translate the original file to binary format. This is helpful because I can choose whether to use float or double, reduce file size (to about 800 MB for double and 400 MB for float), and read in chunks of the appropriate size once I am processing the data.
I wrote the following function to make the ASCII-to-binary translation:
template<typename RealType=float>
void ascii_to_binary(const std::string& fsrc, const std::string& fdst){
RealType value;
std::fstream src(fsrc.c_str(), std::fstream::in | std::fstream::binary);
std::fstream dst(fdst.c_str(), std::fstream::out | std::fstream::binary);
while(src >> value){
dst.write((char*)&value, sizeof(RealType));
}
// RAII closes both files
}
I would like to speed-up acii_to_binary, and I seem unable to come up with anything. I tried reading the file in chunks of 8192 bytes, and then try to process the buffer in another subroutine. This seems very complicated because the last few characters in the buffer may be whitespace (in which case all is good), or a truncated number (which is very bad) - the logic to handle the possible truncation seems hardly worth it.
What would you do to speed up this function? I would rather rely on standard C++ (C++11 is OK) with no additional dependencies, like boost.
Thank you.
Edit:
#DavidSchwarts:
I tried to implement your suggestion as follows:
template<typename RealType=float>
void ascii_to_binary(const std::string& fsrc, const std::string& fdst{
std::vector<RealType> buffer;
typedef typename std::vector<RealType>::iterator VectorIterator;
buffer.reserve(65536);
std::fstream src(fsrc, std::fstream::in | std::fstream::binary);
std::fstream dst(fdst, std::fstream::out | std::fstream::binary);
while(true){
size_t k = 0;
while(k<65536 && src >> buffer[k]) k++;
dst.write((char*)&buffer[0], buffer.size());
if(k<65536){
break;
}
}
}
But it does not seem to be writing the data! I'm working on it...

I did exactly the same thing, except that my fields were separated by tab '\t' and I had to also handle non-numeric comments on the end of each line and header rows interspersed with the data.
Here is the documentation for my utility.
And I also had a speed problem. Here are the things I did to improve performance by around 20x:
Replace explicit file reads with memory-mapped files. Map two blocks at once. When you are in the second block after processing a line, remap with the second and third blocks. This way a line that straddles a block boundary is still contiguous in memory. (Assumes that no line is larger than a block, you can probably increase blocksize to guarantee this.)
Use SIMD instructions such as _mm_cmpeq_epi8 to search for line endings or other separator characters. In my case, any line containing an '=' character was a metadata row that needed different processing.
Use a barebones number parsing function (I used a custom one for parsing times in HH:MM:SS format, strtod and strtol are perfect for grabbing ordinary numbers). These are much faster than istream formatted extraction functions.
Use the OS file write API instead of the standard C++ API.
If you dream of throughput in the 300,000 lines/second range, then you should consider a similar approach.
Your executable also shrinks when you don't use C++ standard streams. I've got 205KB, including a graphical interface, and only dependent on DLLs that ship with Windows (no MSVCRTxx.dll needed). And looking again, I still am using C++ streams for status reporting.

Aggregate the writes into a fixed buffer, using a std::vector of RealType. Your logic should work like this:
Allocate a std::vector<RealType> with 65,536 default-constructed entries.
Read up to 65,536 entries into the vector, replacing the existing entries.
Write out as many entries as you were able to read in.
If you read in exactly 65,536 entries, go to step 2.
Stop, you are done.
This will prevent you from alternating reads and writes to two different files, minimizing the seek activity significantly. It will also allow you make far fewer write calls, reducing copying and buffering logic.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js