I have two files. One file is 15 gigabytes. The other is 684 megabytes. Both of these files have identical structures: they consist of many strings, one per line (which is to say, each string is separated by a \n).
While bored one day, and being the curious novice that I am, I decided to write a little C++ program to read these files into RAM. I compiled the program with G++ 8.1.1 on Fedora 28, and I found that when I read the small file into RAM, it consumes 2154 megabytes of RAM, and when I read the large file, it consumes 70.2 gigabytes of RAM. That's 3.15 times and 4.68 times the size of the original files, respectively.
Why is this the case?
This is the source code for this simple program. I'm using a std::vector to store each line as an std::string. I get the feeling that this question may actually boil down to, how does C++ handle strings? Is there an alternative datatype I should consider using?
#include <iostream>
#include <fstream>
#include <vector>
int main()
{
std::ifstream inFile;
std::vector<std::string> inStrings;
std::string line;
inStrings.reserve(1212356398);
inFile.open("bigfile.txt");
if (!inFile)
{
std::cerr << "Unable to open the hardcoded file" << std::endl;
exit(1);
}
while(getline(inFile, line))
{
inStrings.push_back(line);
}
std::cout << "done reading" << std::endl;
std::cin.get();
return 0;
}
If you have tried implementing a dynamic array in school or as an exercise, recall allocation strategies like doubling the capacity each time the capacity is full; similarly the vector prepares to store more than it actually stores.
Meanwhile, a string by itself stores a length, a capacity as well, and a reference counter, which makes 3 words at minimum even for an empty string.
Edit
Yeah, I guess the bit about the reference counter wasn't correct. I was remembering it was 3 words, and for some reason thought it wasn't counting the pointer to the actual allocated memory. But I guess maybe it is just that: the pointer to the actual string.
In any case the actual story differs due to optimizations across compilers. Search "std::string memory allocation" or something like that to read more.
Related
I have a relatively simple question to ask, there has been an ongoing discussion regarding many programming languages about which method provides the fastest file read. Mostly debated on read() or mmap(). As a person who also participated in these debates, I failed to find an answer to my current problem, because most answers help in the situation where the file to read is huge (example, how to read a 10 TB text file...).
But my problem is a bit different, I have lots of files, lets say a 100 million. I want to read the first 1-2 lines from these files. Whether the file is 10 kb or 100 TB is irrelevant. I just want the first one or two lines from every file. So I want to avoid reading or buffering the unnecessary parts of the files. My knowledge was not enough to thoroughly test which method is faster, or to discover what are all my options in the first place.
What I am doing right know: (I am doing this multithreaded for the moment)
for(const auto& p: std::filesystem::recursive_directory_iterator(path)) {
if (!std::filesystem::is_directory(p)) {
std::ifstream read_file(p.path().string());
if (read_file.is_open()) {
while (getline(read_file, line)) {
// Get two lines here.
}
}
}
}
What does C++, or the linux environment provide me in this situation ? Is there a faster or more efficient way to read small portions of millions of files ?
Thank you for your time.
Info: I have access to C++20 and Ubuntu 18.04
You can save one underlying call to fstat by not testing if the path is a directory, and then rely on is_open test
#include <iostream>
#include <fstream>
#include <filesystem>
#include <string>
int main()
{
std::string line,path=".";
for(const auto& p: std::filesystem::recursive_directory_iterator(path)) {
{
std::ifstream read_file(p.path().string());
if (read_file.is_open()) {
std::cout << "opened: " << p.path().string() << '\n';
while (getline(read_file, line)) {
// Get two lines here.
}
}
}
}
}
At least on Windows this code skips the directories. And as suggested in comments is_open test can even be skipped since getline doesn't read anything from a directory either.
Not the cleanest, but if it can save time it's worth it.
Any function in a program accessing a file under Linux will result in calling some "system calls" (for example read()).
All other available functions in some programming language (like fread(), fgets(), std::filesystem ...) call functions or methods which in turn call some system calls.
For this reason you can't be faster than calling the system calls directly.
I'm not 100% sure, but I think in most cases, the combination open(), read(), close() will be the fastest method for reading data from the start of a file.
(If the data is not located at the start of the file, pread() might be faster than read(); I'm not sure.)
Note that read() does not read a certain number of lines but a certain number of bytes (e.g. into an array of char), so you have to find the end(s) of the line(s) "manually" by searching the '\n' character(s) and/or the end of the file in the array of char.
Unfortunately, a line may be much longer than you expect, so reading the first N bytes from the file does not contain the first M lines and you have to call read() again.
In this case it depends on your system (e.g. file system or even hard disks) how many bytes you should read in each call to read() to get the maximum performance.
Example: Let's say in 75% of all files, the first N lines are found in the first 512 bytes of the file; in the other 25% of all files, first N lines are longer than 512 bytes in sum.
On some computers, reading 1024 bytes at once might require nearly the same time as reading 512 bytes, but reading 512 bytes twice will be much slower than reading 1024 bytes at once; on such computers it makes sense to read() 1024 bytes at once: You save a lot of time for 25% of the files and you lose only very little time for the other 75%.
On other computers, reading 512 bytes is significantly faster than reading 1024 bytes; on such computers it would be better to read() 512 bytes: Reading 1024 bytes would save you only little time when processing the 25% of files but cost you very much time when processing the other 75%.
I think in the most cases this "optimal value" will be a multiple of 512 bytes because most modern file systems organize files in units that have a multiple of 512 bytes.
I was just typing something similar to Martin Rosenau answer (when his popped up): unstructured read of the max length of two lines. But I would go further: queue that text buffer with corresponding file name and let another thread parse / analyze that. If parsing takes about the same time as reading, you can save half of the time. If it takes longer (unlikely) - you can use multiple threads and save even more.
Side note - you should not parallelize reading (tried that).
It may be worth experimenting: can you open one file, read it asynchronously while proceeding to open the next one? I don't know if any OS can overlap those things.
I am trying to solve a school problem and I did that, but it should run faster and on less memory if possible - can you please help me achieve that?
Problem statement: Read a natural number N and a string from a file, and output in another file the same string N number of times.
Example of input file:
3
dog
Example of output file:
dog
dog
dog
Restrictions:
1 ≤ n ≤ 50, and the length of the line to be read is maximum 1,000,000
Time limit: 0.27 seconds
This is what I tried (but run time exceeds the limit):
#include<fstream>
using namespace std;
ifstream cin("afisaren.in");
ofstream cout("afisaren.out");
short n;
char s[1000005];
int main() {
cin >> n;
cin >> s;
while(n) {
cout << s << '\n';
n--;
}
cin.close();
cout.close();
return 0;
}
Generally when given this type of problem, you should profile your own code to see which part of the code is consuming what amount of time. This can mostly be done by adding a few calls to a timekeeping-function before and after code execution, to see how long it was executing. However this is not so easy with your code, since one of the biggest problems (optimisation-wise) is your char s[1000005]; line. The memory will be allocated before executing your main() function, which is operating system dependant (or rather depends on the libc and compiler used).
So first, do not use pre-allocated char-arrays. You're using C++! Why not simply read the text into a std::(w)string or any of the C++-classes which will do dynamic memory allocation (and not crash your program if line-length does exceed 1,000,000).
And second, the c++ std::streams usually perform a flush-to-disk every time a line-ending character is written. This is highly inefficient unless your text is exactly the same size as the block-size of the underlying file-system. To optimize this, create a memory object (i.e. std::string) and copy your text into it for k times, where k = fs-block-size / text-length. fs-block-size will most likely be 1024, 2048 or 4096 bytes. There are system-calls to find that out, but performance will usually not be affected too much when writing twice (or 4x) the fs-block-size, so you can safely assume it to be 4096 for close-to-or-maximum-performance.
Since the maximum number of repetitions is 1 < n < 50, and line length is 1,000,000 (approx. 1 MiB if ASCII), maximum file size for the output will be 50,000,000 characters. You could also write everything into memory and then write everything in one call to write(). This would probably be the most efficient way in terms of disk-activity, but obviously not regarding memory consumption.
I'm not a c++ expert but I had a similar problem when I used c++ style file streams, after googling a bit, I tried switching to c-style file system and it boosted my performance a lot because c++ file streams copy file contents into internal buffer and that takes time, you can try it c-style but usually it is not recommended to use c in c++.
This question already has answers here:
Copy a file in a sane, safe and efficient way
(9 answers)
Closed 7 years ago.
I am trying to understand the code behind the copy command which copies a file from one place to other.I studied c++ file system basics and have written the following code for my task.
#include<iostream>
#include<fstream>
using namespace std;
main()
{
cout<<"Copy file\n";
string from,to;
cout<<"Enter file address: ";
cin>>from;
ifstream in(from,ios::in | ios::binary);
if(!in)
{
cout<<"could not find file "<<from<<endl;
return 1;
}
cout<<"Enter file destination: ";
cin>>to;
ofstream out(to,ios::out | ios::binary);
char ch;
while(in.get(ch))
{
out.put(ch);
}
cout<<"file has been copied\n";
in.close();
out.close();
}
Though this code works but is much slower than the copy command of my OS which is windows.I want to know how I can make my program faster to reduce the difference between my program's time and the my OS's copy command time.
Reading one byte at time is going to waste a lot of time in function calls... use a bigger buffer:
char ch[4096];
while(in) {
in.read(ch, sizeof(ch));
out.write(ch, in.gcount());
}
(you may want to add some more error handling, e.g. out may go in a bad state and the like)
(the most C++-ish way is reported here, but takes advantage of streambuf functionalities that typically a beginner rarely has reason to know, and to me is also way less instructive)
You have correctly opened the file for binary read and binary write. However instead of reading characters(which is not meaningful in binary format), use istream::read and ostream::write.
Like other answers say, use bigger buffers. I'd go for 1MB.
But there's a lot more to it.
Also, avoid stream lib and FILE stuff. They buffer the data so you get 2 memcpy calls instead of 1.
Disabling buffering on the streams can achieve a similar result, but I think you're better of using the system calls directly.
And one last thing, on the "do it yourself" front. You must check the return values from read and write calls. They may read/write less bytes than you ask them to.
If you can manage a circular buffer, you should switch read/wrote whenever the function returns short... disk may be more ready for reading or for writing so no point in wasting time waiting instead of switching to the other thing you have to do.
And now the very last thing you might want to explore- look into the sendfile system call. It was built to speed up web servers by doing all the copy in the kernel and avoiding context switches and memcpys, but may serve here if it works with two disk file descriptors.
I have a big file nearly 800M, and I want to read it line by line.
At first I wrote my program in Python, I use linecache.getline:
lines = linecache.getlines(fname)
It costs about 1.2s.
Now I want to transplant my program to C++.
I wrote these code:
std::ifstream DATA(fname);
std::string line;
vector<string> lines;
while (std::getline(DATA, line)){
lines.push_back(line);
}
But it's slow(costs minutes). How to improve it?
Joachim Pileborg mentioned mmap(), and on windows CreateFileMapping() will work.
My code runs under VS2013, when I use "DEBUG" mode, it takes 162 seconds;
When I use "RELEASE" mode, only 7 seconds!
(Great Thanks To #DietmarKühl and #Andrew)
First of all, you should probably make sure you are compiling with optimizations enabled. This might not matter for such a simple algorithm, but that really depends on your vector/string library implementations.
As suggested by #angew, std::ios_base::sync_with_stdio(false) makes a big difference on routines like the one you have written.
Another, lesser, optimization would be to use lines.reserve() to preallocate your vector so that push_back() doesn't result in huge copy operations. However, this is most useful if you happen to know in advance approximately how many lines you are likely to receive.
Using the optimizations suggested above, I get the following results for reading an 800MB text stream:
20 seconds ## if average line length = 10 characters
3 seconds ## if average line length = 100 characters
1 second ## if average line length = 1000 characters
As you can see, the speed is dominated by per-line overhead. This overhead is primarily occurring inside the std::string class.
It is likely that any approach based on storing a large quantity of std::string will be suboptimal in terms of memory allocation overhead. On a 64-bit system, std::string will require a minimum of 16 bytes of overhead per string. In fact, it is very possible that the overhead will be significantly greater than that -- and you could find that memory allocation (inside of std::string) becomes a significant bottleneck.
For optimal memory use and performance, consider writing your own routine that reads the file in large blocks rather than using getline(). Then you could apply something similar to the flyweight pattern to manage the indexing of the individual lines using a custom string class.
P.S. Another relevant factor will be the physical disk I/O, which might or might not be bypassed by caching.
For c++ you could try something like this:
void processData(string str)
{
vector<string> arr;
boost::split(arr, str, boost::is_any_of(" \n"));
do_some_operation(arr);
}
int main()
{
unsigned long long int read_bytes = 45 * 1024 *1024;
const char* fname = "input.txt";
ifstream fin(fname, ios::in);
char* memblock;
while(!fin.eof())
{
memblock = new char[read_bytes];
fin.read(memblock, read_bytes);
string str(memblock);
processData(str);
delete [] memblock;
}
return 0;
}
I have to write some 5 to 6 large arrays of the same size to text files in C++. I am using only one "for loop" for writing all the arrays to text files. The following is the code i'm using.
ofstream kx_ecl("COARSE_PERMX.GRDECL");
ofstream ky_ecl("COARSE_PERMY.GRDECL");
ofstream kz_ecl("COARSE_PERMZ.GRDECL");
ofstream ntg_c("COARSE_NTG.GRDECL");
ofstream phi_c("COARSE_PORO.GRDECL");
ofstream multbv("MULTBRV.GRDECL");
kx_ecl<<"PERMX"<<endl;
ky_ecl<<"PERMY"<<endl;
kz_ecl<<"PERMZ"<<endl;
ntg_c<<"NTG"<<endl;
phi_c<<"PORO"<<endl;
multbv<<"MULTBRV"<<endl;
for (i=0;i<N;i++) {
kx_ecl<<new_KX[i]<<endl;
ky_ecl<<new_KY[i]<<endl;
kz_ecl<<new_KZ[i]<<endl;
ntg_c<<new_NTG[i]<<endl;
phi_c<<new_por[i]<<endl;
multbv<<MULTBRV[i]<<endl;
}
kx_ecl<<"/";
ky_ecl<<"/";
kz_ecl<<"/";
ntg_c<<"/";
phi_c<<"/";
multbv<<"/";
kx_ecl.close();
ky_ecl.close();
kz_ecl.close();
ntg_c.close();
phi_c.close();
multbv.close();
My question here is that is it faster if I output each of them individually one by one in which case it would have many for loops or leave it as it is. Also, is there any way to write the arrays to a file without using a for loop? I am asking because the output to files is taking a lot of time compared to the actual calculation in my code.
Are you aware that std::endl is flushing your ofstreams constantly? You could push out '\n' characters instead and just flush once at the end for each stream.
I/O might still end up being the biggest piece of your time spent, but you are working against the buffering and do not seem in this example at least to be gaining anything from it. Going one array/file at a time might give you infinitesimally faster performance due to better memory locality (unless you're using so much memory you're swapping pages to disk), but the difference is probably getting completely lost in the huge cost of file writing.