To be frank, I have an assignment that says, quite vaguely,
"If the file exists, the one-argument constructor allocates memory for the number of records contained in the file and copies them into memory."
Now, in considering this instruction, it would seem I am to allocate the dynamic memory /before/ copy the data over, and this seems in principle, impossible.
To dynamically allocate memory, to my knowledge, you require runtime definition of the size of the block to be reserved.
Given that the file size, or number of 'entries' is unknown, how can one possibly allocate that much memory? Does not the notion defeat the very purpose of dynamic allocation?
Solution wise, it would seem the only option is to parse the entire file, determining the size, allocate the proper amount of memory afterward, and then read through the file again, copying the data into the allocated memory.
Given that this must be a common operation in any program that reads file data, I wonder: What is the proper, or most efficient way of loading a file into RAM?
The notion of reading once to determine the size, and then again to copy seems very inefficient. I assume there is a way to jump to the end of the file to determine max length, which would make the process faster. Or perhaps using a static buffer and loading that in blocks to RAM?
Is it possible to read all of the data, and then move it into dynamic memory using the move operator? Or perhaps more efficient to use a linked list of some kind?
The most efficient method is to have the operating system map the file to memory. Search your OS API for "mmap" or "memory mapping".
Another approach is to seek to the end of the file and get the position (tellg()). This is the size of the file. Allocate an array in dynamic memory or create a std::vector reserving at least this amount of space.
Some Operating Systems have API you can call to get the size of a file (without having to seek to the end). You could use this method, then dynamically allocate the memory or use std::vector<char>.
You will need to come up with a plan if the file doesn't fit into memory.
If you need to read the entire file into memory, you could use istream::read using the file length.
It all depends on file format. One way to store records is to first write how many records are stored in file. If you have two phone numbers your file might look like this:
2
Jon
555-123
Mary
555-456
In this case the solution is straightforward:
// ...
is >> count;
record_type *record = new record_type[count];
for ( int i = 0; i < count; ++i )
is >> record[i].name >> record[i].number; // stream checks omitted
// ...
If the file does not store the number of records (I wouldn't do this), you will have to count them first, and then use the above solution:
// ...
int count = 0;
std::string dummy;
while ( is >> dummy >> dummy )
++count;
is.clear();
is.seekg( 0 );
// ...
A second solution for the second case, would be to write a dynamic container (I assume you are not allowed to use standard containers) and push the records as you read them:
// ...
list_type list;
record_type r;
while ( is >> r.name >> r.number )
list.push_back( r );
// ...
The solutions are ordered by complexity. I did not compile the examples above.
Related
I have a big file (let's assume I can make it binary), that can not fit in RAM, and I want to sort numbers from it. In the process I need to read/write a big amount of numbers from/to the file (from/to vector<int> or int[]) quickly, so I'd like not to read/write it one by one, but read/write it by blocks with a fixed size. How can I do it?
I have a big file (let's assume I can make it binary), that can not fit in RAM, and I want to sort numbers from it.
Given that the file is binary, perhaps the simplest and presumably efficient solution is to memory map the file. Unfortunately there is no standard interface to perform memory mapping. On POSIX systems, there is the mmap function.
Now, the memory mapped file is simply an array of raw bytes. Treating it as an array of integers is technically not allowed until C++20 where C-style "implicit creation of low level objects" is introduced. In practice, that already works on most current language implementations Note 1.
For this reinterpretation to work, the representation of the integers in the file must match the representation of integers used by the CPU. The file will not be portable to the same program running on other, incompatible systems.
We can simply use std::sort on this array. The operating system should take care of paging the file in and out of memory. The algorithm used by std::sort isn't necessarily optimised for this use case however. To find the optimal algorithm, you may need to do some research.
1 In case Pre-C++20 standard conformance is a concern, it is possible to iterate over the array, copy the underlying bytes into an integer, placement-new an integer object into the memory using the copied integer as the new value. A compiler can optimise these operations into zero instructions, and this makes the program's behaviour well defined.
You can use ostream::write to write into a file and istream::read to read from a file.
To make the process clean, it will be good to have the number of items also in the file.
Let's say you have a vector<int>.
You can use the following code to write its contents to a file.
std::vector<int> myData;
// .. Fill up myData;
// Open a file to write to, in binary mode.
std::ofstream out("myData.bin", std::ofstream::binary);
// Write the size first.
auto size = myData.size();
out.write(reinterpret_cast<char const*>(&size), sizeof(size));
// Write the data.
out.write(reinterpret_cast<char const*>(myData.data()), sizeof(int)*size);
You can read the contents of such a file using the following code.
std::vector<int> myData;
// Open the file to read from, in binary mode.
std::ifstream in("myData.bin", std::ifstream::binary);
// Read the size first.
auto size = myData.size();
in.read(reinterpret_cast<char*>(&size), sizeof(size));
// Resize myData so it has enough space to read into.
myData.resize(size);
// Read the data.
in.read(reinterpret_cast<char*>(myData.data()), sizeof(int)*size);
If not all of the data can fit into RAM, you can read and write the data in smaller chunks. However, if you read/write them in smaller chunks, I don't know how you would sort them.
I am wondering how is memory managed when different files are stored in a map of string vectors.
I have tried to read different files of 10 mo each to put them in memory and when I check the memory with KSySGuard, the memory appearing is more than twice the memory of my files (~70mo).
I give you a code example for it:
There is a function readfile():
std::vector<std::string> read_file(std::string& path){
ifstream fichier(path);
std::vector<std::string> fich;
if(fichier){
string ligne;
while(getline(fichier, ligne)){
fich.push_back(ligne);
}
}
fichier.close();
return fich;
}
This function is used in another which is building my map:
std::map<std::string, std::vector<std::string>> buildmap(std::string folder){
std::map<std::string,std::vector<std::string>> evaluations; std::vector<std::string> vecFiles = {"file1","file2","file3"};
for( auto i = 0; i < vecFiles.size(); i++ )
{
std::stringstream strad;
strad <<vecFiles[i] ;
std::string path(folder+vecFiles[i]);
std::vector<std::string> a = read_file(path);
evaluations[strad.str()]=a;
}
return evaluations;
}
So, I do not understand why the memory is so high compared to the files sizes. Is-there a more efficient way to construct this kind of container?
There is a lot of memory overhead in your scenario:
You store each file line as a separate std::string object. Each such object occupies some space (typically 24 or 32 bytes on a 64-bit architecture) itself, however, the stored string (line characters) are stored inside it only when the string is short and small/short string optimization (SSO) is applied (usually is by common Standard library implementations from C++11). If lines are long, the space for string is dynamically allocated and each allocation also has some additional memory overhead.
You push_back these std::string objects into a std::vector, which typically increase the size of the internal buffer exponentially (such as doubling it when it run out of space). That is why reserving space (std::vector::reserve) is used when you know the number of vector elements in advance.
This is the price for such a "comfortable" approach. What might help is to store the whole file contents as a single std::string and then store just indexes/pointers into beginnings of individual lines in a separate array/vector (though you then cannot treat these pointers as strings since they won't be null-terminated; or, you can in fact, if you substitute new-line characters by null characters).
In C++17, you can store lines as instances of std::string_view to the whole file contents stored in a single std::string.
Just note that std::string_view will likely by larger than a pointer/index. For instance, with libstdc++ and x86_64, sizeof(std::string_view) is 16 bytes, but pointer/index will occupy 8 bytes. And for files smaller than 4 GB, you can even use 32-bit indexes. If you have a lot of lines in processed files, these differences can matter.
UPDATE
This question is highly relevant: C++ Fast way to load large txt file in vector.
In the below code:
file_mapping fm(FilePath, read_only);
mapped_region region(fm,read_only);
char* const data = static_cast<char *>(region.get_address());
for(size_t n=0; n<region.get_size(); ++n){
cout << data[n];
}
is there any way to access characters from the mapped memory without needing to create the data array?
EDIT code refers to using namespace boost:interprocess;
The data "array" is not actually created as an expensive allocation or copy - it's merely a pointer to the virtual memory space the OS uses to represent the file contents in memory. So that's a bit of bookkeeping but no actual significant work.
When you first access it (i.e. data[0]), the OS pages in the first block of file using optimised routines more efficient than C++ streams or C's (f)read. Good OS'es would also preload the second & subsequent blocks and silently drop old used blocks, so managing physical memory efficiently whilst being faster than you'd expect. Just make sure your file fits in your free virtual memory space - usually only a problem for 1+ GB files in 32-bit code.
So no, there's no other way - wanted or known - of accessing the contents. (I'm discounting re-opening the file using standard I/O routines!)
I have a tough problem right now. I am having a big dictionary file to be loaded in my program, its format basically is:
word1 val1
word2 val2
word3 val3
...
...
This file has 170k lines, and its size is 3.9MB on the disk (in plain text). In my implementation I used boost::unordered_map (hashtable) to store this data to support read-only looking up operations in my program.
Yet, after loaded into memory in runtime, the memory usage increased by 20MB due to the loading operation (I checked this via the Private Working Set Size in windows Task Manager, maybe this is not the right way to determine the memory usage? ). I know there must be some auxiliary data structures in the hashtable to store that data that will increase the memory usage, but I didn't expect the memory size is as much as 5x than its disk size!
Is this normal? Since I have tried another version of hashmap in std extension library, and Trie structure in some other memory, none of them bring significant improvement on that issue.
So that I want to implement some space optimization over this problem. Could any one give some tips or keywords to guide me improve the space usage?
A hash map data structure allocates much more memory than it uses at one time. This is to facilitate quick inserts and removals. When a hash table reaches a certain capacity (implementation defined, but it's a number like 50% full, 70% full, 90% full, etc.) it will reallocate more memory and copy everything over. The point is that it allocates more memory than is in use.
Also, the 20 MB that you see the program using is the size of all the memory your program is using, not just the one hash map.
Furthermore, if you are using std::string or equivalent structure to store the value, you already have created a copy of half the data you are getting from the file. You'll have one copy in the buffer you read the file to, and then another copy in the strings in the hash table.
If your strings have a reasonably small maximum size, you could store them in one large character array and use binary search for lookup (after sorting them of course).
I normally use the method described in csv parser to read spreadsheet files. However, when reading a 64MB file which has around 40 columns and 250K rows of data, it takes about 4 minutes. In the original method, a CSVRow class is used to read the file row by row, and a private vector is used to store all the data in a row.
Several things to note:
I did reserve enough capacity of the vector but not much helpful.
I also need to create instances of some class when reading each line, but even when the code just read in the data without creating any instances, it takes long time.
The file is tab-delimited instead of comma-delimited, but I don't think it matters.
Since some columns in that file are not useful data, I changed the method to have a private string member to store all the data and then find the position of the (n-1)th and the nth delimiter to get the useful data (of course there are many useful columns). By doing so, I avoid some push_back operations, and cut the time to a little more than 2 minutes. However, that still seems too long to me.
Here are my questions:
Is there a way to read such a
spreadsheet file more efficiently?
Shall I read the file by buffer
instead of line by line? If so, how
to read by buffer and use the csvrow
class?
I haven't tried boost tokenizer, is
that more efficient?
Thank you for your help!
It looks like your being bottle-necked by IO. Instead of reading the file line by line, read it in blocks of maybe 8 MB. Parse the block read for records and determine if end of the block is a partial record. If it is, copy the portion of the last record from the block and prepend it to the next block. Repeat until the file is all read. This way, for a 64 MB file you're only making 8 IO requests. You can experiment with block size to determine what gives the best performance vs memory usage.
If reading the whole data into memory acceptable (and apparently it is), then I'd do this:
Read the whole file into a std::vector
Populate a vector > which contains the start positions of all newline characters and cells the data. These positions denote the start/end of each cell
Some code sketch to demonstrate the idea:
vector<vector<vector<char>::size_Type> > rows;
for ( vector<char>::size_type i = 0; i < data.size(); ++i ) {
vector<vector<char>::size_type> currentRow;
currentRow.push_back( i );
while ( data[i] != '\n' ) {
if ( data[i] == ',' ) { // XXX consider comma at end of line
currentRow.push_back( i );
}
}
rows.push_back( currentRow );
}
// XXX consider files which don't end in a newline
Thus, you know the positions of all newlines and all commas, and you have the complete CSV date available as one contiguous memory block. So you can easily extract a cell text like this:
// XXX error checking omitted for simplicity
string getCellText( int row, int col )
{
// XXX Needs handling for last cell of a line
const vector<char>::size_type start = rows[row][col];
const vector<char>::size_type end = rows[row][col + 1];
return string(data[start], data[end]);
}
This article should be helpful.
In short:
1. Either use memory mapped files OR read file in 4kbyte blocks to access the data. Memory-mapped files will be faster.
2. Try to avoid using push_back, std::string operations (like +) and similar routines from stl within parsing loop. They are nice, but they ALL use dynamically allocated memory, and dynamic memory allocation is slow. Anything that is being frequently dynamically allocated, will make your program slower. Try to preallocate all buffers before parsing. Counting all tokens in order to preallocate memory for them shouldn't be difficult.
3. Use profiler to identify what causes the slowdown.
4. You may want to try to avoid using iostream's << and >> operators, and parse file yourself.
In general, efficient C/C++ parser implementation should be able to parse 20 megabytes big text file within 3 seconds.