I have a file with multiple arrays of variable length arrays like this:
15
1 5 2 7
8 4 9
53 21 60 4 342 4321
...
Let's say the first number(15) gives the number of arrays,so that it would be easier to understand(everything is random though).
How can I read from the file all the numbers in c++ and put them into a variable,let's say x[100][100], so when I code x[1][1] + x[2][2] it will give me 14 (5 + 9).I thought about reading until the end of the line,but I don't know how to keep track of columns
If you have e.g. int x[100][100] and only need a few of those elements you have quite a large mount of wasted memory that won't be used (but still exists and must be initialized).
The solution to that is pointers and dynamic allocation. Allocating the correct amount of "sub-arrays" is easy once you have read the first line. The problem comes with how to handle the sub-arrays since they all seem to be of variable number of elements. You can allocate a fixed amount of elements for each sub-array, and hope you will not need more (which brings back the issue of wasted memory that needs to be initialized). Some of the problems can be mitigated if you do two passes over the input: One to get the maximum number of elements in any line, and the second to actually read the data.
A second option is to read and dynamically allocate just enough elements for each line. This requires you to parse the input so you know when the line ends, and also to use reallocation as you add new numbers. You also need to keep track of the number of elements for each sub-array so you don't risk go out of bounds.
To keep track of the number of elements for each line you can either use a second array with the count. Or you can use an array of structures instead, where each structure contains the number of elements and the sub-array for each line.
A better solution (now that I noticed this was a C++-tagged question and not C) you should use std::vector. Or rather a vector of vectors (of int).
When you have read the first line and parsed its number, you know how many sub-vectors you need and can preallocate them.
Then it's just a matter of reading the rest of the data, which is very easy in C++ with std::getline and std::istringstream and std::istream_iterator.
Perhaps something like this:
std::string line;
// Get the first line, the amount of extra lines to read
std::getline(input_file, line);
// Create the vector (of vectors)
std::vector<std::vector<int>> data;
size_t number_of_sub_vectors = std::stoi(line);
// Preallocate memory for the sub-vectors
data.reserve(number_of_sub_vectors);
// Now read the data for each line
for (size_t i = 0; i < number_of_sub_vectors; ++i)
{
// Get the data for the current line
std::getline(input_file, line);
// And put into an input string stream for parsing
std::istringstream iss(line);
// Create the sub-vector in-place, and populate it with the data from the file
data.emplace_back(std::istream_iterator<int>(iss),
std::istream_iterator<int>());
}
Of course the above example doesn't have any kind of error handling, which is really needed.
Related
Basically, I'm not sure why this is happening, but I really want to get it figured out. This is an idea I had and I'm trying to get it to work.
So what I am trying to do: Read strings from a file and store them into an array, basically.
But, I wanted to do this:
instead of initializing the size of the array for some given constant number, I wanted to have a loop which first read the number of lines in the file and assign that value into a variable. Therefore, that value will be the size of the array since that value will correspond to the amount of elements I want to store in the array.
I want for each line of text in the file, so for each string, to be one element of the array.
Here is the code:
#include <iostream>
#include <string>
#include <fstream>
using namespace std;
int main() {
ifstream the_file;
int i, j, z;
string lineCount;
the_file.open("sentence.txt");
j = 0;
while(!the_file.eof())
{
getline(the_file, lineCount);
j++;
}
z = j;
string array[z];
for (i = 0; i < z; i++) {
getline(the_file, array[i]); // Array is empty
cout << i << " " << array[i] << endl;
}
cout << z << endl;
return 0;
}
So then after getting this number, I next want to actually store the elements in the array.
For this part, I went with a for loop to attempt that and had the loop execute (array size - 1) number of times to correspond to each element being stored within one index of the array.
I just don't understand why, after running it, it shows that the array is empty. It shows that the array is the size corresponding to the number of lines in the file, yet there aren't any elements, it looks like, in any of the memory spaces.
The thing is, I tried simply setting the array to a given constant size. So basically just simply read elements from a file and store them in the array without doing that first loop to determine what size to set the array. In this case, it works fine and all of the elements get stored normally and none of the memory spaces are empty.
But why is it that if I wanted to do a loop in order to get the value to correspond to the size I want the array to be, and then attempt to store those elements, it's like the file doesn't get read correctly anymore and the assigned memory spaces are empty?
Here is the output:
/Users/macuser/CLionProjects/PA2supp/cmake-build-debug/PA2supp
0
1
2
3
4
5
Process finished with exit code 0
Here is the file that I want read, this is "sentence.txt". Like I said, I want each line to be an element in my array
This is just one sentence.
This is the second sentence.
This is the third sentence.
This is the fourth sentence.
Fifth sentence right here.
Please don't suggest to use a vector, while I am aware of some of the benefits with vectors, I actually really want to do this specific thing using an array.
There are multiple bugs in the shown code.
string array[z];
Variable lengths arrays are not standard C++, you're using non-standard C++ here. Either use new, to create an array of variable size, or use std::vector.
while(!the_file.eof())
This is a common bug. Your eventual line count will be wrong. Read this link for a detailed explanation why this results in a wrong line count.
Then, at the end of this while loop you've reached the end of the file.
And you're still at the end of the file when you start your second loop. Your second loop, the for loop, begins with the file stream still at the end of the file, as such the second loop will not read anything. The entire file has been read. On each iteration of the for loop, getline() will not read anything because the entire file has already been read, in the first while loop.
Remember the golden rule of computer programming: a computer will always do exactly what you tell it to do, and not what you want it to do. Here, you want the second loop to read the file from the beginning, but you haven't told your computer that this is what it should do. You will need to use seekg() to reposition the input stream to the beginning of the file, before reading it again, and, for a good measure, clear() the file status (since the file stream will have end-of-file and/or fail bits set, as a result of reaching the end of the file in the first loop).
Okay, thanks very much for the input.
I got it resolved, I closed the file using the .close() function and then opened it again and it does what I was aiming for it to do. It's simple, but it went by me at first and I see that I missed the fact that I was at the end of the file while using the first loop! Thanks again for what you've added too, it's helpful.
I am wondering how is memory managed when different files are stored in a map of string vectors.
I have tried to read different files of 10 mo each to put them in memory and when I check the memory with KSySGuard, the memory appearing is more than twice the memory of my files (~70mo).
I give you a code example for it:
There is a function readfile():
std::vector<std::string> read_file(std::string& path){
ifstream fichier(path);
std::vector<std::string> fich;
if(fichier){
string ligne;
while(getline(fichier, ligne)){
fich.push_back(ligne);
}
}
fichier.close();
return fich;
}
This function is used in another which is building my map:
std::map<std::string, std::vector<std::string>> buildmap(std::string folder){
std::map<std::string,std::vector<std::string>> evaluations; std::vector<std::string> vecFiles = {"file1","file2","file3"};
for( auto i = 0; i < vecFiles.size(); i++ )
{
std::stringstream strad;
strad <<vecFiles[i] ;
std::string path(folder+vecFiles[i]);
std::vector<std::string> a = read_file(path);
evaluations[strad.str()]=a;
}
return evaluations;
}
So, I do not understand why the memory is so high compared to the files sizes. Is-there a more efficient way to construct this kind of container?
There is a lot of memory overhead in your scenario:
You store each file line as a separate std::string object. Each such object occupies some space (typically 24 or 32 bytes on a 64-bit architecture) itself, however, the stored string (line characters) are stored inside it only when the string is short and small/short string optimization (SSO) is applied (usually is by common Standard library implementations from C++11). If lines are long, the space for string is dynamically allocated and each allocation also has some additional memory overhead.
You push_back these std::string objects into a std::vector, which typically increase the size of the internal buffer exponentially (such as doubling it when it run out of space). That is why reserving space (std::vector::reserve) is used when you know the number of vector elements in advance.
This is the price for such a "comfortable" approach. What might help is to store the whole file contents as a single std::string and then store just indexes/pointers into beginnings of individual lines in a separate array/vector (though you then cannot treat these pointers as strings since they won't be null-terminated; or, you can in fact, if you substitute new-line characters by null characters).
In C++17, you can store lines as instances of std::string_view to the whole file contents stored in a single std::string.
Just note that std::string_view will likely by larger than a pointer/index. For instance, with libstdc++ and x86_64, sizeof(std::string_view) is 16 bytes, but pointer/index will occupy 8 bytes. And for files smaller than 4 GB, you can even use 32-bit indexes. If you have a lot of lines in processed files, these differences can matter.
UPDATE
This question is highly relevant: C++ Fast way to load large txt file in vector.
I am trying to achieve something in C++, where I have an API that reads out objects from a byte array, while the array I pass in is constrained to a fixed size. After it parses out a complete object, the API knows the pointer location where it finishes reading (the beginning of next object to be read from but not complete in the current byte array).
Then I simply need to attach the remaining byte array with the next same fixed size array, and start reading a new object out at the pointer location as if it's the beginning of the new array.
I am new to C++ and I have the following approach working, but looks rather cumbersome and inefficient. It requires three vectors and lots of cleanup, reserve and insertion. I wonder if there is any alternative that may be more efficient, or at least as efficient but the code looks much more concise? I've been reading things like stringstream all such but they don't seem to require less memory copy (probably more as my API has to require byte array gets passed in). Thanks!
std::vector<char> checkBuffer;
std::vector<char> remainingBuffer;
std::vector<char> readBuffer(READ_BUFFER_SIZE);
//loop while I still have stuff to read from input stream
while (in.good()) {
in.read(readBuffer.data(), READ_BUFFER_SIZE);
//This is the holding buffer for the API to parse object from
checkBuffer.clear();
//concatenate what's remaining in remainingBuffer (initially empty)
//with what's newly read from input inside readBuffer
checkBuffer.reserve(remainingBuffer.size() + readBuffer.size());
checkBuffer.insert(checkBuffer.end(), remainingBuffer.begin(),
remainingBuffer.end());
checkBuffer.insert(checkBuffer.end(), readBuffer.begin(),
readBuffer.end());
//Call API here, and I will also get a pointerPosition back as to
//where I am inside the buffer when finishing reading the object
Object parsedObject = parse(checkBuffer, &pointerPosition)
//Then calculate the size of bytes not read in checkBuffer
int remainingBufSize = CheckBuffer.size() - pointerPosition;
remainingBuffer.clear();
remainingBuffer.reserve(remainingBufSize);
//Then just copy over whatever is remaining in the checkBuffer into
//remainingBuffer and make it be used in next iteration
remainingBuffer.insert(remainingBuffer.end(),
&checkBuffer[pointerPosition],&checkBuffer[checkBuffer.size()]);
}
Write append_chunk_into(in,vect). It appends one chunk of data at the end of vect. It does resizing as needed. As an aside, a char-sized does-not-zero-memory standard layout struct might be a better choice than char.
To append to end:
size_t old_size=vect.size();
vect.resize(vect.size()+new_bytes);
in.read(vect.data()+old_size, new_bytes);
or whatever the read api is.
To parse, feed it vect.data(). Get back the pointer of when it ends ptr.
Then `vect.erase(vect.begin(), vect.begin()+(ptr-vect.data())) to remove the parsed bytes. (only do this after you have parsed everything you can from the buffer, to save wasted mem moves).
One vector. It will reuse its memory, and never grow larger than read size+size of largest object-1. So you can pre-reserve it.
But really, usually most of the time spent will be io. So focus optimizarion on keepimg the data flowing smoothly.
If I were in your position I would keep only the readBuffer. I would reserve READ_BUFFER_SIZE +sizeof(LargestMessage).
After parsing you would be given back a pointer to the last thing the api was able to read in the vector. I would then convert the end iterator to a pointer &*readbuffer.end() and use it to bound the data we have to then copy to the head of the vector. once you have that data on the head of the vector you can then read the rest in using that same data call except you add in the number of bytes remaining. There does need to be some way of determining how many characters were in the remaining array but that shouldn't be insurmountable.
Basically I am wondering what would be a faster way of handling input from standard input:
Method one: Declaring an array of some arbitrary size, reading into the array, and if the input is more than the size, allocate a new array twice the size, copying the contents into the new array, and deallocating the previous array.
Method two: Read the whole input and count the number of lines while reading. reset the pointer back to the top of the input, declare an array of the length of the size of the number of lines, and then input into that array.
some background:
I'm not using vectors. please don't say to just use vectors...
they won't be typing the input, it'll be redirected from the command line to a file. akin to ./program < input.txt
I understand that the first method is more inefficient in terms of space, but is it faster than method two? if so, by how much? method 2 essentially takes 2n time to finish. I want to know if the first method would increase the runtime of my code.
Both methods are O(n). However, you're reading from stdin, so there's no way to rewind it back to the beginning unless something is already storing the data somewhere, so I don't see how you could use method 2.
You would need to use method 1. If you can use realloc, it might not even have to do any copying. If you're worried about the extra copying, you can store the items in a linked-list of buffers of exponentially increasing size, then create a single array at the end and copy each one only once.
I normally use the method described in csv parser to read spreadsheet files. However, when reading a 64MB file which has around 40 columns and 250K rows of data, it takes about 4 minutes. In the original method, a CSVRow class is used to read the file row by row, and a private vector is used to store all the data in a row.
Several things to note:
I did reserve enough capacity of the vector but not much helpful.
I also need to create instances of some class when reading each line, but even when the code just read in the data without creating any instances, it takes long time.
The file is tab-delimited instead of comma-delimited, but I don't think it matters.
Since some columns in that file are not useful data, I changed the method to have a private string member to store all the data and then find the position of the (n-1)th and the nth delimiter to get the useful data (of course there are many useful columns). By doing so, I avoid some push_back operations, and cut the time to a little more than 2 minutes. However, that still seems too long to me.
Here are my questions:
Is there a way to read such a
spreadsheet file more efficiently?
Shall I read the file by buffer
instead of line by line? If so, how
to read by buffer and use the csvrow
class?
I haven't tried boost tokenizer, is
that more efficient?
Thank you for your help!
It looks like your being bottle-necked by IO. Instead of reading the file line by line, read it in blocks of maybe 8 MB. Parse the block read for records and determine if end of the block is a partial record. If it is, copy the portion of the last record from the block and prepend it to the next block. Repeat until the file is all read. This way, for a 64 MB file you're only making 8 IO requests. You can experiment with block size to determine what gives the best performance vs memory usage.
If reading the whole data into memory acceptable (and apparently it is), then I'd do this:
Read the whole file into a std::vector
Populate a vector > which contains the start positions of all newline characters and cells the data. These positions denote the start/end of each cell
Some code sketch to demonstrate the idea:
vector<vector<vector<char>::size_Type> > rows;
for ( vector<char>::size_type i = 0; i < data.size(); ++i ) {
vector<vector<char>::size_type> currentRow;
currentRow.push_back( i );
while ( data[i] != '\n' ) {
if ( data[i] == ',' ) { // XXX consider comma at end of line
currentRow.push_back( i );
}
}
rows.push_back( currentRow );
}
// XXX consider files which don't end in a newline
Thus, you know the positions of all newlines and all commas, and you have the complete CSV date available as one contiguous memory block. So you can easily extract a cell text like this:
// XXX error checking omitted for simplicity
string getCellText( int row, int col )
{
// XXX Needs handling for last cell of a line
const vector<char>::size_type start = rows[row][col];
const vector<char>::size_type end = rows[row][col + 1];
return string(data[start], data[end]);
}
This article should be helpful.
In short:
1. Either use memory mapped files OR read file in 4kbyte blocks to access the data. Memory-mapped files will be faster.
2. Try to avoid using push_back, std::string operations (like +) and similar routines from stl within parsing loop. They are nice, but they ALL use dynamically allocated memory, and dynamic memory allocation is slow. Anything that is being frequently dynamically allocated, will make your program slower. Try to preallocate all buffers before parsing. Counting all tokens in order to preallocate memory for them shouldn't be difficult.
3. Use profiler to identify what causes the slowdown.
4. You may want to try to avoid using iostream's << and >> operators, and parse file yourself.
In general, efficient C/C++ parser implementation should be able to parse 20 megabytes big text file within 3 seconds.