C++ reading random lines of txt? - c++

I am running C++ code where I need to import data from txt file.
The text file contains 10,000 lines. Each line contains n columns of binary data.
The code has to loop 100,000 times, each time it has to randomly select a line out of the txt file and assign the binary values in the columns to some variables.
What is the most efficient way to write this code? should I load the file first into the memory or should I randomly open a random line number?
How can I implement this in C++?

To randomly access a line in a text file, all lines need to have the same byte-length. If you don't have that, you need to loop until you get at the correct line. Since this will be pretty slow for so much access, better just load it into a std::vector of std::strings, each entry being one line (this is easily done with std::getline). Or since you want to assign values from the different columns, you can use a std::vector with your own struct like
struct MyValues{
double d;
int i;
// whatever you have / need
};
std::vector<MyValues> vec;
Which might be better instead of parsing the line all the time.
With the std::vector, you get your random access and only have to loop once through the whole file.

10K lines is a pretty small file.
If you have, say, 100 chars per line, it will use the HUGE amount of 1MB of your RAM.
Load it to a vector and access it the way you want.

maybe not THE most efficient, but you could try this:
int main() {
//use ifstream to read
ifstream in("yourfile.txt");
//string to store the line
string line = "";
//random number generator
srand(time(NULL));
for(int i = 0; i < 100000; i++) {
in.seekg(rand() % 10000);
in>>line;
//do what you want with the line here...
}
}
Im too lazy right now, but you need to make sure that you check your ifstream for errors like end-of-file, index-out-of-bounds, etc...

Since you're taking 100,000 samples from just 10,000 lines, the majority of lines will be sampled. Read the entire file into an array data structure, and then randomly sample the array. This avoids file seeking entirely.
The more common case is to sample only a small subset of the file's data. To do that, assuming the lines are different length, seek to random points in the file, skip to the next newline (for example cin.ignore( numeric_limits< streamsize >::max(), '\n' ), and then parse the subsequent text.

Related

Counting the number of data points in a line from ifstream

I have a bunch of data files I need to read in to some multidimensional container, all of which are of the following form:
a1,a2,a3,...,aN,
b1,b2,b3,...,bN,
c1,c2,c3,...,cN,
................
z1,z2,z3,...,zN,
I know from this previous question that a quick way of counting the total number of lines in a file can be achieved as follows:
std::ifstream is("filename");
int lines = std::count(std::istreambuf_iterator<char>(is), std::istreambuf_iterator<char>(), '\n');
This lets me know what z, the total number of data sets to read in, each of which contains N data points. The next challenge is to count the number of data values per line, for which I can do the following:
std::ifstream is("filename");
std::string line;
std::getline(is, line);
std::istringstream line_(line);
int points = std::count(std::istreambuf_iterator<char>(line_), std::istreambuf_iterator<char>(), ',');
I can be confident that each file has the same amount of data values per line. My question is, is there a nicer/faster way of achieving the above without resorting to using getline to and dumping a single line to a string? I was wondering if this could be achieved with stream buffers, but having done a bit of searching it's not quite clear to me.
Any help would be much appreciated, thank-you!
If you were required to use
int points = std::count(std::istreambuf_iterator<char>(line_), std::istreambuf_iterator<char>(), ',');
for every line of text, I would advise you to look for a way to make it more efficient.
However, you said:
I can be confident that each file has the same amount of data values per line.
That means, you can compute the number points from the first line and assume it to be valid for the rest of the lines.
I wouldn't sweat it for a one time call.

ifstream how to start read line from a particular line using c++

I'm using ifstream to parse a file in a c++ code. I'm not able using seekg() and tellg() to jump in a particular line of the code.
In particular I would like to read a line, with method getLine, from a particular position of the file. Position saved in the previously iteration.
You just have to skip required number of lines.
The best way to do it is ignoring strings with std::istream::ignore
for (int currLineNumber = 0; currLineNumber < startLineNumber; ++currLineNumber){
if (addressesFile.ignore(numeric_limits<streamsize>::max(), addressesFile.widen('\n'))){
//just skipping the line
} else {
// todo: handle the error
}
}
The first argument is maximum number of characters to extract. If this is exactly numeric_limits::max(), there is no limit.
You should use is instead of std::getline due to better performance.
It seems there are no specific C++ functions, like "seekline", for your needs, and I see two ways to solve this task:
Preliminary you can expand every line in textfile with spaces to reach a
constant length L. Than, to seek required line N, just use
seekg with L * N offset.
This method is more complicated. You can create an auxiliary binary
file, every byte of it will keep length of every line of source
file. This auxiliary file is a kind of database. Next, you have to
load this binary file into array in your program within initialization phase. The offset of a
required line in textfile should be calculated as sum of first N array's
elements. Of course, it's necessary to update an auxiliary file and source file simultaneously.
The first case is more efficient if a textfile is loyal for it's size requirements. The second case brings best perfomance for long textfile and rare edit operations.

c++ overwriting file data?

I am trying to run a program to replace certain data within a file. The relevant parts of the file attempting to be replaced look like the following:
1 Information 15e+10
2 Information 2e+16
3 Information 6e+2
And so on.
The files in question can be very large in the multiple gigabyte range and to my understanding because of this using a buffer of the whole file and rewriting the whole file is impossible/unreasonable. Well that is all fine I just want to replace the values (ex. the 15e+10).
This all works fine with simple ios::in|ios::out and tellp() if I am replacing the value with a similar sized value (15e+10->12e+12) or even if its a smaller size as I can simply add an extra space which can be ignored down the line (ex. 15e+10->4e+10 ). But I am running into the problem if I need to replace the value with a value whose length is longer than already in the file (ex. 6e+2->16e+10) it will write over the new line character or start writing over the information in the next line.
I have searched on the forums and everyone says you can either overwrite in the file, you can append to the end of the file, or you can buffer and recreate the whole file. Is there anyway I can achieve my goal of overwriting the value correctly without having to recreate the file?
If not then how can I have 2 files open (1 input 1 output) to do this if multiple files in question are too large for the memory?
Note: I would also like to avoid using boost:: as I need to be able to run this on a system without the boost library.
Open a stream to read from the input (IN) file and a second stream (OUT) to write to a new output (tmp) file.
Read from IN and write to OUT. When you get a value from IN that you want to replace write the replacement to OUT instead of the value you got from IN.
When parsing is complete replace the first file with the second (tmp) file.
Would this work for you?
Use lseek()/fseek() for "jump" to a given position in a file.
You can use seekp to go to the location and rewrite it with <<
Example:
example.txt ( |?| = 1 byte of data )
|A|B|C|\n|1|2|3|D|E|F|\n|4|5|6|
//Somewhere in the code
fstream file;
open("example.txt");
//Somehow find the character distance and store it into "distance"
seekp(distance);//If distance = 0, it will go to "A" like rewind() but easier for me
If the distance is 4, the next character will be overwritten is 1
file << "987";
And the file will be
|A|B|C|\n|9|8|7|D|E|F|\n|4|5|6|
BUT the only problem here is when you need to increase/decrease the size:
Increase:
You will overwrite the other character so you need to create a temp string to store it the rest of data or separate it into smaller chunk if the data is too large like
|A|B|C|\n|9|8|7|D|E|F|\n|4|5|6|
string tempstring;
seekp(distance);
file >> tempstring;
seekp(distance);
file << content << tempstring; //content is the data
Decrease:
The easiest solution is to write NULL character \0 to the excess space like
|A|B|C|\n|1|\0|\0|D|E|F|\n|4|5|6|
The only side-effect is the file size is the same as before

How to read partial data from large text file in C++

I have a big text file with more then 200.000 lines, and I need to read just a few lines. For instance: line 10.000 to 20.000.
Important: I don´t want to open and search the full file to extract theses lines because of performance issues.
Is this possible?
If the lines are fixed length, then it would be possible to seek to a specific byte position and load just the lines you want. If lines are variable length, the only way to find the lines you're looking for is to parse the file and count the number of end-of-line markers. If the file changes infrequently, you might be able to get sufficient performance by performing this parsing once and then keeping an index of the byte positions of each line to speed future accesses (perhaps writing that index to disk so it doesn't need to be done every time your program is run).
You will have to search through the file to count the newlines, unless you know that all lines are the same length (in which case you could seek to the offset = line_number * line_size_in_bytes, where line_number counts from zero and line_size_in_bytes includes all characters in the line).
If the lines are variable / unknown length then while reading through it once you could index the beginning offset of each line so that subsequent reads could seek to the start of a given line.
If these lines are all the same length you could compute an offset for a given line and read just those bytes.
If the lines are varying length then you really have to read the entire file to count how many lines there are. Line terminating characters are just arbitrary bytes in the file.
If the line are fixed length then you just compute the offset, no problem.
If they're not (i.e. a regular CSV file) then you'll need to go through the file, either to build an index or to just read the lines you need. To make the file reading a little faster a good idea would be to use memory mapped files (see the implementation that's part of the Boost iostreams: http://www.boost.org/doc/libs/1_39_0/libs/iostreams/doc/classes/mapped_file.html).
As others noted, if you do not have the lines of fixed width, it is impossible to do without building the index. However, if you are in control of the format of the file, you can get a ~O(log(size)) instead of O(size) performance in finding the start line, if you manage to store number of the line itself on each line, i.e. to have the file contents look something like this:
1: val1, val2, val3
2: val4
3: val5, val6
4: val7, val8, val9, val10
With this format of the file, you can quickly find the needed line by binary search: start with seeking into the middle of the file. Read till the next newline. Then read the line, and parse the number. If the number is bigger than the target, then you need to repeat the algorithm on the first half of the file, if it is smaller than the target line number, then you need to repeat it on the second half of the file.
You'd need to be careful about the corner cases (e.g.: your "beginning" of the range and "end" of the range are on the same line, etc.), but for me this approach worked excellently in the past for parsing the logfiles which had the date in it (and I needed to find the lines that are between the certain timestamps).
Of course, this still does not beat the performance of the explicitly built index or the fixed-size records.

How do you read a word in from a file in C++?

So I was feeling bored and decided I wanted to make a hangman game. I did an assignment like this back in high school when I first took C++. But this was before I even too geometry, so unfortunately I didn't do well in any way shape or form in it, and after the semester I trashed everything in a fit of rage.
I'm looking to make a txt document and just throw in a whole bunch of words
(ie:
test
love
hungery
flummuxed
discombobulated
pie
awkward
you
get
the
idea
)
So here's my question:
How do I get C++ to read a random word from the document?
I have a feeling #include<ctime> will be needed, as well as srand(time(0)); to get some kind of pseudorandom choice...but I haven't the foggiest on how to have a random word taken from a file...any suggestions?
Thanks ahead of time!
Here's a rough sketch, assuming that the words are separated by whitespaces (space, tab, newline, etc):
vector<string> words;
ifstream in("words.txt");
while(in) {
string word;
in >> word;
words.push_back(word);
}
string r=words[rand()%words.size()];
The operator >> used on a string will read 1 (white) space separated word from a stream.
So the question is do you want to read the file each time you pick a word or do you want to load the file into memory and then pick up the word from a memory structure. Without more information I can only guess.
Pick a Word from a file:
// Note a an ifstream is also an istream.
std::string pickWordFromAStream(std::istream& s,std::size_t pos)
{
std::istream_iterator<std::string> iter(s);
for(;pos;--pos)
{ ++iter;
}
// This code assumes that pos is smaller or equal to
// the number of words in the file
return *iter;
}
Load a file into memory:
void loadStreamIntoVector(std::istream& s,std::vector<std::string> words)
{
std::copy(std::istream_iterator<std::string>(s),
std::istream_iterator<std::string>(),
std::back_inserter(words)
);
}
Generating a random number should be easy enough. Assuming you only want psudo-random.
I would recommend creating a plain text file (.txt) in Notepad and using the standard C file APIs (fopen(), and fread()) to read from it. You can use fgets() to read each line one at a time.
Once you have your plain text file, just read each line into an array and then randomly choose an entry in the array using the method you've suggested above.