C++ Read file into Array / List / Vector - c++

I am currently working on a small program to join two text files (similar to a database join). One file might look like:
269ED3
86356D
818858
5C8ABB
531810
38066C
7485C5
948FD4
The second one is similar:
hsdf87347
7485C5
rhdff
23487
948FD4
Both files have over 1.000.000 lines and are not limited to a specific number of characters. What I would like to do is find all matching lines in both files.
I have tried a few things, Arrays, Vectors, Lists - but I am currently struggling with deciding what the best (fastest and memory easy) way.
My code currently looks like:
#include iostream>
#include fstream>
#include string>
#include ctime>
#include list>
#include algorithm>
#include iterator>
using namespace std;
int main()
{
string line;
clock_t startTime = clock();
list data;
//read first file
ifstream myfile ("test.txt");
if (myfile.is_open())
{
for(line; getline(myfile, line);/**/){
data.push_back(line);
}
myfile.close();
}
list data2;
//read second file
ifstream myfile2 ("test2.txt");
if (myfile2.is_open())
{
for(line; getline(myfile2, line);/**/){
data2.push_back(line);
}
myfile2.close();
}
else cout data2[k], k++
//if data[j] > a;
return 0;
}
My thinking is: With a vector, random access on elements is very difficult and jumping to the next element is not optimal (not in the code, but I hope you get the point). It also takes a long time to read the file into a vector by using push_back and adding the lines one by one. With arrays the random access is easier, but reading >1.000.000 records into an array will be very memory intense and takes a long time as well. Lists can read the files faster, random access is expensive again.
Eventually I will not only look for exact matches, but also for the first 4 characters of each line.
Can you please help me deciding, what the most efficient way is? I have tried arrays, vectors and lists, but am not satisfied with the speed so far. Is there any other way to find matches, that I have not considered? I am very happy to change the code completely, looking forward to any suggestion!
Thanks a lot!
EDIT: The output should list the matching values / lines. In this example the output is supposed to look like:
7485C5
948FD4

Reading a 2 millions lines won't be too much slow, what might be slowing down is your comparison logic :
Use : std::intersection
data1.sort(data1.begin(), data1.end()); // N1log(N1)
data2.sort(data2.begin(), data2.end()); // N2log(N2)
std::vector<int> v; //Gives the matching elements
std::set_intersection(data1.begin(), data1.end(),
data2.begin(), data2.end(),
std::back_inserter(v));
// Does 2(N1+N2-1) comparisons (worst case)
You can also try using std::set and insert lines into it from both files, the resultant set will have only unique elements.

If the values for this are unique in the first file, this becomes trivial when exploiting the O(nlogn) characteristics of a set. The following stores all lines in the first file passed as a command-line argument to a set, then performs a O(logn) search for each line in the second file.
EDIT: Added 4-char-only preamble searching. To do this, the set contains only the first four chars of each line, and the search from the second looks for only the first four chars of each search-line. The second-file line is printed in its entirety if there is a match. Printing the first file full-line in entirety would be a bit more challenging.
#include <iostream>
#include <fstream>
#include <string>
#include <set>
int main(int argc, char *argv[])
{
if (argc < 3)
return EXIT_FAILURE;
// load set with first file
std::ifstream inf(argv[1]);
std::set<std::string> lines;
std::string line;
for (unsigned int i=1; std::getline(inf,line); ++i)
lines.insert(line.substr(0,4));
// load second file, identifying all entries.
std::ifstream inf2(argv[2]);
while (std::getline(inf2, line))
{
if (lines.find(line.substr(0,4)) != lines.end())
std::cout << line << std::endl;
}
return 0;
}

One solution is to read the entire file at once.
Use istream::seekg and istream::tellg to figure the size of the two files. Allocate a character array large enough to store them both. Read both files into the array, at appropriate location, using istream::read.
Here is an example of the above functions.

Related

Reading a specific line from a .txt file

I have a text file full of names:
smartgem
marshbraid
seamore
stagstriker
meadowbreath
hydrabrow
startrack
wheatrage
caskreaver
seaash
I want to code a random name generator that will copy a specific line from the.txt file and return it.
While reading in from a file you must start from the beginning and continue on. My best advice would be to read in all of the names, store them in a set, and randomly access them that way if you don't have stringent concerns over efficiency.
You cannot pick a random string from the end of the file without first reading up that name in the file.
You may also want to look at fseek() which will allow you to "jump" to a location within the input stream. You could randomly generate an offset and then provide that as an argument to fseek().
http://www.cplusplus.com/reference/cstdio/fseek/
You cannot do that unless you do one of two things:
Generate an index for that file, containing the address of each line, then you can go straight to that address and read it. This index can be stored in many different ways, the easiest one being on a separate file, this way the original file can still be considered a text file, or;
Structure the file so that each line starts at a fixed distance in bytes of each other, so you can just go to the line you want by multiplying (desired index * size). This does not mean the texts on each line need to have the same length, you can pad the end of the line with null-terminators (character '\0'). In this case it is not recommended to work this file as a text file anymore, but a binary file instead.
You can write a separate program that will generate this index or generate the structured file for your main program to use.
All this of course, considering you want the program to run and read the line without having to load the entire file in memory first. If your program will constantly read lines from the file, you should probably just load the entire file into a std::vector<std::string> and then read the lines at will from there.
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <cstdlib>
#include <ctime>
using namespace std;
int main()
{
string filePath = "test.txt";
vector<std::string> qNames;
ifstream openFile(filePath.data());
if (openFile.is_open())
{
string line;
while (getline(openFile, line))
{
qNames.push_back(line.c_str());
}
openFile.close();
}
if (!qNames.empty())
{
srand((unsigned int)time(NULL));
for (int i = 0; i < 10; i++)
{
int num = rand();
int linePos = num % qNames.size();
cout << qNames.at(linePos).c_str() << endl;
}
}
return 0;
}

C++ - Opening text files sequentially

I have hundreds of .txt files ordered by number: 1.txt, 2.txt, 3.txt,...n.txt. In each file there are two columns with decimal numbers.
I wrote an algorithm that does some operations to one .txt file alone, and now I want to recursively do the same to all of them.
This helpful question gave me some idea of what I'm trying to do.
Now I'm trying to write an algorithm to read all of the files:
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main ()
{
int i, n;
char filename[6];
double column1[100], column2[100];
for (n=1;n=200;n++)
{
sprintf(filename, "%d.txt", n);
ifstream datafile;
datafile.open(filename);
for (i=0;i<100;i++)
{
datafile >> column1[i] >> column2[i];
cout << column1[i] << column2[i];
}
datafile.close();
}
return 0;
}
What I think the code is doing: it is creating string names from 1.txt till 200.txt, then it opens files with these names. For each file, the first 100 columns will be associated to the arrays column1 and column2, then the values will be shown on the screen.
I don't get any error when compiling it, but when I run it the output is huge and simply won't stop. If i set the output to a .txt file it reaches easily some Gb!
I also tried decreasing the loop number and reduce the numbers of columns (to 3 or so), but I till get an infinite output. I would be glad if someone could point the mistakes I'm doing in the code...
I am using gcc 5.2.1 with Linux.
Thanks!
6-element array is too short to store "200.txt". It must be at least 8 elements.
The condition n=200 is wrong and is always true. It should be n<=200.
If all your files are in the same directory, you could also use boost::filesystem, e.g.:
auto path = "path/to/folder";
std::for_each(boost::filesystem::directory_iterator{path},
boost::filesystem::directory_iterator{},
[](boost::filesystem::directory_entry file){
// test if file is of the correct type
// do sth with file
});
I think this is a cleaner solution.

Read a line of a file c++

I'm just trying to use the fstream library and I wanna read a given row.
I thought this, but I don't know if is the most efficient way.
#include <iostream>
#include <fstream>
using namespace std;
int main(){
int x;
fstream input2;
string line;
int countLine = 0;
input2.open("theinput.txt");
if(input2.is_open());
while(getline(input2,line)){
countLine++;
if (countLine==1){ //1 is the lane I want to read.
cout<<line<<endl;
}
}
}
}
Is there another way?
This does not appear to be the most efficient code, no.
In particular, you're currently reading the entire input file even though you only care about one line of the file. Unfortunately, doing a good job of skipping a line is somewhat difficult. Quite a few people recommend using code like:
your_stream.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
...for this job. This can work, but has a couple of shortcomings. First and foremost, if you try to use it on a non-text file (especially one that doesn't contain new-lines) it can waste inordinate amounts of time reading an entire huge file, long after you've read enough that you would normally realize that there must be a problem. For example, if you're reading a "line", that's a pretty good indication that you're expecting a text file, and you can pretty easily set a much lower limit on how long that first line could reasonably be, such as (say) a megabyte, and usually quite a lot less than that.
You also usually want to detect whether it stopped reading because it reached that maximum, or because it got to the end of the line. Skipping a line "succeeded" only if a new-line was encountered before reaching the specified maximum. To do that, you can use gcount() to compare against the maximum you specified. If you stopped reading because you reached the specified maximum, you typically want to stop processing that file (and log the error, print out an error message, etc.)
With that in mind, we might write code like this:
bool skip_line(std::istream &in) {
size_t max = 0xfffff;
in.ignore(max, '\n');
return in.gcount() < max;
}
Depending on the situation, you might prefer to pass the maximum line size as a parameter (probably with a default) instead:
bool skip_line(std::istream &in, size_t max = 0xfffff) {
// skip definition of `max`, remainder identical
With this, you can skip up to a megabyte by default, but if you want to specify a different maximum, you can do so quite easily.
Either way, with that defined, the remainder becomes fairly trivial, something like this:
int main(){
std::ifstream in("theinput.txt");
if (!skip_line(in)) {
std::cerr << "Error reading file\n";
return EXIT_FAILURE;
}
// copy the second line:
std::string line;
if (std::getline(in, line))
std::cout << line;
}
Of course, if you want to skip more than one line, you can do that pretty easily as well by putting the call to skip_line in a loop--but note that you still usually want to test the result from it, and break the loop (and log the error) if it fails. You don't usually want something like:
for (int i=0; i<lines_to_skip; i++)
skip_line(in);
With this, you'd lose one of the basic benefits of assuring that your input really is what you expected, and you're not producing garbage.
I think you can condense your code to this. if (input) is sufficient to check for failure.
#include <iostream>
#include <fstream>
#include <limits>
int main()
{
std::ifstream input("file.txt");
int row = 5;
int count = 0;
if (input)
{
while (count++ < row) input.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
std::string line;
std::getline(input, line);
std::cout << line;
}
}

Big csv file c++ parsing performance

I have a big csv file (25 mb) that represents a symmetric graph (about 18kX18k). While parsing it into an array of vectors, i have analyzed the code (with VS2012 ANALYZER) and it shows that the problem with the parsing efficiency (about 19 seconds total) occurs while reading each character (getline::basic_string::operator+=) as shown in the picture below:
This leaves me frustrated, as with Java simple buffered line file reading and tokenizer i achieve it with less than half a second.
My code uses only STL library:
int allColumns = initFirstRow(file,secondRow);
// secondRow has initialized with one value
int column = 1; // dont forget, first column is 0
VertexSet* rows = new VertexSet[allColumns];
rows[1] = secondRow;
string vertexString;
long double vertexDouble;
for (int row = 1; row < allColumns; row ++){
// dont do the last row
for (; column < allColumns; column++){
//dont do the last column
getline(file,vertexString,',');
vertexDouble = stold(vertexString);
if (vertexDouble > _TH){
rows[row].add(column);
}
}
// do the last in the column
getline(file,vertexString);
vertexDouble = stold(vertexString);
if (vertexDouble > _TH){
rows[row].add(++column);
}
column = 0;
}
initLastRow(file,rows[allColumns-1],allColumns);
init first and last row basically does the same thing as the loop above, but initFirstRow also counts the number of columns.
VertexSet is basically a vector of indexes (int). Each vertex read (separated by ',') goes no more than 7 characters length long (values are between -1 and 1).
At 25 megabytes, I'm going to guess that your file is machine generated. As such, you (probably) don't need to worry about things like verifying the format (e.g., that every comma is in place).
Given the shape of the file (i.e., each line is quite long) you probably won't impose a lot of overhead by putting each line into a stringstream to parse out the numbers.
Based on those two facts, I'd at least consider writing a ctype facet that treats commas as whitespace, then imbuing the stringstream with a locale using that facet to make it easy to parse out the numbers. Overall code length would be a little greater, but each part of the code would end up pretty simple:
#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <time.h>
#include <stdlib.h>
#include <locale>
#include <sstream>
#include <algorithm>
#include <iterator>
class my_ctype : public std::ctype<char> {
std::vector<mask> my_table;
public:
my_ctype(size_t refs=0):
my_table(table_size),
std::ctype<char>(my_table.data(), false, refs)
{
std::copy_n(classic_table(), table_size, my_table.data());
my_table[',']=(mask)space;
}
};
template <class T>
class converter {
std::stringstream buffer;
my_ctype *m;
std::locale l;
public:
converter() : m(new my_ctype), l(std::locale::classic(), m) { buffer.imbue(l); }
std::vector<T> operator()(std::string const &in) {
buffer.clear();
buffer<<in;
return std::vector<T> {std::istream_iterator<T>(buffer),
std::istream_iterator<T>()};
}
};
int main() {
std::ifstream in("somefile.csv");
std::vector<std::vector<double>> numbers;
std::string line;
converter<double> cvt;
clock_t start=clock();
while (std::getline(in, line))
numbers.push_back(cvt(line));
clock_t stop=clock();
std::cout<<double(stop-start)/CLOCKS_PER_SEC << " seconds\n";
}
To test this, I generated an 1.8K x 1.8K CSV file of pseudo-random doubles like this:
#include <iostream>
#include <stdlib.h>
int main() {
for (int i=0; i<1800; i++) {
for (int j=0; j<1800; j++)
std::cout<<rand()/double(RAND_MAX)<<",";
std::cout << "\n";
}
}
This produced a file around 27 megabytes. After compiling the reading/parsing code with gcc (g++ -O2 trash9.cpp), a quick test on my laptop showed it running in about 0.18 to 0.19 seconds. It never seems to use (even close to) all of one CPU core, indicating that it's I/O bound, so on a desktop/server machine (with a faster hard drive) I'd expect it to run faster still.
The inefficiency here is in Microsoft's implementation of std::getline, which is being used in two places in the code. The key problems with it are:
It reads from the stream one character at a time
It appends to the string one character at a time
The profile in the original post shows that the second of these problems is the biggest issue in this case.
I wrote more about the inefficiency of std::getline here.
GNU's implementation of std::getline, i.e. the version in libstdc++, is much better.
Sadly, if you want your program to be fast and you build it with Visual C++ you'll have to use lower level functions than std::getline.
The debug Runtime Library in VS is very slow because it does a lot of debug checks (for out of bound accesses and things like that) and calls lots of very small functions that are not inlined when you compile in Debug.
Running your program in release should remove all these overheads.
My bet on the next bottleneck is string allocation.
I would try read bigger chunks of memory at once and then parse it all.
Like.. read full line. and then parse this line using pointers and specialized functions.
Hmm good answer here. Took me a while but I had the same problem. After this fix my write and process time went from 38 sec to 6 sec.
Here's what I did.
First get data using boost mmap. Then you can use boost thread to make processing faster on the const char* that boost mmap returns. Something like this: (the multithreading is different depending on your implementation so I excluded that part)
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/thread/thread.hpp>
#include <boost/lockfree/queue.hpp>
foo(string path)
{
boost::iostreams::mapped_file mmap(path,boost::iostreams::mapped_file::readonly);
auto chars = mmap.const_data(); // set data to char array
auto eofile = chars + mmap.size(); // used to detect end of file
string next = ""; // used to read in chars
vector<double> data; // store the data
for (; chars && chars != eofile; chars++) {
if (chars[0] == ',' || chars[0] == '\n') { // end of value
data.push_back(atof(next.c_str())); // add value
next = ""; // clear
}
else
next += chars[0]; // add to read string
}
}

C++ length of file and vectors

Hi I have a file with some text in it. Is there some easy way to get the number of lines in the file without traversing through the file?
I also need to put the lines of the file into a vector. I am new to C++ but I think vector is like ArrayList in java so I wanted to use a vector and insert things into it. So how would I do it?
Thanks.
There is no way of finding the number of lines in a file without reading it. To read all lines:
1) create a std::vector of std::string
3 ) open a file for input
3) read a line as a std::string using getline()
4) if the read failed, stop
5) push the line into the vector
6) goto 3
You would need to traverse the file to detect the number of lines (or at least call a library method that traverse the file).
Here is a sample code for parsing text file, assuming that you pass the file name as an argument, by using the getline method:
#include <string>
#include <vector>
#include <fstream>
#include <iostream>
int main(int argc, char* argv[])
{
std::vector<std::string> lines;
std::string line;
lines.clear();
// open the desired file for reading
std::ifstream infile (argv[1], std::ios_base::in);
// read each file individually (watch out for Windows new lines)
while (getline(infile, line, '\n'))
{
// add line to vector
lines.push_back (line);
}
// do anything you like with the vector. Output the size for example:
std::cout << "Read " << lines.size() << " lines.\n";
return 0;
}
Update: The code could fail for many reasons (e.g. file not found, concurrent modifications to file, permission issues, etc). I'm leaving that as an exercise to the user.
1) No way to find number of lines without reading the file.
2) Take a look at getline function from the C++ Standard Library. Something like:
string line;
fstream file;
vector <string> vec;
...
while (getline(file, line)) vec.push_back(line);
Traversing the file is fundamentally required to determine the number of lines, regardless of whether you do it or some library routine does it. New lines are just another character, and the file must be scanned one character at a time in its entirety to count them.
Since you have to read the lines into a vector anyways, you might as well combine the two steps:
// Read lines from input stream in into vector out
// Return the number of lines read
int getlines(std::vector<std::string>& out, std::istream& in == std::cin) {
out.clear(); // remove any data in vector
std::string buffer;
while (std::getline(in, buffer))
out.push_back(buffer);
// return number of lines read
return out.size();
}