Reading lines from input - c++

I'm looking to read from std::in with a syntax as below (it is always int, int, int, char[]/str). What would be the fastest way to parse the data into an int array[3] and either a string or char array.
#NumberOfLines(i.e.10000000)
1,2,2,'abc'
2,2,2,'abcd'
1,2,3,'ab'
...1M+ to 10M+ more lines, always in the form of (int,int,int,str)
At the moment, I'm doing something along the lines of.
//unsync stdio
std::ios_base::sync_with_stdio (false);
std::cin.tie(NULL);
//read from cin
for(i in amount of lines in stdin){
getline(cin,str);
if(i<3){
int commaindex = str.find(',');
string substring = str.substr(0,commaindex);
array[i]=atoi(substring.c_str());
str.erase(0,commaindex+1)
}else{
label = str;
}
//assign array and label to other stuff and do other stuff, repeat
}
I'm quite new to C++ and recently learned profiling with Visual Studio however not the best at interpreting it. IO takes up 68.2% and kernel takes 15.8% of CPU usage. getline() covers 35.66% of the elapsed inclusive time.
Is there any way I can do something similar to reading large chunks at once to avoid calling getline() as much? I've been told fgets() is much faster, however, I'm unsure of how to use it when I cannot predict the number of characters to specify.
I've attempted to use scanf as follows, however it was slower than getline method. Also have used `stringstreams, but that was incredibly slow.
scanf("%i,%i,%i,%s",&array[0],&array[1],&array[2],str);
Also if it matters, it is run on a server with low memory available. I think reading the entire input to buffer would not be viable?
Thanks!
Update: Using #ted-lyngmo approach, gathered the results below.
time wc datafile
real 4m53.506s
user 4m14.219s
sys 0m36.781s
time ./a.out < datafile
real 2m50.657s
user 1m55.469s
sys 0m54.422s
time ./a.out datafile
real 2m40.367s
user 1m53.523s
sys 0m53.234s

You could use std::from_chars (and reserve() the approximate amount of lines you have in the file, if you store the values in a vector for example). I also suggest adding support for reading directly from the file. Reading from a file opened by the program is (at least for me) faster than reading from std::cin (even with sync_with_stdio(false)).
Example:
#include <algorithm> // std::for_each
#include <cctype> // std::isspace
#include <charconv> // std::from_chars
#include <cstdio> // std::perror
#include <fstream>
#include <iostream>
#include <iterator> // std::istream_iterator
#include <limits> // std::numeric_limits
struct foo {
int a[3];
std::string s;
};
std::istream& operator>>(std::istream& is, foo& f) {
if(std::getline(is, f.s)) {
std::from_chars_result fcr{f.s.data(), {}};
const char* end = f.s.data() + f.s.size();
// extract the numbers
for(unsigned i = 0; i < 3 && fcr.ptr < end; ++i) {
fcr = std::from_chars(fcr.ptr, end, f.a[i]);
if(fcr.ec != std::errc{}) {
is.setstate(std::ios::failbit);
return is;
}
// find next non-whitespace
do ++fcr.ptr;
while(fcr.ptr < end &&
std::isspace(static_cast<unsigned char>(*fcr.ptr)));
}
// extract the string
if(++fcr.ptr < end)
f.s = std::string(fcr.ptr, end - 1);
else
is.setstate(std::ios::failbit);
}
return is;
}
std::ostream& operator<<(std::ostream& os, const foo& f) {
for(int i = 0; i < 3; ++i) {
os << f.a[i] << ',';
}
return os << '\'' << f.s << "'\n";
}
int main(int argc, char* argv[]) {
std::ifstream ifs;
if(argc >= 2) {
ifs.open(argv[1]); // if a filename is given as argument
if(!ifs) {
std::perror(argv[1]);
return 1;
}
} else {
std::ios_base::sync_with_stdio(false);
std::cin.tie(nullptr);
}
std::istream& is = argc >= 2 ? ifs : std::cin;
// ignore the first line - it's of no use in this demo
is.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
// read all `foo`s from the stream
std::uintmax_t co = 0;
std::for_each(std::istream_iterator<foo>(is), std::istream_iterator<foo>(),
[&co](const foo& f) {
// Process each foo here
// Just counting them for demo purposes:
++co;
});
std::cout << co << '\n';
}
My test runs on a file with 1'000'000'000 lines with content looking like below:
2,2,2,'abcd'
2, 2,2,'abcd'
2, 2, 2,'abcd'
2, 2, 2, 'abcd'
Unix time wc datafile
1000000000 2500000000 14500000000 datafile
real 1m53.440s
user 1m48.001s
sys 0m3.215s
time ./my_from_chars_prog datafile
1000000000
real 1m43.471s
user 1m28.247s
sys 0m5.622s
From this comparison I think one can see that my_from_chars_prog is able to successfully parse all entries pretty fast. It was consistently faster at doing so than wc - a standard unix tool whos only purpose is to count lines, words and characters.

Related

f.getline() iterator not increasing

I don't understand why my iterator(nr) doesn't increase.
#include <iostream>
#include <fstream>
#include <stdio.h>
#include <string.h>
using namespace std;
ifstream f("date.in");
ofstream g("date.out");
int main()
{
int l, nr = 0;
char x, s[100];
f >> l;
while(!f.eof())
{
f.getline(s, 100);
{
g << s;
nr++;
}
if(nr == 19)
{
g << '\n';
nr = 0;
}
}
return 0;
}
I expect to get the output to start on a new line every 20 characters.
The problem is that you read and count complete lines as #Andrey Akhmetov said in the comments. If you want to inject a \n every 20 chars, the easiest way would be to read one character at a time:
void add_newlines(std::istream& in, std::ostream& out) {
char ch;
int nr = 0;
// Read one char with "<istream>.get()". The returned file descriptor (in) will
// be true or false in a boolean context (the while(<condition>)) depending on
// the state of the stream. If it fails extracting a character, the failbit will
// be set on the stream and "in" will be "false" in the boolean context and
// the while loop will end.
while( in.get((ch)) ) {
out.put(ch);
if(++nr == 19) {
out << '\n';
nr = 0;
}
}
}
Call it with add_newlines(f, g);.
Note that get() and put() use Unformatted I/O while the out << '\n' uses Formatted output and will widen() \n to \r\n on Windows which probably could cause sequences like \r\r\n\n to appear in your output (if you run on Windows).

Reading number list from file to a dynamic array

I'm having trouble reading a number list from a .txt file to a dynamic array of type double. This first number in the list is the number of numbers to add to the array. After the first number, the numbers in the list all have decimals.
My header file:
#include <iostream>
#ifndef SORT
#define SORT
class Sort{
private:
double i;
double* darray; // da array
double j;
double size;
public:
Sort();
~Sort();
std::string getFileName(int, char**);
bool checkFileName(std::string);
void letsDoIt(std::string);
void getArray(std::string);
};
#endif
main.cpp:
#include <stdio.h>
#include <stdlib.h>
#include "main.h"
int main(int argc, char** argv)
{
Sort sort;
std::string cheese = sort.getFileName(argc, argv); //cheese is the file name
bool ean = sort.checkFileName(cheese); //pass in file name fo' da check
sort.letsDoIt(cheese); //starts the whole thing up
return 0;
}
impl.cpp:
#include <iostream>
#include <fstream>
#include <cstring>
#include <stdlib.h>
#include "main.h"
Sort::Sort(){
darray[0];
i = 0;
j = 0;
size = 0;
}
Sort::~Sort(){
std::cout << "Destroyed" << std::endl;
}
std::string Sort::getFileName(int argc, char* argv[]){
std::string fileIn = "";
for(int i = 1; i < argc;)//argc the number of arguements
{
fileIn += argv[i];//argv the array of arguements
if(++i != argc)
fileIn += " ";
}
return fileIn;
}
bool Sort::checkFileName(std::string userFile){
if(userFile.empty()){
std::cout<<"No user input"<<std::endl;
return false;
}
else{
std::ifstream tryread(userFile.c_str());
if (tryread.is_open()){
tryread.close();
return true;
}
else{
return false;
}
}
}
void Sort::letsDoIt(std::string file){
getArray(file);
}
void Sort::getArray(std::string file){
double n = 0;
int count = 0;
// create a file-reading object
std::ifstream fin;
fin.open(file.c_str()); // open a file
fin >> n; //first line of the file is the number of numbers to collect to the array
size = n;
std::cout << "size: " << size << std::endl;
darray = (double*)malloc(n * sizeof(double)); //allocate storage for the array
// read each line of the file
while (!fin.eof())
{
fin >> n;
if (count == 0){ //if count is 0, don't add to array
count++;
std::cout << "count++" << std::endl;
}
else {
darray[count - 1] = n; //array = line from file
count++;
}
std::cout << std::endl;
}
free((void*) darray);
}
I have to use malloc, but I think I may be using it incorrectly. I've read other posts but I am still having trouble understanding what is going on.
Thanks for the help!
Your use of malloc() is fine. Your reading is not doing what you want it to do.
Say I have the inputfile:
3
1.2
2.3
3.7
My array would be:
[0]: 2.3
[1]: 3.7
[2]: 0
This is because you are reading in the value 1.2 as if you were rereading the number of values.
When you have this line:
fin >> n; //first line of the file is the number of numbers to collect to the array
You are reading in the count, in this case 3, and advancing where in the file you will read from next. You are then attempting to reread that value but are getting the first entry instead.
I believe that replacing your while() {...} with the code below will do what you are looking for.
while (count != size && fin >> n)
{
darray[count++] = n; //array = line from file
std::cout << n << std::endl;
}
This should give you the correct values in the array:
[0]: 1.2
[1]: 2.3
[2]: 3.7
You appear to be writing the next exploitable program. You are mistakenly trusting the first line of the file to determine your buffer size, then reading an unlimited amount of data from the remainder of the file into a buffer that is not unlimited. This allows an evil input file to trash some other memory in your program, possibly allowing the creator of that file to take control of your computer. Oh noes!
Here's what you need to do to fix it:
Remember how much memory you allocated (you'll need it in step #2). Have a variable alleged_size or array_length that is separate from the one you use to read the rest of the data.
Don't allow count to run past the end of the array. Your loop should look more like this:
while ((count < alleged_size) && (cin >> n))
This both prevents array overrun and decides whether to process data based on whether it was parsed successfully, not whether you reached the end-of-file at some useless point in the past.
The less problematic bug is the one #bentank noticed, that you didn't realize that you kept your position in the file, which is after the first line, and shouldn't expect to hit that line within the loop.
In addition to this, you probably want to deallocate the memory in your destructor. Right now you throw the data away immediately after parsing it. Wouldn't other functions like to party on that data too?

Comparing numbers in strings to numbers in int?

I'm trying to make a program that will open a txt file containing a list of names in this format (ignore the bullets):
3 Mark
4 Ralph
1 Ed
2 Kevin
and will create a file w/ organized names based on the number in front of them:
1 Ed
2 Kevin
3 Mark
4 Ralph
I think I'm experiencing trouble in line 40, where I try to compare the numbers stored in strings with a number stored in an int.
I can't think of any other way to tackle this, any advice would be wonderful!
#include <iostream>
#include <fstream>
#include <vector>
#include <cstdlib>
using namespace std;
int main()
{
ifstream in;
ofstream out;
string line;
string collection[5];
vector <string> lines;
vector <string> newLines;
in.open("infile.txt");
if (in.fail())
{
cout << "Input file opening failed. \n";
exit(1);
}
out.open("outfile.txt");
if (out.fail())
{
cout << "Output file opening failed. \n";
exit(1);
}
while (!in.eof())
{
getline(in, line);
lines.push_back(line);
}
for (int i = 0; i < lines.size(); i++)
{
collection[i] = lines[i];
}
for (int j = 0; j < lines.size(); j++)
{
for (int x = 0; x < lines.size(); x--)
{
if (collection[x][0] == j)
newLines.push_back(collection[x]);
}
}
for (int k = 0; k < newLines.size(); k++)
{
out << newLines[k] << endl;
}
in.close( );
out.close( );
return 0;
}
Using a debugger would tell you where you went wrong, but let me highlight the mistake:
if (collection[x][0] == j)
You're expecting a string like 3 Mark. The first character of this string is '3', but that has the ASCII value of 51, and that is the numerical value you'll get when trying work with it is this way! This will never equal j, unless you've got a lot of lines in your file, and then your search system will not work at all like you wanted. YOu need to convert that character into an integer, and then do your comparison.
C++ offers many way to process data via streams, including parsing simple datafiles and converting text to numbers and vice versa. Here's a simple standalone function that will read a datafile like you have (only with arbitrary text including spaces after the number on each line).
#include <algorithm>
// snip
struct file_entry { int i; std::string text; };
std::vector<file_entry> parse_file(std::istream& in)
{
std::vector<file_entry> data;
while (!in.eof())
{
file_entry e;
in >> e.i; // read the first number on the line
e.ignore(); // skip the space between the number and the text
std::getline(in, e.text); // read the whole of the rest of the line
data.push_back(e);
}
return data;
}
Because the standard way that >> works involves reading until the next space (or end of line), if you want to read a chunk of text which contains whitespace, it will be much easier to use std::getline to just slurp up the whole of the rest of the current line.
Note: I've made no attempt to handle malformed lines in the textfile, or any number of other possible error conditions. Writing a proper file parser is outside of the scope of this question, but there are plenty of tutorials out there on using C++'s stream functionality appropriately.
Now you have the file in a convenient structure, you can use other standard c++ features to sort it, rather than reinventing the wheel and trying to do it yourself:
int sort_file_entry(file_entry a, file_entry b)
{
return a.i < b.i;
}
int main()
{
// set up all your streams, etc.
std::vector<file_entry> lines = parse_file(in);
std::sort(lines.begin(), lines.end(), sort_file_entry);
// now you can write the sorted vector back out to disk.
}
Again, a full introduction to how iterators and containers work is well outside the scope of this answer, but the internet has no shortage of introductory C++ guides out there. Good luck!

Writing to file line by line vs writing whole text at once

I always told that file io processes are the slowest ones. However when I test the two processes below:
Scenario 1:
test.open("test.xml",fstream::out);
for(int i=0;i<1000;i++)
{
test<<"<p> attr1=\"test1\" attr2=\"test2\" attr3=\"test3\" attr4=\"test4\">test5</p>\n";
}
test.close();
Scenario 2:
test.open("test.xml",fstream::out);
stringstream fileDataStr;
for(int i=0;i<1000;i++)
{
fileDataStr<<"<p> attr1=\"test1\" attr2=\"test2\" attr3=\"test3\" attr4=\"test4\">test5</p>\n";
}
test<<fileDataStr;
test.close();
I expect the senario1 to be slower because it does 1000 file io but test result showed that it has the same speed with scenario2 (in terms of clock_t). Why this is so, is it related with OS optimization as in file read?
getline while reading a file vs reading whole file and then splitting based on newline character
Edited: With the advice of #irW
string fileDataStr;
changed to
stringstream fileDataStr;
Because of the way std::ofstream buffers output, you end up
doing exactly the same amount of IO in both cases. (Usually,
at any rateā€”an implementation could optimize things when
you output a very long string.) The only difference is that in
the second case, you've introduced an additional intermediate
buffer, which means a little more copying, and a few more
dynamic allocations. (How many dynamic allocations depends on
the implementation, but it shouldn't be too many.)
Each time you have fileDataStr+=you are making a new string and copying the previous one into it, strings are immutable! If you would use a stringstream it might be a more fair comparison.
There is no one answer to this, because the results can and will vary with the compiler and standard library you use. For example, I put your different attempts together into a single program with a little test/timing harness. Then, just for fun, I added a fourth attempt (test3 in the code below):
#include <iostream>
#include <vector>
#include <string>
#include <sstream>
#include <time.h>
#include <fstream>
#include <sstream>
#include <string.h>
static const int limit = 1000000;
void test1() {
std::ofstream test("test.xml");
for (int i = 0; i < limit; i++)
{
test << "<p> attr1=\"test1\" attr2=\"test2\" attr3=\"test3\" attr4=\"test4\">test5</p>\n";
}
test.close();
}
void test11() {
std::ofstream test("test.xml");
std::string fileDataStr;
for (int i = 0; i < limit; i++)
{
fileDataStr += "<p> attr1=\"test1\" attr2=\"test2\" attr3=\"test3\" attr4=\"test4\">test5</p>\n";
}
test << fileDataStr;
test.close();
}
void test2() {
std::ofstream test("test.xml");
std::stringstream fileDataStr;
for (int i = 0; i < limit; i++)
{
fileDataStr << "<p> attr1=\"test1\" attr2=\"test2\" attr3=\"test3\" attr4=\"test4\">test5</p>\n";
}
test << fileDataStr.str();
test.close();
}
void test3() {
std::ofstream test("test.xml");
std::vector<char> buffer;
char line [] = "<p> attr1=\"test1\" attr2=\"test2\" attr3=\"test3\" attr4=\"test4\">test5</p>\n";
size_t len = strlen(line);
buffer.reserve(limit * len + 1);
for (int i = 0; i < limit; i++)
std::copy(line, line + len, std::back_inserter(buffer));
test.write(&buffer[0], buffer.size());
test.close();
}
template <class T>
void timer(T f) {
clock_t start = clock();
f();
clock_t stop = clock();
std::cout << double(stop - start) / CLOCKS_PER_SEC << " seconds\n";
}
int main() {
timer(test1);
timer(test11);
timer(test2);
timer(test3);
}
Then I compiled it with VC++, and got the following results:
0.681 seconds
0.659 seconds
0.874 seconds
0.955 seconds
Then, I compiled with g++, and got these results:
1.267 seconds
0.725 seconds
0.795 seconds
0.649 seconds
The fourth version (the one I added) gives the worst performance with VC++, but the best performance with g++. The one that was next to fastest with VC++ is (by far) the slowest with g++.
You're asking why X is true. Unfortunately, X isn't consistently true at all.
We'd probably have to do pretty detailed analysis of the exact compiler and standard library you were using to give an answer that really meant much.

C/C++ reading and writing long strings to files

I have a list of cities that I'm formatting like this:
{town, ...},
{...},
...
Reading and building each town and creating town1, town2,.... works
The problem is when I output it, 1st line works {town, ...}, but the second line crashes.
Any idea why?
I have [region] [town] (excel table).
So each region repeats by how many towns are in it.
Each file has 1 region/town per line.
judete contains each region repeated 1 time.
AB
SD
PC
....
orase contains the towns list.
town1
town2
....
orase-index contains the region of each town
AB
AB
AB
AB
SD
SD
SD
PC
PC
...
I want an output like this {"town1", "town2", ...} and each row (row 5) contains the town that belong to the region from judete at the same row (row 5).
Here's my code:
#include<stdio.h>
#include<string.h>
char judet[100][100];
char orase[50][900000];
char oras[100], ceva[100];
void main ()
{
int i=0, nr;
FILE *judete, *index, *ORASE, *output;
judete = fopen("judete.txt", "rt");
index = fopen("orase-index.txt", "rt");
ORASE = fopen("orase.txt", "rt");
output = fopen("output.txt", "wt");
while( !feof(judete) )
{
fgets(judet[i], 100, judete);
i++;
}
nr = i;
char tmp[100];
int where=0;
for(i=0;i<nr;i++)
strcpy(orase[i],"");
while( !feof(index) )
{
fgets(tmp, 100, index);
for(i=0;i<nr;i++)
{
if( strstr(judet[i], tmp) )
{
fgets(oras, 100, ORASE);
strcat(ceva, "\"");
oras[strlen(oras)-1]='\0';
strcat(ceva, oras);
strcat(ceva, "\", ");
strcat(orase[i], ceva);
break;
}
}
}
char out[900000];
for(i=0;i<nr;i++)
{
strcpy(out, "");
strcat(out, "{");
strcat(out, orase[i]); //fails here
fprintf(output, "%s},\n", out);
}
}
The result I get from running the code is:
Unhandled exception at 0x00D4F7A9 (msvcr110d.dll) in orase-judete.exe: 0xC0000005: Access violation writing location 0x00A90000.
You don't clear orase array, beacause your loop
for(i-0;i<nr;i++)
strcpy(orase[i],"");
by mistake ('-' instead of '=') executes 0 times.
I think you need to start by making up your mind whether you're writing C or C++. You've tagged this with both, but the code looks like it's pure C. While a C++ compiler will accept most C, the result isn't what most would think of as ideal C++.
Since you have tagged it as C++, I'm going to assume you actually want (or all right with) C++ code. Well written C++ code is going to be enough different from your current C code that it's probably easier to start over than try to rewrite the code line by line or anything like that.
The immediate problem I see with doing that, however, is that you haven't really specified what you want as your output. For the moment I'm going to assume you want each line of output to be something like this: "{" <town> "," <town> "}".
If that's the case, I'd start by noting that the output doesn't seem to depend on your judete file at all. The orase and orase-index seem to be entirely adequate. For that, our code can look something like this:
#include <iostream>
#include <string>
#include <iterator>
#include <fstream>
#include <vector>
// a class that overloads `operator>>` to read a line at a time:
class line {
std::string data;
public:
friend std::istream &operator>>(std::istream &is, line &l) {
return std::getline(is, l.data);
}
operator std::string() const { return data; }
};
int main() {
// open the input files:
std::ifstream town_input("orase.txt");
std::ifstream region_input("orase-index.txt");
// create istream_iterator's to read from the input files. Note
// that these iterate over `line`s, (i.e., objects of the type
// above, so they use its `operator>>` to read each data item).
//
std::istream_iterator<line> regions(region_input),
towns(town_input),
end;
// read in the lists of towns and regions:
std::vector<std::string> town_list {towns, end};
std::vector<std::string> region_list {regions, end};
// write out the file of town-name, region-name:
std::ofstream result("output.txt");
for (int i=0; i<town_list.size(); i++)
result << "{" << town_list[i] << "," << region_list[i] << "}\n";
}
Noe that since this is C++, you'll typically need to save the source as something.cpp instead of something.c for the compiler to recognize it correctly.
Edit: Based on the new requirements you've given in the comments, you apparently want something closer to this:
#include <iostream>
#include <string>
#include <iterator>
#include <fstream>
#include <vector>
#include <map>
// a class that overloads `operator>>` to read a line at a time:
class line {
std::string data;
public:
friend std::istream &operator>>(std::istream &is, line &l) {
return std::getline(is, l.data);
}
operator std::string() const { return data; }
};
int main() {
// open the input files:
std::ifstream town_input("orase.txt");
std::ifstream region_input("orase-index.txt");
// create istream_iterator's to read from the input files. Note
// that these iterate over `line`s, (i.e., objects of the type
// above, so they use its `operator>>` to read each data item).
//
std::istream_iterator<line> regions(region_input),
towns(town_input),
end;
// read in the lists of towns and regions:
std::vector<std::string> town_list (towns, end);
std::vector<std::string> region_list (regions, end);
// consolidate towns per region:
std::map<std::string, std::vector<std::string> > consolidated;
for (int i = 0; i < town_list.size(); i++)
consolidated[region_list[i]].push_back(town_list[i]);
// write out towns by region
std::ofstream output("output.txt");
for (auto pos = consolidated.begin(); pos != consolidated.end(); ++pos) {
std::cout << pos->first << ": ";
std::copy(pos->second.begin(), pos->second.end(),
std::ostream_iterator<std::string>(output, "\t"));
std::cout << "\n";
}
}
Notice that ceva is never initialized.
Instead of using strcpy to initialize strings, I would recommend using static initialization:
char ceva[100]="";