Writing to file line by line vs writing whole text at once - c++

I always told that file io processes are the slowest ones. However when I test the two processes below:
Scenario 1:
test.open("test.xml",fstream::out);
for(int i=0;i<1000;i++)
{
test<<"<p> attr1=\"test1\" attr2=\"test2\" attr3=\"test3\" attr4=\"test4\">test5</p>\n";
}
test.close();
Scenario 2:
test.open("test.xml",fstream::out);
stringstream fileDataStr;
for(int i=0;i<1000;i++)
{
fileDataStr<<"<p> attr1=\"test1\" attr2=\"test2\" attr3=\"test3\" attr4=\"test4\">test5</p>\n";
}
test<<fileDataStr;
test.close();
I expect the senario1 to be slower because it does 1000 file io but test result showed that it has the same speed with scenario2 (in terms of clock_t). Why this is so, is it related with OS optimization as in file read?
getline while reading a file vs reading whole file and then splitting based on newline character
Edited: With the advice of #irW
string fileDataStr;
changed to
stringstream fileDataStr;

Because of the way std::ofstream buffers output, you end up
doing exactly the same amount of IO in both cases. (Usually,
at any rateā€”an implementation could optimize things when
you output a very long string.) The only difference is that in
the second case, you've introduced an additional intermediate
buffer, which means a little more copying, and a few more
dynamic allocations. (How many dynamic allocations depends on
the implementation, but it shouldn't be too many.)

Each time you have fileDataStr+=you are making a new string and copying the previous one into it, strings are immutable! If you would use a stringstream it might be a more fair comparison.

There is no one answer to this, because the results can and will vary with the compiler and standard library you use. For example, I put your different attempts together into a single program with a little test/timing harness. Then, just for fun, I added a fourth attempt (test3 in the code below):
#include <iostream>
#include <vector>
#include <string>
#include <sstream>
#include <time.h>
#include <fstream>
#include <sstream>
#include <string.h>
static const int limit = 1000000;
void test1() {
std::ofstream test("test.xml");
for (int i = 0; i < limit; i++)
{
test << "<p> attr1=\"test1\" attr2=\"test2\" attr3=\"test3\" attr4=\"test4\">test5</p>\n";
}
test.close();
}
void test11() {
std::ofstream test("test.xml");
std::string fileDataStr;
for (int i = 0; i < limit; i++)
{
fileDataStr += "<p> attr1=\"test1\" attr2=\"test2\" attr3=\"test3\" attr4=\"test4\">test5</p>\n";
}
test << fileDataStr;
test.close();
}
void test2() {
std::ofstream test("test.xml");
std::stringstream fileDataStr;
for (int i = 0; i < limit; i++)
{
fileDataStr << "<p> attr1=\"test1\" attr2=\"test2\" attr3=\"test3\" attr4=\"test4\">test5</p>\n";
}
test << fileDataStr.str();
test.close();
}
void test3() {
std::ofstream test("test.xml");
std::vector<char> buffer;
char line [] = "<p> attr1=\"test1\" attr2=\"test2\" attr3=\"test3\" attr4=\"test4\">test5</p>\n";
size_t len = strlen(line);
buffer.reserve(limit * len + 1);
for (int i = 0; i < limit; i++)
std::copy(line, line + len, std::back_inserter(buffer));
test.write(&buffer[0], buffer.size());
test.close();
}
template <class T>
void timer(T f) {
clock_t start = clock();
f();
clock_t stop = clock();
std::cout << double(stop - start) / CLOCKS_PER_SEC << " seconds\n";
}
int main() {
timer(test1);
timer(test11);
timer(test2);
timer(test3);
}
Then I compiled it with VC++, and got the following results:
0.681 seconds
0.659 seconds
0.874 seconds
0.955 seconds
Then, I compiled with g++, and got these results:
1.267 seconds
0.725 seconds
0.795 seconds
0.649 seconds
The fourth version (the one I added) gives the worst performance with VC++, but the best performance with g++. The one that was next to fastest with VC++ is (by far) the slowest with g++.
You're asking why X is true. Unfortunately, X isn't consistently true at all.
We'd probably have to do pretty detailed analysis of the exact compiler and standard library you were using to give an answer that really meant much.

Related

Reading lines from input

I'm looking to read from std::in with a syntax as below (it is always int, int, int, char[]/str). What would be the fastest way to parse the data into an int array[3] and either a string or char array.
#NumberOfLines(i.e.10000000)
1,2,2,'abc'
2,2,2,'abcd'
1,2,3,'ab'
...1M+ to 10M+ more lines, always in the form of (int,int,int,str)
At the moment, I'm doing something along the lines of.
//unsync stdio
std::ios_base::sync_with_stdio (false);
std::cin.tie(NULL);
//read from cin
for(i in amount of lines in stdin){
getline(cin,str);
if(i<3){
int commaindex = str.find(',');
string substring = str.substr(0,commaindex);
array[i]=atoi(substring.c_str());
str.erase(0,commaindex+1)
}else{
label = str;
}
//assign array and label to other stuff and do other stuff, repeat
}
I'm quite new to C++ and recently learned profiling with Visual Studio however not the best at interpreting it. IO takes up 68.2% and kernel takes 15.8% of CPU usage. getline() covers 35.66% of the elapsed inclusive time.
Is there any way I can do something similar to reading large chunks at once to avoid calling getline() as much? I've been told fgets() is much faster, however, I'm unsure of how to use it when I cannot predict the number of characters to specify.
I've attempted to use scanf as follows, however it was slower than getline method. Also have used `stringstreams, but that was incredibly slow.
scanf("%i,%i,%i,%s",&array[0],&array[1],&array[2],str);
Also if it matters, it is run on a server with low memory available. I think reading the entire input to buffer would not be viable?
Thanks!
Update: Using #ted-lyngmo approach, gathered the results below.
time wc datafile
real 4m53.506s
user 4m14.219s
sys 0m36.781s
time ./a.out < datafile
real 2m50.657s
user 1m55.469s
sys 0m54.422s
time ./a.out datafile
real 2m40.367s
user 1m53.523s
sys 0m53.234s
You could use std::from_chars (and reserve() the approximate amount of lines you have in the file, if you store the values in a vector for example). I also suggest adding support for reading directly from the file. Reading from a file opened by the program is (at least for me) faster than reading from std::cin (even with sync_with_stdio(false)).
Example:
#include <algorithm> // std::for_each
#include <cctype> // std::isspace
#include <charconv> // std::from_chars
#include <cstdio> // std::perror
#include <fstream>
#include <iostream>
#include <iterator> // std::istream_iterator
#include <limits> // std::numeric_limits
struct foo {
int a[3];
std::string s;
};
std::istream& operator>>(std::istream& is, foo& f) {
if(std::getline(is, f.s)) {
std::from_chars_result fcr{f.s.data(), {}};
const char* end = f.s.data() + f.s.size();
// extract the numbers
for(unsigned i = 0; i < 3 && fcr.ptr < end; ++i) {
fcr = std::from_chars(fcr.ptr, end, f.a[i]);
if(fcr.ec != std::errc{}) {
is.setstate(std::ios::failbit);
return is;
}
// find next non-whitespace
do ++fcr.ptr;
while(fcr.ptr < end &&
std::isspace(static_cast<unsigned char>(*fcr.ptr)));
}
// extract the string
if(++fcr.ptr < end)
f.s = std::string(fcr.ptr, end - 1);
else
is.setstate(std::ios::failbit);
}
return is;
}
std::ostream& operator<<(std::ostream& os, const foo& f) {
for(int i = 0; i < 3; ++i) {
os << f.a[i] << ',';
}
return os << '\'' << f.s << "'\n";
}
int main(int argc, char* argv[]) {
std::ifstream ifs;
if(argc >= 2) {
ifs.open(argv[1]); // if a filename is given as argument
if(!ifs) {
std::perror(argv[1]);
return 1;
}
} else {
std::ios_base::sync_with_stdio(false);
std::cin.tie(nullptr);
}
std::istream& is = argc >= 2 ? ifs : std::cin;
// ignore the first line - it's of no use in this demo
is.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
// read all `foo`s from the stream
std::uintmax_t co = 0;
std::for_each(std::istream_iterator<foo>(is), std::istream_iterator<foo>(),
[&co](const foo& f) {
// Process each foo here
// Just counting them for demo purposes:
++co;
});
std::cout << co << '\n';
}
My test runs on a file with 1'000'000'000 lines with content looking like below:
2,2,2,'abcd'
2, 2,2,'abcd'
2, 2, 2,'abcd'
2, 2, 2, 'abcd'
Unix time wc datafile
1000000000 2500000000 14500000000 datafile
real 1m53.440s
user 1m48.001s
sys 0m3.215s
time ./my_from_chars_prog datafile
1000000000
real 1m43.471s
user 1m28.247s
sys 0m5.622s
From this comparison I think one can see that my_from_chars_prog is able to successfully parse all entries pretty fast. It was consistently faster at doing so than wc - a standard unix tool whos only purpose is to count lines, words and characters.

Stack overflow with no recursion?

As far as I can see, my code has no recursion, but I am getting exception 0xC00000FD. According to Rider For Unreal Engine, it is happening in the main function. It's being encountered at the decompiled code
mov byte ptr [r11], 0x0
It was working fine, then suddenly when I ran it a second time, just to make sure it worked, it broke. It now gives that exception every time.
Here is my code:
// Language.cpp
#include <fstream>
#include <iostream>
#include <regex>
#include <stdexcept>
#include <string>
#include "Lexer/Lexer.h"
#include "Util/StrUtil.h"
int main(int argc, char* argv[])
{
std::string line, fileString;
std::ifstream file;
file.open(argv[1]);
if(file.is_open())
{
int lines = 0;
while(file.good())
{
if (lines > 10000)
{
std::cerr << "Whoa there, that file's a little long! Try compiling something less than 10,000 lines." << '\n';
return -1;
}
getline(file, line);
fileString += line + (char)10;
lines++;
}
}
fileString += (char)0x00;
std::string arr[10000];
int arrLength = StrUtil::split(fileString, "\n", arr);
Line lines[10000];
Lexer::lex(arr, arrLength, lines);
return 0;
}
// Lexer.cpp
#include "Lexer.h"
void Lexer::lex(std::string (&str)[10000], int length, Line (&lines)[10000])
{
for (int i = 0; i < length; i++)
{
}
}
// StrUtil.cpp
#include "StrUtil.h"
#include <stdexcept>
#include <string>
int StrUtil::split(std::string str, std::string delimiter, std::string (&out)[10000])
{
int pos = 0;
out[0] = str;
while (out[pos].find(delimiter) != std::string::npos)
{
const size_t found = out[pos].find(delimiter);
out[pos + 1] = out[pos].substr(found + delimiter.length());
out[pos] = out[pos].substr(0, found);
pos++;
}
return pos + 2;
}
For custom types, Line is an array of Tokens, which are just <TokenType, std::string> pairs.
As far as I can see, my code has no recursion
Indeed, there is no recursion in the shown code. Lack of recursion doesn't mean that you won't overflow the stack.
Stack overflow with no recursion?
Yes, this program is likely going to overflow the stack on some systems.
std::string arr[10000];
Typical size of std::string is 32 bytes (can vary greatly between language implementations). 10000 strings is 312 kiB. The execution stack on most desktop systems is one to few MiB. That one array is about the third of the memory that has to fit all of the automatic variables at the deepest stack frame of the program. It is very feasible that the remaning stack memory isn't enough for the rest of the program, especially considering you have another huge array of Line objects.
To fix the program, do not allocate massive variables such as this in automatic storage.

Reversing a string? a more optimal way

#include<iostream>
#include<string.h>
using namespace std;
int main ()
{
char str[50], temp;
int i, j;
cout << "Enter a string : ";
gets(str);
j = strlen(str) - 1;
for (i = 0; i < j; i++,j--)
{
temp = str[i];
str[i] = str[j];
str[j] = temp;
}
cout << "\nReverse string : " << str;
return 0;
}
Is there any more optimal way without using this function to reverse a string ? The function will start from the last position of S and will continue to be copied the string reversed. Instead of using the tmp variable.
string reverse(string s)
{
string reversed ;
for(int is.length();int i >0;--)
{
reversed +=s[i];
}
return reversed;
}
You can use std::reverse to reverse the string in place, with complexity being (last - first)/2 swaps, which is exactly the complexity of the first function, but is cleaner.
The second method has an overhead of extra allocation which will probably end up being slower.
I have tested the two option with quite large files.
The std::reverse(ss.begin(), ss.end()); corresponds to the first option (no copy)
The ss_r = new std::string(ss.rbegin(), ss.rend()); corresponds to the second option (copy)
#include <iostream>
#include <fstream>
#include <chrono>
#include <string>
#include <algorithm>
//Just for reading the file
std::string read_file(char * file_name)
{
std::ifstream file(file_name);
std::string ss;
file.seekg(0, std::ios::end);
std::cout << file.tellg() <<std::endl;
ss.resize(file.tellg());
file.seekg(0, std::ios::beg);
file.read(&ss[0], ss.size());
file.close();
return ss;
}
//The real test
int main(int arg, char ** args)
{
std::string ss = read_file(args[1]);
std::string * ss_r=NULL;
std::chrono::time_point<std::chrono::high_resolution_clock> start, end;
start = std::chrono::high_resolution_clock::now();
if(args[2]==std::string("copy"))
{
//Second option
ss_r = new std::string(ss.rbegin(), ss.rend());
}
else
{
//First option
std::reverse(ss.begin(), ss.end());
}
end = std::chrono::high_resolution_clock::now();
int elapsed_nano_seconds = std::chrono::duration_cast<std::chrono::nanoseconds>
(end-start).count();
if(ss_r!=NULL)
{
std::cout<<*ss_r<<std::endl;
}
else
{
std::cout<<ss<<std::endl;
}
std::cout << elapsed_nano_seconds<<std::endl;
}
Testing with icpc test.cpp -O3 --std=c++11
a.out Test_file no_copy runs in 160 microseconds
a.out Test_file copy runs in 320 microseconds
On the other hand with the first option you lost the original string...
So in summary if you don't care about loosing the original string go with std::reverse if you want to keep it go with std::string(ss.rbegin(), ss.rend());

Reading number list from file to a dynamic array

I'm having trouble reading a number list from a .txt file to a dynamic array of type double. This first number in the list is the number of numbers to add to the array. After the first number, the numbers in the list all have decimals.
My header file:
#include <iostream>
#ifndef SORT
#define SORT
class Sort{
private:
double i;
double* darray; // da array
double j;
double size;
public:
Sort();
~Sort();
std::string getFileName(int, char**);
bool checkFileName(std::string);
void letsDoIt(std::string);
void getArray(std::string);
};
#endif
main.cpp:
#include <stdio.h>
#include <stdlib.h>
#include "main.h"
int main(int argc, char** argv)
{
Sort sort;
std::string cheese = sort.getFileName(argc, argv); //cheese is the file name
bool ean = sort.checkFileName(cheese); //pass in file name fo' da check
sort.letsDoIt(cheese); //starts the whole thing up
return 0;
}
impl.cpp:
#include <iostream>
#include <fstream>
#include <cstring>
#include <stdlib.h>
#include "main.h"
Sort::Sort(){
darray[0];
i = 0;
j = 0;
size = 0;
}
Sort::~Sort(){
std::cout << "Destroyed" << std::endl;
}
std::string Sort::getFileName(int argc, char* argv[]){
std::string fileIn = "";
for(int i = 1; i < argc;)//argc the number of arguements
{
fileIn += argv[i];//argv the array of arguements
if(++i != argc)
fileIn += " ";
}
return fileIn;
}
bool Sort::checkFileName(std::string userFile){
if(userFile.empty()){
std::cout<<"No user input"<<std::endl;
return false;
}
else{
std::ifstream tryread(userFile.c_str());
if (tryread.is_open()){
tryread.close();
return true;
}
else{
return false;
}
}
}
void Sort::letsDoIt(std::string file){
getArray(file);
}
void Sort::getArray(std::string file){
double n = 0;
int count = 0;
// create a file-reading object
std::ifstream fin;
fin.open(file.c_str()); // open a file
fin >> n; //first line of the file is the number of numbers to collect to the array
size = n;
std::cout << "size: " << size << std::endl;
darray = (double*)malloc(n * sizeof(double)); //allocate storage for the array
// read each line of the file
while (!fin.eof())
{
fin >> n;
if (count == 0){ //if count is 0, don't add to array
count++;
std::cout << "count++" << std::endl;
}
else {
darray[count - 1] = n; //array = line from file
count++;
}
std::cout << std::endl;
}
free((void*) darray);
}
I have to use malloc, but I think I may be using it incorrectly. I've read other posts but I am still having trouble understanding what is going on.
Thanks for the help!
Your use of malloc() is fine. Your reading is not doing what you want it to do.
Say I have the inputfile:
3
1.2
2.3
3.7
My array would be:
[0]: 2.3
[1]: 3.7
[2]: 0
This is because you are reading in the value 1.2 as if you were rereading the number of values.
When you have this line:
fin >> n; //first line of the file is the number of numbers to collect to the array
You are reading in the count, in this case 3, and advancing where in the file you will read from next. You are then attempting to reread that value but are getting the first entry instead.
I believe that replacing your while() {...} with the code below will do what you are looking for.
while (count != size && fin >> n)
{
darray[count++] = n; //array = line from file
std::cout << n << std::endl;
}
This should give you the correct values in the array:
[0]: 1.2
[1]: 2.3
[2]: 3.7
You appear to be writing the next exploitable program. You are mistakenly trusting the first line of the file to determine your buffer size, then reading an unlimited amount of data from the remainder of the file into a buffer that is not unlimited. This allows an evil input file to trash some other memory in your program, possibly allowing the creator of that file to take control of your computer. Oh noes!
Here's what you need to do to fix it:
Remember how much memory you allocated (you'll need it in step #2). Have a variable alleged_size or array_length that is separate from the one you use to read the rest of the data.
Don't allow count to run past the end of the array. Your loop should look more like this:
while ((count < alleged_size) && (cin >> n))
This both prevents array overrun and decides whether to process data based on whether it was parsed successfully, not whether you reached the end-of-file at some useless point in the past.
The less problematic bug is the one #bentank noticed, that you didn't realize that you kept your position in the file, which is after the first line, and shouldn't expect to hit that line within the loop.
In addition to this, you probably want to deallocate the memory in your destructor. Right now you throw the data away immediately after parsing it. Wouldn't other functions like to party on that data too?

How to output array of doubles to hard drive?

I would like to know how to output an array of doubles to the hard drive.
edit:
for further clarification. I would like to output it to a file on the hard drive (I/O functions). Preferably in a file format that can be quickly translated back into an array of doubles in another program. It would also be nice if it was stored in a standard 4 byte configuration so that i can look at it through a hex viewer and see the actual values.
Hey... so you want to do it in a single write/read, well its not too hard, the following code should work fine, maybe need some extra error checking but the trial case was successful:
#include <string>
#include <fstream>
#include <iostream>
bool saveArray( const double* pdata, size_t length, const std::string& file_path )
{
std::ofstream os(file_path.c_str(), std::ios::binary | std::ios::out);
if ( !os.is_open() )
return false;
os.write(reinterpret_cast<const char*>(pdata), std::streamsize(length*sizeof(double)));
os.close();
return true;
}
bool loadArray( double* pdata, size_t length, const std::string& file_path)
{
std::ifstream is(file_path.c_str(), std::ios::binary | std::ios::in);
if ( !is.is_open() )
return false;
is.read(reinterpret_cast<char*>(pdata), std::streamsize(length*sizeof(double)));
is.close();
return true;
}
int main()
{
double* pDbl = new double[1000];
int i;
for (i=0 ; i<1000 ; i++)
pDbl[i] = double(rand());
saveArray(pDbl,1000,"test.txt");
double* pDblFromFile = new double[1000];
loadArray(pDblFromFile, 1000, "test.txt");
for (i=0 ; i<1000 ; i++)
{
if ( pDbl[i] != pDblFromFile[i] )
{
std::cout << "error, loaded data not the same!\n";
break;
}
}
if ( i==1000 )
std::cout << "success!\n";
delete [] pDbl;
delete [] pDblFromFile;
return 0;
}
Just make sure you allocate appropriate buffers! But thats a whole nother topic.
Use std::copy() with the stream iterators. This way if you change 'data' into another type the alterations to code would be trivial.
#include <algorithm>
#include <iterator>
#include <fstream>
int main()
{
double data[1000] = {/*Init Array */};
{
// Write data too a file.
std::ofstream outfile("data");
std::copy(data,
data+1000,
std::ostream_iterator<double>(outfile," ")
);
}
{
// Read data from a file
std::ifstream infile("data");
std::copy(std::istream_iterator<double>(infile),
std::istream_iterator<double>(),
data // Assuming data is large enough.
);
}
}
You can use iostream .read() and .write().
It works (very roughly!) like this:
double d[2048];
fill(d, d+2048, 0);
ofstream outfile ("save.bin", ios::binary);
outfile.write(reinterpret_cast<char*>(&d), sizeof(d));
ifstream infile ("save.bin", ios::binary);
infile.read(reinterpret_cast<char*>(&d), sizeof(d));
Note that this is not portable between CPU architectures. Some may have different sizes of double. Some may store the bytes in a different order. It shouldn't be used for data files that move between machines or data that is sent over the network.
#include <fstream.h>
void saveArray(double* array, int length);
int main()
{
double array[] = { 15.25, 15.2516, 84.168, 84356};
saveArray(array, 4);
return 0;
}
void saveArray(double* array, int length)
{
ofstream output("output.txt");
for(int i=0;i<length;i++)
{
output<<array[i]<<endl;
}
}
here is a way to output an array of doubles to text file one per line. hope this helps
EDIT
Change top one line to this two, and it will compile in VS. You can use multithreading to not blocking system wile saving data
#include <fstream>
using namespace std;
Now I feel old. I asked this question a long time ago (except about ints).
comp.lang.c++ link
#include <iostream>
#include <fstream>
using namespace std;
int main () {
double [] theArray=...;
int arrayLength=...;
ofstream myfile;
myfile.open ("example.txt");
for(int i=0; i<arrayLength; i++) {
myfile << theArray[i]<<"\n";
}
myfile.close();
return 0;
}
adapted from http://www.cplusplus.com/doc/tutorial/files/
Just set theArray and arrayLength to whatever your code requires.