Many of my programs output huge volumes of data for me to review on Excel. The best way to view all these files is to use a tab deliminated text format. Currently i use this chunk of code to get it done:
ofstream output (fileName.c_str());
for (int j = 0; j < dim; j++)
{
for (int i = 0; i < dim; i++)
output << arrayPointer[j * dim + i] << " ";
output << endl;
}
This seems to be a very slow operation, is a more efficient way of outputting text files like this to the hard drive?
Update:
Taking the two suggestions into mind, the new code is this:
ofstream output (fileName.c_str());
for (int j = 0; j < dim; j++)
{
for (int i = 0; i < dim; i++)
output << arrayPointer[j * dim + i] << "\t";
output << "\n";
}
output.close();
writes to HD at 500KB/s
But this writes to HD at 50MB/s
{
output.open(fileName.c_str(), std::ios::binary | std::ios::out);
output.write(reinterpret_cast<char*>(arrayPointer), std::streamsize(dim * dim * sizeof(double)));
output.close();
}
Use C IO, it's a lot faster than C++ IO. I've heard of people in programming contests timing out purely because they used C++ IO and not C IO.
#include <cstdio>
FILE* fout = fopen(fileName.c_str(), "w");
for (int j = 0; j < dim; j++)
{
for (int i = 0; i < dim; i++)
fprintf(fout, "%d\t", arrayPointer[j * dim + i]);
fprintf(fout, "\n");
}
fclose(fout);
Just change %d to be the correct type.
Don't use endl. It will be flushing the stream buffers, which is potentially very inefficient. Instead:
output << '\n';
I decided to test JPvdMerwe's claim that C stdio is faster than C++ IO streams. (Spoiler: yes, but not necessarily by much.) To do this, I used the following test programs:
Common wrapper code, omitted from programs below:
#include <iostream>
#include <cstdio>
int main (void) {
// program code goes here
}
Program 1: normal synchronized C++ IO streams
for (int j = 0; j < ROWS; j++) {
for (int i = 0; i < COLS; i++) {
std::cout << (i-j) << "\t";
}
std::cout << "\n";
}
Program 2: unsynchronized C++ IO streams
Same as program 1, except with std::cout.sync_with_stdio(false); prepended.
Program 3: C stdio printf()
for (int j = 0; j < ROWS; j++) {
for (int i = 0; i < COLS; i++) {
printf("%d\t", i-j);
}
printf("\n");
}
All programs were compiled with GCC 4.8.4 on Ubuntu Linux, using the following command:
g++ -Wall -ansi -pedantic -DROWS=10000 -DCOLS=1000 prog.cpp -o prog
and timed using the command:
time ./prog > /dev/null
Here are the results of the test on my laptop (measured in wall clock time):
Program 1 (synchronized C++ IO): 3.350s (= 100%)
Program 2 (unsynchronized C++ IO): 3.072s (= 92%)
Program 3 (C stdio): 2.592s (= 77%)
I also ran the same test with g++ -O2 to test the effect of optimization, and got the following results:
Program 1 (synchronized C++ IO) with -O2: 3.118s (= 100%)
Program 2 (unsynchronized C++ IO) with -O2: 2.943s (= 94%)
Program 3 (C stdio) with -O2: 2.734s (= 88%)
(The last line is not a fluke; program 3 consistently runs slower for me with -O2 than without it!)
Thus, my conclusion is that, based on this test, C stdio is indeed about 10% to 25% faster for this task than (synchronized) C++ IO. Using unsynchronized C++ IO saves about 5% to 10% over synchronized IO, but is still slower than stdio.
Ps. I tried a few other variations, too:
Using std::endl instead of "\n" is, as expected, slightly slower, but the difference is less than 5% for the parameter values given above. However, printing more but shorter output lines (e.g. -DROWS=1000000 -DCOLS=10) makes std::endl more than 30% slower than "\n".
Piping the output to a normal file instead of /dev/null slows down all the programs by about 0.2s, but makes no qualitative difference to the results.
Increasing the line count by a factor of 10 also yields no surprises; the programs all take about 10 times longer to run, as expected.
Prepending std::cout.sync_with_stdio(false); to program 3 has no noticeable effect.
Using (double)(i-j) (and "%g\t" for printf()) slows down all three programs a lot! Notably, program 3 is still fastest, taking only 9.3s where programs 1 and 2 each took a bit over 14s, a speedup of nearly 40%! (And yes, I checked, the outputs are identical.) Using -O2 makes no significant difference either.
does it have to be written in C? if not, there are many tools already written in C, eg (g)awk (can be used in unix/windows) that does the job of file parsing really well, also on big files.
awk '{$1=$1}1' OFS="\t" file
It may be faster to do it this way:
ofstream output (fileName.c_str());
for (int j = 0; j < dim; j++)
{
for (int i = 0; i < dim; i++)
output << arrayPointer[j * dim + i] << '\t';
output << '\n';
}
ofstream output (fileName.c_str());
for (int j = 0; j < dim; j++)
{
for (int i = 0; i < dim; i++)
output << arrayPointer[j * dim + i] << '\t';
output << endl;
}
Use '\t' instead of " "
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I have a program that reads data from a file (a matrix of 400,000 * 3 elements) and writes it to a two-dimensional array and write this array. However, there is a problem: it all takes a long time (6 seconds). According to the conditions of correct tests, this should take no more than 2 seconds.
int main()
{
ifstream file_for_reading("C:\\Tests\\20");
int a,b,c;
int edge, number_of_vertexes;
file_for_reading >> number_of_vertexes >> edge;
if (number_of_vertexes < 1 || number_of_vertexes > 30000 || edge < 0 || edge>400000) { cout << "Correct your vallues"; exit(1); };
short** matrix = new short* [edge];
for (int i = 0; i < edge; i++)
matrix[i] = new short[3];
int tmp = 0;
for (int i = 0; i < edge; i++) {
file_for_reading >> matrix[i][tmp] >> matrix[i][tmp+1] >> matrix[i][tmp+2];
tmp = 0;
}
file_for_reading.close();
//Dijkstra(matrix, number_of_vertexes);
}
S.M.'s advice is promising - just short* matrix = new short [edge * 3]; then for (int i = 0; i < edge * 3; i++) file_for_reading >> matrix[i]; to read the file. Crucially, this puts all the file content into contiguous memory, which is more CPU cache friendly.
Using the following code I generated test input and measured the performance of your original approach and the contiguous-memory approach:
#include <iostream>
#include <fstream>
#include <cstdlib>
#include <cstring>
#include <string>
using namespace std::literals;
#define ASSERT(X) \
do { \
if (X) break; \
std::cerr << ':' << __LINE__ << " ASSERT " << #X << '\n'; \
exit(1); \
} while (false)
int main(int argc, const char* argv[])
{
// for (int i = 0; i < 400000 * 3; ++i)
// std::cout << rand() % 32768 << ' ';
// old way...
std::ifstream in{"fasterread.in"};
ASSERT(in);
if (argc == 2 && argv[1] == "orig"s) {
short** m = new short*[400000];
for (int i = 0; i < 400000; ++i)
m[i] = new short[3];
for (int i = 0; i < 400000; ++i)
in >> m[i][0] >> m[i][1] >> m[i][2];
}
if (argc == 2 && argv[1] == "contig"s) {
short* m = new short[400000 * 3];
for (int i = 0; i < 400000 * 3; ++i)
in >> m[i];
}
}
I then compiled them with optimisations using GCC on Linux:
g++ -O2 -Wall -std=c++20 fasterread.cc -o fasterread
And ran them with the time utility to show elapsed time:
time ./fasterread orig; time ./fasterread contig
Over a dozen runs of each, the fastest the orig version completed was 0.063 seconds (I have a fast SSD), whilst contig took as little as 0.058 seconds. Still not fast enough to meet your 3-fold reduction target.
That said, C++ ifstream supports locale translations whilst parsing numbers - using a slowish virtual dispatch mechanism - so may be slower than other text-to-number parsing that you could use or write.
But, when you're 100x slower than me - it's obviously your old HDD that sucks, and not the software parsing the numbers....
FWIW, I tried C-style I/O using fscanf and it proved slower for me - 0.077s.
Here's an optimized version for you:
const int limit = edge / 2;
for (int i = 0; i < limit; i += 2)
{
/* register */ int a, b, c, d, e, f;
file_for_reading >> a >> b >> c >> d >> e >> f;
matrix[i][0] = a;
matrix[i][1] = b;
matrix[i][2] = c;
matrix[i + 1][0] = d;
matrix[i + 1][1] = e;
matrix[i + 1][2] = f;
}
for (; i < edge; ++i)
{
/* register */ int a, b, c;
file_for_reading >> a >> b >> c;
matrix[i][0] = a;
matrix[i][1] = b;
matrix[i][2] = c;
}
Here are the optimization principles, I'm trying to achieve in the above example:
Keep the file streaming (more), read more data per transaction.
Group the matrix assignments together, separate from the input.
This allows the compiler and processor to optimize. The processor can reduce memory fetches to take advantage of prefetching.
Hopefully, the compiler can use registers for the local variables. Register access is faster than memory access.
By grouping the assignments, maybe the compiler can use some advanced processor instructions.
Loop unrolling. The loop overhead (comparison and increment) are performed less often.
The best idea is to set your compiler for highest optimization and create a release build. Also have your compiler print the assembly language for the functions. The compiler may already perform some of the above optimizations. IMHO, it never hurts to make your code easier for the compiler to optimize. :-)
Edit 1:
I'm hoping also that the matrix assignment may occur while reading in the "next" group of variables. This would be a great optimization. I'm open to people suggesting edits to this answer showing how to do that (without using threads).
My program opens a file which contains 100,000 numbers and parses them out into a 10,000 x 10 array correlating to 10,000 sets of 10 physical parameters. The program then iterates through each row of the array, performing overlap calculations between that row and every other row in the array.
The process is quite simple, and being new to c++, I programmed it the most straightforward way that I could think of. However, I know that I'm not doing this in the most optimal way possible, which is something that I would love to do, as the program is going to face off against my cohort's identical program, coded in Fortran, in a "race".
I have a feeling that I am going to need to implement multithreading to accomplish my goal of speeding up the program, but not only am I new to c++, I am new to multithreading, so I'm not sure how I should go about creating new threads in a beneficial way, or if it is even something that would give me that much "gain on investment" so to speak.
The program has the potential to be run on a machine with over 50 cores, but because the program is so simple, I'm not convinced that more threads is necessarily better. I think that if I implement two threads to compute the complex parameters of the two gaussians, one thread to compute the overlap between the gaussians, and one thread that is dedicated to writing to the file, I could speed up the program significantly, but I could also be wrong.
CODE:
cout << "Working...\n";
double **gaussian_array;
gaussian_array = (double **)malloc(N*sizeof(double *));
for(int i = 0; i < N; i++){
gaussian_array[i] = (double *)malloc(10*sizeof(double));
}
fstream gaussians;
gaussians.open("GaussParams", ios::in);
if (!gaussians){
cout << "File not found.";
}
else {
//generate the array of gaussians -> [10000][10]
int i = 0;
while(i < N) {
char ch;
string strNums;
string Num;
string strtab[10];
int j = 0;
getline(gaussians, strNums);
stringstream gaussian(strNums);
while(gaussian >> ch) {
if(ch != ',') {
Num += ch;
strtab[j] = Num;
}
else {
Num = "";
j += 1;
}
}
for(int c = 0; c < 10; c++) {
stringstream dbl(strtab[c]);
dbl >> gaussian_array[i][c];
}
i += 1;
}
}
gaussians.close();
//Below is the process to generate the overlap file between all gaussians:
string buffer;
ofstream overlaps;
overlaps.open("OverlapMatrix", ios::trunc);
overlaps.precision(15);
for(int i = 0; i < N; i++) {
for(int j = 0 ; j < N; j++){
double r1[6][2];
double r2[6][2];
double ol[2];
//compute complex parameters from the two gaussians
compute_params(gaussian_array[i], r1);
compute_params(gaussian_array[j], r2);
//compute overlap between the gaussians using the complex parameters
compute_overlap(r1, r2, ol);
//write to file
overlaps << ol[0] << "," << ol[1];
if(j < N - 1)
overlaps << " ";
else
overlaps << "\n";
}
}
overlaps.close();
return 0;
Any suggestions are greatly appreciated. Thanks!
This loop:
long n = 0;
unsigned int i, j, innerLoopLength = 4;
for (i = 0; i < 10000000; i++) {
for (j = 0; j < innerLoopLength; j++) {
n += v[j];
}
}
finishes in 0 ms, while this one:
long n = 0;
unsigned int i, j, innerLoopLength = argc;
for (i = 0; i < 10000000; i++) {
for (j = 0; j < innerLoopLength; j++) {
n += v[j];
}
}
takes 35 ms.
No matter what the innerLoopLength is, the first method is always pretty fast while the second getting slower and slower.
Does anybody know why and is there a way to speed up the seconds version? I'm grateful for every ms.
Full code:
#include <iostream>
#include <chrono>
#include <vector>
using namespace std;
int main(int argc, char *argv[]) {
vector<long> v;
cout << "argc: " << argc << endl;
for (long l = 1; l <= argc; l++) {
v.push_back(l);
}
auto start = chrono::steady_clock::now();
long n = 0;
unsigned int i, j, innerLoopLength = 4;
for (i = 0; i < 10000000; i++) {
for (j = 0; j < innerLoopLength; j++) {
n += v[j];
}
}
auto end = chrono::steady_clock::now();
cout << "duration: " << chrono::duration_cast<chrono::microseconds>(end - start).count() / 1000.0 << " ms" << endl;
cout << "n: " << n << endl;
return 0;
}
Compiled with -std=c++1z and -O3.
The fixed-length loop was far quicker due to loop unrolling:
Loop unrolling, also known as loop unwinding, is a loop transformation
technique that attempts to optimize a program's execution speed at the
expense of its binary size, which is an approach known as space–time
tradeoff. The transformation can be undertaken manually by the
programmer or by an optimizing compiler.
The goal of loop unwinding is to increase a program's speed by
reducing or eliminating instructions that control the loop, such as
pointer arithmetic and "end of loop" tests on each iteration; reducing
branch penalties; as well as hiding latencies, including the delay in
reading data from memory. To eliminate this computational overhead,
loops can be re-written as a repeated sequence of similar independent
statements.
Essentially, the inner loop of your C(++) code is transformed to the following before compilation:
for (i = 0; i < 10000000; i++) {
n += v[0];
n += v[1];
n += v[2];
n += v[3];
}
As you can see, it is a little bit faster.
In your specific case, there is yet another source of the optimization: you sum 1000000 times the same values to n. gcc can detect it since around 3.*, and converts it to a multiplication. You can check that, doing the same loop 100000000000 times will be similarly ready in 0 ms. You can check on the ASM level (g++ -S -o bench.s bench.c -O3), you will see only a multiplication and not an addition in a loop. To avoid this, you should add something what can't be converted to a multiplication so easily.
None of them can be done in the second case. Thus, on the ASM level, you will have to deal with a lot of conditional expressions (conditional jumps). These are costly in a modern CPU, because their unexpected result causes the CPU pipeline to reset.
What can you help:
If you know something from innerLoopLength, for example if it is always divisable by 4, you can unroll the loop for yourself
Some gcc(g++) optimization flag, to help him to understand, here you need fast code. Compile with at least -O3 -funroll-loops.
I'm having problems with creating a 2D boolean array in C++. I wrote a quick program for create and print all the bool array.
#include <iostream>
using namespace std;
int main() {
const int WIDTH = 20;
const int HEIGHT = 20;
bool world [HEIGHT][WIDTH];
for(int i = 0; i < HEIGHT; i++){
for(int j = 0; j < WIDTH; j++){
world[i][j] = true;
}
}
for(int i = 0; i < HEIGHT; i++){
for(int j = 0; j < WIDTH; j++){
if(world[i][j]){
cout << j;
}else{
cout << ' ';
};
}
cout << "-" << i << endl;
}
return 0;
}
And this is his output.
012345678910111213141516171819-0
012345678910111213141516171819-1
012345678910111213141516171819-2
012345678910111213141516171819-3
012345678910111213141516171819-4
012345678910111213141516171819-5
012345678910111213141516171819-6
012345678910111213141516171819-7
012345678910111213141516171819-8
012345678910111213141516171819-9
012345678910111213141516171819-10
012345678910111213141516171819-11
012345678910111213141516171819-12
012345678910111213141516171819-13
012345678910111213141516171819-14
012345678910111213141516171819-15
012345678910111213141516171819-16
012345678910111213141516171819-17
012345678910111213141516171819-18
012345678910111213141516171819-19
It creates a 2D array, set all his values to true, and print the array. This is fine, the problem is when the 2d array get bigger. For example if I change the size of WIDTH and HEIGHT to 30, when i print the array I have the following ouput:
01234567891011121314151617181920212223242526272829-0
01234567891011121314151617181920212223242526272829-1
01234567891011121314151617181920212223242526272829-2
01234567891011121314151617181920212223242526272829-3
01234567891011121314151617181920212223242526272829-4
01234567891011121314151617181920212223242526272829-5
01234567891011121314151617181920212223242526272829-6
01234567891011121314151617181920212223242526272829-7
01234567891011121314151617181920212223242526272829-8
01234567891011121314151617181920212223242526272829-9
01234567891011121314151617181920212223242526272829-10
01234567891011121314151617181920212201234567891011121314151617181920212223242526272829-11
01234567891011121314151617181920212223242526272829-12
01234567891011121314151617181920212223242526272829-13
01234567891011121314151617181920212223242526272829-14
01234567891011121314151617181920212223242526272829-15
01234567891011121314151617181920212223242526272829-16
01234567891011121314151617181920212223242526272829-17
01234567891011121314151617181920212223242526272829-18
01234567891011121314151617181920212223242526272829-19
01234567891011121314151617181920212223242526272829-20
01234567891011121314151617181920212223242526272829-21
01234567891011121314151617181920212223242526272829-22
01234567891011121314151617181920212223242526272829-23
01234567891011121314151617181920212223242526272829-24
01234567891011121314151617181920212223242526272829-25
01234567891011121314151617181920212223242526272829-26
01234567891011121314151617181920212223242526272829-27
01234567891011121314151617181920212223242526272829-28
01234567891011121314151617181920212223242526272829-29
As you can see on the line 11 it counts until 22 and restart the for loop for j. I don't know what is wrong, I need an 2D array of bools of size [50][50] but I don't what is wrong there.
EDIT: The problem is the compiler. I tried the same code on GCC compiler on a Linux machine and works perfectly. This code works fine, the problem is the compiler or the compiler with the CLion IDE. It compiles but I have problems with the running or the output produced. The code works fine with GCC compiler or on an Unix machine
Okay this is logically absolutely correct and i have tested your code on online compiler using 20 as well as 30. restart your compiler or try another compiler a reliable one... Here is the screenshot of your result when i executed your code online.
The line
bool world [WIDTH][HEIGHT];
Should be
bool world [HEIGHT][WIDTH];
As the i in your loop ranges from 0 to HEIGHT-1. j ranges from 0 to WIDTH-1
In my code i'm changing my array (int*) and then I want to compare it into the matlab results.
since my array is big 1200 X 1000 element. this takes forever to load it into matlab
i'm trying to copy the printed output file into matlab command line...
for (int i = 0; i < _roiY1; i++)
{
for (int j = 0; j < newWidth; j++)
{
channel_gr[i*newWidth + j] = clipLevel;
}
}
ofstream myfile;
myfile.open("C:\\Users\\gdarmon\\Desktop\\OpenCVcliptop.txt");
for (int i = 0; i < newHeight ; i++)
{
for (int j = 0; j < newWidth; j++)
{
myfile << channel_gr[i * newWidth + j] << ", ";
}
myfile<<";" <<endl;
}
is there a faster way to create a readable matrix data from c++? into matlab?
The simplest answer is that it's much quicker to transfer the data in binary form, rather than - as suggested in the question - rendering to text and having Matlab parse it back to binary. You can achieve this by using fwrite() at the C/C++ end, and fread() at the Matlab end.
int* my_data = ...;
int my_data_count = ...;
FILE* fid = fopen('my_data_file', 'wb');
fwrite((void*)my_data, sizeof(int), my_data_count, fid);
fclose(fid);
In Matlab:
fid = fopen('my_data_file', 'r');
my_data = fread(fid, inf, '*int32');
fclose(fid);
It's maybe worth noting that you can call C/C++ functions from within Matlab, so depending on what you are doing that may be an easier architecture (look up "mex files").
Don't write the output as text.
Write your matrix into your output file the way Matlab likes to read: big array of binary.
ofstream myfile;
myfile.open("C:\\Users\\gdarmon\\Desktop\\OpenCVcliptop.txt", ofstream::app::binary);
myfile.write((char*) channel_gr, newHeight*newWidth*sizeof(channel_gr[0]));
You may want to play some games on output to get the array ordered column-row rather than row-column because of the way matlab likes to see data. I remember orders of magnitude improvements in performance when writing mex file plug-ins for file readers, but it's been a while since I've done it.