How can I speed up the program execution? [closed] - c++

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I have a program that reads data from a file (a matrix of 400,000 * 3 elements) and writes it to a two-dimensional array and write this array. However, there is a problem: it all takes a long time (6 seconds). According to the conditions of correct tests, this should take no more than 2 seconds.
int main()
{
ifstream file_for_reading("C:\\Tests\\20");
int a,b,c;
int edge, number_of_vertexes;
file_for_reading >> number_of_vertexes >> edge;
if (number_of_vertexes < 1 || number_of_vertexes > 30000 || edge < 0 || edge>400000) { cout << "Correct your vallues"; exit(1); };
short** matrix = new short* [edge];
for (int i = 0; i < edge; i++)
matrix[i] = new short[3];
int tmp = 0;
for (int i = 0; i < edge; i++) {
file_for_reading >> matrix[i][tmp] >> matrix[i][tmp+1] >> matrix[i][tmp+2];
tmp = 0;
}
file_for_reading.close();
//Dijkstra(matrix, number_of_vertexes);
}

S.M.'s advice is promising - just short* matrix = new short [edge * 3]; then for (int i = 0; i < edge * 3; i++) file_for_reading >> matrix[i]; to read the file. Crucially, this puts all the file content into contiguous memory, which is more CPU cache friendly.
Using the following code I generated test input and measured the performance of your original approach and the contiguous-memory approach:
#include <iostream>
#include <fstream>
#include <cstdlib>
#include <cstring>
#include <string>
using namespace std::literals;
#define ASSERT(X) \
do { \
if (X) break; \
std::cerr << ':' << __LINE__ << " ASSERT " << #X << '\n'; \
exit(1); \
} while (false)
int main(int argc, const char* argv[])
{
// for (int i = 0; i < 400000 * 3; ++i)
// std::cout << rand() % 32768 << ' ';
// old way...
std::ifstream in{"fasterread.in"};
ASSERT(in);
if (argc == 2 && argv[1] == "orig"s) {
short** m = new short*[400000];
for (int i = 0; i < 400000; ++i)
m[i] = new short[3];
for (int i = 0; i < 400000; ++i)
in >> m[i][0] >> m[i][1] >> m[i][2];
}
if (argc == 2 && argv[1] == "contig"s) {
short* m = new short[400000 * 3];
for (int i = 0; i < 400000 * 3; ++i)
in >> m[i];
}
}
I then compiled them with optimisations using GCC on Linux:
g++ -O2 -Wall -std=c++20 fasterread.cc -o fasterread
And ran them with the time utility to show elapsed time:
time ./fasterread orig; time ./fasterread contig
Over a dozen runs of each, the fastest the orig version completed was 0.063 seconds (I have a fast SSD), whilst contig took as little as 0.058 seconds. Still not fast enough to meet your 3-fold reduction target.
That said, C++ ifstream supports locale translations whilst parsing numbers - using a slowish virtual dispatch mechanism - so may be slower than other text-to-number parsing that you could use or write.
But, when you're 100x slower than me - it's obviously your old HDD that sucks, and not the software parsing the numbers....
FWIW, I tried C-style I/O using fscanf and it proved slower for me - 0.077s.

Here's an optimized version for you:
const int limit = edge / 2;
for (int i = 0; i < limit; i += 2)
{
/* register */ int a, b, c, d, e, f;
file_for_reading >> a >> b >> c >> d >> e >> f;
matrix[i][0] = a;
matrix[i][1] = b;
matrix[i][2] = c;
matrix[i + 1][0] = d;
matrix[i + 1][1] = e;
matrix[i + 1][2] = f;
}
for (; i < edge; ++i)
{
/* register */ int a, b, c;
file_for_reading >> a >> b >> c;
matrix[i][0] = a;
matrix[i][1] = b;
matrix[i][2] = c;
}
Here are the optimization principles, I'm trying to achieve in the above example:
Keep the file streaming (more), read more data per transaction.
Group the matrix assignments together, separate from the input.
This allows the compiler and processor to optimize. The processor can reduce memory fetches to take advantage of prefetching.
Hopefully, the compiler can use registers for the local variables. Register access is faster than memory access.
By grouping the assignments, maybe the compiler can use some advanced processor instructions.
Loop unrolling. The loop overhead (comparison and increment) are performed less often.
The best idea is to set your compiler for highest optimization and create a release build. Also have your compiler print the assembly language for the functions. The compiler may already perform some of the above optimizations. IMHO, it never hurts to make your code easier for the compiler to optimize. :-)
Edit 1:
I'm hoping also that the matrix assignment may occur while reading in the "next" group of variables. This would be a great optimization. I'm open to people suggesting edits to this answer showing how to do that (without using threads).

Related

Performance bottleneck in writing a large matrix of doubles to a file

My program opens a file which contains 100,000 numbers and parses them out into a 10,000 x 10 array correlating to 10,000 sets of 10 physical parameters. The program then iterates through each row of the array, performing overlap calculations between that row and every other row in the array.
The process is quite simple, and being new to c++, I programmed it the most straightforward way that I could think of. However, I know that I'm not doing this in the most optimal way possible, which is something that I would love to do, as the program is going to face off against my cohort's identical program, coded in Fortran, in a "race".
I have a feeling that I am going to need to implement multithreading to accomplish my goal of speeding up the program, but not only am I new to c++, I am new to multithreading, so I'm not sure how I should go about creating new threads in a beneficial way, or if it is even something that would give me that much "gain on investment" so to speak.
The program has the potential to be run on a machine with over 50 cores, but because the program is so simple, I'm not convinced that more threads is necessarily better. I think that if I implement two threads to compute the complex parameters of the two gaussians, one thread to compute the overlap between the gaussians, and one thread that is dedicated to writing to the file, I could speed up the program significantly, but I could also be wrong.
CODE:
cout << "Working...\n";
double **gaussian_array;
gaussian_array = (double **)malloc(N*sizeof(double *));
for(int i = 0; i < N; i++){
gaussian_array[i] = (double *)malloc(10*sizeof(double));
}
fstream gaussians;
gaussians.open("GaussParams", ios::in);
if (!gaussians){
cout << "File not found.";
}
else {
//generate the array of gaussians -> [10000][10]
int i = 0;
while(i < N) {
char ch;
string strNums;
string Num;
string strtab[10];
int j = 0;
getline(gaussians, strNums);
stringstream gaussian(strNums);
while(gaussian >> ch) {
if(ch != ',') {
Num += ch;
strtab[j] = Num;
}
else {
Num = "";
j += 1;
}
}
for(int c = 0; c < 10; c++) {
stringstream dbl(strtab[c]);
dbl >> gaussian_array[i][c];
}
i += 1;
}
}
gaussians.close();
//Below is the process to generate the overlap file between all gaussians:
string buffer;
ofstream overlaps;
overlaps.open("OverlapMatrix", ios::trunc);
overlaps.precision(15);
for(int i = 0; i < N; i++) {
for(int j = 0 ; j < N; j++){
double r1[6][2];
double r2[6][2];
double ol[2];
//compute complex parameters from the two gaussians
compute_params(gaussian_array[i], r1);
compute_params(gaussian_array[j], r2);
//compute overlap between the gaussians using the complex parameters
compute_overlap(r1, r2, ol);
//write to file
overlaps << ol[0] << "," << ol[1];
if(j < N - 1)
overlaps << " ";
else
overlaps << "\n";
}
}
overlaps.close();
return 0;
Any suggestions are greatly appreciated. Thanks!

Allocate more RAM to executable

I've made a program which process a lot of data, and it takes forever at runtime, but looking in Task Manager I found out that the executable only uses a small part of my cpu and my RAM...
How can I tell my IDE to allocate more resources (as much as he can) to my program?
Running it in Release x64 helps but not enough.
#include <cstddef>
#include <iostream>
#include <utility>
#include <vector>
int main() {
using namespace std;
struct library {
int num = 0;
unsigned int total = 0;
int booksnum = 0;
int signup = 0;
int ship = 0;
vector<int> scores;
};
unsigned int libraries = 30000; // in the program this number is read a file
unsigned int books = 20000; // in the program this number is read a file
unsigned int days = 40000; // in the program this number is read a file
vector<int> scores(books, 0);
vector<library*> all(libraries);
for(auto& it : all) {
it = new library;
it->booksnum = 15000; // in the program this number is read a file
it->signup = 50000; // in the program this number is read a file
it->ship = 99999; // in the program this number is read a file
it->scores.resize(it->booksnum, 0);
}
unsigned int past = 0;
for(size_t done = 0; done < all.size(); done++) {
if(!(done % 1000)) cout << done << '-' << all.size() << endl;
for(size_t m = done; m < all.size() - 1; m++) {
all[m]->total = 0;
{
double run = past + all[m]->signup;
for(auto at : all[m]->scores) {
if(days - run > 0) {
all[m]->total += scores[at];
run += 1. / all[m]->ship;
} else
break;
}
}
}
for(size_t n = done; n < all.size(); n++)
for(size_t m = 0; m < all.size() - 1; m++) {
if(all[m]->total < all[m + 1]->total) swap(all[m], all[m + 1]);
}
past += all[done]->signup;
if (past > days) break;
}
return 0;
}
this is the cycle which takes up so much time... For some reason even using pointers to library doesn't optimize it
RAM doesn't make things go faster. RAM is just there to store data your program uses; if it's not using much then it doesn't need much.
Similarly, in terms of CPU usage, the program will use everything it can (the operating system can change priority, and there are APIs for that, but this is probably not your issue).
If you're seeing it using a fraction of CPU percentage, the chances are you're either waiting on I/O or writing a single threaded application that can only use a single core at any one time. If you've optimised your solution as much as possible on a single thread, then it's worth looking into breaking its work down across multiple threads.
What you need to do is use a tool called a profiler to find out where your code is spending its time and then use that information to optimise it. This will help you with microoptimisations especially, but for larger algorithmic changes (i.e. changing how it works entirely), you'll need to think about things at a higher level of abstraction.

Multithreaded storing into file

I have code
const int N = 100000000;
int main() {
FILE* fp = fopen("result.txt", "w");
for (int i=0; i<N; ++i) {
int res = f(i);
fprintf (fp, "%d\t%d\n", i, res);
}
return 0;
}
Here f averagely run for several milliseconds in single thread.
To make it faster I'd like to use multithreading.
What provides a way to get the next i? Or do I need to lock, get, add and unlock?
Should writing be proceeded in a separated thread to make things easier?
Do I need a temporary memory in case f(7) is worked out before f(3)?
If 3, is it likely that f(3) is not calculated for long time and the temporary memory is filled?
I'm currently using C++11, but requiring higher version of C++ may be acceptable
General rule how to improve performance:
Find way to measure performance (automated test)
Do profiling of existing code (find bottlenecks)
Understanding findings in point 2 and try to fix them (without mutilating)
Do a measurement from point 1. and decide if change provided expected improvement.
go back to point 2 couple times
Only if steps 1 to 5 didn't help try use muti threading. Procedure is same as in points 2 - 5, but you have to think: can you split large task to couple smaller one? If yest do they need synchronization? Can you avoid it?
Now in your example just split result to 8 (or more) separate files and merge them at the end if you have to.
This can look like this:
#include <vector>
#include <future>
#include <fstream>
std::vector<int> multi_f(int start, int stop)
{
std::vector<int> r;
r.reserve(stop - start);
for (;start < stop; ++start) r.push_back(f(start));
return r;
}
int main()
{
const int N = 100000000;
const int tasks = 100;
const int sampleCount = N / tasks;
std::vector<std::future<std::vector<int>>> allResults;
for (int i=0; i < N; i += sampleCount) {
allResults.push_back(std::async(&multi_f, i, i + sampleCount));
}
std::ofstream f{ "result.txt" }; // it is a myth that printf is faster
int i = 0;
for (auto& task : allResults)
{
for (auto r : task.get()) {
f << i++ << '\t' << r << '\n';
}
}
return 0;
}

How can I optimize C++ for loops/if statements?

Is there anyway to make c++ code run faster, im trying to optimize the slowest parts of my code such as this:
void removeTrail ( char floor[][SIZEX],int trail[][SIZEX])
{
for (int y=1; y < SIZEY-1; y++)
for (int x=1; x < SIZEX-1; x++)
{ if (trail [y][x] <= SLIMELIFE && trail [y][x] > 0)
{ trail [y][x] --;
if (trail [y][x] == 0)
floor [y][x] = NONE;
}
}
}
Most of the guides i have found online are for more complex C++.
It really depends on what kind of optimization you are seeking. Is seems to me that you are talking about a more "low-level" optimization, which can be achieved, in combination with compile flags, by techniques such as changing the order of the nested loops, changing where you place your if statements, deciding between recursive vs. iterative approaches, etc.
However, the most effective optimizations are those targeted at the algorithms, which means that you are changing the complexity of your routines and, thus, often diminishing execution time by orders of magnitude. This would be the case, for example, when you decide to implement a Quicksort instead of a Selection Sort. An optimization from an O(n^2) to an O(n lg n) algorithm will hardly be beaten by any micro optimization.
In this particular case, I see that you are trying to "remove" the elements from the matrix when they reach a certain value. Depending on how those values change, simply tracking when they reach that and adding them to a queue for removal right there, instead of always checking the whole matrix, might do it:
trail[y][x]--; // In some part of your code, this happens
if (trail[y][x] == 0) { //add for removal
removalQueueY[yQueueTail++] = y;
removalQueueX[xQueueTail++] = x;
}
//Then, instead of checking for removal as you currently do:
while (yQueueHead < yQueueTail) {
//Remove the current element and advance the heads
floor[removalQueueY[yQueueHead]][removalQueueX[xQueueHead]] = NONE;
yQueueHead++, xQueueHead++;
}
Depending on how those values change (if it is not a simple trail[y][x]--), another data structure might prove more useful. You could try using a heap, for example, or an std::set, std::priority_queue, among other possibilities. It all comes down to what operations your algorithm must support, and which data structures allow you to execute those operations as efficiently as possible (contemplating memory and execution time, depending on your priorities and needs).
The first thing to do is to turn on compiler optimization. The most powerful optimization I know is profile guided optimization. For gcc:
1) g++ -fprofile-generate .... -o my_program
2) run my_program (typical load)
3) g++ -fprofile-use -O3 ... -o optimized_program
With profile O3 does make sense.
The next thing is to perform algorithmic optimization, like in Renato_Ferreira answer. If it doesn't work for your situation you can improve your performance by factor of 2..8 using vectorization. Your code looks vectorizable:
#include <cassert>
#include <emmintrin.h>
#include <iostream>
#define SIZEX 100 // SIZEX % 4 == 0
#define SIZEY 100
#define SLIMELIFE 100
#define NONE 0xFF
void removeTrail(char floor[][SIZEX], int trail[][SIZEX]) {
// check if trail is 16 bytes alligned
assert((((size_t)(&trail[0][0])) & (size_t)0xF) == 0);
static const int lower_a[] = {0,0,0,0};
static const int sub_a[] = {1,1,1,1};
static const int floor_a[] = {1,1,1,1}; // will underflow after decrement
static const int upper_a[] = {SLIMELIFE, SLIMELIFE, SLIMELIFE, SLIMELIFE};
__m128i lower_v = *(__m128i*) lower_a;
__m128i upper_v = *(__m128i*) upper_a;
__m128i sub_v = *(__m128i*) sub_a;
__m128i floor_v = *(__m128i*) floor_a;
for (int i = 0; i < SIZEY; i++) {
for (int j = 0; j < SIZEX; j += 4) { // only for SIZEX % 4 == 0
__m128i x = *(__m128i*)(&trail[i][j]);
__m128i floor_mask = _mm_cmpeq_epi32(x, floor_v); // 32-bit
floor_mask = _mm_packs_epi32(floor_mask, floor_mask); // now 16-bit
floor_mask = _mm_packs_epi16(floor_mask, floor_mask); // now 8-bit
int32_t fl_mask[4];
*(__m128i*)fl_mask = floor_mask;
*(int32_t*)(&floor[i][j]) |= fl_mask[0];
__m128i less_mask = _mm_cmplt_epi32(lower_v, x);
__m128i upper_mask = _mm_cmplt_epi32(x, upper_v);
__m128i mask = less_mask & upper_mask;
*(__m128i*)(&trail[i][j]) = _mm_sub_epi32(x, mask & sub_v);
}
}
}
int main()
{
int T[SIZEY][SIZEX];
char F[SIZEY][SIZEX];
for (int i = 0; i < SIZEY; i++) {
for (int j = 0; j < SIZEX; j++) {
F[i][j] = 0x0;
T[i][j] = j-10;
}
}
removeTrail(F, T);
for (int j = 0; j < SIZEX; j++) {
std::cout << (int) F[2][j] << " " << T[2][j] << '\n';
}
return 0;
}
Looks like it does what it suppose to do. No ifs and 4 values for iteration. Works only for NONE = 0xFF. Could be done for another but it is difficult.

Killed process by SIGKILL

I have a process that get killed immediately after executing the program. This is the code of the compiled executable, and it is a small program that reads several graphs represented by numbers from standard input (a descriptive file usually) and finds the minimum spanning tree for every graph using the Prim's algorithm (it does not show the results yet, it just find the solution).
#include <stdlib.h>
#include <iostream>
using namespace std;
const int MAX_NODOS = 20000;
const int infinito = 10000;
int nnodos;
int nAristas;
int G[MAX_NODOS][MAX_NODOS];
int solucion[MAX_NODOS][MAX_NODOS];
int menorCoste[MAX_NODOS];
int masCercano[MAX_NODOS];
void leeGrafo(){
if (nnodos<0 || nnodos>MAX_NODOS) {
cerr << "Numero de nodos (" << nnodos << ") no valido\n";
exit(0);
}
for (int i=0; i<nnodos ; i++)
for (int j=0; j<nnodos ; j++)
G[i][j] = infinito;
int A,B,P;
for(int i=0;i<nAristas;i++){
cin >> A >> B >> P;
G[A][B] = P;
G[B][A] = P;
}
}
void prepararEstructuras(){
// Grafo de salida
for(int i=0;i<nnodos;i++)
for(int j=0;j<nnodos;j++)
solucion[i][j] = infinito;
// el mas cercaano
for(int i=1;i<nnodos;i++){
masCercano[i]=0;
// menor coste
menorCoste[i]=G[0][i];
}
}
void prim(){
prepararEstructuras();
int min,k;
for(int i=1;i<nnodos;i++){
min = menorCoste[1];
k = 1;
for(int j=2;i<nnodos;j++){
if(menorCoste[j] < min){
min = menorCoste[j];
k = j;
}
}
solucion[k][masCercano[k]] = G[k][masCercano[k]];
menorCoste[k] = infinito;
for(int j=1;j<nnodos;j++){
if(G[k][j] < menorCoste[j] && menorCoste[j]!=infinito){
menorCoste[j] = G[k][j];
masCercano[j] = k;
}
}
}
}
void output(){
for(int i=0;i<nnodos;i++){
for(int j=0;j<nnodos;j++)
cout << G[i][j] << ' ';
cout << endl;
}
}
int main (){
while(true){
cin >> nnodos;
cin >> nAristas;
if((nnodos==0)&&(nAristas==0)) break;
else{
leeGrafo();
output();
prim();
}
}
}
I have learned that i must use strace to find what is going on, and this is what i get :
execve("./412", ["./412"], [/* 38 vars */] <unfinished ...>
+++ killed by SIGKILL +++
Killed
I am runing ubuntu and this is the first time i get this type of errors. The program is supposed to stop after reading two zeros in a row from the input wich i can guarantee that i have in my graphs descriptive file. Also the problem happens even if i execute the program without doing an input redirection to my graphs file.
Although I'm not 100% sure that this is the problem, take a look at the sizes of your global arrays:
const int MAX_NODOS = 20000;
int G[MAX_NODOS][MAX_NODOS];
int solucion[MAX_NODOS][MAX_NODOS];
Assuming int is 4 bytes, you'll need:
20000 * 20000 * 4 bytes * 2 = ~3.2 GB
For one, you might not even have that much memory. Secondly, if you're on 32-bit, it's likely that the OS will not allow a single process to have that much memory at all.
Assuming you're on 64-bit (and assuming you have enough memory), the solution would be to allocate it all at run-time.
Your arrays G and solucion each contain 400,000,000 integers, which is about 1.6 GiB each on most machines. Unless you have enough (virtual) memory for that (3.2 GiB and counting), and permission to use it (try ulimit -d; that's correct for bash on MacOS X 10.7.2), your process will fail to start and will be killed by SIGKILL (which cannot be trapped, not that the process is really going yet).