I've made a program which process a lot of data, and it takes forever at runtime, but looking in Task Manager I found out that the executable only uses a small part of my cpu and my RAM...
How can I tell my IDE to allocate more resources (as much as he can) to my program?
Running it in Release x64 helps but not enough.
#include <cstddef>
#include <iostream>
#include <utility>
#include <vector>
int main() {
using namespace std;
struct library {
int num = 0;
unsigned int total = 0;
int booksnum = 0;
int signup = 0;
int ship = 0;
vector<int> scores;
};
unsigned int libraries = 30000; // in the program this number is read a file
unsigned int books = 20000; // in the program this number is read a file
unsigned int days = 40000; // in the program this number is read a file
vector<int> scores(books, 0);
vector<library*> all(libraries);
for(auto& it : all) {
it = new library;
it->booksnum = 15000; // in the program this number is read a file
it->signup = 50000; // in the program this number is read a file
it->ship = 99999; // in the program this number is read a file
it->scores.resize(it->booksnum, 0);
}
unsigned int past = 0;
for(size_t done = 0; done < all.size(); done++) {
if(!(done % 1000)) cout << done << '-' << all.size() << endl;
for(size_t m = done; m < all.size() - 1; m++) {
all[m]->total = 0;
{
double run = past + all[m]->signup;
for(auto at : all[m]->scores) {
if(days - run > 0) {
all[m]->total += scores[at];
run += 1. / all[m]->ship;
} else
break;
}
}
}
for(size_t n = done; n < all.size(); n++)
for(size_t m = 0; m < all.size() - 1; m++) {
if(all[m]->total < all[m + 1]->total) swap(all[m], all[m + 1]);
}
past += all[done]->signup;
if (past > days) break;
}
return 0;
}
this is the cycle which takes up so much time... For some reason even using pointers to library doesn't optimize it
RAM doesn't make things go faster. RAM is just there to store data your program uses; if it's not using much then it doesn't need much.
Similarly, in terms of CPU usage, the program will use everything it can (the operating system can change priority, and there are APIs for that, but this is probably not your issue).
If you're seeing it using a fraction of CPU percentage, the chances are you're either waiting on I/O or writing a single threaded application that can only use a single core at any one time. If you've optimised your solution as much as possible on a single thread, then it's worth looking into breaking its work down across multiple threads.
What you need to do is use a tool called a profiler to find out where your code is spending its time and then use that information to optimise it. This will help you with microoptimisations especially, but for larger algorithmic changes (i.e. changing how it works entirely), you'll need to think about things at a higher level of abstraction.
Related
I have code
const int N = 100000000;
int main() {
FILE* fp = fopen("result.txt", "w");
for (int i=0; i<N; ++i) {
int res = f(i);
fprintf (fp, "%d\t%d\n", i, res);
}
return 0;
}
Here f averagely run for several milliseconds in single thread.
To make it faster I'd like to use multithreading.
What provides a way to get the next i? Or do I need to lock, get, add and unlock?
Should writing be proceeded in a separated thread to make things easier?
Do I need a temporary memory in case f(7) is worked out before f(3)?
If 3, is it likely that f(3) is not calculated for long time and the temporary memory is filled?
I'm currently using C++11, but requiring higher version of C++ may be acceptable
General rule how to improve performance:
Find way to measure performance (automated test)
Do profiling of existing code (find bottlenecks)
Understanding findings in point 2 and try to fix them (without mutilating)
Do a measurement from point 1. and decide if change provided expected improvement.
go back to point 2 couple times
Only if steps 1 to 5 didn't help try use muti threading. Procedure is same as in points 2 - 5, but you have to think: can you split large task to couple smaller one? If yest do they need synchronization? Can you avoid it?
Now in your example just split result to 8 (or more) separate files and merge them at the end if you have to.
This can look like this:
#include <vector>
#include <future>
#include <fstream>
std::vector<int> multi_f(int start, int stop)
{
std::vector<int> r;
r.reserve(stop - start);
for (;start < stop; ++start) r.push_back(f(start));
return r;
}
int main()
{
const int N = 100000000;
const int tasks = 100;
const int sampleCount = N / tasks;
std::vector<std::future<std::vector<int>>> allResults;
for (int i=0; i < N; i += sampleCount) {
allResults.push_back(std::async(&multi_f, i, i + sampleCount));
}
std::ofstream f{ "result.txt" }; // it is a myth that printf is faster
int i = 0;
for (auto& task : allResults)
{
for (auto r : task.get()) {
f << i++ << '\t' << r << '\n';
}
}
return 0;
}
I have a program which reads the file line by line and then stores each possible substring of length 50 in a hash table along with its frequency. I tried to use threads in my program so that it will read 5 lines and then use five different threads to do the processing. The processing involves reading each substring of that line and putting them into hash map with frequency. But it seems there is something wrong which I could not figure out for which the program is not faster then the serial approach. Also, for large input file it is aborted. Here is the piece of code I am using
unordered_map<string, int> m;
mutex mtx;
void parseLine(char *line, int subLen){
int no_substr = strlen(line) - subLen;
for(int i = 0; i <= no_substr; i++) {
char *subStr = (char*) malloc(sizeof(char)* subLen + 1);
strncpy(subStr, line+i, subLen);
subStr[subLen]='\0';
mtx.lock();
string s(subStr);
if(m.find(s) != m.end()) m[s]++;
else {
pair<string, int> ret(s, 1);
m.insert(ret);
}
mtx.unlock();
}
}
int main(){
char **Array = (char **) malloc(sizeof(char *) * num_thread +1);
int num = 0;
while (NOT END OF FILE) {
if(num < num_th) {
if(num == 0)
for(int x = 0; x < num_th; x++)
Array[x] = (char*) malloc(sizeof(char)*strlen(line)+1);
strcpy(Array[num], line);
num++;
}
else {
vector<thread> threads;
for(int i = 0; i < num_th; i++) {
threads.push_back(thread(parseLine, Array[i]);
}
for(int i = 0; i < num_th; i++){
if(threads[i].joinable()) {
threads[i].join();
}
}
for(int x = 0; x < num_th; x++) free(seqArray[x]);
num = 0;
}
}
}
It's a myth that just by the virtue of using threads, the end result must be faster. In general, in order to take advantage of multithreading, two conditions must be met(*):
1) You actually have to have sufficient physical CPU cores, that can run the threads at the same time.
2) The threads have independent tasks to do, that they can do on their own.
From a cursory examination of the shown code, it seems to fail on the second part. It seems to me that, most of the time all of these threads will be fighting each other in order to acquire the same mutex. There's little to be gained from multithreading, in this situation.
(*) Of course, you don't always use threads for purely performance reasons. Multithreading also comes in useful in many other situations too, for example, in a program with a GUI, having a separate thread updating the GUI helps the UI working even while the main execution thread is chewing on something, for a while...
I have created a model program of a more complex program that will utilize multithreading and multiple harddrives to increase performance. The data size is so large that reading all data into memory will not be feasible so the data will be read, processed, and written back out in chunks. This test program uses pipeline design to be able to read, process and write out at the same time on 3 different threads. Because read and write are to different harddrive, there is no problem with read and write at the same time. However, the program utilizing multithread seems to run 2x slower than its linear version(also in the code). I have tried to have the read and write thread not be destoryed after running a chunk but the synchronization seem to have slowed it down even more than the current version. I was wondering if I am doing something wrong or how I can improve this. Thank You.
Tested using i3-2100 # 3.1ghz and 16GB ram.
#include <iostream>
#include <fstream>
#include <ctime>
#include <thread>
#define CHUNKSIZE 8192 //size of each chunk to process
#define DATASIZE 2097152 //total size of data
using namespace std;
int data[3][CHUNKSIZE];
int run = 0;
int totalRun = DATASIZE/CHUNKSIZE;
bool finishRead = false, finishWrite = false;
ifstream infile;
ofstream outfile;
clock_t starttime, endtime;
/*
Process a chunk of data(simulate only, does not require to sort all data)
*/
void quickSort(int arr[], int left, int right) {
int i = left, j = right;
int tmp;
int pivot = arr[(left + right) / 2];
while (i <= j) {
while (arr[i] < pivot) i++;
while (arr[j] > pivot) j--;
if (i <= j) {
tmp = arr[i];
arr[i] = arr[j];
arr[j] = tmp;
i++;
j--;
}
};
if (left < j) quickSort(arr, left, j);
if (i < right) quickSort(arr, i, right);
}
/*
Find runtime
*/
void diffclock(){
double diff = (endtime - starttime)/(CLOCKS_PER_SEC/1000);
cout<<"Total run time: "<<diff<<"ms"<<endl;
}
/*
Read a chunk of data
*/
void readData(){
for(int i = 0; i < CHUNKSIZE; i++){
infile>>data[run%3][i];
}
finishRead = true;
}
/*
Write a chunk of data
*/
void writeData(){
for(int i = 0; i < CHUNKSIZE; i++){
outfile<<data[(run-2)%3][i]<<endl;
}
finishWrite = true;
}
/*
Pipelines Read, Process, Write using multithread
*/
void threadtransfer(){
starttime = clock();
infile.open("/home/pcg/test/iothread/source.txt");
outfile.open("/media/pcg/Data/test/iothread/ThreadDuplicate.txt");
thread read, write;
run = 0;
readData();
run = 1;
readData();
quickSort(data[(run-1)%3], 0, CHUNKSIZE - 1);
run = 2;
while(run < totalRun){
//cout<<run<<endl;
finishRead = finishWrite = false;
read = thread(readData);
write = thread(writeData);
read.detach();
write.detach();
quickSort(data[(run-1)%3], 0, CHUNKSIZE - 1);
while(!finishRead||!finishWrite){} //check if next cycle is ready.
run++;
}
quickSort(data[(run-1)%3], 0, CHUNKSIZE - 1);
writeData();
run++;
writeData();
infile.close();
outfile.close();
endtime = clock();
diffclock();
}
/*
Linearly read, sort, and write a chunk and repeat.
*/
void lineartransfer(){
int totalRun = DATASIZE/CHUNKSIZE;
int holder[CHUNKSIZE];
starttime = clock();
infile.open("/home/pcg/test/iothread/source.txt");
outfile.open("/media/pcg/Data/test/iothread/Linearduplicate.txt");
run = 0;
while(run < totalRun){
for(int i = 0; i < CHUNKSIZE; i++) infile>>holder[i];
quickSort(holder, 0, CHUNKSIZE - 1);
for(int i = 0; i < CHUNKSIZE; i++) outfile<<holder[i]<<endl;
run++;
}
endtime = clock();
diffclock();
}
/*
Create large amount of data for testing
*/
void createData(){
outfile.open("/home/pcg/test/iothread/source.txt");
for(int i = 0; i < DATASIZE; i++){
outfile<<rand()<<endl;
}
outfile.close();
}
int main(){
int mode=0;
cout<<"Number of threads: "<<thread::hardware_concurrency()<<endl;
cout<<"Enter mode\n1.Create Data\n2.thread copy\n3.linear copy\ninput mode:";
cin>>mode;
if(mode == 1) createData();
else if(mode == 2) threadtransfer();
else if(mode == 3) lineartransfer();
return 0;
}
Don't busy-wait. This wastes precious CPU time and may well slow down the rest (not to mention the compiler can optimize it into an infinite loop because it can't guess whether those flags will change or not, so it's not even correct in the first place). And don't detach() either. Replace both detach() and busy-waiting with join():
while (run < totalRun) {
read = thread(readData);
write = thread(writeData);
quickSort(data[(run-1)%3], 0, CHUNKSIZE - 1);
read.join();
write.join();
run++;
}
As to the global design, well, ignoring the global variables I guess it's otherwise acceptable if you don't expect the processing (quickSort) part to ever exceed the read/write time. I for one would use message queues to pass the buffers between the various threads (which allows to add more processing threads if you need it, either doing the same tasks in parallel or different tasks in sequence) but maybe that's because I'm used to do it that way.
Since you are measuing time using clock on a Linux machine, I expect that the total CPU time is (roughly) the same whether you run one thread or multiple threads.
Maybe you want to use time myprog instead? Or use gettimeofday to fetch the time (which will give you a time in seconds + nanoseconds [although the nanoseconds may not be "accurate" down to the last digit].
Edit:
Next, don't use endl when writing to a file. It slows things down a lot, because the C++ runtime goes and flushes to the file, which is an operating system call. It is almost certainly somehow protected against multiple threads, so you have three threads doing write-data, a single line, synchronously, at a time. Most likely going to take nearly 3x as long as running a single thread. Also, don't write to the same file from three different threads - that's going to be bad in one way or another.
Please correct me if I am wrong, but it seems your threaded function is basically a linear function doing 3 times the work of your linear function?
In a threaded program you would create three threads and run the readData/quicksort functions once on each thread (distributing thee workload), but in your program it seems like the thread simulation is actually just reading three times, quicksorting three times, and writing three times, and totalling the time it takes to do all three of each.
I'm trying to get a good understanding of branch prediction by measuring the time to run loops with predictable branches vs. loops with random branches.
So I wrote a program that takes large arrays of 0's and 1's arranged in different orders (i.e. all 0's, repeating 0-1, all rand), and iterates through the array branching based on if the current index is 0 or 1, doing time-wasting work.
I expected that harder-to-guess arrays would take longer to run on, since the branch predictor would guess wrong more often, and that the time-delta between runs on two sets of arrays would remain the same regardless of the amount of time-wasting work.
However, as amount of time-wasting work increased, the difference in time-to-run between arrays increased, A LOT.
(X-axis is amount of time-wasting work, Y-axis is time-to-run)
Does anyone understand this behavior? You can see the code I'm running at the following code:
#include <stdlib.h>
#include <time.h>
#include <chrono>
#include <stdio.h>
#include <iostream>
#include <vector>
using namespace std;
static const int s_iArrayLen = 999999;
static const int s_iMaxPipelineLen = 60;
static const int s_iNumTrials = 10;
int doWorkAndReturnMicrosecondsElapsed(int* vals, int pipelineLen){
int* zeroNums = new int[pipelineLen];
int* oneNums = new int[pipelineLen];
for(int i = 0; i < pipelineLen; ++i)
zeroNums[i] = oneNums[i] = 0;
chrono::time_point<chrono::system_clock> start, end;
start = chrono::system_clock::now();
for(int i = 0; i < s_iArrayLen; ++i){
if(vals[i] == 0){
for(int i = 0; i < pipelineLen; ++i)
++zeroNums[i];
}
else{
for(int i = 0; i < pipelineLen; ++i)
++oneNums[i];
}
}
end = chrono::system_clock::now();
int elapsedMicroseconds = (int)chrono::duration_cast<chrono::microseconds>(end-start).count();
//This should never fire, it just exists to guarantee the compiler doesn't compile out our zeroNums/oneNums
for(int i = 0; i < pipelineLen - 1; ++i)
if(zeroNums[i] != zeroNums[i+1] || oneNums[i] != oneNums[i+1])
return -1;
delete[] zeroNums;
delete[] oneNums;
return elapsedMicroseconds;
}
struct TestMethod{
string name;
void (*func)(int, int&);
int* results;
TestMethod(string _name, void (*_func)(int, int&)) { name = _name; func = _func; results = new int[s_iMaxPipelineLen]; }
};
int main(){
srand( (unsigned int)time(nullptr) );
vector<TestMethod> testMethods;
testMethods.push_back(TestMethod("all-zero", [](int index, int& out) { out = 0; } ));
testMethods.push_back(TestMethod("repeat-0-1", [](int index, int& out) { out = index % 2; } ));
testMethods.push_back(TestMethod("repeat-0-0-0-1", [](int index, int& out) { out = (index % 4 == 0) ? 0 : 1; } ));
testMethods.push_back(TestMethod("rand", [](int index, int& out) { out = rand() % 2; } ));
int* vals = new int[s_iArrayLen];
for(int currentPipelineLen = 0; currentPipelineLen < s_iMaxPipelineLen; ++currentPipelineLen){
for(int currentMethod = 0; currentMethod < (int)testMethods.size(); ++currentMethod){
int resultsSum = 0;
for(int trialNum = 0; trialNum < s_iNumTrials; ++trialNum){
//Generate a new array...
for(int i = 0; i < s_iArrayLen; ++i)
testMethods[currentMethod].func(i, vals[i]);
//And record how long it takes
resultsSum += doWorkAndReturnMicrosecondsElapsed(vals, currentPipelineLen);
}
testMethods[currentMethod].results[currentPipelineLen] = (resultsSum / s_iNumTrials);
}
}
cout << "\t";
for(int i = 0; i < s_iMaxPipelineLen; ++i){
cout << i << "\t";
}
cout << "\n";
for (int i = 0; i < (int)testMethods.size(); ++i){
cout << testMethods[i].name.c_str() << "\t";
for(int j = 0; j < s_iMaxPipelineLen; ++j){
cout << testMethods[i].results[j] << "\t";
}
cout << "\n";
}
int end;
cin >> end;
delete[] vals;
}
Pastebin link: http://pastebin.com/F0JAu3uw
I think you may be measuring the cache/memory performance, more than the branch prediction. Your inner 'work' loop is accessing an ever increasing chunk of memory. Which may explain the linear growth, the periodic behaviour, etc.
I could be wrong, as I've not tried replicating your results, but if I were you I'd factor out memory accesses before timing other things. Perhaps sum one volatile variable into another, rather than working in an array.
Note also that, depending on the CPU, the branch prediction can be a lot smarter than just recording the last time a branch was taken - repeating patterns, for example, aren't as bad as random data.
Ok, a quick and dirty test I knocked up on my tea break which tried to mirror your own test method, but without thrashing the cache, looks like this:
Is that more what you expected?
If I can spare any time later there's something else I want to try, as I've not really looked at what the compiler is doing...
Edit:
And, here's my final test - I recoded it in assembler to remove the loop branching, ensure an exact number of instructions in each path, etc.
I also added an extra case, of a 5-bit repeating pattern. It seems pretty hard to upset the branch predictor on my ageing Xeon.
In addition to what JasonD pointed out, I would also like to note that there are conditions inside for loop, which may affect branch predictioning:
if(vals[i] == 0)
{
for(int i = 0; i < pipelineLen; ++i)
++zeroNums[i];
}
i < pipelineLen; is a condition like your ifs. Of course compiler may unroll this loop, however pipelineLen is argument passed to a function so probably it does not.
I'm not sure if this can explain wavy pattern of your results, but:
Since the BTB is only 16 entries long in the Pentium 4 processor, the prediction will eventually fail for loops that are longer than 16 iterations. This limitation can be avoided by unrolling a loop until it is only 16 iterations long. When this is done, a loop conditional will always fit into the BTB, and a branch misprediction will not occur on loop exit. The following is an exam ple of loop unrolling:
Read full article: http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts
So your loops are not only measuring memory throughput but they are also affecting BTB.
If you have passed 0-1 pattern in your list but then executed a for loop with pipelineLen = 2 your BTB will be filled with something like 0-1-1-0 - 1-1-1-0 - 0-1-1-0 - 1-1-1-0 and then it will start to overlap, so this can indeed explain wavy pattern of your results (some overlaps will be more harmful than others).
Take this as an example of what may happen rather than literal explanation. Your CPU may have much more sophisticated branch prediction architecture.
This question already has answers here:
Why are elementwise additions much faster in separate loops than in a combined loop?
(10 answers)
Performance of breaking apart one loop into two loops
(6 answers)
Closed 9 years ago.
What is the overhead in splitting a for-loop like this,
int i;
for (i = 0; i < exchanges; i++)
{
// some code
// some more code
// even more code
}
into multiple for-loops like this?
int i;
for (i = 0; i < exchanges; i++)
{
// some code
}
for (i = 0; i < exchanges; i++)
{
// some more code
}
for (i = 0; i < exchanges; i++)
{
// even more code
}
The code is performance-sensitive, but doing the latter would improve readability significantly. (In case it matters, there are no other loops, variable declarations, or function calls, save for a few accessors, within each loop.)
I'm not exactly a low-level programming guru, so it'd be even better if someone could measure up the performance hit in comparison to basic operations, e.g. "Each additional for-loop would cost the equivalent of two int allocations." But, I understand (and wouldn't be surprised) if it's not that simple.
Many thanks, in advance.
There are often way too many factors at play... And it's easy to demonstrate both ways:
For example, splitting the following loop results in almost a 2x slow-down (full test code at the bottom):
for (int c = 0; c < size; c++){
data[c] *= 10;
data[c] += 7;
data[c] &= 15;
}
And this is almost stating the obvious since you need to loop through 3 times instead of once and you make 3 passes over the entire array instead of 1.
On the other hand, if you take a look at this question: Why are elementwise additions much faster in separate loops than in a combined loop?
for(int j=0;j<n;j++){
a1[j] += b1[j];
c1[j] += d1[j];
}
The opposite is sometimes true due to memory alignment.
What to take from this?
Pretty much anything can happen. Neither way is always faster and it depends heavily on what's inside the loops.
And as such, determining whether such an optimization will increase performance is usually trial-and-error. With enough experience you can make fairly confident (educated) guesses. But in general, expect anything.
"Each additional for-loop would cost the equivalent of two int allocations."
You are correct that it's not that simple. In fact it's so complicated that the numbers don't mean much. A loop iteration may take X cycles in one context, but Y cycles in another due to a multitude of factors such as Out-of-order Execution and data dependencies.
Not only is the performance context-dependent, but it also vary with different processors.
Here's the test code:
#include <time.h>
#include <iostream>
using namespace std;
int main(){
int size = 10000;
int *data = new int[size];
clock_t start = clock();
for (int i = 0; i < 1000000; i++){
#ifdef TOGETHER
for (int c = 0; c < size; c++){
data[c] *= 10;
data[c] += 7;
data[c] &= 15;
}
#else
for (int c = 0; c < size; c++){
data[c] *= 10;
}
for (int c = 0; c < size; c++){
data[c] += 7;
}
for (int c = 0; c < size; c++){
data[c] &= 15;
}
#endif
}
clock_t end = clock();
cout << (double)(end - start) / CLOCKS_PER_SEC << endl;
system("pause");
}
Output (one loop): 4.08 seconds
Output (3 loops): 7.17 seconds
Processors prefer to have a higher ratio of data instructions to jump instructions.
Branch instructions may force your processor to clear the instruction pipeline and reload.
Based on the reloading of the instruction pipeline, the first method would be faster, but not significantly. You would add at least 2 new branch instructions by splitting.
A faster optimization is to unroll the loop. Unrolling the loop tries to improve the ratio of data instructions to branch instructions by performing more instructions inside the loop before branching to the top of the loop.
Another significant performance optimization is to organize the data so it fits into one of the processor's cache lines. So for example, you could split have inner loops that process a single cache of data and the outer loop would load new items into the cache.
This optimizations should only be applied after the program runs correctly and robustly and the environment demands more performance. The environment defined as observers (animation / movies), users (waiting for a response) or hardware (performing operations before a critical time event). Any other purpose is a waste of your time, as the OS (running concurrent programs) and storage access will contribute more to your program's performance issues.
This will give you a good indication of whether or not one version is faster than another.
#include <array>
#include <chrono>
#include <iostream>
#include <numeric>
#include <string>
const int iterations = 100;
namespace
{
const int exchanges = 200;
template<typename TTest>
void Test(const std::string &name, TTest &&test)
{
typedef std::chrono::high_resolution_clock Clock;
typedef std::chrono::duration<float, std::milli> ms;
std::array<float, iterations> timings;
for (auto i = 0; i != iterations; ++i)
{
auto t0 = Clock::now();
test();
timings[i] = ms(Clock::now() - t0).count();
}
auto avg = std::accumulate(timings.begin(), timings.end(), 0) / iterations;
std::cout << "Average time, " << name << ": " << avg << std::endl;
}
}
int main()
{
Test("single loop",
[]()
{
for (auto i = 0; i < exchanges; ++i)
{
// some code
// some more code
// even more code
}
});
Test("separated loops",
[]()
{
for (auto i = 0; i < exchanges; ++i)
{
// some code
}
for (auto i = 0; i < exchanges; ++i)
{
// some more code
}
for (auto i = 0; i < exchanges; ++i)
{
// even more code
}
});
}
The thing is quite simple. The first code is like taking a single lap on a race track and the other code is like taking a full 3-lap race. So, more time required to take three laps rather than one lap. However, if the loops are doing something that needs to be done in sequence and they depend on each other then second code will do the stuff. for example if first loop is doing some calculations and second loop is doing some work with those calculations then both loops need to be done in sequence otherwise not...