I have code
const int N = 100000000;
int main() {
FILE* fp = fopen("result.txt", "w");
for (int i=0; i<N; ++i) {
int res = f(i);
fprintf (fp, "%d\t%d\n", i, res);
}
return 0;
}
Here f averagely run for several milliseconds in single thread.
To make it faster I'd like to use multithreading.
What provides a way to get the next i? Or do I need to lock, get, add and unlock?
Should writing be proceeded in a separated thread to make things easier?
Do I need a temporary memory in case f(7) is worked out before f(3)?
If 3, is it likely that f(3) is not calculated for long time and the temporary memory is filled?
I'm currently using C++11, but requiring higher version of C++ may be acceptable
General rule how to improve performance:
Find way to measure performance (automated test)
Do profiling of existing code (find bottlenecks)
Understanding findings in point 2 and try to fix them (without mutilating)
Do a measurement from point 1. and decide if change provided expected improvement.
go back to point 2 couple times
Only if steps 1 to 5 didn't help try use muti threading. Procedure is same as in points 2 - 5, but you have to think: can you split large task to couple smaller one? If yest do they need synchronization? Can you avoid it?
Now in your example just split result to 8 (or more) separate files and merge them at the end if you have to.
This can look like this:
#include <vector>
#include <future>
#include <fstream>
std::vector<int> multi_f(int start, int stop)
{
std::vector<int> r;
r.reserve(stop - start);
for (;start < stop; ++start) r.push_back(f(start));
return r;
}
int main()
{
const int N = 100000000;
const int tasks = 100;
const int sampleCount = N / tasks;
std::vector<std::future<std::vector<int>>> allResults;
for (int i=0; i < N; i += sampleCount) {
allResults.push_back(std::async(&multi_f, i, i + sampleCount));
}
std::ofstream f{ "result.txt" }; // it is a myth that printf is faster
int i = 0;
for (auto& task : allResults)
{
for (auto r : task.get()) {
f << i++ << '\t' << r << '\n';
}
}
return 0;
}
Related
I've made a program which process a lot of data, and it takes forever at runtime, but looking in Task Manager I found out that the executable only uses a small part of my cpu and my RAM...
How can I tell my IDE to allocate more resources (as much as he can) to my program?
Running it in Release x64 helps but not enough.
#include <cstddef>
#include <iostream>
#include <utility>
#include <vector>
int main() {
using namespace std;
struct library {
int num = 0;
unsigned int total = 0;
int booksnum = 0;
int signup = 0;
int ship = 0;
vector<int> scores;
};
unsigned int libraries = 30000; // in the program this number is read a file
unsigned int books = 20000; // in the program this number is read a file
unsigned int days = 40000; // in the program this number is read a file
vector<int> scores(books, 0);
vector<library*> all(libraries);
for(auto& it : all) {
it = new library;
it->booksnum = 15000; // in the program this number is read a file
it->signup = 50000; // in the program this number is read a file
it->ship = 99999; // in the program this number is read a file
it->scores.resize(it->booksnum, 0);
}
unsigned int past = 0;
for(size_t done = 0; done < all.size(); done++) {
if(!(done % 1000)) cout << done << '-' << all.size() << endl;
for(size_t m = done; m < all.size() - 1; m++) {
all[m]->total = 0;
{
double run = past + all[m]->signup;
for(auto at : all[m]->scores) {
if(days - run > 0) {
all[m]->total += scores[at];
run += 1. / all[m]->ship;
} else
break;
}
}
}
for(size_t n = done; n < all.size(); n++)
for(size_t m = 0; m < all.size() - 1; m++) {
if(all[m]->total < all[m + 1]->total) swap(all[m], all[m + 1]);
}
past += all[done]->signup;
if (past > days) break;
}
return 0;
}
this is the cycle which takes up so much time... For some reason even using pointers to library doesn't optimize it
RAM doesn't make things go faster. RAM is just there to store data your program uses; if it's not using much then it doesn't need much.
Similarly, in terms of CPU usage, the program will use everything it can (the operating system can change priority, and there are APIs for that, but this is probably not your issue).
If you're seeing it using a fraction of CPU percentage, the chances are you're either waiting on I/O or writing a single threaded application that can only use a single core at any one time. If you've optimised your solution as much as possible on a single thread, then it's worth looking into breaking its work down across multiple threads.
What you need to do is use a tool called a profiler to find out where your code is spending its time and then use that information to optimise it. This will help you with microoptimisations especially, but for larger algorithmic changes (i.e. changing how it works entirely), you'll need to think about things at a higher level of abstraction.
I want to run n instances of an algorithm in parallel and compute the mean of a function f of the results. If I'm not terribly wrong, the following code achieves this goal:
struct X {};
int f(X) { return /* ... */; }
int main()
{
std::size_t const n = /* ... */;
std::vector<std::future<X>> results;
results.reserve(n);
for (std::size_t i = 0; i < n; ++i)
results.push_back(std::async([]() -> X { /* ... */ }));
int mean = 0;
for (std::size_t i = 0; i < n; ++i)
mean += f(results[i].get());
mean /= n;
}
However, is there a better way to do this? The obvious problem with the code above is the following: The order of summation in the line mean += f(results[i].get()); doesn't matter. Thus, it would be better to add the results to mean as soon as they are available. If in the code above, the result of the ith task is not yet available, the program waits for that result, while it might be possible that all results of task i + 1 to n - 1 are already available.
So, how can we do this in a better way?
You're blocking on the future, which is one operation too early.
Why not update the accumulated sum in the async thread and then block on all threads being complete?
#include <condition_variable>
#include <thread>
#include <mutex>
struct X {};
int f(X);
X make_x(int);
struct algo_state
{
std::mutex m;
std::condition_variable cv;
int remaining_tasks;
int accumulator;
};
void task(X x, algo_state& state)
{
auto part = f(x);
auto lock = std::unique_lock(state.m);
state.accumulator += part;
if (--state.remaining_tasks == 0)
{
lock.unlock();
state.cv.notify_one();
}
}
int main()
{
int get_n();
auto n = get_n();
algo_state state = {
{},
{},
n,
0
};
for(int i = 0 ; i < n ; ++i)
std::thread([&] { task(make_x(i), state); }).detach();
auto lock = std::unique_lock(state.m);
state.cv.wait(lock, [&] { return state.remaining_tasks == 0; });
auto mean = state.accumulator / n;
return mean;
}
Couldn't fit this into comment:
Instead of passing N functions to M threads for N data points(X), you can have:
K queues of N/K elements of data elements for each of them
M threads in a pool (producers, ready with same function)
1 consumer (adder) thread (main?)
and pass only N data points between threads. Passing functions and executing them can have more overhead than just data.
Also those functions can add into a shared variable without needing any extra summation outside then only M producers can work with a suitable synchronization such as atomics or lock guards.
What is sizeof that struct?
Easiest way
What about making the lambda return f(x) instead of x:
for (std::size_t i = 0; i < n; ++i)
results.push_back(std::async([]() -> int { /* ... */ }));
In this case, f() could be performed as soon as possible an without waiting. The average computation would still need to wait in a sequential order. But this is a false problem since there's nothing faster than summarising integers, and anyway, you would not be able to finish the calculation of the average before having summed each part.
Easy alternative
Still another approach could be to use atomic<int> mean; and capture it in the lambda and update the sum. So in the end you'd only need to be sure that all future delivered before doing the division. But as said, considering the cost of an integer addition, this might be overkill here.
std::vector<std::future<void>> results;
...
atomic<int> mean{0};
for (std::size_t i = 0; i < n; ++i)
results.push_back(std::async([&mean]() -> void
{ X x = ...; int i=f(x); mean+=i; return; }));
for (std::size_t i = 0; i < n; ++i)
results[i].get();
mean = mean/n; // attention not an atomic operation, but all concurent things are done
I am very new to modern C++ library, and trying to learn how to use std::async to perform some operations on a big pointer array. The sample code I have written is crashing at the point where the async task is launched.
Sample code:
#include <iostream>
#include <future>
#include <tuple>
#include <numeric>
#define maximum(a,b) (((a) > (b)) ? (a) : (b))
class Foo {
bool flag;
public:
Foo(bool b) : flag(b) {}
//******
//
//******
std::tuple<long long, int> calc(int* a, int begIdx, int endIdx) {
long sum = 0;
int max = 0;
if (!(*this).flag) {
return std::make_tuple(sum, max);
}
if (endIdx - begIdx < 100)
{
for (int i = begIdx; i < endIdx; ++i)
{
sum += a[i];
if (max < a[i])
max = a[i];
}
return std::make_tuple(sum, max);
}
int midIdx = endIdx / 2;
auto handle = std::async(&Foo::calc, this, std::ref(a), midIdx, endIdx);
auto resultTuple = calc(a, begIdx, midIdx);
auto asyncTuple = handle.get();
sum = std::get<0>(asyncTuple) +std::get<0>(resultTuple);
max = maximum(std::get<1>(asyncTuple), std::get<1>(resultTuple));
return std::make_tuple(sum, max);
}
//******
//
//******
void call_calc(int*& a) {
auto handle = std::async(&Foo::calc, this, std::ref(a), 0, 10000);
auto resultTuple = handle.get();
std::cout << "Sum = " << std::get<0>(resultTuple) << " Maximum = " << std::get<1>(resultTuple) << std::endl;
}
};
//******
//
//******
int main() {
int* nums = new int[10000];
for (int i = 0; i < 10000; ++i)
nums[i] = rand() % 10000 + 1;
Foo foo(true);
foo.call_calc(nums);
delete[] nums;
}
Can anyone help me to identify why does it crash?
Is there any better approach to apply parallelism to operations on a big pointer array?
The fundamental problem is your code wants to launch more than array size / 100 threads. That means more than 100 threads. 100 threads won't do anything good; they'll thrash. See std::thread::hardware_concurrency, and in general don't use raw async or thread in production applications; write task pools and splice together futures and the like.
That many threads is both extremely inefficient and could exhaust system resources.
The second problem is you failed to calculate the average of 2 values.
The average of begIdx and endIdx is not endIdx/2 but rather:
int midIdx = begIdx + (endIdx-begIdx) / 2;
Live example.
You'll notice I discovered the problem with your program by adding intermediate output. In particular, I had it print out the ranges it was working on, and I noticed it was repeating ranges. This is known as "printf debugging", and is pretty powerful especially when step-based debugging isn't (with this many threads, stepping through the code will be brain-numbing)
The problem with async calls is that they are not done in some universe where an infinite amount of tasks can be executed all at the exact same time.
Async calls are executed on a processor which has a certain amount of processors/cores and the async calls have to be lined up to be executed on them.
Now here is where problems of synchronization, and the problems of blocking, starvation, ... and other multithreaded issues come into play.
Your algorithm is very difficult to follow, as it is spawning tasks inside already created tasks. Something is happening, but it is difficult to follow.
I would solve this problem by:
Creating a vector of results (which will be from async threads)
In a loop execute the async calls (assigning the result to the vector)
Afterwards loop through the reuslts vector gathering the results
I have two arrays. One is "x" factor the size of the second one.
I need to copy from the first (bigger) array to the second (smaller) array only its x element.
Meaning 0,x,2x.
Each array sits as a block in the memory.
The array is of simple values.
I am currently doing it using a loop.
Is there any faster smarter way to do this?
Maybe with ostream?
Thanks!
You are doing something like this right?
#include <cstddef>
int main()
{
const std::size_t N = 20;
const std::size_t x = 5;
int input[N*x];
int output[N];
for(std::size_t i = 0; i < N; ++i)
output[i] = input[i*x];
}
well, I don't know any function that can do that, so I would use the for loop. This is fast.
EDIT: even faster solution (to avoid multiplications)(C++03 Version)
int* inputit = input;
int* outputit = output;
int* outputend = output+N;
while(outputit != outputend)
{
*outputit = *inputit;
++outputit;
inputit+=x;
}
if I get you right you want to copy every n-th element. the simplest solution would be
#include <iostream>
int main(int argc, char **argv) {
const int size[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
int out[5];
int *pout = out;
for (const int *i = &size[0]; i < &size[10]; i += 3) {
std::cout << *i << ", ";
*pout++ = *i;
if (pout > &out[4]) {
break;
}
}
std::cout << "\n";
for (const int *i = out; i < pout; i++) {
std::cout << *i << ", ";
}
std::cout << std::endl;
}
You can use copy_if and lambda in C++11:
copy_if(a.begin(), a.end(), b.end(), [&] (const int& i) -> bool
{ size_t index = &i - &a[0]; return index % x == 0; });
A test case would be:
#include <iostream>
#include <vector>
#include <algorithm> // std::copy_if
using namespace std;
int main()
{
std::vector<int> a;
a.push_back(0);
a.push_back(1);
a.push_back(2);
a.push_back(3);
a.push_back(4);
std::vector<int> b(3);
int x = 2;
std::copy_if(a.begin(), a.end(), b.begin(), [&] (const int& i) -> bool
{ size_t index = &i - &a[0]; return index % x == 0; });
for(int i=0; i<b.size(); i++)
{
std::cout<<" "<<b[i];
}
return 0;
}
Note that you need to use a C++11 compatible compiler (if gcc, with -std=c++11 option).
template<typename InIt, typename OutIt>
void copy_step_x(InIt first, InIt last, OutIt result, int x)
{
for(auto it = first; it != last; std::advance(it, x))
*result++ = *it;
}
int main()
{
std::array<int, 64> ar0;
std::array<int, 32> ar1;
copy_step_x(std::begin(ar0), std::end(ar0), std::begin(ar1), ar0.size() / ar1.size());
}
The proper and clean way of doing this is a loop like has been said before. A number of good answers here show you how to do that.
I do NOT recommend doing it in the following fashion, it depends on a lot of specific things, value range of X, size and value range of the variables and so on but for some you could do it like this:
for every 4 bytes:
tmp = copy a 32 bit variable from the array, this now contains the 4 new values
real_tmp = bitmask tmp to get the right variable of those 4
add it to the list
This only works if you want values <= 255 and X==4, but if you want something faster than a loop this is one way of doing it. This could be modified for 16bit, 32bit or 64bit values and every 2,3,4,5,6,7,8(64 bit) values but for X>8 this method will not work, or for values that are not allocated in a linear fashion. It won't work for classes either.
For this kind of optimization to be worth the hassle the code need to run often, I assume you've run a profiler to confirm that the old copy is a bottleneck before starting implementing something like this.
The following is an observation on how most CPU designs are unimaginative when it comes to this sort of thing.
On some OpenVPX you have the ability to DMA data from one processor to another. The one that I use has a pretty advanced DMA controller, and it can do this sort of thing for you.
For example, I could ask it to copy your big array to another CPU, but skipping over N elements of the array, just like you're trying to do. As if by magic the destination CPU would have the smaller array in its memory. I could also if I wanted perform matrix transformations, etc.
The nice thing is that it takes no CPU time at all to do this; it's all done by the DMA engine. My CPUs can then concentrate on harder sums instead of being tied down shuffling data around.
I think the Cell processor in the PS3 can do this sort of thing internally (I know it can DMA data around, I don't know if it will do the strip mining at the same time). Some DSP chips can do it too. But x86 doesn't do it, meaning us software programmers have to write ridiculous loops just moving data in simple patterns. Yawn.
I have written a multithreaded memcpy() in the past to do this sort of thing. The only way you're going to beat a for loop is to have several threads doing your for loop in several parallel chunks.
If you pick the right compiler (eg Intel's ICC or Sun/Oracles Sun Studio) they can be made to automatically parallelise your for loops on your behalf (so your source code doesn't change). That's probably the simplest way to beat your original for loop.
I'm trying to get a good understanding of branch prediction by measuring the time to run loops with predictable branches vs. loops with random branches.
So I wrote a program that takes large arrays of 0's and 1's arranged in different orders (i.e. all 0's, repeating 0-1, all rand), and iterates through the array branching based on if the current index is 0 or 1, doing time-wasting work.
I expected that harder-to-guess arrays would take longer to run on, since the branch predictor would guess wrong more often, and that the time-delta between runs on two sets of arrays would remain the same regardless of the amount of time-wasting work.
However, as amount of time-wasting work increased, the difference in time-to-run between arrays increased, A LOT.
(X-axis is amount of time-wasting work, Y-axis is time-to-run)
Does anyone understand this behavior? You can see the code I'm running at the following code:
#include <stdlib.h>
#include <time.h>
#include <chrono>
#include <stdio.h>
#include <iostream>
#include <vector>
using namespace std;
static const int s_iArrayLen = 999999;
static const int s_iMaxPipelineLen = 60;
static const int s_iNumTrials = 10;
int doWorkAndReturnMicrosecondsElapsed(int* vals, int pipelineLen){
int* zeroNums = new int[pipelineLen];
int* oneNums = new int[pipelineLen];
for(int i = 0; i < pipelineLen; ++i)
zeroNums[i] = oneNums[i] = 0;
chrono::time_point<chrono::system_clock> start, end;
start = chrono::system_clock::now();
for(int i = 0; i < s_iArrayLen; ++i){
if(vals[i] == 0){
for(int i = 0; i < pipelineLen; ++i)
++zeroNums[i];
}
else{
for(int i = 0; i < pipelineLen; ++i)
++oneNums[i];
}
}
end = chrono::system_clock::now();
int elapsedMicroseconds = (int)chrono::duration_cast<chrono::microseconds>(end-start).count();
//This should never fire, it just exists to guarantee the compiler doesn't compile out our zeroNums/oneNums
for(int i = 0; i < pipelineLen - 1; ++i)
if(zeroNums[i] != zeroNums[i+1] || oneNums[i] != oneNums[i+1])
return -1;
delete[] zeroNums;
delete[] oneNums;
return elapsedMicroseconds;
}
struct TestMethod{
string name;
void (*func)(int, int&);
int* results;
TestMethod(string _name, void (*_func)(int, int&)) { name = _name; func = _func; results = new int[s_iMaxPipelineLen]; }
};
int main(){
srand( (unsigned int)time(nullptr) );
vector<TestMethod> testMethods;
testMethods.push_back(TestMethod("all-zero", [](int index, int& out) { out = 0; } ));
testMethods.push_back(TestMethod("repeat-0-1", [](int index, int& out) { out = index % 2; } ));
testMethods.push_back(TestMethod("repeat-0-0-0-1", [](int index, int& out) { out = (index % 4 == 0) ? 0 : 1; } ));
testMethods.push_back(TestMethod("rand", [](int index, int& out) { out = rand() % 2; } ));
int* vals = new int[s_iArrayLen];
for(int currentPipelineLen = 0; currentPipelineLen < s_iMaxPipelineLen; ++currentPipelineLen){
for(int currentMethod = 0; currentMethod < (int)testMethods.size(); ++currentMethod){
int resultsSum = 0;
for(int trialNum = 0; trialNum < s_iNumTrials; ++trialNum){
//Generate a new array...
for(int i = 0; i < s_iArrayLen; ++i)
testMethods[currentMethod].func(i, vals[i]);
//And record how long it takes
resultsSum += doWorkAndReturnMicrosecondsElapsed(vals, currentPipelineLen);
}
testMethods[currentMethod].results[currentPipelineLen] = (resultsSum / s_iNumTrials);
}
}
cout << "\t";
for(int i = 0; i < s_iMaxPipelineLen; ++i){
cout << i << "\t";
}
cout << "\n";
for (int i = 0; i < (int)testMethods.size(); ++i){
cout << testMethods[i].name.c_str() << "\t";
for(int j = 0; j < s_iMaxPipelineLen; ++j){
cout << testMethods[i].results[j] << "\t";
}
cout << "\n";
}
int end;
cin >> end;
delete[] vals;
}
Pastebin link: http://pastebin.com/F0JAu3uw
I think you may be measuring the cache/memory performance, more than the branch prediction. Your inner 'work' loop is accessing an ever increasing chunk of memory. Which may explain the linear growth, the periodic behaviour, etc.
I could be wrong, as I've not tried replicating your results, but if I were you I'd factor out memory accesses before timing other things. Perhaps sum one volatile variable into another, rather than working in an array.
Note also that, depending on the CPU, the branch prediction can be a lot smarter than just recording the last time a branch was taken - repeating patterns, for example, aren't as bad as random data.
Ok, a quick and dirty test I knocked up on my tea break which tried to mirror your own test method, but without thrashing the cache, looks like this:
Is that more what you expected?
If I can spare any time later there's something else I want to try, as I've not really looked at what the compiler is doing...
Edit:
And, here's my final test - I recoded it in assembler to remove the loop branching, ensure an exact number of instructions in each path, etc.
I also added an extra case, of a 5-bit repeating pattern. It seems pretty hard to upset the branch predictor on my ageing Xeon.
In addition to what JasonD pointed out, I would also like to note that there are conditions inside for loop, which may affect branch predictioning:
if(vals[i] == 0)
{
for(int i = 0; i < pipelineLen; ++i)
++zeroNums[i];
}
i < pipelineLen; is a condition like your ifs. Of course compiler may unroll this loop, however pipelineLen is argument passed to a function so probably it does not.
I'm not sure if this can explain wavy pattern of your results, but:
Since the BTB is only 16 entries long in the Pentium 4 processor, the prediction will eventually fail for loops that are longer than 16 iterations. This limitation can be avoided by unrolling a loop until it is only 16 iterations long. When this is done, a loop conditional will always fit into the BTB, and a branch misprediction will not occur on loop exit. The following is an exam ple of loop unrolling:
Read full article: http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts
So your loops are not only measuring memory throughput but they are also affecting BTB.
If you have passed 0-1 pattern in your list but then executed a for loop with pipelineLen = 2 your BTB will be filled with something like 0-1-1-0 - 1-1-1-0 - 0-1-1-0 - 1-1-1-0 and then it will start to overlap, so this can indeed explain wavy pattern of your results (some overlaps will be more harmful than others).
Take this as an example of what may happen rather than literal explanation. Your CPU may have much more sophisticated branch prediction architecture.