I have tested three C++ program codes which load the data from the memory to CPU, do the simple + or x operations (count the time here) and then report the result. These three codes have the same structure but with different data types (int,double,float).
The testing result is: The time is 2x when the data size is 2x for both three codes.
However, I have the following observations.
Observation 1: the time is 2x slower when the optimization is not used. However, it is strange since the loading time (bottleneck) should not be affected by the compiler.
Observation 2: the time for double type program code is roughly 2x faster than both the int and float type program codes when no compiler optimization is added into it and the data size is fixed (256MB,512MB,1024MB,2048MB,4096MB). It is also strange, since double should be the slowest one.
Remark for Observation 2: The time for both three codes are similar when I add the compiler optimization (O,O2,O3).
The attched Code is here:
int main()
{
float value;
double totalTimeDifference;
const int numberOFElements=178956970; //4GB for 6 arrays in total
float*FLOAT_Array_one=new float[numberOFElements];
float*FLOAT_Array_two=new float[numberOFElements];
float*FLOAT_Array_three=new float[numberOFElements];
float*FLOAT_Array_four=new float[numberOFElements];
float*FLOAT_Array_five=new float[numberOFElements];
float*FLOAT_Array=new float[numberOFElements];
srand(time(NULL));
for(int i=0;i<numberOFElements;i++)
{
FLOAT_Array_one[i]=rand()% 400;
FLOAT_Array_two[i]=rand()% 400;
FLOAT_Array_four[i]=rand()% 400;
FLOAT_Array_five[i]=rand()% 400;
}
timeval tim1;
timeval tim2;
gettimeofday(&tim1,NULL);
//****************************//
for(int i=0;i<numberOFElements;i++)
{
FLOAT_Array[i]=FLOAT_Array_one[i]+FLOAT_Array_two[i];
}
//****************************//
//****************************//
for(int i=0;i<numberOFElements;i++)
{
FLOAT_Array_three[i]=FLOAT_Array_four[i]*FLOAT_Array_five[i];
}
//****************************//
gettimeofday(&tim2,NULL);
double t1=tim1.tv_sec+(tim1.tv_usec/1000000.0);
double t2=tim2.tv_sec+(tim2.tv_usec/1000000.0);
for(int i=0;i<numberOFElements;i++)
{
if(i%2==0)
value=value+FLOAT_Array[i]+FLOAT_Array_three[i];
else
value=value-FLOAT_Array[i]-FLOAT_Array_three[i];
}
totalTimeDifference=t2-t1;
cout<<value<<endl;
cout<<totalTimeDifference<<endl;
}
A few educated guesses:
Since you are doing arithmetic operations in your "loading" between the two time checks, it may be employing SSE streaming floating-point math instructions only when optimized. This should result in a significant speedup.
If you are on a 64-bit OS, memory access for two half words takes more time than one whole one. On a 64-bit program, a float takes only 32-bits, and subsequent accesses to half words take more time than a single access to a whole word. The part I don't understand about that, however, is how optimizations are able to get around this.
Related
Before I start, let me say that I've only used threads once when we were taught about them in university. Therefore, I have almost zero experience using them and I don't know if what I'm trying to do is a good idea.
I'm doing a project of my own and I'm trying to make a for loop run fast because I need the calculations in the loop for a real-time application. After "optimizing" the calculations in the loop, I've gotten closer to the desired speed. However, it still needs improvement.
Then, I remembered threading. I thought I could make the loop run even faster if I split it in 4 parts, one for each core of my machine. So this is what I tried to do:
void doYourThing(int size,int threadNumber,int numOfThreads) {
int start = (threadNumber - 1) * size / numOfThreads;
int end = threadNumber * size / numOfThreads;
for (int i = start; i < end; i++) {
//Calculations...
}
}
int main(void) {
int size = 100000;
int numOfThreads = 4;
int start = 0;
int end = size / numOfThreads;
std::thread coreB(doYourThing, size, 2, numOfThreads);
std::thread coreC(doYourThing, size, 3, numOfThreads);
std::thread coreD(doYourThing, size, 4, numOfThreads);
for (int i = start; i < end; i++) {
//Calculations...
}
coreB.join();
coreC.join();
coreD.join();
}
With this, computation time changed from 60ms to 40ms.
Questions:
1)Do my threads really run on a different core? If that's true, I would expect a greater increase in speed. More specifically, I assumed it would take close to 1/4 of the initial time.
2)If they don't, should I use even more threads to split the work? Will it make my loop faster or slower?
(1).
The question #François Andrieux asked is good. Because in the original code there is a well-structured for-loop, and if you used -O3 optimization, the compiler might be able to vectorize the computation. This vectorization will give you speedup.
Also, it depends on what is the critical path in your computation. According to Amdahl's law, the possible speedups are limited by the un-parallelisable path. You might check if the computation are reaching some variable where you have locks, then the time could also spend to spin on the lock.
(2). to find out the total number of cores and threads on your computer you may have lscpu command, which will show you the cores and threads information on your computer/server
(3). It is not necessarily true that more threads will have a better performance
There is a header-only library in Github which may be just what you need. Presumably your doYourThing processes an input vector (of size 100000 in your code) and stores the results into another vector. In this case, all you need to do is to say is
auto vectorOut = Lazy::runForAll(vectorIn, myFancyFunction);
The library will decide how many threads to use based on how many cores you have.
On the other hand, if the compiler is able to vectorize your algorithm and it still looks like it is a good idea to split the work into 4 chunks like in your example code, you could do it for example like this:
#include "Lazy.h"
void doYourThing(const MyVector& vecIn, int from, int to, MyVector& vecOut)
{
for (int i = from; i < to; ++i) {
// Calculate vecOut[i]
}
}
int main(void) {
int size = 100000;
MyVector vecIn(size), vecOut(size)
// Load vecIn vector with input data...
Lazy::runForAll({{std::pair{0, size/4}, {size/4, size/2}, {size/2, 3*size/4}, {3*size/4, size}},
[&](auto indexPair) {
doYourThing(vecIn, indexPair.first, indexPair.second, vecOut);
});
// Now the results are in vecOut
}
README.md gives further examples on parallel execution which you might find useful.
In a self-educational project I measure the bandwidth of the memory with help of the following code (here paraphrased, the whole code follows at the end of the question):
unsigned int doit(const std::vector<unsigned int> &mem){
const size_t BLOCK_SIZE=16;
size_t n = mem.size();
unsigned int result=0;
for(size_t i=0;i<n;i+=BLOCK_SIZE){
result+=mem[i];
}
return result;
}
//... initialize mem, result and so on
int NITER = 200;
//... measure time of
for(int i=0;i<NITER;i++)
resul+=doit(mem)
BLOCK_SIZE is choosen in such a way, that a whole 64byte cache line is fetched per single integer-addition. My machine (an Intel-Broadwell) needs about 0.35 nanosecond per integer-addion, so the code above could saturate a bandwith as high as 182GB/s (this value is just an upper bound and is probably quite off, what is important is the ratio of bandwidths for different sizes). The code is compiled with g++ and -O3.
Varying the size of the vector, I can observe expected bandwidths for L1(*)-, L2-, L3-caches and the RAM-memory:
However, there is an effect I'm really struggling to explain: the collapse of the measured bandwidth of L1-cache for sizes around 2 kB, here in somewhat higher resolution:
I could reproduce the results on all machines I have access to (which have Intel-Broadwell and Intel-Haswell processors).
My question: What is the reason for the performance-collapse for memory-sizes around 2 KB?
(*) I hope I understand correctly, that for L1-cache not 64 bytes but only 4 bytes per addition are read/transfered (there is no further faster cache where a cache line must be filled), so the plotted bandwidth for L1 is only the upper limit and not the badwidth itself.
Edit: When the step size in the inner for-loop is chosen to be
8 (instead of 16) the collapse happens for 1KB
4 (instead of 16) the collapse happens for 0.5KB
i.e. when the inner loop consists of about 31-35 steps/reads. That means the collapse isn't due to the memory-size but due to the number of steps in the inner loop.
It can be explained with branch misses as shown in #user10605163's great answer.
Listing for reproducing the results
bandwidth.cpp:
#include <vector>
#include <chrono>
#include <iostream>
#include <algorithm>
//returns minimal time needed for one execution in seconds:
template<typename Fun>
double timeit(Fun&& stmt, int repeat, int number)
{
std::vector<double> times;
for(int i=0;i<repeat;i++){
auto begin = std::chrono::high_resolution_clock::now();
for(int i=0;i<number;i++){
stmt();
}
auto end = std::chrono::high_resolution_clock::now();
double time = std::chrono::duration_cast<std::chrono::nanoseconds>(end-begin).count()/1e9/number;
times.push_back(time);
}
return *std::min_element(times.begin(), times.end());
}
const int NITER=200;
const int NTRIES=5;
const size_t BLOCK_SIZE=16;
struct Worker{
std::vector<unsigned int> &mem;
size_t n;
unsigned int result;
void operator()(){
for(size_t i=0;i<n;i+=BLOCK_SIZE){
result+=mem[i];
}
}
Worker(std::vector<unsigned int> &mem_):
mem(mem_), n(mem.size()), result(1)
{}
};
double PREVENT_OPTIMIZATION=0.0;
double get_size_in_kB(int SIZE){
return SIZE*sizeof(int)/(1024.0);
}
double get_speed_in_GB_per_sec(int SIZE){
std::vector<unsigned int> vals(SIZE, 42);
Worker worker(vals);
double time=timeit(worker, NTRIES, NITER);
PREVENT_OPTIMIZATION+=worker.result;
return get_size_in_kB(SIZE)/(1024*1024)/time;
}
int main(){
int size=BLOCK_SIZE*16;
std::cout<<"size(kB),bandwidth(GB/s)\n";
while(size<10e3){
std::cout<<get_size_in_kB(size)<<","<<get_speed_in_GB_per_sec(size)<<"\n";
size=(static_cast<int>(size+BLOCK_SIZE)/BLOCK_SIZE)*BLOCK_SIZE;
}
//ensure that nothing is optimized away:
std::cerr<<"Sum: "<<PREVENT_OPTIMIZATION<<"\n";
}
create_report.py:
import sys
import pandas as pd
import matplotlib.pyplot as plt
input_file=sys.argv[1]
output_file=input_file[0:-3]+'png'
data=pd.read_csv(input_file)
labels=list(data)
plt.plot(data[labels[0]], data[labels[1]], label="my laptop")
plt.xlabel(labels[0])
plt.ylabel(labels[1])
plt.savefig(output_file)
plt.close()
Building/running/creating report:
>>> g++ -O3 -std=c++11 bandwidth.cpp -o bandwidth
>>> ./bandwidth > report.txt
>>> python create_report.py report.txt
# image is in report.png
I changed the values slightly: NITER = 100000 and NTRIES=1 to get a less noisy result.
I don't have a Broadwell available right now, however I tried your code on my Coffee-Lake and got a performance drop, not at 2KB, but around 4.5KB. In addition I find erratic behavior of the throughput slightly above 2KB.
The blue line in the graph corresponds to your measurement (left axis):
The red line here is the result from perf stat -e branch-instructions,branch-misses, giving the fraction of branches that were not correctly predicted (in percent, right axis). As you can see there is a clear anti-correlation between the two.
Looking into the more detailed perf report, I found that basically all of these branch mispredictions happen in the most inner loop in Worker::operator(). If the taken/non-taken pattern for the loop branch becomes too long the branch predictor will not be able to keep track of it and so the exit branch of the inner loop will be mispredicted, leading to the sharp drop in throughput. With further increasing number of iterations the impact of this single mispredict will become less significant leading to the slow recover of the throughput.
For further information on the erratic behavior before the drop see the comments made by #PeterCordes below.
In any case the best way to avoid branch mispredictions is to avoid branches and so I manually unrolled the loop in Worker::operator(), like e.g.:
void operator()(){
for(size_t i=0;i+3*BLOCK_SIZE<n;i+=BLOCK_SIZE*4){
result+=mem[i];
result+=mem[i+BLOCK_SIZE];
result+=mem[i+2*BLOCK_SIZE];
result+=mem[i+3*BLOCK_SIZE];
}
}
Unrolling 2, 3, 4, 6 or 8 iterations gives the results below. Note that I did not correct for the blocks at the end of the vector which were ignored due to the unrolling. Therefore the periodic peaks in the blue line should be ignored, the lower bound base line of the periodic pattern is the actual bandwidth.
As you can see the fraction of branch mispredictions didn't really change, but because the total number of branches is reduced by the factor of unrolled iterations, they will not contribute strongly to the performance anymore.
There is also an additional benefit of the processor being more free to do the calculations out-of-order if the loop is unrolled.
If this is supposed to have practical application I would suggest to try to give the hot loop a compile-time fixed number of iteration or some guarantee on divisibility, so that (maybe with some extra hints) the compiler can decide on the optimal number of iterations to unroll.
Might be unrelated but your Linux machine might playing with CPU frequency. I know Ubuntu 18 has a gouverner that is balanced between power and performance. You also want to play with the process affinity to make sure it does not get migrated to different core while running.
I keep getting wrong outcome, while I try to sum big arrays. I have isolated problem to the following code sample (its not an sum of a big array, but I believe this is The problem)
Compilable Sample:
template <typename T>
void cpu_sum(const unsigned int size, T & od_value) {
od_value = 0;
for (unsigned int i = 0; i < size; i++) {
od_value += 1;
}
}
int main() {
typedef float Data;
const unsigned int size = 800000000;
Data sum;
cpu_sum(size, sum);
cout << setprecision(35) << sum << endl; // prints: 16777216 // ERROR !!!
getchar();
}
Environment:
OS: Windows 8.1 x64 home
IDE: Microsoft Visual Studio 2015
Error Description:
While my outcome should obviously be sum == 800000000 I keep getting sum == 16777216.
That is very weird for me, because float max value is far above this one, and yet it looks like sum variable reach its limit.
What did I miss??
It is a well known problem. Gradually your sum becomes so big that next summand becomes comparable with an epsilon (about 10^-14) of it. At that moment you start loosing precision.
Standard solution is to change summation tactics: when array larger than, say, 100 elements, split it in halves and sum each half separately. It goes on recursively and tend to keep precision much better.
As pointed out by Michael Simbursky, this is a well-known problem when trying to sum a large number of limited-precision numbers. The solution he provides works well, and is quite efficient.
For the curious reader, a slower method is to sort your array before computing the sum, assuring that the values are added in order of increasing absolute value. This ensures that the least-significant values make their contributions to the overall sum before being overwhelmed by the other values. This technique is used more as an example/illustration than in any serious programming ventures.
For serious scientific programming where the overall size of the data can vary a great deal, another algorithm to be aware of is the Kahan Summation Algorithm, and the referenced wikipedia page provides a nice description of it.
I encountered this weird bug in a c++ multithread program on linux. The multithreaded part basically executes a loop. One single iteration first loads a sift file containing some features. And then it queries these features against a tree. Since I have a lot of images, I used multiple threads to do this querying. Here is the code snippets.
struct MultiMatchParam
{
int thread_id;
float *scores;
double *scores_d;
int *perm;
size_t db_image_num;
std::vector<std::string> *query_filenames;
int start_id;
int num_query;
int dim;
VocabTree *tree;
FILE *file;
};
// multi-thread will do normalization anyway
void MultiMatch(MultiMatchParam ¶m)
{
// Clear scores
for(size_t t = param.start_id; t < param.start_id + param.num_query; t++)
{
for (size_t i = 0; i < param.db_image_num; i++)
param.scores[i] = 0.0;
DTYPE *keys;
int num_keys;
keys = ReadKeys_sfm((*param.query_filenames)[t].c_str(), param.dim, num_keys);
int normalize = true;
double mag = param.tree->MultiScoreQueryKeys(num_keys, normalize, keys, param.scores);
delete [] keys;
}
}
I run this on a 8-core cpu. At first it runs perfectly and the cpu usage is nearly 100% on all 8 cores. After each thread has queried several images (about 20 images), all of a sudden the performance (cpu usage) drops drastically, down to about 30% across all eight cores.
I doubt the key to this bug is concerned with this line of code.
double mag = param.tree->MultiScoreQueryKeys(num_keys, normalize, keys, param.scores);
Since if I replace it with another costly operations (e.g., a large for-loop containing sqrt). The cpu usage is always nearly 100%. This MultiScoreQueryKeys function does a complex operation on a tree. Since all eight cores may read the same tree (no write operation to this tree), I wonder whether the read operation has some kind of blocking effect. But it shouldn't have this effect because I don't have write operations in this function. Also the operations in the loop are basically the same. If it were to block the cpu usage, it would happen in the first few iterations. If you need to see the details of this function or other part of this project, please let me know.
Use std::async() instead of zeta::SimpleLock lock
I've made a small application that averages the numbers between 1 and 1000000. It's not hard to see (using a very basic algebraic formula) that the average is 500000.5 but this was more of a project in learning C++ than anything else.
Anyway, I made clock variables that were designed to find the amount of clock steps required for the application to run. When I first ran the script, it said that it took 3770000 clock steps, but every time that I've run it since then, it's taken "0.0" seconds...
I've attached my code at the bottom.
Either a.) It's saved the variables from the first time I ran it, and it's just running quickly to the answer...
or b.) something is wrong with how I'm declaring the time variables.
Regardless... it doesn't make sense.
Any help would be appreciated.
FYI (I'm running this through a Linux computer, not sure if that matters)
double avg (int arr[], int beg, int end)
{
int nums = end - beg + 1;
double sum = 0.0;
for(int i = beg; i <= end; i++)
{
sum += arr[i];
}
//for(int p = 0; p < nums*10000; p ++){}
return sum/nums;
}
int main (int argc, char *argv[])
{
int nums = 1000000;//atoi(argv[0]);
int myarray[nums];
double timediff;
//printf("Arg is: %d\n",argv[0]);
printf("Nums is: %d\n",nums);
clock_t begin_time = clock();
for(int i = 0; i < nums; i++)
{
myarray[i] = i+1;
}
double average = avg(myarray, 0, nums - 1);
printf("%f\n",average);
clock_t end_time = clock();
timediff = (double) difftime(end_time, begin_time);
printf("Time to Average: %f\n", timediff);
return 0;
}
You are measuring the I/O operation too (printf), that depends on external factors and might be affecting the run time. Also, clock() might not be as precise as needed to measure such a small task - look into higher resolution functions such as clock_get_time(). Even then, other processes might affect the run time by generating page fault interrupts and occupying the memory BUS, etc. So this kind of fluctuation is not abnormal at all.
On the machine I tested, Linux's clock call was only accurate to 1/100th of a second. If your code runs in less than 0.01 seconds, it will usually say zero seconds have passed. Also, I ran your program a total of 50 times in .13 seconds, so I find it suspicous that you claim it takes 2 seconds to run it once on your computer.
Your code incorrectly uses the difftime, which may display incorrect output as well if clock says time did pass.
I'd guess that the first timing you got was with different code than that posted in this question, becase I can't think of any way the code in this question could produce a time of 3770000.
Finally, benchmarking is hard, and your code has several benchmarking mistakes:
You're timing how long it takes to (1) fill an array, (2) calculate an average, (3) format the result string (4) make an OS call (slow) that prints said string in the right language/font/colo/etc, which is especially slow.
You're attempting to time a task which takes less than a hundredth of a second, which is WAY too small for any accurate measurement.
Here is my take on your code, measuring that the average takes ~0.001968 seconds on this machine.